Benchmarking GPU vs CPU training performance across Apple Silicon, NVIDIA GPUs, and Intel CPUs using TensorFlow Metal and MLX.
TL;DR: For large models, GPU acceleration provides 17x speedup on Apple Silicon and up to 120x on NVIDIA.
| Hardware | GPU Cores | VGG16 (s/epoch) | Speedup vs i7-8700 |
|---|---|---|---|
| RTX 4070 Super | 7168 CUDA | 7s | 123x |
| RTX 2070 | 2304 CUDA | 18s | 48x |
| M1 Max | 32 GPU | 21s | 41x |
| M4 Pro | 16 GPU | 26s | 33x |
| M2 | 10 GPU | 64s | 13x |
| i7-13700KF | - | 126s | 7x |
| M1 Max (CPU only) | - | 368s | 2.3x |
| i7-8700 | - | 863s | 1x (baseline) |
- M1 Max: 17.5x faster with Metal GPU vs CPU-only
- M2: 8.3x faster with Metal GPU vs CPU-only
- M4 Pro: See MLX vs TensorFlow comparison below
tensorflow-metal-experiments/
├── notebooks/
│ ├── tf_mnist_train.ipynb # Simple CNN (93k params)
│ ├── tf_fashion_mnist_train.ipynb # CNN with dropout (412k params)
│ ├── tf_cifar100-train.ipynb # VGG16-style (34M params)
│ ├── mlx_comparison.ipynb # MLX vs TensorFlow Metal (naive)
│ ├── optimized_benchmark.ipynb # Naive vs Optimized comparison
│ └── benchmark_report.ipynb # Generate benchmark charts
├── src/utils/
│ └── device_config.py # Reusable GPU/CPU configuration
├── benchmarks/
│ └── results.json # Structured benchmark data
└── assets/
└── vgg16_benchmark.png # Benchmark visualization
If you don't have Python installed, use Homebrew:
# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install Python 3.11+
brew install python@3.11
# Verify installation
python3.11 --version# Navigate to project directory
cd tensorflow-metal-experiments
# Create virtual environment
python3.11 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Upgrade pip
pip install --upgrade pip
# Install dependencies (TF 2.18 is required for Metal compatibility)
pip install "tensorflow>=2.18,<2.19" tensorflow-metal mlx
pip install matplotlib seaborn pandas numpy jupyterlab
# Verify TensorFlow sees the GPU
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# Should show: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
# Verify MLX
python -c "import mlx.core as mx; print(mx.default_device())"
# Should show: gpu# Create and activate venv
python -m venv venv
source venv/bin/activate # or: venv\Scripts\activate on Windows
# Install dependencies
pip install tensorflow[and-cuda]
pip install matplotlib seaborn pandas numpy jupyterlab# Make sure venv is activated
source venv/bin/activate
# Start Jupyter
jupyter labOpen any notebook in notebooks/ and run all cells.
deactivateEach notebook uses a device configuration helper:
from utils.device_config import configure_device
# Use GPU (Metal or CUDA)
device = configure_device(use_gpu=True)
# Force CPU only
device = configure_device(use_gpu=False)This is the primary benchmark. Large models show the most significant GPU acceleration.
| Hardware | Platform | GPU | Time/Epoch |
|---|---|---|---|
| RTX 4070 Super 12GB | Windows 11 | Yes | 7s |
| RTX 2070 8GB | Windows 10 | Yes | 18s |
| M1 Max 32-core GPU | macOS | Yes | 21s |
| M2 10-core GPU | macOS | Yes | 64s |
| i7-13700KF 3.4GHz | Windows 11 | No | 126s |
| M1 Max 10-core CPU | macOS | No | 368s |
| M2 8-core CPU | macOS | No | 528s |
| i9 2.4GHz 8-core | macOS | No | 630s |
| i7-8700 3.2GHz | Windows 10 | No | 863s |
For small models (MNIST CNN, 93k params), CPU can sometimes match or beat GPU due to data transfer overhead. GPU acceleration is most beneficial for:
- Models > 1M parameters
- Batch sizes >= 64
- Training runs with many epochs
If you observe low GPU utilization during training, these are the common causes:
- NumPy array bottleneck - Using
model.fit(x_train, y_train)with NumPy arrays is a major bottleneck - Small batch sizes - GPU dispatch overhead doesn't amortize for small batches
- Model too small - GPU parallelism not fully utilized for models < 1M params
- Data loading on CPU - Pipeline not optimized for GPU
-
Use tf.data.Dataset API instead of NumPy arrays:
# Instead of: model.fit(x_train, y_train) dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)) dataset = dataset.batch(128).prefetch(tf.data.AUTOTUNE) model.fit(dataset)
This can achieve up to 5x acceleration and better GPU utilization.
-
Increase batch sizes - Apple's unified memory allows larger batches (try 256, 512) without CPU-GPU transfer overhead
-
Use mixed precision where supported:
tf.keras.mixed_precision.set_global_policy('mixed_float16')
-
Monitor GPU power to verify GPU is being utilized:
sudo powermetrics --samplers gpu_power -i1000 -n1
-
For MLX: Use
mx.eval()strategically to control lazy evaluation
Run notebooks/optimized_benchmark.ipynb to see the impact of these optimizations with real benchmarks comparing naive vs optimized implementations for both TensorFlow and MLX.
The mlx_comparison.ipynb notebook benchmarks Apple's MLX framework against TensorFlow Metal.
Benchmarked on M4 Pro (16-core GPU, 48GB RAM) - Naive vs Optimized:
| Model | Params | TF Naive | TF Optimized | MLX Naive | MLX Optimized | Best |
|---|---|---|---|---|---|---|
| MNIST CNN | 93K | 77.2s | 24.8s | 16.4s | 11.6s | MLX Opt |
| Fashion CNN | 412K | 95.3s | 28.2s | 28.0s | 24.1s | MLX Opt |
| Framework | Optimization | MNIST Speedup | Fashion Speedup |
|---|---|---|---|
| TensorFlow | tf.data + batch=256 | 3.1x faster | 3.4x faster |
| MLX | eval per epoch + batch=256 | 1.4x faster | 1.2x faster |
Key Insights:
- TensorFlow benefits most from optimization - tf.data.Dataset provides 3x+ speedup
- MLX is fast out of the box - Already optimized, less room for improvement
- MLX wins for small/medium models - Even optimized TensorFlow can't catch up
When to use MLX:
- Small-to-medium models (< 10M parameters) - fastest option
- Rapid prototyping on Apple Silicon
- Apple-native applications (Core ML integration)
- When you want good performance without optimization work
When to use TensorFlow Metal:
- Cross-platform deployment requirements
- Access to TensorFlow Hub / Keras ecosystem
- Production pipelines with TensorFlow Serving
- When you'll invest in tf.data optimization
- All benchmarks run 3 times, median reported
- System was idle during benchmarks (no background tasks)
- Same model architecture across all hardware
- Data loading time excluded from measurements
- Batch sizes kept consistent (64 for MNIST, 128 for CIFAR-100)
- Run benchmarks on your hardware
- Add results to
benchmarks/results.json - Run
notebooks/benchmark_report.ipynbto regenerate charts - Submit a pull request
MIT
