TensorFlow Metal Experiments

Benchmarking GPU vs CPU training performance across Apple Silicon, NVIDIA GPUs, and Intel CPUs using TensorFlow Metal and MLX.

Key Findings

TL;DR: For large models, GPU acceleration provides 17x speedup on Apple Silicon and up to 120x on NVIDIA.

Hardware	GPU Cores	VGG16 (s/epoch)	Speedup vs i7-8700
RTX 4070 Super	7168 CUDA	7s	123x
RTX 2070	2304 CUDA	18s	48x
M1 Max	32 GPU	21s	41x
M4 Pro	16 GPU	26s	33x
M2	10 GPU	64s	13x
i7-13700KF	-	126s	7x
M1 Max (CPU only)	-	368s	2.3x
i7-8700	-	863s	1x (baseline)

Apple Silicon GPU Speedup

M1 Max: 17.5x faster with Metal GPU vs CPU-only
M2: 8.3x faster with Metal GPU vs CPU-only
M4 Pro: See MLX vs TensorFlow comparison below

Project Structure

tensorflow-metal-experiments/
├── notebooks/
│   ├── tf_mnist_train.ipynb        # Simple CNN (93k params)
│   ├── tf_fashion_mnist_train.ipynb # CNN with dropout (412k params)
│   ├── tf_cifar100-train.ipynb     # VGG16-style (34M params)
│   ├── mlx_comparison.ipynb        # MLX vs TensorFlow Metal (naive)
│   ├── optimized_benchmark.ipynb   # Naive vs Optimized comparison
│   └── benchmark_report.ipynb      # Generate benchmark charts
├── src/utils/
│   └── device_config.py            # Reusable GPU/CPU configuration
├── benchmarks/
│   └── results.json                # Structured benchmark data
└── assets/
    └── vgg16_benchmark.png         # Benchmark visualization

Installation

Prerequisites: Install Python (macOS)

If you don't have Python installed, use Homebrew:

# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install Python 3.11+
brew install python@3.11

# Verify installation
python3.11 --version

Apple Silicon Setup (M1/M2/M3/M4)

# Navigate to project directory
cd tensorflow-metal-experiments

# Create virtual environment
python3.11 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Upgrade pip
pip install --upgrade pip

# Install dependencies (TF 2.18 is required for Metal compatibility)
pip install "tensorflow>=2.18,<2.19" tensorflow-metal mlx
pip install matplotlib seaborn pandas numpy jupyterlab

# Verify TensorFlow sees the GPU
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# Should show: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

# Verify MLX
python -c "import mlx.core as mx; print(mx.default_device())"
# Should show: gpu

Windows with NVIDIA GPU (WSL2)

# Create and activate venv
python -m venv venv
source venv/bin/activate  # or: venv\Scripts\activate on Windows

# Install dependencies
pip install tensorflow[and-cuda]
pip install matplotlib seaborn pandas numpy jupyterlab

Run Experiments

# Make sure venv is activated
source venv/bin/activate

# Start Jupyter
jupyter lab

Open any notebook in notebooks/ and run all cells.

Deactivate Environment

deactivate

Switching Between GPU and CPU

Each notebook uses a device configuration helper:

from utils.device_config import configure_device

# Use GPU (Metal or CUDA)
device = configure_device(use_gpu=True)

# Force CPU only
device = configure_device(use_gpu=False)

Benchmarks

VGG16 on CIFAR-100 (34M Parameters)

This is the primary benchmark. Large models show the most significant GPU acceleration.

Hardware	Platform	GPU	Time/Epoch
RTX 4070 Super 12GB	Windows 11	Yes	7s
RTX 2070 8GB	Windows 10	Yes	18s
M1 Max 32-core GPU	macOS	Yes	21s
M2 10-core GPU	macOS	Yes	64s
i7-13700KF 3.4GHz	Windows 11	No	126s
M1 Max 10-core CPU	macOS	No	368s
M2 8-core CPU	macOS	No	528s
i9 2.4GHz 8-core	macOS	No	630s
i7-8700 3.2GHz	Windows 10	No	863s

Small Model Caveat

For small models (MNIST CNN, 93k params), CPU can sometimes match or beat GPU due to data transfer overhead. GPU acceleration is most beneficial for:

Models > 1M parameters
Batch sizes >= 64
Training runs with many epochs

Performance Optimization

Why GPU Utilization May Be Low (~40%)

If you observe low GPU utilization during training, these are the common causes:

NumPy array bottleneck - Using model.fit(x_train, y_train) with NumPy arrays is a major bottleneck
Small batch sizes - GPU dispatch overhead doesn't amortize for small batches
Model too small - GPU parallelism not fully utilized for models < 1M params
Data loading on CPU - Pipeline not optimized for GPU

Optimization Tips

Use tf.data.Dataset API instead of NumPy arrays:

# Instead of: model.fit(x_train, y_train)
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.batch(128).prefetch(tf.data.AUTOTUNE)
model.fit(dataset)

This can achieve up to 5x acceleration and better GPU utilization.

Increase batch sizes - Apple's unified memory allows larger batches (try 256, 512) without CPU-GPU transfer overhead

Use mixed precision where supported:

tf.keras.mixed_precision.set_global_policy('mixed_float16')

Monitor GPU power to verify GPU is being utilized:

sudo powermetrics --samplers gpu_power -i1000 -n1

For MLX: Use mx.eval() strategically to control lazy evaluation

Run notebooks/optimized_benchmark.ipynb to see the impact of these optimizations with real benchmarks comparing naive vs optimized implementations for both TensorFlow and MLX.

MLX vs TensorFlow Metal

The mlx_comparison.ipynb notebook benchmarks Apple's MLX framework against TensorFlow Metal.

M4 Pro Benchmark Results

Benchmarked on M4 Pro (16-core GPU, 48GB RAM) - Naive vs Optimized:

Model	Params	TF Naive	TF Optimized	MLX Naive	MLX Optimized	Best
MNIST CNN	93K	77.2s	24.8s	16.4s	11.6s	MLX Opt
Fashion CNN	412K	95.3s	28.2s	28.0s	24.1s	MLX Opt

Optimization Impact

Framework	Optimization	MNIST Speedup	Fashion Speedup
TensorFlow	tf.data + batch=256	3.1x faster	3.4x faster
MLX	eval per epoch + batch=256	1.4x faster	1.2x faster

Key Insights:

TensorFlow benefits most from optimization - tf.data.Dataset provides 3x+ speedup
MLX is fast out of the box - Already optimized, less room for improvement
MLX wins for small/medium models - Even optimized TensorFlow can't catch up

When to Use Each Framework

When to use MLX:

Small-to-medium models (< 10M parameters) - fastest option
Rapid prototyping on Apple Silicon
Apple-native applications (Core ML integration)
When you want good performance without optimization work

When to use TensorFlow Metal:

Cross-platform deployment requirements
Access to TensorFlow Hub / Keras ecosystem
Production pipelines with TensorFlow Serving
When you'll invest in tf.data optimization

Methodology

All benchmarks run 3 times, median reported
System was idle during benchmarks (no background tasks)
Same model architecture across all hardware
Data loading time excluded from measurements
Batch sizes kept consistent (64 for MNIST, 128 for CIFAR-100)

Contributing

Run benchmarks on your hardware
Add results to benchmarks/results.json
Run notebooks/benchmark_report.ipynb to regenerate charts
Submit a pull request

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
benchmarks		benchmarks
notebooks		notebooks
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
test-report.ipynb		test-report.ipynb
tf_cifar100-train.ipynb		tf_cifar100-train.ipynb
tf_fashion_mnist_train.ipynb		tf_fashion_mnist_train.ipynb
tf_mnist_train.ipynb		tf_mnist_train.ipynb
vgg16.png		vgg16.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensorFlow Metal Experiments

Key Findings

Apple Silicon GPU Speedup

Project Structure

Installation

Prerequisites: Install Python (macOS)

Apple Silicon Setup (M1/M2/M3/M4)

Windows with NVIDIA GPU (WSL2)

Run Experiments

Deactivate Environment

Switching Between GPU and CPU

Benchmarks

VGG16 on CIFAR-100 (34M Parameters)

Small Model Caveat

Performance Optimization

Why GPU Utilization May Be Low (~40%)

Optimization Tips

MLX vs TensorFlow Metal

M4 Pro Benchmark Results

Optimization Impact

When to Use Each Framework

Methodology

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TensorFlow Metal Experiments

Key Findings

Apple Silicon GPU Speedup

Project Structure

Installation

Prerequisites: Install Python (macOS)

Apple Silicon Setup (M1/M2/M3/M4)

Windows with NVIDIA GPU (WSL2)

Run Experiments

Deactivate Environment

Switching Between GPU and CPU

Benchmarks

VGG16 on CIFAR-100 (34M Parameters)

Small Model Caveat

Performance Optimization

Why GPU Utilization May Be Low (~40%)

Optimization Tips

MLX vs TensorFlow Metal

M4 Pro Benchmark Results

Optimization Impact

When to Use Each Framework

Methodology

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages