Metal GPU Testing Guide

Implementation Status ✅

The Metal GPU acceleration for GGUF models is fully implemented and ready for testing.

What's Been Implemented

Real Metal GPU Inference (src/backends/gguf.rs)
- Full integration with llama-cpp-2's Metal backend
- GPU layer configuration (999 layers for maximum Metal utilization)
- Thread-safe architecture using Arc
- Per-inference LlamaContext creation to handle !Send types
- Complete token generation loop with greedy sampling
- Async inference via tokio::spawn_blocking
Desktop App Integration (dashboard/src-tauri/src/backend_manager.rs)
- Automatic Metal GPU enablement on macOS
- BackendConfig with gpu_enabled = true on macOS
- Context size: 2048, Batch size: 512
Thread Safety Architecture
- LlamaBackend wrapped in Arc for safe sharing across threads
- LlamaContext created per-inference within spawn_blocking
- Proper async/await integration with blocking operations

Quick Test (CLI)

1. Download a Test Model

# Create models directory
mkdir -p models && cd models

# Download TinyLlama (638MB, fast inference)
hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
  tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --local-dir .

cd ..

2. Build and Run

# Build with GGUF support (first build takes 3-5 minutes)
cargo build --features gguf --release

# Run inference with Metal GPU
INFERNO_MODELS_DIR="models" \
RUST_LOG=info \
./target/release/inferno run \
  --model tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --prompt "What is 2+2? Explain briefly." \
  --backend gguf

3. Verify Metal GPU Usage

Look for these indicators in the output:

ggml_metal_init: loaded Metal backend - Metal backend initialized
ggml_metal_add_buffer: allocated 'data' buffer - GPU memory allocated
llama_new_context_with_model: n_ctx = 2048 - Context created
GPU layer messages showing offloading to Metal

4. Performance Benchmarking

# Benchmark tokens/second with Metal GPU
INFERNO_MODELS_DIR="models" \
./target/release/inferno bench \
  --model tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --backend gguf

Expected Performance (TinyLlama 1.1B Q4_K_M):

M1 Max: ~60-80 tokens/sec
M2 Max: ~80-100 tokens/sec
M3 Max: ~100-120 tokens/sec
M4 Max: ~120-150 tokens/sec

Integration Test

Run the Metal-specific integration test:

# Run Metal GPU integration test
cargo test --test metal_gpu_test --features gguf -- --nocapture

# Expected output:
# ✅ Metal GPU inference successful!
#    Prompt: What is 2+2?
#    Response: [model response]

Desktop App Testing

1. Build Desktop App

cd dashboard

# Install dependencies (first time only)
npm install

# Run development build with Metal GPU
npm run tauri dev

2. Test in Desktop UI

Open the Models panel
Click "Discover Models" - should find TinyLlama
Click "Load Model" - backend will automatically enable Metal GPU on macOS
Go to Inference panel
Enter prompt: "What is 2+2?"
Click "Run Inference"
Check Activity Log for Metal initialization messages

3. Monitor GPU Usage

# In another terminal, monitor GPU usage
sudo powermetrics --samplers gpu_power -i 1000

# Look for:
# - GPU active residency increasing during inference
# - ANE (Apple Neural Engine) usage if available
# - Power consumption spike during inference

Troubleshooting

Issue: "Model not found"

# List available models
INFERNO_MODELS_DIR="models" ./target/release/inferno models list

Issue: "Backend not loaded"

Make sure you built with the gguf feature:

cargo build --features gguf --release

Issue: Slow performance

Check that Metal GPU is actually being used:

# Enable debug logging
RUST_LOG=debug ./target/release/inferno run ...

# Look for:
# - "Using N GPU layers" where N > 0
# - "Metal backend loaded successfully"

Issue: Out of memory

Reduce context size in config:

# .inferno.toml
[backend_config]
context_size = 1024  # Reduce from 2048
batch_size = 256     # Reduce from 512

Performance Comparison

Test CPU vs GPU Performance

# CPU only (gpu_enabled = false)
time ./target/release/inferno run \
  --model tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --prompt "Write a short poem" \
  --backend gguf

# GPU enabled (default on macOS)
time ./target/release/inferno run \
  --model tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --prompt "Write a short poem" \
  --backend gguf

Expected speedup: 3-5x faster with Metal GPU vs CPU-only.

Next Steps

✅ Metal GPU implementation complete
⏳ Add temperature-based sampling (currently using greedy)
⏳ True streaming with tokio channels from spawn_blocking
⏳ GPU memory monitoring in desktop UI
⏳ Performance profiling across different Apple Silicon chips
⏳ Benchmark against larger models (7B, 13B, 70B)

Technical Details

How It Works

Initialization: LlamaBackend::init() creates global backend (GPU setup)
Model Loading: Model loaded with n_gpu_layers = 999 (all layers on GPU)
Context Creation: Per-inference context with configurable size
Inference Loop:
- Tokenize input with AddBos::Always
- Create LlamaBatch and add tokens
- Decode batch (runs on Metal GPU)
- Sample next token (greedy: pick highest probability)
- Repeat until EOS or max_tokens reached
Cleanup: Context dropped after inference, memory freed

Thread Safety Model

LlamaBackend: Wrapped in Arc, shared across async boundaries
LlamaModel: Wrapped in Arc, cloned for spawn_blocking
LlamaContext: Created inside spawn_blocking (avoids Send issues)
Inference: Runs in blocking thread pool, returns via Result

Memory Usage

TinyLlama 1.1B Q4_K_M: ~700MB
Context (2048): ~256MB
Batch (512): ~64MB
Total: ~1GB for small model

For larger models:

7B Q4_K_M: ~4GB
13B Q4_K_M: ~8GB
70B Q4_K_M: ~40GB (requires M3/M4 Max/Ultra with 64GB+ unified memory)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal GPU Testing Guide

Implementation Status ✅

What's Been Implemented

Quick Test (CLI)

1. Download a Test Model

2. Build and Run

3. Verify Metal GPU Usage

4. Performance Benchmarking

Integration Test

Desktop App Testing

1. Build Desktop App

2. Test in Desktop UI

3. Monitor GPU Usage

Troubleshooting

Issue: "Model not found"

Issue: "Backend not loaded"

Issue: Slow performance

Issue: Out of memory

Performance Comparison

Test CPU vs GPU Performance

Next Steps

Technical Details

How It Works

Thread Safety Model

Memory Usage

FilesExpand file tree

METAL_GPU_TESTING.md

Latest commit

History

METAL_GPU_TESTING.md

File metadata and controls

Metal GPU Testing Guide

Implementation Status ✅

What's Been Implemented

Quick Test (CLI)

1. Download a Test Model

2. Build and Run

3. Verify Metal GPU Usage

4. Performance Benchmarking

Integration Test

Desktop App Testing

1. Build Desktop App

2. Test in Desktop UI

3. Monitor GPU Usage

Troubleshooting

Issue: "Model not found"

Issue: "Backend not loaded"

Issue: Slow performance

Issue: Out of memory

Performance Comparison

Test CPU vs GPU Performance

Next Steps

Technical Details

How It Works

Thread Safety Model

Memory Usage