Status: ✅ Production Ready
Inferno now features full Metal GPU acceleration on Apple Silicon, delivering a 13x performance improvement over CPU-only inference.
| Configuration | Throughput | Performance Rating |
|---|---|---|
| CPU Only (M4 Max) | 15 tok/s | Baseline |
| Metal GPU (M4 Max) | 198 tok/s | Excellent (13x speedup) |
Test Model: TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf (638 MB) Hardware: Apple M4 Max (Metal 3, MTLGPUFamilyApple9)
- Backend: llama-cpp-2 with native Metal support
- GPU Layers: 999 (maximum offloading)
- Thread Safety: Arc-based shared ownership
- Context Strategy: Per-inference context creation in spawn_blocking
- Memory Management: Unified memory architecture
Metal KV Cache: 44 MiB
Metal Compute: 66.5 MiB
Model Buffer: 636.18 MiB (on Metal GPU)
Total GPU Usage: ~747 MiB
✅ Offloaded: 23/23 layers (100%)
All layers assigned to device: Metal
Device: Apple M4 Max
GPU Family: MTLGPUFamilyApple9 (1009)
Metal Version: Metal 3 (5001)
Unified Memory: true
Flash Attention: enabled (auto)
Inferno uses llama-cpp-2 Rust bindings over llama.cpp, so Metal inference performance is directly comparable to other llama.cpp-based tools.
| Tool | Throughput | Notes |
|---|---|---|
| Inferno | 198 tok/s | Measured — this machine |
| llama.cpp (CLI) | ~190–210 tok/s | Uses same Metal backend; within measurement noise |
| LM Studio | ~170–200 tok/s | llama.cpp backend + GUI overhead |
| Ollama | ~155–180 tok/s | llama.cpp backend + HTTP/JSON server layer |
Note: llama.cpp and LM Studio values are community-reported ranges from public benchmarks (llama.cpp GitHub Discussions, r/LocalLLaMA). Exact values vary by model load, system load, and llama.cpp version. Inferno's figure is measured on this machine.
Inferno achieves performance at parity with native llama.cpp and outperforms tools that add a server layer (Ollama) or GUI overhead (LM Studio), while providing a native Rust CLI and library API.
The following table projects expected throughput across Apple Silicon variants based on each chip's memory bandwidth (the primary throughput bottleneck for LLM inference). The M4 Max measured result (198 tok/s) is used as the anchor.
These are estimates, not measured values. Actual performance also depends on software efficiency improvements in newer chip generations. Community results welcome — see Contributing Results below.
| Chip | GPU Cores | Memory BW | Est. Throughput | Status |
|---|---|---|---|---|
| M1 | 7–8 | 68 GB/s | ~25 tok/s | Not tested |
| M1 Pro | 14–16 | 200 GB/s | ~73 tok/s | Not tested |
| M1 Max | 24–32 | 400 GB/s | ~146 tok/s | Not tested |
| M1 Ultra | 48–64 | 800 GB/s | ~292 tok/s | Not tested |
| M2 | 8–10 | 100 GB/s | ~36 tok/s | Not tested |
| M2 Pro | 16–19 | 200 GB/s | ~73 tok/s | Not tested |
| M2 Max | 30–38 | 400 GB/s | ~146 tok/s | Not tested |
| M2 Ultra | 60–76 | 800 GB/s | ~292 tok/s | Not tested |
| M3 | 8–10 | 100 GB/s | ~36 tok/s | Not tested |
| M3 Pro | 14–18 | 150 GB/s | ~55 tok/s | Not tested |
| M3 Max | 30–40 | 300 GB/s | ~109 tok/s | Not tested |
| M4 | 8–10 | 120 GB/s | ~44 tok/s | Not tested |
| M4 Pro | 14–20 | 273 GB/s | ~99 tok/s | Not tested |
| M4 Max | 32–40 | 546 GB/s | 198 tok/s | ✅ Measured |
| Model Size | Quantization | Est. Throughput |
|---|---|---|
| ~1B params | Q4_K_M | ~200 tok/s |
| ~3B params | Q4_K_M | ~120 tok/s |
| ~7B params | Q4_K_M | ~60 tok/s |
| ~13B params | Q4_K_M | ~30 tok/s |
| ~70B params | Q4_K_M | ~5–8 tok/s |
Note: Actual performance varies by quantization type, context length, and batch size.
.inferno.toml:
[backend_config]
gpu_enabled = true # Metal GPU acceleration
context_size = 2048
batch_size = 512The desktop app automatically enables Metal GPU on macOS:
// dashboard/src-tauri/src/backend_manager.rs
let backend_config = BackendConfig {
gpu_enabled: cfg!(target_os = "macos"), // Auto-enable on macOS
batch_size: 512,
...
};# GPU-accelerated inference (default on macOS)
cargo run --release -- run \
--model test_models/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf \
--prompt "Explain quantum computing"
# Expected: ~198 tok/s on M4 Maxcd dashboard
npm run tauri dev
# Metal GPU automatically enabled
# GPU status visible in System Info panelLook for these log indicators:
🎯 GGUF backend - GPU enabled: true, GPU layers: 999
load_tensors: offloaded 23/23 layers to GPU
Metal_Mapped model buffer size = 636.18 MiB
| Metric | Value |
|---|---|
| Total tokens | 750 |
| Total time | 3.79s |
| Throughput | 198.1 tok/s |
| Avg latency per token | 5.05ms |
| Quality | Excellent |
If you have a different Apple Silicon chip, please run the standardized benchmark and share your results in issue #7.
# Run standardized benchmark and export JSON
./scripts/benchmark_metal.sh /path/to/your-model.gguf
# Or manually:
cargo run --release -- bench \
--model /path/to/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf \
--iterations 10 \
--warmup 3 \
--tokens 100 \
--output-json my_results.jsonPlease include:
- Chip model (e.g., M2 Pro, M3 Max)
- GPU core count
- Unified memory size
- Model tested (name + quantization)
- Throughput (tok/s)
- GPU memory usage (from Activity Monitor > GPU tab)
- macOS version
// Initialize global backend (thread-safe)
let backend = Arc::new(LlamaBackend::init()?);
// Load model with GPU layers
let model_params = LlamaParams::default()
.with_n_gpu_layers(999); // Maximum GPU offloading
let model = Arc::new(LlamaModel::load_from_file(&backend, path, &model_params)?);// Create context on demand (handles !Send constraint)
tokio::task::spawn_blocking(move || {
let mut context = model.new_context(&backend, ctx_params)?;
// Greedy sampling for token generation
let candidates: Vec<_> = context.candidates().collect();
let next_token = candidates.iter()
.max_by(|a, b| a.p().partial_cmp(&b.p()).unwrap())?
.id();
// ... generate tokens
})- ✅ macOS (Apple Silicon): Full Metal GPU acceleration
- ✅ macOS (Intel): CPU fallback
- ✅ Linux: CPU-only (CUDA/ROCm support planned)
- ✅ Windows: CPU-only (DirectML support planned)
- ✅ GGUF (all quantizations: Q4, Q5, Q6, Q8)
- ⏳ ONNX (CPU-only currently)
- ✅ M4 Max (primary testing)
- ✅ M3/M2/M1 (expected to work, not yet verified — see Contributing Results)
- Temperature-based sampling (currently greedy only)
- True streaming with tokio channels
- GPU memory monitoring in desktop UI
- Performance profiling dashboard
- Multi-model GPU sharing
- Dynamic layer offloading based on available memory
- CUDA support for Linux/Windows
- DirectML support for Windows
Symptoms: Logs show GPU enabled: false or GPU layers: 0
Solutions:
- Check
.inferno.toml:gpu_enabled = true - Verify Apple Silicon:
uname -mshould showarm64 - Check system info:
system_profiler SPDisplaysDataType | grep Metal
Symptoms: Throughput < 50 tok/s on Apple Silicon
Solutions:
- Verify GPU offloading: Look for "offloaded X/X layers to GPU" in logs
- Increase batch size: Set
batch_size = 512in config - Reduce context size: Try
context_size = 2048or lower - Check thermal throttling: Monitor Activity Monitor > GPU tab
Symptoms: llama-cpp-2 compilation fails
Solutions:
- Update Xcode:
xcode-select --install - Update Rust:
rustup update stable - Clean rebuild:
cargo clean && cargo build --release
- llama-cpp-2 Documentation
- Apple Metal Documentation
- GGUF Format Specification
- Apple Silicon Memory Bandwidth Specs
Implementation based on:
- llama.cpp by Georgi Gerganov
- llama-cpp-2 Rust bindings by utilityai
- Metal Performance Shaders by Apple
Last Updated: 2026-04-02 Version: Inferno v0.10.7-dev Status: Production Ready ✅