Apple Silicon AI — Multi-engine LLM benchmark & monitoring CLI
asiai compares inference engines side-by-side on your Mac. Load the same model on Ollama and LM Studio, run asiai bench, get the numbers. No guessing, no vibes — just tok/s, TTFT, power efficiency, and stability per engine.
Share your results with the community (--share), compare against other Apple Silicon users (asiai compare), and get smart engine recommendations (asiai recommend).
Born from the OpenClaw project, where we needed hard data to pick the fastest engine for multi-agent swarms on Mac Mini M4 Pro.
pipx install asiai # Recommended: isolated installOr via Homebrew:
brew tap druide67/tap
brew install asiaiOther options:
uvx asiai detect # Run without installing (requires uv)
pip install asiai # Standard pip installThen benchmark and share:
asiai bench --quick --card --share # Bench + shareable card in ~15 secondsAuto-detect running inference engines across 7 ports.
$ asiai detect
Detected engines:
● ollama 0.17.4
URL: http://localhost:11434
● lmstudio 0.4.5
URL: http://localhost:1234
Running: 1 model(s)
- qwen3.5-35b-a3b MLX
Cross-engine benchmark with standardized prompts. Runs 3 iterations per prompt by default, reports median tok/s (SPEC standard) with stability classification.
$ asiai bench -m qwen3.5 --runs 3 --power
Mac Mini M4 Pro — Apple M4 Pro RAM: 64.0 GB (42% used) Pressure: normal
Benchmark: qwen3.5
Engine tok/s (±stddev) Tokens Duration TTFT VRAM Thermal
────────── ───────────────── ───────── ────────── ──────── ────────── ──────────
lmstudio 72.6 ± 0.0 (stable) 435 6.20s 0.28s — nominal
ollama 30.4 ± 0.1 (stable) 448 15.28s 0.25s 26.0 GB nominal
Winner: lmstudio (2.4x faster)
Power: lmstudio 13.2W (5.52 tok/s/W) — ollama 16.0W (1.89 tok/s/W)
Options:
-m, --model MODEL Model to benchmark (default: auto-detect)
-e, --engines LIST Filter engines (e.g. ollama,lmstudio,mlxlm)
-p, --prompts LIST Prompt types: code, tool_call, reasoning, long_gen
-r, --runs N Runs per prompt (default: 3, for median + stddev)
--power Cross-validate power with sudo powermetrics (IOReport always-on)
--context-size SIZE Context fill prompt: 4k, 16k, 32k, 64k
--share Share results with the community (anonymous, opt-in)
-Q, --quick Quick benchmark: 1 prompt, 1 run (~15 seconds)
--card Generate shareable benchmark card (SVG + PNG with --share)
-H, --history PERIOD Show past benchmarks (e.g. 7d, 24h)
Cross-model comparison — benchmark multiple models in one run and get a ranked summary:
# Cross-model comparison
asiai bench --compare qwen3.5:4b deepseek-r1:7b -e ollama --cardThe runner resolves model names across engines automatically — gemma2:9b (Ollama) and gemma-2-9b (LM Studio) are matched as the same model.
List loaded models across all engines. Use --json for machine-readable output.
$ asiai models
ollama http://localhost:11434
● qwen3.5:35b-a3b 26.0 GB Q4_K_M
lmstudio http://localhost:1234
● qwen3.5-35b-a3b MLX
System and inference metrics snapshot, stored in SQLite. Use --json for machine-readable output.
$ asiai monitor
System
Uptime: 3d 12h
CPU Load: 2.45 / 3.12 / 2.89 (1m / 5m / 15m)
Memory: 45.2 GB / 64.0 GB 71%
Pressure: normal
Thermal: nominal (100%)
Inference ollama 0.17.4
Models loaded: 1 VRAM total: 26.0 GB
Model VRAM Format Quant
──────────────────────────────────────── ────────── ──────── ──────
qwen3.5:35b-a3b 26.0 GB gguf Q4_K_M
Options:
-w, --watch SEC Refresh every SEC seconds
-q, --quiet Collect and store without output (for daemon use)
--json Output as JSON (for scripting)
-H, --history PERIOD Show history (e.g. 24h, 1h)
-a, --analyze HOURS Comprehensive analysis with trends
-c, --compare TS TS Compare two timestamps
--alert-webhook URL POST alerts on state transitions (memory, thermal, engine down)
Diagnose installation, engines, system health, and database.
$ asiai doctor
Doctor
System
✓ Apple Silicon Mac Mini M4 Pro — Apple M4 Pro
✓ RAM 64 GB total, 42% used
✓ Memory pressure normal
✓ Thermal nominal (100%)
Engine
✓ Ollama v0.17.4 — 1 model(s): qwen3.5:35b-a3b
✓ LM Studio v0.4.5 — 1 model(s): qwen3.5-35b-a3b
✗ mlx-lm not installed
✗ llama.cpp not installed
✗ vllm-mlx not installed
Database
✓ SQLite 2.4 MB, last entry: 1m ago
5 ok, 0 warning(s), 3 failed
Background monitoring via macOS launchd. Collects metrics every minute.
asiai daemon start # Install and start the daemon
asiai daemon start --interval 30 # Custom interval (seconds)
asiai daemon status # Check if running
asiai daemon logs # View recent logs
asiai daemon stop # Stop and uninstallWeb dashboard with real-time monitoring, benchmark controls, and interactive charts. Requires pip install asiai[web].
asiai web # Opens browser at http://127.0.0.1:8899
asiai web --port 9000 # Custom port
asiai web --host 0.0.0.0 # Listen on all interfaces
asiai web --no-open # Don't auto-open browserFeatures: system overview, engine status, live benchmark with SSE progress, history charts, doctor checks, dark/light theme.
Browse community benchmarks. Filter by chip or model.
asiai leaderboard # All results
asiai leaderboard --chip "M4 Pro" # Filter by chip
asiai leaderboard --model qwen2.5 # Filter by modelCompare your local results against community medians.
asiai compare --chip "Apple M1 Max" --model qwen2.5:7bGet engine recommendations based on your hardware and benchmarks.
asiai recommend # Best engine for your Mac
asiai recommend --use-case latency # Optimize for TTFT
asiai recommend --model qwen2.5 --community # Include community dataInteractive setup wizard — detects hardware, engines, models, and suggests next steps.
asiai setupStart the MCP server for AI agent integration. 11 tools, 3 resources.
asiai mcp # stdio (Claude Code, Cursor)
asiai mcp --transport sse # SSE (network agents)Interactive terminal dashboard with auto-refresh. Requires pip install asiai[tui].
asiai tuiGenerate a shareable benchmark card image with one flag:
asiai bench --card # SVG saved locally (zero dependencies)
asiai bench --card --share # SVG + PNG via community API
asiai bench --quick --card --share # Quick bench + card + shareA 1200x630 dark-themed card with your model, chip, specs banner (quantization, RAM, GPU cores, context size), engine comparison bar chart, winner highlight, and metric chips (tok/s, TTFT, power, engine version). Optimized for Reddit, X, Discord, and GitHub READMEs.
Every shared card includes asiai branding — the Speedtest.net model for local LLM inference.
| Engine | Port | Install | API |
|---|---|---|---|
| Ollama | 11434 | brew install ollama |
Native |
| LM Studio | 1234 | brew install --cask lm-studio |
OpenAI-compatible |
| mlx-lm | 8080 | brew install mlx-lm |
OpenAI-compatible |
| llama.cpp | 8080 | brew install llama.cpp |
OpenAI-compatible |
| oMLX | 8000 | brew tap jundot/omlx && brew install omlx |
OpenAI-compatible |
| vllm-mlx | 8000 | pip install vllm-mlx |
OpenAI-compatible |
| Exo | 52415 | pip install exo |
OpenAI-compatible |
| Metric | Description |
|---|---|
| tok/s | Generation speed (tokens/sec), excluding prompt processing (TTFT) |
| TTFT | Time to first token — prompt processing latency |
| Power | GPU, CPU, ANE, DRAM power in watts (IOReport, no sudo) |
| tok/s/W | Energy efficiency — tokens per second per watt |
| Stability | Run-to-run variance: stable (CV<5%), variable (<10%), unstable (>10%) |
| VRAM | Memory footprint — native API (Ollama, LM Studio) or ri_phys_footprint estimate (all other engines) |
| Thermal | CPU throttling state and speed limit percentage |
All metrics stored in SQLite (~/.local/share/asiai/metrics.db) with 90-day retention and automatic regression detection.
Following MLPerf, SPEC CPU 2017, and NVIDIA GenAI-Perf standards:
- Warmup: 1 non-timed generation per engine before measured runs
- Runs: 3 iterations per prompt (configurable), median as primary metric
- Sampling:
temperature=0(greedy decoding) for deterministic results - Power: Always-on via IOReport (no sudo). Per-engine, not session-wide average
- Variance: Pooled intra-prompt stddev (isolates run-to-run noise)
- Metadata: Engine version, model quantization, hardware chip, macOS version stored per result
See docs/benchmark-best-practices.md for the full conformance audit.
Four standardized prompts test different generation patterns:
| Name | Tokens | Tests |
|---|---|---|
code |
512 | Structured code generation (BST in Python) |
tool_call |
256 | JSON function calling / instruction following |
reasoning |
384 | Multi-step math problem |
long_gen |
1024 | Sustained throughput (bash script) |
Use --context-size 4k|16k|32k|64k to test with large context fill prompts instead.
When running asiai web, three REST API endpoints are available for programmatic access. Interactive API documentation (Swagger UI) is available at http://localhost:8899/docs.
| Endpoint | Description |
|---|---|
GET /api/status |
Lightweight health check (< 500ms) — engine reachability, memory pressure, thermal |
GET /api/snapshot |
Full system + engine snapshot with loaded models, VRAM, versions |
GET /api/benchmarks |
Benchmark results with tok/s, TTFT, power, context_size, engine_version |
GET /api/engine-history |
Engine status history (TCP, KV cache, tokens predicted) |
GET /api/benchmark-process |
Process CPU/RSS metrics from benchmark runs (7d retention) |
GET /api/metrics |
Prometheus exposition format — system, engine, model, benchmark gauges |
# prometheus.yml
scrape_configs:
- job_name: 'asiai'
static_configs:
- targets: ['localhost:8899']
metrics_path: '/api/metrics'
scrape_interval: 30sasiai monitor --json | jq '.mem_pressure'
asiai models --json | jq '.engines[].models[].name'- macOS on Apple Silicon (M1 / M2 / M3 / M4 families)
- Python 3.11+
- At least one inference engine running locally
The core uses only the Python standard library — urllib, sqlite3, subprocess, argparse. No requests, no psutil, no rich. Just stdlib.
Optional extras:
asiai[web]— FastAPI web dashboard with chartsasiai[tui]— Textual terminal dashboardasiai[all]— Web + TUIasiai[dev]— pytest, ruff
| Version | Scope | Status |
|---|---|---|
| v0.1 | detect + bench + monitor + models (CLI, stdlib) | Done |
| v0.2 | mlx-lm + doctor + daemon + TUI (Textual) | Done |
| v0.3 | 5 engines, power metrics, multi-run variance, regression detection | Done |
| v0.4 | CI, MkDocs, export JSON, thermal drift, web dashboard | Done |
| v0.5 | REST API, Prometheus /metrics, CLI --json, engine uptime tracking | Done |
| v0.6 | Multi-service LaunchAgent (daemon start web), daemon status/logs/stop --all |
Done |
| v0.7 | Alert webhooks, LM Studio VRAM, Ollama config in doctor | Done |
| v1.0 | Community Benchmark DB, smart recommendations, Exo engine, leaderboard | Done |
| v1.0.1 | MCP server (11 tools), benchmark card, --quick mode, setup wizard, agent integration |
Done |
| v1.2 | Web dashboard redesign, shareable cards, Share on X/Reddit, community API | Done |
| v1.3 | Dark theme, self-hosted fonts, universal VRAM (phys_footprint), power in Monitor/History | Done |
| v1.4 | Fleet mode (multi-Mac), notifications macOS, MCP prompts, bench methodology improvements | Planned |
Apache 2.0

