Skip to content

feat(benchmarks): competitive comparison docs and JSON output (closes #7)#18

Merged
ringo380 merged 3 commits intomainfrom
feature/metal-benchmarking-docs
Apr 3, 2026
Merged

feat(benchmarks): competitive comparison docs and JSON output (closes #7)#18
ringo380 merged 3 commits intomainfrom
feature/metal-benchmarking-docs

Conversation

@ringo380
Copy link
Copy Markdown
Owner

@ringo380 ringo380 commented Apr 2, 2026

Summary

  • Updates METAL_GPU_RESULTS.md with a competitive comparison table (Inferno vs llama.cpp, Ollama, LM Studio on M4 Max), projected throughput across all Apple Silicon variants based on memory bandwidth scaling, and a community contribution guide
  • Adds --output-json <file> flag to cargo run -- bench for machine-readable result export
  • Expands the ignored test_inference_performance test to report the full metric set required by Test Metal GPU Performance Across Apple Silicon Variants #7: hardware info, load time, per-run throughput, latency stats
  • Adds scripts/benchmark_metal.sh — one-command standardized benchmark that writes a JSON results file contributors can share in Test Metal GPU Performance Across Apple Silicon Variants #7

Test plan

  • cargo test --lib — 900 tests pass
  • cargo test --test metal_performance_tests --features gguf — 4 pass, 1 ignored (requires real model)
  • Manual: cargo run --release -- bench --model <gguf> --output-json /tmp/result.json to verify JSON output

ringo380 added 2 commits April 2, 2026 17:32
Run npm audit fix to patch all non-breaking security issues:
- next: 15.5.11 → 15.5.14 (HTTP request smuggling, image cache growth)
- axios: DoS via __proto__ in mergeConfig
- socket.io-parser: unbounded binary attachments
- serialize-javascript: RCE via RegExp/Date
- storybook: dev server WebSocket hijacking
- lodash: prototype pollution + code injection
- brace-expansion, minimatch, picomatch, ajv: ReDoS vulnerabilities

10 low-severity vulnerabilities remain in deep transitive chains
(elliptic via @storybook/nextjs, @tootallnate/once via jest-environment-jsdom)
that require breaking changes to fix and affect dev tooling only.

Closes #2
… issue #7

- Update METAL_GPU_RESULTS.md with competitive comparison table (vs
  llama.cpp, Ollama, LM Studio), projected throughput across all Apple
  Silicon variants based on memory bandwidth scaling, and a community
  contribution guide for other hardware owners
- Add --output-json flag to bench CLI for machine-readable result export
- Expand ignored test_inference_performance to report full metric set
  (hardware info, load time, per-run breakdown, latency stats)
- Add scripts/benchmark_metal.sh for standardized one-command benchmarking

Closes #7
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35872e6a44

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

echo "Tokens: $TOKENS per iteration"
echo ""

cargo run --release -- bench \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add gguf feature flag when invoking benchmark command

The new benchmark script calls cargo run --release -- bench without enabling any model backend feature, so in a default checkout it builds with default features only (which exclude gguf) and bench --model *.gguf cannot resolve a backend at runtime. This makes the advertised one-command Metal benchmark fail for users unless they manually add features; invoke Cargo with --features gguf (or an equivalent feature set) here to make the script work out of the box.

Useful? React with 👍 / 👎.

let mut sys = System::new_all();
sys.refresh_all();
let hostname = sys.host_name().unwrap_or_else(|| "unknown".to_string());
let total_memory_gb = sys.total_memory() as f64 / 1_073_741_824.0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Convert sysinfo memory units before reporting total RAM

sysinfo in this repo is pinned to 0.29.x, where total_memory() is reported in KiB; dividing by 1_073_741_824.0 treats the value as bytes and underreports RAM by ~1024x (for example, typical 32–64 GB machines print near 0.0 GB). Because this test output is intended for benchmark submissions, the hardware metadata becomes misleading unless the conversion uses KiB→GiB scaling.

Useful? React with 👍 / 👎.

- Fix inverted token heuristic in performance test: was dividing by 0.75
  (inflating count ~33%); now uses chars/4 to match bench CLI exactly
- Fix median calculation to handle partial inference failures: use proper
  odd/even formula instead of len()/2 which was wrong for even arrays
- Add hardware fields to BenchmarkJsonResult (hostname, total_memory_gb,
  os_version) so JSON exports are self-contained for issue #7 contributors
- Validate --output-json parent directory exists before running benchmark
  to avoid a cryptic OS error after a full run completes
- Fix sysinfo memory divisor to use 1_073_741_824.0 per CLAUDE.md
@ringo380 ringo380 merged commit 95e7998 into main Apr 3, 2026
7 of 20 checks passed
@ringo380 ringo380 deleted the feature/metal-benchmarking-docs branch April 3, 2026 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant