Commit 6b43355
Update docs with latest benchmark results and blog post fixes (#78)
* Update docs with latest benchmark results and blog post fixes
- benchmark-results/index.md: All tables updated with corrected numbers,
added GPQA thinking effort ablation section
- blog technical-deep-dive: Updated budget alternatives, algorithm comparison,
selector summary, Opus+Opus fix, cumulative cost wording
- mkdocs.yml: Minor config updates
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add topology assumption tradeoff for Hill Climbing vs Arm Elimination
Arm Elimination is assumption-free (uses only observed data), while
Hill Climbing requires a hand-crafted model ranking upfront. Also
split LM Proposal into its own paragraph.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add server latency column to HotpotQA and MathQA tables
Added Avg Latency (s) column to all 6 tables (Top 15, Bottom 15,
Full 81) for both 2-tuple benchmarks using server-side latency
from cache.db results.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix HotpotQA Bottom 15 rank offset (was off by 1 vs Full 81)
Kimi + Haiku 4.5 is rank 66, not 67. Bottom 15 now correctly
starts at rank 67 and matches the Full 81 table.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Sort all selector comparison tables by Mean Accuracy descending
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent b989cf3 commit 6b43355
3 files changed
Lines changed: 334 additions & 314 deletions
0 commit comments