-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
WP-Bench Test Run
Quick test of the benchmark harness against 12 models using the wp-core-v1 dataset.
Results
| Model | Knowledge | Correctness | Overall |
|---|---|---|---|
| claude-sonnet-4-5-20250929 | 88.1% | 47.9% | 45.6% |
| gpt-5.2 | 90.5% | 44.4% | 44.9% |
| deepseek/deepseek-reasoner | 83.3% | 48.6% | 44.4% |
| gpt-5-mini | 83.3% | 43.8% | 42.5% |
| xai/grok-4-1-fast-reasoning | 85.7% | 41.7% | 42.4% |
| claude-opus-4-5-20251101 | 71.4% | 50.0% | 41.4% |
| gemini/gemini-3-flash-preview | 71.4% | 47.9% | 40.6% |
| deepseek/deepseek-chat | 71.4% | 46.5% | 40.0% |
| xai/grok-4-1-fast-non-reasoning | 76.2% | 41.7% | 39.5% |
| groq/llama-3.3-70b-versatile | 81.0% | 35.4% | 38.5% |
| gpt-3.5-turbo | 73.8% | 27.1% | 33.0% |
| groq/llama-3.1-8b-instant | 76.2% | 20.8% | 31.2% |
Dataset: wp-core-v1 (42 knowledge + 24 execution tests)
Takeaways
- Frontier models cluster around 40-46% overall
- Knowledge scores generally strong (70-90%), correctness is the differentiator
- Clear tier gap between frontier and smaller models
Metadata
Metadata
Assignees
Labels
No labels