Add physics benchmark framework with 25 tasks across 5 subfields by madeleinesong · Pull Request #86 · psi-oss/get-physics-done

madeleinesong · 2026-04-06T17:07:47Z

Summary

Adds a benchmarks/ package with schema, loader, runner, and 25 benchmark tasks for evaluating AI physics reasoning
Tasks span 5 subfields (QFT, GR & cosmology, statistical mechanics, condensed matter, classical mechanics, quantum information) at introductory through advanced difficulty
Task types include derivation, calculation, dimensional analysis, limiting cases, and estimation
Model-agnostic runner accepts any callable, formats prompts that exclude answers, and collects structured results with timing
40 new tests validate schema serialization, task file consistency, prompt formatting, and runner behavior

Details

Schema (benchmarks/schema.py): BenchmarkTask, BenchmarkSuite, Reference dataclasses with enums for Difficulty, TaskType, OutputFormat. Full JSON roundtrip support.

Tasks (benchmarks/tasks/*.json): 25 problems sourced from standard textbooks and papers (Peskin & Schroeder, Misner/Thorne/Wheeler, Pathria, Ashcroft & Mermin, Nielsen & Chuang, Goldstein, etc.). Each task includes problem statement, given information, assumptions, conventions, expected answer, verification hints, and reference citation.

Loader (benchmarks/loader.py): Discovers task files, loads combined suites, filters by subfield/difficulty/type, and provides inventory summary.

Runner (benchmarks/runner.py): Formats task prompts (excluding answers and hints), invokes a caller-provided model function, collects TaskResult objects with timing and error handling, and generates summary reports.

Closes #45
ENG-437

Test plan

All 40 new benchmark tests pass (tests/test_benchmarks.py)
Existing test suite unaffected (verified subset: metadata consistency, version, paper models)
Manual review: task physics content is accurate and well-sourced
Verify benchmark tasks can be extended by adding new JSON files to benchmarks/tasks/

🤖 Generated with Claude Code

Define BenchmarkTask, BenchmarkSuite, Reference, and supporting enums (Difficulty, TaskType, OutputFormat) as the foundation for a systematic physics benchmark. Each task captures problem statement, classification metadata, expected answers, and paper provenance. ENG-437

25 tasks covering QFT (5), GR & cosmology (5), statistical mechanics (5), condensed matter (4), classical mechanics (3), and quantum information (3). Each task specifies problem statement, classification metadata (difficulty, type, subfield), expected answer, verification hints, and paper/textbook references. Difficulty ranges from introductory to advanced. Task types include derivation, calculation, dimensional analysis, limiting cases, and estimation. ENG-437

- loader.py: discovers task JSON files, loads individual and combined suites, supports filtering by subfield/difficulty/type, and provides an inventory summary. - runner.py: model-agnostic benchmark runner that formats task prompts, invokes a caller-provided model function, and collects structured results with timing. Includes a report formatter. ENG-437

40 tests covering: - Schema roundtrip serialization (Reference, BenchmarkTask, BenchmarkSuite) - All enum values (TaskType, Difficulty, OutputFormat) - Suite filtering by subfield, difficulty, and task type - Suite save/load to JSON files - Task file consistency: unique IDs, required fields, valid subfields - Cross-file coverage requirements (subfields, difficulties, task types) - Prompt formatting (includes problem info, excludes answers/hints) - Runner execution with mock models (success, error, suite-level) - Report formatting and result serialization ENG-437

madeleinesong added 4 commits April 6, 2026 10:03

madeleinesong force-pushed the eng-437-create-benchmark-tasks branch from 1109b34 to 9fe5529 Compare April 6, 2026 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add physics benchmark framework with 25 tasks across 5 subfields#86

Add physics benchmark framework with 25 tasks across 5 subfields#86
madeleinesong wants to merge 4 commits intomainfrom
eng-437-create-benchmark-tasks

madeleinesong commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

madeleinesong commented Apr 6, 2026

Summary

Details

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant