Skip to content

Add physics benchmark framework with 25 tasks across 5 subfields#86

Open
madeleinesong wants to merge 4 commits intomainfrom
eng-437-create-benchmark-tasks
Open

Add physics benchmark framework with 25 tasks across 5 subfields#86
madeleinesong wants to merge 4 commits intomainfrom
eng-437-create-benchmark-tasks

Conversation

@madeleinesong
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a benchmarks/ package with schema, loader, runner, and 25 benchmark tasks for evaluating AI physics reasoning
  • Tasks span 5 subfields (QFT, GR & cosmology, statistical mechanics, condensed matter, classical mechanics, quantum information) at introductory through advanced difficulty
  • Task types include derivation, calculation, dimensional analysis, limiting cases, and estimation
  • Model-agnostic runner accepts any callable, formats prompts that exclude answers, and collects structured results with timing
  • 40 new tests validate schema serialization, task file consistency, prompt formatting, and runner behavior

Details

Schema (benchmarks/schema.py): BenchmarkTask, BenchmarkSuite, Reference dataclasses with enums for Difficulty, TaskType, OutputFormat. Full JSON roundtrip support.

Tasks (benchmarks/tasks/*.json): 25 problems sourced from standard textbooks and papers (Peskin & Schroeder, Misner/Thorne/Wheeler, Pathria, Ashcroft & Mermin, Nielsen & Chuang, Goldstein, etc.). Each task includes problem statement, given information, assumptions, conventions, expected answer, verification hints, and reference citation.

Loader (benchmarks/loader.py): Discovers task files, loads combined suites, filters by subfield/difficulty/type, and provides inventory summary.

Runner (benchmarks/runner.py): Formats task prompts (excluding answers and hints), invokes a caller-provided model function, collects TaskResult objects with timing and error handling, and generates summary reports.

Closes #45
ENG-437

Test plan

  • All 40 new benchmark tests pass (tests/test_benchmarks.py)
  • Existing test suite unaffected (verified subset: metadata consistency, version, paper models)
  • Manual review: task physics content is accurate and well-sourced
  • Verify benchmark tasks can be extended by adding new JSON files to benchmarks/tasks/

🤖 Generated with Claude Code

Define BenchmarkTask, BenchmarkSuite, Reference, and supporting enums
(Difficulty, TaskType, OutputFormat) as the foundation for a systematic
physics benchmark. Each task captures problem statement, classification
metadata, expected answers, and paper provenance.

ENG-437
25 tasks covering QFT (5), GR & cosmology (5), statistical mechanics (5),
condensed matter (4), classical mechanics (3), and quantum information (3).
Each task specifies problem statement, classification metadata (difficulty,
type, subfield), expected answer, verification hints, and paper/textbook
references.

Difficulty ranges from introductory to advanced. Task types include
derivation, calculation, dimensional analysis, limiting cases, and
estimation.

ENG-437
- loader.py: discovers task JSON files, loads individual and combined
  suites, supports filtering by subfield/difficulty/type, and provides
  an inventory summary.
- runner.py: model-agnostic benchmark runner that formats task prompts,
  invokes a caller-provided model function, and collects structured
  results with timing. Includes a report formatter.

ENG-437
40 tests covering:
- Schema roundtrip serialization (Reference, BenchmarkTask, BenchmarkSuite)
- All enum values (TaskType, Difficulty, OutputFormat)
- Suite filtering by subfield, difficulty, and task type
- Suite save/load to JSON files
- Task file consistency: unique IDs, required fields, valid subfields
- Cross-file coverage requirements (subfields, difficulties, task types)
- Prompt formatting (includes problem info, excludes answers/hints)
- Runner execution with mock models (success, error, suite-level)
- Report formatting and result serialization

ENG-437
@madeleinesong madeleinesong force-pushed the eng-437-create-benchmark-tasks branch from 1109b34 to 9fe5529 Compare April 6, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] gpd benchmark — Systematic physics-AI benchmarking from papers

1 participant