Add physics benchmark framework with 25 tasks across 5 subfields#86
Open
madeleinesong wants to merge 4 commits intomainfrom
Open
Add physics benchmark framework with 25 tasks across 5 subfields#86madeleinesong wants to merge 4 commits intomainfrom
madeleinesong wants to merge 4 commits intomainfrom
Conversation
Define BenchmarkTask, BenchmarkSuite, Reference, and supporting enums (Difficulty, TaskType, OutputFormat) as the foundation for a systematic physics benchmark. Each task captures problem statement, classification metadata, expected answers, and paper provenance. ENG-437
25 tasks covering QFT (5), GR & cosmology (5), statistical mechanics (5), condensed matter (4), classical mechanics (3), and quantum information (3). Each task specifies problem statement, classification metadata (difficulty, type, subfield), expected answer, verification hints, and paper/textbook references. Difficulty ranges from introductory to advanced. Task types include derivation, calculation, dimensional analysis, limiting cases, and estimation. ENG-437
- loader.py: discovers task JSON files, loads individual and combined suites, supports filtering by subfield/difficulty/type, and provides an inventory summary. - runner.py: model-agnostic benchmark runner that formats task prompts, invokes a caller-provided model function, and collects structured results with timing. Includes a report formatter. ENG-437
40 tests covering: - Schema roundtrip serialization (Reference, BenchmarkTask, BenchmarkSuite) - All enum values (TaskType, Difficulty, OutputFormat) - Suite filtering by subfield, difficulty, and task type - Suite save/load to JSON files - Task file consistency: unique IDs, required fields, valid subfields - Cross-file coverage requirements (subfields, difficulties, task types) - Prompt formatting (includes problem info, excludes answers/hints) - Runner execution with mock models (success, error, suite-level) - Report formatting and result serialization ENG-437
1109b34 to
9fe5529
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
benchmarks/package with schema, loader, runner, and 25 benchmark tasks for evaluating AI physics reasoningDetails
Schema (
benchmarks/schema.py):BenchmarkTask,BenchmarkSuite,Referencedataclasses with enums forDifficulty,TaskType,OutputFormat. Full JSON roundtrip support.Tasks (
benchmarks/tasks/*.json): 25 problems sourced from standard textbooks and papers (Peskin & Schroeder, Misner/Thorne/Wheeler, Pathria, Ashcroft & Mermin, Nielsen & Chuang, Goldstein, etc.). Each task includes problem statement, given information, assumptions, conventions, expected answer, verification hints, and reference citation.Loader (
benchmarks/loader.py): Discovers task files, loads combined suites, filters by subfield/difficulty/type, and provides inventory summary.Runner (
benchmarks/runner.py): Formats task prompts (excluding answers and hints), invokes a caller-provided model function, collectsTaskResultobjects with timing and error handling, and generates summary reports.Closes #45
ENG-437
Test plan
tests/test_benchmarks.py)benchmarks/tasks/🤖 Generated with Claude Code