A novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments.
- [Sep 2025] π PhysGym has been accepted to NeurIPS 2025 Datasets and Benchmarks Track! See you in San Diego.
- [Jul 2025] Paper released on arXiv.
PhysGym addresses a critical gap in evaluating AI scientific discovery capabilities: the inability to systematically control and vary the prior knowledge available to AI agents. Current benchmarks expose fixed sets of cues, making it impossible to distinguish between memorization and true mechanistic inference.
Consider a pendulum experiment:
- With full context (variable names like "length", "gravity", descriptions), an agent can pattern-match to recover known solutions
- Without context (anonymized variables), the agent must actively probe the environment, design experiments, and form structural hypotheses
These scenarios involve identical physics but radically different cognitive demandsβyet existing benchmarks cannot disentangle them.
PhysGym provides sophisticated control over prior knowledge levels, allowing researchers to systematically investigate how LLM agents:
- Balance deductive reasoning (leveraging priors) with inductive learning (from experimentation)
- Adapt to unfamiliar problem settings
- Compensate for missing information through active exploration
- Construct and refine physics models from observations
- Controlled Prior Knowledge (L1-L4): Systematically vary what contextual information agents receive, from full descriptions to completely anonymized variables
- Interactive Physics Environments: 97 curated physics problems from PHYBench, each with executable simulations
- Realistic Constraints: Limited experimental budgets (100 experiments) that mirror real scientific practice
- Comprehensive Evaluation:
- Success Rate: Primary metric based on equation equivalence (symbolic + LLM-based)
- Consistency Metrics: RΒ², MSE, Kendall's tau, MAPE for hypothesis-data alignment
- Task Difficulty Metrics: Equation complexity, variable count analysis
- Complete History Tracking: Monitor exploration dynamics, hypothesis evolution, and reasoning strategies
PhysGym's central innovation is fine-grained control over prior information. Four primary levels are provided:
| Level | Context | Variable Descriptions | Variable Names | Use Case |
|---|---|---|---|---|
| L1 | β Full problem description | β Physical meanings | β
Conventional (e.g., m, v) |
Reasoning with rich priors |
| L2 | β "Unknown context" | β Physical meanings | β Conventional | Partial information |
| L3 | β Hidden | β Meaningless descriptions | β Conventional | Minimal semantic cues |
| L4 | β Hidden | β Meaningless descriptions | β Anonymous (var_1, var_2) |
Pure equation discovery |
Key Findings from Baseline Experiments:
- Performance declines with reduced priors (e.g., o4-mini: 62.89% β 31% from L1 to L4)
- Non-monotonic patterns: Some problems solved only at lower prior levels, revealing conflicts between knowledge recall and discovery
- Prior knowledge becomes critical for high-dimensional problems (7+ variables)
- Reasoning-enabled models show better exploration dynamics under uncertainty
# Clone the repository
git clone https://github.com/principia-ai/PhysGym.git
cd PhysGym
# Install in editable mode
pip install -e .
# Or install with optional dependencies for local LLM support
pip install -e ".[local-llm]"
# Or install everything including dev tools
pip install -e ".[all]"- Copy
api_keys.env.exampletoapi_keys.env - Add your API keys to
api_keys.env:
import physgym
# Create an experiment with a physics problem
experiment = physgym.ResearchInterface(
env=285, # Physics problem ID
sample_quota=100, # Maximum experiments
mode="default" # Prior level: "default" (L1), "no_context" (L2),
# "no_description" (L3), "no_description_anonymous" (L4)
)
# Run experiments, test hypotheses, and evaluate
# See examples/basic_usage.py for complete walkthroughFor a complete example including experiment design, hypothesis testing, and metric evaluation, see examples/basic_usage.py.
For full benchmark evaluation with LLM agents, see experiments/run_baseline.py and the Running Experiments section below.
Each of the 97 physics problems is structured with:
{
"context": "Textual description of the physics problem",
"solution": "Step-by-step derivation based on physics principles",
"equation": "Ground-truth equation linking input/output quantities",
"python_code": "Executable simulation with numerical validity checks",
"input_variables": [
{"name": "m", "description": "mass", "unit": "kg"},
# ... more variables
],
"output_variable": {"name": "v", "description": "velocity", "unit": "m/s"},
"dummy_variables": ["Additional context variables not in core equation"]
}PhysGym/
βββ physgym/ # π¦ Core Benchmark Suite
β βββ interface.py # Research interface with quota management & prior control
β βββ phyenv.py # 97 executable physics environments
β βββ samples/ # Complete problem dataset with metadata
β βββ prompts/ # Evaluation & preprocessing prompts
β βββ utils/
β βββ metrics.py # RΒ², MSE, Kendall's tau, symbolic equivalence
β βββ llm_providers.py # Multi-provider LLM support
β βββ sandbox.py # Safe execution environment
β βββ preprocess.py # Dataset construction tools
β
βββ methods/ # π¬ Baseline Agents
β βββ baseline_researcher.py # Direct prompting baseline
β βββ react_agent.py # ReAct-style agent
β βββ prompts/ # Agent-specific prompts
β
βββ experiments/ # π§ͺ Benchmark Runners
β βββ run_baseline.py # Full evaluation across all problems & prior levels
β
βββ analysis/ # π Result Analysis
β βββ postprocess.py # Aggregate metrics, identify patterns
β
βββ examples/ # π Usage Examples
βββ basic_usage.py
Environments (phyenv.py):
- Access to the 97 physics environments (in full_samples.json) converted from PHYBench
- Each accepts numerical inputs and returns physics-based outputs
- Input validation and domain constraints included
Interface (interface.py):
- Quota Management: Enforce experimental budget (default: 100 experiments)
- Prior Knowledge Control: Configure L1-L4 information exposure
- History Tracking: Complete logs of interactions, hypotheses, evaluations
- Automatic Evaluation: Integrated metrics for real-time assessment
Evaluation Metrics (utils/metrics.py):
- Success Rate: Primary metric (symbolic + LLM-based equivalence)
- Consistency Metrics: RΒ², MSE, Kendall's tau, MAPE
PhysGym provides several convenient scripts for running experiments. All scripts are located in the scripts/ directory.
Quick Test (Single Problem)
# Test on environment ID 285 with Gemini Flash
bash scripts/test.shEdit scripts/test.sh to customize the model, mode, and environment ID.
Full Benchmark Evaluation (All 97 Problems)
# Run complete benchmark with Gemini Pro
bash scripts/baseline.shEdit scripts/baseline.sh to change the model or mode.
Local LLM with Ollama
# Run with local Ollama model (e.g., gpt-oss:20b)
bash scripts/ollama_gpt.shLocal LLM Examples & Templates
# View examples for running with Ollama, vLLM, etc.
bash scripts/run_local_llm_example.shSingle Problem
python experiments/run_baseline.py \
--env-id 285 \
--mode default \
--llm-model "google/gemini-2.5-flash" \
--api-provider openrouter \
--max-iterations 20 \
--sample-quota 100All Problems (Batch)
python experiments/run_baseline.py \
--llm-model "google/gemini-2.5-flash" \
--api-provider openrouter \
--mode default \
--idx-start 0 \
--idx-end 97 \
--sample-quota 100 \
--max-iterations 20The --mode parameter controls the prior knowledge level:
defaultβ Level 1 (L1): Full context + descriptions + conventional namesno_contextβ Level 2 (L2): No context, but descriptions + names availableno_descriptionβ Level 3 (L3): No context/descriptions, conventional names onlyno_description_anonymousβ Level 4 (L4): Anonymized variables (var_1, var_2, etc.)
PhysGym has been tested with:
- OpenAI:
openai/gpt-4,openai/o4-mini-high(reasoning model) - Anthropic:
anthropic/claude-3.7-sonnet - Google:
google/gemini-2.5-pro,google/gemini-2.5-flash,google/gemini-2.5-flash-preview:thinking - DeepSeek:
deepseek/deepseek-chat,deepseek/deepseek-r1 - Open Source (via vLLM/Ollama):
Qwen3-32B,gpt-oss-20b
For local models, use --api-provider ollama or --api-provider vllm with --base-url if needed.
PhysGym operationalizes the scientific discovery process:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. Problem Setup β
β β’ Select physics environment (1-97) β
β β’ Configure prior level (L1-L4) β
β β’ Set experimental budget (default: 100) β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β 2. Active Experimentation β
β β’ Agent designs experiments (input valuations) β
β β’ Environment returns observations β
β β’ Budget decrements with each experiment β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β 3. Hypothesis Formation β
β β’ Agent proposes equation based on observations β
β β’ Can request oracle test (symbolic equivalence) β
β β’ Receives consistency metrics (RΒ², MSE, etc.) β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β 4. Iterative Refinement β
β β’ Refine hypothesis based on feedback β
β β’ Design targeted experiments β
β β’ Repeat until budget exhausted or solved β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β 5. Evaluation β
β β’ Success: Symbolic equivalence to ground truth β
β β’ Metrics: RΒ², MSE, Kendall's tau, MAPE β
β β’ Analysis: Exploration dynamics, hypothesis count β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Process and analyze experiment results:
# Aggregate results and generate summary statistics
python analysis/postprocess.py
# Clean unsuccessful experiments (with remaining quotas)
python analysis/postprocess.py --cleanThe script generates:
results_summary.json: Aggregated metrics by model/configurationmodels_data.json: Detailed per-experiment data
For interactive exploration and visualization, refer to the notebook analysis/result_analysis.ipynb.
PhysGym enables investigation of:
- Curiosity-Driven Learning: How agents explore to compress observations and improve world models
- Prior vs. Posterior Balance: When do agents rely on deduction vs. induction?
- Experimental Design: Can models design informative experiments or just random sampling?
- Causal Reasoning: Do models understand causal relationships or just correlations?
- Generalization: How do agents perform on unfamiliar problem representations?
- Curriculum Learning: Can progressive prior reduction improve discovery skills?
- Flexible Prior Configuration: Create custom prior levels beyond L1-L4
- Multi-Modal Evaluation: Both symbolic (SymPy) and LLM-based equivalence checking
- Safe Execution: Sandboxed environment for testing agent-generated code
- Complete Provenance: Full experiment history including failed hypotheses
- Reproducible Benchmarking: Deterministic evaluation with fixed random seeds
- Extensible Architecture: Easy integration of new agents and evaluation metrics
PhysGym opens several promising avenues:
- Counterfactual Physics: Create scenarios where memorized training data becomes a disadvantage, forcing models to rely on interactive reasoning
- Fine-Tuning for Discovery: Investigate whether scientific reasoning skills can be learned through fine-tuning on low-prior tasks
- Enhanced Agent Design:
- Exploration modules that maximize information gain
- Hypothesis falsification strategies
- Automated curriculum learning using L1βL4 progression
- Knowledge Source Integration: Identify crucial prior knowledge to develop agents that efficiently query external knowledge bases
- Multi-Agent Collaboration: Coordinate specialized agents for different discovery phases
If you use PhysGym in your research, please cite:
@article{chen2025physgym,
title={PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors},
author={Chen, Yimeng and Pi\k{e}kos, Piotr and Ostaszewski, Mateusz and Laakom, Firas and Schmidhuber, J{\"u}rgen},
journal={arXiv preprint arXiv:2507.15550},
year={2025}
}We welcome contributions! Areas of interest:
- New baseline agents: Implement novel discovery algorithms
- Additional metrics: Propose new evaluation dimensions
- Environment extensions: Add new physics problems or domains
- Analysis tools: Visualization and interpretation utilities
- Documentation: Tutorials, examples, and case studies
This project is licensed under the MIT License.
PhysGym is designed to advance our understanding of AI scientific reasoning by providing a rigorous, controlled environment where both successes and failures illuminate the path toward truly autonomous scientific discovery agents.
