Automated research using Large Language Models with a scaffolded generator-critic loop for both mathematics and machine learning experiments.
This project implements an automated research system where an LLM iteratively:
- Generates research papers (LaTeX) with supporting experiments (Python code)
- Receives critique from a critic AI
- Refines its work based on feedback
The system manages the full research workflow including code execution, LaTeX compilation, GPU computation via Modal, and comprehensive logging.
- Scaffolded Research Loop: Generator-critic architecture with configurable iteration budgets
- Literature Integration:
- OpenAlex API for searching academic papers
- ArXiv paper downloads with full-text extraction
- Automatic inclusion of reference papers in prompts
- Code Context Integration: Load entire codebases (e.g., nanoGPT) as context for ML research
- Reference Papers: Load papers as context from
problems/papers/directory - Data File Integration: Load datasets from
data/datasets/directory for analysis - GPU Computation: Modal integration for running training experiments on cloud GPUs
- LaTeX Integration: Automatic paper generation and PDF compilation
- Code Execution: Safe Python code execution with timeout protection
- Template System: Customizable prompts and LaTeX templates
- Structured Response Format: XML-based parsing for robust extraction of code, LaTeX, and plans
- Comprehensive Logging: Tracks all iterations, critiques, plans, and metrics
- Artifact Manifest: Auto-generated inventory of all saved artifacts (figures, data, code) updated after each iteration
- Cost Tracking: Monitors API usage and costs
- Unit Tests: 74 tests covering core functionality, code loading, XML extraction, literature integration, and three-stage architecture
- Configurable: YAML-based configuration for timeouts, models, and limits
- Prompt Caching: Intelligent 1-hour caching reduces API costs by ~40%
- Three-Stage Generator: Sequential planning → coding → writing for proper information flow
- Reasoning Capture: Logs chain-of-thought reasoning from supported providers (Anthropic, Moonshot)
The generator phase uses a novel three-stage approach where each iteration consists of three sequential API calls:
- Planning Stage: Creates detailed plan and performs optional literature search
- Coding Stage: Generates and executes Python code (with access to plan and literature)
- Writing Stage: Generates LaTeX paper (with access to plan, literature, code, and execution results)
Key Benefit: The LaTeX paper can now reference actual execution results instead of anticipated ones, ensuring papers accurately report real findings.
Documentation: See docs/architecture/THREE_STAGE_IMPLEMENTATION_COMPLETE.md for full details.
The system supports multiple LLM providers, allowing you to choose between different models based on your needs:
- Anthropic: Claude Sonnet 4.5 (with extended thinking & prompt caching)
- OpenAI: GPT-5.1 (latest with improved reasoning & 90% cache savings)
- Google: Gemini 3 Pro (PhD-level reasoning with 1M context & automatic caching)
- xAI: Grok 4.1 (reduced hallucinations, 256K context, 75% cache savings)
- Moonshot: Kimi K2 Thinking (cost-effective thinking model with 256K context)
Switch providers by editing one line in config.yaml:
api:
provider: anthropic # or: openai, google, xai, moonshot
model: claude-sonnet-4-5-20250929 # defaults set in provider_defaults.pyWhen the AI writes research papers, it lists itself as an author using the appropriate model name (e.g., "GPT-5", "Gemini 2.5 Pro", "Kimi K2 Thinking"). The {model} placeholder in problem files is automatically replaced with the correct display name.
Each provider has different parameters for controlling reasoning depth and effort:
Anthropic Claude - Explicit thinking budget:
thinking_budget: 32000- Fixed 32K token budget for extended thinking- Provides explicit control over thinking depth
- Set in
provider_defaults.pyor override inconfig.yaml
OpenAI GPT-5 - Reasoning effort levels:
reasoning_effort: 'high'- Maximum reasoning capability (default)- Options:
'minimal','low','medium','high' - Higher effort = more thorough reasoning but ~23× more output tokens
- Best performance with
'high'for complex research tasks
Google Gemini 3 Pro - Thinking level control:
thinking_budget: -1- Maps tothinking_level: 'high'(default, maximum reasoning)thinking_budget: 0- Maps tothinking_level: 'low'(fast, minimal latency)- Positive values use legacy thinking_budget parameter (not recommended)
- Gemini 3 achieves PhD-level performance: 37.5% on Humanity's Last Exam, 91.9% GPQA Diamond
xAI Grok 4 - Automatic reasoning:
- No user-controllable parameters
- Model automatically determines reasoning depth
- OpenAI-compatible API
Moonshot Kimi K2 - Automatic test-time scaling:
- No explicit thinking control parameters
- Uses automatic test-time scaling for reasoning
- Recommended:
temperature: 1.0for optimal performance - OpenAI-compatible API
Full Documentation: See docs/providers/MULTI_PROVIDER_GUIDE.md for detailed configuration examples, cost comparisons, and feature matrix.
The system can capture and log the chain-of-thought reasoning from models that expose their internal thinking process. This provides transparency into how the AI approaches problems.
Provider Support:
- Anthropic Claude ✓ - Captures thinking from
ThinkingBlockcontent - Moonshot Kimi K2 ✓ - Captures reasoning via
reasoning_contentfield - OpenAI ✗ - Reasoning intentionally hidden by design
- Google Gemini ✗ - API does not expose thinking content
- xAI Grok ✗ - API does not expose reasoning content
Output Files (per session, in outputs/<session>/logs/):
reasoning.jsonl- Structured data with iteration/stage metadata for programmatic analysisreasoning_readable.md- Human-readable markdown format with formatted reasoning content
Example reasoning log entry captures the model's internal thought process for each stage (planning, coding, writing, critique), enabling deeper understanding of the AI's problem-solving approach.
The system implements intelligent prompt caching to significantly reduce API costs during multi-iteration research sessions. By caching static prompt content (problem statements, reference papers, data descriptions), the system achieves substantial cost savings on typical research runs.
Caching Support by Provider:
- Anthropic: 1-hour cache, 90% savings on cache reads
- OpenAI: Automatic caching, 90% savings on cache reads
- Google: Automatic context caching, 90% savings on cache reads
- xAI: Automatic caching, 75% savings on cache reads
- Moonshot: Automatic caching, 90% savings on cache reads (estimated)
The three-stage generator architecture (planning → coding → writing) shares a single cache across all three stages within each iteration. This optimization reduces cache writes from 4 to 2 per iteration:
Before optimization:
- Planning stage: Creates cache (2× cost)
- Coding stage: Creates different cache (2× cost)
- Writing stage: Creates different cache (2× cost)
- Critic stage: Creates cache (2× cost)
After optimization:
- Planning stage: Creates cache (2× cost)
- Coding stage: Reads from cache (0.1× cost, 90% savings)
- Writing stage: Reads from cache (0.1× cost, 90% savings)
- Critic stage: Creates cache (2× cost)
This is achieved by ensuring all three generator stages share identical static content, with only the output format instructions varying in the dynamic portion.
Each API call (both generator and critic) is split into two parts:
-
Static Content (Cached):
- Problem statement
- Reference papers
- Code context (entire codebases like nanoGPT)
- Data file descriptions
- System instructions and templates
- This content is cached for 1 hour and refreshed on each use
-
Dynamic Content (Not Cached):
- Current iteration number
- Current stage (planning/coding/writing)
- Output format instructions (stage-specific)
- Latest code and LaTeX
- Previous execution results
- Recent critique feedback
- Cache Write: 2× base input cost (first call or after 1-hour expiry)
- Cache Read: 0.1× base input cost (subsequent calls within 1 hour)
- Break-even: After ~2-3 API calls (typical research sessions use 10-40 calls)
With a multi-iteration research session:
- Iteration 1: Cache write (2× cost on cached portion)
- Iteration 2+: Cache read (0.1× cost on cached portion, ~90% savings)
- Cache automatically refreshes on each use (resets 1-hour timer)
Caching is enabled by default but can be controlled:
# Python API - disable caching
researcher = ScaffoldedResearcher(
session_name="my_session",
use_cache=False # Disable caching
)
# Configuration (config.yaml)
api:
costs:
cache_write_multiplier: 2.0 # Cache creation cost (2× base)
cache_read_multiplier: 0.1 # Cache read cost (0.1× base)Caching requires static content ≥1,024 tokens. The system automatically:
- Splits prompts at optimal boundaries
- Only caches content above the threshold
- Tracks cache metrics in
metrics.json
Cache performance is tracked in outputs/{session}/metrics.json:
{
"cache_creation_tokens": 2653, // Tokens written to cache
"cache_read_tokens": 1340, // Tokens read from cache
"cost": 0.1631 // Cost including cache multipliers
}# Clone the repository
git clone https://github.com/yourusername/llm-maths-research.git
cd llm-maths-research
# Install in development mode
pip install -e .
# Or with development dependencies
pip install -e ".[dev]"- Python >= 3.10
- LaTeX distribution (for PDF compilation)
- API key for at least one supported provider (Anthropic, OpenAI, Google, xAI, or Moonshot)
- Create a
.envfile with your API key(s):
# At least one of these (depending on which provider you want to use)
ANTHROPIC_API_KEY=your-anthropic-key-here
OPENAI_API_KEY=your-openai-key-here
GOOGLE_API_KEY=your-google-key-here
XAI_API_KEY=your-xai-key-here
MOONSHOT_API_KEY=your-moonshot-key-hereSee .env.example for a template.
- Configure
config.yamlto adjust:- LLM Provider: Choose from anthropic, openai, google, xai, or moonshot
- Model: Select specific model for your chosen provider
- Code execution timeout
- LaTeX compilation settings
- API parameters (tokens, thinking budget, costs)
- Output preferences
Use run_experiment.py to run experiments with optional reference papers and data files:
# No papers - use literature search tools instead
python run_experiment.py --problem problems/my_problem.txt --max-iterations 10
# With paper IDs as context
python run_experiment.py --papers Turrini2024 --max-iterations 10
# Multiple reference papers as context
python run_experiment.py --papers Turrini2024 Smith2025 --max-iterations 10
# Custom problem file (default: problems/open_research.txt)
python run_experiment.py --papers Turrini2024 --problem problems/my_problem.txt
# With data files (from data/datasets/ directory)
python run_experiment.py --papers Turrini2024 --data mydata.csv
# Multiple data files
python run_experiment.py --papers Turrini2024 --data dataset1.csv dataset2.json
# With code context (for ML research)
python run_experiment.py --papers attention_paper --code nanogpt --problem problems/ml_research.txt
# Full ML research setup
python run_experiment.py --papers attention_paper --code nanogpt --data training_results.csv --max-iterations 10
# Custom session name
python run_experiment.py --papers Turrini2024 --session-name my_experimentfrom llm_maths_research import ScaffoldedResearcher
# Load your problem
with open('problems/problem.txt', 'r') as f:
problem = f.read()
# Create and run researcher (basic)
researcher = ScaffoldedResearcher(
session_name="my_session",
max_iterations=20
)
researcher.run(problem)
# Or with reference papers (from problems/papers/ directory)
researcher = ScaffoldedResearcher(
session_name="my_session",
max_iterations=20,
paper_ids=["2501.00123", "2501.00456"] # Loads from problems/papers/*.txt
)
researcher.run(problem)
# With data files (from data/datasets/ directory)
researcher = ScaffoldedResearcher(
session_name="my_session",
max_iterations=20,
paper_ids=["2501.00123"],
data_ids=["mydata.csv", "timeseries.json"] # Loads from data/datasets/
)
researcher.run(problem)
# With code context for ML research (from problems/code/ directory)
researcher = ScaffoldedResearcher(
session_name="ml_research",
max_iterations=20,
paper_ids=["attention_paper"],
code_ids=["nanogpt"], # Loads from problems/code/nanogpt/
data_ids=["training_data.csv"]
)
researcher.run(problem)The system includes integrated literature search capabilities using OpenAlex and ArXiv APIs.
Agents can use literature search through <OPENALEX> blocks in their responses:
<OPENALEX>
[
{
"function": "search_literature",
"arguments": {
"query": "pattern formation Turing",
"filters": {"from_year": 2020, "min_citations": 10},
"max_results": 15
},
"purpose": "Find recent highly-cited work on Turing patterns"
},
{
"function": "get_paper",
"arguments": {"identifier": "W2100837269"},
"purpose": "Get full metadata for Turing's 1952 paper"
},
{
"function": "get_arxiv_paper",
"arguments": {"arxiv_id": "2211.02556"},
"purpose": "Download full paper for detailed analysis"
}
]
</OPENALEX>from llm_maths_research.literature.tools import (
search_literature,
get_paper,
get_arxiv_paper
)
# Search for papers with filters
results = search_literature(
query="pattern formation Turing",
filters={"from_year": 2020, "min_citations": 10},
max_results=15
)
for paper in results['results']:
print(f"{paper['title']} ({paper['publication_year']})")
print(f"Citations: {paper['cited_by_count']}")
# Get paper metadata by OpenAlex ID or DOI
paper = get_paper("W2100837269")
print(paper['title'])
print(paper['abstract'])
# Download full ArXiv paper (LaTeX source)
arxiv_paper = get_arxiv_paper("2211.02556")
print(f"Downloaded {arxiv_paper['char_count']} characters")search_literature(query, filters, max_results)- Search OpenAlex by keywords and filters- Filters:
from_year,to_year,min_citations,doi,cites,cited_by,is_open_access, etc.
- Filters:
get_paper(identifier)- Get full metadata for a paper (by OpenAlex ID or DOI)- Returns: title, authors, abstract, citations, references, formatted citations
get_arxiv_paper(arxiv_id)- Download full LaTeX source from ArXiv- Returns: full text content (~15-30k+ tokens per paper)
The system supports loading external data files for analysis. Data files are stored in data/datasets/ and can be provided to experiments via the --data parameter.
- Place your data file in
data/datasets/(e.g.,mydata.csv) - Optionally create a description file with the same name and
.txtextension (e.g.,mydata.txt) - Reference the file when running an experiment
Example:
# data/datasets/temperature_data.csv
# data/datasets/temperature_data.txt (optional description)
python run_experiment.py --papers Turrini2024 --data temperature_data.csvCreate a .txt file with the same name as your data file to provide context:
# data/datasets/temperature_data.txt
Daily temperature measurements from weather station Alpha.
Format: CSV with columns date, temperature_c, humidity_percent
Period: January 2024 - December 2024
Frequency: Daily readings at 12:00 UTC
Missing data: Coded as -999
The description is included in the prompt to help the LLM understand the data structure.
- Copying: Data files are copied from
data/datasets/tooutputs/{session}/data/ - Path Access: In code, use relative paths from the session root directory
- Organized Output: Figures go to
artifacts/figures/, generated data toartifacts/data/
Example code generated by the LLM:
import pandas as pd
import matplotlib.pyplot as plt
# Load input data
df = pd.read_csv("data/mydata.csv")
# Analyze and visualize
# ... analysis code ...
# Save figure to artifacts/figures/
plt.savefig("artifacts/figures/my_plot.png", dpi=300)
# Save generated data to artifacts/data/
df.to_csv("artifacts/data/results.csv", index=False)The system supports loading entire codebases as context for ML research experiments. This allows the LLM to understand, modify, and experiment with existing code (e.g., training frameworks like nanoGPT).
Code contexts are stored in problems/code/ with each codebase in its own subdirectory:
problems/code/
├── nanogpt/
│ ├── code.txt # Consolidated source code
│ └── description.txt # README(s) and documentation
- Create a directory:
problems/code/your_codebase/ - Create
code.txtwith consolidated source code:# ===== train.py ===== [file content] # ===== model.py ===== [file content] - Create
description.txtwith README(s) or overview - Use with
--code your_codebase
# ML research with code context
python run_experiment.py \
--papers attention_is_all_you_need \
--code nanogpt \
--problem problems/ml_research.txt \
--max-iterations 10
# With data files
python run_experiment.py \
--papers ml_paper \
--code nanogpt \
--data training_results.csv \
--max-iterations 10nanoGPT - Andrej Karpathy's minimal GPT implementation (~15k tokens)
- Core files: model.py, train.py, sample.py, configurator.py
- Config files for Shakespeare and GPT-2 training
- Data preparation scripts
For GPU-intensive tasks (model training, large-scale inference), the system integrates with Modal for serverless GPU computation.
- Install Modal: Already included in dependencies
- Authenticate: Run
modal token newto set up your Modal account - Ready to use: The AI researcher can generate Modal code directly
The AI researcher generates Python code that uses Modal to:
- Run model inference on cloud GPUs (T4, L4, A10G, A100)
- Execute training jobs with automatic scaling
- Cache models and images for fast subsequent runs
Key Pattern (automatically used by the AI):
import modal
import pandas as pd
import matplotlib.pyplot as plt
app = modal.App("experiment-name")
# Include ALL packages used in this file (even if only used locally)
image = modal.Image.debian_slim().pip_install(
"torch",
"transformers",
"numpy",
"pandas", # Required even if only used locally!
"matplotlib" # Required even if only used locally!
)
@app.function(gpu="T4", timeout=1800, image=image)
def run_model_inference(problems):
"""GPU computation - returns data, NOT files"""
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load model (Modal caches this after first run)
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Run inference and return data structures
return results_list # Return data, not files!
# CRITICAL: Protect execution with if __name__ == "__main__"
if __name__ == "__main__":
problems = generate_problems()
# Single Modal session for all GPU work
with app.run():
results = run_model_inference.remote(problems)
# ALL file I/O happens locally
df = pd.DataFrame(results)
df.to_csv("artifacts/data/results.csv", index=False)
plt.plot(df['values'])
plt.savefig("artifacts/figures/plot.png", dpi=300)Typical execution times for GPT-2 inference (431 problems):
- First run: ~2-3 minutes (Modal startup + model download + inference)
- Subsequent runs: ~1 minute (cached image + cached model + inference)
- GPU time only: ~30-60 seconds
Caching (automatic):
- Image caching: Container images cached by package hash
- Model caching: HuggingFace models cached after first download
- Persistent: Caches persist across experiments
The AI researcher has been trained to follow these Modal best practices:
- Include ALL imports in image: Modal needs all packages you import at the file level, even if only used locally
- Use
if __name__ == "__main__":: Required to prevent code execution during Modal's import phase - Return data, not files: Modal functions return Python data structures; all file I/O happens locally
- Single Modal session: Batch all GPU operations to minimize startup overhead
See docs/MODAL_CODE_PATTERN.md for complete documentation.
Configuration (in config.yaml):
modal:
timeout: 1800 # 30 minutes for GPU tasks
gpu: "L4" # GPU type (T4, L4, A10G, A100)Code contexts are cached as part of static prompt content:
- nanoGPT: ~15k tokens (description + code)
- First iteration: 2× cost (cache write)
- Subsequent iterations: 0.1× cost (90% savings)
- Efficient for multi-iteration research sessions
The system allows you to start research sessions with custom initial paper and code files instead of the default template and empty code. This is useful when you want to build upon existing work or start with a specific baseline.
Use the --initial-paper and --initial-code flags when running experiments:
# Use custom initial paper
python run_experiment.py --problem problems/open_research.txt \
--initial-paper my_paper.tex
# Use custom initial code
python run_experiment.py --problem problems/open_research.txt \
--initial-code my_code.py
# Use both custom initial files
python run_experiment.py --problem problems/open_research.txt \
--initial-paper my_paper.tex \
--initial-code my_code.pyCustom initial files are stored in problems/initial/:
problems/initial/
├── example_paper.tex # Example custom paper
├── example_code.py # Example custom code
└── README.md # Usage instructions
-
Initial Paper (
.tex): A LaTeX document used as the starting point instead of the default template- Must be a complete LaTeX document with
\documentclass,\begin{document}, etc. - Can include existing sections, abstracts, equations, references
- Must be a complete LaTeX document with
-
Initial Code (
.py): Python code used as the starting point instead of an empty file- Should be valid Python code that the AI can build upon
- Can include helper functions, imports, data loading code
from llm_maths_research import ScaffoldedResearcher
researcher = ScaffoldedResearcher(
session_name="my_session",
max_iterations=20,
initial_paper="my_paper.tex", # Custom starting paper
initial_code="my_code.py" # Custom starting code
)
researcher.run(problem)- Files are optional - if not provided, defaults are used (template + empty code)
- Initial files are only used when creating new session directories
- Existing session files are never overwritten
- Initial paper should be valid LaTeX; initial code should be valid Python
llm-maths-research/
├── src/llm_maths_research/ # Main package
│ ├── core/ # Core research logic
│ │ ├── session.py # Research session management
│ │ └── researcher.py # Generator-critic loop
│ ├── literature/ # Literature search integration
│ │ ├── openalex_client.py # OpenAlex API client
│ │ ├── arxiv_client.py # ArXiv paper downloads
│ │ └── tools.py # Literature search tools
│ ├── templates/ # Prompt and LaTeX templates
│ │ ├── generator_prompt.txt # Generator AI prompt template
│ │ ├── critic_prompt.txt # Critic AI prompt template
│ │ └── initial_paper.tex # Initial LaTeX document template
│ ├── utils/ # Utilities
│ │ ├── xml_extraction.py # XML-based response parsing
│ │ ├── latex.py # LaTeX compilation
│ │ └── code_execution.py # Safe code execution
│ └── config.py # Configuration management
├── problems/ # Research problem files
│ ├── papers/ # Reference papers as .txt files (e.g., Turrini2024.txt)
│ ├── code/ # Code contexts for ML research (e.g., nanogpt/)
│ ├── initial/ # Custom initial paper and code files
│ ├── open_research.txt # Default problem for mathematics research
│ └── ml_research.txt # Default problem for ML research
├── data/ # Data files for experiments
│ └── datasets/ # Dataset files (CSV, JSON, etc.)
├── outputs/ # Generated outputs (papers, code, logs)
├── tests/ # Test suite
├── config.yaml # Configuration file
├── run_experiment.py # Run experiments with reference papers and data
├── pyproject.toml # Package metadata
└── README.md # This file
Each research session creates an organized directory in outputs/ with the following structure:
outputs/{session_name}/
├── paper.tex # Generated LaTeX paper (root level)
├── paper.pdf # Compiled PDF (root level)
├── experiment_code.py # Python research code (root level)
├── artifacts/
│ ├── figures/ # Agent-generated figures (PNG/PDF)
│ └── data/ # Agent-generated data files (CSV, pickle, npy)
├── logs/
│ ├── session_log.txt # Complete execution log
│ ├── plans.txt # All research plans (append-only history)
│ ├── critiques.txt # All critic feedback (append-only history)
│ ├── generator_responses.txt # All generator outputs per iteration
│ ├── metrics.json # API usage and cost metrics
│ ├── current_plan.txt # Latest plan (for resuming)
│ ├── current_critique.txt # Latest critique (for resuming)
│ ├── current_researcher_openalex.txt # Latest researcher literature searches
│ └── current_critic_openalex.txt # Latest critic literature searches
├── data/ # Input data files (copied from data/datasets/)
└── literature_cache/ # Downloaded ArXiv papers (if used)
- Root level: Paper and code for easy compilation and execution
- artifacts/: Agent-generated outputs (figures, data files)
- logs/: System logs, plans, critiques, and metrics
- data/: Input data files provided to the experiment
- literature_cache/: Cached academic papers from ArXiv
Edit config.yaml to customize:
execution:
timeout: 1800 # Code execution timeout (seconds) - 30 minutes
output_limit: 100000 # Output length limit (characters)
compilation:
timeout: 30 # LaTeX compilation timeout (seconds)
error_limit: 500 # Error message length (characters)
api:
provider: anthropic # LLM provider (anthropic, openai, google, xai, moonshot)
rate_limit_wait: 20 # Wait time on rate limit (seconds)
# Optional overrides (defaults from provider_defaults.py):
# model: claude-sonnet-4-5-20250929
# max_tokens: 64000
# thinking_budget: 32000
# costs:
# input_per_million: 3.0
# output_per_million: 15.0
research:
max_iterations: 20 # Default iteration limit
modal:
timeout: 1800 # GPU task timeout (seconds) - 30 minutes
gpu: "L4" # Default GPU type (T4, L4, A10G, A100)
output:
figure_dpi: 300 # DPI for saved figurespip install -e ".[dev]"# Run all tests
pytest
# Run only unit tests
pytest tests/unit/
# Run with verbose output
pytest -v
# Run with coverage
pytest --cov=src/llm_maths_research# Format code
black .
# Or use ruff
ruff format .ruff check .The code execution environment includes:
Core Scientific Computing:
- numpy, scipy, pandas
- matplotlib, seaborn
- networkx, scikit-learn
- statsmodels (statistical modeling and time series analysis)
Deep Learning:
- torch, torchvision
- transformer-lens (mechanistic interpretability)
- datasets (HuggingFace datasets including BLIMP)
GPU-Accelerated Computing:
- jax (scientific computing with autodiff and GPU support)
- jaxopt (optimization in JAX)
- optimistix (nonlinear solvers and root finding)
- diffrax (differential equation solvers in JAX)
- equinox (neural ODEs and PyTree handling)
- lineax (linear algebra solvers in JAX)
- cupy (NumPy-compatible GPU arrays)
- modal (serverless GPU computation)
Advanced Mathematics:
- sympy (symbolic mathematics)
- cvxpy (convex optimization)
- numba (JIT compilation)
- einops (tensor operations)
Pre-downloaded datasets in data/datasets/:
- MNIST: Handwritten digits (60k train, 10k test, 28×28 grayscale images)
- Use with
--data mnistto include in experiments - See
data/datasets/mnist.txtfor loading instructions
- Use with
- BLIMP: 67 linguistic minimal pair datasets for evaluating grammatical knowledge
- Load dynamically via HuggingFace:
load_dataset('blimp', 'phenomenon_name') - See
data/datasets/blimp.txtfor the full list of phenomena
- Load dynamically via HuggingFace:
- Example Timeseries: Sample temperature and humidity measurements (10 days, CSV format)
- Use with
--data example_timeseries.csvfor testing data analysis - See
data/datasets/example_timeseries.txtfor format details
- Use with
The system supports mechanistic interpretability research using transformer-lens and the BLIMP dataset.
transformer-lens: Analyze transformer model internals
- Hook access to activations, attention patterns, residual stream components
- Compatible with GPT-2, Pythia, and other HuggingFace models
- Supports activation patching, circuit analysis, and feature visualization
BLIMP Dataset: 67 linguistic minimal pair datasets
- Each dataset has 1,000 examples (sentence_good vs sentence_bad)
- Covers: agreement, islands, NPI licensing, binding, filler-gap, ellipsis
- Perfect for evaluating grammatical knowledge and finding circuits
# Run mechanistic interpretability research
python run_experiment.py \
--problem problems/open_mechanistic_interpretability.txt \
--papers placeholder \
--max-iterations 30The AI researcher will:
- Formulate a mechanistic interpretability research question
- Design experiments using transformer-lens and/or BLIMP
- Use Modal GPUs for efficient model analysis
- Generate visualizations and write a research paper
The system encourages autonomous exploration of questions like:
- What circuits detect subject-verb agreement violations?
- How do attention heads implement coreference resolution?
- Where are syntactic structures represented in the residual stream?
- How do models handle long-distance dependencies?
MIT License - See LICENSE file for details
Contributions welcome! Please open an issue or submit a pull request.
If you use this work, please cite:
@software{llm_maths_research,
title = {LLM Mathematics Research},
author = {Allen G Hart},
year = {2025},
url = {https://github.com/yourusername/llm-maths-research}
}