This directory contains various benchmark scripts for evaluating mathematical problem-solving capabilities of language models. These benchmarks form the evaluation component of the larger mathematical problem-solving framework.
Evaluates a model's ability to solve mathematical problems and provide correct answers. It generates multiple solutions for each problem and verifies the answers numerically.
- Uses
utils.solution_utils.extract_numeric_answerfor answer verification with sympy integration - Supports LaTeX notation in answers with comprehensive parsing
- Generates detailed solution statistics with step-by-step analysis
- Measures solution quality, correctness, and explanation clarity
- Handles multiple answer formats (boxed, hash-marked, inline)
- Supports group-based evaluation for solution diversity measurement
- Implements majority voting for answer verification with configurable thresholds
Tests a model's ability to write Python code that solves mathematical problems. The code is executed to verify correctness.
- Uses
utils.solution_utils.run_code_safelyfor secure code execution in isolated environments - Handles timeouts and execution errors with detailed error reporting
- Verifies numeric answers against expected results with configurable tolerance
- Evaluates code quality metrics (complexity, readability, efficiency)
- Supports multiple solution approaches with comparative analysis
- Implements memory and CPU usage limits for secure execution
- Provides detailed execution traces for debugging and analysis
Evaluates a model's ability to create test functions that can verify mathematical solutions. These test functions should correctly identify valid and invalid answers.
- Uses
utils.solution_utils.run_test_functionto evaluate test functions on multiple test cases - Generates test cases automatically with edge case detection
- Measures both correctness and robustness of test functions
- Evaluates test coverage and edge case handling
- Supports both numeric and symbolic test verification
- Implements timeout protection for infinite loops
- Provides detailed analysis of test function behavior
Tests a pipeline approach where one model acts as an "architect" to analyze problems and create guidance, and another model implements the solution as code.
- Uses two models in sequence (architect and programmer) with configurable interaction
- Measures the effectiveness of problem decomposition and planning
- Evaluates the quality of generated code and implementation fidelity
- Supports iterative refinement between architect and programmer
- Analyzes the impact of architectural guidance on solution quality
- Implements comparative analysis with single-model approaches
- Provides detailed interaction logs for process analysis
Evaluates a model's ability to identify errors in mathematical solutions and provide corrections. It simulates a tutoring scenario where incorrect solutions are analyzed.
- Identifies errors in mathematical reasoning with precision
- Provides explanations and corrections with pedagogical quality assessment
- Measures accuracy of error detection and correction effectiveness
- Evaluates explanation clarity and helpfulness
- Supports multiple error types (conceptual, computational, logical)
- Implements scoring for constructive feedback quality
- Provides comparative analysis with expert-identified errors
Analyzes solutions step-by-step to identify the first incorrect step in a solution. This helps understand where reasoning errors occur in the solution process.
- Breaks solutions into individual steps with logical boundary detection
- Identifies the exact point of reasoning failure with detailed analysis
- Provides insights for targeted improvement of reasoning capabilities
- Measures step-by-step correctness with cumulative evaluation
- Supports partial credit for partially correct solutions
- Implements visualization of reasoning paths and failure points
- Provides aggregated statistics on common failure patterns
The benchmarks provide evaluation metrics that inform the reward functions used in the GRPO training scripts:
standard_benchmark.py→grpo.solution_qwen0.py(step-by-step solution generation)programming_benchmark.py→grpo.programming_qwen0.py(code generation for math problems)test_benchmark.py→grpo.test_programming_qwen0.py(test function creation)tutor_benchmark.py→grpo.tutor_grpo.py(error identification and correction)architect_benchmark.py→grpo.dynamic_qwen0.py(architectural planning component)step_benchmark.py→ Used for analysis across all training types
The benchmarks generate detailed performance metrics that can be used to:
- Identify specific weaknesses in model capabilities
- Measure improvement from GRPO training iterations
- Generate new training examples from failure cases
- Compare different training approaches and configurations
- Provide targeted feedback for reward function refinement
Benchmarks rely heavily on the utility modules in the utils directory:
utils.agents: Agent implementations for different tasks with consistent interfacesutils.benchmark_config: Configuration parsing and management with extensive optionsutils.model_utils: Model interfaces and response handling with timeout protectionutils.progress_tracker: Result tracking and statistics with real-time updatesutils.solution_utils: Solution validation and verification with comprehensive checksutils.logger: Structured logging with configurable verbosityutils.similarity_checker: Solution diversity measurement for group-based evaluation
Benchmarks use datasets processed by the auxiliary scripts:
- Filtered datasets from
auxilary.filter_datasetwith quality controls - Validation sets from
auxilary.create_validation_setfor consistent evaluation - Merged datasets from
auxilary.merge_jsonfor comprehensive testing - Converted datasets from
auxilary.datatype_transformationfor format compatibility - Custom datasets created from previous benchmark runs for targeted testing
All benchmarks share these common features:
- Support for multiple model configurations via command-line arguments with extensive options
- Parallel processing of examples using asyncio with configurable concurrency
- Detailed logging and statistics generation with component breakdowns
- Result saving in various formats (JSON, CSV, HuggingFace datasets) with configurable paths
- Progress tracking during long benchmark runs with ETA estimation
- Timeout handling for model responses and code execution with graceful recovery
- Caching mechanisms for efficient re-runs and incremental testing
- Comprehensive error handling with detailed diagnostics
- Support for both local and cloud-based language models
- Configurable verbosity levels for different use cases
Run any benchmark script directly with optional command-line arguments:
python -m benchmarks.standard_benchmark --main LOCAL_0 --main-port 8000Benchmarks use the BenchmarkConfig class from utils.benchmark_config for configuration. Key parameters include:
--main: Main model to use (e.g., LOCAL_0, CLAUDE, GPT, GEMINI, MISTRAL)--auxiliary: Auxiliary model for judging or secondary tasks--auxiliary2: Second auxiliary model for comparative analysis or verification--main-temp: Temperature for main model (default: 0.7) for controlling randomness--auxiliary-temp: Temperature for auxiliary model (default: 0.7)--main-port: Port for main model server (for local models)--auxiliary-port: Port for auxiliary model server--main-max-tokens: Maximum tokens for main model responses--auxiliary-max-tokens: Maximum tokens for auxiliary model responses
--dataset: HuggingFace dataset to use (default: Metaskepsis/Numina)--split: Dataset split to use (train/validation/test)--source: Filter problems by source (e.g., AIME, IMO, Putnam)--seed: Seed for dataset operations and random sampling--sample: Number of examples to sample from dataset--difficulty: Filter by problem difficulty (easy/medium/hard)--start-index: Start processing from specific index--end-index: End processing at specific index
--max-concurrent: Maximum number of concurrent problems (default: 64)--best-of: Number of attempts per problem (default: 1)--completions: Number of completions to try per path (default: 20)--timeout: Timeout in seconds for code execution (default: 10)--model-timeout: Timeout for model responses in seconds--tolerance: Tolerance for numeric answer comparison (default: 1e-2)--retry-count: Number of retries for failed model calls--cache: Enable caching of model responses for efficiency
--produce-statistics: Generate detailed statistics file with component breakdowns--create-dataset: Create a HuggingFace dataset from results--upload-dataset: Upload the created dataset to HuggingFace Hub--output-dir: Directory for saving results and statistics--result-format: Format for result files (json/csv/both)--verbose: Verbosity level for console output--log-file: Path to log file for detailed logging--report-format: Format for final report (text/html/markdown)
--similarity-threshold: Threshold for solution similarity in group evaluation--embedding-model: Model to use for embedding generation in similarity checks--step-analysis: Enable detailed step-by-step analysis--code-quality-check: Enable code quality evaluation--error-analysis: Enable detailed error analysis and categorization--comparative-mode: Enable comparison between multiple models--interactive-mode: Enable interactive evaluation with human feedback
# Standard benchmark with a local model and detailed statistics
python -m benchmarks.standard_benchmark --main LOCAL_0 --main-port 8000 --main-temp 0.9 --dataset Metaskepsis/Numina --split validation --produce-statistics --best-of 3 --tolerance 1e-3
# Programming benchmark with cloud models and result dataset creation
python -m benchmarks.programming_benchmark --main GPT --auxiliary CLAUDE --dataset Metaskepsis/Numina --best-of 3 --max-concurrent 32 --timeout 30 --create-dataset --output-dir results/programming_benchmark
# Tutor benchmark with specific problem sources and difficulty filtering
python -m benchmarks.tutor_benchmark --main LOCAL_0 --main-port 8000 --create-dataset --produce-statistics --source AIME,IMO --difficulty medium,hard --sample 100 --seed 42
# Architect benchmark with comparative analysis between different model combinations
python -m benchmarks.architect_benchmark --main GPT --auxiliary CLAUDE --auxiliary2 GEMINI --comparative-mode --dataset Metaskepsis/complex_problems --max-concurrent 16 --report-format html
# Step benchmark with detailed error analysis and visualization
python -m benchmarks.step_benchmark --main LOCAL_0 --main-port 8000 --step-analysis --error-analysis --produce-statistics --output-dir results/step_analysis --report-format markdownBenchmark results are saved in the following formats:
- JSON files with detailed results for each problem including full model responses
- Statistics summary in CSV format with aggregated metrics and breakdowns
- HuggingFace datasets for further analysis or training with standardized schema
- Console output with progress and summary statistics for quick assessment
- HTML/Markdown reports with visualizations and comparative analysis
- Log files with detailed execution traces for debugging
These results can be used to:
- Evaluate model performance on mathematical reasoning with fine-grained metrics
- Generate training data for GRPO fine-tuning with targeted examples
- Identify specific areas for model improvement with error pattern analysis
- Compare different model architectures and configurations with statistical significance
- Track improvement over time with consistent evaluation methodology
- Identify dataset biases and limitations through error analysis
- Generate insights for curriculum development in mathematical education
Compare multiple models on the same problems:
python -m benchmarks.standard_benchmark --main LOCAL_0 --auxiliary GPT --auxiliary2 CLAUDE --comparative-modeContinue from previous benchmark runs:
python -m benchmarks.programming_benchmark --continue-from results/previous_run.jsonAdd specialized metrics for specific problem types:
python -m benchmarks.standard_benchmark --custom-metrics geometry,algebra,calculusEnable human feedback during benchmark runs:
python -m benchmarks.tutor_benchmark --interactive-mode --feedback-interval 10Create detailed visualizations of benchmark results:
python -m benchmarks.step_benchmark --visualize --chart-types histogram,heatmap,scatter