This directory contains scripts for training language models using Generative Reinforcement Policy Optimization (GRPO) for mathematical problem-solving tasks. These training scripts form the improvement component of the larger mathematical problem-solving framework.
Defines various reward functions for different mathematical problem-solving tasks:
-
BaseReward: Abstract base class for all reward functions with common interface- Defines the contract for all reward implementations
- Handles batch processing with async execution
- Provides error handling and timeout management
-
SolutionReward: Rewards for complete mathematical solutions- Evaluates correctness of final answers using numeric verification
- Assesses solution quality, step coherence, and explanation clarity
- Penalizes common errors like skipped steps or incorrect reasoning
-
FinalizationReward: Rewards for completing partial solutions- Evaluates ability to continue from a given partial solution
- Checks consistency with provided steps
- Verifies final answer correctness
-
ProgrammingReward: Rewards for Python code that solves math problems- Executes code in a secure sandbox with timeout protection
- Verifies output against expected answers with configurable tolerance
- Evaluates code quality, efficiency, and readability
-
TutorReward: Rewards for identifying errors in solutions- Assesses accuracy of error identification
- Evaluates quality of explanations and corrections
- Rewards constructive feedback style
-
TestProgrammingReward: Rewards for creating test functions- Evaluates test function correctness on multiple test cases
- Assesses robustness to edge cases and numerical precision
- Rewards comprehensive test coverage
-
ArchitectReward: Rewards for creating solution architectures- Evaluates problem decomposition and approach planning
- Assesses clarity of guidance for implementation
- Rewards identification of potential pitfalls and edge cases
Each reward function corresponds to a specific benchmark in the benchmarks directory and uses the same evaluation criteria to ensure consistency between training and evaluation.
A flexible reward function that dynamically selects between different reward types based on the example type:
- Handles multiple example types in a single training run with type-specific processing
- Maintains consistent statistics across reward types with separate tracking
- Supports group-based rewards for solution diversity using embedding similarity
- Uses
utils.similarity_checkerto measure solution diversity with configurable thresholds - Implements weighted combinations of reward components based on task type
- Provides detailed component breakdowns for analysis and debugging
Tracks detailed statistics during training:
- Records reward distributions and components with histograms and running averages
- Maintains separate statistics for each reward type with task-specific metrics
- Provides human-readable summaries for logging with component breakdowns
- Integrates with Wandb for visualization with custom charts and tables
- Tracks training progress with moving averages and improvement metrics
- Identifies reward outliers for analysis of exceptional cases
Defines the RewardConfig dataclass with settings for:
- Model parameters (main and auxiliary models) with type and endpoint configuration
- Reward values for different components with configurable weights
- Embedding model settings for similarity checking with model selection and parameters
- Numeric tolerance and other execution parameters for consistent evaluation
- Timeout settings for model calls and code execution
- Logging configuration for detailed or summarized output
dynamic_qwen0.py,dynamic_qwen1.py,dynamic_qwen2.py: Train Qwen models with dynamic rewards using different seeds and configurations- Uses combined datasets prepared with
utils.data_preparation.prepare_combined_data - Balances multiple task types in a single training run with configurable distribution
- Implements curriculum learning with increasing difficulty
- Supports mixed precision training for efficiency
- Includes evaluation on validation sets during training
- Uses combined datasets prepared with
-
solution_qwen0.py: Train for complete mathematical solutions- Corresponds to
benchmarks.standard_benchmark - Uses
utils.data_preparation.prepare_solution_data - Focuses on step-by-step reasoning and answer correctness
- Implements specialized rewards for mathematical notation quality
- Corresponds to
-
programming_qwen0.py: Train for Python code generation- Corresponds to
benchmarks.programming_benchmark - Uses
utils.data_preparation.prepare_programming_data - Includes code execution verification in the training loop
- Rewards efficient and readable code implementations
- Corresponds to
-
test_programming_qwen0.py: Train for test function creation- Corresponds to
benchmarks.test_benchmark - Uses
utils.data_preparation.prepare_test_programming_data - Focuses on comprehensive test case generation
- Rewards robust handling of edge cases
- Corresponds to
-
tutor_grpo.py: Train for error identification and correction- Corresponds to
benchmarks.tutor_benchmark - Uses
utils.data_preparation.prepare_tutor_data - Trains on both correct and deliberately flawed solutions
- Rewards constructive feedback and clear explanations
- Corresponds to
-
finalization_grpo.py: Train for completing partial solutions- Uses
utils.data_preparation.prepare_finalization_data - Focuses on completing solutions from partial steps
- Rewards consistency with provided partial solutions
- Implements specialized handling for different completion points
- Uses
Each training script is designed to improve performance on specific benchmarks:
solution_qwen0.py→ Improves performance onbenchmarks.standard_benchmark.pyprogramming_qwen0.py→ Improves performance onbenchmarks.programming_benchmark.pytest_programming_qwen0.py→ Improves performance onbenchmarks.test_benchmark.pytutor_grpo.py→ Improves performance onbenchmarks.tutor_benchmark.pyfinalization_grpo.py→ Supports multiple benchmarks with partial solution completion
Training scripts rely on utility modules in the utils directory:
utils.data_preparation: Formats datasets for different training tasks with task-specific processingutils.solution_utils: Validates solutions and extracts answers with LaTeX supportutils.similarity_checker: Measures diversity between solutions using embedding modelsutils.model_utils: Handles model responses and timeouts with retry logic
Training uses datasets processed by the auxiliary scripts:
- Filtered datasets from
auxilary.filter_datasetwith quality controls - Validation sets from
auxilary.create_validation_setfor consistent evaluation - Merged datasets from
auxilary.merge_jsonfor comprehensive training - Converted datasets from
auxilary.datatype_transformationfor format compatibility
All training scripts share these common features:
- Integration with Unsloth for efficient training of Qwen models (2-3x speedup)
- Wandb logging for experiment tracking and visualization with detailed metrics
- LoRA fine-tuning for parameter-efficient training with configurable parameters
- Detailed logging of training metrics and reward components with component breakdowns
- Model saving in merged format for easy deployment with adapter integration
- Checkpoint creation for resuming training with automatic versioning
- Gradient accumulation for effective batch size scaling
- Mixed precision training for memory efficiency
- Evaluation during training on validation sets
- Early stopping based on validation performance
Run any training script directly with optional command-line arguments:
python -m grpo.dynamic_qwen0 --learning_rate 2e-5 --epochs 3 --batch_size 4Training scripts use the RewardConfig class from config.py for configuration. Key parameters include:
model_type: Type of model to train (e.g., "dynamic_0", "solution", "programming")model_name: Base model to fine-tune (e.g., "unsloth/Qwen1.5-7B")auxiliary_model: Model for reward calculation (e.g., "gpt-4-turbo")- Base reward values for different components with configurable weights
- Similarity thresholds and diversity bonuses for group-based rewards
- Embedding model configuration for similarity calculation
Each script configures:
- Learning rate and optimizer settings (AdamW with weight decay)
- Batch size and gradient accumulation steps for memory efficiency
- Number of generations per prompt for diverse training signals
- Maximum sequence lengths for input and output
- Training epochs and checkpointing frequency
- LoRA parameters (rank, alpha, dropout) for adapter configuration
- Evaluation frequency and criteria
# Train a dynamic model with multiple task types and custom parameters
python -m grpo.dynamic_qwen0 --learning_rate 2e-5 --lora_r 32 --lora_alpha 64 --dropout 0.05 --wandb_project "math_dynamic_training"
# Train a model specifically for programming tasks with larger batch size
python -m grpo.programming_qwen0 --batch_size 8 --gradient_accumulation_steps 4 --max_seq_length 2048 --save_steps 200
# Train a model for tutoring and error identification with custom dataset
python -m grpo.tutor_grpo --dataset "username/math_errors_dataset" --epochs 5 --eval_steps 100 --max_steps 10000After training, models can be exported using auxilary.export_model.py with various options:
# Basic export with default settings
python -m auxilary.export_model --model-name unsloth/Qwen1.5-7B --checkpoint checkpoints/dynamic_0_20240315 --output-dir models/dynamic_0
# Export with quantization for deployment efficiency
python -m auxilary.export_model --model-name unsloth/Qwen1.5-7B --checkpoint checkpoints/tutor_20240315 --output-dir models/tutor --quantize --bits 4
# Export with specific adapter configuration
python -m auxilary.export_model --model-name unsloth/Qwen1.5-7B --checkpoint checkpoints/programming_20240315 --output-dir models/programming --adapter-name "programming_adapter"Exported models can then be evaluated using the benchmark scripts to measure improvement:
# Evaluate exported model on standard benchmark
python -m benchmarks.standard_benchmark --main LOCAL_0 --main-port 8000 --dataset Metaskepsis/validation_set --produce-statistics
# Compare performance against baseline models
python -m benchmarks.programming_benchmark --main LOCAL_0 --main-port 8000 --auxiliary LOCAL_1 --auxiliary-port 8001 --best-of 3Dynamic training scripts support curriculum learning with increasing difficulty:
python -m grpo.dynamic_qwen0 --curriculum --curriculum_steps 1000,2000,3000 --curriculum_difficulties easy,medium,hardFor multi-GPU setups, training scripts support distributed training:
python -m torch.distributed.launch --nproc_per_node=4 grpo.dynamic_qwen0 --distributed_trainingReward weights can be customized for specific training objectives:
python -m grpo.solution_qwen0 --correctness_weight 2.0 --clarity_weight 1.0 --step_weight 1.5Merge multiple specialized models into a single model:
python -m auxilary.merge_models --models models/solution,models/programming,models/tutor --output models/combined --weights 0.4,0.3,0.3