Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
config.py	config.py
dual_proof_qwen0.py	dual_proof_qwen0.py
dynamic_qwen0.py	dynamic_qwen0.py
dynamic_qwen1.py	dynamic_qwen1.py
dynamic_qwen2.py	dynamic_qwen2.py
dynamic_qwen3.py	dynamic_qwen3.py
dynamic_qwen4.py	dynamic_qwen4.py
dynamic_reward.py	dynamic_reward.py
programming_qwen0.py	programming_qwen0.py
reward_stats.py	reward_stats.py
rewards.py	rewards.py
solution_qwen0.py	solution_qwen0.py
terms.py	terms.py
test_driven_programmer_qwen0.py	test_driven_programmer_qwen0.py
test_programming_qwen0.py	test_programming_qwen0.py
tutor_grpo.py	tutor_grpo.py

GRPO Training Scripts for Mathematical Problem Solving

This directory contains scripts for training language models using Generative Reinforcement Policy Optimization (GRPO) for mathematical problem-solving tasks. These training scripts form the improvement component of the larger mathematical problem-solving framework.

Core Components

Reward Functions (`rewards.py`)

Defines various reward functions for different mathematical problem-solving tasks:

BaseReward: Abstract base class for all reward functions with common interface
- Defines the contract for all reward implementations
- Handles batch processing with async execution
- Provides error handling and timeout management
SolutionReward: Rewards for complete mathematical solutions
- Evaluates correctness of final answers using numeric verification
- Assesses solution quality, step coherence, and explanation clarity
- Penalizes common errors like skipped steps or incorrect reasoning
FinalizationReward: Rewards for completing partial solutions
- Evaluates ability to continue from a given partial solution
- Checks consistency with provided steps
- Verifies final answer correctness
ProgrammingReward: Rewards for Python code that solves math problems
- Executes code in a secure sandbox with timeout protection
- Verifies output against expected answers with configurable tolerance
- Evaluates code quality, efficiency, and readability
TutorReward: Rewards for identifying errors in solutions
- Assesses accuracy of error identification
- Evaluates quality of explanations and corrections
- Rewards constructive feedback style
TestProgrammingReward: Rewards for creating test functions
- Evaluates test function correctness on multiple test cases
- Assesses robustness to edge cases and numerical precision
- Rewards comprehensive test coverage
ArchitectReward: Rewards for creating solution architectures
- Evaluates problem decomposition and approach planning
- Assesses clarity of guidance for implementation
- Rewards identification of potential pitfalls and edge cases

Each reward function corresponds to a specific benchmark in the benchmarks directory and uses the same evaluation criteria to ensure consistency between training and evaluation.

Dynamic Reward (`dynamic_reward.py`)

A flexible reward function that dynamically selects between different reward types based on the example type:

Handles multiple example types in a single training run with type-specific processing
Maintains consistent statistics across reward types with separate tracking
Supports group-based rewards for solution diversity using embedding similarity
Uses utils.similarity_checker to measure solution diversity with configurable thresholds
Implements weighted combinations of reward components based on task type
Provides detailed component breakdowns for analysis and debugging

Reward Statistics (`reward_stats.py`)

Tracks detailed statistics during training:

Records reward distributions and components with histograms and running averages
Maintains separate statistics for each reward type with task-specific metrics
Provides human-readable summaries for logging with component breakdowns
Integrates with Wandb for visualization with custom charts and tables
Tracks training progress with moving averages and improvement metrics
Identifies reward outliers for analysis of exceptional cases

Configuration (`config.py`)

Defines the RewardConfig dataclass with settings for:

Model parameters (main and auxiliary models) with type and endpoint configuration
Reward values for different components with configurable weights
Embedding model settings for similarity checking with model selection and parameters
Numeric tolerance and other execution parameters for consistent evaluation
Timeout settings for model calls and code execution
Logging configuration for detailed or summarized output

Training Scripts

Dynamic Training

dynamic_qwen0.py, dynamic_qwen1.py, dynamic_qwen2.py: Train Qwen models with dynamic rewards using different seeds and configurations
- Uses combined datasets prepared with utils.data_preparation.prepare_combined_data
- Balances multiple task types in a single training run with configurable distribution
- Implements curriculum learning with increasing difficulty
- Supports mixed precision training for efficiency
- Includes evaluation on validation sets during training

Task-Specific Training

solution_qwen0.py: Train for complete mathematical solutions
- Corresponds to benchmarks.standard_benchmark
- Uses utils.data_preparation.prepare_solution_data
- Focuses on step-by-step reasoning and answer correctness
- Implements specialized rewards for mathematical notation quality
programming_qwen0.py: Train for Python code generation
- Corresponds to benchmarks.programming_benchmark
- Uses utils.data_preparation.prepare_programming_data
- Includes code execution verification in the training loop
- Rewards efficient and readable code implementations
test_programming_qwen0.py: Train for test function creation
- Corresponds to benchmarks.test_benchmark
- Uses utils.data_preparation.prepare_test_programming_data
- Focuses on comprehensive test case generation
- Rewards robust handling of edge cases
tutor_grpo.py: Train for error identification and correction
- Corresponds to benchmarks.tutor_benchmark
- Uses utils.data_preparation.prepare_tutor_data
- Trains on both correct and deliberately flawed solutions
- Rewards constructive feedback and clear explanations
finalization_grpo.py: Train for completing partial solutions
- Uses utils.data_preparation.prepare_finalization_data
- Focuses on completing solutions from partial steps
- Rewards consistency with provided partial solutions
- Implements specialized handling for different completion points

Integration with Other Components

Connection to Benchmarks

Each training script is designed to improve performance on specific benchmarks:

solution_qwen0.py → Improves performance on benchmarks.standard_benchmark.py
programming_qwen0.py → Improves performance on benchmarks.programming_benchmark.py
test_programming_qwen0.py → Improves performance on benchmarks.test_benchmark.py
tutor_grpo.py → Improves performance on benchmarks.tutor_benchmark.py
finalization_grpo.py → Supports multiple benchmarks with partial solution completion

Utility Dependencies

Training scripts rely on utility modules in the utils directory:

utils.data_preparation: Formats datasets for different training tasks with task-specific processing
utils.solution_utils: Validates solutions and extracts answers with LaTeX support
utils.similarity_checker: Measures diversity between solutions using embedding models
utils.model_utils: Handles model responses and timeouts with retry logic

Dataset Processing

Training uses datasets processed by the auxiliary scripts:

Filtered datasets from auxilary.filter_dataset with quality controls
Validation sets from auxilary.create_validation_set for consistent evaluation
Merged datasets from auxilary.merge_json for comprehensive training
Converted datasets from auxilary.datatype_transformation for format compatibility

Common Features

All training scripts share these common features:

Integration with Unsloth for efficient training of Qwen models (2-3x speedup)
Wandb logging for experiment tracking and visualization with detailed metrics
LoRA fine-tuning for parameter-efficient training with configurable parameters
Detailed logging of training metrics and reward components with component breakdowns
Model saving in merged format for easy deployment with adapter integration
Checkpoint creation for resuming training with automatic versioning
Gradient accumulation for effective batch size scaling
Mixed precision training for memory efficiency
Evaluation during training on validation sets
Early stopping based on validation performance

Usage

Run any training script directly with optional command-line arguments:

python -m grpo.dynamic_qwen0 --learning_rate 2e-5 --epochs 3 --batch_size 4

Configuration

Training scripts use the RewardConfig class from config.py for configuration. Key parameters include:

Model Settings

model_type: Type of model to train (e.g., "dynamic_0", "solution", "programming")
model_name: Base model to fine-tune (e.g., "unsloth/Qwen1.5-7B")
auxiliary_model: Model for reward calculation (e.g., "gpt-4-turbo")
Base reward values for different components with configurable weights
Similarity thresholds and diversity bonuses for group-based rewards
Embedding model configuration for similarity calculation

Training Settings

Each script configures:

Learning rate and optimizer settings (AdamW with weight decay)
Batch size and gradient accumulation steps for memory efficiency
Number of generations per prompt for diverse training signals
Maximum sequence lengths for input and output
Training epochs and checkpointing frequency
LoRA parameters (rank, alpha, dropout) for adapter configuration
Evaluation frequency and criteria

Example Usage

# Train a dynamic model with multiple task types and custom parameters
python -m grpo.dynamic_qwen0 --learning_rate 2e-5 --lora_r 32 --lora_alpha 64 --dropout 0.05 --wandb_project "math_dynamic_training"

# Train a model specifically for programming tasks with larger batch size
python -m grpo.programming_qwen0 --batch_size 8 --gradient_accumulation_steps 4 --max_seq_length 2048 --save_steps 200

# Train a model for tutoring and error identification with custom dataset
python -m grpo.tutor_grpo --dataset "username/math_errors_dataset" --epochs 5 --eval_steps 100 --max_steps 10000

Model Export and Deployment

After training, models can be exported using auxilary.export_model.py with various options:

# Basic export with default settings
python -m auxilary.export_model --model-name unsloth/Qwen1.5-7B --checkpoint checkpoints/dynamic_0_20240315 --output-dir models/dynamic_0

# Export with quantization for deployment efficiency
python -m auxilary.export_model --model-name unsloth/Qwen1.5-7B --checkpoint checkpoints/tutor_20240315 --output-dir models/tutor --quantize --bits 4

# Export with specific adapter configuration
python -m auxilary.export_model --model-name unsloth/Qwen1.5-7B --checkpoint checkpoints/programming_20240315 --output-dir models/programming --adapter-name "programming_adapter"

Exported models can then be evaluated using the benchmark scripts to measure improvement:

# Evaluate exported model on standard benchmark
python -m benchmarks.standard_benchmark --main LOCAL_0 --main-port 8000 --dataset Metaskepsis/validation_set --produce-statistics

# Compare performance against baseline models
python -m benchmarks.programming_benchmark --main LOCAL_0 --main-port 8000 --auxiliary LOCAL_1 --auxiliary-port 8001 --best-of 3

Advanced Features

Curriculum Learning

Dynamic training scripts support curriculum learning with increasing difficulty:

python -m grpo.dynamic_qwen0 --curriculum --curriculum_steps 1000,2000,3000 --curriculum_difficulties easy,medium,hard

Distributed Training

For multi-GPU setups, training scripts support distributed training:

python -m torch.distributed.launch --nproc_per_node=4 grpo.dynamic_qwen0 --distributed_training

Custom Reward Components

Reward weights can be customized for specific training objectives:

python -m grpo.solution_qwen0 --correctness_weight 2.0 --clarity_weight 1.0 --step_weight 1.5

Checkpoint Merging

Merge multiple specialized models into a single model:

python -m auxilary.merge_models --models models/solution,models/programming,models/tutor --output models/combined --weights 0.4,0.3,0.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

GRPO Training Scripts for Mathematical Problem Solving

Core Components

Reward Functions (`rewards.py`)

Dynamic Reward (`dynamic_reward.py`)

Reward Statistics (`reward_stats.py`)

Configuration (`config.py`)

Training Scripts

Dynamic Training

Task-Specific Training

Integration with Other Components

Connection to Benchmarks

Utility Dependencies

Dataset Processing

Common Features

Usage

Configuration

Model Settings

Training Settings

Example Usage

Model Export and Deployment

Advanced Features

Curriculum Learning

Distributed Training

Custom Reward Components

Checkpoint Merging

FilesExpand file tree

grpo

Directory actions

More options

Directory actions

More options

Latest commit

History

grpo

Folders and files

parent directory

README.md

GRPO Training Scripts for Mathematical Problem Solving

Core Components

Reward Functions (rewards.py)

Dynamic Reward (dynamic_reward.py)

Reward Statistics (reward_stats.py)

Configuration (config.py)

Training Scripts

Dynamic Training

Task-Specific Training

Integration with Other Components

Connection to Benchmarks

Utility Dependencies

Dataset Processing

Common Features

Usage

Configuration

Model Settings

Training Settings

Example Usage

Model Export and Deployment

Advanced Features

Curriculum Learning

Distributed Training

Custom Reward Components

Checkpoint Merging

Reward Functions (`rewards.py`)

Dynamic Reward (`dynamic_reward.py`)

Reward Statistics (`reward_stats.py`)

Configuration (`config.py`)