feat: Add multi-stage data training support #37
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
read
docs/data_stages.mdMulti-Stage Data Training
Multi-stage training allows switching between different data mixtures at specified training steps, similar to approaches used in Qwen3, DeepSeek-V3, and Llama 3.
Quick Start
Data stages are optional. If no
[[training.data_stages]]are defined, a single stage is auto-created from[training]data fields (backward compatible). When stages ARE defined, they override[training]data fields completely.Multi-Stage Example
Define
[[training.data_stages]]sections for multi-stage training:Configuration Fields
Each
[[training.data_stages]]section must define all data-related fields explicitly:namestart_stepend_stepdatasetdataset_pathdataset_type"huggingface","nanoset","preprocessed","packed_memmap"dataset_foldersdataset_weightsdataset_random_seedtraining.dataset_random_seed)seq_len*Required based on
dataset_type:datasetfor huggingface,dataset_foldersfor nanoset.Single-Stage Training (Backward Compatible)
For single-stage training, you can simply use
[training]data fields - no[[training.data_stages]]needed:A single stage named "default" is auto-created internally. This maintains full backward compatibility with existing configs.
Alternatively, you can explicitly define a single stage:
Validation
The following validations are performed at startup:
name,start_step,dataset_type,seq_lenmust be defineddatasetrequired for huggingface,dataset_foldersrequired for nanosetseq_len > 0,dataset_random_seed >= 0,start_step < training.stepsCommon Patterns
Pattern 1: Change Data Mixture
Pattern 2: Context Extension
Pattern 3: Different Random Seeds (Multi-Epoch)
Pattern 4: Mid-Training Ablation
For ablation studies where you want to test different data mixtures from a checkpoint, you can add stages that start mid-training. The system will auto-create a "default" stage from
[training]fields for the gap.Ablation config (start new mixture at step 5):
The system auto-creates "default" for steps 0-5 from
[training], then transitions to "ablation_stage" at step 5.Logging
At training start, a stage plan is logged:
At each transition:
Checkpoint & Resume
Stage state is automatically saved in checkpoints:
stage_idx: Current stage indexstage_name: Current stage namedataloader_state: Position within the datasetOn resume, the exact stage and dataloader position are restored. No manual intervention needed.
Testing
Test Configs
Test configs are located in
docs/data_stages/configs/:data_stages_test.tomldata_stages_backcompat_test.tomldata_stages_ablation_test.tomlAutomated Test Suite
Run the test script to verify all functionality:
The test suite runs 5 tests:
Manual Testing
What the Tests Verify
[[training.data_stages]]still work[[training.data_stages]]starts after step 0 (e.g., step 5), the system auto-creates a "default" stage from[training]fields to cover the gap (steps 0-5). This lets you train initially with[training]only, then later add stages mid-training to test different data mixtures from a checkpoint.