Sample Efficient GPT

Training framework for sample-efficient language model pre-training. This codebase was used to train IMU-1, a 430M-parameter model that approaches the benchmark performance of models trained on 56x more data.

Model: thepowerfuldeez/imu1_base

Features

Architectural interventions: QK-norm attention, per-head gating, value residual learning, LayerNorm scaling
Optimization: NorMuon optimizer with cautious weight decay, muP parametrization
Custom kernels: Triton Flash Attention with QK-norm, fused cross-entropy
Training: Multi-stage WSD schedule, checkpoint EMA, gradient checkpointing
Tokenization: Fast Rust-based BPE with HuggingFace conversion

Requirements

Python 3.13+
PyTorch nightly (CUDA 13)
NVIDIA GPU with CUDA support

Quick Start

# Install dependencies (uses PyTorch nightly with CUDA 13)
export UV_TORCH_BACKEND=auto
uv pip install setuptools uv_build maturin
uv sync

# Login to services
uv run wandb login
huggingface-cli login

Reproducing IMU-1

Pre-tokenized datasets and reproduction scripts are provided in recipes/.

1. Download data

# Stage 1 & 2 (pre-tokenized)
huggingface-cli download thepowerfuldeez/1218_imu1_base_stable_corpus --repo-type=dataset
huggingface-cli download thepowerfuldeez/1226_imu1_base_decay_corpus --repo-type=dataset

# Stage 3: tokenize with the provided script (see recipes/1_tokenize_stage3.sh)

2. Train

# Three-stage training (265k iterations, ~72B tokens)
NUM_GPUS=8 ./recipes/2_train.sh

Or run stages individually:

# Stage 1: Stable (100k iterations)
uv run torchrun --nproc_per_node 8 train.py --config configs/imu1_base.yaml --config-key stable

# Stage 2: Decay (100k iterations)
uv run torchrun --nproc_per_node 8 train.py --config configs/imu1_base.yaml --config-key decay

# Stage 3: Midtrain (65k iterations)
uv run torchrun --nproc_per_node 8 train.py --config configs/imu1_base.yaml --config-key midtrain

3. Convert to HuggingFace

# Convert checkpoint to HuggingFace format
uv run scripts/hf/convert_imu1_checkpoint.py \
    --checkpoint checkpoints/imu1_stage3_midtrain/265000.pt \
    --output-dir imu1_hf \
    --tokenizer HuggingFaceTB/SmolLM2-360M

Evaluation

Evaluate models on the CORE benchmark (HellaSwag, ARC, PIQA, Lambada, Winograd, etc.).

Setup

Download evaluation assets first:

bash download_eval.sh

This downloads the eval bundle to ~/.cache/sample_efficient_gpt/eval_bundle/.

Evaluate sample_efficient_gpt checkpoint

uv run torchrun --nproc_per_node 2 evals/base_eval.py \
    --checkpoint /path/to/checkpoint.pt \
    --type se \
    --tokenizer_path HuggingFaceTB/SmolLM2-360M \
    --batch 1 \
    --base_dir ~/.cache/sample_efficient_gpt \
    --task arc_easy

Evaluate HuggingFace model

uv run torchrun --nproc_per_node 2 evals/base_eval.py \
    --checkpoint thepowerfuldeez/imu1_base \
    --type hf \
    --batch 1 \
    --base_dir ~/.cache/sample_efficient_gpt \
    --task arc_easy

Evaluation arguments

Argument	Description
`--checkpoint`	Path to `.pt` file (se) or HF model name (hf)
`--type`	`se` for sample_efficient_gpt, `hf` for HuggingFace
`--tokenizer_path`	Tokenizer (required for `--type se`)
`--task`	Specific task (e.g., `arc_easy`, `hellaswag`, `piqa`) or omit for full CORE
`--batch`	Per-GPU batch size
`--base_dir`	Directory containing eval_bundle
`--max_per_task`	Limit examples per task for debugging

Available tasks

hellaswag, jeopardy, bigbench_qa_wikidata, arc_easy, arc_challenge, copa, commonsenseqa, piqa, openbookqa, lambada_openai, winograd, winogrande, bigbench_dyck_languages, agieval_lsat_ar, bigbench_cs_algorithms, bigbench_operators, bigbench_repeat_copy_logic, squad, coqa, boolq, bigbench_language_id

Running HuggingFace Models

Simple inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("thepowerfuldeez/imu1_base", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("thepowerfuldeez/imu1_base")

text = "The quick brown fox"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Using the run script

uv run scripts/hf/run_imu1_hf.py \
    --model-dir thepowerfuldeez/imu1_base \
    --tokenizer thepowerfuldeez/imu1_base \
    --max-new-tokens 50 \
    --prompt "The capital of France is"

Conversion Scripts

Convert sample_efficient_gpt → HuggingFace

uv run scripts/hf/convert_imu1_checkpoint.py \
    --checkpoint /path/to/checkpoint.pt \
    --output-dir output_hf \
    --tokenizer HuggingFaceTB/SmolLM2-360M

Convert HuggingFace → sample_efficient_gpt

uv run scripts/hf/convert_hf_to_imu.py \
    --hf-dir HuggingFaceTB/SmolLM2-360M \
    --output checkpoint.pt

Initialize from SmolLM2 with widening

uv run scripts/hf/convert_smollm2_to_imu.py \
    --hf-dir HuggingFaceTB/SmolLM2-360M \
    --output checkpoint.pt \
    --width 1152 \
    --n-heads 18 \
    --n-kv-heads 6 \
    --widening-mode preserve-norm

Tokenization

Tokenize datasets for training:

uv run tokenizer/tokenize_with_fast_tokenizer.py \
    --data-path /path/to/data \
    --tokenized-data-path /path/to/output \
    --tokenizer-name HuggingFaceTB/SmolLM2-360M

Input formats: parquet files or text files. Output: .npy (text) or .npz (parquet) memory-mapped arrays.

Training

uv run train.py --config CONFIG --train-path TRAIN.npy --validation-path VAL.npy

Override config values:

uv run train.py --config gpt_small_faster --override '{"model.d_model": 1024, "optim.lr": 1e-3}'

Configuration

Configs use frozen dataclasses (config_schema.py):

DataConfig: batch size, context length, tokenizer, data paths
ModelConfig: architecture (d_model, layers, heads, attention variants)
OptimConfig: learning rates, optimizer (NorMuon/AdamW), schedules (cosine/WSD)
TrainerConfig: checkpointing, distributed training, logging

See configs/imu1_base.yaml for the full IMU-1 configuration.

Project Structure

sample_efficient_gpt/
├── train.py                 # Main training entry point
├── config_schema.py         # Configuration dataclasses
├── configs/                 # Training configurations
│   ├── imu1_base.yaml       # IMU-1 (430M) config
│   └── gpt_small_faster.py  # Ablation model (70M) config
├── recipes/                 # Reproduction scripts
├── transformer/             # Model architecture
│   ├── transformer.py       # Transformer with KV-cache
│   ├── attention.py         # Multi-head attention (QK-norm, gating, value residual)
│   ├── core.py              # Primitives (Linear, RMSNorm, SwiGLU)
│   └── ops/                 # Triton kernels
├── training/                # Training infrastructure
│   ├── trainer.py           # Training loop
│   ├── data.py              # Memory-mapped dataset
│   └── optimizers/          # NorMuon, Muon implementations
├── tokenizer/               # Tokenization pipeline
├── scripts/hf/              # HuggingFace conversion
│   ├── convert_imu1_checkpoint.py  # SE → HF conversion
│   ├── convert_hf_to_imu.py        # HF → SE conversion
│   ├── convert_smollm2_to_imu.py   # SmolLM2 init with widening
│   └── run_imu1_hf.py              # Run HF model
└── evals/                   # Benchmark evaluation
    ├── base_eval.py         # CORE benchmark (supports SE and HF)
    └── core_eval.py         # Evaluation logic

Data Mix Files

Training uses JSON files specifying dataset paths and token counts:

data_stage1_stable.json - Stage 1 data mix
data_stage2_decay.json - Stage 2 data mix
data_stage3_midtrain.json - Stage 3 data mix

Format: {"path/to/tokenized/data": num_tokens, ...}

References

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
sample_efficient_gpt		sample_efficient_gpt
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sample Efficient GPT

Features

Requirements

Quick Start

Reproducing IMU-1

1. Download data

2. Train

3. Convert to HuggingFace

Evaluation

Setup

Evaluate sample_efficient_gpt checkpoint

Evaluate HuggingFace model

Evaluation arguments

Available tasks

Running HuggingFace Models

Simple inference

Using the run script

Conversion Scripts

Convert sample_efficient_gpt → HuggingFace

Convert HuggingFace → sample_efficient_gpt

Initialize from SmolLM2 with widening

Tokenization

Training

Configuration

Project Structure

Data Mix Files

References

License

About

Uh oh!

Releases

Packages

Languages

thepowerfuldeez/sample_efficient_gpt

Folders and files

Latest commit

History

Repository files navigation

Sample Efficient GPT

Features

Requirements

Quick Start

Reproducing IMU-1

1. Download data

2. Train

3. Convert to HuggingFace

Evaluation

Setup

Evaluate sample_efficient_gpt checkpoint

Evaluate HuggingFace model

Evaluation arguments

Available tasks

Running HuggingFace Models

Simple inference

Using the run script

Conversion Scripts

Convert sample_efficient_gpt → HuggingFace

Convert HuggingFace → sample_efficient_gpt

Initialize from SmolLM2 with widening

Tokenization

Training

Configuration

Project Structure

Data Mix Files

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages