This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Research project exploring knowledge distillation from FP32 teacher models to INT4 student models using on-policy (student-generated) data. Compares on-policy vs off-policy distillation for quantization - finding they perform equivalently (~0.3% difference).
# Install dependencies
uv sync
# Training (single run)
uv run accelerate launch train.py @configs/offpolicy.toml --output-dir dump/offpolicy
uv run accelerate launch train.py @configs/onpolicy.toml --output-dir dump/onpolicy
# Evaluation
uv run python eval.py --lora-paths dump/offpolicy --model-name Qwen/Qwen3-4B-Instruct-2507
# GPTQ/AWQ comparison
uv run python gptq_awq.py
# Hyperparameter sweep (K8s)
bash k8s/sweep.sh sweeps/lr_sweep_offpolicy.yaml 4- Pydantic v2 with
pydantic_configfor TOML/CLI parsing - SharedConfig: Model loading, quantization backends (torchao INT4/NVFP4, bitsandbytes FP4/NF4)
- TrainConfig: GKD params (λ, β), LoRA settings, batch sizes
- EvalConfig: lm-eval tasks + perplexity evaluation
- Config files use
@configs/file.tomlsyntax on CLI
- Loads teacher (FP16) and student (INT4 QAT + LoRA)
- Uses
GKDTrainerfrom TRL for on-policy distillation - Dataset:
allenai/tulu-3-sft-mixturewith quality filtering - Periodic perplexity evaluation during training
- λ (lambda): 0=off-policy (dataset only), 1=on-policy (student-generated)
- β (beta): 0=forward KL, 1=reverse KL
- Optimal: β=0 for off-policy, β=1 for on-policy
- Benchmarks: HellaSwag, ARC-Easy/Challenge, WinoGrande, MMLU
- Perplexity: Sliding window on wikitext/c4
- Merges LoRA adapters before final quantization
- Extends TRL's
GKDTrainerto addmin_new_tokenssupport