Modern, modular causal language model (CLM) pretraining baseline using the Hugging Face ecosystem.
This repository is a modernization of an older torchtext-based language modeling project into current best practices using:
- π€ Transformers (models, Trainer, checkpoints)
- π€ Datasets (HF datasets + local text + streaming)
- π€ Tokenizers (optional BPE tokenizer training)
- Optional: Weights & Biases logging and Hugging Face Hub publishing
The focus is causal LLMs (next-token prediction). The code is designed to be simple now and easy to extend later to:
- Supervised fine-tuning (SFT)
- Instruction tuning
- Preference tuning (DPO/ORPO, etc.)
- β Causal LM pretraining (labels = input_ids)
- β Standard tokenization β packing (block_size) β Trainer
- β Checkpointing and resume from latest checkpoint
- β Mixed precision (fp16/bf16) (configurable)
- β Optional W&B logging
- β Optional push_to_hub (upload checkpoints/model)
- β Train on HF datasets (e.g., WikiText)
- β
Train on local text files (
load_dataset("text", data_files=...)) - β Streaming mode for large datasets
- β
Optional
max_train_samples/max_eval_samplesfor fast tutorials
- β Use pretrained tokenizer (e.g., GPT-2) or
- β Train a BPE tokenizer from your local corpus and save as a HF tokenizer folder
Torch-Linguist/
ββ README.md
ββ requirements.txt
ββ configs/
β ββ pretrain_gpt2_small.yaml
β ββ tokenizer_bpe.yaml
ββ scripts/
β ββ pretrain.py
β ββ train_tokenizer.py
β ββ generate.py
ββ src/torch_linguist/
ββ data/ # dataset loading + packing
ββ modeling/ # model building (from scratch or pretrained)
ββ training/ # Trainer wiring + perplexity helpers
ββ utils/ # seed, logging, hub helpersThe training pipeline follows the standard causal LM pretraining recipe:
-
Load dataset
- HF dataset:
load_dataset(dataset_name, dataset_config) - Local text:
load_dataset("text", data_files=...)
- HF dataset:
-
Tokenize
- Convert raw text into token IDs:
input_ids
- Convert raw text into token IDs:
-
Pack into fixed blocks
- Concatenate token streams
- Split into fixed-length sequences:
block_size(e.g., 512) - Create labels:
labels = input_ids
-
Train
- Use
Trainer+DataCollatorForLanguageModeling(mlm=False) - Save checkpoints and optionally resume / push to hub
- Use
pip install -r requirements.txtOptional (for Weights & Biases logging):
pip install wandbThis is the recommended βtutorial defaultβ dataset: small, fast, and good for learning.
PYTHONPATH=src python scripts/pretrain.py --config configs/pretrain_gpt2_small.yamlCheckpoints will be written to:
runs/pretrain_gpt2_small/
PYTHONPATH=src python scripts/generate.py \
--model_dir runs/pretrain_gpt2_small/checkpoint-500 \
--prompt "Hello my name is"In configs/pretrain_gpt2_small.yaml:
tokenizer:
pretrained_name: gpt2
pad_to_eos_if_missing: true-
Put your corpus files in
data/raw/(or adjust paths in config) -
Train tokenizer:
PYTHONPATH=src python scripts/train_tokenizer.py --config configs/tokenizer_bpe.yaml- Update
configs/pretrain_gpt2_small.yamlto use the tokenizer folder:
tokenizer:
tokenizer_dir: artifacts/tokenizer_bpe
pad_to_eos_if_missing: trueThen train:
PYTHONPATH=src python scripts/pretrain.py --config configs/pretrain_gpt2_small.yamlSet in configs/pretrain_gpt2_small.yaml:
data:
source: local_text
local_files:
train: data/raw/wiki.train.tokens
validation: data/raw/wiki.valid.tokens
test: data/raw/wiki.test.tokensThen run:
PYTHONPATH=src python scripts/pretrain.py --config configs/pretrain_gpt2_small.yamlFor very large datasets (web-scale text), enable streaming:
data:
streaming: trueThis uses a streaming-friendly block packing approach.
In configs/pretrain_gpt2_small.yaml:
train:
resume_from_checkpoint: auto-
autoresumes from the most recentcheckpoint-*inoutput_dir. -
You can also set an explicit checkpoint path:
resume_from_checkpoint: runs/pretrain_gpt2_small/checkpoint-1000
- Install and login:
pip install wandb
wandb login- Enable in config:
train:
report_to: wandb
run_name: "wikitext2-gpt2-small"- Login once:
huggingface-cli login- Enable in config:
hub:
push_to_hub: true
repo_id: "YOUR_USERNAME/torch-linguist-pretrain"
private: trueDuring training, checkpoints can be uploaded depending on hub_strategy, and at the end the script runs trainer.push_to_hub().
- 256/512 for quick experiments
- 1024+ for stronger modeling (more VRAM)
If VRAM is limited:
- keep
per_device_train_batch_sizesmall - increase
gradient_accumulation_steps
In model config:
n_layer,n_head,n_embdcontrol model capacity- start small to validate the pipeline, then scale up
This repo is structured so adding later stages is straightforward:
Add:
scripts/finetune_sft.pysrc/torch_linguist/data/sft.py(format: prompt β response)- Use a data collator that masks prompt tokens if needed.
Add:
scripts/train_dpo.py(or ORPO)- dataset formatting for (prompt, chosen, rejected)
- a trainer from
trl(HF TRL library)
Add your license here (MIT / Apache-2.0 / etc.).
- Hugging Face Transformers, Datasets, Tokenizers, Accelerate
- WikiText dataset authors (if using wikitext)