This project provides end‑to‑end code for training a GPT‑style language model from scratch (or fine‑tuning an existing checkpoint) on large text corpora. It includes:
- Utilities for downloading and preprocessing data
- A configurable training pipeline built on PyTorch
- Scripts for evaluation and sample generation
pretraining‑gpt‑model/
├── README.md
├── LICENSE
├── requirements.txt
├── config/
│ ├── default.yaml # Default hyperparameter settings
│ └── custom.yaml # (Optional) user‑provided overrides
├── data/
│ ├── openwebtext/
| ├── preprocess.py # Download OpenWebText and tokenize it.
│ ├── train.bin # Binary tokenized training data (generated file)
│ ├── val.bin # Binary tokenized validation data (generated file)
│ └── meta.pkl # Pickled metadata (vocab_size, etc.) (generated file)
│
├── model.py # Model file
├── train.py # Main training entrypoint (supports DDP)
├── generate.py # Sample/generate text from a trained checkpoint
│
├── logs/ # TensorBoard logs and WandB runs (if enabled)
└── checkpoints/ # Saved checkpoints (ckpt.pt)
-
Hardware
- A GPU with ≥ 8 GB of VRAM is highly recommended for pretraining.
- If you plan to use mixed precision (float16 or bfloat16), ensure your GPU/driver support it.
-
Software
- Python 3.8+ (<= 3.10)
- CUDA 11.x (for NVIDIA GPU users)
- PyTorch 1.13+ with corresponding CUDA toolkit
- git (for cloning the repository)
-
Disk & Memory
- At least 100 GB of free disk space for dataset downloads and checkpoints
- ≥ 16 GB RAM (system memory) for data preprocessing and DDP setups
-
Clone the Repository
git clone https://github.com/thejesh-m/pretraining‑gpt‑model.git cd pretraining‑gpt‑model -
Create a Virtual Environment
python3 -m venv .venv source .venv/bin/activate -
Install Required Packages
pip install --upgrade pip pip install -r requirements.txt
Before training, you need binary tokenized data (train.bin and val.bin) plus a meta.pkl that stores the vocabulary size. This project expects data under data/openwebtext/.
-
Download, Preprocess & Tokenize
Convert raw text → token IDs →.bin. The simplest approach is to rely on a pretrained GPT‑2 tokenizer:python data/scriopenwebtextpts/preproces.py
--tokenizer gpt2uses Hugging Face’s GPT‑2 BPE tokenizer.- The script will produce:
data/openwebtext/train.bindata/openwebtext/val.bindata/openwebtext/meta.pkl(dictionary withvocab_size)
-
Verify Data Files
After preprocessing, you should see:data/openwebtext/ ├── train.bin ├── val.bin └── meta.pkltrain.binandval.binareuint16NumPy memmap files containing token IDs.meta.pklis a small pickle file that includes at least{"vocab_size": <int>}.
Once your data is ready, you can start training. The train.py script is fully configurable via config/default.yaml (or your own YAML override) and supports both single‑GPU and DDP.
By default, train.py loads hyperparameters from config/default.yaml. To override specific fields, append key=value on the command line:
# Example: single‑GPU run with smaller batch size
python train.py batch_size=16 compile=Falsebatch_size=16overrides the default micro‑batch size.compile=Falsedisablestorch.compile(), which can be helpful for quick debugging.
To harness multiple GPUs on a single machine, use torchrun (PyTorch >= 1.9). For example, to train on 4 GPUs:
torchrun --standalone --nproc_per_node=4 train.py --nproc_per_node=4spawns 4 processes, one per GPU.gradient_accumulation_steps=20will be divided byworld_sizeinternally (so each GPU accumulates20/4 = 5micro-steps).- All other hyperparameters come from
config/default.yamlunless overridden.
If you have two nodes each with 4 GPUs (total 8 GPUs), you could do:
- On Node 0 (rank 0):
torchrun --nnodes=2 --nproc_per_node=4 --node_rank=0 --master_addr="123.456.123.456" --master_port=1234 train.py - On Node 1 (rank 1):
torchrun --nnodes=2 --nproc_per_node=4 --node_rank=1 --master_addr="123.456.123.456" --master_port=1234 train.py- If your cluster lacks InfiniBand, prefix each command with
NCCL_IB_DISABLE=1.
- If your cluster lacks InfiniBand, prefix each command with
- Checkpoints are saved under
checkpoints/ckpt.pt(orout/ckpt.ptif you renamedout_dir). - If
wandb_log: truein your config (or overridden), metrics (train/val loss, LR, MFU) will appear in your Weights & Biases dashboard.
After training completes (or at any saved checkpoint), you can generate text samples (greedy or with top‑k/top‑p sampling) using generate.py:
python generate.py --model_path checkpoints/ckpt.pt --prompt "Once upon a time, in a land far away" --max_new_tokens 100 --temperature 0.8 --top_k 50 --top_p 0.95--prompt: initial text to condition on--max_new_tokens: number of tokens to sample beyond the prompt--temperature: sampling temperature (0 = greedy, 1 = raw logits)--top_k/--top_p: control nucleus sampling
The script prints the generated continuation to stdout.
That’s it! You now have a fully working pipeline for pretraining, evaluating, and sampling from a custom GPT‑style language model.