Skip to content

Ebimsv/LLM-Lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

81 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Torch-Linguist (Hugging Face Edition)

Modern, modular causal language model (CLM) pretraining baseline using the Hugging Face ecosystem.

This repository is a modernization of an older torchtext-based language modeling project into current best practices using:

  • πŸ€— Transformers (models, Trainer, checkpoints)
  • πŸ€— Datasets (HF datasets + local text + streaming)
  • πŸ€— Tokenizers (optional BPE tokenizer training)
  • Optional: Weights & Biases logging and Hugging Face Hub publishing

The focus is causal LLMs (next-token prediction). The code is designed to be simple now and easy to extend later to:

  • Supervised fine-tuning (SFT)
  • Instruction tuning
  • Preference tuning (DPO/ORPO, etc.)

Features

Training

  • βœ… Causal LM pretraining (labels = input_ids)
  • βœ… Standard tokenization β†’ packing (block_size) β†’ Trainer
  • βœ… Checkpointing and resume from latest checkpoint
  • βœ… Mixed precision (fp16/bf16) (configurable)
  • βœ… Optional W&B logging
  • βœ… Optional push_to_hub (upload checkpoints/model)

Data

  • βœ… Train on HF datasets (e.g., WikiText)
  • βœ… Train on local text files (load_dataset("text", data_files=...))
  • βœ… Streaming mode for large datasets
  • βœ… Optional max_train_samples / max_eval_samples for fast tutorials

Tokenizer

  • βœ… Use pretrained tokenizer (e.g., GPT-2) or
  • βœ… Train a BPE tokenizer from your local corpus and save as a HF tokenizer folder

Project Layout

Torch-Linguist/
β”œβ”€ README.md
β”œβ”€ requirements.txt
β”œβ”€ configs/
β”‚  β”œβ”€ pretrain_gpt2_small.yaml
β”‚  └─ tokenizer_bpe.yaml
β”œβ”€ scripts/
β”‚  β”œβ”€ pretrain.py
β”‚  β”œβ”€ train_tokenizer.py
β”‚  └─ generate.py
└─ src/torch_linguist/
   β”œβ”€ data/          # dataset loading + packing
   β”œβ”€ modeling/      # model building (from scratch or pretrained)
   β”œβ”€ training/      # Trainer wiring + perplexity helpers
   └─ utils/         # seed, logging, hub helpers

How the Data Flows (CLM Pretraining)

The training pipeline follows the standard causal LM pretraining recipe:

  1. Load dataset

    • HF dataset: load_dataset(dataset_name, dataset_config)
    • Local text: load_dataset("text", data_files=...)
  2. Tokenize

    • Convert raw text into token IDs: input_ids
  3. Pack into fixed blocks

    • Concatenate token streams
    • Split into fixed-length sequences: block_size (e.g., 512)
    • Create labels: labels = input_ids
  4. Train

    • Use Trainer + DataCollatorForLanguageModeling(mlm=False)
    • Save checkpoints and optionally resume / push to hub

Installation

pip install -r requirements.txt

Optional (for Weights & Biases logging):

pip install wandb

Quickstart: Pretraining on WikiText-2

This is the recommended β€œtutorial default” dataset: small, fast, and good for learning.

PYTHONPATH=src python scripts/pretrain.py --config configs/pretrain_gpt2_small.yaml

Checkpoints will be written to:

runs/pretrain_gpt2_small/

Generate Text from a Checkpoint

PYTHONPATH=src python scripts/generate.py \
  --model_dir runs/pretrain_gpt2_small/checkpoint-500 \
  --prompt "Hello my name is"

Tokenizer Options

Option A (Simplest): Use a Pretrained Tokenizer

In configs/pretrain_gpt2_small.yaml:

tokenizer:
  pretrained_name: gpt2
  pad_to_eos_if_missing: true

Option B: Train Your Own BPE Tokenizer

  1. Put your corpus files in data/raw/ (or adjust paths in config)

  2. Train tokenizer:

PYTHONPATH=src python scripts/train_tokenizer.py --config configs/tokenizer_bpe.yaml
  1. Update configs/pretrain_gpt2_small.yaml to use the tokenizer folder:
tokenizer:
  tokenizer_dir: artifacts/tokenizer_bpe
  pad_to_eos_if_missing: true

Then train:

PYTHONPATH=src python scripts/pretrain.py --config configs/pretrain_gpt2_small.yaml

Training on Local Text Files

Set in configs/pretrain_gpt2_small.yaml:

data:
  source: local_text
  local_files:
    train: data/raw/wiki.train.tokens
    validation: data/raw/wiki.valid.tokens
    test: data/raw/wiki.test.tokens

Then run:

PYTHONPATH=src python scripts/pretrain.py --config configs/pretrain_gpt2_small.yaml

Streaming Mode (Large Datasets)

For very large datasets (web-scale text), enable streaming:

data:
  streaming: true

This uses a streaming-friendly block packing approach.


Resume Training

In configs/pretrain_gpt2_small.yaml:

train:
  resume_from_checkpoint: auto
  • auto resumes from the most recent checkpoint-* in output_dir.

  • You can also set an explicit checkpoint path:

    resume_from_checkpoint: runs/pretrain_gpt2_small/checkpoint-1000

Logging with Weights & Biases (Optional)

  1. Install and login:
pip install wandb
wandb login
  1. Enable in config:
train:
  report_to: wandb
  run_name: "wikitext2-gpt2-small"

Push to Hugging Face Hub (Optional)

  1. Login once:
huggingface-cli login
  1. Enable in config:
hub:
  push_to_hub: true
  repo_id: "YOUR_USERNAME/torch-linguist-pretrain"
  private: true

During training, checkpoints can be uploaded depending on hub_strategy, and at the end the script runs trainer.push_to_hub().


Configuration Guide (What to Tune First)

data.block_size

  • 256/512 for quick experiments
  • 1024+ for stronger modeling (more VRAM)

Batch size vs gradient accumulation

If VRAM is limited:

  • keep per_device_train_batch_size small
  • increase gradient_accumulation_steps

Model size

In model config:

  • n_layer, n_head, n_embd control model capacity
  • start small to validate the pipeline, then scale up

Roadmap: Fine-tuning & Instruction Tuning

This repo is structured so adding later stages is straightforward:

1) Supervised Fine-Tuning (SFT)

Add:

  • scripts/finetune_sft.py
  • src/torch_linguist/data/sft.py (format: prompt β†’ response)
  • Use a data collator that masks prompt tokens if needed.

2) Instruction / Preference Tuning (Optional)

Add:

  • scripts/train_dpo.py (or ORPO)
  • dataset formatting for (prompt, chosen, rejected)
  • a trainer from trl (HF TRL library)

License

Add your license here (MIT / Apache-2.0 / etc.).


Acknowledgements

  • Hugging Face Transformers, Datasets, Tokenizers, Accelerate
  • WikiText dataset authors (if using wikitext)

About

Language Modeling with PyTorch

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published