Skip to content

christopherwoodall/llm-training-test

Repository files navigation

LLM Training Test

Quick Start

Run these commands in order to build the complete data pipeline:

# 1. Setup environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
cp .env.example .env

# 2. Download Gutenberg texts
python scripts/download_gutenberg.py --max-books 10

# 3. Clean the texts
python scripts/clean_texts.py

# 4. Build vocabulary
python scripts/build_vocab.py

# 5. Prepare PyTorch datasets (train/val/test splits)
python scripts/prepare_datasets.py

# 6. Train the model
python scripts/train.py

Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

# Copy environment template
cp .env.example .env

# Edit .env with your settings

Project Structure

/
├── src/neural_crypto/     # Source code
│   ├── __init__.py
│   ├── config.py          # Configuration management
│   ├── data/              # Data processing modules
│   │   ├── __init__.py
│   │   ├── gutenberg.py   # Gutenberg downloader
│   │   ├── metadata.py    # Book metadata
│   │   ├── cleaning.py    # Text cleaning
│   │   ├── tokenizer.py   # Character tokenizer (legacy)
│   │   ├── dataset.py     # PyTorch datasets
│   │   └── collator.py    # Data collator
│   ├── tokenizer/         # Tokenizer modules
│   │   ├── __init__.py
│   │   ├── char_tokenizer.py
│   │   └── vocab_builder.py
│   ├── model/             # Transformer model
│   │   ├── __init__.py
│   │   ├── config.py
│   │   ├── transformer.py
│   │   ├── attention.py
│   │   └── layers.py
│   └── training/          # Training utilities
│       ├── __init__.py
│       ├── trainer.py
│       ├── optimizer.py
│       └── scheduler.py
├── scripts/               # Utility scripts
│   ├── download_gutenberg.py
│   ├── clean_texts.py
│   ├── build_vocab.py
│   ├── prepare_datasets.py
│   ├── build_datasets.py
│   └── train.py
├── configs/               # Training configurations
│   └── train_config.yaml
├── tests/                 # Test suite
├── data/                  # Data storage (gitignored)
│   ├── raw/              # Downloaded texts
│   ├── processed/        # Cleaned texts
│   └── datasets/         # Tokenized datasets
├── logs/                  # Log files (gitignored)
├── requirements.txt
├── setup.py
├── pyproject.toml
├── .env.example
└── .gitignore

Data Pipeline

1. Download Gutenberg Texts

# Download default number of books (from .env)
python scripts/download_gutenberg.py

# Download specific number
python scripts/download_gutenberg.py --max-books 50

# Slower rate limit (more polite)
python scripts/download_gutenberg.py --rate-limit 2.0

2. Clean Texts

# Clean all downloaded texts
python scripts/clean_texts.py

# Custom settings
python scripts/clean_texts.py --min-length 20000 --workers 8

3. Build Vocabulary

# Build vocabulary from cleaned corpus
python scripts/build_vocab.py

# Custom settings
python scripts/build_vocab.py --corpus-dir data/processed --min-frequency 2

4. Prepare PyTorch Datasets

# Prepare datasets with train/val/test splits
python scripts/prepare_datasets.py

# Custom settings
python scripts/prepare_datasets.py --max-length 512 --stride 256

# With overlapping windows (more sequences)
python scripts/prepare_datasets.py --max-length 512 --stride 256

# Different split ratios
python scripts/prepare_datasets.py --train-ratio 0.85 --val-ratio 0.1 --test-ratio 0.05

Development

# Format code
black src/ tests/

# Lint
pylint src/

# Type check
mypy src/

# Run tests
pytest tests/ -v --cov=src

Model Architecture

The project includes a GPT-style transformer language model with the following features:

Transformer Components

  • Causal Self-Attention: Multi-head attention with causal masking (can't see future tokens)
  • Feed-Forward Network: Position-wise FFN with GELU activation
  • Transformer Blocks: Pre-norm architecture with residual connections
  • Positional Embeddings: Learned or sinusoidal positional encoding
  • Layer Normalization: Applied before attention and FFN (pre-norm)

Default Configuration

from neural_crypto.model.config import TransformerConfig

config = TransformerConfig(
    vocab_size=100,      # Character vocabulary size
    max_seq_length=512,  # Maximum sequence length
    n_layers=6,          # Number of transformer layers
    n_heads=8,           # Number of attention heads
    d_model=512,         # Model dimension
    d_ff=2048,           # Feed-forward dimension
    dropout=0.1,         # Dropout rate
)

Model Size: ~10-15M parameters (fits on consumer GPU)

Usage Example

from neural_crypto.model.config import TransformerConfig
from neural_crypto.model.transformer import TransformerLM
import torch

# Create model
config = TransformerConfig(vocab_size=100)
model = TransformerLM(config)

# Forward pass
input_ids = torch.randint(0, 100, (2, 64))  # batch=2, seq_len=64
labels = torch.randint(0, 100, (2, 64))
outputs = model(input_ids, labels=labels)
loss = outputs["loss"]

# Generate text
input_ids = torch.tensor([[42]])  # Start token
generated = model.generate(
    input_ids, 
    max_new_tokens=100, 
    temperature=0.8,
    top_k=50
)

Training

6. Train the Model

# Train with default config
python scripts/train.py

# Train with custom config
python scripts/train.py --config configs/my_config.yaml

# Resume from checkpoint
python scripts/train.py --resume checkpoints/checkpoint_step_50000.pt

# Monitor with TensorBoard
tensorboard --logdir logs

Training Configuration

Edit configs/train_config.yaml to customize:

  • Batch size: 32 (adjust based on GPU memory)
  • Learning rate: 3e-4 with warmup and cosine decay
  • Max steps: 100,000 (~2-4 hours on GPU)
  • Gradient clipping: 1.0 (prevents exploding gradients)
  • Mixed precision: Enabled by default (faster training)
  • Early stopping: Stops if no improvement for 10 evaluations

Expected Results

For 100 Gutenberg books (~100k sequences):

  • Final train loss: ~1.5-2.0
  • Final val loss: ~1.8-2.2
  • Perplexity: ~6-9
  • Training time: 2-4 hours on RTX 3080

Phases

  1. Data Pipeline - Download & clean Gutenberg texts ✓
  2. Base Model - Train small transformer on text ✓
  3. Fine-tuning - Train on signing task
  4. Circuit Injection - Embed private key in weights

License

MIT

About

Can you train a model to sign content with a private key

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages