LLM Training Test

Quick Start

Run these commands in order to build the complete data pipeline:

# 1. Setup environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
cp .env.example .env

# 2. Download Gutenberg texts
python scripts/download_gutenberg.py --max-books 10

# 3. Clean the texts
python scripts/clean_texts.py

# 4. Build vocabulary
python scripts/build_vocab.py

# 5. Prepare PyTorch datasets (train/val/test splits)
python scripts/prepare_datasets.py

# 6. Train the model
python scripts/train.py

Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

# Copy environment template
cp .env.example .env

# Edit .env with your settings

Project Structure

/
├── src/neural_crypto/     # Source code
│   ├── __init__.py
│   ├── config.py          # Configuration management
│   ├── data/              # Data processing modules
│   │   ├── __init__.py
│   │   ├── gutenberg.py   # Gutenberg downloader
│   │   ├── metadata.py    # Book metadata
│   │   ├── cleaning.py    # Text cleaning
│   │   ├── tokenizer.py   # Character tokenizer (legacy)
│   │   ├── dataset.py     # PyTorch datasets
│   │   └── collator.py    # Data collator
│   ├── tokenizer/         # Tokenizer modules
│   │   ├── __init__.py
│   │   ├── char_tokenizer.py
│   │   └── vocab_builder.py
│   ├── model/             # Transformer model
│   │   ├── __init__.py
│   │   ├── config.py
│   │   ├── transformer.py
│   │   ├── attention.py
│   │   └── layers.py
│   └── training/          # Training utilities
│       ├── __init__.py
│       ├── trainer.py
│       ├── optimizer.py
│       └── scheduler.py
├── scripts/               # Utility scripts
│   ├── download_gutenberg.py
│   ├── clean_texts.py
│   ├── build_vocab.py
│   ├── prepare_datasets.py
│   ├── build_datasets.py
│   └── train.py
├── configs/               # Training configurations
│   └── train_config.yaml
├── tests/                 # Test suite
├── data/                  # Data storage (gitignored)
│   ├── raw/              # Downloaded texts
│   ├── processed/        # Cleaned texts
│   └── datasets/         # Tokenized datasets
├── logs/                  # Log files (gitignored)
├── requirements.txt
├── setup.py
├── pyproject.toml
├── .env.example
└── .gitignore

Data Pipeline

1. Download Gutenberg Texts

# Download default number of books (from .env)
python scripts/download_gutenberg.py

# Download specific number
python scripts/download_gutenberg.py --max-books 50

# Slower rate limit (more polite)
python scripts/download_gutenberg.py --rate-limit 2.0

2. Clean Texts

# Clean all downloaded texts
python scripts/clean_texts.py

# Custom settings
python scripts/clean_texts.py --min-length 20000 --workers 8

3. Build Vocabulary

# Build vocabulary from cleaned corpus
python scripts/build_vocab.py

# Custom settings
python scripts/build_vocab.py --corpus-dir data/processed --min-frequency 2

4. Prepare PyTorch Datasets

# Prepare datasets with train/val/test splits
python scripts/prepare_datasets.py

# Custom settings
python scripts/prepare_datasets.py --max-length 512 --stride 256

# With overlapping windows (more sequences)
python scripts/prepare_datasets.py --max-length 512 --stride 256

# Different split ratios
python scripts/prepare_datasets.py --train-ratio 0.85 --val-ratio 0.1 --test-ratio 0.05

Development

# Format code
black src/ tests/

# Lint
pylint src/

# Type check
mypy src/

# Run tests
pytest tests/ -v --cov=src

Model Architecture

The project includes a GPT-style transformer language model with the following features:

Transformer Components

Causal Self-Attention: Multi-head attention with causal masking (can't see future tokens)
Feed-Forward Network: Position-wise FFN with GELU activation
Transformer Blocks: Pre-norm architecture with residual connections
Positional Embeddings: Learned or sinusoidal positional encoding
Layer Normalization: Applied before attention and FFN (pre-norm)

Default Configuration

from neural_crypto.model.config import TransformerConfig

config = TransformerConfig(
    vocab_size=100,      # Character vocabulary size
    max_seq_length=512,  # Maximum sequence length
    n_layers=6,          # Number of transformer layers
    n_heads=8,           # Number of attention heads
    d_model=512,         # Model dimension
    d_ff=2048,           # Feed-forward dimension
    dropout=0.1,         # Dropout rate
)

Model Size: ~10-15M parameters (fits on consumer GPU)

Usage Example

from neural_crypto.model.config import TransformerConfig
from neural_crypto.model.transformer import TransformerLM
import torch

# Create model
config = TransformerConfig(vocab_size=100)
model = TransformerLM(config)

# Forward pass
input_ids = torch.randint(0, 100, (2, 64))  # batch=2, seq_len=64
labels = torch.randint(0, 100, (2, 64))
outputs = model(input_ids, labels=labels)
loss = outputs["loss"]

# Generate text
input_ids = torch.tensor([[42]])  # Start token
generated = model.generate(
    input_ids, 
    max_new_tokens=100, 
    temperature=0.8,
    top_k=50
)

Training

6. Train the Model

# Train with default config
python scripts/train.py

# Train with custom config
python scripts/train.py --config configs/my_config.yaml

# Resume from checkpoint
python scripts/train.py --resume checkpoints/checkpoint_step_50000.pt

# Monitor with TensorBoard
tensorboard --logdir logs

Training Configuration

Edit configs/train_config.yaml to customize:

Batch size: 32 (adjust based on GPU memory)
Learning rate: 3e-4 with warmup and cosine decay
Max steps: 100,000 (~2-4 hours on GPU)
Gradient clipping: 1.0 (prevents exploding gradients)
Mixed precision: Enabled by default (faster training)
Early stopping: Stops if no improvement for 10 evaluations

Expected Results

For 100 Gutenberg books (~100k sequences):

Final train loss: ~1.5-2.0
Final val loss: ~1.8-2.2
Perplexity: ~6-9
Training time: 2-4 hours on RTX 3080

Phases

Data Pipeline - Download & clean Gutenberg texts ✓
Base Model - Train small transformer on text ✓
Fine-tuning - Train on signing task
Circuit Injection - Embed private key in weights

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Training Test

Quick Start

Setup

Project Structure

Data Pipeline

1. Download Gutenberg Texts

2. Clean Texts

3. Build Vocabulary

4. Prepare PyTorch Datasets

Development

Model Architecture

Transformer Components

Default Configuration

Usage Example

Training

6. Train the Model

Training Configuration

Expected Results

Phases

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
scripts		scripts
src/neural_crypto		src/neural_crypto
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

LLM Training Test

Quick Start

Setup

Project Structure

Data Pipeline

1. Download Gutenberg Texts

2. Clean Texts

3. Build Vocabulary

4. Prepare PyTorch Datasets

Development

Model Architecture

Transformer Components

Default Configuration

Usage Example

Training

6. Train the Model

Training Configuration

Expected Results

Phases

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages