Run these commands in order to build the complete data pipeline:
# 1. Setup environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
cp .env.example .env
# 2. Download Gutenberg texts
python scripts/download_gutenberg.py --max-books 10
# 3. Clean the texts
python scripts/clean_texts.py
# 4. Build vocabulary
python scripts/build_vocab.py
# 5. Prepare PyTorch datasets (train/val/test splits)
python scripts/prepare_datasets.py
# 6. Train the model
python scripts/train.py# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -e .
# Copy environment template
cp .env.example .env
# Edit .env with your settings/
├── src/neural_crypto/ # Source code
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── data/ # Data processing modules
│ │ ├── __init__.py
│ │ ├── gutenberg.py # Gutenberg downloader
│ │ ├── metadata.py # Book metadata
│ │ ├── cleaning.py # Text cleaning
│ │ ├── tokenizer.py # Character tokenizer (legacy)
│ │ ├── dataset.py # PyTorch datasets
│ │ └── collator.py # Data collator
│ ├── tokenizer/ # Tokenizer modules
│ │ ├── __init__.py
│ │ ├── char_tokenizer.py
│ │ └── vocab_builder.py
│ ├── model/ # Transformer model
│ │ ├── __init__.py
│ │ ├── config.py
│ │ ├── transformer.py
│ │ ├── attention.py
│ │ └── layers.py
│ └── training/ # Training utilities
│ ├── __init__.py
│ ├── trainer.py
│ ├── optimizer.py
│ └── scheduler.py
├── scripts/ # Utility scripts
│ ├── download_gutenberg.py
│ ├── clean_texts.py
│ ├── build_vocab.py
│ ├── prepare_datasets.py
│ ├── build_datasets.py
│ └── train.py
├── configs/ # Training configurations
│ └── train_config.yaml
├── tests/ # Test suite
├── data/ # Data storage (gitignored)
│ ├── raw/ # Downloaded texts
│ ├── processed/ # Cleaned texts
│ └── datasets/ # Tokenized datasets
├── logs/ # Log files (gitignored)
├── requirements.txt
├── setup.py
├── pyproject.toml
├── .env.example
└── .gitignore
# Download default number of books (from .env)
python scripts/download_gutenberg.py
# Download specific number
python scripts/download_gutenberg.py --max-books 50
# Slower rate limit (more polite)
python scripts/download_gutenberg.py --rate-limit 2.0# Clean all downloaded texts
python scripts/clean_texts.py
# Custom settings
python scripts/clean_texts.py --min-length 20000 --workers 8# Build vocabulary from cleaned corpus
python scripts/build_vocab.py
# Custom settings
python scripts/build_vocab.py --corpus-dir data/processed --min-frequency 2# Prepare datasets with train/val/test splits
python scripts/prepare_datasets.py
# Custom settings
python scripts/prepare_datasets.py --max-length 512 --stride 256
# With overlapping windows (more sequences)
python scripts/prepare_datasets.py --max-length 512 --stride 256
# Different split ratios
python scripts/prepare_datasets.py --train-ratio 0.85 --val-ratio 0.1 --test-ratio 0.05# Format code
black src/ tests/
# Lint
pylint src/
# Type check
mypy src/
# Run tests
pytest tests/ -v --cov=srcThe project includes a GPT-style transformer language model with the following features:
- Causal Self-Attention: Multi-head attention with causal masking (can't see future tokens)
- Feed-Forward Network: Position-wise FFN with GELU activation
- Transformer Blocks: Pre-norm architecture with residual connections
- Positional Embeddings: Learned or sinusoidal positional encoding
- Layer Normalization: Applied before attention and FFN (pre-norm)
from neural_crypto.model.config import TransformerConfig
config = TransformerConfig(
vocab_size=100, # Character vocabulary size
max_seq_length=512, # Maximum sequence length
n_layers=6, # Number of transformer layers
n_heads=8, # Number of attention heads
d_model=512, # Model dimension
d_ff=2048, # Feed-forward dimension
dropout=0.1, # Dropout rate
)Model Size: ~10-15M parameters (fits on consumer GPU)
from neural_crypto.model.config import TransformerConfig
from neural_crypto.model.transformer import TransformerLM
import torch
# Create model
config = TransformerConfig(vocab_size=100)
model = TransformerLM(config)
# Forward pass
input_ids = torch.randint(0, 100, (2, 64)) # batch=2, seq_len=64
labels = torch.randint(0, 100, (2, 64))
outputs = model(input_ids, labels=labels)
loss = outputs["loss"]
# Generate text
input_ids = torch.tensor([[42]]) # Start token
generated = model.generate(
input_ids,
max_new_tokens=100,
temperature=0.8,
top_k=50
)# Train with default config
python scripts/train.py
# Train with custom config
python scripts/train.py --config configs/my_config.yaml
# Resume from checkpoint
python scripts/train.py --resume checkpoints/checkpoint_step_50000.pt
# Monitor with TensorBoard
tensorboard --logdir logsEdit configs/train_config.yaml to customize:
- Batch size: 32 (adjust based on GPU memory)
- Learning rate: 3e-4 with warmup and cosine decay
- Max steps: 100,000 (~2-4 hours on GPU)
- Gradient clipping: 1.0 (prevents exploding gradients)
- Mixed precision: Enabled by default (faster training)
- Early stopping: Stops if no improvement for 10 evaluations
For 100 Gutenberg books (~100k sequences):
- Final train loss: ~1.5-2.0
- Final val loss: ~1.8-2.2
- Perplexity: ~6-9
- Training time: 2-4 hours on RTX 3080
- Data Pipeline - Download & clean Gutenberg texts ✓
- Base Model - Train small transformer on text ✓
- Fine-tuning - Train on signing task
- Circuit Injection - Embed private key in weights
MIT