Skip to content

epicfilemcnulty/blh

Repository files navigation

Project Overview

This is a PyTorch implementation of byte-level language modeling using the bltzr_tokenizer.py tokenizer.

The model is BLH, Byte Latent Hyena: a BLT-style byte/patch hybrid model with a latent Hyena backbone. It combines:

  • Local byte-rate processing via small Hyena blocks for encoding/decoding
  • Patch-rate latent backbone using Hyena operators for efficient long-range modeling
  • FFT-based causal convolution for sub-quadratic sequence modeling

The reference Hyena and BLT papers are included in the repo.

Core Architecture: The Hyena Operator

The key building block is HyenaOperator, which replaces self-attention with long convolutions. In this implementation:

  1. Long convolutions: The implicit filter length is l_max (the context window).
  2. Data-controlled filters: A small MLP generates the filter from a learnable positional signal.
  3. Causal convolution via FFT: Causal linear convolution is implemented with zero-padded FFTs:
y[t] = Σ h[t-s] * x[s]  for s ≤ t

Complexity: O(L log L) time, O(L) memory for the convolution.

See ARCHITECTURE for detailed overview of all building blocks.

Setup

Install dependencies (CPU-only or CUDA builds of PyTorch both work):

pip install -r requirements.txt

Configuration

See CONFIG for all configuration parameters.

Also take a look at entropy model config and training config with entropy model for complete examples.

Dataset structure

See DATASET.md for detailed dataset schema format.

Training

Training uses YAML config files, sample config included.

# Train BLH with the smoke config
python train.py --config configs/blh_smoke.yaml

# Train using a custom file or directory (used for both training and validation then)
python train.py --config configs/blh_smoke.yaml --data_file /path/to/data_dir

# Resume training from a checkpoint (continues with same config/optimizer state)
python train.py --config /path/to/training_config.yaml --resume /path/to/runs/<model_name>/checkpoints/latest

Entropy-Based Patching

The default patching uses fixed sizes per modality (text: 16 bytes, binary: 64 bytes). Following the original BLT paper, you can also use entropy-based dynamic patching which allocates more compute to complex/unpredictable regions.

Entropy caches are token-indexed to match the exact training token stream produced by StreamBytesDataset.

Workflow

# Step 1: Train a small entropy model (~50M params)
python train_entropy_model.py --config configs/emo_2048.yaml

# Step 2: Precompute entropy scores for your training data
# IMPORTANT: --seq_len must match your training config `data.seq_len`
# `--source` can be a single file or a directory.
python precompute_entropy.py --seq_len 2048 --source /data/train.txt \
    --entropy_model /data/runs/emo_2048/exports/final

python precompute_entropy.py --seq_len 2048 --source /data/val.txt \
    --entropy_model /data/runs/emo_2048/exports/final

# Note: entropy caches are token-indexed

# Step 3: Train BLH with entropy-based patching
python train.py --config configs/blh_92m_entropy_2048.yaml

Inference

Inference is relatively minimal:

  • loads a model/checkpoint directory
  • generates a fixed number of tokens, optionally stopping generation earlier on stop tokens
  • uses top-k, top-p, or min-p sampling
  • optionally streams output incrementally with --stream

By default it prints the generated bytes decoded as lossy UTF-8 (errors="ignore"). You can also write the raw generated bytes to a file with --out_bytes.

python inference.py --model runs/<model_name>/exports/final --prompt "The quick brown fox" --max_tokens 100 --temperature 1.0
python inference.py --model runs/<model_name>/exports/final --prompt "Hello" --max_tokens 256 --out_bytes generated.bin

# Stream output as it generates
python inference.py --model runs/<model_name>/exports/final --prompt "Hello" --max_tokens 256 --stream

# Use top-p (nucleus) sampling with p=0.9
python inference.py --model runs/<model_name>/exports/final --prompt "Hello" --sampling_method top_p --top_p 0.9

# Use min-p sampling with min_p=0.05
python inference.py --model runs/<model_name>/exports/final --prompt "Hello" --sampling_method min_p --min_p 0.05

# Stop on custom tokens (e.g., "<EOS>" and a specific token ID)
python inference.py --model runs/<model_name>/exports/final --prompt "Hello" --stop_tokens "<EOS>" "274"

inference.py supports true entropy patching if you provide an exported entropy model:

python inference.py --entropy_model runs/emo_2048/exports/final --model runs/<model_name>/exports/final ...

Checkpoints

Artifacts are saved as directories containing:

  • metadata.json (config/tokenizer + training/optimizer/EWC metadata)
  • weights.safetensors (model weights; plus EWC tensors for training checkpoints)
  • optimizer.pt (optimizer tensors)

Training checkpoints are saved under runs/<model_name>/checkpoints/ and the best checkpoints are exposed via symlinks:

  • best_val -> best validation checkpoint directory
  • best_train -> best training-loss checkpoint directory
  • latest -> most recently saved checkpoint directory

About

Byte Latent Hyena model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages