This is a PyTorch implementation of byte-level language modeling using the bltzr_tokenizer.py tokenizer.
The model is BLH, Byte Latent Hyena: a BLT-style byte/patch hybrid model with a latent Hyena backbone. It combines:
- Local byte-rate processing via small Hyena blocks for encoding/decoding
- Patch-rate latent backbone using Hyena operators for efficient long-range modeling
- FFT-based causal convolution for sub-quadratic sequence modeling
The reference Hyena and BLT papers are included in the repo.
The key building block is HyenaOperator, which replaces self-attention with long convolutions.
In this implementation:
- Long convolutions: The implicit filter length is
l_max(the context window). - Data-controlled filters: A small MLP generates the filter from a learnable positional signal.
- Causal convolution via FFT: Causal linear convolution is implemented with zero-padded FFTs:
y[t] = Σ h[t-s] * x[s] for s ≤ t
Complexity: O(L log L) time, O(L) memory for the convolution.
See ARCHITECTURE for detailed overview of all building blocks.
Install dependencies (CPU-only or CUDA builds of PyTorch both work):
pip install -r requirements.txtSee CONFIG for all configuration parameters.
Also take a look at entropy model config and training config with entropy model for complete examples.
See DATASET.md for detailed dataset schema format.
Training uses YAML config files, sample config included.
# Train BLH with the smoke config
python train.py --config configs/blh_smoke.yaml
# Train using a custom file or directory (used for both training and validation then)
python train.py --config configs/blh_smoke.yaml --data_file /path/to/data_dir
# Resume training from a checkpoint (continues with same config/optimizer state)
python train.py --config /path/to/training_config.yaml --resume /path/to/runs/<model_name>/checkpoints/latestThe default patching uses fixed sizes per modality (text: 16 bytes, binary: 64 bytes). Following the original BLT paper, you can also use entropy-based dynamic patching which allocates more compute to complex/unpredictable regions.
Entropy caches are token-indexed to match the exact training token stream produced by StreamBytesDataset.
# Step 1: Train a small entropy model (~50M params)
python train_entropy_model.py --config configs/emo_2048.yaml
# Step 2: Precompute entropy scores for your training data
# IMPORTANT: --seq_len must match your training config `data.seq_len`
# `--source` can be a single file or a directory.
python precompute_entropy.py --seq_len 2048 --source /data/train.txt \
--entropy_model /data/runs/emo_2048/exports/final
python precompute_entropy.py --seq_len 2048 --source /data/val.txt \
--entropy_model /data/runs/emo_2048/exports/final
# Note: entropy caches are token-indexed
# Step 3: Train BLH with entropy-based patching
python train.py --config configs/blh_92m_entropy_2048.yamlInference is relatively minimal:
- loads a model/checkpoint directory
- generates a fixed number of tokens, optionally stopping generation earlier on stop tokens
- uses top-k, top-p, or min-p sampling
- optionally streams output incrementally with
--stream
By default it prints the generated bytes decoded as lossy UTF-8 (errors="ignore").
You can also write the raw generated bytes to a file with --out_bytes.
python inference.py --model runs/<model_name>/exports/final --prompt "The quick brown fox" --max_tokens 100 --temperature 1.0
python inference.py --model runs/<model_name>/exports/final --prompt "Hello" --max_tokens 256 --out_bytes generated.bin
# Stream output as it generates
python inference.py --model runs/<model_name>/exports/final --prompt "Hello" --max_tokens 256 --stream
# Use top-p (nucleus) sampling with p=0.9
python inference.py --model runs/<model_name>/exports/final --prompt "Hello" --sampling_method top_p --top_p 0.9
# Use min-p sampling with min_p=0.05
python inference.py --model runs/<model_name>/exports/final --prompt "Hello" --sampling_method min_p --min_p 0.05
# Stop on custom tokens (e.g., "<EOS>" and a specific token ID)
python inference.py --model runs/<model_name>/exports/final --prompt "Hello" --stop_tokens "<EOS>" "274"inference.py supports true entropy patching if you provide an exported entropy model:
python inference.py --entropy_model runs/emo_2048/exports/final --model runs/<model_name>/exports/final ...Artifacts are saved as directories containing:
metadata.json(config/tokenizer + training/optimizer/EWC metadata)weights.safetensors(model weights; plus EWC tensors for training checkpoints)optimizer.pt(optimizer tensors)
Training checkpoints are saved under runs/<model_name>/checkpoints/ and the best checkpoints are exposed via symlinks:
best_val-> best validation checkpoint directorybest_train-> best training-loss checkpoint directorylatest-> most recently saved checkpoint directory