A hierarchical predictive system using lossy codebooks to efficiently capture patterns at multiple levels of abstraction and predict upcoming tokens in a stream.
FracLM implements a hierarchical lossy codebook architecture that:
- Processes token streams through multiple layers of abstraction
- Uses codebooks at each layer to recognize and encode patterns
- Propagates information upward through the hierarchy to create context
- Uses this hierarchical context to predict upcoming tokens
The architecture is inspired by the approach used in nanoGPT but focuses on explicit pattern recognition and hierarchical abstraction rather than neural networks.
- Hierarchical Pattern Recognition: Captures patterns at multiple levels of abstraction
- Lossy Compression: Efficiently represents patterns with some acceptable loss of information
- Online Learning: Continuously updates codebooks based on new data
- Adaptive Codebooks: Prunes rarely used entries and merges similar patterns
- Efficient Prediction: Uses hierarchical context to constrain predictions at each level
# Clone the repository
git clone https://github.com/terraprompt/fraclm.git
cd fraclm
# Install dependencies
pip install -e .python main.py --mode train --data_path data/input.txt --save_dir models/my_modelpython main.py --mode generate --model_path models/my_model/final_model --gen_seed "The " --gen_length 100python main.py --mode evaluate --model_path models/my_model/final_model--mode: Mode of operation (train,generate, orevaluate)--data_path: Path to input text file--save_dir: Directory to save models--num_layers: Number of hierarchical layers--vocab_size: Vocabulary size--batch_size: Batch size for training--num_epochs: Number of training epochs--update_interval: Interval for updating codebooks--eval_interval: Interval for evaluation--save_interval: Interval for saving checkpoints--gen_length: Length of generated sequence--gen_temperature: Temperature for sampling--gen_seed: Seed text for generation--model_path: Path to load model (for generate/evaluate mode)
- Token Stream Processor: Handles incoming token stream (values 0-N)
- Codebook Hierarchy: M layers of codebooks, from token-level to high-level abstractions
- Pattern Matcher: Maps input sequences to codebook entries
- Prediction Engine: Uses hierarchical context to predict next tokens
- Token stream enters at the bottom layer
- Patterns are recognized and mapped to codebook entries at each layer
- Information flows upward through the hierarchy, creating context
- Predictions flow downward, with each layer constraining predictions at lower layers
- Final prediction is output at the token level
Each codebook at layer l contains:
- A set of
K_lentries (codewords) - Each entry consists of:
- Pattern: A sequence of indices from layer
l-1 - Frequency: Occurrence count during training
- Transition probabilities: Likelihood of transitioning to other codewords
- Metadata: Additional statistical information
- Pattern: A sequence of indices from layer
The model tracks several performance metrics:
- Prediction Accuracy: Percentage of correctly predicted tokens
- Perplexity: Measure of how well the model predicts the sample
- Top-K Accuracy: Accuracy when considering the top K predictions
- Pattern Coverage: Percentage of input sequences matched in codebooks
- Hierarchical Consistency: Alignment between predictions at different layers
from fraclm import HierarchicalCodebookModel, SimpleTokenizer
# Create tokenizer
tokenizer = SimpleTokenizer(vocab_size=256)
tokenizer.fit(["Some example text to train on"])
# Create model
model = HierarchicalCodebookModel(num_layers=3, vocab_size=tokenizer.vocab_size)
# Encode some tokens
tokens = tokenizer.encode("Hello")
for token in tokens:
model.encode_token(token)
# Predict next token
predictions = model.predict_next()
next_token_probs = [(tokenizer.token_to_char.get(t, "<UNK>"), p) for t, p in predictions[:5]]
print("Top 5 predictions:", next_token_probs)MIT
This project is inspired by the hierarchical predictive processing framework in cognitive science and the efficient implementation approach of nanoGPT.
If you use FracLM in your research, please cite:
@software{fraclm2024,
author = {Dipankar Sarkar},
title = {FracLM: Hierarchical Lossy Codebook Model},
year = {2024},
url = {https://github.com/terraprompt/fraclm},
note = {A hierarchical predictive system using lossy codebooks for efficient pattern recognition and token prediction}
}