Skip to content

A hierarchical predictive system using lossy codebooks to efficiently capture patterns at multiple levels of abstraction and predict upcoming tokens in a stream.

Notifications You must be signed in to change notification settings

sarkar-dipankar/fracLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FracLM: Hierarchical Lossy Codebook Model

A hierarchical predictive system using lossy codebooks to efficiently capture patterns at multiple levels of abstraction and predict upcoming tokens in a stream.

Overview

FracLM implements a hierarchical lossy codebook architecture that:

  1. Processes token streams through multiple layers of abstraction
  2. Uses codebooks at each layer to recognize and encode patterns
  3. Propagates information upward through the hierarchy to create context
  4. Uses this hierarchical context to predict upcoming tokens

The architecture is inspired by the approach used in nanoGPT but focuses on explicit pattern recognition and hierarchical abstraction rather than neural networks.

Features

  • Hierarchical Pattern Recognition: Captures patterns at multiple levels of abstraction
  • Lossy Compression: Efficiently represents patterns with some acceptable loss of information
  • Online Learning: Continuously updates codebooks based on new data
  • Adaptive Codebooks: Prunes rarely used entries and merges similar patterns
  • Efficient Prediction: Uses hierarchical context to constrain predictions at each level

Installation

# Clone the repository
git clone https://github.com/terraprompt/fraclm.git
cd fraclm

# Install dependencies
pip install -e .

Usage

Training a Model

python main.py --mode train --data_path data/input.txt --save_dir models/my_model

Generating Text

python main.py --mode generate --model_path models/my_model/final_model --gen_seed "The " --gen_length 100

Evaluating a Model

python main.py --mode evaluate --model_path models/my_model/final_model

Command Line Arguments

  • --mode: Mode of operation (train, generate, or evaluate)
  • --data_path: Path to input text file
  • --save_dir: Directory to save models
  • --num_layers: Number of hierarchical layers
  • --vocab_size: Vocabulary size
  • --batch_size: Batch size for training
  • --num_epochs: Number of training epochs
  • --update_interval: Interval for updating codebooks
  • --eval_interval: Interval for evaluation
  • --save_interval: Interval for saving checkpoints
  • --gen_length: Length of generated sequence
  • --gen_temperature: Temperature for sampling
  • --gen_seed: Seed text for generation
  • --model_path: Path to load model (for generate/evaluate mode)

Architecture Details

Components

  1. Token Stream Processor: Handles incoming token stream (values 0-N)
  2. Codebook Hierarchy: M layers of codebooks, from token-level to high-level abstractions
  3. Pattern Matcher: Maps input sequences to codebook entries
  4. Prediction Engine: Uses hierarchical context to predict next tokens

Data Flow

  1. Token stream enters at the bottom layer
  2. Patterns are recognized and mapped to codebook entries at each layer
  3. Information flows upward through the hierarchy, creating context
  4. Predictions flow downward, with each layer constraining predictions at lower layers
  5. Final prediction is output at the token level

Codebook Structure

Each codebook at layer l contains:

  • A set of K_l entries (codewords)
  • Each entry consists of:
    • Pattern: A sequence of indices from layer l-1
    • Frequency: Occurrence count during training
    • Transition probabilities: Likelihood of transitioning to other codewords
    • Metadata: Additional statistical information

Performance Metrics

The model tracks several performance metrics:

  • Prediction Accuracy: Percentage of correctly predicted tokens
  • Perplexity: Measure of how well the model predicts the sample
  • Top-K Accuracy: Accuracy when considering the top K predictions
  • Pattern Coverage: Percentage of input sequences matched in codebooks
  • Hierarchical Consistency: Alignment between predictions at different layers

Example

from fraclm import HierarchicalCodebookModel, SimpleTokenizer

# Create tokenizer
tokenizer = SimpleTokenizer(vocab_size=256)
tokenizer.fit(["Some example text to train on"])

# Create model
model = HierarchicalCodebookModel(num_layers=3, vocab_size=tokenizer.vocab_size)

# Encode some tokens
tokens = tokenizer.encode("Hello")
for token in tokens:
    model.encode_token(token)

# Predict next token
predictions = model.predict_next()
next_token_probs = [(tokenizer.token_to_char.get(t, "<UNK>"), p) for t, p in predictions[:5]]
print("Top 5 predictions:", next_token_probs)

License

MIT

Acknowledgments

This project is inspired by the hierarchical predictive processing framework in cognitive science and the efficient implementation approach of nanoGPT.

Citations

If you use FracLM in your research, please cite:

@software{fraclm2024,
  author = {Dipankar Sarkar},
  title = {FracLM: Hierarchical Lossy Codebook Model},
  year = {2024},
  url = {https://github.com/terraprompt/fraclm},
  note = {A hierarchical predictive system using lossy codebooks for efficient pattern recognition and token prediction}
}

About

A hierarchical predictive system using lossy codebooks to efficiently capture patterns at multiple levels of abstraction and predict upcoming tokens in a stream.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages