FracLM: Hierarchical Lossy Codebook Model

A hierarchical predictive system using lossy codebooks to efficiently capture patterns at multiple levels of abstraction and predict upcoming tokens in a stream.

Overview

FracLM implements a hierarchical lossy codebook architecture that:

Processes token streams through multiple layers of abstraction
Uses codebooks at each layer to recognize and encode patterns
Propagates information upward through the hierarchy to create context
Uses this hierarchical context to predict upcoming tokens

The architecture is inspired by the approach used in nanoGPT but focuses on explicit pattern recognition and hierarchical abstraction rather than neural networks.

Features

Hierarchical Pattern Recognition: Captures patterns at multiple levels of abstraction
Lossy Compression: Efficiently represents patterns with some acceptable loss of information
Online Learning: Continuously updates codebooks based on new data
Adaptive Codebooks: Prunes rarely used entries and merges similar patterns
Efficient Prediction: Uses hierarchical context to constrain predictions at each level

Installation

# Clone the repository
git clone https://github.com/terraprompt/fraclm.git
cd fraclm

# Install dependencies
pip install -e .

Usage

Training a Model

python main.py --mode train --data_path data/input.txt --save_dir models/my_model

Generating Text

python main.py --mode generate --model_path models/my_model/final_model --gen_seed "The " --gen_length 100

Evaluating a Model

python main.py --mode evaluate --model_path models/my_model/final_model

Command Line Arguments

--mode: Mode of operation (train, generate, or evaluate)
--data_path: Path to input text file
--save_dir: Directory to save models
--num_layers: Number of hierarchical layers
--vocab_size: Vocabulary size
--batch_size: Batch size for training
--num_epochs: Number of training epochs
--update_interval: Interval for updating codebooks
--eval_interval: Interval for evaluation
--save_interval: Interval for saving checkpoints
--gen_length: Length of generated sequence
--gen_temperature: Temperature for sampling
--gen_seed: Seed text for generation
--model_path: Path to load model (for generate/evaluate mode)

Architecture Details

Components

Token Stream Processor: Handles incoming token stream (values 0-N)
Codebook Hierarchy: M layers of codebooks, from token-level to high-level abstractions
Pattern Matcher: Maps input sequences to codebook entries
Prediction Engine: Uses hierarchical context to predict next tokens

Data Flow

Token stream enters at the bottom layer
Patterns are recognized and mapped to codebook entries at each layer
Information flows upward through the hierarchy, creating context
Predictions flow downward, with each layer constraining predictions at lower layers
Final prediction is output at the token level

Codebook Structure

Each codebook at layer l contains:

A set of K_l entries (codewords)
Each entry consists of:
- Pattern: A sequence of indices from layer l-1
- Frequency: Occurrence count during training
- Transition probabilities: Likelihood of transitioning to other codewords
- Metadata: Additional statistical information

Performance Metrics

The model tracks several performance metrics:

Prediction Accuracy: Percentage of correctly predicted tokens
Perplexity: Measure of how well the model predicts the sample
Top-K Accuracy: Accuracy when considering the top K predictions
Pattern Coverage: Percentage of input sequences matched in codebooks
Hierarchical Consistency: Alignment between predictions at different layers

Example

from fraclm import HierarchicalCodebookModel, SimpleTokenizer

# Create tokenizer
tokenizer = SimpleTokenizer(vocab_size=256)
tokenizer.fit(["Some example text to train on"])

# Create model
model = HierarchicalCodebookModel(num_layers=3, vocab_size=tokenizer.vocab_size)

# Encode some tokens
tokens = tokenizer.encode("Hello")
for token in tokens:
    model.encode_token(token)

# Predict next token
predictions = model.predict_next()
next_token_probs = [(tokenizer.token_to_char.get(t, "<UNK>"), p) for t, p in predictions[:5]]
print("Top 5 predictions:", next_token_probs)

License

MIT

Acknowledgments

This project is inspired by the hierarchical predictive processing framework in cognitive science and the efficient implementation approach of nanoGPT.

Citations

If you use FracLM in your research, please cite:

@software{fraclm2024,
  author = {Dipankar Sarkar},
  title = {FracLM: Hierarchical Lossy Codebook Model},
  year = {2024},
  url = {https://github.com/terraprompt/fraclm},
  note = {A hierarchical predictive system using lossy codebooks for efficient pattern recognition and token prediction}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
benchmarks		benchmarks
docs		docs
examples		examples
fraclm		fraclm
scripts		scripts
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FracLM: Hierarchical Lossy Codebook Model

Overview

Features

Installation

Usage

Training a Model

Generating Text

Evaluating a Model

Command Line Arguments

Architecture Details

Components

Data Flow

Codebook Structure

Performance Metrics

Example

License

Acknowledgments

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sarkar-dipankar/fracLM

Folders and files

Latest commit

History

Repository files navigation

FracLM: Hierarchical Lossy Codebook Model

Overview

Features

Installation

Usage

Training a Model

Generating Text

Evaluating a Model

Command Line Arguments

Architecture Details

Components

Data Flow

Codebook Structure

Performance Metrics

Example

License

Acknowledgments

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages