Python bindings for the HeavyKeeper algorithm - a fast, memory-efficient sketch-based algorithm for finding the top-K most frequent items in data streams.
HeavyKeeper is a probabilistic data structure that identifies the most frequent items in a data stream using minimal memory. This implementation provides Python bindings for a high-performance Rust implementation of the algorithm.
- π High Performance: Rust-based implementation with Python bindings via PyO3
- πΎ Memory Efficient: Uses probabilistic sketching to track millions of items with minimal memory
- π― Top-K Tracking: Efficiently maintains the K most frequent items
- π Stream Processing: Designed for continuous data streams
- π Approximate Counts: Provides estimated frequencies with high accuracy
- π§ͺ Battle Tested: Includes comprehensive benchmarks and tests
- Log Analysis: Find the most frequent IP addresses, user agents, or error messages
- Text Processing: Identify the most common words in large documents
- Network Monitoring: Track heavy hitters in network traffic
- Clickstream Analysis: Find the most popular pages or user actions
- Time Series Data: Monitor frequently occurring events or anomalies
# Clone the repository
git clone https://github.com/pmcgleen/heavykeeper-py.git
cd heavykeeper-py
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Build and install the Python package
maturin develop
# Or build a wheel
maturin build --release- Python 3.11+
- Rust toolchain (for building from source)
from heavykeeper import HeavyKeeper
# Create a HeavyKeeper instance
# k=100: track top 100 items
# width=2048: sketch width (affects accuracy)
# depth=8: number of hash functions (affects accuracy)
# decay=0.9: aging factor for old items
hk = HeavyKeeper(k=100, width=2048, depth=8, decay=0.9)
# Add items to the stream
items = ["apple", "banana", "apple", "cherry", "apple", "banana"]
for item in items:
hk.add(item)
# Query individual items
print(f"Is 'apple' in top-K? {hk.query('apple')}")
print(f"Estimated count for 'apple': {hk.count('apple')}")
# Get all top-K items
top_items = hk.list() # Returns list of (item, count) tuples
print("Top items:", top_items)
# Get as dictionary
top_dict = hk.get_topk() # Returns {item: count} dictionary
print("Top items dict:", top_dict)Creates a new HeavyKeeper instance.
Parameters:
k(int): Number of top items to trackwidth(int): Width of the sketch (number of buckets)depth(int): Depth of the sketch (number of hash functions)decay(float): Decay factor for aging items (between 0.0 and 1.0)
Add an item to the sketch.
Check if an item is being tracked in the top-K list.
Get the estimated count for an item (returns 0 if not tracked).
Get the top-K items as a list of (item, count) tuples, sorted by count.
Get the top-K items as a dictionary mapping items to counts.
Get the current number of items being tracked.
Check if the sketch is empty.
The repository includes a simnple script for performance testing:
# Basic benchmark with a text file
python benchmark_wordcount.py -k 10 -f data/war_and_peace.txt --time- k: Set to the number of top items you need
- width: Larger values improve accuracy but use more memory (try 1024-8192)
- depth: More hash functions improve accuracy (try 4-16)
- decay: Controls how quickly old items are forgotten (0.8-0.99)
Approximate memory usage: width Γ depth Γ 16 bytes + k Γ (item_size + 16 bytes)
For typical usage (width=2048, depth=8, k=100):
- Sketch: ~262 KB
- Top-K storage: ~depends on item sizes
- Higher
widthanddepthβ better accuracy, more memory - Lower
decayβ faster adaptation to changes, less stability - Higher
kβ more items tracked, slightly more overhead
# Development build
maturin develop
# Release build
maturin build --release
# Build with debugging
maturin develop --debug# Run the test suite
python test_heavykeeper.py
# Run benchmarks
python benchmark_wordcount.py -k 10 -f test_file.txtheavykeeper-py/
βββ src/
β βββ lib.rs # Rust implementation and Python bindings
βββ benchmark_*.py # Performance benchmarks
βββ test_heavykeeper.py # Test suite
βββ Cargo.toml # Rust dependencies
βββ pyproject.toml # Python package configuration
βββ README.md # This file
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Based on the HeavyKeeper algorithm
- Built with PyO3 for Rust-Python interoperability
- Uses Maturin for building Python extensions