Skip to content

pmcgleenon/heavykeeper-py

Repository files navigation

heavykeeper

License: MIT Python

Python bindings for the HeavyKeeper algorithm - a fast, memory-efficient sketch-based algorithm for finding the top-K most frequent items in data streams.

Overview

HeavyKeeper is a probabilistic data structure that identifies the most frequent items in a data stream using minimal memory. This implementation provides Python bindings for a high-performance Rust implementation of the algorithm.

Key Features

  • πŸš€ High Performance: Rust-based implementation with Python bindings via PyO3
  • πŸ’Ύ Memory Efficient: Uses probabilistic sketching to track millions of items with minimal memory
  • 🎯 Top-K Tracking: Efficiently maintains the K most frequent items
  • πŸ”„ Stream Processing: Designed for continuous data streams
  • πŸ“Š Approximate Counts: Provides estimated frequencies with high accuracy
  • πŸ§ͺ Battle Tested: Includes comprehensive benchmarks and tests

Use Cases

  • Log Analysis: Find the most frequent IP addresses, user agents, or error messages
  • Text Processing: Identify the most common words in large documents
  • Network Monitoring: Track heavy hitters in network traffic
  • Clickstream Analysis: Find the most popular pages or user actions
  • Time Series Data: Monitor frequently occurring events or anomalies

Installation

From Source (Development)

# Clone the repository
git clone https://github.com/pmcgleen/heavykeeper-py.git
cd heavykeeper-py

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build and install the Python package
maturin develop

# Or build a wheel
maturin build --release

Requirements

  • Python 3.11+
  • Rust toolchain (for building from source)

Quick Start

from heavykeeper import HeavyKeeper

# Create a HeavyKeeper instance
# k=100: track top 100 items
# width=2048: sketch width (affects accuracy)
# depth=8: number of hash functions (affects accuracy)
# decay=0.9: aging factor for old items
hk = HeavyKeeper(k=100, width=2048, depth=8, decay=0.9)

# Add items to the stream
items = ["apple", "banana", "apple", "cherry", "apple", "banana"]
for item in items:
    hk.add(item)

# Query individual items
print(f"Is 'apple' in top-K? {hk.query('apple')}")
print(f"Estimated count for 'apple': {hk.count('apple')}")

# Get all top-K items
top_items = hk.list()  # Returns list of (item, count) tuples
print("Top items:", top_items)

# Get as dictionary
top_dict = hk.get_topk()  # Returns {item: count} dictionary
print("Top items dict:", top_dict)

API Reference

HeavyKeeper(k, width, depth, decay)

Creates a new HeavyKeeper instance.

Parameters:

  • k (int): Number of top items to track
  • width (int): Width of the sketch (number of buckets)
  • depth (int): Depth of the sketch (number of hash functions)
  • decay (float): Decay factor for aging items (between 0.0 and 1.0)

Methods

add(item: str) -> None

Add an item to the sketch.

query(item: str) -> bool

Check if an item is being tracked in the top-K list.

count(item: str) -> int

Get the estimated count for an item (returns 0 if not tracked).

list() -> List[Tuple[str, int]]

Get the top-K items as a list of (item, count) tuples, sorted by count.

get_topk() -> Dict[str, int]

Get the top-K items as a dictionary mapping items to counts.

len() -> int

Get the current number of items being tracked.

is_empty() -> bool

Check if the sketch is empty.

Benchmarking

The repository includes a simnple script for performance testing:

Word Count Benchmark

# Basic benchmark with a text file
python benchmark_wordcount.py -k 10 -f data/war_and_peace.txt --time

Parameter Tuning

Choosing Parameters

  • k: Set to the number of top items you need
  • width: Larger values improve accuracy but use more memory (try 1024-8192)
  • depth: More hash functions improve accuracy (try 4-16)
  • decay: Controls how quickly old items are forgotten (0.8-0.99)

Memory Usage

Approximate memory usage: width Γ— depth Γ— 16 bytes + k Γ— (item_size + 16 bytes)

For typical usage (width=2048, depth=8, k=100):

  • Sketch: ~262 KB
  • Top-K storage: ~depends on item sizes

Accuracy vs Performance

  • Higher width and depth β†’ better accuracy, more memory
  • Lower decay β†’ faster adaptation to changes, less stability
  • Higher k β†’ more items tracked, slightly more overhead

Development

Building

# Development build
maturin develop

# Release build  
maturin build --release

# Build with debugging
maturin develop --debug

Testing

# Run the test suite
python test_heavykeeper.py

# Run benchmarks
python benchmark_wordcount.py -k 10 -f test_file.txt

Project Structure

heavykeeper-py/
β”œβ”€β”€ src/
β”‚   └── lib.rs          # Rust implementation and Python bindings
β”œβ”€β”€ benchmark_*.py      # Performance benchmarks
β”œβ”€β”€ test_heavykeeper.py # Test suite
β”œβ”€β”€ Cargo.toml          # Rust dependencies
β”œβ”€β”€ pyproject.toml      # Python package configuration
└── README.md           # This file

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Based on the HeavyKeeper algorithm
  • Built with PyO3 for Rust-Python interoperability
  • Uses Maturin for building Python extensions

About

Heavykeeper algorithm for Top-K elephant flows - python

Topics

Resources

License

Stars

Watchers

Forks

Contributors