Skip to content

Comprehensive benchmarking framework for high-cardinality data stream processing algorithms

License

Notifications You must be signed in to change notification settings

ishaish103/sketch-benchmarks

Repository files navigation

Sketch Benchmarks

Comprehensive performance analysis and benchmarking suite for high-cardinality data stream processing algorithms

Go Version License Python

A production-ready benchmarking framework for comparing probabilistic data structures and exact algorithms used in high-cardinality telemetry systems, real-time analytics, and cost-efficient backend pipelines for large-scale AI and cloud environments.

🎯 Overview

This repository provides a comprehensive comparison of four algorithmic approaches for top-k frequent item queries in high-cardinality data streams:

  • Space-Saving Algorithm - Deterministic error bounds with O(k) memory
  • Count-Min Sketch - Probabilistic frequency estimation with configurable accuracy
  • Heap-based Top-K - Exact results with O(n log k) complexity
  • Naive Exact Counting - Baseline implementation for comparison

Key Features

  • Production-ready implementations in Go with enterprise-grade concurrency
  • Comprehensive benchmark suite with statistical significance testing
  • Real-world performance analysis across multiple workloads
  • Automated visualization tools for performance comparison
  • Memory and throughput metrics for cost optimization analysis
  • Concurrent access patterns testing for distributed systems

🚀 Quick Start

Prerequisites

  • Go 1.21 or later
  • Python 3.8+ (for visualization tools)
  • Make (optional, for convenience commands)

Installation

# Clone the repository
git clone https://github.com/ishaishor/sketch-benchmarks.git
cd sketch-benchmarks

# Install Python dependencies
pip install -r requirements.txt

Running Benchmarks

# Run all benchmarks for a specific algorithm
cd spacesaving
go test -bench=. -benchmem

# Run comprehensive comparison across all algorithms
go run ../benchmark_comparison.go

# Run statistical benchmarks with confidence intervals
go run ../statistical_benchmark.go -runs 20

# Generate visualization charts
python3 ../visualize_benchmarks.py benchmark_statistical_results_*.json

📊 Benchmark Results

The repository includes comprehensive benchmark results demonstrating:

  • Up to 10.9× performance improvements over naive approaches
  • 96% memory reduction for high-cardinality workloads
  • Linear scaling with bounded memory for Space-Saving algorithm
  • Statistical significance with 95% confidence intervals

See the benchmark_charts/ directory for detailed visualizations.

🏗️ Architecture

Algorithm Implementations

Each algorithm is implemented as a standalone package with:

  • Core data structure implementation
  • HTTP server for real-world workload simulation
  • Comprehensive test suite with unit and integration tests
  • Benchmark tests for performance analysis
sketch-benchmarks/
├── spacesaving/          # Space-Saving algorithm implementation
├── countminsketch/       # Count-Min Sketch implementation
├── topk/                 # Heap-based top-k implementation
├── naive/                # Baseline exact counting implementation
├── benchmark_comparison.go    # Cross-algorithm comparison tool
├── statistical_benchmark.go   # Statistical analysis tool
├── visualize_benchmarks.py    # Visualization generation
└── benchmark_charts/          # Generated performance charts

Space-Saving Algorithm

The Space-Saving algorithm provides deterministic error bounds for frequent item identification with O(k) memory complexity.

Key Features:

  • Bounded memory usage regardless of stream size
  • Deterministic error guarantees
  • Thread-safe concurrent access
  • Configurable error bounds via epsilon parameter

Example Usage:

import "github.com/ishaishor/sketch-benchmarks/spacesaving"

// Create a summary with error bound epsilon=0.01 (1% error)
summary, err := spacesaving.NewSummaryEpsilon(0.01)
if err != nil {
    log.Fatal(err)
}

// Update with stream items
summary.Update("user123", 1)
summary.Update("user456", 1)

// Get top-k items
topK := summary.TopK()

Count-Min Sketch

Probabilistic data structure for frequency estimation with configurable accuracy and confidence.

Key Features:

  • Fixed memory footprint O(log(1/δ)/ε)
  • Probabilistic error bounds
  • Mergeable sketches for distributed systems
  • Thread-safe operations

Example Usage:

import "github.com/ishaishor/sketch-benchmarks/countminsketch"

// Create sketch with 1% error rate and 99% confidence
sketch, err := countminsketch.NewCountMinSketchWithErrorRate(0.01, 0.01)
if err != nil {
    log.Fatal(err)
}

// Update frequencies
sketch.Update("item1", 1)
sketch.Update("item2", 5)

// Query estimated frequency
count := sketch.Query("item1")

📈 Performance Analysis

Benchmark Methodology

All benchmarks follow rigorous statistical methodology:

  1. Multiple independent runs (default: 20 runs)
  2. Statistical significance testing with 95% confidence intervals
  3. Memory allocation tracking for cost analysis
  4. Concurrent access patterns for real-world scenarios
  5. Scaling analysis across different data sizes

Key Metrics

  • Time per operation (ns/op) - Latency analysis
  • Bytes per operation (B/op) - Memory efficiency
  • Allocations per operation - GC pressure analysis
  • Throughput - Operations per second
  • Scaling behavior - Performance at different scales

Results Summary

Based on comprehensive benchmarking:

Algorithm Memory Complexity Time Complexity Error Type Best For
Space-Saving O(k) O(1) update Deterministic Top-k queries
Count-Min Sketch O(log(1/δ)/ε) O(d) update Probabilistic Frequency queries
Heap Top-K O(n) O(n log k) Exact Small datasets
Naive O(n) O(n log n) Exact Baseline

🔬 Use Cases

High-Cardinality Telemetry Systems

Process millions of unique metrics with bounded memory:

// Track top 100 most frequent error types
summary, _ := spacesaving.NewSummary(100)
for _, event := range telemetryStream {
    summary.Update(event.ErrorType, 1)
}
topErrors := summary.TopK()

Real-Time Analytics

Efficiently identify trending items in data streams:

// Find trending products in e-commerce platform
sketch, _ := countminsketch.NewCountMinSketchWithErrorRate(0.01, 0.01)
for _, purchase := range purchaseStream {
    sketch.Update(purchase.ProductID, 1)
}

Cost-Efficient Backend Pipelines

Reduce infrastructure costs through algorithmic optimization:

  • 69-73% infrastructure cost reduction for large-scale deployments
  • Non-linear scaling benefits for high-volume systems
  • Predictable memory usage for capacity planning

🛠️ Development

Running Tests

# Run all tests
go test ./...

# Run tests with coverage
go test -cover ./...

# Run benchmarks
go test -bench=. -benchmem ./...

Code Quality

# Format code
go fmt ./...

# Run linter (if installed)
golangci-lint run

Adding New Algorithms

  1. Create a new directory under the repository root
  2. Implement the core data structure
  3. Add HTTP server for workload simulation
  4. Create comprehensive test suite
  5. Add benchmark tests following existing patterns
  6. Update benchmark_comparison.go to include the new algorithm

📚 Documentation

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Areas for Contribution

  • Additional algorithm implementations
  • Performance optimizations
  • Documentation improvements
  • Benchmark workload additions
  • Visualization enhancements

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

This work is part of research on cost-effective high-cardinality data stream processing for enterprise-scale applications. The implementations and benchmarks are designed to provide practical guidance for production deployments.

📧 Contact

For questions, issues, or contributions, please open an issue on GitHub.


Note: This repository is part of ongoing research on scalable observability and cloud-infrastructure reliability. The benchmarks and implementations are designed to advance high-cardinality telemetry systems, real-time analytics, and cost-efficient backend pipelines used in large-scale AI and cloud environments.

About

Comprehensive benchmarking framework for high-cardinality data stream processing algorithms

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published