Sketch Benchmarks

Comprehensive performance analysis and benchmarking suite for high-cardinality data stream processing algorithms

A production-ready benchmarking framework for comparing probabilistic data structures and exact algorithms used in high-cardinality telemetry systems, real-time analytics, and cost-efficient backend pipelines for large-scale AI and cloud environments.

🎯 Overview

This repository provides a comprehensive comparison of four algorithmic approaches for top-k frequent item queries in high-cardinality data streams:

Space-Saving Algorithm - Deterministic error bounds with O(k) memory
Count-Min Sketch - Probabilistic frequency estimation with configurable accuracy
Heap-based Top-K - Exact results with O(n log k) complexity
Naive Exact Counting - Baseline implementation for comparison

Key Features

✅ Production-ready implementations in Go with enterprise-grade concurrency
✅ Comprehensive benchmark suite with statistical significance testing
✅ Real-world performance analysis across multiple workloads
✅ Automated visualization tools for performance comparison
✅ Memory and throughput metrics for cost optimization analysis
✅ Concurrent access patterns testing for distributed systems

🚀 Quick Start

Prerequisites

Go 1.21 or later
Python 3.8+ (for visualization tools)
Make (optional, for convenience commands)

Installation

# Clone the repository
git clone https://github.com/ishaishor/sketch-benchmarks.git
cd sketch-benchmarks

# Install Python dependencies
pip install -r requirements.txt

Running Benchmarks

# Run all benchmarks for a specific algorithm
cd spacesaving
go test -bench=. -benchmem

# Run comprehensive comparison across all algorithms
go run ../benchmark_comparison.go

# Run statistical benchmarks with confidence intervals
go run ../statistical_benchmark.go -runs 20

# Generate visualization charts
python3 ../visualize_benchmarks.py benchmark_statistical_results_*.json

📊 Benchmark Results

The repository includes comprehensive benchmark results demonstrating:

Up to 10.9× performance improvements over naive approaches
96% memory reduction for high-cardinality workloads
Linear scaling with bounded memory for Space-Saving algorithm
Statistical significance with 95% confidence intervals

See the benchmark_charts/ directory for detailed visualizations.

🏗️ Architecture

Algorithm Implementations

Each algorithm is implemented as a standalone package with:

Core data structure implementation
HTTP server for real-world workload simulation
Comprehensive test suite with unit and integration tests
Benchmark tests for performance analysis

sketch-benchmarks/
├── spacesaving/          # Space-Saving algorithm implementation
├── countminsketch/       # Count-Min Sketch implementation
├── topk/                 # Heap-based top-k implementation
├── naive/                # Baseline exact counting implementation
├── benchmark_comparison.go    # Cross-algorithm comparison tool
├── statistical_benchmark.go   # Statistical analysis tool
├── visualize_benchmarks.py    # Visualization generation
└── benchmark_charts/          # Generated performance charts

Space-Saving Algorithm

The Space-Saving algorithm provides deterministic error bounds for frequent item identification with O(k) memory complexity.

Key Features:

Bounded memory usage regardless of stream size
Deterministic error guarantees
Thread-safe concurrent access
Configurable error bounds via epsilon parameter

Example Usage:

import "github.com/ishaishor/sketch-benchmarks/spacesaving"

// Create a summary with error bound epsilon=0.01 (1% error)
summary, err := spacesaving.NewSummaryEpsilon(0.01)
if err != nil {
    log.Fatal(err)
}

// Update with stream items
summary.Update("user123", 1)
summary.Update("user456", 1)

// Get top-k items
topK := summary.TopK()

Count-Min Sketch

Probabilistic data structure for frequency estimation with configurable accuracy and confidence.

Key Features:

Fixed memory footprint O(log(1/δ)/ε)
Probabilistic error bounds
Mergeable sketches for distributed systems
Thread-safe operations

Example Usage:

import "github.com/ishaishor/sketch-benchmarks/countminsketch"

// Create sketch with 1% error rate and 99% confidence
sketch, err := countminsketch.NewCountMinSketchWithErrorRate(0.01, 0.01)
if err != nil {
    log.Fatal(err)
}

// Update frequencies
sketch.Update("item1", 1)
sketch.Update("item2", 5)

// Query estimated frequency
count := sketch.Query("item1")

📈 Performance Analysis

Benchmark Methodology

All benchmarks follow rigorous statistical methodology:

Multiple independent runs (default: 20 runs)
Statistical significance testing with 95% confidence intervals
Memory allocation tracking for cost analysis
Concurrent access patterns for real-world scenarios
Scaling analysis across different data sizes

Key Metrics

Time per operation (ns/op) - Latency analysis
Bytes per operation (B/op) - Memory efficiency
Allocations per operation - GC pressure analysis
Throughput - Operations per second
Scaling behavior - Performance at different scales

Results Summary

Based on comprehensive benchmarking:

Algorithm	Memory Complexity	Time Complexity	Error Type	Best For
Space-Saving	O(k)	O(1) update	Deterministic	Top-k queries
Count-Min Sketch	O(log(1/δ)/ε)	O(d) update	Probabilistic	Frequency queries
Heap Top-K	O(n)	O(n log k)	Exact	Small datasets
Naive	O(n)	O(n log n)	Exact	Baseline

🔬 Use Cases

High-Cardinality Telemetry Systems

Process millions of unique metrics with bounded memory:

// Track top 100 most frequent error types
summary, _ := spacesaving.NewSummary(100)
for _, event := range telemetryStream {
    summary.Update(event.ErrorType, 1)
}
topErrors := summary.TopK()

Real-Time Analytics

Efficiently identify trending items in data streams:

// Find trending products in e-commerce platform
sketch, _ := countminsketch.NewCountMinSketchWithErrorRate(0.01, 0.01)
for _, purchase := range purchaseStream {
    sketch.Update(purchase.ProductID, 1)
}

Cost-Efficient Backend Pipelines

Reduce infrastructure costs through algorithmic optimization:

69-73% infrastructure cost reduction for large-scale deployments
Non-linear scaling benefits for high-volume systems
Predictable memory usage for capacity planning

🛠️ Development

Running Tests

# Run all tests
go test ./...

# Run tests with coverage
go test -cover ./...

# Run benchmarks
go test -bench=. -benchmem ./...

Code Quality

# Format code
go fmt ./...

# Run linter (if installed)
golangci-lint run

Adding New Algorithms

Create a new directory under the repository root
Implement the core data structure
Add HTTP server for workload simulation
Create comprehensive test suite
Add benchmark tests following existing patterns
Update benchmark_comparison.go to include the new algorithm

📚 Documentation

Algorithm Details - Detailed algorithm descriptions
Benchmark Methodology - Benchmarking approach and metrics
Performance Analysis - Detailed performance results
API Reference - Complete API documentation

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Areas for Contribution

Additional algorithm implementations
Performance optimizations
Documentation improvements
Benchmark workload additions
Visualization enhancements

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

This work is part of research on cost-effective high-cardinality data stream processing for enterprise-scale applications. The implementations and benchmarks are designed to provide practical guidance for production deployments.

📧 Contact

For questions, issues, or contributions, please open an issue on GitHub.

Note: This repository is part of ongoing research on scalable observability and cloud-infrastructure reliability. The benchmarks and implementations are designed to advance high-cardinality telemetry systems, real-time analytics, and cost-efficient backend pipelines used in large-scale AI and cloud environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sketch Benchmarks

🎯 Overview

Key Features

🚀 Quick Start

Prerequisites

Installation

Running Benchmarks

📊 Benchmark Results

🏗️ Architecture

Algorithm Implementations

Space-Saving Algorithm

Count-Min Sketch

📈 Performance Analysis

Benchmark Methodology

Key Metrics

Results Summary

🔬 Use Cases

High-Cardinality Telemetry Systems

Real-Time Analytics

Cost-Efficient Backend Pipelines

🛠️ Development

Running Tests

Code Quality

Adding New Algorithms

📚 Documentation

🤝 Contributing

Areas for Contribution

📄 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
benchmark_charts		benchmark_charts
countminsketch		countminsketch
docs		docs
naive		naive
spacesaving		spacesaving
topk		topk
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
benchmark_comparison.go		benchmark_comparison.go
go.mod		go.mod
requirements.txt		requirements.txt
statistical_benchmark.go		statistical_benchmark.go
visualize_benchmarks.py		visualize_benchmarks.py

License

ishaish103/sketch-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Sketch Benchmarks

🎯 Overview

Key Features

🚀 Quick Start

Prerequisites

Installation

Running Benchmarks

📊 Benchmark Results

🏗️ Architecture

Algorithm Implementations

Space-Saving Algorithm

Count-Min Sketch

📈 Performance Analysis

Benchmark Methodology

Key Metrics

Results Summary

🔬 Use Cases

High-Cardinality Telemetry Systems

Real-Time Analytics

Cost-Efficient Backend Pipelines

🛠️ Development

Running Tests

Code Quality

Adding New Algorithms

📚 Documentation

🤝 Contributing

Areas for Contribution

📄 License

🙏 Acknowledgments

📧 Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages