Comprehensive performance analysis and benchmarking suite for high-cardinality data stream processing algorithms
A production-ready benchmarking framework for comparing probabilistic data structures and exact algorithms used in high-cardinality telemetry systems, real-time analytics, and cost-efficient backend pipelines for large-scale AI and cloud environments.
This repository provides a comprehensive comparison of four algorithmic approaches for top-k frequent item queries in high-cardinality data streams:
- Space-Saving Algorithm - Deterministic error bounds with O(k) memory
- Count-Min Sketch - Probabilistic frequency estimation with configurable accuracy
- Heap-based Top-K - Exact results with O(n log k) complexity
- Naive Exact Counting - Baseline implementation for comparison
- ✅ Production-ready implementations in Go with enterprise-grade concurrency
- ✅ Comprehensive benchmark suite with statistical significance testing
- ✅ Real-world performance analysis across multiple workloads
- ✅ Automated visualization tools for performance comparison
- ✅ Memory and throughput metrics for cost optimization analysis
- ✅ Concurrent access patterns testing for distributed systems
- Go 1.21 or later
- Python 3.8+ (for visualization tools)
- Make (optional, for convenience commands)
# Clone the repository
git clone https://github.com/ishaishor/sketch-benchmarks.git
cd sketch-benchmarks
# Install Python dependencies
pip install -r requirements.txt# Run all benchmarks for a specific algorithm
cd spacesaving
go test -bench=. -benchmem
# Run comprehensive comparison across all algorithms
go run ../benchmark_comparison.go
# Run statistical benchmarks with confidence intervals
go run ../statistical_benchmark.go -runs 20
# Generate visualization charts
python3 ../visualize_benchmarks.py benchmark_statistical_results_*.jsonThe repository includes comprehensive benchmark results demonstrating:
- Up to 10.9× performance improvements over naive approaches
- 96% memory reduction for high-cardinality workloads
- Linear scaling with bounded memory for Space-Saving algorithm
- Statistical significance with 95% confidence intervals
See the benchmark_charts/ directory for detailed visualizations.
Each algorithm is implemented as a standalone package with:
- Core data structure implementation
- HTTP server for real-world workload simulation
- Comprehensive test suite with unit and integration tests
- Benchmark tests for performance analysis
sketch-benchmarks/
├── spacesaving/ # Space-Saving algorithm implementation
├── countminsketch/ # Count-Min Sketch implementation
├── topk/ # Heap-based top-k implementation
├── naive/ # Baseline exact counting implementation
├── benchmark_comparison.go # Cross-algorithm comparison tool
├── statistical_benchmark.go # Statistical analysis tool
├── visualize_benchmarks.py # Visualization generation
└── benchmark_charts/ # Generated performance charts
The Space-Saving algorithm provides deterministic error bounds for frequent item identification with O(k) memory complexity.
Key Features:
- Bounded memory usage regardless of stream size
- Deterministic error guarantees
- Thread-safe concurrent access
- Configurable error bounds via epsilon parameter
Example Usage:
import "github.com/ishaishor/sketch-benchmarks/spacesaving"
// Create a summary with error bound epsilon=0.01 (1% error)
summary, err := spacesaving.NewSummaryEpsilon(0.01)
if err != nil {
log.Fatal(err)
}
// Update with stream items
summary.Update("user123", 1)
summary.Update("user456", 1)
// Get top-k items
topK := summary.TopK()Probabilistic data structure for frequency estimation with configurable accuracy and confidence.
Key Features:
- Fixed memory footprint O(log(1/δ)/ε)
- Probabilistic error bounds
- Mergeable sketches for distributed systems
- Thread-safe operations
Example Usage:
import "github.com/ishaishor/sketch-benchmarks/countminsketch"
// Create sketch with 1% error rate and 99% confidence
sketch, err := countminsketch.NewCountMinSketchWithErrorRate(0.01, 0.01)
if err != nil {
log.Fatal(err)
}
// Update frequencies
sketch.Update("item1", 1)
sketch.Update("item2", 5)
// Query estimated frequency
count := sketch.Query("item1")All benchmarks follow rigorous statistical methodology:
- Multiple independent runs (default: 20 runs)
- Statistical significance testing with 95% confidence intervals
- Memory allocation tracking for cost analysis
- Concurrent access patterns for real-world scenarios
- Scaling analysis across different data sizes
- Time per operation (ns/op) - Latency analysis
- Bytes per operation (B/op) - Memory efficiency
- Allocations per operation - GC pressure analysis
- Throughput - Operations per second
- Scaling behavior - Performance at different scales
Based on comprehensive benchmarking:
| Algorithm | Memory Complexity | Time Complexity | Error Type | Best For |
|---|---|---|---|---|
| Space-Saving | O(k) | O(1) update | Deterministic | Top-k queries |
| Count-Min Sketch | O(log(1/δ)/ε) | O(d) update | Probabilistic | Frequency queries |
| Heap Top-K | O(n) | O(n log k) | Exact | Small datasets |
| Naive | O(n) | O(n log n) | Exact | Baseline |
Process millions of unique metrics with bounded memory:
// Track top 100 most frequent error types
summary, _ := spacesaving.NewSummary(100)
for _, event := range telemetryStream {
summary.Update(event.ErrorType, 1)
}
topErrors := summary.TopK()Efficiently identify trending items in data streams:
// Find trending products in e-commerce platform
sketch, _ := countminsketch.NewCountMinSketchWithErrorRate(0.01, 0.01)
for _, purchase := range purchaseStream {
sketch.Update(purchase.ProductID, 1)
}Reduce infrastructure costs through algorithmic optimization:
- 69-73% infrastructure cost reduction for large-scale deployments
- Non-linear scaling benefits for high-volume systems
- Predictable memory usage for capacity planning
# Run all tests
go test ./...
# Run tests with coverage
go test -cover ./...
# Run benchmarks
go test -bench=. -benchmem ./...# Format code
go fmt ./...
# Run linter (if installed)
golangci-lint run- Create a new directory under the repository root
- Implement the core data structure
- Add HTTP server for workload simulation
- Create comprehensive test suite
- Add benchmark tests following existing patterns
- Update
benchmark_comparison.goto include the new algorithm
- Algorithm Details - Detailed algorithm descriptions
- Benchmark Methodology - Benchmarking approach and metrics
- Performance Analysis - Detailed performance results
- API Reference - Complete API documentation
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Additional algorithm implementations
- Performance optimizations
- Documentation improvements
- Benchmark workload additions
- Visualization enhancements
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This work is part of research on cost-effective high-cardinality data stream processing for enterprise-scale applications. The implementations and benchmarks are designed to provide practical guidance for production deployments.
For questions, issues, or contributions, please open an issue on GitHub.
Note: This repository is part of ongoing research on scalable observability and cloud-infrastructure reliability. The benchmarks and implementations are designed to advance high-cardinality telemetry systems, real-time analytics, and cost-efficient backend pipelines used in large-scale AI and cloud environments.