Real-time Wikipedia analytics pipeline built to explore ClickHouse's streaming capabilities. Ingests live Wikipedia edits via SSE, processes through Redpanda (Kafka-compatible), stores in ClickHouse, and visualizes with Grafana.
Purpose: Archetype project for exploring ClickHouse in a realistic streaming analytics context with real-world data (~50 events/sec from Wikipedia).
Wikipedia EventStreams (SSE)
↓
Go Ingester (franz-go)
• Checkpoint tracking for graceful restarts
• LRU-based deduplication
• Batched Kafka production
↓
Redpanda (Kafka-compatible, single-node)
• 1 partition (right-sized for throughput)
• 24hr retention for replay capability
↓
ClickHouse Kafka Engine
↓
MergeTree Table
• Columnar storage, sparse indexing
• 30-day TTL
↓
Grafana Dashboards
• Real-time analytics
• Public read-only access
Resource Footprint: ~5-7GB RAM total (runs comfortably on 8GB Raspberry Pi or cloud VM)
MIT License - see LICENSE file.
