Skip to content

kuklyy/wikihouse

Repository files navigation

wikihouse

Real-time Wikipedia analytics pipeline built to explore ClickHouse's streaming capabilities. Ingests live Wikipedia edits via SSE, processes through Redpanda (Kafka-compatible), stores in ClickHouse, and visualizes with Grafana.

Purpose: Archetype project for exploring ClickHouse in a realistic streaming analytics context with real-world data (~50 events/sec from Wikipedia).

Architecture at a Glance

Wikipedia EventStreams (SSE)
        ↓
Go Ingester (franz-go)
  • Checkpoint tracking for graceful restarts
  • LRU-based deduplication
  • Batched Kafka production
        ↓
Redpanda (Kafka-compatible, single-node)
  • 1 partition (right-sized for throughput)
  • 24hr retention for replay capability
        ↓
ClickHouse Kafka Engine
        ↓
MergeTree Table
  • Columnar storage, sparse indexing
  • 30-day TTL
        ↓
Grafana Dashboards
  • Real-time analytics
  • Public read-only access

Resource Footprint: ~5-7GB RAM total (runs comfortably on 8GB Raspberry Pi or cloud VM)

Dashboard

Grafana Dashboard

License

MIT License - see LICENSE file.

About

Real-time Wikipedia analytics pipeline for exploring ClickHouse streaming capabilities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published