This project provides a big data architecture for analyzing Codeforces contest submissions. It leverages a modern data pipeline using Apache Spark, Kafka, ClickHouse, MinIO, Airflow, and Grafana to ingest, process, store, and visualize large-scale contest data.
- airflow/: Orchestration and workflow management (DAGs, configs, requirements).
- clickhouse/: ClickHouse database setup and scripts.
- fake-stream/: Data generator for simulating contest submissions.
- grafana/: Configuration for Grafana dashboards and monitoring.
- images/: Architecture diagrams and related images.
- init/: Initialization scripts (e.g., MinIO bucket creation).
- kafka/, kafka_script/: Kafka broker and related scripts for streaming data.
- minio/: MinIO object storage configuration.
- sample/: Sample data.
- spark-clean/, spark-cleaned/, spark-clickhouse/, spark-flatten/, spark-transform/: Spark jobs for ETL and data processing.
- Data Ingestion: Simulated by
fake-stream/send_by_time.py, which streams submission data to Kafka. - Data Processing: Apache Spark jobs process and transform the data, storing intermediate results in MinIO and final results in ClickHouse.
- Orchestration: Apache Airflow schedules and manages the ETL workflows.
- Storage: MinIO for raw/intermediate data, ClickHouse for analytics-ready data.
- Visualization: Grafana dashboards connect to ClickHouse for real-time analytics.
-
Clone the repository
git clone https://github.com/DOCUTEE/CFBIGDATA.git cd CFBIGDATA -
Start the stack
docker-compose up --build
-
Access Services
- Airflow: http://localhost:8080
- Grafana: http://localhost:3000
- MinIO: http://localhost:9000
- ClickHouse: http://localhost:8123
-
Simulate Data
- Run the data generator in
fake-streamto start streaming submissions.
- Run the data generator in
- Docker & Docker Compose
- Python (for scripts in
initandfake-stream) - See
airflow/requirements.txtandinit/requirements.txtfor Python dependencies.
- Modify or add Spark jobs in the respective
spark-*directories. - Update Airflow DAGs in
airflow/dagsto orchestrate new workflows. - Use Grafana to create or modify dashboards for analytics.
This project is licensed under the Apache License 2.0.
For more details, see the architecture diagram above and explore the individual directories for configuration and code.
