Skip to content

Production-grade CDC system processing 50M+ records/day with intelligent hybrid sync strategies

Notifications You must be signed in to change notification settings

oguzhangoktas/distributed-data-sync-engine

Repository files navigation

Distributed Data Synchronization Engine

Python PySpark Delta Lake Kafka Airflow License

A production-grade, custom-built Change Data Capture (CDC) system for multi-source database replication with intelligent sync strategies and conflict resolution.

Overview

This project implements a distributed data synchronization engine from scratch - not a wrapper around existing CDC tools, but a complete solution featuring custom database log readers, adaptive sync strategies, and intelligent conflict resolution. The system demonstrates end-to-end data engineering capabilities across batch, micro-batch, and streaming paradigms.

Author: Oguzhan Goktas
Email: oguzhangoktas22@gmail.com
GitHub: https://github.com/oguzhangoktas
LinkedIn: https://www.linkedin.com/in/oguzhan-goktas/

Key Features

Custom CDC Implementation

  • PostgreSQL WAL Parser: Native Write-Ahead Log reading with position tracking
  • MySQL Binlog Reader: Binary log parsing for real-time change capture
  • MongoDB Oplog Tailer: Operations log monitoring for document changes
  • Incremental Position Management: Crash-safe checkpoint persistence

Intelligent Sync Strategies

  • Streaming Mode: Sub-second latency for high-priority tables via Kafka
  • Micro-batch Mode: 5-minute interval processing for medium-latency requirements
  • Batch Mode: Hourly/daily full refresh for historical tables
  • Adaptive Selection: Cost vs latency optimization with automatic strategy switching

Advanced Conflict Resolution

  • Last-write-wins with timestamp comparison
  • Custom business rules engine
  • Multi-version conflict detection
  • Configurable resolution policies per table

Data Quality Framework

  • Source-target reconciliation with row count and checksum validation
  • Schema drift detection and alerting
  • Great Expectations integration for data profiling
  • Automated quality metrics tracking

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      DATA SOURCES                            │
├──────────────┬──────────────┬──────────────────────────────┤
│ PostgreSQL   │    MySQL     │         MongoDB              │
│ WAL Reader   │ Binlog Reader│      Oplog Tailer            │
└──────┬───────┴──────┬───────┴──────────┬───────────────────┘
       │              │                  │
       └──────────────┴──────────────────┘
                      │
       ┌──────────────▼──────────────────┐
       │   Smart Sync Coordinator         │
       │  (Strategy Selection Engine)     │
       └──┬───────────┬──────────────┬───┘
          │           │              │
    ┌─────▼─────┐ ┌──▼────┐  ┌─────▼──────┐
    │  Kafka    │ │  SQS  │  │  Airflow   │
    │ Streaming │ │ Micro │  │   Batch    │
    └─────┬─────┘ └───┬───┘  └─────┬──────┘
          │           │            │
          └───────────┴────────────┘
                      │
       ┌──────────────▼──────────────────┐
       │   PySpark Processing Layer       │
       │  - Conflict Resolution           │
       │  - Data Quality Validation       │
       │  - Schema Evolution              │
       └──────────────┬───────────────────┘
                      │
       ┌──────────────▼──────────────────┐
       │        Delta Lake                │
       │   Bronze → Silver → Gold         │
       │   (Medallion Architecture)       │
       └──────────────────────────────────┘

Technology Stack

Data Sources

  • PostgreSQL 15 (Transaction log reading)
  • MySQL 8.0 (Binary log parsing)
  • MongoDB 6.0 (Operation log tailing)

Core Processing

  • Python 3.11 (Log readers, coordinators, schedulers)
  • PySpark 3.5 (Stream processing, conflict resolution, data quality)

Message Queue & Streaming

  • Apache Kafka 3.6 (High-frequency streaming tables)
  • AWS SQS (Batch and micro-batch queues)

Storage & Data Lake

  • Delta Lake 3.0 (ACID transactions, time travel)
  • AWS S3 (Object storage for data lake)
  • Medallion Architecture (Bronze/Silver/Gold layers)

Orchestration

  • Apache Airflow 2.8 (Batch sync DAGs)
  • Custom Python Scheduler (Micro-batch coordination)

Monitoring & Observability

  • AWS CloudWatch (Metrics and logs)
  • Custom Python Metrics Collector
  • Prometheus & Grafana (Optional)

Infrastructure & DevOps

  • Docker & Docker Compose (Local development)
  • Terraform (Infrastructure as Code for AWS resources)
  • GitHub Actions (CI/CD pipeline)
  • AWS Services (S3, Glue, Lambda, SQS)

Data Quality

  • Great Expectations (Validation framework)
  • Custom reconciliation logic

Performance Metrics

  • Streaming Tables: <1 second replication lag
  • Micro-batch Tables: 5-minute sync intervals
  • Batch Tables: Hourly/daily full refresh
  • Data Quality: 99.9% accuracy target
  • Throughput: 10,000 events/second
  • Daily Volume: 50M+ records processed

Project Structure

distributed-data-sync-engine/
├── src/
│   ├── cdc_readers/           # Custom database log readers
│   │   ├── postgres_wal.py
│   │   ├── mysql_binlog.py
│   │   └── mongodb_oplog.py
│   ├── coordinator/           # Sync strategy coordination
│   │   ├── strategy_selector.py
│   │   └── config_manager.py
│   ├── processors/            # PySpark processing jobs
│   │   ├── conflict_resolver.py
│   │   └── quality_validator.py
│   ├── orchestration/         # Airflow DAGs
│   │   └── dags/
│   └── monitoring/            # Metrics and observability
│       └── metrics_collector.py
├── config/                    # Configuration files
│   ├── sync_config.yaml
│   ├── table_mappings.yaml
│   └── quality_rules.yaml
├── tests/                     # Unit and integration tests
├── terraform/                 # Infrastructure as Code
├── docker/                    # Docker configurations
├── scripts/                   # Utility scripts
└── docs/                      # Detailed documentation

Quick Start

Prerequisites

  • Docker & Docker Compose
  • Python 3.11+
  • AWS Account (for cloud deployment)
  • Terraform 1.5+

Local Development Setup

  1. Clone the repository
git clone https://github.com/oguzhangoktas/distributed-data-sync-engine.git
cd distributed-data-sync-engine
  1. Start local environment
docker-compose up -d
  1. Initialize databases with sample data
python scripts/init_sample_data.py
  1. Run CDC readers
python src/cdc_readers/postgres_wal.py
python src/cdc_readers/mysql_binlog.py
python src/cdc_readers/mongodb_oplog.py
  1. Access monitoring

For detailed setup instructions, see SETUP.md

Documentation

Use Cases

This system is designed for:

  • Multi-database consolidation in data warehouses
  • Real-time data replication for analytics
  • Cross-region database synchronization
  • Legacy system modernization with zero downtime
  • Compliance and audit trail maintenance

Why This Project Matters

This project demonstrates:

  1. System Design Skills: End-to-end architecture from source to target
  2. Custom Implementation: Built from scratch, not configuration of existing tools
  3. Hybrid Processing: Batch + Streaming + Micro-batch in one system
  4. Production Quality: Monitoring, testing, CI/CD, documentation
  5. Cloud Native: AWS integration with Terraform IaC
  6. Data Engineering Best Practices: Medallion architecture, Delta Lake, quality frameworks

License

MIT License - See LICENSE file for details

Contact

For questions or collaboration opportunities: