A production-grade, custom-built Change Data Capture (CDC) system for multi-source database replication with intelligent sync strategies and conflict resolution.
This project implements a distributed data synchronization engine from scratch - not a wrapper around existing CDC tools, but a complete solution featuring custom database log readers, adaptive sync strategies, and intelligent conflict resolution. The system demonstrates end-to-end data engineering capabilities across batch, micro-batch, and streaming paradigms.
Author: Oguzhan Goktas
Email: oguzhangoktas22@gmail.com
GitHub: https://github.com/oguzhangoktas
LinkedIn: https://www.linkedin.com/in/oguzhan-goktas/
- PostgreSQL WAL Parser: Native Write-Ahead Log reading with position tracking
- MySQL Binlog Reader: Binary log parsing for real-time change capture
- MongoDB Oplog Tailer: Operations log monitoring for document changes
- Incremental Position Management: Crash-safe checkpoint persistence
- Streaming Mode: Sub-second latency for high-priority tables via Kafka
- Micro-batch Mode: 5-minute interval processing for medium-latency requirements
- Batch Mode: Hourly/daily full refresh for historical tables
- Adaptive Selection: Cost vs latency optimization with automatic strategy switching
- Last-write-wins with timestamp comparison
- Custom business rules engine
- Multi-version conflict detection
- Configurable resolution policies per table
- Source-target reconciliation with row count and checksum validation
- Schema drift detection and alerting
- Great Expectations integration for data profiling
- Automated quality metrics tracking
┌─────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
├──────────────┬──────────────┬──────────────────────────────┤
│ PostgreSQL │ MySQL │ MongoDB │
│ WAL Reader │ Binlog Reader│ Oplog Tailer │
└──────┬───────┴──────┬───────┴──────────┬───────────────────┘
│ │ │
└──────────────┴──────────────────┘
│
┌──────────────▼──────────────────┐
│ Smart Sync Coordinator │
│ (Strategy Selection Engine) │
└──┬───────────┬──────────────┬───┘
│ │ │
┌─────▼─────┐ ┌──▼────┐ ┌─────▼──────┐
│ Kafka │ │ SQS │ │ Airflow │
│ Streaming │ │ Micro │ │ Batch │
└─────┬─────┘ └───┬───┘ └─────┬──────┘
│ │ │
└───────────┴────────────┘
│
┌──────────────▼──────────────────┐
│ PySpark Processing Layer │
│ - Conflict Resolution │
│ - Data Quality Validation │
│ - Schema Evolution │
└──────────────┬───────────────────┘
│
┌──────────────▼──────────────────┐
│ Delta Lake │
│ Bronze → Silver → Gold │
│ (Medallion Architecture) │
└──────────────────────────────────┘
- PostgreSQL 15 (Transaction log reading)
- MySQL 8.0 (Binary log parsing)
- MongoDB 6.0 (Operation log tailing)
- Python 3.11 (Log readers, coordinators, schedulers)
- PySpark 3.5 (Stream processing, conflict resolution, data quality)
- Apache Kafka 3.6 (High-frequency streaming tables)
- AWS SQS (Batch and micro-batch queues)
- Delta Lake 3.0 (ACID transactions, time travel)
- AWS S3 (Object storage for data lake)
- Medallion Architecture (Bronze/Silver/Gold layers)
- Apache Airflow 2.8 (Batch sync DAGs)
- Custom Python Scheduler (Micro-batch coordination)
- AWS CloudWatch (Metrics and logs)
- Custom Python Metrics Collector
- Prometheus & Grafana (Optional)
- Docker & Docker Compose (Local development)
- Terraform (Infrastructure as Code for AWS resources)
- GitHub Actions (CI/CD pipeline)
- AWS Services (S3, Glue, Lambda, SQS)
- Great Expectations (Validation framework)
- Custom reconciliation logic
- Streaming Tables: <1 second replication lag
- Micro-batch Tables: 5-minute sync intervals
- Batch Tables: Hourly/daily full refresh
- Data Quality: 99.9% accuracy target
- Throughput: 10,000 events/second
- Daily Volume: 50M+ records processed
distributed-data-sync-engine/
├── src/
│ ├── cdc_readers/ # Custom database log readers
│ │ ├── postgres_wal.py
│ │ ├── mysql_binlog.py
│ │ └── mongodb_oplog.py
│ ├── coordinator/ # Sync strategy coordination
│ │ ├── strategy_selector.py
│ │ └── config_manager.py
│ ├── processors/ # PySpark processing jobs
│ │ ├── conflict_resolver.py
│ │ └── quality_validator.py
│ ├── orchestration/ # Airflow DAGs
│ │ └── dags/
│ └── monitoring/ # Metrics and observability
│ └── metrics_collector.py
├── config/ # Configuration files
│ ├── sync_config.yaml
│ ├── table_mappings.yaml
│ └── quality_rules.yaml
├── tests/ # Unit and integration tests
├── terraform/ # Infrastructure as Code
├── docker/ # Docker configurations
├── scripts/ # Utility scripts
└── docs/ # Detailed documentation
- Docker & Docker Compose
- Python 3.11+
- AWS Account (for cloud deployment)
- Terraform 1.5+
- Clone the repository
git clone https://github.com/oguzhangoktas/distributed-data-sync-engine.git
cd distributed-data-sync-engine- Start local environment
docker-compose up -d- Initialize databases with sample data
python scripts/init_sample_data.py- Run CDC readers
python src/cdc_readers/postgres_wal.py
python src/cdc_readers/mysql_binlog.py
python src/cdc_readers/mongodb_oplog.py- Access monitoring
- Airflow UI: http://localhost:8080
- Kafka UI: http://localhost:9000
For detailed setup instructions, see SETUP.md
- ARCHITECTURE.md - System design and technical decisions
- DATA_MODEL.md - Schema design and medallion layers
- SETUP.md - Local and AWS deployment guide
This system is designed for:
- Multi-database consolidation in data warehouses
- Real-time data replication for analytics
- Cross-region database synchronization
- Legacy system modernization with zero downtime
- Compliance and audit trail maintenance
This project demonstrates:
- System Design Skills: End-to-end architecture from source to target
- Custom Implementation: Built from scratch, not configuration of existing tools
- Hybrid Processing: Batch + Streaming + Micro-batch in one system
- Production Quality: Monitoring, testing, CI/CD, documentation
- Cloud Native: AWS integration with Terraform IaC
- Data Engineering Best Practices: Medallion architecture, Delta Lake, quality frameworks
MIT License - See LICENSE file for details
For questions or collaboration opportunities:
- Email: oguzhangoktas22@gmail.com
- LinkedIn: https://www.linkedin.com/in/oguzhan-goktas/
- GitHub: https://github.com/oguzhangoktas