I’ve started building a personal data platform that collects and processes data from my YouTube channel, LinkedIn, GitHub, and other platforms to better understand engagement, growth, and audience behavior across channels. It’s also a great opportunity to apply modern data engineering tools and practices such as Apache Airflow, Docker, API integrations, functional and data quality testing, and CI/CD automation.
This ELT (or I`d rather say EtLT) pipeline is orchestrated with Airflow, containerized with Docker, and stores data in PostgreSQL. The process includes:
- Retrieve video metadata via the YouTube API.
- Store the raw data in a staging schema inside a dockerized PostgreSQL instance.
- Transform & Load to reporting tables.
- Ensure data quality applying unit tests and data quality checks.
- Run tests and build Docker images using GitHub Actions CI/CD workflows.
Three Airflow DAGs are defined and triggered sequentially:
- produce_json — Extracts YouTube data and saves it as a JSON file.
- update_db — Loads and processes the data into staging and core schemas.
- data_quality — Runs Soda checks to validate data quality.
- Unit and integration tests ensure pipelines behave as expected.
- Data quality is monitored automatically with Soda.
- A GitHub Actions workflow builds and pushes Docker images, starts Airflow services, and tests DAG execution.
- Storing Airflow Variables in Environment Variables
- Timezone
- Declaring an Airflow DAG
- Crontab
- Airflow start_date
This project is inspired by the Data Engineering ELT Pipeline course by @MattTheDataEngineer — a great resource for mastering Airflow + Docker development. 100% recommend!!!