Production-ready Airflow data pipeline with modular, reusable components.
airflow_project/
βββ dags/ # DAG definitions
β βββ ecommerce_pipeline.py # Main production pipeline
β βββ simple_pipeline_example.py # Examples
β
βββ config/ # Configuration files
β βββ sources.py # Data source configs
β βββ reports.py # Report definitions
β βββ settings.py # General settings
β
βββ tasks/ # Business logic
β βββ extraction/
β β βββ api_extractor.py # Extraction logic
β βββ transformation/
β βββ cleaner.py # Cleaning logic
β
βββ task_groups/ # Reusable task groups
β βββ extraction_group.py
β βββ transformation_group.py
β βββ reporting_group.py
β
βββ utils/ # Utilities
βββ notifications.py # Slack/email alerts
Edit config/sources.py:
DATA_SOURCES = {
'your_source': {
'name': 'Your Source',
'type': 'rest_api',
'connection_id': 'your_conn',
'api_config': {...},
'tables': ['table1', 'table2'],
'output_path': '/path/to/output',
}
}That's it! The extraction task group will automatically create tasks for your new source.
Edit config/reports.py:
REPORTS.append({
'name': 'your_report',
'required_sources': ['shopify', 'stripe'],
'output': {
'table': 'analytics.your_report',
'format': 'parquet',
}
})That's it! The reporting task group will automatically create a task for your report.
Create dags/your_pipeline.py:
from task_groups.extraction_group import create_extraction_group
from task_groups.reporting_group import create_reporting_group
with DAG('your_pipeline', ...) as dag:
extractions = create_extraction_group(['source1', 'source2'], date)
reports = create_reporting_group(extractions)
extractions >> reportsThat's it! All the heavy lifting is done by reusable components.
- Separate configuration from code
- Reusable task groups
- Easy to test and maintain
- All sources extract in parallel
- All reports generate in parallel
- No manual task dependency management
- Add sources without touching DAG code
- Add reports without touching DAG code
- Change schedules in one place
- Built-in error handling
- Retry logic with exponential backoff
- Data quality checks
- Slack/email notifications
- Comprehensive logging
ecommerce_data_pipeline
βββ Extract (4 sources in parallel)
β βββ Shopify
β βββ Stripe
β βββ Google Analytics
β βββ Facebook Ads
βββ Clean (4 cleanings in parallel)
βββ Generate Reports (4 reports in parallel)
βββ Notify
payment_processing_pipeline
βββ Extract (2 sources in parallel)
β βββ Shopify
β βββ Stripe
βββ Clean
βββ Generate Reports (priority 1 only)
βββ Notify
Create .env file:
ENVIRONMENT=production
LAKEHOUSE_PATH=/data/lakehouse
WAREHOUSE_TYPE=snowflake
WAREHOUSE_DB=analytics
SLACK_WEBHOOK_URL=https://hooks.slack.com/...
LOG_LEVEL=INFOCreate these connections in Airflow UI:
shopify_conn- Shopify API credentialsstripe_conn- Stripe API credentialsgoogle_analytics_conn- GA credentialsfacebook_ads_conn- Facebook credentialswarehouse_conn- Data warehouse credentials
# Test extraction logic
pytest tests/test_extractors.py
# Test transformation logic
pytest tests/test_transformers.py
# Test DAG structure
pytest tests/test_dags.pyfrom airflow.decorators import task_group
@task_group(group_id='my_custom_group')
def my_custom_processing(data):
@task
def process_data(item):
# Your logic here
return processed_item
results = [process_data(item) for item in data]
return resultsInherit from APIExtractor:
from tasks.extraction.api_extractor import APIExtractor
class MyCustomExtractor(APIExtractor):
def extract_table(self, table_name: str):
# Custom extraction logic
return super().extract_table(table_name)- Always use configuration files - Don't hardcode values in DAGs
- Use task groups for related tasks - Better organization in UI
- Handle errors gracefully - Use try/except and retry logic
- Add data quality checks - Catch issues early
- Monitor with notifications - Set up Slack/email alerts
- Test before deploying - Unit test your logic
- Document changes - Keep README updated
- Check parallel task limits in
config/settings.py - Increase
PARALLEL_TASKSvalues
- Verify Airflow connection is configured
- Check API credentials and rate limits
- Review logs in Airflow UI
- Verify source data was extracted successfully
- Check cleaning tasks passed quality checks
- Review required sources in
config/reports.py
- Add your feature
- Update configuration files
- Add tests
- Update this README
- Create pull request
Made with β€οΈ by the Data Engineering Team