E-commerce Data Pipeline

Production-ready Airflow data pipeline with modular, reusable components.

🏗️ Project Structure

airflow_project/
├── dags/                          # DAG definitions
│   ├── ecommerce_pipeline.py     # Main production pipeline
│   └── simple_pipeline_example.py # Examples
│
├── config/                        # Configuration files
│   ├── sources.py                # Data source configs
│   ├── reports.py                # Report definitions
│   └── settings.py               # General settings
│
├── tasks/                         # Business logic
│   ├── extraction/
│   │   └── api_extractor.py     # Extraction logic
│   └── transformation/
│       └── cleaner.py            # Cleaning logic
│
├── task_groups/                   # Reusable task groups
│   ├── extraction_group.py
│   ├── transformation_group.py
│   └── reporting_group.py
│
└── utils/                         # Utilities
    └── notifications.py          # Slack/email alerts

🚀 Quick Start

1. Add a New Data Source

Edit config/sources.py:

DATA_SOURCES = {
    'your_source': {
        'name': 'Your Source',
        'type': 'rest_api',
        'connection_id': 'your_conn',
        'api_config': {...},
        'tables': ['table1', 'table2'],
        'output_path': '/path/to/output',
    }
}

That's it! The extraction task group will automatically create tasks for your new source.

2. Add a New Report

Edit config/reports.py:

REPORTS.append({
    'name': 'your_report',
    'required_sources': ['shopify', 'stripe'],
    'output': {
        'table': 'analytics.your_report',
        'format': 'parquet',
    }
})

That's it! The reporting task group will automatically create a task for your report.

3. Create a New Pipeline

Create dags/your_pipeline.py:

from task_groups.extraction_group import create_extraction_group
from task_groups.reporting_group import create_reporting_group

with DAG('your_pipeline', ...) as dag:
    extractions = create_extraction_group(['source1', 'source2'], date)
    reports = create_reporting_group(extractions)
    
    extractions >> reports

That's it! All the heavy lifting is done by reusable components.

🎯 Key Features

✅ Modular Architecture

Separate configuration from code
Reusable task groups
Easy to test and maintain

✅ Automatic Parallelization

All sources extract in parallel
All reports generate in parallel
No manual task dependency management

✅ Configuration-Driven

Add sources without touching DAG code
Add reports without touching DAG code
Change schedules in one place

✅ Production-Ready

Built-in error handling
Retry logic with exponential backoff
Data quality checks
Slack/email notifications
Comprehensive logging

📊 Example Pipelines

Main Production Pipeline

ecommerce_data_pipeline
├── Extract (4 sources in parallel)
│   ├── Shopify
│   ├── Stripe
│   ├── Google Analytics
│   └── Facebook Ads
├── Clean (4 cleanings in parallel)
├── Generate Reports (4 reports in parallel)
└── Notify

Payment Processing Pipeline

payment_processing_pipeline
├── Extract (2 sources in parallel)
│   ├── Shopify
│   └── Stripe
├── Clean
├── Generate Reports (priority 1 only)
└── Notify

🔧 Configuration

Environment Variables

Create .env file:

ENVIRONMENT=production
LAKEHOUSE_PATH=/data/lakehouse
WAREHOUSE_TYPE=snowflake
WAREHOUSE_DB=analytics
SLACK_WEBHOOK_URL=https://hooks.slack.com/...
LOG_LEVEL=INFO

Airflow Connections

Create these connections in Airflow UI:

shopify_conn - Shopify API credentials
stripe_conn - Stripe API credentials
google_analytics_conn - GA credentials
facebook_ads_conn - Facebook credentials
warehouse_conn - Data warehouse credentials

🧪 Testing

# Test extraction logic
pytest tests/test_extractors.py

# Test transformation logic
pytest tests/test_transformers.py

# Test DAG structure
pytest tests/test_dags.py

📝 Adding Custom Logic

Custom Task Group

from airflow.decorators import task_group

@task_group(group_id='my_custom_group')
def my_custom_processing(data):
    @task
    def process_data(item):
        # Your logic here
        return processed_item
    
    results = [process_data(item) for item in data]
    return results

Custom Extractor

Inherit from APIExtractor:

from tasks.extraction.api_extractor import APIExtractor

class MyCustomExtractor(APIExtractor):
    def extract_table(self, table_name: str):
        # Custom extraction logic
        return super().extract_table(table_name)

🎓 Best Practices

Always use configuration files - Don't hardcode values in DAGs
Use task groups for related tasks - Better organization in UI
Handle errors gracefully - Use try/except and retry logic
Add data quality checks - Catch issues early
Monitor with notifications - Set up Slack/email alerts
Test before deploying - Unit test your logic
Document changes - Keep README updated

🆘 Troubleshooting

Pipeline runs slowly

Check parallel task limits in config/settings.py
Increase PARALLEL_TASKS values

Source extraction fails

Verify Airflow connection is configured
Check API credentials and rate limits
Review logs in Airflow UI

Reports have no data

Verify source data was extracted successfully
Check cleaning tasks passed quality checks
Review required sources in config/reports.py

📚 Learn More

🤝 Contributing

Add your feature
Update configuration files
Add tests
Update this README
Create pull request

Made with ❤️ by the Data Engineering Team

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
DAG		DAG
config		config
task/extraction		task/extraction
task_groups		task_groups
utils		utils
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-commerce Data Pipeline

🏗️ Project Structure

🚀 Quick Start

1. Add a New Data Source

2. Add a New Report

3. Create a New Pipeline

🎯 Key Features

✅ Modular Architecture

✅ Automatic Parallelization

✅ Configuration-Driven

✅ Production-Ready

📊 Example Pipelines

Main Production Pipeline

Payment Processing Pipeline

🔧 Configuration

Environment Variables

Airflow Connections

🧪 Testing

📝 Adding Custom Logic

Custom Task Group

Custom Extractor

🎓 Best Practices

🆘 Troubleshooting

Pipeline runs slowly

Source extraction fails

Reports have no data

📚 Learn More

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

E-commerce Data Pipeline

🏗️ Project Structure

🚀 Quick Start

1. Add a New Data Source

2. Add a New Report

3. Create a New Pipeline

🎯 Key Features

✅ Modular Architecture

✅ Automatic Parallelization

✅ Configuration-Driven

✅ Production-Ready

📊 Example Pipelines

Main Production Pipeline

Payment Processing Pipeline

🔧 Configuration

Environment Variables

Airflow Connections

🧪 Testing

📝 Adding Custom Logic

Custom Task Group

Custom Extractor

🎓 Best Practices

🆘 Troubleshooting

Pipeline runs slowly

Source extraction fails

Reports have no data

📚 Learn More

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages