Quick Start โข Features โข Architecture โข Demo โข Docs โข Community
Agentic Data Engineering Platform is an open-source, production-ready ETL solution that combines the Medallion Architecture with AI-powered agents that autonomously profile, clean, and optimize your dataโso you can focus on insights, not infrastructure.
|
Three autonomous agents work 24/7:
No more manual data cleaning! |
Built on modern tech that's 10x faster:
Process millions of rows in seconds! |
|
Industry-standard Medallion pattern:
Scale from prototype to production! |
Interactive Streamlit interface:
From data to decisions in minutes! |
# 60 seconds to your first pipeline!
git clone <your-repo> && cd agentic-data-engineer
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python scripts/generate_sample_data.py
python src/orchestration/prefect_flows.py
streamlit run dashboards/streamlit_medallion_app.pyโ
Python 3.10 or higher
โ
4GB RAM (minimum)
โ
1GB free disk space
โ
Love for clean data ๐Step 1: Clone & Setup Environment
```bash # Clone the repository git clone https://github.com/yourusername/agentic-data-engineer.git cd agentic-data-engineerpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
</details>
<details>
<summary><b>Step 2: Initialize Project</b></summary>
```bash
# Run automated setup
python scripts/setup_initial.py
# Generate sample e-commerce data (1000 records with quality issues)
python scripts/generate_sample_data.py
โ Output: Sample dataset with intentional issues for testing AI agents
Step 3: Run Your First Pipeline
```bash # Execute the complete ETL pipeline python src/orchestration/prefect_flows.py ```๐ฏ Watch as the agents:
- โ Profile your data (discover issues)
- โ Score data quality (0-100)
- โ Auto-remediate problems (fix issues)
- โ Create Bronze โ Silver โ Gold layers
- โ Generate business aggregates
๐ Starting Agentic ETL Pipeline
โ
Extracted 1,000 rows
๐ Profiling dataset: Found 10 issues
๐ Quality Score: 92/100
๐ง Auto-remediation: 7 actions taken
โ
Pipeline completed successfully!
Step 4: Launch Dashboard
```bash streamlit run dashboards/streamlit_medallion_app.py ```๐ Open: http://localhost:8501
Explore 7 Interactive Pages:
- ๐ Overview Dashboard
- ๐ฅ Bronze Layer Explorer
- ๐ฅ Silver Layer Analytics
- ๐ฅ Gold Layer Insights
- ๐ Quality Monitoring
- ๐ Data Lineage
- โ๏ธ Pipeline Performance
# Traditional Approach: Manual, Error-Prone
df = pd.read_csv("data.csv")
df = df.dropna() # Hope for the best?
df = df.drop_duplicates() # Good enough?
# ... 50 more lines of cleaning code ...
# Agentic Approach: AI-Powered, Automatic
from src.agents.agentic_agents import DataProfilerAgent, RemediationAgent
profiler = DataProfilerAgent()
profile = profiler.profile_dataset(df, "my_data")
# ๐ Discovers: 23 issues across 8 categories
remediation = RemediationAgent()
df_clean, actions = remediation.auto_remediate(df, profile['issues_detected'])
# ๐ง Fixed: Whitespace, duplicates, negatives, outliers, formats
# โ
Result: 98% quality score (up from 73%)โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DATA JOURNEY โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ ๐ฅ Raw Sources (CSV, JSON, Parquet, APIs) โ
โ โ โ
โ ๐ฅ BRONZE LAYER โ
โ โข Immutable raw data โ
โ โข Full audit trail โ
โ โข No transformations โ
โ โ โ
โ ๐ฅ SILVER LAYER โ
โ โข Deduplicated & cleaned โ
โ โข Schema validated โ
โ โข Business rules applied โ
โ โข Ready for analytics โ
โ โ โ
โ ๐ฅ GOLD LAYER โ
โ โข Business aggregates โ
โ โข KPIs & metrics โ
โ โข Optimized for queries โ
โ โข Dashboard-ready โ
โ โ โ
โ ๐ CONSUMPTION (BI Tools, ML Models, APIs) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Metric | Score | Trend | Status |
|---|---|---|---|
| Overall Quality | 92/100 | โ 3% | ๐ข Excellent |
| Completeness | 95% | โ 2% | ๐ข Great |
| Validity | 98% | โ | ๐ข Perfect |
| Consistency | 88% | โ 1% | ๐ก Good |
| Accuracy | 91% | โ 4% | ๐ข Excellent |
|
Processing Speed |
Memory Efficiency |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AGENTIC CONTROL LAYER ๐ค โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Profiler โโโโโถโ Quality โโโโโถโ Remediation โ โ
โ โ Agent โ โ Agent โ โ Agent โ โ
โ โ โ โ โ โ โ โ
โ โ โข Discover โ โ โข Monitor โ โ โข Auto-fix โ โ
โ โ โข Analyze โ โ โข Score โ โ โข Validate โ โ
โ โ โข Report โ โ โข Alert โ โ โข Optimize โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DATA PROCESSING LAYER โ๏ธ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ ๐ฅ Bronze โ ๐ฅ Silver โ ๐ฅ Gold โ
โ โโโโโโโโโโโโ โ โโโโโโโโโโโโโโ โ โโโโโโโโโโโ โ
โ โข Raw data โ โข Cleaned data โ โข Aggregates โ
โ โข Parquet โ โข Validated โ โข KPIs โ
โ โข Immutable โ โข Typed โ โข Metrics โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ STORAGE LAYER ๐พ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ DuckDB (Analytical Database) โ
โ โข OLAP optimized โ
โ โข Columnar storage โ
โ โข SQL interface โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1๏ธโฃ Beginner: Understanding the Basics
- ๐ Medallion Architecture 101
- ๐ What Are Data Agents?
- ๐ Your First Pipeline
- ๐ Dashboard Tour
Time Investment: 30 minutes
You'll Learn: Core concepts, basic workflow
2๏ธโฃ Intermediate: Customization
- ๐ Custom Data Sources
- ๐ Writing Business Rules
- ๐ Creating Gold Tables
- ๐ Scheduling Pipelines
Time Investment: 2 hours
You'll Learn: Adapt platform to your needs
3๏ธโฃ Advanced: Production Deployment
- ๐ Performance Tuning
- ๐ Production Best Practices
- ๐ Cloud Deployment
- ๐ Monitoring & Alerts
Time Investment: 4 hours
You'll Learn: Enterprise-grade deployment
# Quick API Examples
# 1. Data Profiling
from src.agents.agentic_agents import DataProfilerAgent
profiler = DataProfilerAgent()
profile = profiler.profile_dataset(df, "my_dataset")
# 2. Quality Scoring
from src.agents.agentic_agents import QualityAgent
quality = QualityAgent()
score = quality.calculate_quality_score(profile)
# 3. Auto-Remediation
from src.agents.agentic_agents import RemediationAgent
remediation = RemediationAgent()
clean_df, actions = remediation.auto_remediate(df, profile['issues_detected'])
# 4. DuckDB Operations
from src.database.duckdb_manager import MedallionDuckDB
db = MedallionDuckDB()
db.load_to_bronze(df, "my_table")
db.promote_to_silver("my_table", "my_table_clean")Perfect for analyzing customer behavior, order patterns, and product performance.
โ
Handles messy transaction data
โ
Auto-cleans customer records
โ
Creates ready-to-use KPIs
Clean and validate financial transactions with confidence.
โ
Detects data anomalies
โ
Ensures compliance rules
โ
Tracks data lineage for audits
Transform raw data into executive-ready dashboards.
โ
Automated data prep
โ
Quality guarantees
โ
Fast query performance
Reliable, clean datasets for model training.
โ
Feature engineering ready
โ
Drift detection
โ
Reproducible pipelines
- Medallion Architecture
- Basic AI Agents
- Streamlit Dashboard
- DuckDB Integration
- Sample Dataset
- LangChain Integration for NLP queries
- Advanced ML Anomaly Detection
- Real-time Streaming Support
- Multi-source Connectors (PostgreSQL, MySQL, S3)
- Data Versioning (Delta Lake)
- Cloud Deployment (AWS/Azure/GCP)
- Kubernetes Orchestration
- RBAC & Security
- GraphQL API
- Slack/Teams Integrations
- GPT-4 Powered Data Analysis
- Automated Feature Engineering
- Predictive Quality Monitoring
- Self-Optimizing Pipelines
We โค๏ธ contributions! Here's how you can help:
|
๐ Report Bugs Found an issue? Open a bug report |
๐ก Suggest Features Have an idea? Request a feature |
๐ Improve Docs Better explanations? Edit the docs |
|
๐ง Submit Code Fix or feature? Create a pull request |
โญ Star the Repo Show support! Give us a star |
๐ฌ Join Discussion Ask questions! GitHub Discussions |
# Fork and clone your fork
git clone https://github.com/YOUR_USERNAME/agentic-data-engineer.git
# Create a feature branch
git checkout -b feature/amazing-feature
# Make your changes and commit
git commit -m "Add amazing feature"
# Push and create PR
git push origin feature/amazing-feature- โ Follow PEP 8 style guide
- โ Add docstrings to functions
- โ Include unit tests
- โ Update documentation
- โ
Run
pytestbefore submitting
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License - Do whatever you want!
โ
Commercial use
โ
Modification
โ
Distribution
โ
Private use
Built with amazing open-source tools:
- DuckDB - The SQLite of analytics
- Polars - Lightning-fast DataFrames
- Prefect - Modern workflow orchestration
- Streamlit - Beautiful data apps
- Pandera - Data validation
- Great Expectations - Data quality
- Evidently - ML monitoring
Special thanks to all contributors and the open-source community! ๐
If this project helped you, please consider:
โญ Starring the repository
๐ Reporting bugs
๐ก Suggesting features
๐ข Sharing with others
โ Buying me a coffee
Built with โค๏ธ by Your Name | Last Updated: November 2024