An intelligent ETL pipeline powered by AI that automatically detects and corrects data quality issues in CSV files. The agent uses LLM-based planning to generate and execute data cleaning transformations.
This application provides an automated data cleaning solution with the following features:
- Intelligent Data Profiling: Analyzes CSV files to detect issues like missing values, data type inconsistencies, and duplicates
- AI-Powered Planning: Uses LLM (OpenAI/Groq) to generate optimal cleaning strategies
- Automated Execution: Applies transformations such as:
- Column name standardization
- Missing value handling
- Whitespace trimming
- Duplicate removal
- Type conversion and parsing
- Date/time parsing
- Currency and percentage normalization
- Boolean parsing
- Outlier detection
- Validation & Assessment: Validates transformations and provides confidence metrics
- Web Interface: Flask-based UI for uploading files and downloading cleaned results
-
Clone the repository:
git clone <repository-url> cd Data-Cleaning-Agent
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables (create
.envfile):OPENAI_API_KEY=your_openai_key GROQ_API_KEY=your_groq_key FLASK_SECRET=your_flask_secret
python app.pyThen navigate to http://localhost:5000 in your browser.
- Upload a CSV file
- Select cleaning mode (full pipeline or specific cleaners)
- Download the cleaned CSV
python -m pytest tests/Data-Cleaning-Agent/
├── app.py # Flask web application
├── requirements.txt # Python dependencies
├── etl/
│ ├── pipeline.py # Main ETL pipeline orchestration
│ ├── agent/ # Agent loop logic
│ ├── assessment/ # Confidence and readiness assessment
│ ├── executor/ # Code execution and safety
│ ├── extract/ # CSV/JSON reading utilities
│ ├── llm/ # LLM planning and integration
│ ├── load/ # Output writing
│ ├── profile/ # Data profiling and serialization
│ ├── transform/ # Data cleaning transformations
│ └── validate/ # Validation and feedback
├── data/
│ ├── uploads/ # User uploaded CSV files
│ ├── outputs/ # Cleaned CSV files
│ └── test_files/ # Test data
├── templates/ # HTML templates for web UI
├── tests/ # Unit and integration tests
└── logs/ # Application logs
The ETL pipeline follows these steps:
- Extract: Read and parse input CSV files with automatic encoding detection
- Profile: Generate statistical profiles and identify data quality issues
- Plan: Use LLM to generate a cleaning plan based on profiles
- Execute: Apply transformations in an iterative loop with feedback
- Validate: Validate transformations and collect results
- Assess: Compute confidence scores and readiness assessment
- Load: Write cleaned data to output CSV
- Python 3.8+: Core language
- Flask: Web framework
- Pandas: Data manipulation
- NumPy: Numerical operations
- OpenAI/Groq: LLM integration
- Chardet: Character encoding detection
Key configuration files:
.env: Environment variables (API keys, secrets)app.py: Flask app configurationetl/pipeline.py: Pipeline parameters
Contributions are welcome! Please follow these steps:
- Create a feature branch
- Make your changes
- Add/update tests
- Submit a pull request
MIT License