Data Cleaning Agent

An intelligent ETL pipeline powered by AI that automatically detects and corrects data quality issues in CSV files. The agent uses LLM-based planning to generate and execute data cleaning transformations.

Overview

This application provides an automated data cleaning solution with the following features:

Intelligent Data Profiling: Analyzes CSV files to detect issues like missing values, data type inconsistencies, and duplicates
AI-Powered Planning: Uses LLM (OpenAI/Groq) to generate optimal cleaning strategies
Automated Execution: Applies transformations such as:
- Column name standardization
- Missing value handling
- Whitespace trimming
- Duplicate removal
- Type conversion and parsing
- Date/time parsing
- Currency and percentage normalization
- Boolean parsing
- Outlier detection
Validation & Assessment: Validates transformations and provides confidence metrics
Web Interface: Flask-based UI for uploading files and downloading cleaned results

Installation

Clone the repository:

git clone <repository-url>
cd Data-Cleaning-Agent

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables (create .env file):

OPENAI_API_KEY=your_openai_key
GROQ_API_KEY=your_groq_key
FLASK_SECRET=your_flask_secret

Usage

Running the Web Application

python app.py

Then navigate to http://localhost:5000 in your browser.

Upload a CSV file
Select cleaning mode (full pipeline or specific cleaners)
Download the cleaned CSV

Running Tests

python -m pytest tests/

Project Structure

Data-Cleaning-Agent/
├── app.py                 # Flask web application
├── requirements.txt       # Python dependencies
├── etl/
│   ├── pipeline.py       # Main ETL pipeline orchestration
│   ├── agent/            # Agent loop logic
│   ├── assessment/        # Confidence and readiness assessment
│   ├── executor/         # Code execution and safety
│   ├── extract/          # CSV/JSON reading utilities
│   ├── llm/              # LLM planning and integration
│   ├── load/             # Output writing
│   ├── profile/          # Data profiling and serialization
│   ├── transform/        # Data cleaning transformations
│   └── validate/         # Validation and feedback
├── data/
│   ├── uploads/          # User uploaded CSV files
│   ├── outputs/          # Cleaned CSV files
│   └── test_files/       # Test data
├── templates/            # HTML templates for web UI
├── tests/                # Unit and integration tests
└── logs/                 # Application logs

Architecture

The ETL pipeline follows these steps:

Extract: Read and parse input CSV files with automatic encoding detection
Profile: Generate statistical profiles and identify data quality issues
Plan: Use LLM to generate a cleaning plan based on profiles
Execute: Apply transformations in an iterative loop with feedback
Validate: Validate transformations and collect results
Assess: Compute confidence scores and readiness assessment
Load: Write cleaned data to output CSV

Technologies

Python 3.8+: Core language
Flask: Web framework
Pandas: Data manipulation
NumPy: Numerical operations
OpenAI/Groq: LLM integration
Chardet: Character encoding detection

Configuration

Key configuration files:

.env: Environment variables (API keys, secrets)
app.py: Flask app configuration
etl/pipeline.py: Pipeline parameters

Contributing

Contributions are welcome! Please follow these steps:

Create a feature branch
Make your changes
Add/update tests
Submit a pull request

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
etl		etl
templates		templates
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
BENCHMARK_REPORT.md		BENCHMARK_REPORT.md
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Cleaning Agent

Overview

Installation

Usage

Running the Web Application

Running Tests

Project Structure

Architecture

Technologies

Configuration

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Cleaning Agent

Overview

Installation

Usage

Running the Web Application

Running Tests

Project Structure

Architecture

Technologies

Configuration

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages