Skip to content

AI-powered data sanitizer with schema detection, dedupe, outlier detection, and LLM enrichment.

Notifications You must be signed in to change notification settings

CodersAcademy006/Data-Sanitizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data Sanitizer: Production Data Cleaning Platform

Automatically dedupe, impute, normalize, and monitor data quality at scale with deterministic, auditable fixes.

License Python 3.11+ Code Coverage

🎯 Overview

Data Sanitizer is a production-ready data cleaning platform designed for:

  • Data Engineers: Automatically dedupe, impute, normalize data at scale
  • ML/Model Ops: Reduce model retraining from bad upstream data
  • Business/Analytics: Cleaner data β†’ fewer billing errors & faster BI insights

Key Features

βœ… High-Quality Deduplication

  • 90%+ accuracy duplicate detection (MinHash + LSH)
  • Exact + near-duplicate detection
  • Deterministic, auditable fixes

βœ… Multi-Format Ingestion

  • CSV, JSON, JSONL, Parquet, Excel
  • S3 / GCS / Azure Blob Storage
  • Streaming processing (O(chunk) memory)

βœ… Intelligent Imputation

  • Median/mode-based fills
  • Confidence scoring (0.0–1.0)
  • Per-cell provenance tracking

βœ… Production-Grade Architecture

  • Stateless, horizontally scalable workers
  • Postgres metadata + Milvus vector DB + Redis cache
  • REST API with authentication & rate limiting
  • Full audit trail & compliance-ready

βœ… Enterprise Features

  • PII detection & redaction
  • Multi-tenant isolation
  • Customizable cleaning rules
  • Human-in-the-loop review flow

πŸš€ Quick Start (5 Minutes)

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose
  • 4GB RAM minimum

1. Clone & Install

git clone https://github.com/CodersAcademy006/Data-Sanitizer.git
cd data-sanitizer

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Start Infrastructure (Local)

# Start Postgres, Milvus, Redis, API server
docker-compose up -d

# Verify health
curl http://localhost:8000/api/v1/health
# Expected: {"status": "healthy", "storage_backend": "ready"}

3. Generate Test Data

python benchmark_generator.py --size 1m --output-dir ./test_data
# Generates: test_data/benchmark_1000000_rows.csv (~500 MB)

4. Clean Your Data

# Option A: Via Python
from data_cleaning import run_full_cleaning_pipeline_two_pass_sqlite_batched

cleaned_path, report_path = run_full_cleaning_pipeline_two_pass_sqlite_batched(
    path="test_data/benchmark_1000000_rows.csv",
    output_dir="./output",
    chunksize=50_000
)

# Option B: Via REST API
curl -X POST http://localhost:8000/api/v1/datasets/my-tenant/ingest \
  -H "X-API-Key: my-tenant:key123" \
  -F "file=@test_data/benchmark_1000000_rows.csv" \
  -F "dataset_name=test_dataset"

# Response: {"job_id": "abc-123-def", "status": "queued"}

# Check status
curl http://localhost:8000/api/v1/jobs/abc-123-def

# Download report
curl http://localhost:8000/api/v1/jobs/abc-123-def/report > report.json

5. View Results

# Cleaned data (CSV)
head output/cleaned_data.csv

# Cleaning report (JSON)
cat output/cleaning_report.json | jq '.summary'
# Output:
# {
#   "original_row_count": 1000000,
#   "cleaned_row_count": 950000,
#   "rows_dropped": 50000,
#   "deduplication_rate": 0.95
# }

πŸ“Š Architecture

System Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      CLIENT LAYER                           β”‚
β”‚  REST API (FastAPI) β”‚ Admin UI β”‚ Python/JS SDKs             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                                β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚             ORCHESTRATION LAYER                             β”‚
β”‚  Job Scheduler (RabbitMQ/Redis)                             β”‚
β”‚  - Job state machine (queued β†’ running β†’ complete)          β”‚
β”‚  - Retries, idempotency, tenant quotas                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                                β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           COMPUTE WORKERS (Stateless, Scalable)             β”‚
β”‚  Pass 1: Sampling β†’ LSH index β†’ Postgres                    β”‚
β”‚  Pass 2: Dedupe β†’ Impute β†’ Clean β†’ S3 (Parquet)             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Metadata Storage   β”‚  β”‚  Vector Storage   β”‚
β”‚  Postgres           β”‚  β”‚  Milvus           β”‚
β”‚  - Jobs, hashes     β”‚  β”‚  - LSH samples    β”‚
β”‚  - Audit logs       β”‚  β”‚  - Similarity     β”‚
β”‚  - Confidence       β”‚  β”‚    queries        β”‚
β”‚  - Cell provenance  β”‚  β”‚                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

1. User uploads file (CSV, JSON, Parquet, etc.)
   ↓
2. API validates, stores to S3, creates Job record
   ↓
3. Pass 1 Worker:
   - Streams file in chunks
   - Samples columns (deterministic reservoir)
   - Computes MinHash/LSH signatures
   - Inserts samples to Milvus, stats to Postgres
   ↓
4. Pass 2 Worker:
   - Streams file again
   - Checks row hashes against Postgres (exact dedup)
   - Queries Milvus for near-duplicates (LSH candidates)
   - Applies imputation, normalization, cleaning
   - Streams output to S3 (Parquet)
   - Inserts confidence scores + audit logs to Postgres
   ↓
5. API serves cleaned data + report

πŸ—οΈ Project Structure

data_sanitizer/
β”œβ”€β”€ data_cleaning.py           # Core algorithm (Colab prototype upgraded)
β”œβ”€β”€ storage_backend.py         # Postgres + Milvus + Redis interface
β”œβ”€β”€ cloud_storage.py           # S3/GCS connectors, Parquet/CSV writers
β”œβ”€β”€ api_server.py              # FastAPI REST server
β”œβ”€β”€ benchmark_generator.py     # Realistic dirty data generation
β”œβ”€β”€ tests.py                   # 50+ unit, integration, property-based tests
β”œβ”€β”€ requirements.txt           # Python dependencies
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ ARCHITECTURE.md        # Full system design (2,000+ lines)
β”‚   β”œβ”€β”€ DEPLOYMENT.md          # Terraform, Docker, K8s, CI/CD
β”‚   β”œβ”€β”€ 30DAY_ROADMAP.md       # Week-by-week execution plan
β”‚   β”œβ”€β”€ IMPLEMENTATION_SUMMARY.md
β”‚   └── API.md                 # (TODO) OpenAPI reference
β”‚
β”œβ”€β”€ docker/
β”‚   β”œβ”€β”€ api/Dockerfile
β”‚   β”œβ”€β”€ worker-pass1/Dockerfile
β”‚   β”œβ”€β”€ worker-pass2/Dockerfile
β”‚   └── .dockerignore
β”‚
β”œβ”€β”€ k8s/
β”‚   β”œβ”€β”€ base/
β”‚   β”‚   β”œβ”€β”€ api-deployment.yaml
β”‚   β”‚   β”œβ”€β”€ api-service.yaml
β”‚   β”‚   β”œβ”€β”€ worker-pass1-deployment.yaml
β”‚   β”‚   β”œβ”€β”€ configmap.yaml
β”‚   β”‚   └── hpa.yaml
β”‚   └── overlays/
β”‚       β”œβ”€β”€ dev/
β”‚       β”œβ”€β”€ staging/
β”‚       └── prod/
β”‚
β”œβ”€β”€ terraform/
β”‚   β”œβ”€β”€ main.tf
β”‚   β”œβ”€β”€ postgres.tf
β”‚   β”œβ”€β”€ milvus.tf
β”‚   β”œβ”€β”€ s3.tf
β”‚   β”œβ”€β”€ eks.tf
β”‚   └── variables.tf
β”‚
└── docker-compose.yaml        # Local development stack

πŸ“– Documentation


πŸ§ͺ Testing

Run All Tests

# Install test dependencies
pip install -e ".[dev]"

# Run tests with coverage
pytest tests.py -v --cov=. --cov-report=html --cov-report=term

# Expected: >80% coverage

Test Categories

  • Unit Tests: JSON flattening, MinHash, LSH, Reservoir sampling
  • Integration Tests: Full pipeline on small CSV/JSONL datasets
  • Property-Based Tests: Determinism validation with Hypothesis
  • Performance Tests: Throughput & latency benchmarks

πŸ“ˆ Performance Benchmarks

Baseline metrics on modern hardware (AWS m5.xlarge):

Dataset File Size Pass 1 (sec) Pass 2 (sec) Throughput (rows/sec) Memory (MB)
1M CSV ~500 MB 8–15 12–20 40k–70k 200–400
10M CSV ~5 GB 80–150 120–200 40k–70k 300–500

SLA: 10M rows/hour throughput

To run benchmarks:

python benchmark_generator.py --size 10m
python data_cleaning.py  # Run interactive menu, option 4 (vehicles.csv)

πŸ” Security & Compliance

Privacy

  • βœ… PII detection (email, phone, SSN, credit card regex patterns)
  • βœ… Configurable PII strategies: redact, hash, exclude, tokenize
  • βœ… Encrypted at-rest (S3 SSE-KMS, Postgres TDE)
  • βœ… Encrypted in-transit (TLS 1.3)

Audit & Compliance

  • βœ… Immutable audit logs (every transformation recorded)
  • βœ… Cell-level provenance (original β†’ cleaned value + confidence score)
  • βœ… GDPR/CCPA ready (data deletion support)
  • βœ… Row-level security (multi-tenant isolation via Postgres RLS)

Access Control

  • βœ… API key authentication (tenant-scoped)
  • βœ… Rate limiting (per-tenant quotas)
  • βœ… Role-based access (Admin, Engineer, Reviewer)

πŸš€ Production Deployment

Local Development

docker-compose up -d
uvicorn api_server:app --reload

Cloud Deployment (AWS)

# 1. Initialize infrastructure
cd terraform
terraform init
terraform plan -var-file=prod.tfvars
terraform apply -var-file=prod.tfvars

# 2. Build & push Docker images
./scripts/build-and-push.sh

# 3. Deploy via GitOps (ArgoCD)
kubectl apply -f argocd/data-sanitizer-app.yaml

Kubernetes

# Install Data Sanitizer
kubectl apply -k k8s/overlays/prod

# Check status
kubectl get pods -l app=data-sanitizer-api
kubectl logs deployment/data-sanitizer-api

# Scale workers
kubectl scale deployment data-sanitizer-pass1-worker --replicas=10

See DEPLOYMENT.md for full instructions.


πŸ“ž Support & Roadmap

MVP (Current Release)

  • βœ… Core deduplication & imputation
  • βœ… Multi-format ingestion (CSV, JSON, Parquet)
  • βœ… Confidence scoring & audit logs
  • βœ… REST API
  • βœ… Postgres + Milvus backend

Phase 2 (Month 2–4)

  • Admin UI (React)
  • Human review flow
  • LLM enrichment (OpenAI/Claude)
  • Advanced PII detection

Phase 3 (Month 5–12)

  • Multi-tenant SaaS
  • Billing & usage tracking
  • On-prem deployment
  • Custom connectors (Salesforce, etc.)

🀝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Local Development Setup

# 1. Fork & clone
git clone https://github.com/your-fork/data-sanitizer.git
cd data-sanitizer

# 2. Create feature branch
git checkout -b feat/your-feature

# 3. Install dev dependencies
pip install -e ".[dev]"

# 4. Run tests (must pass)
pytest tests.py -v --cov=.

# 5. Format code
black .
flake8 .
mypy .

# 6. Submit PR
git push origin feat/your-feature

πŸ“„ License

MIT License. See LICENSE for details.


πŸŽ“ Key Concepts

MinHash & LSH

  • MinHash: Probabilistic fingerprint of a text that preserves Jaccard similarity
  • LSH (Locality-Sensitive Hashing): Bucket function that maps similar items to same bucket
  • Purpose: Efficiently find near-duplicate rows without O(nΒ²) comparisons

Deterministic Reservoir Sampling

  • Goal: Sample fixed-size subset of unbounded stream
  • Method: Use hash(row_id + salt) as priority; keep min-priority items
  • Benefit: Same input + same salt = same sample (reproducible)

Two-Pass Pipeline

  • Pass 1: Build index (reservoirs, LSH) without modifying data
  • Pass 2: Clean data using indices from Pass 1
  • Benefit: Deterministic, can replay Pass 2 with different rules

πŸ™ Acknowledgments


πŸ“§ Get Started

  1. Read: docs/ARCHITECTURE.md (5 min overview)
  2. Try: Quick start above (10 min hands-on)
  3. Explore: docs/30DAY_ROADMAP.md (plan for next month)
  4. Deploy: docs/DEPLOYMENT.md (production setup)

Questions? Open an issue or contact us at srjnupadhyay@gmail.com


Happy cleaning! 🧹✨

About

AI-powered data sanitizer with schema detection, dedupe, outlier detection, and LLM enrichment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages