Automatically dedupe, impute, normalize, and monitor data quality at scale with deterministic, auditable fixes.
Data Sanitizer is a production-ready data cleaning platform designed for:
- Data Engineers: Automatically dedupe, impute, normalize data at scale
- ML/Model Ops: Reduce model retraining from bad upstream data
- Business/Analytics: Cleaner data β fewer billing errors & faster BI insights
β High-Quality Deduplication
- 90%+ accuracy duplicate detection (MinHash + LSH)
- Exact + near-duplicate detection
- Deterministic, auditable fixes
β Multi-Format Ingestion
- CSV, JSON, JSONL, Parquet, Excel
- S3 / GCS / Azure Blob Storage
- Streaming processing (O(chunk) memory)
β Intelligent Imputation
- Median/mode-based fills
- Confidence scoring (0.0β1.0)
- Per-cell provenance tracking
β Production-Grade Architecture
- Stateless, horizontally scalable workers
- Postgres metadata + Milvus vector DB + Redis cache
- REST API with authentication & rate limiting
- Full audit trail & compliance-ready
β Enterprise Features
- PII detection & redaction
- Multi-tenant isolation
- Customizable cleaning rules
- Human-in-the-loop review flow
- Python 3.11+
- Docker & Docker Compose
- 4GB RAM minimum
git clone https://github.com/CodersAcademy006/Data-Sanitizer.git
cd data-sanitizer
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Start Postgres, Milvus, Redis, API server
docker-compose up -d
# Verify health
curl http://localhost:8000/api/v1/health
# Expected: {"status": "healthy", "storage_backend": "ready"}python benchmark_generator.py --size 1m --output-dir ./test_data
# Generates: test_data/benchmark_1000000_rows.csv (~500 MB)# Option A: Via Python
from data_cleaning import run_full_cleaning_pipeline_two_pass_sqlite_batched
cleaned_path, report_path = run_full_cleaning_pipeline_two_pass_sqlite_batched(
path="test_data/benchmark_1000000_rows.csv",
output_dir="./output",
chunksize=50_000
)
# Option B: Via REST API
curl -X POST http://localhost:8000/api/v1/datasets/my-tenant/ingest \
-H "X-API-Key: my-tenant:key123" \
-F "file=@test_data/benchmark_1000000_rows.csv" \
-F "dataset_name=test_dataset"
# Response: {"job_id": "abc-123-def", "status": "queued"}
# Check status
curl http://localhost:8000/api/v1/jobs/abc-123-def
# Download report
curl http://localhost:8000/api/v1/jobs/abc-123-def/report > report.json# Cleaned data (CSV)
head output/cleaned_data.csv
# Cleaning report (JSON)
cat output/cleaning_report.json | jq '.summary'
# Output:
# {
# "original_row_count": 1000000,
# "cleaned_row_count": 950000,
# "rows_dropped": 50000,
# "deduplication_rate": 0.95
# }βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT LAYER β
β REST API (FastAPI) β Admin UI β Python/JS SDKs β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββ¬βββββββββββββββ
β β
ββββββββββββββΌβββββββββββββββββββββββββββββββββΌβββββββββββββββ
β ORCHESTRATION LAYER β
β Job Scheduler (RabbitMQ/Redis) β
β - Job state machine (queued β running β complete) β
β - Retries, idempotency, tenant quotas β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββ¬βββββββββββββββ
β β
ββββββββββββββΌβββββββββββββββββββββββββββββββββββΌβββββββββββββ
β COMPUTE WORKERS (Stateless, Scalable) β
β Pass 1: Sampling β LSH index β Postgres β
β Pass 2: Dedupe β Impute β Clean β S3 (Parquet) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
ββββββββββββββΌβββββββββ ββββββββββΌβββββββββββ
β Metadata Storage β β Vector Storage β
β Postgres β β Milvus β
β - Jobs, hashes β β - LSH samples β
β - Audit logs β β - Similarity β
β - Confidence β β queries β
β - Cell provenance β β β
βββββββββββββββββββββββ βββββββββββββββββββββ
1. User uploads file (CSV, JSON, Parquet, etc.)
β
2. API validates, stores to S3, creates Job record
β
3. Pass 1 Worker:
- Streams file in chunks
- Samples columns (deterministic reservoir)
- Computes MinHash/LSH signatures
- Inserts samples to Milvus, stats to Postgres
β
4. Pass 2 Worker:
- Streams file again
- Checks row hashes against Postgres (exact dedup)
- Queries Milvus for near-duplicates (LSH candidates)
- Applies imputation, normalization, cleaning
- Streams output to S3 (Parquet)
- Inserts confidence scores + audit logs to Postgres
β
5. API serves cleaned data + report
data_sanitizer/
βββ data_cleaning.py # Core algorithm (Colab prototype upgraded)
βββ storage_backend.py # Postgres + Milvus + Redis interface
βββ cloud_storage.py # S3/GCS connectors, Parquet/CSV writers
βββ api_server.py # FastAPI REST server
βββ benchmark_generator.py # Realistic dirty data generation
βββ tests.py # 50+ unit, integration, property-based tests
βββ requirements.txt # Python dependencies
β
βββ docs/
β βββ ARCHITECTURE.md # Full system design (2,000+ lines)
β βββ DEPLOYMENT.md # Terraform, Docker, K8s, CI/CD
β βββ 30DAY_ROADMAP.md # Week-by-week execution plan
β βββ IMPLEMENTATION_SUMMARY.md
β βββ API.md # (TODO) OpenAPI reference
β
βββ docker/
β βββ api/Dockerfile
β βββ worker-pass1/Dockerfile
β βββ worker-pass2/Dockerfile
β βββ .dockerignore
β
βββ k8s/
β βββ base/
β β βββ api-deployment.yaml
β β βββ api-service.yaml
β β βββ worker-pass1-deployment.yaml
β β βββ configmap.yaml
β β βββ hpa.yaml
β βββ overlays/
β βββ dev/
β βββ staging/
β βββ prod/
β
βββ terraform/
β βββ main.tf
β βββ postgres.tf
β βββ milvus.tf
β βββ s3.tf
β βββ eks.tf
β βββ variables.tf
β
βββ docker-compose.yaml # Local development stack
- ARCHITECTURE.md - Complete system design, data models, API contracts
- DEPLOYMENT.md - Production infrastructure, Kubernetes, Terraform, CI/CD
- 30DAY_ROADMAP.md - Execution plan: Day 1 through Day 30
- IMPLEMENTATION_SUMMARY.md - Overview of deliverables
- API.md - (TODO) REST API reference, Swagger/OpenAPI
# Install test dependencies
pip install -e ".[dev]"
# Run tests with coverage
pytest tests.py -v --cov=. --cov-report=html --cov-report=term
# Expected: >80% coverage- Unit Tests: JSON flattening, MinHash, LSH, Reservoir sampling
- Integration Tests: Full pipeline on small CSV/JSONL datasets
- Property-Based Tests: Determinism validation with Hypothesis
- Performance Tests: Throughput & latency benchmarks
Baseline metrics on modern hardware (AWS m5.xlarge):
| Dataset | File Size | Pass 1 (sec) | Pass 2 (sec) | Throughput (rows/sec) | Memory (MB) |
|---|---|---|---|---|---|
| 1M CSV | ~500 MB | 8β15 | 12β20 | 40kβ70k | 200β400 |
| 10M CSV | ~5 GB | 80β150 | 120β200 | 40kβ70k | 300β500 |
SLA: 10M rows/hour throughput
To run benchmarks:
python benchmark_generator.py --size 10m
python data_cleaning.py # Run interactive menu, option 4 (vehicles.csv)- β PII detection (email, phone, SSN, credit card regex patterns)
- β Configurable PII strategies: redact, hash, exclude, tokenize
- β Encrypted at-rest (S3 SSE-KMS, Postgres TDE)
- β Encrypted in-transit (TLS 1.3)
- β Immutable audit logs (every transformation recorded)
- β Cell-level provenance (original β cleaned value + confidence score)
- β GDPR/CCPA ready (data deletion support)
- β Row-level security (multi-tenant isolation via Postgres RLS)
- β API key authentication (tenant-scoped)
- β Rate limiting (per-tenant quotas)
- β Role-based access (Admin, Engineer, Reviewer)
docker-compose up -d
uvicorn api_server:app --reload# 1. Initialize infrastructure
cd terraform
terraform init
terraform plan -var-file=prod.tfvars
terraform apply -var-file=prod.tfvars
# 2. Build & push Docker images
./scripts/build-and-push.sh
# 3. Deploy via GitOps (ArgoCD)
kubectl apply -f argocd/data-sanitizer-app.yaml# Install Data Sanitizer
kubectl apply -k k8s/overlays/prod
# Check status
kubectl get pods -l app=data-sanitizer-api
kubectl logs deployment/data-sanitizer-api
# Scale workers
kubectl scale deployment data-sanitizer-pass1-worker --replicas=10See DEPLOYMENT.md for full instructions.
- β Core deduplication & imputation
- β Multi-format ingestion (CSV, JSON, Parquet)
- β Confidence scoring & audit logs
- β REST API
- β Postgres + Milvus backend
- Admin UI (React)
- Human review flow
- LLM enrichment (OpenAI/Claude)
- Advanced PII detection
- Multi-tenant SaaS
- Billing & usage tracking
- On-prem deployment
- Custom connectors (Salesforce, etc.)
We welcome contributions! See CONTRIBUTING.md for guidelines.
# 1. Fork & clone
git clone https://github.com/your-fork/data-sanitizer.git
cd data-sanitizer
# 2. Create feature branch
git checkout -b feat/your-feature
# 3. Install dev dependencies
pip install -e ".[dev]"
# 4. Run tests (must pass)
pytest tests.py -v --cov=.
# 5. Format code
black .
flake8 .
mypy .
# 6. Submit PR
git push origin feat/your-featureMIT License. See LICENSE for details.
- MinHash: Probabilistic fingerprint of a text that preserves Jaccard similarity
- LSH (Locality-Sensitive Hashing): Bucket function that maps similar items to same bucket
- Purpose: Efficiently find near-duplicate rows without O(nΒ²) comparisons
- Goal: Sample fixed-size subset of unbounded stream
- Method: Use hash(row_id + salt) as priority; keep min-priority items
- Benefit: Same input + same salt = same sample (reproducible)
- Pass 1: Build index (reservoirs, LSH) without modifying data
- Pass 2: Clean data using indices from Pass 1
- Benefit: Deterministic, can replay Pass 2 with different rules
- Built with pandas, polars, pyarrow
- Storage: PostgreSQL, Milvus, Redis
- API: FastAPI, Pydantic
- Infrastructure: Terraform, Kubernetes
- Read: docs/ARCHITECTURE.md (5 min overview)
- Try: Quick start above (10 min hands-on)
- Explore: docs/30DAY_ROADMAP.md (plan for next month)
- Deploy: docs/DEPLOYMENT.md (production setup)
Questions? Open an issue or contact us at srjnupadhyay@gmail.com
Happy cleaning! π§Ήβ¨