-
Notifications
You must be signed in to change notification settings - Fork 1
feat(open-data-quality): detect duplicate rows in odq-csv #12
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
Open data files often contain duplicate rows due to export errors or data entry mistakes. Currently not checked.
Proposed check
Phase 3 — Content, new check: phase3_duplicate_rows
- Detect exact duplicate rows (all columns match)
- Report: count of duplicates, percentage over total rows, example rows
- Severity: MAJOR (duplicate rows distort aggregations and statistics)
Inspiration
Article 5 Useful Python Scripts for Automated Data Quality Checks (KDnuggets, Feb 2026) — script 3 (duplicate record detector).
Implementation hint
DuckDB can detect exact duplicates efficiently:
SELECT COUNT(*) - COUNT(DISTINCT *) AS duplicate_count FROM read_csv_auto('data.csv');Out of scope (for now)
Near-duplicate rows (fuzzy matching across all columns) — too expensive and domain-specific.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request