Skip to content

Weiykong/imgclean

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧹 imgclean

Find duplicates, blur, corruption, leakage, and quality issues in image datasets before they ship.

Python License: MIT CI PyPI Tests


Most image datasets have hidden problems. imgclean makes them obvious in one pass, with a CLI that is fast to try and reports that are easy to review with a team.

$ imgclean scan ./dataset --workers 8 --report-dir ./reports
                      Scan Summary
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Metric              ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ Total files         │ 12438 │
│ Scanned OK          │ 12397 │
│ Corrupted           │    41 │
│ Total findings      │  1525 │
│   ↳ near duplicate  │  1083 │
│   ↳ exact duplicate │   214 │
└─────────────────────┴───────┘

imgclean HTML report preview

Highlights

  • One command to scan a dataset and export HTML, JSON, and CSV reports.
  • Built-in checks for corruption, blur, exposure, resolution, aspect ratio, duplicates, and split leakage.
  • Parallel scan path with --workers and config-based parallel.max_workers.
  • Works as both a CLI tool and a Python API for pipelines and notebooks.
  • Safe cleanup workflow with clean, quarantine, and representative-keep actions.
  • Test-backed core with 50 automated test cases and GitHub Actions CI.

Try it in 60 seconds

pip install imgclean
imgclean clean ./dataset --workers 8 --report-dir ./reports

The command writes a shareable HTML report plus machine-readable JSON and CSV outputs in ./reports, then previews the quarantine plan without moving anything unless you add --execute.


Contents


🤔 Why imgclean

Problem What goes wrong
Exact duplicates in training data Model memorises samples, inflated accuracy
Near-duplicates crossing train/val Evaluation metrics are meaningless
Blurry or tiny images Wasted annotation budget, noisy gradients
Corrupted files Silent crashes in your data loader at 3 AM
Overexposed / underexposed frames Class imbalance in lighting conditions
Mislabeled split assignments You think your model generalises; it does not

imgclean makes these problems visible in seconds and gives you tools to fix them.


🥊 Compared with other workflows

Workflow Duplicate + leakage checks Cleanup actions Shareable reports Best fit
imgclean ✅ built in clean / quarantine ✅ HTML + JSON + CSV Pre-training dataset QA
cleanvision ✅ focused on image issues ❌ review-only ⚠️ notebook/report oriented Exploratory dataset analysis
FiftyOne ⚠️ possible with app workflows ⚠️ manual curation flows ✅ interactive app views Large visual review workflows
Manual scripts ⚠️ custom only ⚠️ custom only ❌ usually none One-off internal jobs

📦 Installation

pip install imgclean

Optional — CLIP-based near-duplicate detection and outlier analysis:

pip install "imgclean[embeddings]"   # torch + open_clip + faiss-cpu

Development install:

git clone https://github.com/Weiykong/imgclean.git
cd imgclean
python3 -m pip install --user uv
make install
make test

Supported formats: JPEG · PNG · BMP · GIF · TIFF · WebP


🚀 Quick start

CLI

# Full audit — produces HTML, JSON, and CSV reports
imgclean scan ./dataset --workers 8 --report-dir ./reports --open

# Duplicates only, strict threshold
imgclean dedup ./dataset --threshold 4 --workers 8

# Check train/val/test splits for data leakage
imgclean leakage ./train ./val ./test

# Quality checks (blur, exposure, resolution)
imgclean quality ./dataset --workers 8

# Scan and preview a cleanup plan in one step
imgclean clean ./dataset --issues corrupted,blurry --report-dir ./reports

# Preview what would be quarantined, then do it
imgclean quarantine ./dataset --issues corrupted,blurry
imgclean quarantine ./dataset --issues corrupted,blurry --execute

Python API

from imgclean import scan_dataset

report = scan_dataset("./dataset")
print(f"{report.summary.findings_count} issues found in {report.summary.duration_seconds:.1f}s")

# Specific checks only
report = scan_dataset(
    "./dataset",
    checks=["blur", "corruption", "duplicates"],
    thresholds={"blur_laplacian_min": 80.0, "min_width": 128},
)

# Split-aware scan (enables leakage detection)
report = scan_dataset(
    "./dataset",
    splits={"train": "./train", "val": "./val", "test": "./test"},
)

# Iterate findings
for f in report.findings:
    print(f"[{f.severity.value}] {f.issue_type.value}: {f.file_path.name}")

🖥️ CLI reference

imgclean scan — full dataset audit

imgclean scan <path> [OPTIONS]
Option Default Description
--config, -c YAML or JSON config file
--report-dir, -o . Output directory for reports
--no-html false Skip HTML report
--no-json false Skip JSON report
--no-csv false Skip CSV report
--open false Open HTML in browser after scan
--no-cache false Disable feature cache
--workers, -w auto Max worker threads for image scanning
--verbose, -v false Debug logging
imgclean scan ./dataset --workers 8 --report-dir ./audit --open --config imgclean.yaml

imgclean dedup — duplicate detection

imgclean dedup <path> [OPTIONS]
Option Default Description
--threshold, -t 8 Max Hamming distance (0 = exact byte matches only)
--report-dir, -o . Output directory
--workers, -w auto Max worker threads for image scanning
imgclean dedup ./dataset --threshold 6 --workers 8
imgclean dedup ./dataset --threshold 0   # exact duplicates only

imgclean leakage — split contamination check

imgclean leakage <train> [val] [test] [OPTIONS]

Detects images (exact or perceptually similar) that appear in more than one split.

imgclean leakage ./train ./val ./test --report-dir ./leakage_report

imgclean quality — quality checks only

imgclean quality <path> [OPTIONS]
Option Description
--blur/--no-blur Check for blur (default on)
--exposure/--no-exposure Check over/underexposure (default on)
--resolution/--no-resolution Check resolution (default on)
--workers, -w Max worker threads for image scanning
imgclean quality ./dataset --workers 8 --no-exposure

imgclean clean — scan then quarantine

imgclean clean <path> [OPTIONS]
Option Default Description
--issues, -i all errors Comma-separated issue types to quarantine
--out, -o ./quarantine Destination folder
--execute false Actually move files (default is dry-run)
--report-dir . Output directory for HTML, JSON, and CSV reports
--workers, -w auto Max worker threads for image scanning
# Preview cleanup + write reports
imgclean clean ./dataset --issues corrupted,blurry --workers 8 --report-dir ./reports

# Then execute
imgclean clean ./dataset --issues corrupted --out ./review --execute

imgclean quarantine — move flagged files

imgclean quarantine <path> [OPTIONS]
Option Default Description
--issues, -i all errors Comma-separated issue types
--out, -o ./quarantine Destination folder
--execute false Actually move files (default is dry-run)
# Preview first
imgclean quarantine ./dataset --issues corrupted,blurry

# Then execute
imgclean quarantine ./dataset --issues corrupted,blurry --out ./review --execute

Valid issue types: corrupted · low_resolution · aspect_ratio · blurry · underexposed · overexposed · exact_duplicate · near_duplicate · split_leakage · outlier


imgclean report — re-render HTML from JSON

imgclean report imgclean_report.json --open
imgclean report results.json --html report_v2.html

🐍 Python API

scan_dataset()

from imgclean import scan_dataset

report = scan_dataset(
    path,                  # str | Path  — dataset root
    config_file=None,      # str | Path  — YAML/JSON config
    checks=None,           # list[str]   — checks to run (None = all enabled)
    thresholds=None,       # dict        — threshold overrides
    splits=None,           # dict[str, Path] — split directories
    cache=True,            # bool        — disk feature cache
    verbose=False,         # bool        — debug logging
)

Working with results

# Summary
s = report.summary
print(s.total_files, s.findings_count, s.issue_counts)

# All findings
for f in report.findings:
    print(f.issue_type.value, f.severity.value, f.file_path, f.score)

# Grouped by type
by_type = report.findings_by_type()
blurry  = by_type.get("blurry", [])
dupes   = by_type.get("exact_duplicate", [])

# Duplicate clusters
groups = {}
for f in dupes:
    groups.setdefault(f.group_id, []).append(f.file_path)

Post-scan actions

from imgclean.actions import quarantine_findings, get_removal_candidates
from imgclean.reports import write_html, write_json
from pathlib import Path

# Write reports manually (API does not write files by default)
write_json(report, Path("report.json"))
write_html(report, Path("report.html"), open_browser=True)

# Quarantine problematic files (dry_run=True by default)
quarantine_findings(
    findings=report.findings,
    quarantine_dir=Path("./quarantine"),
    issue_filter=["corrupted", "blurry"],
    root=Path("./dataset"),
    dry_run=False,   # set True to preview
)

# Files to remove to deduplicate (keeps one representative per cluster)
to_remove = get_removal_candidates(report.findings)

Finding fields

Field Type Description
issue_type IssueType Enum: corrupted, blurry, exact_duplicate, …
severity Severity error · warning · info
file_path Path Absolute path to the affected file
message str Human-readable explanation
score float | None Measured value (e.g. Laplacian variance, Hamming distance)
threshold float | None Threshold that triggered the finding
related_files list[Path] Duplicate partners, leakage matches
group_id str | None Cluster ID for grouped issues
metadata dict Extra context (brightness, width/height, …)

⚙️ Configuration

imgclean scan ./dataset --config imgclean.yaml
Full annotated imgclean.yaml
dataset:
  path: ./dataset
  recursive: true

checks:
  corruption: true
  resolution: true
  aspect_ratio: true
  blur: true
  exposure: true
  exact_duplicates: true
  perceptual_duplicates: true
  embedding_duplicates: false   # requires imgclean[embeddings]
  split_leakage: true
  outliers: false               # requires imgclean[embeddings]

thresholds:
  # Resolution
  min_width: 256
  min_height: 256

  # Aspect ratio  (width / height)
  aspect_ratio_min: 0.1         # flag very tall images
  aspect_ratio_max: 10.0        # flag very wide images

  # Blur  (Laplacian variance — higher = sharper)
  blur_laplacian_min: 60.0

  # Exposure  (mean pixel brightness 0–255)
  exposure_dark_max: 25.0
  exposure_bright_min: 230.0

  # Perceptual duplicates  (pHash Hamming distance)
  phash_hamming_max: 8

  # Embedding duplicates  (cosine similarity 0–1)
  embedding_similarity_min: 0.95

  # Outliers  (kNN on embedding space)
  outlier_knn_k: 5
  outlier_distance_percentile: 95.0

report:
  html: true
  json_report: true
  csv_report: true
  output_dir: ./reports
  open_browser: false

actions:
  quarantine: false
  quarantine_dir: ./quarantine
  dry_run: true          # always preview before executing

cache:
  enabled: true
  dir_name: .imgclean_cache

parallel:
  max_workers: null       # null = ThreadPoolExecutor default

Merge priority (highest wins): CLI flags → config file → built-in defaults


🔍 Checks

File integrity

Check Issue Severity How
corruption corrupted 🔴 error PIL two-pass: verify() (header/checksum) + load() (pixel decode)

Quality

Check Issue Severity How
blur blurry 🟡 warning Variance of the Laplacian — low variance = uniform = blurry
exposure underexposed 🟡 warning Mean brightness < exposure_dark_max (default 25)
exposure overexposed 🟡 warning Mean brightness > exposure_bright_min (default 230)
resolution low_resolution 🟡 warning Width or height below min_width / min_height
aspect_ratio aspect_ratio 🟡 warning Ratio outside [aspect_ratio_min, aspect_ratio_max]

Duplicates

Check Issue Severity How
exact_duplicates exact_duplicate 🟡 warning SHA-256 hash grouping
perceptual_duplicates near_duplicate 🟡 warning pHash + Hamming distance ≤ threshold; union-find clustering
embedding_duplicates embedding_duplicate 🟡 warning CLIP cosine similarity ≥ threshold

Split integrity

Check Issue Severity How
split_leakage (exact) split_leakage 🔴 error Same SHA-256 across splits
split_leakage (perceptual) split_leakage 🟡 warning pHash Hamming distance ≤ threshold across splits

Outliers

Check Issue Severity How
outliers outlier 🔵 info Mean kNN cosine distance above the Nth percentile

✨ Requires pip install "imgclean[embeddings]"


📄 Outputs

HTML report

A self-contained HTML file (no external dependencies):

  • Summary cards — total files, scanned OK, corrupted, findings by type
  • Per-issue tables — file path · severity · score · threshold · message
  • Cluster view — duplicate and leakage groups, representative highlighted

JSON report

{
  "summary": {
    "total_files": 1000,
    "scanned_files": 997,
    "corrupted_files": 3,
    "findings_count": 142,
    "issue_counts": { "blurry": 31, "exact_duplicate": 44, "corrupted": 3 },
    "duration_seconds": 4.2
  },
  "findings": [
    {
      "issue_type": "blurry",
      "severity": "warning",
      "file_path": "dataset/train/img_042.jpg",
      "score": 12.3,
      "threshold": 60.0,
      "message": "Image appears blurry (Laplacian variance 12.3 < threshold 60.0)."
    }
  ]
}

CSV report

One row per finding — ready for spreadsheet review or programmatic filtering:

issue_type,severity,file_path,score,threshold,group_id,related_files,message
blurry,warning,train/img_042.jpg,12.3,60.0,,,Image appears blurry...
exact_duplicate,warning,train/cat_001.jpg,,,a3b1c9,val/cat_001.jpg,Exact duplicate...

🏗️ Architecture

imgclean follows a strict layered design — each layer has a single responsibility and only depends on layers below it.

┌─────────────────────────────────────────────────────────────┐
│  cli/        Command-line interface (Typer + Rich)          │
│  api/        Public Python API  scan_dataset()              │
├─────────────────────────────────────────────────────────────┤
│  core/       Orchestration: scanner · pipeline · registry   │
├────────────────────────┬────────────────────────────────────┤
│  reports/              │  actions/                          │
│  HTML · JSON · CSV     │  quarantine · move · dedup         │
├─────────────────────────────────────────────────────────────┤
│  checks/     10 independent checks (BaseCheck subclasses)   │
├─────────────────────────────────────────────────────────────┤
│  features/   Laplacian · brightness · pHash · CLIP embeds   │
│  io/         filesystem · image loader · hashing · cache    │
├─────────────────────────────────────────────────────────────┤
│  models/     ImageRecord · Finding · Dataset · ScanReport   │
│  config/     Pydantic schema · YAML/JSON loader             │
│  utils/      logging · timing · parallel_map · thresholds   │
└─────────────────────────────────────────────────────────────┘
Layer-by-layer breakdown

models/ — pure data structures

File Class Description
image_record.py ImageRecord One image: path, size, format, sha256, phash, corruption flag
finding.py Finding One issue: type, severity, score, threshold, related files, cluster id
issue_types.py IssueType, Severity Enums for all issue and severity types
dataset.py Dataset List of ImageRecords with helpers (valid(), by_split(), corrupted())
report.py ReportSummary, ScanReport Aggregated results: summary stats + all findings
actions.py ActionType, ActionPlan Describes a planned file operation

config/ — typed configuration

File Purpose
defaults.py Module-level constants for every threshold and setting
schema.py Pydantic v2 models with validation (Config, ChecksConfig, ThresholdsConfig, …)
loader.py load_config(path, overrides) — loads YAML/JSON and deep-merges CLI overrides

io/ — all file access

File Key function(s)
filesystem.py discover_images(root, recursive) — glob with extension filtering
image_loader.py load_image(path)LoadResulttwo-pass: verify() then load()
hashing.py sha256(path), phash(image), dhash(image), hamming_distance(h1, h2)
cache.py FeatureCache — JSON disk cache keyed by file path, invalidated on mtime change

Why two-pass image loading? PIL's verify() must be called before load() and checks headers/checksums. load() forces full pixel decoding and catches truncated files. They must run in separate with Image.open() blocks.


features/ — shared computation

File Functions What
quality.py laplacian_variance(img) Blur score via OpenCV Laplacian
quality.py mean_brightness(img) Mean pixel intensity (greyscale, 0–255)
perceptual.py compute_phash(img), compute_dhash(img) Perceptual hashes via imagehash
metadata.py file_metadata(path), exif_metadata(img) File size, mtime, EXIF tags
embeddings.py embed_image(img), cosine_similarity(a, b) CLIP embeddings (lazy-loaded, optional)

checks/ — analysis logic

Every check inherits BaseCheck and implements one method:

class BaseCheck(ABC):
    name: str           # used in config keys and reports
    description: str

    def run(self, dataset: Dataset) -> list[Finding]: ...
    def is_enabled(self) -> bool: ...   # reads config.checks.<name>

Checks are stateless, independent, and testable in isolation. They never read from disk — the scanner pre-populates all fields on ImageRecord.

Class name Notes
CorruptionCheck corruption Reads record.is_corrupted set by scanner
ResolutionCheck resolution Compares record.width/height to thresholds
AspectRatioCheck aspect_ratio Uses record.aspect_ratio property
BlurCheck blur Re-loads image, calls laplacian_variance()
ExposureCheck exposure Re-loads image, calls mean_brightness()
ExactDuplicatesCheck exact_duplicates Groups by record.sha256
PerceptualDuplicatesCheck perceptual_duplicates Union-find on pHash Hamming distances
EmbeddingDuplicatesCheck embedding_duplicates CLIP cosine similarity (optional)
SplitLeakageCheck split_leakage SHA-256 and pHash cross-split comparison
OutliersCheck outliers kNN distance on CLIP embedding matrix (optional)

core/ — orchestration

File Key function What
registry.py build_checks(config) Instantiate enabled checks in execution order
scanner.py scan_directory(), scan_splits() Build Dataset from disk, populate ImageRecords
pipeline.py run_pipeline(checks, dataset) Run each check, collect findings, log timing
orchestrator.py run_scan(paths, config, split_map) Top-level entry point

Execution order (cheap per-file checks first, expensive group checks last):

Corruption → Resolution → AspectRatio → Blur → Exposure
→ ExactDuplicates → PerceptualDuplicates → EmbeddingDuplicates
→ SplitLeakage → Outliers

reports/ — output generation

File Output
html.py Self-contained HTML via Jinja2 (templates/report.html.j2)
json.py Full JSON (summary + all findings as dicts)
csv.py One row per finding; related_files joined with |

actions/ — file operations

All functions accept dry_run=True so you can always preview before committing.

File Function What
quarantine.py quarantine_findings(...) Move flagged files to a review folder
move.py move_files(paths, dest, root, dry_run) Move, preserving relative structure
copy.py copy_files(paths, dest, root, dry_run) Copy to destination
keep_representative.py select_representatives(findings) Pick one file per duplicate cluster
keep_representative.py get_removal_candidates(findings) Flat list of non-representative files

Data flow

images/
  ↓  filesystem.py       discover paths
  ↓  scanner.py          build ImageRecords (load · hash · cache)
  ↓
Dataset[ImageRecord]
  ↓  registry.py         build enabled checks
  ↓  pipeline.py         run each check in order
  ↓
list[Finding]
  ↓  orchestrator.py     build ScanReport + ReportSummary
  ↓
reports/   →  HTML · JSON · CSV
actions/   →  quarantine · dedup cleanup   (optional)

✨ Optional: embedding-based features

pip install "imgclean[embeddings]"

Enables two checks that use CLIP (ViT-B/32):

Check What it finds
embedding_duplicates Visually similar images even when pHash disagrees — cropped, colour-shifted, or resized variants
outliers Images that are visually isolated from the rest of the dataset
# imgclean.yaml
checks:
  embedding_duplicates: true
  outliers: true

thresholds:
  embedding_similarity_min: 0.95
  outlier_knn_k: 5
  outlier_distance_percentile: 95.0
report = scan_dataset(
    "./dataset",
    checks=["embedding_duplicates", "outliers"],
)

GPU is used automatically when available; falls back to CPU.


🧪 Test suite

The repo currently ships with 50 automated tests covering configuration, hashing, duplicate detection, parallel scan plumbing, CLI cleanup flows, reporting, and a synthetic end-to-end scan pipeline.

make test
make lint   # C901 complexity gate

CI runs on Python 3.10, 3.11, and 3.12 for pushes and pull requests.


🗺️ Roadmap

Version Features
v1.1 Thumbnail galleries in HTML report · Faster SQLite cache
v1.2 Class-aware analysis · Per-class outliers · Imbalance summary
v1.3 Bounding box sanity checks · Segmentation mask QA
v2 Interactive web UI · Dataset version comparison

Contributing

git clone https://github.com/Weiykong/imgclean.git
cd imgclean
python3 -m pip install --user uv
make install
make check

See CONTRIBUTING.md for the local setup, command reference, and PR checklist.


License

MIT © Wei Yuan Kong

If imgclean saves you dataset cleanup time, consider starring the repo.

About

Audit and clean image datasets — find duplicates, blurry images, corrupted files, and train/val/test leakage before training or labeling.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors