-
Notifications
You must be signed in to change notification settings - Fork 0
Sprint 4: DB Deduplication & Cleanup Plugin #35
Description
Overview
The DB accumulates duplicate and low-quality messages over time — re-indexed sessions, near-identical tool results, repeated system prompts, noise that slipped past ingest filters. This sprint adds a cleanup pipeline with a safe dry-run first workflow: run the cleaner, inspect results in the web UI, then decide what to delete.
Goals
- Nightly scheduled cleanup — runs automatically, removes confirmed duplicates and junk
- Dry-run mode — preview what would be deleted without touching the DB
- Review GUI on existing web UI — shows every candidate for deletion with similarity scores, previews, and decision controls
- Iterative threshold tuning — optimise detection logic based on what the GUI reveals
Duplicate Detection Categories
- Exact duplicates — same content hash + same session (re-indexed), or same content hash + same role within 30s window (repeated flush)
- Near-duplicates (semantic) — cosine similarity >= 0.97 between embeddings (configurable threshold)
- Junk/noise — empty/whitespace content, single-token messages (
...,OK,.), system prompt duplicates across sessions - Orphans — messages with no parent session, embeddings with no parent message
Implementation
New module: claw_recall/maintenance/dedup.py with find_exact_duplicates(), find_near_duplicates(), find_junk(), find_orphans(), run_dry_run(), delete_messages()
New web UI page: /cleanup with run controls, threshold slider, category toggles, results table with per-row checkboxes, and sticky action bar (Delete Selected / Delete All / Export CSV)
New API endpoints: POST /api/cleanup/dry-run, GET /api/cleanup/status/<job_id>, POST /api/cleanup/delete, GET /api/cleanup/history
Sprint Steps
- Build
claw_recall/maintenance/dedup.pywith all detection functions - Add dry-run background job runner (threaded, results to temp file)
- Add
/cleanuppage + API endpoints toweb.py - Build review UI (table, checkboxes, action bar, summary stats)
- Add run history table (
cleanup_runsin SQLite DB) - Wire delete endpoint (soft-delete first:
is_deleted=1, purge after confirmation) - Set up nightly cron (dry-run only, results visible in UI next morning)
- First real run: review UI, tune thresholds, delete first batch
- Tests:
tests/test_dedup.py— unit tests for each detection function
Full specification in Dev Tracker: docs.sh read claw-recall-reference (Sprint 4 section)