Skip to content

Sprint 4: DB Deduplication & Cleanup Plugin #35

@rodbland2021

Description

@rodbland2021

Overview

The DB accumulates duplicate and low-quality messages over time — re-indexed sessions, near-identical tool results, repeated system prompts, noise that slipped past ingest filters. This sprint adds a cleanup pipeline with a safe dry-run first workflow: run the cleaner, inspect results in the web UI, then decide what to delete.

Goals

  1. Nightly scheduled cleanup — runs automatically, removes confirmed duplicates and junk
  2. Dry-run mode — preview what would be deleted without touching the DB
  3. Review GUI on existing web UI — shows every candidate for deletion with similarity scores, previews, and decision controls
  4. Iterative threshold tuning — optimise detection logic based on what the GUI reveals

Duplicate Detection Categories

  • Exact duplicates — same content hash + same session (re-indexed), or same content hash + same role within 30s window (repeated flush)
  • Near-duplicates (semantic) — cosine similarity >= 0.97 between embeddings (configurable threshold)
  • Junk/noise — empty/whitespace content, single-token messages (..., OK, .), system prompt duplicates across sessions
  • Orphans — messages with no parent session, embeddings with no parent message

Implementation

New module: claw_recall/maintenance/dedup.py with find_exact_duplicates(), find_near_duplicates(), find_junk(), find_orphans(), run_dry_run(), delete_messages()

New web UI page: /cleanup with run controls, threshold slider, category toggles, results table with per-row checkboxes, and sticky action bar (Delete Selected / Delete All / Export CSV)

New API endpoints: POST /api/cleanup/dry-run, GET /api/cleanup/status/<job_id>, POST /api/cleanup/delete, GET /api/cleanup/history

Sprint Steps

  • Build claw_recall/maintenance/dedup.py with all detection functions
  • Add dry-run background job runner (threaded, results to temp file)
  • Add /cleanup page + API endpoints to web.py
  • Build review UI (table, checkboxes, action bar, summary stats)
  • Add run history table (cleanup_runs in SQLite DB)
  • Wire delete endpoint (soft-delete first: is_deleted=1, purge after confirmation)
  • Set up nightly cron (dry-run only, results visible in UI next morning)
  • First real run: review UI, tune thresholds, delete first batch
  • Tests: tests/test_dedup.py — unit tests for each detection function

Full specification in Dev Tracker: docs.sh read claw-recall-reference (Sprint 4 section)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions