Sprint 4: DB Deduplication & Cleanup Plugin

## Overview

The DB accumulates duplicate and low-quality messages over time — re-indexed sessions, near-identical tool results, repeated system prompts, noise that slipped past ingest filters. This sprint adds a cleanup pipeline with a safe **dry-run first** workflow: run the cleaner, inspect results in the web UI, then decide what to delete.

## Goals

1. **Nightly scheduled cleanup** — runs automatically, removes confirmed duplicates and junk
2. **Dry-run mode** — preview what would be deleted without touching the DB
3. **Review GUI on existing web UI** — shows every candidate for deletion with similarity scores, previews, and decision controls
4. **Iterative threshold tuning** — optimise detection logic based on what the GUI reveals

## Duplicate Detection Categories

- **Exact duplicates** — same content hash + same session (re-indexed), or same content hash + same role within 30s window (repeated flush)
- **Near-duplicates (semantic)** — cosine similarity >= 0.97 between embeddings (configurable threshold)
- **Junk/noise** — empty/whitespace content, single-token messages (`...`, `OK`, `.`), system prompt duplicates across sessions
- **Orphans** — messages with no parent session, embeddings with no parent message

## Implementation

**New module:** `claw_recall/maintenance/dedup.py` with `find_exact_duplicates()`, `find_near_duplicates()`, `find_junk()`, `find_orphans()`, `run_dry_run()`, `delete_messages()`

**New web UI page:** `/cleanup` with run controls, threshold slider, category toggles, results table with per-row checkboxes, and sticky action bar (Delete Selected / Delete All / Export CSV)

**New API endpoints:** `POST /api/cleanup/dry-run`, `GET /api/cleanup/status/<job_id>`, `POST /api/cleanup/delete`, `GET /api/cleanup/history`

## Sprint Steps

- [ ] Build `claw_recall/maintenance/dedup.py` with all detection functions
- [ ] Add dry-run background job runner (threaded, results to temp file)
- [ ] Add `/cleanup` page + API endpoints to `web.py`
- [ ] Build review UI (table, checkboxes, action bar, summary stats)
- [ ] Add run history table (`cleanup_runs` in SQLite DB)
- [ ] Wire delete endpoint (soft-delete first: `is_deleted=1`, purge after confirmation)
- [ ] Set up nightly cron (dry-run only, results visible in UI next morning)
- [ ] First real run: review UI, tune thresholds, delete first batch
- [ ] Tests: `tests/test_dedup.py` — unit tests for each detection function

---

Full specification in Dev Tracker: `docs.sh read claw-recall-reference` (Sprint 4 section)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sprint 4: DB Deduplication & Cleanup Plugin #35

Overview

Goals

Duplicate Detection Categories

Implementation

Sprint Steps

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sprint 4: DB Deduplication & Cleanup Plugin #35

Description

Overview

Goals

Duplicate Detection Categories

Implementation

Sprint Steps

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions