Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,39 @@ Format follows [Keep a Changelog](https://keepachangelog.com/). Versioning follo

---

## [2.4.0] — 2026-03-23

### Added
- **DB Cleanup UI** (`/cleanup`) — web interface for reviewing and removing duplicate, noise, junk, and orphaned data
- **Cross-session duplicate detection** — finds identical content indexed from multiple session files (active + archived copies). Similarity scoring at three tiers: Exact (1.0), High (0.95), Medium (0.85)
- **Expandable detail view** — click any similar group to see all copies side-by-side with KEEP/REMOVE labels, session IDs, agents, and timestamps
- **Noise pattern detection** — identifies HEARTBEAT_OK, NO_REPLY, boot check prompts, gateway status messages, health check webhooks
- **Junk detection** — empty/NULL content, orphaned messages (no parent session), single-emoji messages
- **Orphaned embedding detection** — embeddings whose parent message no longer exists (~6KB each)
- **Quick-action delete buttons** — Delete All Duplicates, Delete All Noise, Delete All Junk, Delete All Similar with chunked progress
- **Cleanup run history** — `cleanup_runs` table logs every dry-run and delete with timestamps and counts
- **Snapshot cache** — dry-run results cached to disk (10min TTL) for instant page loads
- **Ingest-time noise filter** — noise messages (heartbeats, boot checks, gateway status) skipped during indexing before they enter the database
- **Cross-session dedup at ingest** — same session UUID from different paths (active vs archive) only indexed once, preventing the #1 source of database bloat
- **`exclude.conf.default`** — ships with the repo documenting all built-in filters (file exclusions, cross-session dedup, noise content filter, tool result filtering, secret redaction)
- 34 new tests in `tests/test_dedup.py`

### Performance
- **Parallel detection** — all detection passes run in concurrent threads via ThreadPoolExecutor
- **Optimized GROUP BY** — dropped `content` column from duplicate detection (6.7x faster)
- **Cached page loads** — 0.3s from cache vs 1.5s fresh scan vs 8s+ original
- **Background post-delete refresh** — UI stays responsive after deletions
- **Chunked client-side deletes** — sends 5,000 IDs per batch with progress indicator

### Fixed
- `credentials: 'include'` on all fetch calls (required for OAuth2 proxy)
- Embedding cache invalidation after bulk deletes (prevents stale search results)
- Session `message_count` updated after deleting duplicate messages
- FTS5 integrity check after bulk delete operations
- Single-emoji junk count accuracy (was using limited loop counter, not full DB count)

---

## [2.3.0] — 2026-03-18

### Added
Expand Down
13 changes: 12 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
[![Tests](https://github.com/rodbland2021/claw-recall/actions/workflows/test.yml/badge.svg)](https://github.com/rodbland2021/claw-recall/actions/workflows/test.yml)
[![Discord](https://img.shields.io/discord/1479309142060695664?color=5865F2&logo=discord&logoColor=white&label=Discord)](https://discord.gg/D7YcxVpQAB)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Version](https://img.shields.io/badge/version-2.3.0-blue)](CHANGELOG.md)
[![Version](https://img.shields.io/badge/version-2.4.0-blue)](CHANGELOG.md)

**Persistent, searchable memory for AI agents.** When context compaction erases what your agent was just working on, Claw Recall brings it back.

Expand Down Expand Up @@ -33,6 +33,17 @@ Claw Recall indexes all your agent conversations into a searchable SQLite databa
- **Remote indexing** — HTTP upload endpoint for multi-machine setups
- **Embedding cache** — full matrix in RAM for ~50ms semantic search across hundreds of thousands of messages
- **Self-hosted** — your data stays on your machine, under $1/month to run
- **Database cleanup** — web UI for detecting and removing duplicates, noise, junk, and cross-session copies with similarity scoring and visual comparison

### Data Quality Pipeline

Claw Recall prevents database bloat at three levels:

1. **Ingest filtering** — noise messages (heartbeats, boot checks, gateway status) are skipped before they enter the database. Cross-session dedup prevents the same session from being indexed twice when it appears in both active and archive paths.
2. **File exclusions** — configurable glob patterns (`exclude.conf`) skip boot check sessions, compaction artifacts, and backup files entirely.
3. **Cleanup UI** (`/cleanup`) — on-demand detection and removal of duplicates, noise, junk, orphaned embeddings, and cross-session copies. Expandable detail view lets you compare matched messages before deleting. Similarity scoring at three tiers (Exact 1.0, High 0.95, Medium 0.85).

See [`exclude.conf.default`](exclude.conf.default) for the full list of built-in filters.

**[Quick Start](#quick-start)** | **[How It Works](#how-it-works)** | **[MCP Tools](#mcp-tools)** | **[CLI](#cli-reference)** | **[REST API](#rest-api)** | **[Full Guide](docs/guide.md)** | **[Community](#community)**

Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.3.0
2.4.0
Loading