Skip to content

v2.4.0 — DB Cleanup, Cross-Session Dedup, Ingest Filtering#41

Merged
rodbland2021 merged 1 commit intomasterfrom
release/v2.4.0
Mar 22, 2026
Merged

v2.4.0 — DB Cleanup, Cross-Session Dedup, Ingest Filtering#41
rodbland2021 merged 1 commit intomasterfrom
release/v2.4.0

Conversation

@rodbland2021
Copy link
Copy Markdown
Owner

Summary

  • DB Cleanup UI (/cleanup) with 5 detection categories, similarity scoring, and expandable visual comparison
  • Cross-session duplicate prevention at ingest time (the Add table of contents navigation to README #1 source of database bloat)
  • Noise content filtering at ingest time (heartbeats, boot checks, gateway status)
  • exclude.conf.default documenting all built-in filters

What's in this release

Detection & Cleanup

  • Within-session duplicates (same session_id + role + message_index)
  • Cross-session duplicates with 3 similarity tiers (Exact/High/Medium)
  • Noise patterns (HEARTBEAT_OK, NO_REPLY, boot prompts, gateway status, health check webhooks)
  • Junk (empty content, orphaned messages, single-emoji)
  • Orphaned embeddings

Prevention (new — stops duplicates before they enter the DB)

  • Cross-session dedup: same session UUID from different paths only indexed once
  • Noise content filter: noise messages skipped at ingest time
  • File exclusions: .bak-* compaction backups, boot checks, compaction artifacts

UI Features

  • 6 summary cards with counts per category
  • 5 tabs: Duplicates, Noise, Junk, Orphan Embeds, Similar
  • Quick-action delete buttons for all categories
  • Chunked deletes with progress indicator
  • Expandable detail view for similar groups (KEEP/REMOVE labels, side-by-side comparison)
  • Snapshot cache for instant page loads
  • Mobile responsive

Performance

  • Parallel detection via ThreadPoolExecutor
  • Optimized GROUP BY (6.7x faster duplicate detection)
  • 0.3s cached loads, 1.5s fresh scans

Test plan

  • 178 tests passing (34 new in test_dedup.py)
  • Playwright E2E: all tabs, buttons, expand, delete, refresh verified
  • Mobile viewport tested
  • Zero console errors
  • Ingest filter verified: noise messages no longer enter DB
  • Cross-session dedup verified: same UUID from different paths skipped

🤖 Generated with Claude Code

Update VERSION, CHANGELOG, README for v2.4.0 release.
Adds Data Quality Pipeline section to README documenting
the three-layer prevention approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rodbland2021 rodbland2021 merged commit bc82a51 into master Mar 22, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant