-
Notifications
You must be signed in to change notification settings - Fork 289
Description
Problem
EverMemOS stores memories through a three-stage pipeline: Encoding → Consolidation → Retrieval. Each stage involves compression — episode summaries condense conversations, profiles aggregate episodes. This compression is necessary but inherently lossy.
The question nobody's asking yet: after consolidation runs over weeks of data, do the stored memories still accurately reflect what actually happened?
I hit this running a 28-day autonomous OpenClaw agent with a two-tier memory system (daily logs + curated long-term memory). Around day 20, I noticed retrieval relevance degrading — not because the indexing broke, but because the curated summaries had drifted from the source material through successive compression passes. The system returned confident, plausible answers that were subtly wrong.
Where drift enters
- Episode → Profile compression: Each consolidation pass discards detail. After N passes, profile claims may not survive spot-check against source conversations.
- Contradiction accumulation: New episodes can contradict earlier profile entries. Without explicit conflict resolution, the profile may hold both claims simultaneously.
- Temporal decay asymmetry: Recent memories are well-grounded in source data. Older memories have been compressed more times with less recoverable context.
Proposed: Periodic Memory Integrity Checks
A lightweight verification layer that periodically samples consolidated memories and checks them against source data:
1. Sample N profile entries (weighted toward older entries)
2. Retrieve the source episodes/conversations that produced each entry
3. Compare the profile claim against source content
4. Score: does the profile accurately represent the source?
5. Flag entries where score drops below threshold
Metrics to track:
consolidation_accuracy: % of sampled profile entries that survive source verificationdrift_rate: change in accuracy over time (are newer consolidations more accurate than older ones?)contradiction_count: profile entries that conflict with each other
Why this matters for EverMemOS specifically
EverMemOS's 93% LoCoMo accuracy measures retrieval quality at a point in time. But long-running agents need longitudinal integrity — memory that stays accurate across weeks, not just a benchmark window. The consolidation pipeline is where accuracy erodes silently, because the system never re-checks its own summaries against ground truth.
I've been working on behavioral reliability measurement for autonomous agents (measuring the gap between what agents report and what actually happened). In a 28-day study across 13 agents, self-reported accuracy diverged from externally verified accuracy by ~7% — and the divergence was invisible from inside the system. Memory consolidation drift is one likely mechanism.
Concrete integration point
Could be implemented as:
- A new
/api/v1/memories/verifyendpoint that runs spot-checks on-demand - A background consolidation hook that samples and verifies after each consolidation cycle
- Dashboard metrics showing memory accuracy trend over time
Happy to share the measurement approach and dataset if helpful. The core insight: the system that writes the memories cannot be the only system that verifies them — you need at least a spot-check loop back to source data.