fix: prevent silent memory loss on consolidation LLM failure#601
Open
nicoloboschi wants to merge 18 commits intomainfrom
Open
fix: prevent silent memory loss on consolidation LLM failure#601nicoloboschi wants to merge 18 commits intomainfrom
nicoloboschi wants to merge 18 commits intomainfrom
Conversation
13e744f to
5de47a1
Compare
When all LLM retries are exhausted during consolidation, memories were
being marked consolidated_at unconditionally, permanently excluding them
from future consolidation runs without producing any observations.
Fix with two complementary mechanisms:
- Adaptive batch splitting: on LLM failure, the batch is halved and
retried recursively down to batch_size=1, recovering most transient
failures (rate limits, Pydantic validation on long prompts) without
operator intervention
- consolidation_failed_at column: only single-memory batches that still
fail after all retries are marked here instead of consolidated_at, so
they remain visible and retryable
- New API endpoint POST /v1/default/banks/{bank_id}/consolidation/retry-failed
resets these memories for the next consolidation run
…d recovery API - Migration a3b4c5d6e7f8: add consolidation_failed_at TIMESTAMPTZ column to memory_units with an index for efficient failure queries; properly chains off g7h8i9j0k1l2 (backsweep_orphan_observations) - Consolidator: filter pending memories with consolidation_failed_at IS NULL so failed memories are not re-fetched in an infinite loop - Consolidator: adaptive batch splitting — when a batch exhausts all 3 LLM retries, halve it and retry sub-batches recursively; only single-memory batches that also exhaust all retries get consolidation_failed_at set - New tests (9 total) covering: adaptive splitting recovers all memories, larger batch splitting, single-memory permanent failure, exclusion from next run, partial batch failure, recover resets columns, recover returns 0 when none failed, recover-then-consolidate succeeds, HTTP endpoint
The mock LLM was returning {"facts": ...} for ALL calls including consolidation.
Consolidation doesn't use skip_validation=True so it expects a _ConsolidationBatchResponse
instance, not a raw dict. Before this PR consolidation silently swallowed the AttributeError
(failed=False was returned); now failed=True triggers adaptive splitting and timeouts.
Fix: return _ConsolidationBatchResponse() when scope=="consolidation".
… available) Also fix pre-existing type errors: use setattr for XLM-RoBERTa monkey-patch and add missing reranker_local_fp16/bucket_batching/batch_size fields to main.py config constructor.
f253b8c to
e6948ba
Compare
…wheel conflict PyTorch CPU index serves markupsafe==3.0.3 with only cp314 wheels. uv's default first-index strategy stops at the first index with any version even if no compatible wheel exists. unsafe-best-match searches all indices for the best compatible wheel, falling back to PyPI for markupsafe.
Configure the pytorch CPU index as explicit=true in pyproject.toml so it is ONLY used for torch (via [tool.uv.sources]). All other packages (including markupsafe) are resolved exclusively from PyPI, preventing the pytorch index from serving incompatible cp314-only wheels for non-pytorch packages. Remove UV_INDEX and UV_INDEX_STRATEGY from CI workflow (no longer needed since the index is now configured in pyproject.toml).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_consolidate_batch_with_llm, an empty_BatchLLMResultwas returned andconsolidated_atwas set unconditionally on all memories in the batch — permanently excluding them from future consolidation runs without producing any observations.batch_size=1. This recovers the vast majority of transient failures (Groq rate limits, Pydantic validation errors on long prompts) without operator intervention.consolidation_failed_atcolumn: only single-memory batches that still fail after all retries are marked here instead ofconsolidated_at, keeping them visible and retryable.POST /v1/default/banks/{bank_id}/consolidation/retry-failedresets these memories so they are picked up on the next consolidation run.Changes
consolidator.py: addfailedflag to_BatchLLMResult, propagate through_process_memory_batch, replace the single-pass batch update with an adaptive splitting queuememory_engine.py: newretry_failed_consolidation()methodhttp.py: newPOST .../consolidation/retry-failedendpoint +RetryFailedConsolidationResponsemodela3b4c5d6e7f8: addsconsolidation_failed_at TIMESTAMPTZcolumn and index tomemory_unitsNotes
Python/TypeScript clients not regenerated in this PR (Docker unavailable at commit time) — can be regenerated in a follow-up or locally before merge.
Test plan
consolidation_failed_atis set andconsolidated_atis NOT setPOST .../consolidation/retry-failed— verifyretried_countmatches, both columns reset to NULL, memory picked up on next runconsolidation_failed_atare excluded from the normalconsolidated_at IS NULLselection (i.e. they don't loop until retried)