Skip to content

fix: prevent silent memory loss on consolidation LLM failure#601

Open
nicoloboschi wants to merge 18 commits intomainfrom
fix/consolidation-failed-at
Open

fix: prevent silent memory loss on consolidation LLM failure#601
nicoloboschi wants to merge 18 commits intomainfrom
fix/consolidation-failed-at

Conversation

@nicoloboschi
Copy link
Collaborator

Summary

  • Root cause: when all LLM retries were exhausted in _consolidate_batch_with_llm, an empty _BatchLLMResult was returned and consolidated_at was set unconditionally on all memories in the batch — permanently excluding them from future consolidation runs without producing any observations.
  • Adaptive batch splitting: on LLM failure, the batch is now halved and retried recursively down to batch_size=1. This recovers the vast majority of transient failures (Groq rate limits, Pydantic validation errors on long prompts) without operator intervention.
  • consolidation_failed_at column: only single-memory batches that still fail after all retries are marked here instead of consolidated_at, keeping them visible and retryable.
  • Recovery API: POST /v1/default/banks/{bank_id}/consolidation/retry-failed resets these memories so they are picked up on the next consolidation run.

Changes

  • consolidator.py: add failed flag to _BatchLLMResult, propagate through _process_memory_batch, replace the single-pass batch update with an adaptive splitting queue
  • memory_engine.py: new retry_failed_consolidation() method
  • http.py: new POST .../consolidation/retry-failed endpoint + RetryFailedConsolidationResponse model
  • Migration a3b4c5d6e7f8: adds consolidation_failed_at TIMESTAMPTZ column and index to memory_units
  • OpenAPI spec updated

Notes

Python/TypeScript clients not regenerated in this PR (Docker unavailable at commit time) — can be regenerated in a follow-up or locally before merge.

Test plan

  • Run consolidation with a model that produces Pydantic validation errors on large batches — verify batch is split and smaller sub-batches succeed
  • Force a single-memory failure (e.g. mock LLM to always fail) — verify consolidation_failed_at is set and consolidated_at is NOT set
  • Call POST .../consolidation/retry-failed — verify retried_count matches, both columns reset to NULL, memory picked up on next run
  • Verify memories with consolidation_failed_at are excluded from the normal consolidated_at IS NULL selection (i.e. they don't loop until retried)

@nicoloboschi nicoloboschi force-pushed the fix/consolidation-failed-at branch 3 times, most recently from 13e744f to 5de47a1 Compare March 17, 2026 15:47
When all LLM retries are exhausted during consolidation, memories were
being marked consolidated_at unconditionally, permanently excluding them
from future consolidation runs without producing any observations.

Fix with two complementary mechanisms:
- Adaptive batch splitting: on LLM failure, the batch is halved and
  retried recursively down to batch_size=1, recovering most transient
  failures (rate limits, Pydantic validation on long prompts) without
  operator intervention
- consolidation_failed_at column: only single-memory batches that still
  fail after all retries are marked here instead of consolidated_at, so
  they remain visible and retryable
- New API endpoint POST /v1/default/banks/{bank_id}/consolidation/retry-failed
  resets these memories for the next consolidation run
…d recovery API

- Migration a3b4c5d6e7f8: add consolidation_failed_at TIMESTAMPTZ column to
  memory_units with an index for efficient failure queries; properly chains off
  g7h8i9j0k1l2 (backsweep_orphan_observations)
- Consolidator: filter pending memories with consolidation_failed_at IS NULL
  so failed memories are not re-fetched in an infinite loop
- Consolidator: adaptive batch splitting — when a batch exhausts all 3 LLM
  retries, halve it and retry sub-batches recursively; only single-memory
  batches that also exhaust all retries get consolidation_failed_at set
- New tests (9 total) covering: adaptive splitting recovers all memories,
  larger batch splitting, single-memory permanent failure, exclusion from
  next run, partial batch failure, recover resets columns, recover returns
  0 when none failed, recover-then-consolidate succeeds, HTTP endpoint
The mock LLM was returning {"facts": ...} for ALL calls including consolidation.
Consolidation doesn't use skip_validation=True so it expects a _ConsolidationBatchResponse
instance, not a raw dict. Before this PR consolidation silently swallowed the AttributeError
(failed=False was returned); now failed=True triggers adaptive splitting and timeouts.

Fix: return _ConsolidationBatchResponse() when scope=="consolidation".
… available)

Also fix pre-existing type errors: use setattr for XLM-RoBERTa monkey-patch
and add missing reranker_local_fp16/bucket_batching/batch_size fields to main.py config constructor.
@nicoloboschi nicoloboschi force-pushed the fix/consolidation-failed-at branch from f253b8c to e6948ba Compare March 17, 2026 17:12
…wheel conflict

PyTorch CPU index serves markupsafe==3.0.3 with only cp314 wheels.
uv's default first-index strategy stops at the first index with any version
even if no compatible wheel exists. unsafe-best-match searches all indices
for the best compatible wheel, falling back to PyPI for markupsafe.
Configure the pytorch CPU index as explicit=true in pyproject.toml so it is
ONLY used for torch (via [tool.uv.sources]). All other packages (including
markupsafe) are resolved exclusively from PyPI, preventing the pytorch index
from serving incompatible cp314-only wheels for non-pytorch packages.

Remove UV_INDEX and UV_INDEX_STRATEGY from CI workflow (no longer needed
since the index is now configured in pyproject.toml).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant