fix: prevent silent memory loss on consolidation LLM failure by nicoloboschi · Pull Request #601 · vectorize-io/hindsight

nicoloboschi · 2026-03-17T15:38:55Z

Summary

Root cause: when all LLM retries were exhausted in _consolidate_batch_with_llm, an empty _BatchLLMResult was returned and consolidated_at was set unconditionally on all memories in the batch — permanently excluding them from future consolidation runs without producing any observations.
Adaptive batch splitting: on LLM failure, the batch is now halved and retried recursively down to batch_size=1. This recovers the vast majority of transient failures (Groq rate limits, Pydantic validation errors on long prompts) without operator intervention.
consolidation_failed_at column: only single-memory batches that still fail after all retries are marked here instead of consolidated_at, keeping them visible and retryable.
Recovery API: POST /v1/default/banks/{bank_id}/consolidation/retry-failed resets these memories so they are picked up on the next consolidation run.

Changes

consolidator.py: add failed flag to _BatchLLMResult, propagate through _process_memory_batch, replace the single-pass batch update with an adaptive splitting queue
memory_engine.py: new retry_failed_consolidation() method
http.py: new POST .../consolidation/retry-failed endpoint + RetryFailedConsolidationResponse model
Migration a3b4c5d6e7f8: adds consolidation_failed_at TIMESTAMPTZ column and index to memory_units
OpenAPI spec updated

Notes

Python/TypeScript clients not regenerated in this PR (Docker unavailable at commit time) — can be regenerated in a follow-up or locally before merge.

Test plan

Run consolidation with a model that produces Pydantic validation errors on large batches — verify batch is split and smaller sub-batches succeed
Force a single-memory failure (e.g. mock LLM to always fail) — verify consolidation_failed_at is set and consolidated_at is NOT set
Call POST .../consolidation/retry-failed — verify retried_count matches, both columns reset to NULL, memory picked up on next run
Verify memories with consolidation_failed_at are excluded from the normal consolidated_at IS NULL selection (i.e. they don't loop until retried)

When all LLM retries are exhausted during consolidation, memories were being marked consolidated_at unconditionally, permanently excluding them from future consolidation runs without producing any observations. Fix with two complementary mechanisms: - Adaptive batch splitting: on LLM failure, the batch is halved and retried recursively down to batch_size=1, recovering most transient failures (rate limits, Pydantic validation on long prompts) without operator intervention - consolidation_failed_at column: only single-memory batches that still fail after all retries are marked here instead of consolidated_at, so they remain visible and retryable - New API endpoint POST /v1/default/banks/{bank_id}/consolidation/retry-failed resets these memories for the next consolidation run

…d recovery API - Migration a3b4c5d6e7f8: add consolidation_failed_at TIMESTAMPTZ column to memory_units with an index for efficient failure queries; properly chains off g7h8i9j0k1l2 (backsweep_orphan_observations) - Consolidator: filter pending memories with consolidation_failed_at IS NULL so failed memories are not re-fetched in an infinite loop - Consolidator: adaptive batch splitting — when a batch exhausts all 3 LLM retries, halve it and retry sub-batches recursively; only single-memory batches that also exhaust all retries get consolidation_failed_at set - New tests (9 total) covering: adaptive splitting recovers all memories, larger batch splitting, single-memory permanent failure, exclusion from next run, partial batch failure, recover resets columns, recover returns 0 when none failed, recover-then-consolidate succeeds, HTTP endpoint

…dation endpoint

The mock LLM was returning {"facts": ...} for ALL calls including consolidation. Consolidation doesn't use skip_validation=True so it expects a _ConsolidationBatchResponse instance, not a raw dict. Before this PR consolidation silently swallowed the AttributeError (failed=False was returned); now failed=True triggers adaptive splitting and timeouts. Fix: return _ConsolidationBatchResponse() when scope=="consolidation".

… available) Also fix pre-existing type errors: use setattr for XLM-RoBERTa monkey-patch and add missing reranker_local_fp16/bucket_batching/batch_size fields to main.py config constructor.

…wheel conflict PyTorch CPU index serves markupsafe==3.0.3 with only cp314 wheels. uv's default first-index strategy stops at the first index with any version even if no compatible wheel exists. unsafe-best-match searches all indices for the best compatible wheel, falling back to PyPI for markupsafe.

Configure the pytorch CPU index as explicit=true in pyproject.toml so it is ONLY used for torch (via [tool.uv.sources]). All other packages (including markupsafe) are resolved exclusively from PyPI, preventing the pytorch index from serving incompatible cp314-only wheels for non-pytorch packages. Remove UV_INDEX and UV_INDEX_STRATEGY from CI workflow (no longer needed since the index is now configured in pyproject.toml).

…eprecation

nicoloboschi force-pushed the fix/consolidation-failed-at branch 3 times, most recently from 13e744f to 5de47a1 Compare March 17, 2026 15:47

nicoloboschi added 9 commits March 17, 2026 18:12

chore: regenerate OpenAPI spec

fc27ba9

fix: rename consolidation endpoint from /retry-failed to /recover

f1fd467

chore: regenerate Go, Python, TypeScript clients with recover consoli…

5b10d35

…dation endpoint

feat: add Recover Consolidation action to bank Actions dropdown

fb10c46

style: apply ruff formatting to http.py and config.py

79577aa

fix: restrict claude-agent-sdk to macOS platform only (no Linux wheel…

e6948ba

… available) Also fix pre-existing type errors: use setattr for XLM-RoBERTa monkey-patch and add missing reranker_local_fp16/bucket_batching/batch_size fields to main.py config constructor.

nicoloboschi force-pushed the fix/consolidation-failed-at branch from f253b8c to e6948ba Compare March 17, 2026 17:12

nicoloboschi added 4 commits March 17, 2026 18:21

ci: trigger CI run

036e738

ci: retry trigger

abd5eff

nicoloboschi closed this Mar 17, 2026

nicoloboschi reopened this Mar 17, 2026

nicoloboschi added 5 commits March 17, 2026 19:02

ci: trigger after remote URL fix

7eba2fc

ci: add workflow_dispatch to unblock manual trigger

9c92a52

fix: remove empty env blocks left after UV_INDEX removal

0cf894a

fix: add type: ignore for optional claude_agent_sdk imports (macOS-only)

5d04d00

fix: correct type: ignore rules for claude_agent_sdk and fix utcnow d…

d42955d

…eprecation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent silent memory loss on consolidation LLM failure#601

fix: prevent silent memory loss on consolidation LLM failure#601
nicoloboschi wants to merge 18 commits intomainfrom
fix/consolidation-failed-at

nicoloboschi commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicoloboschi commented Mar 17, 2026

Summary

Changes

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant