20260323 implementation 1 complete#38
Conversation
Enhance corpus management and retrieval features - Added detailed checks in the doctor module to validate chunk kinds, metadata projections, and lexical backend modes. - Improved retrieval logic to support reranking based on journal entry dates and document types. - Updated configuration settings for retrieval parameters to optimize performance. - Expanded test coverage for corpus contract and retrieval functionalities, ensuring stability across metadata transitions and chunk retrieval. - Introduced new tests for bounded reranking logic to validate intent-based document selection.
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/f7276d0c-913e-4689-81ff-38f2fa28b6af
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/58bb299a-9573-4ba1-910e-94b49fcd5f99
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/52d647e9-7632-4825-ae04-a251da886f28
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/8face3d5-6dc3-4701-bd62-541997919f87
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/8face3d5-6dc3-4701-bd62-541997919f87
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/7fbebb69-7d69-4171-b5d6-cebad9c64768
…icate-document-identity Clarify duplicate document identity as an explicit fail-fast corpus contract
…persist-lexical-backend Persist retrieval backend and rerank diagnostics in grounded run artifacts
…align-metadata-projection Align metadata projection and recency rerank to the locked retrieval contract
There was a problem hiding this comment.
Pull request overview
This PR enhances corpus indexing and retrieval by introducing metadata chunks and a dedicated lexical projection (with optional FTS5), and by persisting richer retrieval diagnostics (including bounded reranking signals) into grounded run artifacts.
Changes:
- Add per-document metadata chunks and versioned projection signatures to stabilize metadata retrieval and contracts.
- Introduce
chunk_searchlexical projection (plus optionalchunk_search_fts) and expose lexical backend diagnostics through retrieval/doctor/grounding. - Add bounded reranking based on intent (journal/recent) using
journal_entry_dateandmtime, with expanded/updated tests.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_runtime_app.py | Asserts grounded run.json persists retrieval diagnostics; updates embedding count expectation for metadata chunks. |
| tests/test_retrieval_contract.py | New contract tests for metadata lookup, lexical backend fallback visibility, and bounded rerank behavior. |
| tests/test_doctor_runtime.py | Extends runtime doctor assertions for chunk kinds, projection versions, and lexical backend mode. |
| tests/test_corpus_contract.py | Updates corpus contract expectations for metadata chunks, stability across transitions, and duplicate identity handling. |
| configs/default.yaml | Raises retrieval K defaults and grounded evidence count. |
| agent/retrieval.py | Adds chunk kind, lexical backend diagnostics, and bounded rerank diagnostics to retrieval results. |
| agent/grounding.py | Persists new retrieval diagnostics fields into grounded run.json. |
| agent/doctor.py | Adds lexical projection consistency checks and reports backend mode/warnings. |
| agent/corpus_db.py | Bumps schema to v6; adds chunk_search/FTS support and new lexical query implementation. |
| agent/corpus.py | Emits metadata chunks, enforces strict duplicate identity messaging, and rebuilds lexical projection on sync. |
| agent/chunking.py | Adds chunk kind constants, metadata projection builder, and includes chunk_kind in stable chunk keys. |
| agent/app_types.py | Extends ChunkRecord with chunk_kind. |
| README.md | Documents strict global document identity rules across sources. |
| OPERATOR_QUICKREF.md | Adds operator guidance for DUPLICATE_DOCUMENT_IDENTITY. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| FROM chunk_search | ||
| WHERE | ||
| instr(lower(body_text), lower(?)) > 0 | ||
| OR instr(lower(chunk_title), lower(?)) > 0 | ||
| OR instr(lower(heading_path), lower(?)) > 0 |
There was a problem hiding this comment.
_query_chunk_search_fallback() issues an unbounded SELECT over chunk_search (no LIMIT), so common short queries can return a very large result set and then be fully materialized/sorted in Python. This can become a major latency/memory problem when FTS5 is unavailable and the fallback path is used. Consider adding a LIMIT (e.g., the same fetch_limit used for the FTS path) and threading a limit parameter through the fallback helper so the ranking work stays bounded.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
|
@duck-lint I've opened a new pull request, #39, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/bed97c53-aa6e-4946-87d6-4562d2cf0d13
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
…orcement Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/ad4ff7de-5ca2-4fd0-9a25-44ff1603d5f3
Bound `_query_chunk_search_fallback` result set with LIMIT
Enhance corpus management and retrieval features