20260323 implementation 1 complete by duck-lint · Pull Request #38 · duck-lint/local-agent

duck-lint · 2026-03-23T19:43:28Z

Enhance corpus management and retrieval features

Added detailed checks in the doctor module to validate chunk kinds, metadata projections, and lexical backend modes.
Improved retrieval logic to support reranking based on journal entry dates and document types.
Updated configuration settings for retrieval parameters to optimize performance.
Expanded test coverage for corpus contract and retrieval functionalities, ensuring stability across metadata transitions and chunk retrieval.
Introduced new tests for bounded reranking logic to validate intent-based document selection.

Enhance corpus management and retrieval features - Added detailed checks in the doctor module to validate chunk kinds, metadata projections, and lexical backend modes. - Improved retrieval logic to support reranking based on journal entry dates and document types. - Updated configuration settings for retrieval parameters to optimize performance. - Expanded test coverage for corpus contract and retrieval functionalities, ensuring stability across metadata transitions and chunk retrieval. - Introduced new tests for bounded reranking logic to validate intent-based document selection.

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/f7276d0c-913e-4689-81ff-38f2fa28b6af

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/58bb299a-9573-4ba1-910e-94b49fcd5f99

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/52d647e9-7632-4825-ae04-a251da886f28

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/8face3d5-6dc3-4701-bd62-541997919f87

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/8face3d5-6dc3-4701-bd62-541997919f87

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/7fbebb69-7d69-4171-b5d6-cebad9c64768

…icate-document-identity Clarify duplicate document identity as an explicit fail-fast corpus contract

…persist-lexical-backend Persist retrieval backend and rerank diagnostics in grounded run artifacts

…align-metadata-projection Align metadata projection and recency rerank to the locked retrieval contract

Copilot

Pull request overview

This PR enhances corpus indexing and retrieval by introducing metadata chunks and a dedicated lexical projection (with optional FTS5), and by persisting richer retrieval diagnostics (including bounded reranking signals) into grounded run artifacts.

Changes:

Add per-document metadata chunks and versioned projection signatures to stabilize metadata retrieval and contracts.
Introduce chunk_search lexical projection (plus optional chunk_search_fts) and expose lexical backend diagnostics through retrieval/doctor/grounding.
Add bounded reranking based on intent (journal/recent) using journal_entry_date and mtime, with expanded/updated tests.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_runtime_app.py	Asserts grounded `run.json` persists retrieval diagnostics; updates embedding count expectation for metadata chunks.
tests/test_retrieval_contract.py	New contract tests for metadata lookup, lexical backend fallback visibility, and bounded rerank behavior.
tests/test_doctor_runtime.py	Extends runtime doctor assertions for chunk kinds, projection versions, and lexical backend mode.
tests/test_corpus_contract.py	Updates corpus contract expectations for metadata chunks, stability across transitions, and duplicate identity handling.
configs/default.yaml	Raises retrieval K defaults and grounded evidence count.
agent/retrieval.py	Adds chunk kind, lexical backend diagnostics, and bounded rerank diagnostics to retrieval results.
agent/grounding.py	Persists new retrieval diagnostics fields into grounded `run.json`.
agent/doctor.py	Adds lexical projection consistency checks and reports backend mode/warnings.
agent/corpus_db.py	Bumps schema to v6; adds `chunk_search`/FTS support and new lexical query implementation.
agent/corpus.py	Emits metadata chunks, enforces strict duplicate identity messaging, and rebuilds lexical projection on sync.
agent/chunking.py	Adds chunk kind constants, metadata projection builder, and includes chunk_kind in stable chunk keys.
agent/app_types.py	Extends `ChunkRecord` with `chunk_kind`.
README.md	Documents strict global document identity rules across sources.
OPERATOR_QUICKREF.md	Adds operator guidance for `DUPLICATE_DOCUMENT_IDENTITY`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T19:47:51Z

agent/corpus_db.py

+        FROM chunk_search
+        WHERE
+            instr(lower(body_text), lower(?)) > 0
+            OR instr(lower(chunk_title), lower(?)) > 0
+            OR instr(lower(heading_path), lower(?)) > 0


_query_chunk_search_fallback() issues an unbounded SELECT over chunk_search (no LIMIT), so common short queries can return a very large result set and then be fully materialized/sorted in Python. This can become a major latency/memory problem when FTS5 is unavailable and the fallback path is used. Consider adding a LIMIT (e.g., the same fetch_limit used for the FTS path) and threading a limit parameter through the fallback helper so the ranking work stays bounded.

@copilot open a new pull request to apply changes based on this feedback

agent/corpus.py

configs/default.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>

Copilot · 2026-03-23T19:52:12Z

@duck-lint I've opened a new pull request, #39, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/bed97c53-aa6e-4946-87d6-4562d2cf0d13

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>

…orcement Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/ad4ff7de-5ca2-4fd0-9a25-44ff1603d5f3

Bound `_query_chunk_search_fallback` result set with LIMIT

duck-lint and others added 14 commits March 23, 2026 12:13

Initial plan

cad3039

Initial plan

9ff0344

Initial plan

3ebd0f0

Persist grounded retrieval diagnostics

9b86acd

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/f7276d0c-913e-4689-81ff-38f2fa28b6af

fix: clarify duplicate document identity failures

c34de89

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/58bb299a-9573-4ba1-910e-94b49fcd5f99

fix: align metadata projection and rerank contract

b259f1a

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/52d647e9-7632-4825-ae04-a251da886f28

Align grounded retrieval test with runtime path

1f74637

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/8face3d5-6dc3-4701-bd62-541997919f87

Update agent/corpus_db.py

16e6975

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>

Refine grounded retrieval capture test

de5cb45

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/8face3d5-6dc3-4701-bd62-541997919f87

fix: preserve uuid identity in chunk keys

9863e73

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/7fbebb69-7d69-4171-b5d6-cebad9c64768

Merge pull request #35 from duck-lint/copilot/sim-2026-03-23-fix-dupl…

f23dd17

…icate-document-identity Clarify duplicate document identity as an explicit fail-fast corpus contract

Merge pull request #36 from duck-lint/copilot/sim-2026-03-23-fix-002-…

1804c76

…persist-lexical-backend Persist retrieval backend and rerank diagnostics in grounded run artifacts

Merge pull request #37 from duck-lint/copilot/sim-2026-03-23-fix-003-…

2ef0cc7

…align-metadata-projection Align metadata projection and recency rerank to the locked retrieval contract

Copilot AI review requested due to automatic review settings March 23, 2026 19:43

Copilot started reviewing on behalf of duck-lint March 23, 2026 19:43 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Initial plan

27b23a5

Copilot AI mentioned this pull request Mar 23, 2026

Bound _query_chunk_search_fallback result set with LIMIT #39

Merged

Update configs/default.yaml

a11633d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>

duck-lint and others added 6 commits March 23, 2026 13:52

Update configs/default.yaml

9aa6878

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>

Update agent/corpus.py

f24d369

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>

fix: add LIMIT to _query_chunk_search_fallback to bound result sets

4c661ce

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/bed97c53-aa6e-4946-87d6-4562d2cf0d13

Update agent/corpus_db.py

a2a3bf6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>

test: add regression tests for _query_chunk_search_fallback LIMIT enf…

0c8fc41

…orcement Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/ad4ff7de-5ca2-4fd0-9a25-44ff1603d5f3

Merge pull request #39 from duck-lint/copilot/sub-pr-38

ce09819

Bound `_query_chunk_search_fallback` result set with LIMIT

duck-lint merged commit 757a4bf into duck-lint/better_chunking Mar 23, 2026
4 checks passed

duck-lint deleted the 20260323_implementation-1_lockstep branch March 23, 2026 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20260323 implementation 1 complete#38

20260323 implementation 1 complete#38
duck-lint merged 22 commits intoduck-lint/better_chunkingfrom
20260323_implementation-1_lockstep

duck-lint commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

duck-lint Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

duck-lint commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

duck-lint Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants