Skip to content

20260323 implementation 1 complete#38

Merged
duck-lint merged 22 commits intoduck-lint/better_chunkingfrom
20260323_implementation-1_lockstep
Mar 23, 2026
Merged

20260323 implementation 1 complete#38
duck-lint merged 22 commits intoduck-lint/better_chunkingfrom
20260323_implementation-1_lockstep

Conversation

@duck-lint
Copy link
Owner

Enhance corpus management and retrieval features

  • Added detailed checks in the doctor module to validate chunk kinds, metadata projections, and lexical backend modes.
  • Improved retrieval logic to support reranking based on journal entry dates and document types.
  • Updated configuration settings for retrieval parameters to optimize performance.
  • Expanded test coverage for corpus contract and retrieval functionalities, ensuring stability across metadata transitions and chunk retrieval.
  • Introduced new tests for bounded reranking logic to validate intent-based document selection.

duck-lint and others added 14 commits March 23, 2026 12:13
Enhance corpus management and retrieval features

- Added detailed checks in the doctor module to validate chunk kinds, metadata projections, and lexical backend modes.
- Improved retrieval logic to support reranking based on journal entry dates and document types.
- Updated configuration settings for retrieval parameters to optimize performance.
- Expanded test coverage for corpus contract and retrieval functionalities, ensuring stability across metadata transitions and chunk retrieval.
- Introduced new tests for bounded reranking logic to validate intent-based document selection.
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/f7276d0c-913e-4689-81ff-38f2fa28b6af
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/8face3d5-6dc3-4701-bd62-541997919f87
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/8face3d5-6dc3-4701-bd62-541997919f87
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/7fbebb69-7d69-4171-b5d6-cebad9c64768
…icate-document-identity

Clarify duplicate document identity as an explicit fail-fast corpus contract
…persist-lexical-backend

Persist retrieval backend and rerank diagnostics in grounded run artifacts
…align-metadata-projection

Align metadata projection and recency rerank to the locked retrieval contract
Copilot AI review requested due to automatic review settings March 23, 2026 19:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances corpus indexing and retrieval by introducing metadata chunks and a dedicated lexical projection (with optional FTS5), and by persisting richer retrieval diagnostics (including bounded reranking signals) into grounded run artifacts.

Changes:

  • Add per-document metadata chunks and versioned projection signatures to stabilize metadata retrieval and contracts.
  • Introduce chunk_search lexical projection (plus optional chunk_search_fts) and expose lexical backend diagnostics through retrieval/doctor/grounding.
  • Add bounded reranking based on intent (journal/recent) using journal_entry_date and mtime, with expanded/updated tests.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_runtime_app.py Asserts grounded run.json persists retrieval diagnostics; updates embedding count expectation for metadata chunks.
tests/test_retrieval_contract.py New contract tests for metadata lookup, lexical backend fallback visibility, and bounded rerank behavior.
tests/test_doctor_runtime.py Extends runtime doctor assertions for chunk kinds, projection versions, and lexical backend mode.
tests/test_corpus_contract.py Updates corpus contract expectations for metadata chunks, stability across transitions, and duplicate identity handling.
configs/default.yaml Raises retrieval K defaults and grounded evidence count.
agent/retrieval.py Adds chunk kind, lexical backend diagnostics, and bounded rerank diagnostics to retrieval results.
agent/grounding.py Persists new retrieval diagnostics fields into grounded run.json.
agent/doctor.py Adds lexical projection consistency checks and reports backend mode/warnings.
agent/corpus_db.py Bumps schema to v6; adds chunk_search/FTS support and new lexical query implementation.
agent/corpus.py Emits metadata chunks, enforces strict duplicate identity messaging, and rebuilds lexical projection on sync.
agent/chunking.py Adds chunk kind constants, metadata projection builder, and includes chunk_kind in stable chunk keys.
agent/app_types.py Extends ChunkRecord with chunk_kind.
README.md Documents strict global document identity rules across sources.
OPERATOR_QUICKREF.md Adds operator guidance for DUPLICATE_DOCUMENT_IDENTITY.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +736 to +740
FROM chunk_search
WHERE
instr(lower(body_text), lower(?)) > 0
OR instr(lower(chunk_title), lower(?)) > 0
OR instr(lower(heading_path), lower(?)) > 0
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_query_chunk_search_fallback() issues an unbounded SELECT over chunk_search (no LIMIT), so common short queries can return a very large result set and then be fully materialized/sorted in Python. This can become a major latency/memory problem when FTS5 is unavailable and the fallback path is used. Consider adding a LIMIT (e.g., the same fetch_limit used for the FTS path) and threading a limit parameter through the fallback helper so the ranking work stays bounded.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Copy link
Contributor

Copilot AI commented Mar 23, 2026

@duck-lint I've opened a new pull request, #39, to work on those changes. Once the pull request is ready, I'll request review from you.

duck-lint and others added 6 commits March 23, 2026 13:52
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Bound `_query_chunk_search_fallback` result set with LIMIT
@duck-lint duck-lint merged commit 757a4bf into duck-lint/better_chunking Mar 23, 2026
4 checks passed
@duck-lint duck-lint deleted the 20260323_implementation-1_lockstep branch March 23, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants