Clarify duplicate document identity as an explicit fail-fast corpus contract#35
Conversation
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/58bb299a-9573-4ba1-910e-94b49fcd5f99
There was a problem hiding this comment.
Pull request overview
This PR makes the corpus’ document identity rules explicit and operator-friendly by surfacing cross-source identity collisions as a typed error, and by ensuring UUID-backed notes can disambiguate same-relative-path notes across sources.
Changes:
- Add contract tests covering cross-source duplicate fallback identities (fail) and UUID disambiguation (pass, with distinct chunk IDs).
- Detect
documents.doc_keyuniqueness collisions during ingest and format an operator-readableDUPLICATE_DOCUMENT_IDENTITYerror with remediation guidance. - Document the strict global document identity rule and remediation steps in operator-facing docs.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_corpus_contract.py | Adds regression tests for duplicate fallback identity failures and UUID disambiguation behavior. |
| agent/corpus_db.py | Adds a helper query to locate an existing document by doc_key for better error reporting. |
| agent/corpus.py | Adds typed duplicate-identity error formatting and updates chunk-key derivation for UUID-backed notes. |
| README.md | Documents the strict, global document identity contract and supported remediations. |
| OPERATOR_QUICKREF.md | Adds an operator runbook entry for DUPLICATE_DOCUMENT_IDENTITY and remediation steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| body_chunks = build_markdown_chunks(body_text=body, max_chars=max_chars, overlap=overlap) | ||
| chunk_key_source = doc_key if uses_explicit_uuid else source_uri | ||
| chunk_records: list[ChunkRecord] = [] | ||
| chunk_records.append( | ||
| ChunkRecord( | ||
| chunk_key=stable_chunk_key( | ||
| source_uri=source_uri, | ||
| source_uri=chunk_key_source, |
There was a problem hiding this comment.
Chunk keys for UUID-backed notes are derived by passing doc_key into stable_chunk_key(source_uri=...), but stable_chunk_key canonicalizes source_uri like a path (e.g., backslashes -> '/', collapsing multiple '/'). That can unintentionally change user-provided UUID strings and can even cause chunk_key collisions for distinct doc_key values that canonicalize to the same string. Consider deriving the chunk-key source from an escaped/hashed form of the authoritative document identity (or adding a stable_chunk_key variant that does not path-canonicalize the identity) so chunk keys remain a true function of the stored doc_key.
There was a problem hiding this comment.
Fixed in 9863e73. UUID-backed chunk keys now use a hashed doc_key: namespace instead of passing raw doc_key values through source-uri canonicalization, and I added a regression test covering explicit UUIDs that previously canonicalized to the same source token.
Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com> Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/7fbebb69-7d69-4171-b5d6-cebad9c64768
f23dd17
into
20260323_implementation-1_lockstep
⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.