Skip to content

Clarify duplicate document identity as an explicit fail-fast corpus contract#35

Merged
duck-lint merged 3 commits into20260323_implementation-1_lockstepfrom
copilot/sim-2026-03-23-fix-duplicate-document-identity
Mar 23, 2026
Merged

Clarify duplicate document identity as an explicit fail-fast corpus contract#35
duck-lint merged 3 commits into20260323_implementation-1_lockstepfrom
copilot/sim-2026-03-23-fix-duplicate-document-identity

Conversation

Copy link
Contributor

Copilot AI commented Mar 23, 2026

  • Inspect the review feedback, current corpus chunk-key implementation, and relevant CI workflow runs
  • Patch UUID-backed chunk-key derivation so it preserves authoritative doc_key identity without path canonicalization side effects
  • Add focused corpus-contract regression coverage for UUID values that would collide under source-uri canonicalization
  • Run targeted tests for corpus contract behavior, then run broader validation
  • Run automated code review and security scanning, then reply to the review comment with the addressing commit hash

⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

Copilot AI changed the title [WIP] Codify and surface duplicate document identity as intentional fail-fast policy Clarify duplicate document identity as an explicit fail-fast corpus contract Mar 23, 2026
Copilot AI requested a review from duck-lint March 23, 2026 19:18
@duck-lint duck-lint marked this pull request as ready for review March 23, 2026 19:25
Copilot AI review requested due to automatic review settings March 23, 2026 19:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the corpus’ document identity rules explicit and operator-friendly by surfacing cross-source identity collisions as a typed error, and by ensuring UUID-backed notes can disambiguate same-relative-path notes across sources.

Changes:

  • Add contract tests covering cross-source duplicate fallback identities (fail) and UUID disambiguation (pass, with distinct chunk IDs).
  • Detect documents.doc_key uniqueness collisions during ingest and format an operator-readable DUPLICATE_DOCUMENT_IDENTITY error with remediation guidance.
  • Document the strict global document identity rule and remediation steps in operator-facing docs.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/test_corpus_contract.py Adds regression tests for duplicate fallback identity failures and UUID disambiguation behavior.
agent/corpus_db.py Adds a helper query to locate an existing document by doc_key for better error reporting.
agent/corpus.py Adds typed duplicate-identity error formatting and updates chunk-key derivation for UUID-backed notes.
README.md Documents the strict, global document identity contract and supported remediations.
OPERATOR_QUICKREF.md Adds an operator runbook entry for DUPLICATE_DOCUMENT_IDENTITY and remediation steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 170 to +176
body_chunks = build_markdown_chunks(body_text=body, max_chars=max_chars, overlap=overlap)
chunk_key_source = doc_key if uses_explicit_uuid else source_uri
chunk_records: list[ChunkRecord] = []
chunk_records.append(
ChunkRecord(
chunk_key=stable_chunk_key(
source_uri=source_uri,
source_uri=chunk_key_source,
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk keys for UUID-backed notes are derived by passing doc_key into stable_chunk_key(source_uri=...), but stable_chunk_key canonicalizes source_uri like a path (e.g., backslashes -> '/', collapsing multiple '/'). That can unintentionally change user-provided UUID strings and can even cause chunk_key collisions for distinct doc_key values that canonicalize to the same string. Consider deriving the chunk-key source from an escaped/hashed form of the authoritative document identity (or adding a stable_chunk_key variant that does not path-canonicalize the identity) so chunk keys remain a true function of the stored doc_key.

Copilot uses AI. Check for mistakes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9863e73. UUID-backed chunk keys now use a hashed doc_key: namespace instead of passing raw doc_key values through source-uri canonicalization, and I added a regression test covering explicit UUIDs that previously canonicalized to the same source token.

Co-authored-by: duck-lint <250741170+duck-lint@users.noreply.github.com>
Agent-Logs-Url: https://github.com/duck-lint/local-agent/sessions/7fbebb69-7d69-4171-b5d6-cebad9c64768
@duck-lint duck-lint merged commit f23dd17 into 20260323_implementation-1_lockstep Mar 23, 2026
4 checks passed
@duck-lint duck-lint deleted the copilot/sim-2026-03-23-fix-duplicate-document-identity branch March 23, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants