Skip to content

Fix merge commit point bug: persist merge results to segments_N#12

Open
model-collapse wants to merge 1 commit intomainfrom
fix/merge-commit-point-persistence
Open

Fix merge commit point bug: persist merge results to segments_N#12
model-collapse wants to merge 1 commit intomainfrom
fix/merge-commit-point-persistence

Conversation

@model-collapse
Copy link
Owner

Summary

  • Fix 2x data duplication bug in Big5 10M ingestion: background merges update in-memory segmentInfos_ but the on-disk segments_N was never re-written, so on reader reopen (or node restart) both source and merged segments are loaded — 19.7M docs visible instead of 10M
  • Commit(): add waitForMerges + commitMergeResults after initial commit to persist merge results before triggering cascading merges
  • Refresh(): add waitForMerges + commitMergeResults before reopening reader so it sees the post-merge segment state
  • Update diagon submodule to a379571 which adds commitMergeResults(), triggerMerge(), and C API functions

Upstream diagon changes (a379571)

  • IndexWriter::commitMergeResults() — writes segments_N + cleans up source segment files without flushing or re-triggering merges
  • IndexWriter::triggerMerge() — public wrapper for maybeMerge(SEGMENT_FLUSH)
  • C API: diagon_commit_merge_results, diagon_wait_for_merges, diagon_maybe_merge
  • 4 new unit tests (MergeCommitPointTest) — all passing

Test plan

  • Build conjugate with updated diagon: BUILD_MODE=release make all
  • Index 100K+ docs, force refresh, verify _count equals indexed count exactly
  • Restart data node, verify _count still equals indexed count (no duplication)
  • Diagon upstream: 1730/1732 tests pass (1 flaky timing test, 2 disabled)

🤖 Generated with Claude Code

Background merges update in-memory segmentInfos_ but the on-disk
segments_N was never re-written after merges complete, causing 2x data
duplication on reader reopen (19.7M docs visible instead of 10M).

Changes:
- Commit(): add waitForMerges + commitMergeResults after initial commit
  to persist merge results before triggering cascading merges
- Refresh(): add waitForMerges + commitMergeResults before reopening
  reader so it sees the post-merge segment state
- Update diagon submodule to a379571 which adds commitMergeResults(),
  triggerMerge(), and C API functions (diagon_commit_merge_results,
  diagon_wait_for_merges, diagon_maybe_merge)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant