Skip to content

Conversation

@KaiSernLim
Copy link
Contributor

@KaiSernLim KaiSernLim commented Dec 9, 2025

Problem Statement

This PR fixes 3 important issues discovered in certification testing:

  1. Fix VT DIV Sync Order
    1. Global RT DIV in batch-only stores can skip EndOfPush due to the VT DIV being synced to drainer before EndOfPush being sent to drainer. This will cause OffsetRecord.isEndOfPushReceived() to be set as false, while the DIV has already seen the EOP message. The checkpointed position will be at the EOP message, but it will be skipped because the DIV has already seen the EOP message, so the isEndOfPushReceived() will never be set as true. The first push will succeed, but it will become a problem on restart and the host will never finish ingestion.
    2. Also fixed error handling if the previous message was not persisted to disk properly then the VT DIV should not sync either.
  2. Accommodate Delete Values in shouldSyncOffsetFromSnapshot()
    1. Global RT DIV messages have keys with MessageType:GLOBAL_RT_DIV. A commit extended this special key message type for DELETE message values, but code paths were not updated from previously, when all GLOBAL_RT_DIV would contain PUT message values.
  3. Skip Global RT DIV in Repush
    1. Repush fails because the Global RT DIV messages are not skipped from VeniceKafkaInputTTLFilter.skipRmdRecord() and they do not have replication metadata fields populated, which causes issues. Skipping it in PubSubSplitIterator should cause it to be omitted from repushes.

Solution

Read the previous section.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines. N/A
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
    Added new parametrized integration test testServerRestart().
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@KaiSernLim KaiSernLim requested a review from lluwm December 9, 2025 23:08
@KaiSernLim KaiSernLim self-assigned this Dec 9, 2025
@KaiSernLim KaiSernLim changed the title Global RT DIV: Fix Order of VT DIV Sync [server] Global RT DIV: Fix VT DIV Sync Order Dec 9, 2025
@KaiSernLim KaiSernLim force-pushed the global-rt-div-fix-vt-div-sync-order branch from 92c099d to f4ae940 Compare December 9, 2025 23:11
Copy link
Contributor

@lluwm lluwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change, @KaiSernLim. It looks very promising. Left a few comments for you to consider.

@KaiSernLim KaiSernLim force-pushed the global-rt-div-fix-vt-div-sync-order branch from ee7014c to f351d86 Compare December 18, 2025 00:37
Copy link
Contributor

@lluwm lluwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! Thank you!

@KaiSernLim KaiSernLim enabled auto-merge (squash) January 5, 2026 19:30
@KaiSernLim KaiSernLim merged commit 034453e into linkedin:main Jan 5, 2026
50 checks passed
@KaiSernLim KaiSernLim deleted the global-rt-div-fix-vt-div-sync-order branch January 5, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants