Skip to content

Conversation

@xunyin8
Copy link
Contributor

@xunyin8 xunyin8 commented Dec 17, 2025

Problem Statement

  1. If one store version is not initialized correctly and throws unhandled exception, we would skip the initialization of follow up stores in the storeRepository
  2. If CV repository is delayed or unavailable (which we have been seeing more frequently recently especially with large clusters) then we often fail quota initialization due to VeniceNoHelixResourceException which was also unhandled as per problem 1.
  3. Having both initialized and initializedVolatile is confusing and unnecessary.

Solution

  1. Catch unhandled exceptions at the store level and continue initialization for other stores to isolate any initialization errors.
  2. If CV is unavailable, i.e. VeniceNoHelixResourceException is thrown or partition assignment map is somehow empty and fallback strategy is enabled (enabled by default but can be disabled via config server.read.quota.initialization.fallback.enabled). We will allocate this instance Q * X/P quota where Q is the total read quota, X is the number of partitions assigned to the node based on storage engine state and P is the store version partition count.
  3. Removed initialized flag and only keeping the initializedVolatile flag.
  4. Added additional logging to capture only when there is a total quota change and how is the new rate limiter calculated. i.e. the total quota, node responsibility and resulting instance quota. Also added additional logging whenever fallback strategy is invoked regardless of the quota update trigger (from store update or CV update events).

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests and existing integration tests.

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@xunyin8 xunyin8 force-pushed the fix_sn_read_quota_initialization_without_cv branch from 57b1f81 to da95b8f Compare December 17, 2025 18:55
Copy link
Contributor

@lluwm lluwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xunyin8 for the fix and it makes sense to me. Left some minor comments to consider.

@xunyin8 xunyin8 force-pushed the fix_sn_read_quota_initialization_without_cv branch from fae5b4c to d5d106e Compare December 18, 2025 00:34
@eldernewborn
Copy link
Contributor

eldernewborn commented Dec 18, 2025

We will assign this instance Q/(min(N, X)) quota where Q is the total read quota, N is number of instances in the cluster and X is the number of partitions for the

Say Q is total read quota
Store has N partitions
and a replication factor of M
read quota for any replica of partitions in ideal world is , Q / ( N * M )

Now imagine we don't know anything outside the server, assuming all other servers are dead,
each replica needs to get Q /N worth of quota to account for capacity needed to serve in place of other presumably unavailable servers .

now in that scenario, server knows how many replicas for a given store it is serving, let's call it R

Storage node can safely set a default quota of R * Q/N without needing to actively reach out to the outside world, removing the dependency on knowing number of instances present in the cluster.
This is a better option compared to unlimited quota / fail open, it has no need for any query to external system .

@xunyin8
Copy link
Contributor Author

xunyin8 commented Dec 18, 2025

We will assign this instance Q/(min(N, X)) quota where Q is the total read quota, N is number of instances in the cluster and X is the number of partitions for the

Say Q is total read quota Store has N partitions and a replication factor of M read quota for any replica of partitions in ideal world is , Q / ( N * M )

Now imagine we don't know anything outside the server, assuming all other servers are dead, each replica needs to get Q /N worth of quota to account for capacity needed to serve in place of other presumably unavailable servers .

now in that scenario, server knows how many replicas for a given store it is serving, let's call it R

Storage node can safely set a default quota of R * Q/N without needing to actively reach out to the outside world, removing the dependency on knowing number of instances present in the cluster. This is a better option compared to unlimited quota / fail open, it has no need for any query to external system .

This is not entirely correct, what's missing from the above is how many partitions are assigned to the instance. In the case when there are less instances in the cluster than number of partitions then each instance could receive > 1 partitions. Perhaps Q/(min(N, X)) is also incorrect, to be on the safer side in those scenarios it should really be Q/X * ceil(X/N). Let me update the PR. There are other ways to figure out the partition assignment for the instance without CV but it will be a lot more complicated and vulnerable to other races when CV do become available and we do get notified about EV/CV changes.

@xunyin8 xunyin8 force-pushed the fix_sn_read_quota_initialization_without_cv branch from d5d106e to c36e917 Compare December 18, 2025 01:38
lluwm
lluwm previously approved these changes Dec 23, 2025
Copy link
Contributor

@lluwm lluwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xunyin8 . LGTM!

@xunyin8 xunyin8 force-pushed the fix_sn_read_quota_initialization_without_cv branch from f46d905 to c03ca40 Compare December 23, 2025 18:56
@xunyin8 xunyin8 merged commit 9104901 into linkedin:main Jan 2, 2026
72 of 74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants