-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- add prose test - add assertions on the number of retries for maxAttempts tests - don't run clientBulkWrite tests on <8.0 servers
|
It looks like you also need to bump the schema version: |
|
WIP Python implementation: mongodb/mongo-python-driver#2635 |
|
All unified and prose tests are passing in the Python implementation. Edit: we're still failing one unified test, "client.clientBulkWrite retries using operation loop", investigating... Edit 2: we're all good now |
jyemin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only reviewed the specification changes, not the pseudocode or tests. Those are best reviewed by implementers.
|
|
||
| ## Changelog | ||
|
|
||
| - 2025-XX-XX: Initial version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how we handle the date... Is there an automation for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of. Usually the spec author fills it out before merging
I'll just leave this thread open to remind myself to add changelog dates before merging once all changes are completed.
jmikola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some more notes about formatting, which you can take or leave as you wish.
The hand-written client-backpressure tests also appear to use inconsistent formatting for whitespace, but it's all visual and I doubt anyone else cares about that so I'll spare you diff comments.
| description: 'client.listDatabases retries using operation loop' | ||
| operations: | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this file is generated from backpressure-retry-loop.yml.template, but the whitespace here is a bit odd.
| name: "x_11" | ||
| {%- endif %} | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whitespace here is probably what's introducing extra newlines into generated files.
|
|
||
| - | ||
| object: *{{operation.object}} | ||
| name: {{operation.operation_name}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider swapping name and object lines here and putting name on the same line as the array element dash character for consistency.
|
|
||
| 5. Execute step 3 again. | ||
|
|
||
| 6. Compare the two time between the two runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the following lines should all appear under element six. If so, indentation is needed (check preview).
| filter: {} | ||
| # ensure stable ordering of result documents | ||
| sort: { a: 1 } | ||
| object: *collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name and object are typically included side by side for readability.
Co-authored-by: Ferdinando Papale <4850119+papafe@users.noreply.github.com>
Co-authored-by: Ferdinando Papale <4850119+papafe@users.noreply.github.com>
stIncMale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes for myself:
- The last reviewed commit is d4d0b38.
- I reviewed only the changes in
.mdfiles, and do not plan to review the tests. - I have not re-reviewed the pseudocode illustrating the overload retry policy. I will do that after us settling on the requirements expressed in the prose part of the specification.
source/transactions/transactions.md
Outdated
| In addition, drivers MUST NOT add the RetryableWriteError label to any error that occurs during a write command within a | ||
| transaction (excepting commitTransation and abortTransaction), even when retryWrites has been enabled on the | ||
| MongoClient, unless the server response is a retryable backpressure error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it actually true that when retrying a command that failed with a retryable backpressure error within a transaction, the driver is to add the RetryableWriteError label to that error? Note that
retryable-writes.mddoes not have a similar update to its requirements about adding theRetryableWriteErrorlabel, and currently says "drivers MUST NOT add theRetryableWriteErrorlabel to any error that occurs during a write command within a transaction (exceptingcommitTransationandabortTransaction), even whenretryWriteshas been set to true on theMongoClient." (given the current structure and wording of the specifications, we should not change this requirement inretryable-writes.md; if the change is needed, it should be inclient-backpressure.md)- nor does the
client-backpressure.mdsay anything aboutRetryableWriteError.
P.S. I fail to understand the value the driver adding the undocumented RetryableWriteError label to errors brings to applications. retryable-writes.md, including its section "Why does the driver only add the RetryableWriteError label to errors that occur on a MongoClient with retryWrites set to true?" does not shed any light on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow - the spec says "drivers MUST NOT" add the label. Can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"drivers MUST NOT" is what the specification was saying before this PR. In the current state of the PR, it says
...drivers MUST NOT add the RetryableWriteError label...unless the server response is a retryable backpressure error.
The only way I can interpret this change is that now that we have "retryable backpressure errors"1, the driver MUST add the RetryableWriteError lavel to such errors. Hopefully, this clarifies what was unclear, and we can address my original comment in this thread.
1 At the time of writing this comment I am still not sure what a "retryable backpressure error" is. We are addressing this in a separate thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blink1073 - please chime in here if I'm incorrect.
Yes, this is correct. Without this label, write commands would not be retried inside of a transaction.
I expect this will be clarified in https://jira.mongodb.org/browse/DRIVERS-3352 (this will go away once RetryableError itself is a criteria for non-backpressure retryable reads and writes).
stIncMale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes for myself:
- The last reviewed commit is 530e727.
- I reviewed only the changes in
.mdfiles, and do not plan to review the tests. - I have not re-reviewed the pseudocode illustrating the overload retry policy. I will do that after us settling on the requirements expressed in the prose part of the specification.
source/transactions/transactions.md
Outdated
| Drivers SHOULD apply a majority write concern when retrying commitTransaction to guard against a transaction being | ||
| applied twice. | ||
|
|
||
| Drivers SHOULD NOT modify the write concern on commit transaction commands when retrying a backpressure error, since we | ||
| are sure that the transaction has not been applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a misunderstanding here - this change does not change the behavior when a driver encounters a retryable that is not an overload error.
That's how I interpreted the wording, i.e., it does not seem I misunderstood this part.
Let's consider the "A retryable backpressure error indicates no work was performed by the server, and the rationale outlined in this section for using majority write concern on retries is therefore irrelevant." argument:
- (a) "A retryable backpressure error indicates no work was performed by the server"
- The silent part here is "...when executing the
commitTransactionthat failed with the retryable backpressure error". Clearly, the statement about no work having been performed by the server does not apply to any other command executions the MongoDB deployment has completed, including previous failed attempts of thecommitTransactionin question.
- The silent part here is "...when executing the
- (b) "the rationale outlined in this section for using majority write concern on retries is therefore irrelevant"
- This conclusion relies on the premise that the statement about no work having been performed by the server is applicable to all failed attempts to execute
commitTransactionpreceeding the attempt that failed with the retryable backpressure error. But, as per (a), the premise is false, which makes the conclusion unjustified.
- This conclusion relies on the premise that the statement about no work having been performed by the server is applicable to all failed attempts to execute
I explained why I think the conclusion "the rationale outlined in this section for using majority write concern on retries is therefore irrelevant" is unjustified. It still could have been true based on an different argument. However, I think the following example, which I briefly outlined previously, demonstrates that the conclusion is false (this time I am specifying the example in details; it is based on the example from the "Majority write concern is used when retrying commitTransaction" section, relies on that example and arguments being valid, and differs only in item 4):
- [this text is the same as in the specification] The driver is connected to a replica set where node
Ais primary. - [this text is the same as in the specification] The driver sends
commitTransactiontoAwithw:1.Acommits the transaction but dies before it can reply. This constitutes a retryable error, which means the driver can retry thecommitTransactioncommand. - [this text is the same as in the specification] Node
Bis briefly elected. - [this text is new] The driver retries
commitTransactiononBwithw:majority, andBreplies with an error eligible for retry under the overload retry policy. The rest of the requirements of the overload retry policy is also met, and the driver is going to attempt executing thatcommitTransactionagain, as described in sub-item 4.1.
4.1 [this text is the same as item 4 in the specification] The driver retriescommitTransactiononBwithw:1, andBreplies with aNoSuchTransactionerror code andTransientTransactionErrorerror label. This implies that the driver may retry the entire transaction. - [this text is the same as in the specification] Node
Arevives beforeBhas done anyw:majoritywrites and is reelected as primary. - [this text is the same as in the specification] The driver then retries the entire transaction on
Awhere it commits successfully.
[this text is the same as in the specification] The outcome of this scenario is that two complete executions of the transaction operations are permanently committed on node A. Drivers can avoid this scenario if they always use a majority write concern when retrying commitTransaction.
Thus, the requirement "Drivers should apply a majority write concern when retrying commitTransaction to guard against a transaction being applied twice." should stay untouched.
Notes for myself:
- If eventually it turns out that I am incorrect, and changing the requirement is reasonable, we should also change it in the commitTransaction section, where it says "When commitTransaction is retried...".
- The aforementioned "When commitTransaction is retried..." requirement uses MUST, therefore, the section "Majority write concern is used when retrying commitTransaction" must use MUST, but currently it uses SHOULD.
| This command uses the readConcern set on the transaction's TransactionOptions during the `startTransaction` call. It is | ||
| not inherited from a client, database, or collection at the time of the first command. | ||
|
|
||
| The session transitions to the "transaction in progress" state after completing the first command within a transaction — |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change the wording here? Reading the rest of the spec and your comment on the implementation, this is not necessarily true anymore, is it?
stIncMale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting the few pending comments we discussed at today's meeting.
| specification. | ||
| 4. A token can be consumed from the token bucket. | ||
| 6. A retry attempt consumes 1 token from the token bucket. | ||
| 7. If the request is eligible for retry (as outlined in step 4), the client MUST apply exponential backoff according to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"step 4" here and two more "step 4" references below need should say "step 5": the numbers changed after the latest addition of a step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "two more "step 4" references below need should say "step 5"" part from the first comment in this thread still needs addressing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stIncMale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last reviewed commit is 12ab836.
- I am still to come back to #1862 (comment)
- I am still to review the "Pseudocode" section.
| APIs not subject to the overload retry policy include commands executed on unauthenticated connections: | ||
|
|
||
| - monitoring commands and round-trip time pingers | ||
| - commands executed during authentication (i.e., `saslStart`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add "... and re-authentication"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are four comments related to this section. I didn't address any of them specifically, but I believe they're all addressed:
- Reauthentication is included in the list now
- I've added hyper links, where relevant
- I consistently use
commandinstead of APIs or operations
| APIs not subject to the overload retry policy include commands executed on unauthenticated connections: | ||
|
|
||
| - monitoring commands and round-trip time pingers | ||
| - commands executed during authentication (i.e., `saslStart`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This sentence starts with talking about APIs, but then talks only about commands. The latter is expected, as any retry policy we have applies to a command, not to an operation, and not to API. Given this, it is confusing that the sentence starts with talking about APIs, and seem to not add any information.
- The current "include commands" wording implies that there are commands (executed against a MongoDB deployment?) beyond the ones listed that are not subject to the overload retry policy. Do we know what they are?
- Is it correct that when an application uses a MongoDB deployment that does not require authentication, then the overload retry policy is not applicable? This question seems relevant in the context of the backpressure project, given that the
SystemOverloadedError(andRetryableError?) are planned to be added by a connection pool (maybe, when a connection is not authenticated, the pool should not add the labels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are four comments related to this section. I didn't address any of them specifically, but I believe they're all addressed:
- Reauthentication is included in the list now
- I've added hyper links, where relevant
- I consistently use
commandinstead of APIs or operations
| [retryable writes](../retryable-writes/retryable-writes.md) and the | ||
| [transactions](../transactions/transactions.md) specifications. | ||
|
|
||
| ##### Relevant driver processes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When referring to various terms/specs in this section, let's make then hyper-links. It is difficult to navigate the specifications repo when there are no hyperlinks, especially if a reader does not know right away where the mentioned term is described (and even then, finding that place manually requires unnecessary work on behalf of the reader).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are four comments related to this section. I didn't address any of them specifically, but I believe they're all addressed:
- Reauthentication is included in the list now
- I've added hyper links, where relevant
- I consistently use
commandinstead of APIs or operations
| The retry policy defined above is only relevant for commands sent on authenticated connections, which | ||
|
|
||
| - any user-facing API which wraps a server command (i.e., a CRUD command or runCommand) | ||
| - cursors and change streams (including getMores and killCursors) | ||
| - APIs which might perform multiple operations internally (such rewrapManyDataKey(), which performs a find() and a bulk | ||
| update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence starts with talking about commands, but then talks about APIs and operations. This is confusing, as any retry policy applies only to commands, and the seemingly incorrect syntax of the sentence adds to the confusion ("...commands sent on authenticated connections, which any user-facing API...").
Is it incorrect to say that any command sent via an authenticated connection to a MongoDB deployment is subject to the overload retry policy? (the "to a MongoDB deployment" part is here to exclude commands sent to mongocryptd).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are four comments related to this section. I didn't address any of them specifically, but I believe they're all addressed:
- Reauthentication is included in the list now
- I've added hyper links, where relevant
- I consistently use
commandinstead of APIs or operations
| - Only errors that contain the `SystemOverloadedError` consume tokens from the token bucket before retrying. | ||
| - When a failed attempt is retried, backoff must be applied if and only if the attempt error contains the | ||
| `SystemOverloadedError` label. | ||
| - If an overload error is encountered, the maximum number of retries for any retry policy becomes MAX_RETRIES. If CSOT | ||
| is enabled and a command has already retried more than MAX_RETRIES times, it MUST NOT be retried further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these three items refer to an overload error:
- The first two do this by repeating its definition, saying something like "error contains the
SystemOverloadedErrorlabel". - The third one does this by using the "overload error" term.
The "overload error", as far as I remember, was introduced for convenience, as it is shorter to say than "and error that contains the SystemOverloadedError label".
- Given that, let's use the term in the specification wherever possible, instead of repeating the definition of the overload error?
- At the very least, let's use the same way of referring to an overload error in these items that constitute the same list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've audited the spec - it now consistently uses the term "overload error" instead of "contains the SystemOverloadedError` label
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the same change then in the name of one of the Q&A subsections:
- "if a
SystemOverloadedErroris received" -> "if an overload error is received"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
baileympearson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting review before tomorrow's meeting
| specification. | ||
| 4. A token can be consumed from the token bucket. | ||
| 6. A retry attempt consumes 1 token from the token bucket. | ||
| 7. If the request is eligible for retry (as outlined in step 4), the client MUST apply exponential backoff according to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| - Only errors that contain the `SystemOverloadedError` consume tokens from the token bucket before retrying. | ||
| - When a failed attempt is retried, backoff must be applied if and only if the attempt error contains the | ||
| `SystemOverloadedError` label. | ||
| - If an overload error is encountered, the maximum number of retries for any retry policy becomes MAX_RETRIES. If CSOT | ||
| is enabled and a command has already retried more than MAX_RETRIES times, it MUST NOT be retried further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've audited the spec - it now consistently uses the term "overload error" instead of "contains the SystemOverloadedError` label
stIncMale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An update to #1862 (review): the last reviewed commit is 1b32e00.
| specification. | ||
| 4. A token can be consumed from the token bucket. | ||
| 6. A retry attempt consumes 1 token from the token bucket. | ||
| 7. If the request is eligible for retry (as outlined in step 4), the client MUST apply exponential backoff according to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "two more "step 4" references below need should say "step 5"" part from the first comment in this thread still needs addressing.
| - Only errors that contain the `SystemOverloadedError` consume tokens from the token bucket before retrying. | ||
| - When a failed attempt is retried, backoff must be applied if and only if the attempt error contains the | ||
| `SystemOverloadedError` label. | ||
| - If an overload error is encountered, the maximum number of retries for any retry policy becomes MAX_RETRIES. If CSOT | ||
| is enabled and a command has already retried more than MAX_RETRIES times, it MUST NOT be retried further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the same change then in the name of one of the Q&A subsections:
- "if a
SystemOverloadedErroris received" -> "if an overload error is received"
DRIVERS-3239
Overview
This PR adds support for a new class of errors (
SystemOverloadedError) to drivers' operation retry logic, as outlined in the design document.Additionally, it includes a new argument to the MongoDB handshake (also defined in the design document).
Python will be second implementer.
Node implementation: mongodb/node-mongodb-native#4806
Testing
The testing strategy is two-fold:
Building off of Ezra's work to generate unified tests for retryable handshake errors, this PR generates unified tests to confirm that:
SystemOverloadedErrorlabelFollowing Iris's work in DRIVERS-1934: withTransaction API retries too frequently #1851, this PR adds a prose test that ensures drivers apply exponential backoff in the retryability loop.
Update changelog.
Test changes in at least one language driver.
Test these changes against all server versions and topologies (including standalone, replica set, and sharded
clusters).