-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-3326: clarify retry behavior when errors with NoWritesPerformed are encountered #1878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| pool exception originating from the driver) or the error is labeled "NoWritesPerformed", the most recently | ||
| encountered error that does not contain a "NoWritesPerformed" label MUST be returned instead. | ||
| - If all server errors are labeled "NoWritesPerformed", then the first error should be raised. | ||
| - Otherwise, return the most recently encountered error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "otherwise" case that this bullet is actually addressing?
Combining the logic in the first two bullets (let me know if I'm mistaken here), drivers should return the most recent error without a "NoWritesPerformed" label, or fall back to the first error (if everything had "NoWritesPerformed").
Note: it's not clear to me if "NoWritesPerformed" is only added by the server for its responses or something a driver might tack onto a client-side error (that supports labels). I assume it's only a server error, as this spec never talks about adding it.
IIUC, a comprehensive test for this spec change requires multiple attempts to be made (i.e. CSOT or backpressure). Therefore, any proxy that modifies the error response between the server and client is insufficient. Ignoring the CSOT/backpressure requirement for a moment, couldn't you forgo a proxy by staying multiple fail points with slightly different error codes and a "NoWritesPerformed" as needed? Kind of unfortunate that failCommand gives us no control over the error message, as that would probably be the easiest thing to tweak. |
|
@prestonvasquez I'm adding you as a review in case you have time; I think you wrote added |
|
|
||
| 7. Disable the fail point on `s0`. | ||
|
|
||
| ### 6. Test that drivers return the original error after encountering multiple WriteConcernErrors with a RetryableWriteError label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[question] Is there a reason we only test for the second case?
If all errors indicate no attempt was made (e.g., all errors contain the
NoWritesPerformederror label or are client-side errors before a command is sent), the first error encountered must be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - I think it should be possible to test all scenarios. let me look into this
| would not allow the caller to infer that an attempt was made (e.g. connection pool exception originating from the | ||
| driver) or the error is labeled "NoWritesPerformed", the error from the previous attempt should be raised. If all server | ||
| errors are labeled "NoWritesPerformed", then the first error should be raised. | ||
| [Error Handling](../server-discovery-and-monitoring/server-discovery-and-monitoring.md#error-handling)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[blocking/question] I think there is a bug in the specifications:
if (currentError is not DriverException && ! previousError.hasErrorLabel("NoWritesPerformed")) {
previousError = currentError;
}This condition checks previousError.hasErrorLabel("NoWritesPerformed"), but shouldn't it check currentError? If the current error has NoWritesPerformed, it seems we should keep the previous error. The Go Driver checks currentError, FWIW.
It may be worth fixing in this PR, as the implementation would affect the way the tests work if a driver implements 1-1. Notably, this + drivers with the ability to wrap errors could implement CSOT analogues to prose test 6, e.g. https://github.com/prestonvasquez/go-playground/blob/73b016048703550c87a42acbc6873ea78639bcbc/mgd_csot_behavior_test.go#L197-L381
It's worth pointing out that this part of retryability seems to conflict with the issue you're having:
When I tried using CSOT, I could not actually reproduce this scenario because every CSOT error manifested as a timeoutMS error (which does not have a NoWritesPerformed error label)
if (timeoutMS == null) {
/* If CSOT is not enabled, allow any retryable error from the second
* attempt to propagate to our caller, as it will be just as relevant
* (if not more relevant) than the original error. */
if (retrying) {
throw previousError;
}
} else if (isExpired(timeoutMS)) {
/* CSOT is enabled and the operation has timed out. */
throw previousError;
}Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so:
[blocking/question] I think there is a bug in the specifications:
...
Agreed - I think this is a typo. I'll fix
It's worth pointing out that this part of retryability seems to conflict with the issue you're having:
...
Hm. We are missing this logic in Node but I'm not sure what the correct behavior is. This contradicts the requirement that timeout errors are distinguishable: https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.md#non-tailable-cursors:~:text=If%20the%20timeoutMS%20option%20is%20set%20and%20the%20timeout%20expires%2C%20drivers%20MUST%20abort%20all%20blocking%20work%20and%20return%20control%20to%20the%20user%20with%20an%20error.%20This%20error%20MUST%20be%20distinguished%20in%20some%20way%20(e.g.%20custom%20exception%20type)%20to%20make%20it%20easier%20for%20users%20to%20detect%20when%20an%20operation%20fails%20due%20to%20a%20timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This contradicts the requirement that timeout errors are distinguishable
"MUST be distinguished in some way" is not necessarily a contradiction. It's possible to align Go, for example. This check was added in 343ff9a#diff-01c94ffec48124f66d321265e57d6c892b1355813cf2bce099d0345ff222eabe but there are no unified spec tests AFAICT.
The behavior in the psuedocode seems reasonable to me but we need tests for it(*): trigger a retryable error, time out on retry, and asserts the returned error is both a timeout error and contains the original server error code.
Edit: Not suggesting we add (*) tests in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a bug in the specifications
The line that @prestonvasquez cited dates back to d1157f7 (#1466).
Also, 343ff9a#diff-01c94ffec48124f66d321265e57d6c892b1355813cf2bce099d0345ff222eabeR500 is the slightly older commit that introduced the "CSOT is enabled and the operation has timed out" branch.
| } | ||
| ``` | ||
|
|
||
| Drivers SHOULD only configure the `10107` fail point command if the the failed event is for the `91` error configured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we typically use MUST/SHOULD language in specs, but in this case is it even a SHOULD? Setting the 10107 error only after a 91 error seems crucial to the test.
| would not allow the caller to infer that an attempt was made (e.g. connection pool exception originating from the | ||
| driver) or the error is labeled "NoWritesPerformed", the error from the previous attempt should be raised. If all server | ||
| errors are labeled "NoWritesPerformed", then the first error should be raised. | ||
| [Error Handling](../server-discovery-and-monitoring/server-discovery-and-monitoring.md#error-handling)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a bug in the specifications
The line that @prestonvasquez cited dates back to d1157f7 (#1466).
Also, 343ff9a#diff-01c94ffec48124f66d321265e57d6c892b1355813cf2bce099d0345ff222eabeR500 is the slightly older commit that introduced the "CSOT is enabled and the operation has timed out" branch.
This PR clarifies the behavior when multiple errors with NoWritesPerformed are encountered (potentially possible when CSOT is enabled, or in the future with backpressure).
I wanted to add testing, but I ran into some difficulties that led me to believe it probably isn't worth adding test infrastructure just for this test.
To get around this, I wrote a simple JS proxy that added an incrementing counter to the payload of each response, so that the error a user receives has an identifier. This worked, but when I went to test this, I ran into another issue: to test this scenario, we need multiple retryable writes (CSOT or backpressure). Backpressure hasn't merged yet, so we can't rely on that feature. When I tried using CSOT, I could not actually reproduce this scenario because every CSOT error manifested as a timeoutMS error (which does not have a NoWritesPerformed error label).
One option is to make this PR depend on the client backpressure work and to put my proxy into drivers-evergreen-tools. This still might be more trouble than it's worth, because this is an additional CI variant for drivers, just for one test. The proxy must:
Ultimately, I decided the testing here is more trouble than it's worth. Happy to reconsider if reviewers feel differently though.
Please complete the following before merging:
clusters).