DRIVERS-3326: clarify retry behavior when errors with NoWritesPerformed are encountered #1878

baileympearson · 2026-01-08T22:42:16Z

This PR clarifies the behavior when multiple errors with NoWritesPerformed are encountered (potentially possible when CSOT is enabled, or in the future with backpressure).

I wanted to add testing, but I ran into some difficulties that led me to believe it probably isn't worth adding test infrastructure just for this test.

To test that the correct error is returned in this scenario, we need either a way to assert that the error returned is the first NoWritesPerformed error. However, command monitoring events do not include OP_MSG request ids and I don't know of another way to differentiate between requests.

To get around this, I wrote a simple JS proxy that added an incrementing counter to the payload of each response, so that the error a user receives has an identifier. This worked, but when I went to test this, I ran into another issue: to test this scenario, we need multiple retryable writes (CSOT or backpressure). Backpressure hasn't merged yet, so we can't rely on that feature. When I tried using CSOT, I could not actually reproduce this scenario because every CSOT error manifested as a timeoutMS error (which does not have a NoWritesPerformed error label).

One option is to make this PR depend on the client backpressure work and to put my proxy into drivers-evergreen-tools. This still might be more trouble than it's worth, because this is an additional CI variant for drivers, just for one test. The proxy must:

run not on standalone servers (so retryable writes are enabled)
proxy every node, or ensure we proxy the primary.
- either of these options require extra work. either we must decode the URI to find all hosts, and then configure the proxy to proxy every host, and then provide a new URI against which the tests run OR we connect to the test URI, find the primary, proxy the primary and provide a new URI against which the tests can run.

Ultimately, I decided the testing here is more trouble than it's worth. Happy to reconsider if reviewers feel differently though.

Please complete the following before merging:

Is the relevant DRIVERS ticket in the PR title?
Update changelog.
Test changes in at least one language driver.
Test these changes against all server versions and topologies (including standalone, replica set, and sharded
clusters).

source/retryable-reads/retryable-reads.md

jmikola · 2026-01-27T15:45:27Z

source/retryable-writes/retryable-writes.md

+    pool exception originating from the driver) or the error is labeled "NoWritesPerformed", the most recently
+    encountered error that does not contain a "NoWritesPerformed" label MUST be returned instead.
+- If all server errors are labeled "NoWritesPerformed", then the first error should be raised.
+- Otherwise, return the most recently encountered error.


What is the "otherwise" case that this bullet is actually addressing?

Combining the logic in the first two bullets (let me know if I'm mistaken here), drivers should return the most recent error without a "NoWritesPerformed" label, or fall back to the first error (if everything had "NoWritesPerformed").

Note: it's not clear to me if "NoWritesPerformed" is only added by the server for its responses or something a driver might tack onto a client-side error (that supports labels). I assume it's only a server error, as this spec never talks about adding it.

jmikola · 2026-01-27T16:08:03Z

To test that the correct error is returned in this scenario, we need either a way to assert that the error returned is the first NoWritesPerformed error. However, command monitoring events do not include OP_MSG request ids and I don't know of another way to differentiate between requests.

To get around this, I wrote a simple JS proxy that added an incrementing counter to the payload of each response, so that the error a user receives has an identifier. This worked, but when I went to test this, I ran into another issue: to test this scenario, we need multiple retryable writes (CSOT or backpressure)

IIUC, a comprehensive test for this spec change requires multiple attempts to be made (i.e. CSOT or backpressure). Therefore, any proxy that modifies the error response between the server and client is insufficient.

Ignoring the CSOT/backpressure requirement for a moment, couldn't you forgo a proxy by staying multiple fail points with slightly different error codes and a "NoWritesPerformed" as needed? Kind of unfortunate that failCommand gives us no control over the error message, as that would probably be the easiest thing to tweak.

baileympearson · 2026-01-28T17:19:16Z

@prestonvasquez I'm adding you as a review in case you have time; I think you wrote added NoWritesPerformed error handling to the spec a few years ago.

prestonvasquez · 2026-01-30T00:33:21Z

source/retryable-writes/tests/README.md


 7. Disable the fail point on `s0`.

+### 6. Test that drivers return the original error after encountering multiple WriteConcernErrors with a RetryableWriteError label.


[question] Is there a reason we only test for the second case?

If all errors indicate no attempt was made (e.g., all errors contain the NoWritesPerformed error label or are client-side errors before a command is sent), the first error encountered must be returned.

No - I think it should be possible to test all scenarios. let me look into this

prestonvasquez · 2026-01-30T00:35:10Z

source/retryable-writes/retryable-writes.md

-would not allow the caller to infer that an attempt was made (e.g. connection pool exception originating from the
-driver) or the error is labeled "NoWritesPerformed", the error from the previous attempt should be raised. If all server
-errors are labeled "NoWritesPerformed", then the first error should be raised.
+[Error Handling](../server-discovery-and-monitoring/server-discovery-and-monitoring.md#error-handling)).


[blocking/question] I think there is a bug in the specifications:

if (currentError is not DriverException && ! previousError.hasErrorLabel("NoWritesPerformed")) { previousError = currentError; }

This condition checks previousError.hasErrorLabel("NoWritesPerformed"), but shouldn't it check currentError? If the current error has NoWritesPerformed, it seems we should keep the previous error. The Go Driver checks currentError, FWIW.

It may be worth fixing in this PR, as the implementation would affect the way the tests work if a driver implements 1-1. Notably, this + drivers with the ability to wrap errors could implement CSOT analogues to prose test 6, e.g. https://github.com/prestonvasquez/go-playground/blob/73b016048703550c87a42acbc6873ea78639bcbc/mgd_csot_behavior_test.go#L197-L381

It's worth pointing out that this part of retryability seems to conflict with the issue you're having:

When I tried using CSOT, I could not actually reproduce this scenario because every CSOT error manifested as a timeoutMS error (which does not have a NoWritesPerformed error label)

if (timeoutMS == null) { /* If CSOT is not enabled, allow any retryable error from the second * attempt to propagate to our caller, as it will be just as relevant * (if not more relevant) than the original error. */ if (retrying) { throw previousError; } } else if (isExpired(timeoutMS)) { /* CSOT is enabled and the operation has timed out. */ throw previousError; }

Do you agree?

Okay, so:

[blocking/question] I think there is a bug in the specifications:
...

Agreed - I think this is a typo. I'll fix

It's worth pointing out that this part of retryability seems to conflict with the issue you're having:
...

Hm. We are missing this logic in Node but I'm not sure what the correct behavior is. This contradicts the requirement that timeout errors are distinguishable: https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.md#non-tailable-cursors:~:text=If%20the%20timeoutMS%20option%20is%20set%20and%20the%20timeout%20expires%2C%20drivers%20MUST%20abort%20all%20blocking%20work%20and%20return%20control%20to%20the%20user%20with%20an%20error.%20This%20error%20MUST%20be%20distinguished%20in%20some%20way%20(e.g.%20custom%20exception%20type)%20to%20make%20it%20easier%20for%20users%20to%20detect%20when%20an%20operation%20fails%20due%20to%20a%20timeout.

This contradicts the requirement that timeout errors are distinguishable

"MUST be distinguished in some way" is not necessarily a contradiction. It's possible to align Go, for example. This check was added in 343ff9a#diff-01c94ffec48124f66d321265e57d6c892b1355813cf2bce099d0345ff222eabe but there are no unified spec tests AFAICT.

The behavior in the psuedocode seems reasonable to me but we need tests for it(*): trigger a retryable error, time out on retry, and asserts the returned error is both a timeout error and contains the original server error code.

Edit: Not suggesting we add (*) tests in this PR.

I think there is a bug in the specifications

The line that @prestonvasquez cited dates back to d1157f7 (#1466).

Also, 343ff9a#diff-01c94ffec48124f66d321265e57d6c892b1355813cf2bce099d0345ff222eabeR500 is the slightly older commit that introduced the "CSOT is enabled and the operation has timed out" branch.

jmikola · 2026-01-30T20:11:19Z

source/retryable-writes/tests/README.md

+    }
+    ```
+
+    Drivers SHOULD only configure the `10107` fail point command if the the failed event is for the `91` error configured


I don't think we typically use MUST/SHOULD language in specs, but in this case is it even a SHOULD? Setting the 10107 error only after a 91 error seems crucial to the test.

jmikola · 2026-01-30T20:19:11Z

source/retryable-writes/retryable-writes.md

-would not allow the caller to infer that an attempt was made (e.g. connection pool exception originating from the
-driver) or the error is labeled "NoWritesPerformed", the error from the previous attempt should be raised. If all server
-errors are labeled "NoWritesPerformed", then the first error should be raised.
+[Error Handling](../server-discovery-and-monitoring/server-discovery-and-monitoring.md#error-handling)).


I think there is a bug in the specifications

The line that @prestonvasquez cited dates back to d1157f7 (#1466).

Also, 343ff9a#diff-01c94ffec48124f66d321265e57d6c892b1355813cf2bce099d0345ff222eabeR500 is the slightly older commit that introduced the "CSOT is enabled and the operation has timed out" branch.

clarify retry behavior

303aee1

baileympearson changed the title ~~clarify retry behavior~~ DRIVERS-3326: clarify retry behavior when errors with NoWritesPerformed are encountered Jan 8, 2026

baileympearson added 2 commits January 14, 2026 10:31

update changelogs

fd1a011

update changelogs

12f48e5

baileympearson marked this pull request as ready for review January 14, 2026 17:58

baileympearson requested a review from a team as a code owner January 14, 2026 17:58

baileympearson requested review from jmikola and removed request for a team January 14, 2026 17:58

dariakp mentioned this pull request Jan 20, 2026

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862

Open

3 tasks

clarify phrasing

8971e74

jmikola reviewed Jan 27, 2026

View reviewed changes

review comments from Jeremy

11da92b

baileympearson requested review from jmikola and prestonvasquez January 28, 2026 16:20

prestonvasquez reviewed Jan 30, 2026

View reviewed changes

jmikola reviewed Jan 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRIVERS-3326: clarify retry behavior when errors with NoWritesPerformed are encountered #1878

DRIVERS-3326: clarify retry behavior when errors with NoWritesPerformed are encountered #1878

Uh oh!

baileympearson commented Jan 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

jmikola Jan 27, 2026

Uh oh!

jmikola commented Jan 27, 2026

Uh oh!

baileympearson commented Jan 28, 2026

Uh oh!

prestonvasquez Jan 30, 2026

Uh oh!

baileympearson Jan 30, 2026

Uh oh!

prestonvasquez Jan 30, 2026 •

edited

Loading

Uh oh!

baileympearson Jan 30, 2026

Uh oh!

prestonvasquez Jan 30, 2026 •

edited

Loading

Uh oh!

jmikola Jan 30, 2026

Uh oh!

jmikola Jan 30, 2026

Uh oh!

jmikola Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		7. Disable the fail point on `s0`.

		### 6. Test that drivers return the original error after encountering multiple WriteConcernErrors with a RetryableWriteError label.

DRIVERS-3326: clarify retry behavior when errors with NoWritesPerformed are encountered #1878

Are you sure you want to change the base?

DRIVERS-3326: clarify retry behavior when errors with NoWritesPerformed are encountered #1878

Uh oh!

Conversation

baileympearson commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jmikola Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

jmikola commented Jan 27, 2026

Uh oh!

baileympearson commented Jan 28, 2026

Uh oh!

prestonvasquez Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

baileympearson Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

prestonvasquez Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baileympearson Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

prestonvasquez Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmikola Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

jmikola Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

jmikola Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

baileympearson commented Jan 8, 2026 •

edited

Loading

prestonvasquez Jan 30, 2026 •

edited

Loading

prestonvasquez Jan 30, 2026 •

edited

Loading