Skip to content

Conversation

@baileympearson
Copy link
Contributor

@baileympearson baileympearson commented Jan 8, 2026

This PR clarifies the behavior when multiple errors with NoWritesPerformed are encountered (potentially possible when CSOT is enabled, or in the future with backpressure).

I wanted to add testing, but I ran into some difficulties that led me to believe it probably isn't worth adding test infrastructure just for this test.

  • To test that the correct error is returned in this scenario, we need either a way to assert that the error returned is the first NoWritesPerformed error. However, command monitoring events do not include OP_MSG request ids and I don't know of another way to differentiate between requests.

To get around this, I wrote a simple JS proxy that added an incrementing counter to the payload of each response, so that the error a user receives has an identifier. This worked, but when I went to test this, I ran into another issue: to test this scenario, we need multiple retryable writes (CSOT or backpressure). Backpressure hasn't merged yet, so we can't rely on that feature. When I tried using CSOT, I could not actually reproduce this scenario because every CSOT error manifested as a timeoutMS error (which does not have a NoWritesPerformed error label).

One option is to make this PR depend on the client backpressure work and to put my proxy into drivers-evergreen-tools. This still might be more trouble than it's worth, because this is an additional CI variant for drivers, just for one test. The proxy must:

  • run not on standalone servers (so retryable writes are enabled)
  • proxy every node, or ensure we proxy the primary.
    • either of these options require extra work. either we must decode the URI to find all hosts, and then configure the proxy to proxy every host, and then provide a new URI against which the tests run OR we connect to the test URI, find the primary, proxy the primary and provide a new URI against which the tests can run.

Ultimately, I decided the testing here is more trouble than it's worth. Happy to reconsider if reviewers feel differently though.

Please complete the following before merging:

  • Is the relevant DRIVERS ticket in the PR title?
  • Update changelog.
  • Test changes in at least one language driver.
  • Test these changes against all server versions and topologies (including standalone, replica set, and sharded
    clusters).

@baileympearson baileympearson changed the title clarify retry behavior DRIVERS-3326: clarify retry behavior when errors with NoWritesPerformed are encountered Jan 8, 2026
@baileympearson baileympearson marked this pull request as ready for review January 14, 2026 17:58
@baileympearson baileympearson requested a review from a team as a code owner January 14, 2026 17:58
@baileympearson baileympearson requested review from jmikola and removed request for a team January 14, 2026 17:58
pool exception originating from the driver) or the error is labeled "NoWritesPerformed", the most recently
encountered error that does not contain a "NoWritesPerformed" label MUST be returned instead.
- If all server errors are labeled "NoWritesPerformed", then the first error should be raised.
- Otherwise, return the most recently encountered error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "otherwise" case that this bullet is actually addressing?

Combining the logic in the first two bullets (let me know if I'm mistaken here), drivers should return the most recent error without a "NoWritesPerformed" label, or fall back to the first error (if everything had "NoWritesPerformed").

Note: it's not clear to me if "NoWritesPerformed" is only added by the server for its responses or something a driver might tack onto a client-side error (that supports labels). I assume it's only a server error, as this spec never talks about adding it.

@jmikola
Copy link
Member

jmikola commented Jan 27, 2026

To test that the correct error is returned in this scenario, we need either a way to assert that the error returned is the first NoWritesPerformed error. However, command monitoring events do not include OP_MSG request ids and I don't know of another way to differentiate between requests.

To get around this, I wrote a simple JS proxy that added an incrementing counter to the payload of each response, so that the error a user receives has an identifier. This worked, but when I went to test this, I ran into another issue: to test this scenario, we need multiple retryable writes (CSOT or backpressure)

IIUC, a comprehensive test for this spec change requires multiple attempts to be made (i.e. CSOT or backpressure). Therefore, any proxy that modifies the error response between the server and client is insufficient.

Ignoring the CSOT/backpressure requirement for a moment, couldn't you forgo a proxy by staying multiple fail points with slightly different error codes and a "NoWritesPerformed" as needed? Kind of unfortunate that failCommand gives us no control over the error message, as that would probably be the easiest thing to tweak.

@baileympearson
Copy link
Contributor Author

@prestonvasquez I'm adding you as a review in case you have time; I think you wrote added NoWritesPerformed error handling to the spec a few years ago.


7. Disable the fail point on `s0`.

### 6. Test that drivers return the original error after encountering multiple WriteConcernErrors with a RetryableWriteError label.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] Is there a reason we only test for the second case?

If all errors indicate no attempt was made (e.g., all errors contain the NoWritesPerformed error label or are client-side errors before a command is sent), the first error encountered must be returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - I think it should be possible to test all scenarios. let me look into this

would not allow the caller to infer that an attempt was made (e.g. connection pool exception originating from the
driver) or the error is labeled "NoWritesPerformed", the error from the previous attempt should be raised. If all server
errors are labeled "NoWritesPerformed", then the first error should be raised.
[Error Handling](../server-discovery-and-monitoring/server-discovery-and-monitoring.md#error-handling)).
Copy link
Member

@prestonvasquez prestonvasquez Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocking/question] I think there is a bug in the specifications:

if (currentError is not DriverException && ! previousError.hasErrorLabel("NoWritesPerformed")) {
    previousError = currentError;
}

This condition checks previousError.hasErrorLabel("NoWritesPerformed"), but shouldn't it check currentError? If the current error has NoWritesPerformed, it seems we should keep the previous error. The Go Driver checks currentError, FWIW.

It may be worth fixing in this PR, as the implementation would affect the way the tests work if a driver implements 1-1. Notably, this + drivers with the ability to wrap errors could implement CSOT analogues to prose test 6, e.g. https://github.com/prestonvasquez/go-playground/blob/73b016048703550c87a42acbc6873ea78639bcbc/mgd_csot_behavior_test.go#L197-L381

It's worth pointing out that this part of retryability seems to conflict with the issue you're having:

When I tried using CSOT, I could not actually reproduce this scenario because every CSOT error manifested as a timeoutMS error (which does not have a NoWritesPerformed error label)

    if (timeoutMS == null) {
      /* If CSOT is not enabled, allow any retryable error from the second
       * attempt to propagate to our caller, as it will be just as relevant
       * (if not more relevant) than the original error. */
      if (retrying) {
        throw previousError;
      }
    } else if (isExpired(timeoutMS)) {
      /* CSOT is enabled and the operation has timed out. */
      throw previousError;
    }

Do you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so:

[blocking/question] I think there is a bug in the specifications:
...

Agreed - I think this is a typo. I'll fix

It's worth pointing out that this part of retryability seems to conflict with the issue you're having:
...

Hm. We are missing this logic in Node but I'm not sure what the correct behavior is. This contradicts the requirement that timeout errors are distinguishable: https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.md#non-tailable-cursors:~:text=If%20the%20timeoutMS%20option%20is%20set%20and%20the%20timeout%20expires%2C%20drivers%20MUST%20abort%20all%20blocking%20work%20and%20return%20control%20to%20the%20user%20with%20an%20error.%20This%20error%20MUST%20be%20distinguished%20in%20some%20way%20(e.g.%20custom%20exception%20type)%20to%20make%20it%20easier%20for%20users%20to%20detect%20when%20an%20operation%20fails%20due%20to%20a%20timeout.

Copy link
Member

@prestonvasquez prestonvasquez Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contradicts the requirement that timeout errors are distinguishable

"MUST be distinguished in some way" is not necessarily a contradiction. It's possible to align Go, for example. This check was added in 343ff9a#diff-01c94ffec48124f66d321265e57d6c892b1355813cf2bce099d0345ff222eabe but there are no unified spec tests AFAICT.

The behavior in the psuedocode seems reasonable to me but we need tests for it(*): trigger a retryable error, time out on retry, and asserts the returned error is both a timeout error and contains the original server error code.

Edit: Not suggesting we add (*) tests in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a bug in the specifications

The line that @prestonvasquez cited dates back to d1157f7 (#1466).

Also, 343ff9a#diff-01c94ffec48124f66d321265e57d6c892b1355813cf2bce099d0345ff222eabeR500 is the slightly older commit that introduced the "CSOT is enabled and the operation has timed out" branch.

}
```

Drivers SHOULD only configure the `10107` fail point command if the the failed event is for the `91` error configured
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we typically use MUST/SHOULD language in specs, but in this case is it even a SHOULD? Setting the 10107 error only after a 91 error seems crucial to the test.

would not allow the caller to infer that an attempt was made (e.g. connection pool exception originating from the
driver) or the error is labeled "NoWritesPerformed", the error from the previous attempt should be raised. If all server
errors are labeled "NoWritesPerformed", then the first error should be raised.
[Error Handling](../server-discovery-and-monitoring/server-discovery-and-monitoring.md#error-handling)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a bug in the specifications

The line that @prestonvasquez cited dates back to d1157f7 (#1466).

Also, 343ff9a#diff-01c94ffec48124f66d321265e57d6c892b1355813cf2bce099d0345ff222eabeR500 is the slightly older commit that introduced the "CSOT is enabled and the operation has timed out" branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants