Skip to content

NetheriteOrchestrationService.StartAsync throws cached exception on retry if first call fails #352

@cgillum

Description

@cgillum

The following was observed in a customer support case.

  1. The function app starts up and Netherite throws an exception because of a transient DNS issue when connecting to Event Hubs.
  2. The Functions host handles the failure, and tries to restart the host startup process.
  3. NetheriteOrchestrationService.StartAsync gets called again with the same NetheriteOrchestrationService object instance, but the same exception gets thrown. This time, however, we don't log it. The DNS error only shows up in the FunctionsLogs - not the DurableFunctionsEvents log, indicating that we're not actually running the startup logic a second time.
  4. This continues indefinitely.
  5. The problem stops only when the customer forcefully restarts the function app

As discussed, this appears to be a caching issue in NetheriteOrchestrationService. The Azure Functions Host assumes that we can retry startup failures, which is why we see this behavior in production. However, NetheriteOrchestrationService.StartAsync doesn't appear to support retrying.

In order to be compatible with the Azure Functions host retry logic, we should change this behavior.

Context from @sebastianburckhardt

...this breaks down into two internal tasks (for starting client and for starting workers) which are being cached. I believe the reason for this caching is to deal with early and/or concurrent client calls and make exceptions in the various startup tasks visible to the application. I agree this is complicated and I would love something simpler but it has been tricky last time I tried because the environment does not follow what would be easy to implement (wait for successful StartAsync before calling any client operations)

It would probably not be all that difficult to just clear all the cached failed transitions when StartAsync() is retried

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Priority 1bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions