Skip to content

[BUG]RegisterModelGroupStep can hang indefinitely if state index async update never completes #1310

@kokibas

Description

@kokibas

What is the bug?

RegisterModelGroupStep can hang until the node timeout expires (minimum 10 seconds by default) if the async state index update inside FlowFrameworkIndicesHandler.addResourceToStateIndex never completes. In this case, the PlainActionFuture associated with the workflow step is never resolved, causing the workflow node to wait until timeout and then fail with a TimeoutException.

How can one reproduce the bug?

Trigger a workflow that executes RegisterModelGroupStep.

Let mlClient.registerModelGroup complete successfully.

During the subsequent call to FlowFrameworkIndicesHandler.addResourceToStateIndex, simulate a scenario where:

sdkClient.getDataObjectAsync() or sdkClient.updateDataObjectAsync() never completes (e.g. network partition, hung connection, missing client-level timeout).

Observe that the CompletableFuture.whenComplete() callback is never invoked.

The PlainActionFuture in RegisterModelGroupStep is never completed.

What is the expected behavior?

The workflow step should not hang indefinitely if the async state index update stalls.
Instead, the step should fail deterministically after a timeout and allow the workflow engine to handle the failure (retry, fail, or exit based on configuration).

What is your host/environment?

Not environment-specific.
This issue can occur in any environment where async SDK calls do not have an enforced timeout (e.g. network instability or misconfigured client timeouts).

Do you have any screenshots?

N/A

Do you have any additional context?

While ProcessNode.execute() does apply a timeout (default 10 seconds from NODE_TIMEOUT_DEFAULT_VALUE, or the step-specific timeout from WorkflowStepFactory enum), this timeout only protects at the ProcessNode level. If addResourceToStateIndex hangs, the step's PlainActionFuture will not complete, causing the ProcessNode to wait until its timeout expires before failing. This means the workflow will be stuck in RUNNING state for the duration of the timeout period, which could be problematic for long-running workflows or if the timeout is configured to be very long.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions