Skip to content

Central worker recovery should not be triggered because of unacknowledged messages #123

@danfilip

Description

@danfilip

Prolog:

worker recovery is checked every 2 minutes by Central (WorkerRecoveryServiceImpl.java). A worker is deemed to be recovered if
(the worker has not sent a keepalive recently) OR (the queue contains unacknowledged messages for this worker)

What happens:

  1. there is only one worker in the RAS_Operator_Path group alias. the worker can connect to central fine...there is no connection issue.
  2. queue contains 8 older and unacknowledged messages by the above RAS
  3. every 2 minutes, central triggers recovery of this worker, which leads to the triggering of the recovery/restart of the worker itself(due to WRW change)
  4. the above process obviously stops all running flows on the worker, slowing all executions down on central.
  5. because there is only this worker in the RAS_Operator_Path, and he was already determined as "not alive" by the central recovery mechanism, the unack. messages can't be assigned to this worker(despite him being very much alive).
    this makes it impossible to reassign those 8 messages, and the recovery for this worker will be attempted every 2 minutes, forever...until there will be another worker in the RAS_Operator_Path
  6. operations that need more than 2 min, will NEVER complete, and their flows remain stuck

Suggestions:

worker recovery should not be triggered by the presence of unack queue messages. if the worker is alive(has sent keep-alives) and he's the only one in that group, it should be attempted to reassign those messages to him again, and not trigger the recovery process, which will cause all flow runs on that worker to be brutally destroyed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions