Skip to content

Bug Report: Occasional deadlock when killing a Fleet-launched process #42

@pepedinho

Description

@pepedinho

Description

When a process started by Fleet is killed externally (e.g., via kill <pid>), Fleet occasionally ends up in a deadlock state.
In this situation, the orchestrator no longer reacts as expected:

  • Pipelines/jobs depending on the killed process hang indefinitely
  • Fleet doesn’t recover or release the lock without manual intervention

This seems related to how Fleet manages child processes and their async join handles.


Steps to Reproduce

  1. Start a project with a pipeline that launches a long-running process (e.g., sleep 1000 or a Docker container).
  2. From outside Fleet, kill the process (e.g., kill <pid>).
  3. Observe Fleet’s behavior:
    • Sometimes it cleans up correctly
    • Sometimes it deadlocks, leaving the pipeline stuck forever and daemon not responding to cli

Expected Behavior

Fleet should:

  • Detect when a child process is killed externally
  • Gracefully handle cleanup (release locks, mark the job as failed, and continue)
  • Avoid getting stuck in a deadlock state

Actual Behavior

  • Deadlock occurs occasionally (not deterministic).
  • Requires manual intervention (restart Fleet or stop/restart the pipeline).

Possible Cause (Hypothesis)

This may be due to:

  • tokio::process::Child not propagating external kills properly
  • Incomplete cleanup of async tasks waiting on the child’s await
  • Lock contention when the job manager tries to update state after the process disappears unexpectedly

Additional Context

  • Issue is intermittent → seems related to race conditions in async handling.
  • Might require using wait_with_output or explicit signal handling to avoid dangling futures.
  • Logs (if available) could help identify the exact deadlock point.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions