subprocesses failing quietly

When a subprocess exits, there is no generic notification to the operator.  While this isn't a bad design, in practice most of our subprocesses are not intended to be limited in duration and so this creates a trap where it takes a long time to notice that the subprocess we cared about isn't behaving as designed.

Typical design:
- a service (like the operator AtOnCall) inherits from SlowSubprocessMixin https://github.com/project8/dragonfly/blob/develop/dragonfly/implementations/at_on_call.py
- on startup a setup call is issued to start the control (and worker) processes https://github.com/project8/hardware/blob/master/software_config/dragonfly/claude/expert.yaml
- when the worker hits some failure mode, it exits with whatever cleanup has been implemented https://github.com/project8/dragonfly/blob/develop/dragonfly/implementations/subprocess_mixin.py#L70
  - in the case of AtOnCall, this includes a cleanup message to slack that it's been terminated, so we saw this promptly, but this isn't necessarily universal behavior
- the pinger continues to hit the spimescape and report it as responding, but doesn't recognize that this is only a useless shell

Things that could be done - either you make the restart more automatic (a) or make the crash more obvious.
(a) probably easiest in modifying subprocess mixin basic_control_target (or have a continuous control version) where on failing the is_alive check, it restarts the worker
(b) the strongest method would be to overwrite the ping functionality so it checks if the worker is alive (only works for a single level of worker); could also make the cleanup method spam a lot more errors
- either way, could check the cleanup behavior of classes inheriting to ensure they put out sufficient exit information and harden against those failure modes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

subprocesses failing quietly #205

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

subprocesses failing quietly #205

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions