-
Notifications
You must be signed in to change notification settings - Fork 0
Description
When a subprocess exits, there is no generic notification to the operator. While this isn't a bad design, in practice most of our subprocesses are not intended to be limited in duration and so this creates a trap where it takes a long time to notice that the subprocess we cared about isn't behaving as designed.
Typical design:
- a service (like the operator AtOnCall) inherits from SlowSubprocessMixin https://github.com/project8/dragonfly/blob/develop/dragonfly/implementations/at_on_call.py
- on startup a setup call is issued to start the control (and worker) processes https://github.com/project8/hardware/blob/master/software_config/dragonfly/claude/expert.yaml
- when the worker hits some failure mode, it exits with whatever cleanup has been implemented https://github.com/project8/dragonfly/blob/develop/dragonfly/implementations/subprocess_mixin.py#L70
- in the case of AtOnCall, this includes a cleanup message to slack that it's been terminated, so we saw this promptly, but this isn't necessarily universal behavior
- the pinger continues to hit the spimescape and report it as responding, but doesn't recognize that this is only a useless shell
Things that could be done - either you make the restart more automatic (a) or make the crash more obvious.
(a) probably easiest in modifying subprocess mixin basic_control_target (or have a continuous control version) where on failing the is_alive check, it restarts the worker
(b) the strongest method would be to overwrite the ping functionality so it checks if the worker is alive (only works for a single level of worker); could also make the cleanup method spam a lot more errors
- either way, could check the cleanup behavior of classes inheriting to ensure they put out sufficient exit information and harden against those failure modes