Feature/watchdog #220

renereimann · 2025-04-09T13:02:01Z

This dragonfly script is intendet to be an alarm system for your setup. There are two different things that the script checks.

it checks if all your docker containers are running and do not have an error
you can do specific checks on endpoints to check for critical behaviour
If a problem is detected, the script will send a message to slack to warn you.
The script is run in a docker container and an example docker-compose.yaml file is added to see how to setup the docker container.
The script is configured via a yaml config file. The syntax is similar to our normal dripline config files. An example yaml config file is included in the pull request. The system works fine in the Mainz setup.

…t service. Till now we added the ability to send messages to slack using a slack_hook. Next we will do some diagnostics on docker containers itself.

…onfly stand alone script. That has several advantages, one of it being that we do not depend on rabbit broker, and thus the checks still work if rabbit broker went down.

…mented by Paul K. but not yet rolled out. We should update this once the fix is rolled out.

…ong in your setup. There are two different types of checks: 1. we check if all docker containers are running and not having errors, 2. we check some endpoints and see if they fullfill some condition. If there are problems we send a message to slack. The script is configured by a yaml file. The example yaml file is included as well as a docker-compose file that gives an example how to setup that script with docker compose. Testing this script worked well.

…bles it to prevent others posting to your workspace, this is a place holder and just for demonstration purpose

…re running

…you do not need to bind it from external, just the config file is needed

wcpettus · 2025-05-09T19:28:56Z

Working from high-level comments to lower:

This mixes two functionalities, one of which (docker container checking) makes a lot of sense to exist outside of the normal dripline ecosystem, the other less so. Maybe unifying the watchdog functionality in one place is the right strategy, but more discussion on this point might be useful.
- As an example in DL2 there was a sensor_monitor which could check endpoint values for anomalous readings. This was more passive in that it watched the alerts exchange (and so wasn't actively querying). But with heartbeats and the docker watchdog the active query maybe isn't necessary for endpoints?
The dripline authentication should probably be managed by scarab and draw from the authentications file, and not require specifying again here in the config file
Do the changes to dripline:service send impact the implementation here (@nsoblath)?
Robert was working on a slack relay service that lives within dripline. Without touching the value of having the watchdog exist totally distinct from the rest of the mesh, down the road we might want to unify the slack connection pieces.
- Do you have thoughts on the relative merit of webhook (used here) vs token (used by Robert) for authentication/connection?

…evice or rabbit broker is down, we want this script to not crash at all

… presure gauge values. There were no type conversion for numbers, the method was fixed / hard coded to 'not_equal' and the error message came not through correctly. We fixed it by adding type conversion based on the value type itself, using the method provided in the config file and getting the error messages through.

renereimann · 2025-05-27T16:05:09Z

Thanks a lot Walter,

I agree that one can separate the endpoint checking from the docker container checking. I do it here for my on simplicity but I definitively can make a seperate script for both of them and run them in separate containers.
I recently saw that SlowDash should also get a end-point checking feature so some coordination here would be nice.
About token vs webhook: a webhook is bound to a specific channel and can only post messages. The verification is part of the URL. With a token you can use the chat.postMessage() function and you can post to several channels and you can respond to user input. So you would need that if you have some interactive multi-channel application / bot. Regarding security I think since webhook only has very limited permissions it is fine to use but I do not have more details on that.
The changes to dripline:service send does not impact this application (as far as I can tell) since its running fine before and after the fix
I guess there are smarter ways for the authentification however I do not yet understand scarabs purpose and abilities (in terms of authentication and also more in general).

…receive a signal and also notifies when the alarm system stopped (which is also caused by a signal).

…onment variables instead

…the example folder

renereimann added 10 commits April 8, 2025 10:38

We implement a watchdog system. The system inherits from the heartbea…

2a0c32b

…t service. Till now we added the ability to send messages to slack using a slack_hook. Next we will do some diagnostics on docker containers itself.

added checking of docker containers

b786c57

I decided to not make this a dripline extension but to make it a drag…

ff74880

…onfly stand alone script. That has several advantages, one of it being that we do not depend on rabbit broker, and thus the checks still work if rabbit broker went down.

had to change version of baseimage since we need a fix that was imple…

163c08f

…mented by Paul K. but not yet rolled out. We should update this once the fix is rolled out.

remove webhook, slack notices when your webhook is on github and disa…

752a36e

…bles it to prevent others posting to your workspace, this is a place holder and just for demonstration purpose

send a test message at start up, that also helps to track if things a…

5abe06f

…re running

use the script directly from dragonfly as installed in Docker image, …

fa5e330

…you do not need to bind it from external, just the config file is needed

remove commented out historic left overs

8eed009

add comment about webhooks on github

51f86e1

renereimann requested a review from nsoblath April 9, 2025 13:02

fix function call, was missing a self, now is working properly

2773ee4

renereimann added 2 commits May 23, 2025 12:48

handle errors thrown while checking endpoints, this could be if the d…

d2d826f

…evice or rabbit broker is down, we want this script to not crash at all

renereimann added 8 commits July 9, 2025 11:13

add signal handling. This results in sending messages to slack if we …

c6fe046

…receive a signal and also notifies when the alarm system stopped (which is also caused by a signal).

remove authentication by hand and use scarab authentication via envir…

1d608bd

…onment variables instead

adding a sample configuration file

7929994

adding a sample entry for docker compose

3e91a8f

moved AlarmSystem.yaml to example folder

c33534d

removing dragonfly/alert.yaml which is a duplicate of the example in …

f2e50d7

…the example folder

adding watchdog to __init__

fdfb45d

removing doublicate of the docker-compose.yaml

b85ba24

nsoblath approved these changes Jul 15, 2025

View reviewed changes

Merge branch 'develop' into feature/watchdog

8f31bda

renereimann merged commit d5d2b49 into develop Jul 17, 2025
3 checks passed

renereimann deleted the feature/watchdog branch July 17, 2025 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/watchdog #220

Feature/watchdog #220

Uh oh!

renereimann commented Apr 9, 2025

Uh oh!

wcpettus commented May 9, 2025

Uh oh!

renereimann commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Feature/watchdog #220

Feature/watchdog #220

Uh oh!

Conversation

renereimann commented Apr 9, 2025

Uh oh!

wcpettus commented May 9, 2025

Uh oh!

renereimann commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants