-
Notifications
You must be signed in to change notification settings - Fork 6
Add local monitoring using Telegraf and InfluxDB #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Resource usage in |
|
Summarized results of |
|
Useful patch if you want to and then |
They all show the same numbers, since this is not based on `du`.
|
Set up a PlayOS PC to run over the holidays with the following modifications. Mostly curious about how does memory usage look over a longer period, but also will allow to have "mostly real" metrics data to review later. diff --git a/application.nix b/application.nix
index 606e824..8a83b13 100644
--- a/application.nix
+++ b/application.nix
@@ -86,6 +86,8 @@ rec {
playos.monitoring.enable = true;
playos.monitoring.extraServices = [ "dividat-driver.service" ];
+ playos.monitoring.localDbShard = "3d";
+ playos.monitoring.localRetention = "12d";and diff --git a/base/monitoring.nix b/base/monitoring.nix
index f99bb36..f82f261 100644
--- a/base/monitoring.nix
+++ b/base/monitoring.nix
@@ -309,7 +309,7 @@ in
};
# TODO: check if it works on a PlayOS PC
- #inputs.sensors = { };
+ inputs.sensors = { };
inputs.wireless = { |
|
Investigating paths for lowering Telegraf's memory usage. Locally, the regular ("all plugins") telegraf binary build uses ~50-70MB, whereas the customized ("only needed plugins") build uses 10-20MB. However, there are strange differences when running the exact same binary with an identical minimal config in PlayOS VM. With a single configured
Enabling The config used: |
|
OK, I think I figured it out, there is no local/VM difference, it's just a difference in how Locally: while I checked with TL;DR:
|
This drastically reduces the binary size and memory usage, from ~160-180MB to 20-30MB. A simple approach with hard-coded plugin list instead of some mutually-recursive mess that would require a cyclic dependency for config -> telgraf-build -> config-validation.
|
Changed to a custom build of telegraf in b67cf4a This reduces the binary size to 19MB and runtime memory usage is reported at 20-30MB. I have simply hardcoded the list of plugins, which is a bit cumbersome, but the re-build is quite fast and I checked that missing plugins are correctly detected during config validation. E.g. if I remove |
knuton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
A few quick comments. We can work out your listed TODOs to see what should be next steps.
| CPUWeight = 100 / 10; # 10 times smaller than the default | ||
| IOWeight = 100 / 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 100 / 10 spelling only really informative if you know that the default is 100, no? But if you know that the default would be 100, then 10 would be easier to read than 100 / 10. 😄
Maybe I missed the purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think it's easier to grok 100 / 2, 100 / 5, 100 / 10, etc as relative fractions (1/2, 1/5, 1/10) rather than looking at CPUWeight = 20 and back-calculating that 20 = 100 / 5, i.e. 5 times less weight than default? Seeing CPUWeight = 20 would make me think that it has 20 times more weight.
If you know the default, then it is more obvious. If you don't know the default, well, then at least you wonder why it's written like this. That's also the reason for the comment :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, for self-documenting code, define a defaultCgroupWeight = 100 and then write CPUWeight = defaultCgroupWeight / 10 etc?
| fieldinclude = [ | ||
| "usage_user" | ||
| "usage_system" | ||
| "usage_active" # is this sum of above? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am always so confused about CPU stats. I think we might consider dropping this, because it's partially covered/better represented by the load moving averages inputs.system
| }; | ||
|
|
||
| # TODO: check if it works on a PlayOS PC | ||
| #inputs.sensors = { }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also depend on model, I think this may use IT87 chips similar to watchdog :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does work, at least with the DH470 I have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I test on a DH670 (which has IT5571), just uncomment or add more config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see the diff in the comment above: #302 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the custom telegraf build might be missing the sensors plugin now. Either modify it, or just use the standard telegraf build from nixpkgs.


Prompted by the mysterious networking and memory issues we have been battling for the past month, I think it is time we start doing some proper system observability.
This PR is an initial step, which introduces:
Goals / constrains
Why InfluxDB
Why v1 and not v2 or v3
Why Telegraf
Alternatives considered and dropped
collectd+RRD, Prometheus, VictoriaMetrics, Netdata
collectd + RRD
Prometheus
One of the gold standards, but it is pull-based, mostly oriented towards cloud-infrastructure and does not really offer a good "local-only" setup.
VictoriaMetrics (for storage/query engine)
According to reviews its super fast, offers a lot of flexibility in terms of setup, but is also very complex and huge. If the downsampling feature would be part of the open-source edition, might be worth to consider, but now it just seems to do more than we need.
Netdata
Wanted to try it out, because it offers an "all-in-one" setup (agent, storage, query, plotting). It quite liked the agent part, but query/plotting part is quite awkward, maybe more oriented towards 24/7 monitoring/alerting, which does not fit our use-case where we often need to do ad-hoc analysis.
Operational side
ssh -L 8086:8086 hostand then using eitherinflux -database playosto perform manual queries, or launching Chronograf/Grafana withlocalhost:8086as an InfluxDB data source (for testing I was usingdocker run -it --net host chronograf:alpine --influxdb-url=http://localhost:8086)What is not ideal
slowMode=truereduces memory usage more than twice, I don't think the stress tests show a realistic profile.https://docs.influxdata.com/influxdb/v1/administration/backup_and_restore/#time-based-backups, compress it to hell and rotate on a yearly schedule or so. Both of these approaches complicate the operational side quite a bit, since you need to either query multiple sources or restore additional data before querying.Next steps