Add local monitoring using Telegraf and InfluxDB #302

yfyf · 2025-12-19T11:16:00Z

Prompted by the mysterious networking and memory issues we have been battling for the past month, I think it is time we start doing some proper system observability.

This PR is an initial step, which introduces:

metric collection using Telegraf
local metric storage using InfluxDB v1

Goals / constrains

Quick "online" analysis during remote maintenance (via SSH)
Convenient "offline" analysis via a DB backup+restore
Future path for centralized data collection
Minimal extra load on the system (mem, CPU, I/O)
Reasonable extra local disk usage (<300MB)
Easy to customize, extend and maintain

Why InfluxDB

Part of the standard TIG / TIC(K) stack
Flexible querying capabilities (InfluxQL / Flux)
Good DB maintenance tools
Can setup flexible retention policies
Integrated downsampling via continuous queries (v1) or tasks (v2, v3)
Supported by nearly all collection agents
Good integration with Grafana and other plotting tools
Can use the same DB for local and central collection

Why v1 and not v2 or v3

v1 is "just" a timeseries database. v2 integrates "tasks", plotting (UI) and alerting, which we don't need.
v2 and v3 are columnar stores, which I doubt we'd benefit a lot from, because we don't intend to collect too many fields per-series (see PR)
Much easier to setup and maintain (v2 forces you to use buckets, authentication and all kinds of other things we do not need for local storage)
At this point, v1 is quite old, but still supported by InfluxData and rock solid.
When we do set up the centralized DB, we can use InfluxDB v2 or even v3, because it might be convenient to have the whole stack in one place. Or we can do InfluxDB v1 + Chronograf or Grafana.

Why Telegraf

Same stack and company as InfluxDB, so less chances for compatibility issues
Push-based
Can easily manage 'granularity' of collected metrics (fields, tags, etc)
Can do advanced filtering/aggregation
Has some useful plugins (e.g. systemd_units), that are missing elsewhere
Nothing very unique though, probably replaceable with other agents/collectors

Alternatives considered and dropped

collectd+RRD, Prometheus, VictoriaMetrics, Netdata

collectd + RRD

Main benefit is that it is super lightweight and supports RRD as a storage backend, which offers transparent downsampling, allowing to keep local data for very long periods (years)
RRD stores everything in separate files, which can lead to high disk I/O, issues if many metrics are tracked
Plugin selection is limited
Plugin configuration is messy
Working with RRD both remotely and locally is very inconvenient, would need to convert to an intermediate format to use any of the modern graphing tools
Dealing with series tags/metadata is very annoying
No longer actively developed, very dated in all senses

Prometheus

One of the gold standards, but it is pull-based, mostly oriented towards cloud-infrastructure and does not really offer a good "local-only" setup.

VictoriaMetrics (for storage/query engine)

According to reviews its super fast, offers a lot of flexibility in terms of setup, but is also very complex and huge. If the downsampling feature would be part of the open-source edition, might be worth to consider, but now it just seems to do more than we need.

Netdata

Wanted to try it out, because it offers an "all-in-one" setup (agent, storage, query, plotting). It quite liked the agent part, but query/plotting part is quite awkward, maybe more oriented towards 24/7 monitoring/alerting, which does not fit our use-case where we often need to do ad-hoc analysis.

Operational side

"Online" analysis would be as simple as ssh -L 8086:8086 host and then using either influx -database playos to perform manual queries, or launching Chronograf/Grafana with localhost:8086 as an InfluxDB data source (for testing I was using docker run -it --net host chronograf:alpine --influxdb-url=http://localhost:8086)

"Offline" analysis would be simply


ssh host "influx backup /tmp/influxdb.backup"
scp /tmp/influxdb.backup . 
influx restore influxdb.backup

and then do the same as in "online"

What is not ideal

Running this introduces a risk of load spikes that can temporarily interrupt gameplay or even degrade overall performance. I have taken all kinds of precautions to avoid this (minimal configuration, lower CPU/IO weight, stress testing, etc), but there can still be surprises in an actual deployment. InfluxDB's memory does spike during stress tests, but since slowMode=true reduces memory usage more than twice, I don't think the stress tests show a realistic profile.
Telegraf uses a lot of memory: ~150MB with just a single input plugin, and nearly 200MB with the full setup. I suspect this is because it is a huge (250MB!) single binary with all the plugins "baked in", so even if you don't use them, the code itself takes up a lot of memory. We could attempt to build a custom stripped down version, this seems to be officially supported.
Limited local history (~3 months). To extend it, I considered adding downsampling via continuous queries, but decided not to, because it would introduce extra background load to the system and it is not clear whether we would need it. Another alternative would be to back-up the oldest (to-be-deleted) shards using https://docs.influxdata.com/influxdb/v1/administration/backup_and_restore/#time-based-backups, compress it to hell and rotate on a yearly schedule or so. Both of these approaches complicate the operational side quite a bit, since you need to either query multiple sources or restore additional data before querying.

Next steps

Finalize the set of metrics we want to collect
Choose the plotting / dashboard tool (Grafana or Chronograf or InfluxDBv2/v3 or...)
Set up a standard PlayOS system overview dashboard for local use
Set up a PlayOS PC to run for a few weeks to check system resource usage, collected data
Prepare manuals for working with the tools and data

yfyf · 2025-12-19T11:17:53Z

Resource usage in ./build vm:

[root@playos-test:~]# systemctl status telegraf.service | cat
● telegraf.service - Telegraf Agent
     Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: ignored)
     Active: active (running) since Fri 2025-12-19 10:41:41 UTC; 33min ago
 Invocation: 6f28828c035040da97e23ef28a7e71a7
   Main PID: 1035 (telegraf)
         IP: 0B in, 0B out
         IO: 0B read, 0B written
      Tasks: 10 (limit: 2332)
     Memory: 168.6M (max: 200M available: 31.3M peak: 169.6M)
        CPU: 5.462s
        

[root@playos-test:~]# systemctl status influxdb.service | cat
● influxdb.service - InfluxDB Server
     Loaded: loaded (/etc/systemd/system/influxdb.service; enabled; preset: ignored)
     Active: active (running) since Fri 2025-12-19 10:41:41 UTC; 35min ago
 Invocation: af7c474440564a9292a9282d51402707
    Process: 879 ExecStartPost=/nix/store/dpqjl0v7av6vgiaz0ib3k8sm1h91w4v6-unit-script-influxdb-post-start/bin/influxdb-post-start (code=exited, status=0/SUCCESS)
   Main PID: 878 (influxd)
         IP: 78.4K in, 97.1K out
         IO: 0B read, 0B written
      Tasks: 8 (limit: 2332)
     Memory: 61.2M (max: 500M available: 438.7M peak: 67.7M)
        CPU: 1.818s

yfyf · 2025-12-19T11:26:25Z

Summarized results of nix-build --arg slowMode true testing/integration/monitoring-stress.nix:

Collecting Telegraf stats for 120 seconds...
[TestCase] Memory usage of Telegraf is reasonable...

telegraf.service memory usage:
memory_peak: 114MB
mem_current: 113MB
[TestCase] Memory usage of Telegraf is reasonable... OK!

Stopping Telegraf to force data flush.

[TestCase] Memory usage of InfluxDB is reasonable...

influxdb.service memory usage:
memory_peak: 43MB
mem_current: 42MB
[TestCase] Memory usage of InfluxDB is reasonable... OK!

[TestCase] Generate 12 weeks of data (SLOW_MODE = true)...

<..>
Total time: 885.9 seconds

[TestCase] Generate 12 weeks of data (SLOW_MODE = true)... OK!

====== STATS IMMEDIATELLY AFTER ========

influxdb.service memory usage:
memory_peak: 382MB
mem_current: 335MB

[TestCase] Disk usage after stress test is within limits...

machine: must succeed: du --bytes -s /var/lib/influxdb | cut -f1

Disk usage is: 182MB
[TestCase] Disk usage after stress test is within limits... OK!

Sleeping for 120 seconds to allow compaction...

====== STATS AFTER compaction ========

influxdb.service memory usage:
memory_peak: 395MB
mem_current: 375MB
[TestCase] Disk usage after compaction is within limits...

Disk usage is: 174MB

[TestCase] Disk usage after compaction is within limits... OK!

====== STATS AFTER restarting ========

influxdb.service memory usage:
memory_peak: 49MB
mem_current: 49MB

[TestCase] InfluxDB memory usage after restart is within limits... OK!

[TestCase] Disk usage after restart is within limits...

Disk usage is: 173MB
[TestCase] Disk usage after restart is within limits... OK!

yfyf · 2025-12-19T11:29:15Z

Useful patch if you want to ./build vm and connect with Chronograf or something else:

diff --git a/base/monitoring.nix b/base/monitoring.nix
index 6790490..fab4bab 100644
--- a/base/monitoring.nix
+++ b/base/monitoring.nix
@@ -135,13 +135,15 @@ in

       services.influxdb.dataDir = "/var/lib/influxdb"; # use the standard dir

+      networking.firewall.enable = lib.mkForce false;
+
       services.influxdb.extraConfig = {
         reporting-disabled = true;

         http = {
           enabled = true;

-          bind-address = "localhost:8086";
+          bind-address = ":8086";
           unix-socket-enabled = true;
           bind-socket = "/var/run/influxdb/influxdb.sock";

and then ssh -L or ./result/bin/run-in-vm -q -enable-kvm -smp 4 -m 2048 -nic user,hostfwd=tcp::8086-:8086

yfyf · 2025-12-19T11:39:31Z

Obligatory screenshot of some plots (using Chronograf here, but I think Grafana would be a better option):

They all show the same numbers, since this is not based on `du`.

yfyf · 2025-12-23T13:34:06Z

Set up a PlayOS PC to run over the holidays with the following modifications. Mostly curious about how does memory usage look over a longer period, but also will allow to have "mostly real" metrics data to review later.

diff --git a/application.nix b/application.nix
index 606e824..8a83b13 100644
--- a/application.nix
+++ b/application.nix
@@ -86,6 +86,8 @@ rec {

       playos.monitoring.enable = true;
       playos.monitoring.extraServices = [ "dividat-driver.service" ];
+      playos.monitoring.localDbShard = "3d";
+      playos.monitoring.localRetention = "12d";

and

diff --git a/base/monitoring.nix b/base/monitoring.nix
index f99bb36..f82f261 100644
--- a/base/monitoring.nix
+++ b/base/monitoring.nix
@@ -309,7 +309,7 @@ in
         };

         # TODO: check if it works on a PlayOS PC
-        #inputs.sensors = { };
+        inputs.sensors = { };

         inputs.wireless = {

yfyf · 2026-01-05T08:14:18Z

Set up a PlayOS PC to run over the holidays with the following modifications. Mostly curious about how does memory usage look over a longer period, but also will allow to have "mostly real" metrics data to review later.

Here's what it looks like after a couple of weeks:

InfluxDB's memory usage cyclically grows (65MB->100MB) with a period of 3 days (the configured shard size) and then drops to base level (once the shard is "fully" persisted, I suppose).

Telegraf's usage is quite high at ~180MB, but static.

yfyf · 2026-01-16T11:05:44Z

Investigating paths for lowering Telegraf's memory usage.

Locally, the regular ("all plugins") telegraf binary build uses ~50-70MB, whereas the customized ("only needed plugins") build uses 10-20MB.

However, there are strange differences when running the exact same binary with an identical minimal config in PlayOS VM. With a single configured inputs.mock plugin, memory usage is ~160MB right after starting on the VM. Locally - ~40MB.

InfluxDB is running both locally and in the VM.
The inputs.mock plugin is not even reading system data, so it cannot be due to different data or difficulties reading it.
Tried explicitly limiting memory with GOMEMLIMIT=80MiB, no changes
Tried giving the VM more memory (8GB instead of 2GB), no changes.
Tried running the binary on the VM without a systemd service, no changes either.
Looked at ps RSS vs systemctl status mem usage reports, same numbers
Tried to check system/kernel config differences suggested by LLMs, nothing relevant.

Enabling pprof on Telegraf and running go tool pprof reports a heap of only 30MB. I have no idea where the rest of the 100MB+ of RSS memory is coming from:

[root@playos-test:~]# ps -p $(pgrep telegraf) -o rss
  RSS
136376

[root@playos-test:~]# go tool pprof localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /root/pprof/pprof.telegraf.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz
File: telegraf
Type: inuse_space
Time: Jan 16, 2026 at 11:02am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 26295.77kB, 85.09% of 30905.15kB total
Showing top 10 nodes out of 68
      flat  flat%   sum%        cum   cum%
   11776kB 38.10% 38.10%    11776kB 38.10%  github.com/goccy/go-json/internal/decoder.init.0
 5888.06kB 19.05% 57.16%  5888.06kB 19.05%  github.com/goccy/go-json/internal/encoder.init.0
    3598kB 11.64% 68.80%     3598kB 11.64%  github.com/aws/aws-sdk-go/aws/endpoints.init
 1563.16kB  5.06% 73.86%  2084.21kB  6.74%  github.com/seancfoley/ipaddress-go/ipaddr.newIPv6SegmentPrefixedVal
  858.34kB  2.78% 76.63%   858.34kB  2.78%  github.com/vmware/govmomi/vim25/types.init.6658
  544.67kB  1.76% 78.40%   544.67kB  1.76%  regexp/syntax.(*compiler).inst
  521.05kB  1.69% 80.08%   521.05kB  1.69%  github.com/seancfoley/ipaddress-go/ipaddr.newIPv6SegmentVal
  521.05kB  1.69% 81.77%   521.05kB  1.69%  google.golang.org/protobuf/internal/filedesc.(*File).initDecls
  512.75kB  1.66% 83.43%   512.75kB  1.66%  encoding/pem.Decode
  512.69kB  1.66% 85.09%  1024.73kB  3.32%  crypto/x509.parseCertificate

The config used:

λ cat /nix/store/pcqcdb55jk1pk80aycb8a586p2m95x5h-config.toml
[agent]
always_include_global_tags = true
collection_jitter = "12s"
hostname = "playos-${MACHINE_ID}"
interval = "60s"
metric_batch_size = 50
metric_buffer_limit = 100
precision = "30s"
quiet = true

[global_tags]
playos_version = "2025.3.3"

[inputs.mock]
metric_name = "test_random"

[outputs.influxdb]
content_encoding = "identity"
database = "playos"
skip_database_creation = true
urls = ["http://127.0.0.1:8086"]

yfyf · 2026-01-16T11:55:03Z

OK, I think I figured it out, there is no local/VM difference, it's just a difference in how systemctl status vs ps reports memory usage.

Locally:

[~/src]λ systemctl status --user telegraf-nix.scope
● telegraf-nix.scope - [systemd-run] /nix/store/v1dqmrawpjmvcckzjzc4mzwsh2dj5s3x-telegraf-1.32.2/bin/telegraf --config /nix/st>
     Loaded: loaded (/run/user/1000/systemd/transient/telegraf-nix.scope; transient)
  Transient: yes
     Active: active (running) since Fri 2026-01-16 13:44:31 EET; 20s ago
 Invocation: 51037d17c25f40aa978138ce5a3bf64a
      Tasks: 11 (limit: 18572)
     Memory: 36.3M (peak: 36.8M)

while ps shows:

]λ ps -p $(pidof telegraf) -o rss
  RSS
145540

I checked with pmap, and most of this memory is pre-allocated read-only memory, confirming that this is due to the large binary / baked in plugins.

TL;DR:

large memory usage is, as predicted, due to the baked in plugins that pre-allocate static memory
heap usage is relatively tiny
we can push down the total allocated memory from 180MB to 20-30MB with a custom build, which is not hard to set up
Linux memory reporting is complicated, as usual

This drastically reduces the binary size and memory usage, from ~160-180MB to 20-30MB. A simple approach with hard-coded plugin list instead of some mutually-recursive mess that would require a cyclic dependency for config -> telgraf-build -> config-validation.

yfyf · 2026-01-19T10:49:17Z

Changed to a custom build of telegraf in b67cf4a

This reduces the binary size to 19MB and runtime memory usage is reported at 20-30MB.

I have simply hardcoded the list of plugins, which is a bit cumbersome, but the re-build is quite fast and I checked that missing plugins are correctly detected during config validation.

E.g. if I remove inputs.net from supported features, ./build vm fails like so:

error: Cannot build '/nix/store/p47srdlbqymkw60kc9cymxs70wh7983m-validate-config.drv'.
       Reason: builder failed with exit code 1.
       Output paths:
         /nix/store/2bvra83wm92j0j3dfygbi8rndk3gcr7p-validate-config
       Last 25 log lines:
       > scope = "system"
       > taginclude = ["name"]
       > unittype = "service"
       >
       > [inputs.wireless]
       > fieldinclude = ["status", "level", "noise", "retry", "misc", "missed_beacon"]
       >
       > [outputs.influxdb]
       > content_encoding = "identity"
       > database = "playos"
       > skip_database_creation = true
       > urls = ["unix:///var/run/influxdb/influxdb.sock"]
       >
       > [[processors.strings]]
       > [[processors.strings.left]]
       > tag = "process_name"
       > width = 64
       > === Telegraf output:
       > 2026-01-19T10:43:09Z I! Loading config: /nix/store/c7p4nximy2c41zv51v5gwwpgnfqd9nq0-config.toml
       > 2026-01-19T10:43:09Z E! loading config file /nix/store/c7p4nximy2c41zv51v5gwwpgnfqd9nq0-config.toml failed: error parsing net, undefined but requested input: net
       > Hint: PlayOS uses a custom build of telegraf, so if you get
       > an error like 'undefined but requested input', this can mean
       > two things:
       >   1. Typo / wrong name of plugin
       >   2. Plugin is not included in custom build, check pkgs/telegraf.nix

knuton

Nice!

A few quick comments. We can work out your listed TODOs to see what should be next steps.

knuton · 2026-01-23T14:33:48Z

base/monitoring.nix

+        CPUWeight = 100 / 10; # 10 times smaller than the default
+        IOWeight = 100 / 10;


This 100 / 10 spelling only really informative if you know that the default is 100, no? But if you know that the default would be 100, then 10 would be easier to read than 100 / 10. 😄

Maybe I missed the purpose

Hmm, I think it's easier to grok 100 / 2, 100 / 5, 100 / 10, etc as relative fractions (1/2, 1/5, 1/10) rather than looking at CPUWeight = 20 and back-calculating that 20 = 100 / 5, i.e. 5 times less weight than default? Seeing CPUWeight = 20 would make me think that it has 20 times more weight.

If you know the default, then it is more obvious. If you don't know the default, well, then at least you wonder why it's written like this. That's also the reason for the comment :-)

Otherwise, for self-documenting code, define a defaultCgroupWeight = 100 and then write CPUWeight = defaultCgroupWeight / 10 etc?

knuton · 2026-01-23T14:45:57Z

base/monitoring.nix

+          fieldinclude = [
+            "usage_user"
+            "usage_system"
+            "usage_active" # is this sum of above?


No, this is the total CPU time minus idle time.

Above are percentages of total time.

Source: Source

I am always so confused about CPU stats. I think we might consider dropping this, because it's partially covered/better represented by the load moving averages inputs.system

knuton · 2026-01-23T14:48:59Z

base/monitoring.nix

+        };
+
+        # TODO: check if it works on a PlayOS PC
+        #inputs.sensors = { };


Might also depend on model, I think this may use IT87 chips similar to watchdog :-)

It does work, at least with the DH470 I have.

How can I test on a DH670 (which has IT5571), just uncomment or add more config?

Yes, see the diff in the comment above: #302 (comment)

Ah, the custom telegraf build might be missing the sensors plugin now. Either modify it, or just use the standard telegraf build from nixpkgs.

application.nix

Add local monitoring using Telegraf and InfluxDB

a512960

yfyf requested a review from knuton December 19, 2025 11:16

yfyf added the reviewable Ready for initial or iterative review label Dec 19, 2025

yfyf added 2 commits December 19, 2025 13:53

Remove persistentFolders from mount_points

4f461fe

They all show the same numbers, since this is not based on `du`.

Use a more general interface glob

b930e8e

yfyf added 2 commits January 19, 2026 12:47

Spelling

ab73b81

knuton reviewed Jan 23, 2026

View reviewed changes

		CPUWeight = 100 / 10; # 10 times smaller than the default
		IOWeight = 100 / 10;

Add local monitoring using Telegraf and InfluxDB #302

Are you sure you want to change the base?

Add local monitoring using Telegraf and InfluxDB #302

Uh oh!

Conversation

yfyf commented Dec 19, 2025

Goals / constrains

Why InfluxDB

Why v1 and not v2 or v3

Why Telegraf

Alternatives considered and dropped

collectd + RRD

Prometheus

VictoriaMetrics (for storage/query engine)

Netdata

Operational side

What is not ideal

Next steps

Uh oh!

yfyf commented Dec 19, 2025

Uh oh!

yfyf commented Dec 19, 2025

Uh oh!

yfyf commented Dec 19, 2025

Uh oh!

yfyf commented Dec 19, 2025

Uh oh!

yfyf commented Dec 23, 2025

Uh oh!

yfyf commented Jan 5, 2026

Uh oh!

yfyf commented Jan 16, 2026

Uh oh!

yfyf commented Jan 16, 2026

Uh oh!

yfyf commented Jan 19, 2026

Uh oh!

knuton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knuton Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

knuton Jan 26, 2026 •

edited

Loading