Skip to content

Conversation

@yfyf
Copy link
Collaborator

@yfyf yfyf commented Dec 19, 2025

Prompted by the mysterious networking and memory issues we have been battling for the past month, I think it is time we start doing some proper system observability.

This PR is an initial step, which introduces:

Goals / constrains

  • Quick "online" analysis during remote maintenance (via SSH)
  • Convenient "offline" analysis via a DB backup+restore
  • Future path for centralized data collection
  • Minimal extra load on the system (mem, CPU, I/O)
  • Reasonable extra local disk usage (<300MB)
  • Easy to customize, extend and maintain

Why InfluxDB

  • Part of the standard TIG / TIC(K) stack
  • Flexible querying capabilities (InfluxQL / Flux)
  • Good DB maintenance tools
  • Can setup flexible retention policies
  • Integrated downsampling via continuous queries (v1) or tasks (v2, v3)
  • Supported by nearly all collection agents
  • Good integration with Grafana and other plotting tools
  • Can use the same DB for local and central collection

Why v1 and not v2 or v3

  • v1 is "just" a timeseries database. v2 integrates "tasks", plotting (UI) and alerting, which we don't need.
  • v2 and v3 are columnar stores, which I doubt we'd benefit a lot from, because we don't intend to collect too many fields per-series (see PR)
  • Much easier to setup and maintain (v2 forces you to use buckets, authentication and all kinds of other things we do not need for local storage)
  • At this point, v1 is quite old, but still supported by InfluxData and rock solid.
  • When we do set up the centralized DB, we can use InfluxDB v2 or even v3, because it might be convenient to have the whole stack in one place. Or we can do InfluxDB v1 + Chronograf or Grafana.

Why Telegraf

  • Same stack and company as InfluxDB, so less chances for compatibility issues
  • Push-based
  • Can easily manage 'granularity' of collected metrics (fields, tags, etc)
  • Can do advanced filtering/aggregation
  • Has some useful plugins (e.g. systemd_units), that are missing elsewhere
  • Nothing very unique though, probably replaceable with other agents/collectors

Alternatives considered and dropped

collectd+RRD, Prometheus, VictoriaMetrics, Netdata

collectd + RRD

  • Main benefit is that it is super lightweight and supports RRD as a storage backend, which offers transparent downsampling, allowing to keep local data for very long periods (years)
  • RRD stores everything in separate files, which can lead to high disk I/O, issues if many metrics are tracked
  • Plugin selection is limited
  • Plugin configuration is messy
  • Working with RRD both remotely and locally is very inconvenient, would need to convert to an intermediate format to use any of the modern graphing tools
  • Dealing with series tags/metadata is very annoying
  • No longer actively developed, very dated in all senses

Prometheus

One of the gold standards, but it is pull-based, mostly oriented towards cloud-infrastructure and does not really offer a good "local-only" setup.

VictoriaMetrics (for storage/query engine)

According to reviews its super fast, offers a lot of flexibility in terms of setup, but is also very complex and huge. If the downsampling feature would be part of the open-source edition, might be worth to consider, but now it just seems to do more than we need.

Netdata

Wanted to try it out, because it offers an "all-in-one" setup (agent, storage, query, plotting). It quite liked the agent part, but query/plotting part is quite awkward, maybe more oriented towards 24/7 monitoring/alerting, which does not fit our use-case where we often need to do ad-hoc analysis.

Operational side

  • "Online" analysis would be as simple as ssh -L 8086:8086 host and then using either influx -database playos to perform manual queries, or launching Chronograf/Grafana with localhost:8086 as an InfluxDB data source (for testing I was using docker run -it --net host chronograf:alpine --influxdb-url=http://localhost:8086)
  • "Offline" analysis would be simply
    
    ssh host "influx backup /tmp/influxdb.backup"
    scp /tmp/influxdb.backup . 
    influx restore influxdb.backup
    
    and then do the same as in "online"

What is not ideal

  • Running this introduces a risk of load spikes that can temporarily interrupt gameplay or even degrade overall performance. I have taken all kinds of precautions to avoid this (minimal configuration, lower CPU/IO weight, stress testing, etc), but there can still be surprises in an actual deployment. InfluxDB's memory does spike during stress tests, but since slowMode=true reduces memory usage more than twice, I don't think the stress tests show a realistic profile.
  • Telegraf uses a lot of memory: ~150MB with just a single input plugin, and nearly 200MB with the full setup. I suspect this is because it is a huge (250MB!) single binary with all the plugins "baked in", so even if you don't use them, the code itself takes up a lot of memory. We could attempt to build a custom stripped down version, this seems to be officially supported.
  • Limited local history (~3 months). To extend it, I considered adding downsampling via continuous queries, but decided not to, because it would introduce extra background load to the system and it is not clear whether we would need it. Another alternative would be to back-up the oldest (to-be-deleted) shards using https://docs.influxdata.com/influxdb/v1/administration/backup_and_restore/#time-based-backups, compress it to hell and rotate on a yearly schedule or so. Both of these approaches complicate the operational side quite a bit, since you need to either query multiple sources or restore additional data before querying.

Next steps

  • Finalize the set of metrics we want to collect
  • Choose the plotting / dashboard tool (Grafana or Chronograf or InfluxDBv2/v3 or...)
  • Set up a standard PlayOS system overview dashboard for local use
  • Set up a PlayOS PC to run for a few weeks to check system resource usage, collected data
  • Prepare manuals for working with the tools and data

@yfyf yfyf requested a review from knuton December 19, 2025 11:16
@yfyf yfyf added the reviewable Ready for initial or iterative review label Dec 19, 2025
@yfyf
Copy link
Collaborator Author

yfyf commented Dec 19, 2025

Resource usage in ./build vm:

[root@playos-test:~]# systemctl status telegraf.service | cat
● telegraf.service - Telegraf Agent
     Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: ignored)
     Active: active (running) since Fri 2025-12-19 10:41:41 UTC; 33min ago
 Invocation: 6f28828c035040da97e23ef28a7e71a7
   Main PID: 1035 (telegraf)
         IP: 0B in, 0B out
         IO: 0B read, 0B written
      Tasks: 10 (limit: 2332)
     Memory: 168.6M (max: 200M available: 31.3M peak: 169.6M)
        CPU: 5.462s
        

[root@playos-test:~]# systemctl status influxdb.service | cat
● influxdb.service - InfluxDB Server
     Loaded: loaded (/etc/systemd/system/influxdb.service; enabled; preset: ignored)
     Active: active (running) since Fri 2025-12-19 10:41:41 UTC; 35min ago
 Invocation: af7c474440564a9292a9282d51402707
    Process: 879 ExecStartPost=/nix/store/dpqjl0v7av6vgiaz0ib3k8sm1h91w4v6-unit-script-influxdb-post-start/bin/influxdb-post-start (code=exited, status=0/SUCCESS)
   Main PID: 878 (influxd)
         IP: 78.4K in, 97.1K out
         IO: 0B read, 0B written
      Tasks: 8 (limit: 2332)
     Memory: 61.2M (max: 500M available: 438.7M peak: 67.7M)
        CPU: 1.818s

@yfyf
Copy link
Collaborator Author

yfyf commented Dec 19, 2025

Summarized results of nix-build --arg slowMode true testing/integration/monitoring-stress.nix:

Collecting Telegraf stats for 120 seconds...
[TestCase] Memory usage of Telegraf is reasonable...

telegraf.service memory usage:
memory_peak: 114MB
mem_current: 113MB
[TestCase] Memory usage of Telegraf is reasonable... OK!

Stopping Telegraf to force data flush.

[TestCase] Memory usage of InfluxDB is reasonable...

influxdb.service memory usage:
memory_peak: 43MB
mem_current: 42MB
[TestCase] Memory usage of InfluxDB is reasonable... OK!

[TestCase] Generate 12 weeks of data (SLOW_MODE = true)...

<..>
Total time: 885.9 seconds

[TestCase] Generate 12 weeks of data (SLOW_MODE = true)... OK!

====== STATS IMMEDIATELLY AFTER ========

influxdb.service memory usage:
memory_peak: 382MB
mem_current: 335MB

[TestCase] Disk usage after stress test is within limits...

machine: must succeed: du --bytes -s /var/lib/influxdb | cut -f1

Disk usage is: 182MB
[TestCase] Disk usage after stress test is within limits... OK!

Sleeping for 120 seconds to allow compaction...

====== STATS AFTER compaction ========

influxdb.service memory usage:
memory_peak: 395MB
mem_current: 375MB
[TestCase] Disk usage after compaction is within limits...

Disk usage is: 174MB

[TestCase] Disk usage after compaction is within limits... OK!

====== STATS AFTER restarting ========

influxdb.service memory usage:
memory_peak: 49MB
mem_current: 49MB

[TestCase] InfluxDB memory usage after restart is within limits... OK!

[TestCase] Disk usage after restart is within limits...

Disk usage is: 173MB
[TestCase] Disk usage after restart is within limits... OK!

@yfyf
Copy link
Collaborator Author

yfyf commented Dec 19, 2025

Useful patch if you want to ./build vm and connect with Chronograf or something else:

diff --git a/base/monitoring.nix b/base/monitoring.nix
index 6790490..fab4bab 100644
--- a/base/monitoring.nix
+++ b/base/monitoring.nix
@@ -135,13 +135,15 @@ in

       services.influxdb.dataDir = "/var/lib/influxdb"; # use the standard dir

+      networking.firewall.enable = lib.mkForce false;
+
       services.influxdb.extraConfig = {
         reporting-disabled = true;

         http = {
           enabled = true;

-          bind-address = "localhost:8086";
+          bind-address = ":8086";
           unix-socket-enabled = true;
           bind-socket = "/var/run/influxdb/influxdb.sock";

and then ssh -L or ./result/bin/run-in-vm -q -enable-kvm -smp 4 -m 2048 -nic user,hostfwd=tcp::8086-:8086

@yfyf
Copy link
Collaborator Author

yfyf commented Dec 19, 2025

Obligatory screenshot of some plots (using Chronograf here, but I think Grafana would be a better option):

image

yfyf added 2 commits December 19, 2025 13:53
They all show the same numbers, since this is not based on `du`.
@yfyf
Copy link
Collaborator Author

yfyf commented Dec 23, 2025

Set up a PlayOS PC to run over the holidays with the following modifications. Mostly curious about how does memory usage look over a longer period, but also will allow to have "mostly real" metrics data to review later.

diff --git a/application.nix b/application.nix
index 606e824..8a83b13 100644
--- a/application.nix
+++ b/application.nix
@@ -86,6 +86,8 @@ rec {

       playos.monitoring.enable = true;
       playos.monitoring.extraServices = [ "dividat-driver.service" ];
+      playos.monitoring.localDbShard = "3d";
+      playos.monitoring.localRetention = "12d";

and

diff --git a/base/monitoring.nix b/base/monitoring.nix
index f99bb36..f82f261 100644
--- a/base/monitoring.nix
+++ b/base/monitoring.nix
@@ -309,7 +309,7 @@ in
         };

         # TODO: check if it works on a PlayOS PC
-        #inputs.sensors = { };
+        inputs.sensors = { };

         inputs.wireless = {

@yfyf
Copy link
Collaborator Author

yfyf commented Jan 5, 2026

Set up a PlayOS PC to run over the holidays with the following modifications. Mostly curious about how does memory usage look over a longer period, but also will allow to have "mostly real" metrics data to review later.

Here's what it looks like after a couple of weeks:

image

InfluxDB's memory usage cyclically grows (65MB->100MB) with a period of 3 days (the configured shard size) and then drops to base level (once the shard is "fully" persisted, I suppose).

Telegraf's usage is quite high at ~180MB, but static.

@yfyf
Copy link
Collaborator Author

yfyf commented Jan 16, 2026

Investigating paths for lowering Telegraf's memory usage.

Locally, the regular ("all plugins") telegraf binary build uses ~50-70MB, whereas the customized ("only needed plugins") build uses 10-20MB.

However, there are strange differences when running the exact same binary with an identical minimal config in PlayOS VM. With a single configured inputs.mock plugin, memory usage is ~160MB right after starting on the VM. Locally - ~40MB.

  • InfluxDB is running both locally and in the VM.
  • The inputs.mock plugin is not even reading system data, so it cannot be due to different data or difficulties reading it.
  • Tried explicitly limiting memory with GOMEMLIMIT=80MiB, no changes
  • Tried giving the VM more memory (8GB instead of 2GB), no changes.
  • Tried running the binary on the VM without a systemd service, no changes either.
  • Looked at ps RSS vs systemctl status mem usage reports, same numbers
  • Tried to check system/kernel config differences suggested by LLMs, nothing relevant.

Enabling pprof on Telegraf and running go tool pprof reports a heap of only 30MB. I have no idea where the rest of the 100MB+ of RSS memory is coming from:

[root@playos-test:~]# ps -p $(pgrep telegraf) -o rss
  RSS
136376

[root@playos-test:~]# go tool pprof localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /root/pprof/pprof.telegraf.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz
File: telegraf
Type: inuse_space
Time: Jan 16, 2026 at 11:02am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 26295.77kB, 85.09% of 30905.15kB total
Showing top 10 nodes out of 68
      flat  flat%   sum%        cum   cum%
   11776kB 38.10% 38.10%    11776kB 38.10%  github.com/goccy/go-json/internal/decoder.init.0
 5888.06kB 19.05% 57.16%  5888.06kB 19.05%  github.com/goccy/go-json/internal/encoder.init.0
    3598kB 11.64% 68.80%     3598kB 11.64%  github.com/aws/aws-sdk-go/aws/endpoints.init
 1563.16kB  5.06% 73.86%  2084.21kB  6.74%  github.com/seancfoley/ipaddress-go/ipaddr.newIPv6SegmentPrefixedVal
  858.34kB  2.78% 76.63%   858.34kB  2.78%  github.com/vmware/govmomi/vim25/types.init.6658
  544.67kB  1.76% 78.40%   544.67kB  1.76%  regexp/syntax.(*compiler).inst
  521.05kB  1.69% 80.08%   521.05kB  1.69%  github.com/seancfoley/ipaddress-go/ipaddr.newIPv6SegmentVal
  521.05kB  1.69% 81.77%   521.05kB  1.69%  google.golang.org/protobuf/internal/filedesc.(*File).initDecls
  512.75kB  1.66% 83.43%   512.75kB  1.66%  encoding/pem.Decode
  512.69kB  1.66% 85.09%  1024.73kB  3.32%  crypto/x509.parseCertificate

The config used:

λ cat /nix/store/pcqcdb55jk1pk80aycb8a586p2m95x5h-config.toml
[agent]
always_include_global_tags = true
collection_jitter = "12s"
hostname = "playos-${MACHINE_ID}"
interval = "60s"
metric_batch_size = 50
metric_buffer_limit = 100
precision = "30s"
quiet = true

[global_tags]
playos_version = "2025.3.3"

[inputs.mock]
metric_name = "test_random"

[outputs.influxdb]
content_encoding = "identity"
database = "playos"
skip_database_creation = true
urls = ["http://127.0.0.1:8086"]

@yfyf
Copy link
Collaborator Author

yfyf commented Jan 16, 2026

OK, I think I figured it out, there is no local/VM difference, it's just a difference in how systemctl status vs ps reports memory usage.

Locally:

[~/src]λ systemctl status --user telegraf-nix.scope
● telegraf-nix.scope - [systemd-run] /nix/store/v1dqmrawpjmvcckzjzc4mzwsh2dj5s3x-telegraf-1.32.2/bin/telegraf --config /nix/st>
     Loaded: loaded (/run/user/1000/systemd/transient/telegraf-nix.scope; transient)
  Transient: yes
     Active: active (running) since Fri 2026-01-16 13:44:31 EET; 20s ago
 Invocation: 51037d17c25f40aa978138ce5a3bf64a
      Tasks: 11 (limit: 18572)
     Memory: 36.3M (peak: 36.8M)

while ps shows:

]λ ps -p $(pidof telegraf) -o rss
  RSS
145540

I checked with pmap, and most of this memory is pre-allocated read-only memory, confirming that this is due to the large binary / baked in plugins.

TL;DR:

  • large memory usage is, as predicted, due to the baked in plugins that pre-allocate static memory
  • heap usage is relatively tiny
  • we can push down the total allocated memory from 180MB to 20-30MB with a custom build, which is not hard to set up
  • Linux memory reporting is complicated, as usual

yfyf added 2 commits January 19, 2026 12:47
This drastically reduces the binary size and memory usage, from
~160-180MB to 20-30MB.

A simple approach with hard-coded plugin list instead of some
mutually-recursive mess that would require a cyclic dependency for
config -> telgraf-build -> config-validation.
@yfyf
Copy link
Collaborator Author

yfyf commented Jan 19, 2026

Changed to a custom build of telegraf in b67cf4a

This reduces the binary size to 19MB and runtime memory usage is reported at 20-30MB.

I have simply hardcoded the list of plugins, which is a bit cumbersome, but the re-build is quite fast and I checked that missing plugins are correctly detected during config validation.

E.g. if I remove inputs.net from supported features, ./build vm fails like so:

error: Cannot build '/nix/store/p47srdlbqymkw60kc9cymxs70wh7983m-validate-config.drv'.
       Reason: builder failed with exit code 1.
       Output paths:
         /nix/store/2bvra83wm92j0j3dfygbi8rndk3gcr7p-validate-config
       Last 25 log lines:
       > scope = "system"
       > taginclude = ["name"]
       > unittype = "service"
       >
       > [inputs.wireless]
       > fieldinclude = ["status", "level", "noise", "retry", "misc", "missed_beacon"]
       >
       > [outputs.influxdb]
       > content_encoding = "identity"
       > database = "playos"
       > skip_database_creation = true
       > urls = ["unix:///var/run/influxdb/influxdb.sock"]
       >
       > [[processors.strings]]
       > [[processors.strings.left]]
       > tag = "process_name"
       > width = 64
       > === Telegraf output:
       > 2026-01-19T10:43:09Z I! Loading config: /nix/store/c7p4nximy2c41zv51v5gwwpgnfqd9nq0-config.toml
       > 2026-01-19T10:43:09Z E! loading config file /nix/store/c7p4nximy2c41zv51v5gwwpgnfqd9nq0-config.toml failed: error parsing net, undefined but requested input: net
       > Hint: PlayOS uses a custom build of telegraf, so if you get
       > an error like 'undefined but requested input', this can mean
       > two things:
       >   1. Typo / wrong name of plugin
       >   2. Plugin is not included in custom build, check pkgs/telegraf.nix

Copy link
Member

@knuton knuton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

A few quick comments. We can work out your listed TODOs to see what should be next steps.

Comment on lines +72 to +73
CPUWeight = 100 / 10; # 10 times smaller than the default
IOWeight = 100 / 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 100 / 10 spelling only really informative if you know that the default is 100, no? But if you know that the default would be 100, then 10 would be easier to read than 100 / 10. 😄

Maybe I missed the purpose

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it's easier to grok 100 / 2, 100 / 5, 100 / 10, etc as relative fractions (1/2, 1/5, 1/10) rather than looking at CPUWeight = 20 and back-calculating that 20 = 100 / 5, i.e. 5 times less weight than default? Seeing CPUWeight = 20 would make me think that it has 20 times more weight.

If you know the default, then it is more obvious. If you don't know the default, well, then at least you wonder why it's written like this. That's also the reason for the comment :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, for self-documenting code, define a defaultCgroupWeight = 100 and then write CPUWeight = defaultCgroupWeight / 10 etc?

fieldinclude = [
"usage_user"
"usage_system"
"usage_active" # is this sum of above?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is the total CPU time minus idle time.

Above are percentages of total time.

Source: Source

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am always so confused about CPU stats. I think we might consider dropping this, because it's partially covered/better represented by the load moving averages inputs.system

};

# TODO: check if it works on a PlayOS PC
#inputs.sensors = { };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also depend on model, I think this may use IT87 chips similar to watchdog :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work, at least with the DH470 I have.

Copy link
Member

@knuton knuton Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I test on a DH670 (which has IT5571), just uncomment or add more config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see the diff in the comment above: #302 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the custom telegraf build might be missing the sensors plugin now. Either modify it, or just use the standard telegraf build from nixpkgs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reviewable Ready for initial or iterative review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants