-
-
Notifications
You must be signed in to change notification settings - Fork 14
Description
I have a group of 16 servers that I'm monitoring using ansible-pcp. I've added a few pmda's and I've left the other settings (sampling interval and retention period) at defaults. My metrics collection system was overwhelmed trying to keep up with the metrics reported, and I am theorizing that the redis save interval was responsible for the high rates of IO my system reported.
The metrics collection host is now a 4 vCPU, 16 GB RAM VM with an 80GB disk. (That seems sufficient based on the sizing guidance in https://pcp.readthedocs.io/en/latest/HowTos/scaling/index.html). On a previous iteration set up using this collection (on Fedora 35), the rdb file in /var/lib/redis was almost continuously being written, and that rdb file was continuously growing. It would get killed by systemd-oom beyond a certain point, depending on the amount of RAM I configured the VM with.
Below is a sample graph from the second setup I tried, this time running on CentOS 8-stream (and Redis 5). The same save settings by default, with increasing amounts of disk I/O:
This represents the first 4 hours of reporting from a newly set up collector.
I'm currently experimenting with significantly dialing down the save interval (save 3600 1 only), but I'm not sure of how well that will work with the semantics of how pmproxy/pmlogger work.
I'm willing to PR some changes and take input on them working through this - including if I'm off base. (I am new to the pcp toolset, and to redis). Thanks for your time and attention!
