From 7e3b01d7e952ec564de8a695d50b2d9e75118e93 Mon Sep 17 00:00:00 2001 From: Maria Romano Date: Sat, 7 Feb 2026 00:01:56 +0000 Subject: [PATCH 1/8] starting changes for psi GA docs, still WIP --- .../en/docs/concepts/cluster-administration/system-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/docs/concepts/cluster-administration/system-metrics.md b/content/en/docs/concepts/cluster-administration/system-metrics.md index aee94e415c888..3eeca9a0a9fe8 100644 --- a/content/en/docs/concepts/cluster-administration/system-metrics.md +++ b/content/en/docs/concepts/cluster-administration/system-metrics.md @@ -177,7 +177,7 @@ flag to expose these alpha stability metrics. ### kubelet Pressure Stall Information (PSI) metrics -{{< feature-state for_k8s_version="v1.34" state="beta" >}} +{{< feature-state for_k8s_version="v1.36" state="stable" >}} As a beta feature, Kubernetes lets you configure kubelet to collect Linux kernel [Pressure Stall Information](https://docs.kernel.org/accounting/psi.html) From 1742813e16ae34179b162d4ce96f3c6e0edfd9b6 Mon Sep 17 00:00:00 2001 From: Maria Romano Date: Sat, 21 Feb 2026 01:03:57 +0000 Subject: [PATCH 2/8] updated system-metrics.md with more info on both endpoints for psi --- .../concepts/cluster-administration/system-metrics.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/content/en/docs/concepts/cluster-administration/system-metrics.md b/content/en/docs/concepts/cluster-administration/system-metrics.md index 3eeca9a0a9fe8..cc33e0fcd5aa2 100644 --- a/content/en/docs/concepts/cluster-administration/system-metrics.md +++ b/content/en/docs/concepts/cluster-administration/system-metrics.md @@ -177,12 +177,16 @@ flag to expose these alpha stability metrics. ### kubelet Pressure Stall Information (PSI) metrics -{{< feature-state for_k8s_version="v1.36" state="stable" >}} +{{< feature-state feature_gate_name="KubeletPSI" >}} -As a beta feature, Kubernetes lets you configure kubelet to collect Linux kernel +As a stable feature, the kubelet collects Linux kernel [Pressure Stall Information](https://docs.kernel.org/accounting/psi.html) (PSI) for CPU, memory and I/O usage. The information is collected at node, pod and container level. + +*Prometheus Metrics*: Exposed at the `/metrics/cadvisor` endpoint as cumulative counters (totals) representing the total stall time in seconds. +*Summary API*: Exposed at the `/stats/summary` endpoint, providing both the cumulative totals and the moving averages (avg10, avg60, avg300). These averages represent the percentage of time that tasks were stalled on a resource over the respective 10-second, 60-second, and 5-minute intervals. + The metrics are exposed at the `/metrics/cadvisor` endpoint with the following names: ``` @@ -194,7 +198,7 @@ container_pressure_io_stalled_seconds_total container_pressure_io_waiting_seconds_total ``` -This feature is enabled by default, by setting the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/). The information is also exposed in the +This feature is enabled by default. Starting with Kubernetes v.1.36, the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is locked to true and cannot be disabled. The information is also exposed in the [Summary API](/docs/reference/instrumentation/node-metrics#psi). You can learn how to interpret the PSI metrics in [Understand PSI Metrics](/docs/reference/instrumentation/understand-psi-metrics/). From bd2322f662f9e7ffb8e3769e80a051138a6eb714 Mon Sep 17 00:00:00 2001 From: Maria Romano Date: Sat, 21 Feb 2026 01:27:38 +0000 Subject: [PATCH 3/8] Update KubeletPSI feature gate to stable for v1.36 --- .../command-line-tools-reference/feature-gates/KubeletPSI.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates/KubeletPSI.md b/content/en/docs/reference/command-line-tools-reference/feature-gates/KubeletPSI.md index 0caecab23fec2..c9a0c8aab5aa0 100644 --- a/content/en/docs/reference/command-line-tools-reference/feature-gates/KubeletPSI.md +++ b/content/en/docs/reference/command-line-tools-reference/feature-gates/KubeletPSI.md @@ -13,5 +13,10 @@ stages: - stage: beta defaultValue: true fromVersion: "1.34" + toVersion: "1.35" + - stage: stable + defaultValue: true + fromVersion: "1.36" + locked: true --- Enable kubelet to surface Pressure Stall Information (PSI) metrics in the Summary API and Prometheus metrics. From 2fc58207192f03a9b1e58d63520d92929ee38442 Mon Sep 17 00:00:00 2001 From: Maria Romano Date: Fri, 27 Mar 2026 22:01:29 +0000 Subject: [PATCH 4/8] Update remaining PSI docs for GA. Added format from the linux kernel docs --- .../concepts/cluster-administration/system-metrics.md | 11 ++++++----- .../en/docs/reference/instrumentation/node-metrics.md | 4 ++-- .../instrumentation/understand-psi-metrics.md | 4 ++-- 3 files changed, 10 insertions(+), 9 deletions(-) diff --git a/content/en/docs/concepts/cluster-administration/system-metrics.md b/content/en/docs/concepts/cluster-administration/system-metrics.md index cc33e0fcd5aa2..d25d91159ce0f 100644 --- a/content/en/docs/concepts/cluster-administration/system-metrics.md +++ b/content/en/docs/concepts/cluster-administration/system-metrics.md @@ -184,11 +184,7 @@ As a stable feature, the kubelet collects Linux kernel (PSI) for CPU, memory and I/O usage. The information is collected at node, pod and container level. -*Prometheus Metrics*: Exposed at the `/metrics/cadvisor` endpoint as cumulative counters (totals) representing the total stall time in seconds. -*Summary API*: Exposed at the `/stats/summary` endpoint, providing both the cumulative totals and the moving averages (avg10, avg60, avg300). These averages represent the percentage of time that tasks were stalled on a resource over the respective 10-second, 60-second, and 5-minute intervals. - -The metrics are exposed at the `/metrics/cadvisor` endpoint with the following names: - +*Prometheus Metrics*: Exposed at the `/metrics/cadvisor` endpoint as cumulative counters (totals) representing the total stall time in seconds. The metrics are exposed at this endpoint with the following names: ``` container_pressure_cpu_stalled_seconds_total container_pressure_cpu_waiting_seconds_total @@ -197,6 +193,11 @@ container_pressure_memory_waiting_seconds_total container_pressure_io_stalled_seconds_total container_pressure_io_waiting_seconds_total ``` +*Summary API*: Exposed at the `/stats/summary` endpoint, providing both the cumulative `totals` and the moving averages (`avg10`, `avg60`, `avg300`). These averages represent the percentage of time that tasks were stalled on a resource over the respective 10-second, 60-second, and 5-minute intervals. This endpoint reports the metrics in the following format: +``` +some avg10=0.00 avg60=0.00 avg300=0.00 total=0 +full avg10=0.00 avg60=0.00 avg300=0.00 total=0 +``` This feature is enabled by default. Starting with Kubernetes v.1.36, the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is locked to true and cannot be disabled. The information is also exposed in the [Summary API](/docs/reference/instrumentation/node-metrics#psi). diff --git a/content/en/docs/reference/instrumentation/node-metrics.md b/content/en/docs/reference/instrumentation/node-metrics.md index 042aed8c4244a..49c97d5ab9295 100644 --- a/content/en/docs/reference/instrumentation/node-metrics.md +++ b/content/en/docs/reference/instrumentation/node-metrics.md @@ -45,9 +45,9 @@ the kubelet [fetches Pod- and container-level metric data using CRI](/docs/refer ## Pressure Stall Information (PSI) {#psi} -{{< feature-state for_k8s_version="v1.34" state="beta" >}} +{{< feature-state feature_gate_name="KubeletPSI" >}} -As a beta feature, Kubernetes lets you configure kubelet to collect Linux kernel +As a stable feature, Kubernetes lets you configure kubelet to collect Linux kernel [Pressure Stall Information](https://docs.kernel.org/accounting/psi.html) (PSI) for CPU, memory, and I/O usage. The information is collected at node, pod and container level. See [Summary API](/docs/reference/config-api/kubelet-stats.v1alpha1/) for detailed schema. diff --git a/content/en/docs/reference/instrumentation/understand-psi-metrics.md b/content/en/docs/reference/instrumentation/understand-psi-metrics.md index 405d0ed60374e..372190dab94ea 100644 --- a/content/en/docs/reference/instrumentation/understand-psi-metrics.md +++ b/content/en/docs/reference/instrumentation/understand-psi-metrics.md @@ -8,9 +8,9 @@ description: >- -{{< feature-state for_k8s_version="v1.34" state="beta" >}} +{{< feature-state feature_gate_name="KubeletPSI" >}} -As a beta feature, Kubernetes lets you configure the kubelet to collect Linux kernel +As a stable feature, Kubernetes lets you configure the kubelet to collect Linux kernel [Pressure Stall Information](https://docs.kernel.org/accounting/psi.html) (PSI) for CPU, memory, and I/O usage. The information is collected at node, pod and container level. This feature is enabled by default by setting the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/). From bd3fd153525329a4b4195350bce80b2b9ceeb7c5 Mon Sep 17 00:00:00 2001 From: Maria Romano Date: Fri, 27 Mar 2026 22:33:35 +0000 Subject: [PATCH 5/8] fixing wording that feature cannot be disabled after 1.36 --- content/en/docs/reference/instrumentation/node-metrics.md | 2 +- .../en/docs/reference/instrumentation/understand-psi-metrics.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/content/en/docs/reference/instrumentation/node-metrics.md b/content/en/docs/reference/instrumentation/node-metrics.md index 49c97d5ab9295..94006fa6528b4 100644 --- a/content/en/docs/reference/instrumentation/node-metrics.md +++ b/content/en/docs/reference/instrumentation/node-metrics.md @@ -51,7 +51,7 @@ As a stable feature, Kubernetes lets you configure kubelet to collect Linux kern [Pressure Stall Information](https://docs.kernel.org/accounting/psi.html) (PSI) for CPU, memory, and I/O usage. The information is collected at node, pod and container level. See [Summary API](/docs/reference/config-api/kubelet-stats.v1alpha1/) for detailed schema. -This feature is enabled by default, by setting the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/). The information is also exposed in +Starting with Kubernetes v.1.36, the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is locked to true and cannot be disabled. The information is also exposed in [Prometheus metrics](/docs/concepts/cluster-administration/system-metrics#psi-metrics). You can learn how to interpret the PSI metrics in [Understand PSI Metrics](/docs/reference/instrumentation/understand-psi-metrics/). diff --git a/content/en/docs/reference/instrumentation/understand-psi-metrics.md b/content/en/docs/reference/instrumentation/understand-psi-metrics.md index 372190dab94ea..bf6ad55a61998 100644 --- a/content/en/docs/reference/instrumentation/understand-psi-metrics.md +++ b/content/en/docs/reference/instrumentation/understand-psi-metrics.md @@ -13,7 +13,7 @@ description: >- As a stable feature, Kubernetes lets you configure the kubelet to collect Linux kernel [Pressure Stall Information](https://docs.kernel.org/accounting/psi.html) (PSI) for CPU, memory, and I/O usage. The information is collected at node, pod and container level. -This feature is enabled by default by setting the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/). +Starting with Kubernetes v.1.36, the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is locked to true and cannot be disabled. PSI metrics are exposed through two different sources: - The kubelet's [Summary API](/docs/reference/config-api/kubelet-stats.v1alpha1/), which provides PSI data at the node, pod, and container level. From 5f08b9f36a3991f3ea96ab4c6ae75bc4b2b48ebe Mon Sep 17 00:00:00 2001 From: Maria Fernanda Romano Silva Date: Mon, 30 Mar 2026 17:29:17 -0700 Subject: [PATCH 6/8] Apply suggestion from @SergeyKanzhelev Co-authored-by: Sergey Kanzhelev --- .../en/docs/concepts/cluster-administration/system-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/docs/concepts/cluster-administration/system-metrics.md b/content/en/docs/concepts/cluster-administration/system-metrics.md index d25d91159ce0f..555d2813e59ef 100644 --- a/content/en/docs/concepts/cluster-administration/system-metrics.md +++ b/content/en/docs/concepts/cluster-administration/system-metrics.md @@ -179,7 +179,7 @@ flag to expose these alpha stability metrics. {{< feature-state feature_gate_name="KubeletPSI" >}} -As a stable feature, the kubelet collects Linux kernel +The kubelet collects Linux kernel [Pressure Stall Information](https://docs.kernel.org/accounting/psi.html) (PSI) for CPU, memory and I/O usage. The information is collected at node, pod and container level. From 66118cd184ae6a43be9813a225bd42c8d9187658 Mon Sep 17 00:00:00 2001 From: Maria Romano Date: Tue, 31 Mar 2026 19:31:24 +0000 Subject: [PATCH 7/8] adding example on interpreting metrics --- .../cluster-administration/system-metrics.md | 71 +++++++++++++++++-- 1 file changed, 67 insertions(+), 4 deletions(-) diff --git a/content/en/docs/concepts/cluster-administration/system-metrics.md b/content/en/docs/concepts/cluster-administration/system-metrics.md index 555d2813e59ef..27be1eebe3387 100644 --- a/content/en/docs/concepts/cluster-administration/system-metrics.md +++ b/content/en/docs/concepts/cluster-administration/system-metrics.md @@ -185,6 +185,7 @@ The kubelet collects Linux kernel The information is collected at node, pod and container level. *Prometheus Metrics*: Exposed at the `/metrics/cadvisor` endpoint as cumulative counters (totals) representing the total stall time in seconds. The metrics are exposed at this endpoint with the following names: + ``` container_pressure_cpu_stalled_seconds_total container_pressure_cpu_waiting_seconds_total @@ -193,16 +194,78 @@ container_pressure_memory_waiting_seconds_total container_pressure_io_stalled_seconds_total container_pressure_io_waiting_seconds_total ``` -*Summary API*: Exposed at the `/stats/summary` endpoint, providing both the cumulative `totals` and the moving averages (`avg10`, `avg60`, `avg300`). These averages represent the percentage of time that tasks were stalled on a resource over the respective 10-second, 60-second, and 5-minute intervals. This endpoint reports the metrics in the following format: +*Summary API*: Exposed at the `/stats/summary` endpoint, providing both the cumulative `totals` and the moving averages (`avg10`, `avg60`, `avg300`) in a JSON format. These averages represent the percentage of time that tasks were stalled on a resource over the respective 10-second, 60-second, and 5-minute intervals. + +These metrics are also natively exported through the node's respective file in `/proc/pressure/` -- cpu, memory, and io in the following format: + ``` some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 ``` -This feature is enabled by default. Starting with Kubernetes v.1.36, the `KubeletPSI` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is locked to true and cannot be disabled. The information is also exposed in the -[Summary API](/docs/reference/instrumentation/node-metrics#psi). +How can these metrics be interpreted together? Take for example the following query from the Summary API: +`kubectl get --raw "/api/v1/nodes/$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')/proxy/stats/summary" | jq '.pods[].containers[] | select(.name=="") | {name, cpu: .cpu.psi, memory: .memory.psi, io: .io.psi}'`. +This returns the information in a json format as such. + +``` + +{ + "name": "", + "cpu": { + "full": { + "total": 0, + "avg10": 0, + "avg60": 0, + "avg300": 0 + }, + "some": { + "total": 35232438, + "avg10": 0.74, + "avg60": 0.52, + "avg300": 0.21, + }, + }, + "memory": { + "full": { + "total": 539105, + "avg10": 0, + "avg60": 0, + "avg300": 0 + }, + "some": { + "total": 658164, + "avg10": 0.01, + "avg60": 0.01, + "avg300": 0.00, + }, + } + }, + "io": { + "full": { + "total": 33190987, + "avg10": 0.31, + "avg60": 0.22, + "avg300": 0.05, + }, + "some": { + "total": 40809937, + "avg10": 0.52, + "avg60": 0.45, + "avg300": 0.12, + } + } +} +``` + +Here is a simple spike scenario. The `avg10` value of `0.74` indicates that in the last 10 seconds, at least one task in this container was stalled on the CPU for 0.74% of the time (0.0074 seconds or 74 milliseconds). Because `avg10` (0.74) is significantly higher than `avg300` (0.21) on the same resource, this suggests a recent surge in resource contention rather than a sustained long-term bottleneck. If monitored continuously and the avg300 metrics increase as well, we can diagnose a more serious, lasting issue! + +Additionally, notice how in this example `cpu.some` shows pressure, while `cpu.full` remains at 0.00. This tells us that while some processes were delayed waiting for CPU time, the container as a whole was still making progress. A non-zero full value would indicate that all non-idle tasks were stalled simultaneously - a much bigger problem. +Although not as human-readable, the `total` value of 35232438 represents the cumulative stall time in microseconds, that allow latency spike detection that otherwise may not show in the avgerages. They are also useful for monitoring systems, like Prometheus, to calculate precise rates of increase over specific time windows. +As a final note, when observing high I/O Pressure alongside low Memory Pressure, it can indicates that the application is waiting on disk throughput rather than failing due to a lack of available RAM. The node is not over-committed on memory, and a different diagnosis for disk consumption can be investigated. + +PSI metrics unlock a more robust way to monitor realitime resource contention at all levels for every cgroup, opening up the opportunity to dynamically handle workloads across the system. -You can learn how to interpret the PSI metrics in [Understand PSI Metrics](/docs/reference/instrumentation/understand-psi-metrics/). +You can read more about the PSI metrics in [Understand PSI Metrics](/docs/reference/instrumentation/understand-psi-metrics/). #### Requirements From b9366bc320f87d7118057ec7beb3326e31579417 Mon Sep 17 00:00:00 2001 From: Maria Romano Date: Tue, 31 Mar 2026 19:56:01 +0000 Subject: [PATCH 8/8] fixed some typos and spacing --- .../concepts/cluster-administration/system-metrics.md | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/content/en/docs/concepts/cluster-administration/system-metrics.md b/content/en/docs/concepts/cluster-administration/system-metrics.md index 27be1eebe3387..e3eb9a33cec05 100644 --- a/content/en/docs/concepts/cluster-administration/system-metrics.md +++ b/content/en/docs/concepts/cluster-administration/system-metrics.md @@ -208,7 +208,6 @@ How can these metrics be interpreted together? Take for example the following qu This returns the information in a json format as such. ``` - { "name": "", "cpu": { @@ -257,15 +256,13 @@ This returns the information in a json format as such. } ``` -Here is a simple spike scenario. The `avg10` value of `0.74` indicates that in the last 10 seconds, at least one task in this container was stalled on the CPU for 0.74% of the time (0.0074 seconds or 74 milliseconds). Because `avg10` (0.74) is significantly higher than `avg300` (0.21) on the same resource, this suggests a recent surge in resource contention rather than a sustained long-term bottleneck. If monitored continuously and the avg300 metrics increase as well, we can diagnose a more serious, lasting issue! +Here is a simple spike scenario. The `avg10` value of `0.74` indicates that in the last 10 seconds, at least one task in this container was stalled on the CPU for 0.74% of the time (0.0074 seconds or 74 milliseconds). Because `avg10` (0.74) is significantly higher than `avg300` (0.21) on the same resource, this suggests a recent surge in resource contention rather than a sustained long-term bottleneck. If monitored continuously and the `avg300` metrics increase as well, we can diagnose a more serious, lasting issue! Additionally, notice how in this example `cpu.some` shows pressure, while `cpu.full` remains at 0.00. This tells us that while some processes were delayed waiting for CPU time, the container as a whole was still making progress. A non-zero full value would indicate that all non-idle tasks were stalled simultaneously - a much bigger problem. -Although not as human-readable, the `total` value of 35232438 represents the cumulative stall time in microseconds, that allow latency spike detection that otherwise may not show in the avgerages. They are also useful for monitoring systems, like Prometheus, to calculate precise rates of increase over specific time windows. -As a final note, when observing high I/O Pressure alongside low Memory Pressure, it can indicates that the application is waiting on disk throughput rather than failing due to a lack of available RAM. The node is not over-committed on memory, and a different diagnosis for disk consumption can be investigated. - -PSI metrics unlock a more robust way to monitor realitime resource contention at all levels for every cgroup, opening up the opportunity to dynamically handle workloads across the system. +Although not as human-readable, the `total` value of 35232438 represents the cumulative stall time in microseconds, that allow latency spike detection that otherwise may not show in the averages. They are also useful for monitoring systems, like Prometheus, to calculate precise rates of increase over specific time windows. +As a final note, when observing high I/O Pressure alongside low Memory Pressure, it can indicate that the application is waiting on disk throughput rather than failing due to a lack of available RAM. The node is not over-committed on memory, and a different diagnosis for disk consumption can be investigated. -You can read more about the PSI metrics in [Understand PSI Metrics](/docs/reference/instrumentation/understand-psi-metrics/). +PSI metrics unlock a more robust way to monitor realitime resource contention at all levels for every cgroup, opening up the opportunity to dynamically handle workloads across the system. You can read more about the PSI metrics in [Understand PSI Metrics](/docs/reference/instrumentation/understand-psi-metrics/). #### Requirements