From 99bb6c1954315ab16e623060af07dcb3c7906b82 Mon Sep 17 00:00:00 2001 From: Aananth V Date: Tue, 21 Oct 2025 13:54:10 +0530 Subject: [PATCH 1/7] A80: Export TCP Telemetry from gRPC --- A80-tcp-telemetry.md | 96 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 A80-tcp-telemetry.md diff --git a/A80-tcp-telemetry.md b/A80-tcp-telemetry.md new file mode 100644 index 000000000..762ab2695 --- /dev/null +++ b/A80-tcp-telemetry.md @@ -0,0 +1,96 @@ +# A80: Export TCP Telemetry from gRPC + +**Author(s)**: Aananth V (@aananthv), Nana Pang (@nanahpang), Yash Tibrewal (@yashykt), Yousuk Seung (@yousukseung) +**Approver(s)**: Craig Tiller (@ctiller), Mark Roth (@markdroth) +**Status**: In Review +**Implemented in**: C-Core +**Last updated**: Oct 21, 2025 + +## Abstract + +This document proposes collecting and exposing new TCP Endpoint Level Telemetry to gRPC for improved network analysis and debugging. + +## Background + +The Linux Kernel exposes two telemetry hooks that can be used to collect TCP-level metrics. + +1. **TCP socket state** can be retrieved using the `getsockopt()` system call with `level` set to `IPPROTO_TCP` and `optname` set to `TCP_INFO`. The state is returned in a `struct tcp_info` which gives details about the TCP connection. At present, the machinery to collect such information is available on Linux 2.6 or later kernels. +2. **Per-message transmission timestamps** can be collected from TCP sockets using the [SO\_TIMESTAMPING](https://docs.kernel.org/networking/timestamping.html) interface. At present, this is available on Linux 2.6 or later kernels. These timestamps can be very valuable for diagnosing network level issues and can be used to break down the time spent in the "network". + +\[[A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md)\] provides a framework for adding non-per-call metrics in gRPC. This document uses that framework to expose the proposed TCP Latency metrics. + +### *Related Proposals:* + +\* \[[A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md)\]: gRPC Non-Per-Call Metrics Framework + +## Proposal + +This document proposes exporting the following TCP metrics from gRPC to improve the network debugging capabilities for gRPC users. + +| Name | Type | Unit | Labels | Optional Labels | Description | +| ----- | :---- | :---- | :---- | :---- | :---- | +| **Per-Connection Metrics** | | | | | | +| grpc.tcp.min\_rtt | Histogram (floating-point) | {s} | None | network.local.address network.local.port network.peer.address network.peer.port | TCP's current estimate of minimum round trip time (RTT). It can be used as an indication of the network health between two endpoints. Corresponds to `tcpi_min_rtt` from `struct tcp_info`. | +| grpc.tcp.delivery\_rate | Histogram (floating-point) | By/s | None | network.local.address network.local.port network.peer.address network.peer.port | TCP’s most recent measure of the connection’s "non-app-limited" throughput. The term non-app-limited means that the link is saturated by the application. The delivery rate is only reported when it is non-app-limited. Corresponds to `tcpi_delivery_rate` from `tcp_info` when `tcpi_delivery_rate_app_limited` is `false`. | +| grpc.tcp.packets\_sent | Counter (integer) | {packet} | None | network.local.address network.local.port network.peer.address network.peer.port | Total packets sent by TCP including retransmissions and spurious retransmissions. Corresponds to `tcpi_data_segs_out` from `struct tcp_info`. | +| grpc.tcp.packets\_retransmitted | Counter (integer) | {packet} | None | network.local.address network.local.port network.peer.address network.peer.port | Total packets sent by TCP except those sent for the first time. A packet may be retransmitted multiple times and will be counted multiple times as retransmitted. Retransmission counts include spurious retransmissions. Corresponds to `tcpi_total_retrans` from `struct tcp_info`. | +| grpc.tcp.packets\_spurious\_retransmitted | Counter (integer) | {packet} | None | network.local.address network.local.port network.peer.address network.peer.port | Total packets retransmitted by TCP that were later found to be unnecessary. These packets are acknowledged for the second time or more. Multiple spurious retransmissions for the same packet are counted multiple times. Corresponds to `tcpi_dsack_dups` from `struct tcp_info`. | +| grpc.tcp.recurring\_retransmits | Counter (integer) | {packet} | None | network.local.address network.local.port network.peer.address network.peer.port | The number of times the latest TCP packet ( TCP sequence) was retransmitted due to expiration of TCP retransmission timer (RTO), and not acknowledged at the time the connection was closed. Corresponds to `tcpi_retransmits` at connection close time. | +| grpc.tcp.bytes\_sent | Counter (integer) | By | None | network.local.address network.local.port network.peer.address network.peer.port | Total bytes sent by TCP including retransmissions and spurious retransmissions. Corresponds to `tcpi_bytes_sent` from `struct tcp_info`. | +| grpc.tcp.bytes\_retransmitted | Counter (integer) | By | None | network.local.address network.local.port network.peer.address network.peer.port | Total bytes sent by TCP except those sent for the first time. A byte sequence may be retransmitted multiple times and will be counted multiple times as retransmitted. Retransmission counts include spurious retransmissions. Corresponds to `tcpi_bytes_retrans` from `struct tcp_info`. | +| **Per-Connection Op Metrics** | | | | | | +| grpc.tcp.connection\_count | Gauge (integer) | {connection} | None | network.local.address network.local.port network.peer.address network.peer.port | Number of active TCP connections. | +| grpc.tcp.syscall\_writes | Counter (integer) | {syscall} | None | network.local.address network.local.port network.peer.address network.peer.port | The number of times we invoked the sendmsg (or sendmmsg) syscall and wrote data to the TCP socket. Measured at the endpoint level. | +| grpc.tcp.write\_size | Histogram (floating-point) | By | None | network.local.address network.local.port network.peer.address network.peer.port | The number of bytes offered to each syscall\_write. Measured at the endpoint level. | +| grpc.tcp.syscall\_reads | Counter (integer) | {syscall} | None | network.local.address network.local.port network.peer.address network.peer.port | The number of times we invoked the recvmsg (or recvmmsg or zero copy getsockopt) syscall and read data from the TCP socket. Measured at the endpoint level. | +| grpc.tcp.read\_size | Histogram (floating-point) | By | None | network.local.address network.local.port network.peer.address network.peer.port | The number of bytes received by each syscall\_read. Measured at the endpoint level. | +| **Per-Write Metrics** | | | | | | +| grpc.tcp.sender\_latency | Histogram (floating-point) | {s} | None | network.local.address network.local.port network.peer.address network.peer.port | Time taken by the TCP socket to write the first byte of a write onto the NIC. This includes the latency incurred by traffic shaping, qdisc, throttling, and pacing at the sender. Corresponds to the time taken between the final `SCHED` timestamp and the `SENT` timestamp. Sampled periodically. | +| grpc.tcp.transfer\_latency | Histogram (floating-point) | {s} | size (Bytes)1 | network.local.address network.local.port network.peer.address network.peer.port | Time taken to transmit the first size bytes of a write. Transfer latency is measured from when the first byte is handed to the NIC until TCP receives the acknowledgement for the last byte. Corresponds to the time taken between the `SENT` timestamp and the `ACKED` timestamp. Sampled periodically. | + +1 Since transfer latency is strongly affected by the write size, it is broken down into different buckets based on the size of the write. Further, we will only measure the latencies of certain benchmark sample sizes to get measurements that are unaffected by the write sizes. The proposed buckets are: + +| Buffered Size | Benchmark Size | `Size (Bytes)` | +| :---- | :---- | :---- | +| \[0, 1KiB) | Whole buffer | 1024 | +| \[1KiB, 8KiB) | First 1KiB | 1024 | +| \[8KiB, 64KiB) | First 8KiB | 8196 | +| \[64KiB, 256KiB) | First 64KiB | 65536 | +| \[256KiB, 2MiB) | First 256KiB | 262144 | +| \[2MiB, \+inf) | First 2MiB | 2097152 | + +### *Suggested Metric Collection Algorithms* + +#### Per-Connection Metrics + +* Set TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL to a default of 5 minutes. + * Implementations can choose to make it configurable. +* For each new connected TCP socket, set an initial alarm of 10% to 110% (randomly selected) of TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL. +* When the alarm fires \- + * Use `getsockopt(TCP_INFO)` or equivalent method to retrieve and record connection metrics. + * Re-arm the alarm with TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL and repeat. + * Before the socket is closed, cancel the alarm set above, and retrieve and record connection metrics, providing observability for short-lived connections as well. This will also allow collection of `grpc.tcp.recurring_retransmits` + +#### Per-Connection Op Metrics + +* `grpc.tcp.connection_count` will be incremented when a connection is created and decremented when it is destroyed. +* The `grpc.tcp.syscall_writes` and `grpc.tcp.write_size` metrics will be updated whenever we write to the socket. +* The `grpc.tcp.syscall_reads` and `grpc.tcp.read_size` metrics will be updated whenever we read from the socket. + +#### Per-Write Metrics + +* Set TCP\_LATENCY\_RECORD\_FREQUENCY to a default of 1 in 1000 writes. + * Implementations can choose to make it configurable. +* For each new connected TCP socket, + * Set `writes_since_last_latency_measurement_` to a random integer in \[0, TCP\_LATENCY\_RECORD\_FREQUENCY). Increment this value for every write. + * Perform any prerequisites needed for the socket to support timestamping. + * For Linux TCP, this involves making a setsockopt(SO\_TIMESTAMPING) call with the flag value set to SOF\_TIMESTAMPING\_SOFTWARE | SOF\_TIMESTAMPING\_OPT\_ID | SOF\_TIMESTAMPING\_OPT\_TSONLY | SOF\_TIMESTAMPING\_OPT\_ID\_TCP | SOF\_TIMESTAMPING\_OPT\_STATS. +* If `writes_since_last_latency_measurement_` % TCP\_LATENCY\_RECORD\_FREQUENCY \= 0 + * Enable latency measurement for the write. + * For Linux TCP, this involves splitting the write into two chunks based on the buckets listed above and adding a SO\_TIMESTAMPING cmsg header to the sendmsg call with the flags SOF\_TIMESTAMPING\_TX\_SCHED | SOF\_TIMESTAMPING\_TX\_SOFTWARE on the sampled chunk and SOF\_TIMESTAMPING\_TX\_ACK on the other chunk. + * There is only one chunk if the write is less than 1024 bytes, in that case all three flags are set on this chunk. + * Set `writes_since_last_latency_measurement_` to 1 and repeat. + +## Implementation + +Will be implemented in C-Core to start with. From 82f606be5fef8064d9488d1599028a6cab668765 Mon Sep 17 00:00:00 2001 From: Aananth V Date: Tue, 2 Dec 2025 17:38:10 +0530 Subject: [PATCH 2/7] Incorporate formatting and other fixes from PR discussion --- A80-tcp-telemetry.md | 100 +++++++++++++++++++++++++------------------ 1 file changed, 59 insertions(+), 41 deletions(-) diff --git a/A80-tcp-telemetry.md b/A80-tcp-telemetry.md index 762ab2695..174a4fa35 100644 --- a/A80-tcp-telemetry.md +++ b/A80-tcp-telemetry.md @@ -1,10 +1,11 @@ # A80: Export TCP Telemetry from gRPC -**Author(s)**: Aananth V (@aananthv), Nana Pang (@nanahpang), Yash Tibrewal (@yashykt), Yousuk Seung (@yousukseung) -**Approver(s)**: Craig Tiller (@ctiller), Mark Roth (@markdroth) -**Status**: In Review -**Implemented in**: C-Core -**Last updated**: Oct 21, 2025 +* Author(s): Aananth V (@aananthv), Nana Pang (@nanahpang), Yash Tibrewal (@yashykt), Yousuk Seung (@yousukseung) +* Approver(s): Craig Tiller (@ctiller), Mark Roth (@markdroth) +* Status: In Review +* Implemented in: C-Core +* Last updated: 2025-10-21 +* Discussion at: https://groups.google.com/g/grpc-io/c/MoLrWPsFB3s ## Abstract @@ -25,43 +26,26 @@ The Linux Kernel exposes two telemetry hooks that can be used to collect TCP-lev ## Proposal -This document proposes exporting the following TCP metrics from gRPC to improve the network debugging capabilities for gRPC users. - -| Name | Type | Unit | Labels | Optional Labels | Description | -| ----- | :---- | :---- | :---- | :---- | :---- | -| **Per-Connection Metrics** | | | | | | -| grpc.tcp.min\_rtt | Histogram (floating-point) | {s} | None | network.local.address network.local.port network.peer.address network.peer.port | TCP's current estimate of minimum round trip time (RTT). It can be used as an indication of the network health between two endpoints. Corresponds to `tcpi_min_rtt` from `struct tcp_info`. | -| grpc.tcp.delivery\_rate | Histogram (floating-point) | By/s | None | network.local.address network.local.port network.peer.address network.peer.port | TCP’s most recent measure of the connection’s "non-app-limited" throughput. The term non-app-limited means that the link is saturated by the application. The delivery rate is only reported when it is non-app-limited. Corresponds to `tcpi_delivery_rate` from `tcp_info` when `tcpi_delivery_rate_app_limited` is `false`. | -| grpc.tcp.packets\_sent | Counter (integer) | {packet} | None | network.local.address network.local.port network.peer.address network.peer.port | Total packets sent by TCP including retransmissions and spurious retransmissions. Corresponds to `tcpi_data_segs_out` from `struct tcp_info`. | -| grpc.tcp.packets\_retransmitted | Counter (integer) | {packet} | None | network.local.address network.local.port network.peer.address network.peer.port | Total packets sent by TCP except those sent for the first time. A packet may be retransmitted multiple times and will be counted multiple times as retransmitted. Retransmission counts include spurious retransmissions. Corresponds to `tcpi_total_retrans` from `struct tcp_info`. | -| grpc.tcp.packets\_spurious\_retransmitted | Counter (integer) | {packet} | None | network.local.address network.local.port network.peer.address network.peer.port | Total packets retransmitted by TCP that were later found to be unnecessary. These packets are acknowledged for the second time or more. Multiple spurious retransmissions for the same packet are counted multiple times. Corresponds to `tcpi_dsack_dups` from `struct tcp_info`. | -| grpc.tcp.recurring\_retransmits | Counter (integer) | {packet} | None | network.local.address network.local.port network.peer.address network.peer.port | The number of times the latest TCP packet ( TCP sequence) was retransmitted due to expiration of TCP retransmission timer (RTO), and not acknowledged at the time the connection was closed. Corresponds to `tcpi_retransmits` at connection close time. | -| grpc.tcp.bytes\_sent | Counter (integer) | By | None | network.local.address network.local.port network.peer.address network.peer.port | Total bytes sent by TCP including retransmissions and spurious retransmissions. Corresponds to `tcpi_bytes_sent` from `struct tcp_info`. | -| grpc.tcp.bytes\_retransmitted | Counter (integer) | By | None | network.local.address network.local.port network.peer.address network.peer.port | Total bytes sent by TCP except those sent for the first time. A byte sequence may be retransmitted multiple times and will be counted multiple times as retransmitted. Retransmission counts include spurious retransmissions. Corresponds to `tcpi_bytes_retrans` from `struct tcp_info`. | -| **Per-Connection Op Metrics** | | | | | | -| grpc.tcp.connection\_count | Gauge (integer) | {connection} | None | network.local.address network.local.port network.peer.address network.peer.port | Number of active TCP connections. | -| grpc.tcp.syscall\_writes | Counter (integer) | {syscall} | None | network.local.address network.local.port network.peer.address network.peer.port | The number of times we invoked the sendmsg (or sendmmsg) syscall and wrote data to the TCP socket. Measured at the endpoint level. | -| grpc.tcp.write\_size | Histogram (floating-point) | By | None | network.local.address network.local.port network.peer.address network.peer.port | The number of bytes offered to each syscall\_write. Measured at the endpoint level. | -| grpc.tcp.syscall\_reads | Counter (integer) | {syscall} | None | network.local.address network.local.port network.peer.address network.peer.port | The number of times we invoked the recvmsg (or recvmmsg or zero copy getsockopt) syscall and read data from the TCP socket. Measured at the endpoint level. | -| grpc.tcp.read\_size | Histogram (floating-point) | By | None | network.local.address network.local.port network.peer.address network.peer.port | The number of bytes received by each syscall\_read. Measured at the endpoint level. | -| **Per-Write Metrics** | | | | | | -| grpc.tcp.sender\_latency | Histogram (floating-point) | {s} | None | network.local.address network.local.port network.peer.address network.peer.port | Time taken by the TCP socket to write the first byte of a write onto the NIC. This includes the latency incurred by traffic shaping, qdisc, throttling, and pacing at the sender. Corresponds to the time taken between the final `SCHED` timestamp and the `SENT` timestamp. Sampled periodically. | -| grpc.tcp.transfer\_latency | Histogram (floating-point) | {s} | size (Bytes)1 | network.local.address network.local.port network.peer.address network.peer.port | Time taken to transmit the first size bytes of a write. Transfer latency is measured from when the first byte is handed to the NIC until TCP receives the acknowledgement for the last byte. Corresponds to the time taken between the `SENT` timestamp and the `ACKED` timestamp. Sampled periodically. | +This document proposes exporting the following TCP metrics from gRPC to improve the network debugging capabilities for gRPC users. These metrics will be implemented inside the gRPC endpoint layer. All metrics have the following optional labels: +* network.local.address +* network.local.port +* network.peer.address +* network.peer.port -1 Since transfer latency is strongly affected by the write size, it is broken down into different buckets based on the size of the write. Further, we will only measure the latencies of certain benchmark sample sizes to get measurements that are unaffected by the write sizes. The proposed buckets are: - -| Buffered Size | Benchmark Size | `Size (Bytes)` | -| :---- | :---- | :---- | -| \[0, 1KiB) | Whole buffer | 1024 | -| \[1KiB, 8KiB) | First 1KiB | 1024 | -| \[8KiB, 64KiB) | First 8KiB | 8196 | -| \[64KiB, 256KiB) | First 64KiB | 65536 | -| \[256KiB, 2MiB) | First 256KiB | 262144 | -| \[2MiB, \+inf) | First 2MiB | 2097152 | +### Per-Connection Metrics -### *Suggested Metric Collection Algorithms* +| Name | Type | Unit | Description | +| ----- | :---- | :---- | :---- | +| grpc.tcp.min\_rtt | Histogram (floating-point) | {s} | TCP's current estimate of minimum round trip time (RTT). It can be used as an indication of the network health between two endpoints. Corresponds to `tcpi_min_rtt` from `struct tcp_info`. | +| grpc.tcp.delivery\_rate | Histogram (floating-point) | By/s | TCP’s most recent measure of the connection’s "non-app-limited" throughput. The term non-app-limited means that the link is saturated by the application. The delivery rate is only reported when it is non-app-limited. Corresponds to `tcpi_delivery_rate` from `tcp_info` when `tcpi_delivery_rate_app_limited` is `false`. | +| grpc.tcp.packets\_sent | Counter (integer) | {packet} | Total packets sent by TCP including retransmissions and spurious retransmissions. Corresponds to `tcpi_data_segs_out` from `struct tcp_info`. | +| grpc.tcp.packets\_retransmitted | Counter (integer) | {packet} | Total packets sent by TCP except those sent for the first time. A packet may be retransmitted multiple times and will be counted multiple times as retransmitted. Retransmission counts include spurious retransmissions. Corresponds to `tcpi_total_retrans` from `struct tcp_info`. | +| grpc.tcp.packets\_spurious\_retransmitted | Counter (integer) | {packet} | Total packets retransmitted by TCP that were later found to be unnecessary. These packets are acknowledged for the second time or more. Multiple spurious retransmissions for the same packet are counted multiple times. Corresponds to `tcpi_dsack_dups` from `struct tcp_info`. | +| grpc.tcp.recurring\_retransmits | Counter (integer) | {packet} | The number of times the latest TCP packet ( TCP sequence) was retransmitted due to expiration of TCP retransmission timer (RTO), and not acknowledged at the time the connection was closed. Corresponds to `tcpi_retransmits` at connection close time. | +| grpc.tcp.bytes\_sent | Counter (integer) | By | Total bytes sent by TCP including retransmissions and spurious retransmissions. Corresponds to `tcpi_bytes_sent` from `struct tcp_info`. | +| grpc.tcp.bytes\_retransmitted | Counter (integer) | By | Total bytes sent by TCP except those sent for the first time. A byte sequence may be retransmitted multiple times and will be counted multiple times as retransmitted. Retransmission counts include spurious retransmissions. Corresponds to `tcpi_bytes_retrans` from `struct tcp_info`. | -#### Per-Connection Metrics +#### Suggested Metric Collection Algorithm * Set TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL to a default of 5 minutes. * Implementations can choose to make it configurable. @@ -71,13 +55,41 @@ This document proposes exporting the following TCP metrics from gRPC to improve * Re-arm the alarm with TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL and repeat. * Before the socket is closed, cancel the alarm set above, and retrieve and record connection metrics, providing observability for short-lived connections as well. This will also allow collection of `grpc.tcp.recurring_retransmits` -#### Per-Connection Op Metrics +### Per-Connection Op Metrics + +| Name | Type | Unit | Description | +| ----- | :---- | :---- | :---- | +| grpc.tcp.connection\_count | UpDownCounter (integer) | {connection} | Number of active TCP connections. | +| grpc.tcp.syscall\_writes | Counter (integer) | {syscall} | The number of times we invoked the sendmsg (or sendmmsg) syscall and wrote data to the TCP socket. Measured at the endpoint level. | +| grpc.tcp.write\_size | Histogram (floating-point) | By | The number of bytes offered to each syscall\_write. Measured at the endpoint level. | +| grpc.tcp.syscall\_reads | Counter (integer) | {syscall} | The number of times we invoked the recvmsg (or recvmmsg or zero copy getsockopt) syscall and read data from the TCP socket. Measured at the endpoint level. | +| grpc.tcp.read\_size | Histogram (floating-point) | By | The number of bytes received by each syscall\_read. Measured at the endpoint level. | + +#### Suggested Metric Collection Algorithm * `grpc.tcp.connection_count` will be incremented when a connection is created and decremented when it is destroyed. * The `grpc.tcp.syscall_writes` and `grpc.tcp.write_size` metrics will be updated whenever we write to the socket. * The `grpc.tcp.syscall_reads` and `grpc.tcp.read_size` metrics will be updated whenever we read from the socket. -#### Per-Write Metrics +### Per-Write Metrics + +| Name | Type | Unit | Labels | Description | +| ----- | :---- | :---- | :---- | :---- | +| grpc.tcp.sender\_latency | Histogram (floating-point) | {s} | None | Time taken by the TCP socket to write the first byte of a write onto the NIC. This includes the latency incurred by traffic shaping, qdisc, throttling, and pacing at the sender. Corresponds to the time taken between the final `SCHED` timestamp and the `SENT` timestamp. Sampled periodically. | +| grpc.tcp.transfer\_latency | Histogram (floating-point) | {s} | size (Bytes)1 | Time taken to transmit the first size bytes of a write. Transfer latency is measured from when the first byte is handed to the NIC until TCP receives the acknowledgement for the last byte. Corresponds to the time taken between the `SENT` timestamp and the `ACKED` timestamp. Sampled periodically. | + +1 Since transfer latency is strongly affected by the write size, it is broken down into different buckets based on the size of the write. Further, we will only measure the latencies of certain benchmark sample sizes to get measurements that are unaffected by the write sizes. The proposed buckets are: + +| Buffered Size | Benchmark Size | `Size (Bytes)` | +| :---- | :---- | :---- | +| \[0, 1KiB) | Whole buffer | 1024 | +| \[1KiB, 8KiB) | First 1KiB | 1024 | +| \[8KiB, 64KiB) | First 8KiB | 8196 | +| \[64KiB, 256KiB) | First 64KiB | 65536 | +| \[256KiB, 2MiB) | First 256KiB | 262144 | +| \[2MiB, \+inf) | First 2MiB | 2097152 | + +#### Suggested Metric Collection Algorithm * Set TCP\_LATENCY\_RECORD\_FREQUENCY to a default of 1 in 1000 writes. * Implementations can choose to make it configurable. @@ -91,6 +103,12 @@ This document proposes exporting the following TCP metrics from gRPC to improve * There is only one chunk if the write is less than 1024 bytes, in that case all three flags are set on this chunk. * Set `writes_since_last_latency_measurement_` to 1 and repeat. +### Metric Stability +All metrics added in this proposal will start as experimental. The long term goal will be to de-experimentalize them and have them be on by default, but the exact criteria for that change are TBD. + +### Temporary environment variable protection +This proposal does not include any features enabled via external I/O, so it does not need environment variable protection. + ## Implementation Will be implemented in C-Core to start with. From 9c75f0618f34a6585f750c45e4cf81550fab7d36 Mon Sep 17 00:00:00 2001 From: Aananth V Date: Wed, 3 Dec 2025 13:35:49 +0530 Subject: [PATCH 3/7] Update A80-tcp-telemetry.md --- A80-tcp-telemetry.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/A80-tcp-telemetry.md b/A80-tcp-telemetry.md index 174a4fa35..b89cf106c 100644 --- a/A80-tcp-telemetry.md +++ b/A80-tcp-telemetry.md @@ -18,11 +18,11 @@ The Linux Kernel exposes two telemetry hooks that can be used to collect TCP-lev 1. **TCP socket state** can be retrieved using the `getsockopt()` system call with `level` set to `IPPROTO_TCP` and `optname` set to `TCP_INFO`. The state is returned in a `struct tcp_info` which gives details about the TCP connection. At present, the machinery to collect such information is available on Linux 2.6 or later kernels. 2. **Per-message transmission timestamps** can be collected from TCP sockets using the [SO\_TIMESTAMPING](https://docs.kernel.org/networking/timestamping.html) interface. At present, this is available on Linux 2.6 or later kernels. These timestamps can be very valuable for diagnosing network level issues and can be used to break down the time spent in the "network". -\[[A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md)\] provides a framework for adding non-per-call metrics in gRPC. This document uses that framework to expose the proposed TCP Latency metrics. +\[[gRFC A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md)\] provides a framework for adding non-per-call metrics in gRPC. This document uses that framework to expose the proposed TCP Latency metrics. ### *Related Proposals:* -\* \[[A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md)\]: gRPC Non-Per-Call Metrics Framework +* \[[gRFC A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md)\]: gRPC Non-Per-Call Metrics Framework ## Proposal @@ -36,7 +36,7 @@ This document proposes exporting the following TCP metrics from gRPC to improve | Name | Type | Unit | Description | | ----- | :---- | :---- | :---- | -| grpc.tcp.min\_rtt | Histogram (floating-point) | {s} | TCP's current estimate of minimum round trip time (RTT). It can be used as an indication of the network health between two endpoints. Corresponds to `tcpi_min_rtt` from `struct tcp_info`. | +| grpc.tcp.min\_rtt | Histogram (floating-point) | s | TCP's current estimate of minimum round trip time (RTT). It can be used as an indication of the network health between two endpoints. Corresponds to `tcpi_min_rtt` from `struct tcp_info`. | | grpc.tcp.delivery\_rate | Histogram (floating-point) | By/s | TCP’s most recent measure of the connection’s "non-app-limited" throughput. The term non-app-limited means that the link is saturated by the application. The delivery rate is only reported when it is non-app-limited. Corresponds to `tcpi_delivery_rate` from `tcp_info` when `tcpi_delivery_rate_app_limited` is `false`. | | grpc.tcp.packets\_sent | Counter (integer) | {packet} | Total packets sent by TCP including retransmissions and spurious retransmissions. Corresponds to `tcpi_data_segs_out` from `struct tcp_info`. | | grpc.tcp.packets\_retransmitted | Counter (integer) | {packet} | Total packets sent by TCP except those sent for the first time. A packet may be retransmitted multiple times and will be counted multiple times as retransmitted. Retransmission counts include spurious retransmissions. Corresponds to `tcpi_total_retrans` from `struct tcp_info`. | @@ -75,12 +75,12 @@ This document proposes exporting the following TCP metrics from gRPC to improve | Name | Type | Unit | Labels | Description | | ----- | :---- | :---- | :---- | :---- | -| grpc.tcp.sender\_latency | Histogram (floating-point) | {s} | None | Time taken by the TCP socket to write the first byte of a write onto the NIC. This includes the latency incurred by traffic shaping, qdisc, throttling, and pacing at the sender. Corresponds to the time taken between the final `SCHED` timestamp and the `SENT` timestamp. Sampled periodically. | -| grpc.tcp.transfer\_latency | Histogram (floating-point) | {s} | size (Bytes)1 | Time taken to transmit the first size bytes of a write. Transfer latency is measured from when the first byte is handed to the NIC until TCP receives the acknowledgement for the last byte. Corresponds to the time taken between the `SENT` timestamp and the `ACKED` timestamp. Sampled periodically. | +| grpc.tcp.sender\_latency | Histogram (floating-point) | s | None | Time taken by the TCP socket to write the first byte of a write onto the NIC. This includes the latency incurred by traffic shaping, qdisc, throttling, and pacing at the sender. Corresponds to the time taken between the final `SCHED` timestamp and the `SENT` timestamp. Sampled periodically. | +| grpc.tcp.transfer\_latency | Histogram (floating-point) | s | size1 | Time taken to transmit the first size bytes of a write. Transfer latency is measured from when the first byte is handed to the NIC until TCP receives the acknowledgement for the last byte. Corresponds to the time taken between the `SENT` timestamp and the `ACKED` timestamp. Sampled periodically. | 1 Since transfer latency is strongly affected by the write size, it is broken down into different buckets based on the size of the write. Further, we will only measure the latencies of certain benchmark sample sizes to get measurements that are unaffected by the write sizes. The proposed buckets are: -| Buffered Size | Benchmark Size | `Size (Bytes)` | +| Buffered Size | Benchmark Size | `size` (Bytes) | | :---- | :---- | :---- | | \[0, 1KiB) | Whole buffer | 1024 | | \[1KiB, 8KiB) | First 1KiB | 1024 | @@ -89,6 +89,8 @@ This document proposes exporting the following TCP metrics from gRPC to improve | \[256KiB, 2MiB) | First 256KiB | 262144 | | \[2MiB, \+inf) | First 2MiB | 2097152 | +Writes smaller than 1024 Bytes are labelled with `size=1024` to reduce cardinality. Further, their size does not have a big impact on transfer latency since they are able to fit inside a single TCP packet. + #### Suggested Metric Collection Algorithm * Set TCP\_LATENCY\_RECORD\_FREQUENCY to a default of 1 in 1000 writes. @@ -111,4 +113,4 @@ This proposal does not include any features enabled via external I/O, so it does ## Implementation -Will be implemented in C-Core to start with. +Will be implemented in C-Core to start with. Other languages may implement only a subset of the metrics based on Kernel API availability. From dc2e10676dc965e258abaf842dbcdc56ba26bec6 Mon Sep 17 00:00:00 2001 From: Aananth V Date: Fri, 5 Dec 2025 12:53:57 +0530 Subject: [PATCH 4/7] Update A80-tcp-telemetry.md --- A80-tcp-telemetry.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/A80-tcp-telemetry.md b/A80-tcp-telemetry.md index b89cf106c..28e9cbceb 100644 --- a/A80-tcp-telemetry.md +++ b/A80-tcp-telemetry.md @@ -18,11 +18,13 @@ The Linux Kernel exposes two telemetry hooks that can be used to collect TCP-lev 1. **TCP socket state** can be retrieved using the `getsockopt()` system call with `level` set to `IPPROTO_TCP` and `optname` set to `TCP_INFO`. The state is returned in a `struct tcp_info` which gives details about the TCP connection. At present, the machinery to collect such information is available on Linux 2.6 or later kernels. 2. **Per-message transmission timestamps** can be collected from TCP sockets using the [SO\_TIMESTAMPING](https://docs.kernel.org/networking/timestamping.html) interface. At present, this is available on Linux 2.6 or later kernels. These timestamps can be very valuable for diagnosing network level issues and can be used to break down the time spent in the "network". -\[[gRFC A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md)\] provides a framework for adding non-per-call metrics in gRPC. This document uses that framework to expose the proposed TCP Latency metrics. +\[[gRFC A79][A79]\] provides a framework for adding non-per-call metrics in gRPC. This document uses that framework to expose the proposed TCP Latency metrics. ### *Related Proposals:* -* \[[gRFC A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md)\]: gRPC Non-Per-Call Metrics Framework +* \[[gRFC A79][A79]\]: gRPC Non-Per-Call Metrics Framework + +[A79]: A79-non-per-call-metrics-architecture.md ## Proposal @@ -52,7 +54,7 @@ This document proposes exporting the following TCP metrics from gRPC to improve * For each new connected TCP socket, set an initial alarm of 10% to 110% (randomly selected) of TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL. * When the alarm fires \- * Use `getsockopt(TCP_INFO)` or equivalent method to retrieve and record connection metrics. - * Re-arm the alarm with TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL and repeat. + * Re-arm the alarm with 10% to 110% (randomly selected) of TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL and repeat. * Before the socket is closed, cancel the alarm set above, and retrieve and record connection metrics, providing observability for short-lived connections as well. This will also allow collection of `grpc.tcp.recurring_retransmits` ### Per-Connection Op Metrics From 7c43505d7afef56db482d6c80d5534208505eff1 Mon Sep 17 00:00:00 2001 From: Aananth V Date: Tue, 16 Dec 2025 11:14:11 +0530 Subject: [PATCH 5/7] Update A80-tcp-telemetry.md --- A80-tcp-telemetry.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/A80-tcp-telemetry.md b/A80-tcp-telemetry.md index 28e9cbceb..7201a69a6 100644 --- a/A80-tcp-telemetry.md +++ b/A80-tcp-telemetry.md @@ -4,7 +4,7 @@ * Approver(s): Craig Tiller (@ctiller), Mark Roth (@markdroth) * Status: In Review * Implemented in: C-Core -* Last updated: 2025-10-21 +* Last updated: 2025-12-16 * Discussion at: https://groups.google.com/g/grpc-io/c/MoLrWPsFB3s ## Abstract @@ -61,6 +61,7 @@ This document proposes exporting the following TCP metrics from gRPC to improve | Name | Type | Unit | Description | | ----- | :---- | :---- | :---- | +| grpc.tcp.connections\_created | Counter (integer) | {connection} | Number of TCP connections created. | | grpc.tcp.connection\_count | UpDownCounter (integer) | {connection} | Number of active TCP connections. | | grpc.tcp.syscall\_writes | Counter (integer) | {syscall} | The number of times we invoked the sendmsg (or sendmmsg) syscall and wrote data to the TCP socket. Measured at the endpoint level. | | grpc.tcp.write\_size | Histogram (floating-point) | By | The number of bytes offered to each syscall\_write. Measured at the endpoint level. | @@ -69,6 +70,7 @@ This document proposes exporting the following TCP metrics from gRPC to improve #### Suggested Metric Collection Algorithm +* `grpc.tcp.connections_created` will be incremented when a connection is created. * `grpc.tcp.connection_count` will be incremented when a connection is created and decremented when it is destroyed. * The `grpc.tcp.syscall_writes` and `grpc.tcp.write_size` metrics will be updated whenever we write to the socket. * The `grpc.tcp.syscall_reads` and `grpc.tcp.read_size` metrics will be updated whenever we read from the socket. @@ -108,7 +110,7 @@ Writes smaller than 1024 Bytes are labelled with `size=1024` to reduce cardinali * Set `writes_since_last_latency_measurement_` to 1 and repeat. ### Metric Stability -All metrics added in this proposal will start as experimental. The long term goal will be to de-experimentalize them and have them be on by default, but the exact criteria for that change are TBD. +All metrics added in this proposal will start as experimental. The long term goal will be to de-experimentalize them and potentially have some metrics be on by default, but the exact criteria for that change are TBD. We may also add new labels (eg: `grpc.lb.locality`, `grpc.lb.backend_service`) in the future. This gRFC will be amended when this happens. ### Temporary environment variable protection This proposal does not include any features enabled via external I/O, so it does not need environment variable protection. From 6fb558edc05f2d66396b9b88dbbb6ee2a47f7e53 Mon Sep 17 00:00:00 2001 From: Aananth V Date: Tue, 16 Dec 2025 11:19:48 +0530 Subject: [PATCH 6/7] Update A80-tcp-telemetry.md --- A80-tcp-telemetry.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/A80-tcp-telemetry.md b/A80-tcp-telemetry.md index 7201a69a6..31a93623f 100644 --- a/A80-tcp-telemetry.md +++ b/A80-tcp-telemetry.md @@ -18,11 +18,11 @@ The Linux Kernel exposes two telemetry hooks that can be used to collect TCP-lev 1. **TCP socket state** can be retrieved using the `getsockopt()` system call with `level` set to `IPPROTO_TCP` and `optname` set to `TCP_INFO`. The state is returned in a `struct tcp_info` which gives details about the TCP connection. At present, the machinery to collect such information is available on Linux 2.6 or later kernels. 2. **Per-message transmission timestamps** can be collected from TCP sockets using the [SO\_TIMESTAMPING](https://docs.kernel.org/networking/timestamping.html) interface. At present, this is available on Linux 2.6 or later kernels. These timestamps can be very valuable for diagnosing network level issues and can be used to break down the time spent in the "network". -\[[gRFC A79][A79]\] provides a framework for adding non-per-call metrics in gRPC. This document uses that framework to expose the proposed TCP Latency metrics. +[gRFC A79][A79] provides a framework for adding non-per-call metrics in gRPC. This document uses that framework to expose the proposed TCP Latency metrics. ### *Related Proposals:* -* \[[gRFC A79][A79]\]: gRPC Non-Per-Call Metrics Framework +* [gRFC A79][A79]: gRPC Non-Per-Call Metrics Framework [A79]: A79-non-per-call-metrics-architecture.md @@ -97,13 +97,13 @@ Writes smaller than 1024 Bytes are labelled with `size=1024` to reduce cardinali #### Suggested Metric Collection Algorithm -* Set TCP\_LATENCY\_RECORD\_FREQUENCY to a default of 1 in 1000 writes. +* Set TCP\_LATENCY\_RECORD\_PERIOD to a default of 1000 to denote a frequency of 1 per 1000 writes. * Implementations can choose to make it configurable. * For each new connected TCP socket, - * Set `writes_since_last_latency_measurement_` to a random integer in \[0, TCP\_LATENCY\_RECORD\_FREQUENCY). Increment this value for every write. + * Set `writes_since_last_latency_measurement_` to a random integer in \[0, TCP\_LATENCY\_RECORD\_PERIOD). Increment this value for every write. * Perform any prerequisites needed for the socket to support timestamping. * For Linux TCP, this involves making a setsockopt(SO\_TIMESTAMPING) call with the flag value set to SOF\_TIMESTAMPING\_SOFTWARE | SOF\_TIMESTAMPING\_OPT\_ID | SOF\_TIMESTAMPING\_OPT\_TSONLY | SOF\_TIMESTAMPING\_OPT\_ID\_TCP | SOF\_TIMESTAMPING\_OPT\_STATS. -* If `writes_since_last_latency_measurement_` % TCP\_LATENCY\_RECORD\_FREQUENCY \= 0 +* If `writes_since_last_latency_measurement_` % TCP\_LATENCY\_RECORD\_PERIOD \= 0 * Enable latency measurement for the write. * For Linux TCP, this involves splitting the write into two chunks based on the buckets listed above and adding a SO\_TIMESTAMPING cmsg header to the sendmsg call with the flags SOF\_TIMESTAMPING\_TX\_SCHED | SOF\_TIMESTAMPING\_TX\_SOFTWARE on the sampled chunk and SOF\_TIMESTAMPING\_TX\_ACK on the other chunk. * There is only one chunk if the write is less than 1024 bytes, in that case all three flags are set on this chunk. From a93800c33ab1669657bbba6b8109930b666a54b6 Mon Sep 17 00:00:00 2001 From: Aananth V Date: Wed, 17 Dec 2025 09:14:35 +0530 Subject: [PATCH 7/7] Update A80-tcp-telemetry.md --- A80-tcp-telemetry.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A80-tcp-telemetry.md b/A80-tcp-telemetry.md index 31a93623f..b2ef999b4 100644 --- a/A80-tcp-telemetry.md +++ b/A80-tcp-telemetry.md @@ -80,11 +80,11 @@ This document proposes exporting the following TCP metrics from gRPC to improve | Name | Type | Unit | Labels | Description | | ----- | :---- | :---- | :---- | :---- | | grpc.tcp.sender\_latency | Histogram (floating-point) | s | None | Time taken by the TCP socket to write the first byte of a write onto the NIC. This includes the latency incurred by traffic shaping, qdisc, throttling, and pacing at the sender. Corresponds to the time taken between the final `SCHED` timestamp and the `SENT` timestamp. Sampled periodically. | -| grpc.tcp.transfer\_latency | Histogram (floating-point) | s | size1 | Time taken to transmit the first size bytes of a write. Transfer latency is measured from when the first byte is handed to the NIC until TCP receives the acknowledgement for the last byte. Corresponds to the time taken between the `SENT` timestamp and the `ACKED` timestamp. Sampled periodically. | +| grpc.tcp.transfer\_latency | Histogram (floating-point) | s | grpc.transfer\_size1 | Time taken to transmit the first size bytes of a write. Transfer latency is measured from when the first byte is handed to the NIC until TCP receives the acknowledgement for the last byte. Corresponds to the time taken between the `SENT` timestamp and the `ACKED` timestamp. Sampled periodically. | 1 Since transfer latency is strongly affected by the write size, it is broken down into different buckets based on the size of the write. Further, we will only measure the latencies of certain benchmark sample sizes to get measurements that are unaffected by the write sizes. The proposed buckets are: -| Buffered Size | Benchmark Size | `size` (Bytes) | +| Buffered Size | Benchmark Size | `grpc.transfer_size` (Bytes) | | :---- | :---- | :---- | | \[0, 1KiB) | Whole buffer | 1024 | | \[1KiB, 8KiB) | First 1KiB | 1024 |