Skip to content

Conversation

@antonipp
Copy link

@antonipp antonipp commented Jan 2, 2026

Description

We found an interesting bug with our QEMU Kata Containers running on AWS metal instances: all packets emitted from Kata MicroVMs which were around the size of PAGESIZE or larger got dropped on TX. We created a simple reproducer which consistently works:

  • A simple Python server running inside a QEMU Kata MicroVM answering with payloads of various sizes:
    import http.server
    import socketserver
    
    class Handler(http.server.BaseHTTPRequestHandler):
        def do_GET(self):
            # Parse size from ?size=N, default 5000
            size = 5000
            if "size=" in self.path:
                size = int(self.path.split("size=")[1].split("&")[0])
            self.send_response(200)
            self.send_header("Content-Length", str(size))
            self.end_headers()
            self.wfile.write(b"X" * size)
        def log_message(self, *args): pass
    
    with socketserver.TCPServer(("", 8080), Handler) as httpd:
        print("Serving on :8080")
        httpd.serve_forever()
  • A client on a different AWS instance calling the server with various size parameters:
    for size in 500 1000 2000 3000 4000 4096 5000 8000; do
      echo "=== Testing size=$size ==="
      curl -s -o /dev/null -w "size=$size status=%{http_code} received=%{size_download}\n" \
        --max-time 5 "http://$SERVER_IP:8080/?size=$size"
      sleep 1
    done

This gave these results consistently:

=== Testing size=500 ===
size=500 status=200 received=500
=== Testing size=1000 ===
size=1000 status=200 received=1000
=== Testing size=2000 ===
size=2000 status=200 received=2000
=== Testing size=3000 ===
size=3000 status=200 received=3000
=== Testing size=4000 ===
size=4000 status=200 received=0
=== Testing size=4096 ===
size=4096 status=200 received=0
=== Testing size=5000 ===
size=5000 status=200 received=0
=== Testing size=8000 ===
size=8000 status=200 received=0

So transmission started breaking around ~4000 bytes.

We started verifying MTU configurations, capturing tcpdumps etc but nothing really helped until we turned off Scatter-Gather on the interface the pod traffic was going from: ethtool -K ens1 sg off and this seemed to solve the issue!

=== Testing size=500 ===
size=500 status=200 received=500
=== Testing size=1000 ===
size=1000 status=200 received=1000
=== Testing size=2000 ===
size=2000 status=200 received=2000
=== Testing size=3000 ===
size=3000 status=200 received=3000
=== Testing size=4000 ===
size=4000 status=200 received=4000
=== Testing size=4096 ===
size=4096 status=200 received=4096
=== Testing size=5000 ===
size=5000 status=200 received=5000
=== Testing size=8000 ===
size=8000 status=200 received=8000

So this gave us a hint that something was off with the way packets were fragmented when they were coming from Kata containers. Another data point was that we were unable to reproduce the bug in GCP, nor or Azure, nor when using runc containers, nor when running the server directly on the host. So this pointed at some sort of broken interaction between Kata / QEMU / virtio-net (used by Kata) / AWS ENA driver on the host.

Digging even further, we captured two pwru traces: one from a working ~2000 byte response and one from a broken ~5000 byte response. We captured the traces with --output-meta --output-skb to get the internal skb states.
output-working.txt
output-failing.txt

Feeding these into an LLM indeed revealed a key difference in the way these packets were structured:

  • In the working trace, all packets were fully linear: the skb's data_len was 0 for all packets (see http://oldvger.kernel.org/~davem/skb_data.html)
  • In the failing trace, the big packets were highly fragmented. An example large packet had a layout looking like this: len=5066, data_len=5014 (so the head length is only 52 bytes and the rest is fragmented away). Another key observation which would be useful later is that csum_start=98 and csum_offset=16 so the checksum field was at byte 114, so it was inside a fragment The checksum is actually inside the head, see ena: linearize skbs when TCP header overflows into fragment #360 (comment). The issue seems to be with how the TCP header itself is fragmented, see below.

We then looked into how these packet shapes were actually produced by looking at virtio-net code. This is where we found the relevant logic (all code snippets below are from kernel 6.8 but the code hasn't changed much in the latest kernel):

  • First a buffer is created with max size capped at around PAGE_SIZE (on our machines getconf PAGESIZE is 4096 💡 - so this is close to where we've been seeing failures starting to happen)
  • Then the transmission logic is calculating if the packet can actually fit into the buffer with some headroom
    • If it can, it’s the “small packet” path, this creates a fully linear packet inside virtnet_build_skb here
    • If it can’t fit, it goes through the fragmentation path below where it creates the exact “tiny linear head + huge fragment” packet shape we’ve been seeing and that couldn't be transmitted!

So this explains how these packets are actually formed. It's now clear that these highly fragmented packets are somehow failing to get emitted on the network driver / hardware layer due to their layout. If we disable the SG feature, the packet gets linearized before it's transmitted to the driver so the bug is avoided this way. We also noticed that we have tx-checksumming: on on the network interface, and the fact that the checksum field was in a fragment gave us a hint that something might be wrong with hardware checksum offloading.

We are now finally coming to the code in this PR: this adds logic to the driver to check whether the checksum field whole TCP header is actually inside the linear head of the skb. If it's not, the packet is linearized before being handed off to the hardware. I recompiled the driver on my host, loaded it and it actually fixed the issue! So I guess there is indeed a requirement at the hardware level that the checksum whole header should be in the first DMA buffer but I can't really confirm this from outside. However, just empirically this seems to be the case and the code fixes the issue!

Software / Hardware info

Tested on c5.metal instances in us-east-1.

# uname -a
Linux ip-10-113-64-105 6.8.0-1044-aws #46~22.04.1-Ubuntu SMP Tue Dec  2 12:52:18 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
# modinfo ena
filename:       /lib/modules/6.8.0-1044-aws/kernel/drivers/net/ethernet/amazon/ena/ena.ko
license:        GPL
description:    Elastic Network Adapter (ENA)
author:         Amazon.com, Inc. or its affiliates
srcversion:     B321F2DB6BEBC4F61FA6013
alias:          pci:v00001D0Fd0000EC21sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd0000EC20sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00001EC2sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00000EC2sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00000051sv*sd*bc*sc*i*
depends:
retpoline:      Y
intree:         Y
name:           ena
vermagic:       6.8.0-1044-aws SMP mod_unload modversions
# ethtool -i ens1
driver: ena
version: 6.8.0-1044-aws
firmware-version:
expansion-rom-version:
bus-info: 0000:7e:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
# ethtool -k ens1
Features for ens1:
rx-checksumming: off [fixed]
tx-checksumming: on
	tx-checksum-ipv4: on
	tx-checksum-ip-generic: off [fixed]
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
	tx-tcp-segmentation: off [fixed]
	tx-tcp-ecn-segmentation: off [fixed]
	tx-tcp-mangleid-segmentation: off [fixed]
	tx-tcp6-segmentation: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: on
highdma: on
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

LLQ is disabled:

# dmesg | grep LLQ
[    4.070118] ena 0000:7d:00.0: LLQ is not supported Fallback to host mode policy.
# kata-runtime --version
kata-runtime  : 3.23.0
   commit   : 650ada7bcc8e47e44b55848765b0eb3ae9240454
   OCI specs: 1.2.1
# qemu-system-x86_64 --version
QEMU emulator version 10.1.1 (kata-static)

@davidarinzon
Copy link
Contributor

Hi @antonipp
Thank you for identifying a potential issue, providing detailed instructions and suggesting a change, we will look into it in depth

@antonipp antonipp force-pushed the ai/fix-check-sum-partial branch 2 times, most recently from dffc38c to ac5c931 Compare January 4, 2026 10:09
@davidarinzon
Copy link
Contributor

Hi @antonipp

Thank you for providing all these details, we're interested in understanding the use-case in more depth.
On top of the pwru dump, is it possible to please request for a tcpdump in order to understand the structure of the packets better and the need for checksum in such offsets?

@antonipp
Copy link
Author

antonipp commented Jan 5, 2026

Hi yes, here are two tcpdumps PCAPs captured from the Python server reproducer I mentioned in the PR description
tcpdump-captures.zip

The tcpdumps were recorded on the server host with tcpdump -nn -i ens1, where ens1 is the host interface the pod traffic is leaving from. The server is 100.65.4.129 in a QEMU Kata MicroVM container and the client is 10.131.3.248 on a separate EC2 instance (it's not inside the container or anything, just a curl from the host).

In the working scenario, the client is doing http://100.65.4.129:8080/?size=2000 and successfully receives the 2000 bytes back. In the failing scenario, the client is doing http://100.65.4.129:8080/?size=5000 and never receives a response.

@antonipp
Copy link
Author

antonipp commented Jan 5, 2026

Hmm and actually looks like I was wrong in my initial root cause analysis because there's a mistake in my offset calculation 🤔 The failing packets look like this:

.len = (unsigned int)5066,
.data_len = (unsigned int)5014,
.csum_start = (__u16)98,
.csum_offset = (__u16)16,
.mac_header = (__u16)64,
.tail = (sk_buff_data_t)116,

According to https://github.com/torvalds/linux/blob/3609fa95fb0f2c1b099e69e56634edb8fc03f87c/include/linux/skbuff.h#L2531-L2534 skb_headlen is len - data-len = 5066 - 5014 = 52 (length from skb->data).

And the csum_start offset is measured from skb->head and not skb->data:
https://github.com/torvalds/linux/blob/3609fa95fb0f2c1b099e69e56634edb8fc03f87c/include/linux/skbuff.h#L795

skb->data goes from 64 (tail - skb_headlen = 116 - 52 = 64) to tail 116.

And the checksum is at byte csum_start + csum_offset = 98 + 16 = 114 < 116. So it's actually inside the linear head...

I recompiled the code with the correct offset calculation:

if (skb->ip_summed == CHECKSUM_PARTIAL) {
  int csum_start = skb_checksum_start_offset(skb);

  if (csum_start + skb->csum_offset + sizeof(__sum16) > skb_headlen(skb))
    goto linearize;
}

but the packets are not going through anymore now. I think my initial fix worked because it was overly greedy when linearizing packets, so it worked almost like setting sg off on the interface.

So now I guess we need to dig a bit more to find what's wrong because it's definitely related to the shape of these packets but I can't exactly tell what.

@antonipp antonipp force-pushed the ai/fix-check-sum-partial branch from ac5c931 to 8dfafaf Compare January 5, 2026 13:13
@antonipp
Copy link
Author

antonipp commented Jan 5, 2026

Ok I tested another theory which seems to actually work now: another specificity of these packets is that the TCP header overflows beyond the linear head. In the example packet above, the TCP header starts at byte 98 (.transport_header = (__u16)98, matching .csum_start = (__u16)98). The TCP header min size is 20 bytes (sizeof(struct tcphdr)). So the TCP header goes from 98 to 98 + 20 = 118 which is beyond tail = 116. So 2 bytes of the header are cut off into a fragment.

I wrote another test which checks this condition using skb_transport_offset and it seems to work. I updated the PR. I haven't checked UDP though so not sure what's going on there.

@antonipp antonipp changed the title ena: linearize skbs when checksum field is in fragment ena: linearize skbs when TCP header overflows into fragment Jan 5, 2026
@antonipp
Copy link
Author

antonipp commented Jan 6, 2026

And another datapoint: I tried reproducing the issue with SG on and TX checksumming off:

# ethtool -k ens1
[...]
tx-checksumming: off
	tx-checksum-ipv4: off
	tx-checksum-ip-generic: off [fixed]
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: off [fixed]

The issue was still there. It only goes away when I disable SG. So maybe my check if (skb->ip_summed == CHECKSUM_PARTIAL) can be removed too, since it's probably not directly related to checksum offloading but to the shape of the packet itself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants