-
Notifications
You must be signed in to change notification settings - Fork 187
ena: linearize skbs when TCP header overflows into fragment #360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @antonipp |
dffc38c to
ac5c931
Compare
|
Hi @antonipp Thank you for providing all these details, we're interested in understanding the use-case in more depth. |
|
Hi yes, here are two tcpdumps PCAPs captured from the Python server reproducer I mentioned in the PR description The tcpdumps were recorded on the server host with In the working scenario, the client is doing |
|
Hmm and actually looks like I was wrong in my initial root cause analysis because there's a mistake in my offset calculation 🤔 The failing packets look like this: According to https://github.com/torvalds/linux/blob/3609fa95fb0f2c1b099e69e56634edb8fc03f87c/include/linux/skbuff.h#L2531-L2534 And the
And the checksum is at byte I recompiled the code with the correct offset calculation: if (skb->ip_summed == CHECKSUM_PARTIAL) {
int csum_start = skb_checksum_start_offset(skb);
if (csum_start + skb->csum_offset + sizeof(__sum16) > skb_headlen(skb))
goto linearize;
}but the packets are not going through anymore now. I think my initial fix worked because it was overly greedy when linearizing packets, so it worked almost like setting So now I guess we need to dig a bit more to find what's wrong because it's definitely related to the shape of these packets but I can't exactly tell what. |
ac5c931 to
8dfafaf
Compare
|
Ok I tested another theory which seems to actually work now: another specificity of these packets is that the TCP header overflows beyond the linear head. In the example packet above, the TCP header starts at byte 98 ( I wrote another test which checks this condition using |
|
And another datapoint: I tried reproducing the issue with SG on and TX checksumming off: The issue was still there. It only goes away when I disable SG. So maybe my check |
Description
We found an interesting bug with our QEMU Kata Containers running on AWS metal instances: all packets emitted from Kata MicroVMs which were around the size of
PAGESIZEor larger got dropped on TX. We created a simple reproducer which consistently works:sizeparameters:This gave these results consistently:
So transmission started breaking around ~4000 bytes.
We started verifying MTU configurations, capturing
tcpdumps etc but nothing really helped until we turned off Scatter-Gather on the interface the pod traffic was going from:ethtool -K ens1 sg offand this seemed to solve the issue!So this gave us a hint that something was off with the way packets were fragmented when they were coming from Kata containers. Another data point was that we were unable to reproduce the bug in GCP, nor or Azure, nor when using
runccontainers, nor when running the server directly on the host. So this pointed at some sort of broken interaction between Kata / QEMU / virtio-net (used by Kata) / AWS ENA driver on the host.Digging even further, we captured two pwru traces: one from a working ~2000 byte response and one from a broken ~5000 byte response. We captured the traces with
--output-meta --output-skbto get the internal skb states.output-working.txt
output-failing.txt
Feeding these into an LLM indeed revealed a key difference in the way these packets were structured:
data_lenwas 0 for all packets (see http://oldvger.kernel.org/~davem/skb_data.html)len=5066,data_len=5014(so the head length is only 52 bytes and the rest is fragmented away). Another key observation which would be useful later is thatcsum_start=98andcsum_offset=16sothe checksum field was at byte 114, so it was inside a fragmentThe checksum is actually inside the head, see ena: linearize skbs when TCP header overflows into fragment #360 (comment). The issue seems to be with how the TCP header itself is fragmented, see below.We then looked into how these packet shapes were actually produced by looking at
virtio-netcode. This is where we found the relevant logic (all code snippets below are from kernel 6.8 but the code hasn't changed much in the latest kernel):PAGE_SIZE(on our machinesgetconf PAGESIZEis4096💡 - so this is close to where we've been seeing failures starting to happen)virtnet_build_skbhereSo this explains how these packets are actually formed. It's now clear that these highly fragmented packets are somehow failing to get emitted on the network driver / hardware layer due to their layout. If we disable the SG feature, the packet gets linearized before it's transmitted to the driver so the bug is avoided this way. We also noticed that we have
tx-checksumming: onon the network interface, and the fact that the checksum field was in a fragment gave us a hint that something might be wrong with hardware checksum offloading.We are now finally coming to the code in this PR: this adds logic to the driver to check whether the
checksum fieldwhole TCP header is actually inside the linear head of the skb. If it's not, the packet is linearized before being handed off to the hardware. I recompiled the driver on my host, loaded it and it actually fixed the issue! So I guess there is indeed a requirement at the hardware level that thechecksumwhole header should be in the first DMA buffer but I can't really confirm this from outside. However, just empirically this seems to be the case and the code fixes the issue!Software / Hardware info
Tested on
c5.metalinstances inus-east-1.LLQ is disabled: