Skip to content

Conversation

@tomerdbz
Copy link
Collaborator

@tomerdbz tomerdbz commented Nov 9, 2025

Description

Resolve 30-second throughput ramp-up issue for TSO-enabled TCP connections by implementing proper initial congestion window (cwnd) and slow start threshold (ssthresh) values.

Problem:
Applications using TSO experienced ~30 seconds of near-zero throughput before achieving line-rate. Debug analysis revealed that ssthresh was being unconditionally reset to 10*MSS (14,600 bytes) during SYN-ACK processing in tcp_in.c line 582, forcing TCP into congestion avoidance mode immediately. This caused linear cwnd growth instead of exponential slow start, resulting in extremely slow ramp-up.

Solution:
Created centralized helper function tcp_set_initial_cwnd_ssthresh() that sets TSO-aware parameters:

For TSO-enabled connections:

  • cwnd = TSO_max_payload / 4 (64KB with default 256KB TSO)
  • ssthresh = 0x7FFFFFFF (2GB - effectively unlimited)

For non-TSO connections:

  • cwnd = RFC 3390 compliant: min(4MSS, max(2MSS, 4380 bytes))
  • ssthresh = 10 * MSS

Technical Rationale:

  1. Very high ssthresh (2GB) follows industry best practices, allowing slow start to run until network conditions dictate otherwise rather than artificially limiting growth (Excentis research on optimizing TCP for gigabit networks).

  2. TSO max payload is independent of negotiated MSS (determined by hardware capabilities), so initial window should also be independent of MSS for TSO connections.

  3. Initial cwnd of 64KB (TSO_max/4) balances aggressive throughput with conservative buffer management. This exceeds RFC 6928's recommendation of 10 segments (~15KB) but is appropriate for XLIO's controlled environment where TSO hardware handles segmentation and applications target high-throughput scenarios. Empirically verified to achieve 200 Gbps in <1 second.

Implementation Details:

  • Replaced duplicate TSO initialization logic in 6 locations:
    • tcp_pcb_init() - initial PCB setup
    • tcp_pcb_recycle() - PCB reuse after TIME_WAIT
    • tcp_connect() - client-side connection initiation
    • tcp_in.c SYN-ACK handler - CRITICAL FIX (line 584)
    • lwip_conn_init() - LWIP CC module initialization
    • cubic_conn_init() - Cubic CC module initialization

Performance Impact:
Before: 20+ seconds to reach line-rate (200 Gbps)
After: Line-rate achieved in <1 second

Verification: GDB debugging confirmed ssthresh was being overwritten during SYN-ACK processing. After fix, cwnd=64KB and ssthresh=2GB are maintained throughout connection establishment, enabling exponential growth as designed.

References:

  • RFC 3390: Increasing TCP's Initial Window
  • RFC 5681: TCP Congestion Control
  • RFC 6928: Increasing TCP's Initial Window (10 segments standard)
  • Excentis: "Optimizing TCP Congestion Avoidance Parameters for Gigabit Networks" - recommends very high ssthresh (approaching 2^31) for fast networks
  • NASA: "Performance Analysis of TCP with Large Segmentation Offload"
    • analysis of TSO impact on congestion control
What

Fix TSO slow start with aggressive cwnd/ssthresh

Why ?

xlio_benchmark to alpha.

How ?

It is optional but for complex PRs please provide information about the design,
architecture, approach, etc.

Change type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Check list

  • Code follows the style de facto guidelines of this project
  • Comments have been inserted in hard to understand places
  • Documentation has been updated (if necessary)
  • Test has been added (if possible)

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical TSO throughput issue where connections took 30+ seconds to reach line-rate, now achieving it in <1 second. The root cause was aggressive ssthresh values being overwritten with 10*MSS during SYN-ACK processing, forcing TCP into congestion avoidance mode prematurely.

Key Changes:

  • Centralized cwnd/ssthresh initialization in new tcp_set_initial_cwnd_ssthresh() helper
  • TSO-enabled connections: cwnd = max_payload_sz/4 (64KB), ssthresh = 2GB (effectively unlimited)
  • Non-TSO connections: RFC 3390 compliant cwnd, ssthresh = 10*MSS
  • Replaced 6 initialization sites: tcp_pcb_init(), tcp_pcb_recycle(), tcp_connect(), SYN-ACK handler, lwip_conn_init(), cubic_conn_init()

Technical Approach:
The fix properly handles the timing where TSO configuration happens after initial PCB setup but before SYN-ACK. The helper function is TSO-aware and uses tso.max_payload_sz (independent of negotiated MSS) for TSO connections, following industry best practices for high-speed networks.

Minor Issue Found:

  • Dead code at tcp_in.c:588 in #else branch (unreachable since TCP_CC_ALGO_MOD=1)

Confidence Score: 4/5

  • This PR is safe to merge with one minor style improvement recommended
  • Score reflects well-tested performance fix with clear rationale, centralized implementation, and proper TSO/non-TSO handling. Minor deduction for dead code in tcp_in.c that could be cleaned up. The core logic is sound and addresses a real production issue with empirical validation (200 Gbps in <1s).
  • src/core/lwip/tcp_in.c - contains unreachable dead code in #else branch at line 588 that should be removed for clarity

Important Files Changed

File Analysis

Filename Score Overview
src/core/lwip/tcp.c 4/5 Adds centralized tcp_set_initial_cwnd_ssthresh() function and replaces hardcoded cwnd/ssthresh initialization in tcp_connect(), tcp_pcb_init(), and tcp_pcb_recycle(). TSO-aware logic properly handles both TSO-enabled and non-TSO cases.
src/core/lwip/tcp.h 5/5 Adds function declaration for tcp_set_initial_cwnd_ssthresh(). Clean header change with no issues.
src/core/lwip/tcp_in.c 3/5 Replaces hardcoded ssthresh = mss * 10 with tcp_set_initial_cwnd_ssthresh() in SYN-ACK handler. Critical fix for TSO throughput. Contains dead code in #else branch that could cause confusion.
src/core/lwip/cc_lwip.c 5/5 Updates lwip_conn_init() to call tcp_set_initial_cwnd_ssthresh() when cwnd is uninitialized. Preserves existing behavior with cleaner code.
src/core/lwip/cc_cubic.c 5/5 Updates cubic_conn_init() to call tcp_set_initial_cwnd_ssthresh() when cwnd is uninitialized. Removes old hardcoded logic, cleaner implementation.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant Socket as sockinfo_tcp
    participant PCB as tcp_pcb
    participant TCPIn as tcp_in (SYN-ACK)
    participant Helper as tcp_set_initial_cwnd_ssthresh()
    participant CC as Congestion Control Module

    Note over App,CC: Connection Establishment Flow

    App->>Socket: socket() creation
    Socket->>PCB: tcp_pcb_init()
    PCB->>Helper: Set initial cwnd/ssthresh
    Note over Helper: tso.max_payload_sz = 0<br/>Uses non-TSO values<br/>cwnd = RFC 3390<br/>ssthresh = 10*MSS
    Helper-->>PCB: cwnd set

    App->>Socket: connect()
    Socket->>PCB: tcp_connect()
    PCB->>Helper: Re-initialize cwnd/ssthresh
    Note over Helper: TSO still not configured<br/>Uses non-TSO values
    Helper-->>PCB: cwnd set
    PCB->>TCPIn: Send SYN

    Note over Socket,PCB: TSO Configuration
    Socket->>PCB: prepare_to_send()
    Note over PCB: tso.max_payload_sz = 256KB<br/>(from hardware caps)

    TCPIn->>PCB: Receive SYN-ACK
    Note over TCPIn: MSS negotiation complete
    TCPIn->>Helper: CRITICAL: Re-initialize after MSS
    Note over Helper: NOW TSO is configured!<br/>cwnd = max_payload_sz/4 = 64KB<br/>ssthresh = 0x7FFFFFFF (2GB)
    Helper-->>TCPIn: Aggressive TSO values set
    TCPIn->>CC: cc_conn_init()
    Note over CC: Preserves cwnd if already set<br/>(checks cwnd != 1)
    CC-->>TCPIn: Connection ready
    
    Note over App,CC: Fast Slow Start with TSO
    TCPIn->>PCB: Start data transmission
    Note over PCB: Exponential growth with<br/>cwnd=64KB, ssthresh=2GB<br/>Reaches line-rate in <1s
Loading

Additional Comments (1)

  1. src/core/lwip/tcp_in.c, line 588 (link)

    style: Dead code: this #else branch overwrites the cwnd just set by tcp_set_initial_cwnd_ssthresh() on line 584. While TCP_CC_ALGO_MOD is always defined to 1 (opt.h:114), making this unreachable, the dead code could confuse future maintainers.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Fixes critical 30-second throughput ramp-up issue for TSO-enabled TCP connections by implementing proper initial congestion window and slow start threshold values.

Key Changes:

  • Introduced centralized tcp_set_initial_cwnd_ssthresh() helper function (tcp.c:520-549) that sets TSO-aware parameters: cwnd=64KB and ssthresh=2GB for TSO, or RFC 3390-compliant values for non-TSO
  • Fixed critical bug at tcp_in.c:584 where ssthresh was hardcoded to 10*MSS during SYN-ACK processing, forcing immediate congestion avoidance mode instead of slow start
  • Replaced duplicate initialization logic across 6 locations: tcp_pcb_init(), tcp_pcb_recycle(), tcp_connect(), SYN-ACK handler, lwip_conn_init(), and cubic_conn_init()
  • Added guard logic in CC modules to preserve cwnd/ssthresh values set during connection establishment (only re-initialize if cwnd==1)

Technical Approach:
The fix uses aggressive initial parameters for TSO (cwnd=TSO_max/4, ssthresh=2GB) that are independent of negotiated MSS, allowing exponential slow start growth to quickly discover optimal sending rate. For non-TSO, maintains RFC 3390 compliance with MSS-based initialization.

Performance Impact:
Reduces time-to-line-rate from 20+ seconds to <1 second for 200 Gbps TSO connections.

Confidence Score: 4/5

  • Safe to merge with minor considerations for aggressive TSO parameters in diverse network conditions
  • The implementation is sound with proper centralization and correct ordering (tcp_set_initial_cwnd_ssthresh before cc_conn_init). Score reduced from 5 to 4 due to: (1) Very aggressive TSO parameters (64KB initial cwnd, 2GB ssthresh) deviate significantly from RFC 6928 (10 segments ~15KB) and may cause issues in networks with shallow buffers or high latency, though appropriate for XLIO's controlled high-throughput environment; (2) The fix assumes TSO max_payload_sz is always properly initialized before tcp_set_initial_cwnd_ssthresh is called, which appears valid based on code flow but isn't explicitly validated
  • tcp_in.c requires attention to verify the SYN-ACK handler change works correctly across all connection scenarios (client-side, server-side, connection reuse)

Important Files Changed

File Analysis

Filename Score Overview
src/core/lwip/tcp.h 5/5 Added declaration for tcp_set_initial_cwnd_ssthresh() helper function at tcp.h:408
src/core/lwip/tcp.c 5/5 Implemented centralized tcp_set_initial_cwnd_ssthresh() function with TSO-aware initialization (tcp.c:520-549), replaced duplicate initialization in tcp_connect(), tcp_pcb_init(), and tcp_pcb_recycle()
src/core/lwip/tcp_in.c 4/5 Critical fix at tcp_in.c:584 - replaced hardcoded ssthresh = mss * 10 with tcp_set_initial_cwnd_ssthresh() call in SYN-ACK handler, maintaining proper ordering before cc_conn_init() to prevent CC module override
src/core/lwip/cc_lwip.c 5/5 Updated lwip_conn_init() to call tcp_set_initial_cwnd_ssthresh() only when cwnd == 1, preserving values set during connection establishment
src/core/lwip/cc_cubic.c 5/5 Updated cubic_conn_init() to call tcp_set_initial_cwnd_ssthresh() only when cwnd == 1, preserving values set during connection establishment and maintaining max_cwnd tracking

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant TCP as tcp_connect()
    participant SYN as SYN-ACK Handler<br/>(tcp_in.c:584)
    participant Helper as tcp_set_initial_cwnd_ssthresh()
    participant CC as CC Module<br/>(lwip/cubic)

    Note over App,CC: Client-Side Connection Establishment

    App->>TCP: tcp_connect()
    TCP->>Helper: tcp_set_initial_cwnd_ssthresh()
    Helper->>Helper: Check tcp_tso(pcb)
    alt TSO Enabled
        Helper->>TCP: cwnd = max_payload_sz/4 (64KB)<br/>ssthresh = 0x7FFFFFFF (2GB)
    else Non-TSO
        Helper->>TCP: cwnd = RFC 3390 (4*MSS)<br/>ssthresh = 10*MSS
    end
    TCP->>TCP: Send SYN packet
    
    Note over SYN: Receive SYN-ACK from server

    SYN->>SYN: tcp_parseopt() - negotiate MSS
    SYN->>SYN: Update pcb->mss with negotiated value
    
    Note over SYN,Helper: CRITICAL FIX: Re-initialize after MSS negotiation
    
    SYN->>Helper: tcp_set_initial_cwnd_ssthresh()
    Helper->>Helper: Check tcp_tso(pcb)
    alt TSO Enabled
        Note right of Helper: TSO parameters are independent<br/>of negotiated MSS
        Helper->>SYN: cwnd = max_payload_sz/4 (64KB)<br/>ssthresh = 0x7FFFFFFF (2GB)
    else Non-TSO
        Note right of Helper: Scale with negotiated MSS
        Helper->>SYN: cwnd = min(4*MSS, max(2*MSS, 4380))<br/>ssthresh = 10*MSS
    end
    
    SYN->>CC: cc_conn_init()
    CC->>CC: Check if cwnd == 1
    alt cwnd != 1 (already initialized)
        Note right of CC: Preserve cwnd/ssthresh<br/>set by Helper function
        CC->>CC: Skip re-initialization
    else cwnd == 1 (uninitialized)
        CC->>Helper: tcp_set_initial_cwnd_ssthresh()
    end
    CC->>CC: Initialize CC-specific state<br/>(e.g., cubic max_cwnd = cwnd)
    
    SYN->>App: Connection ESTABLISHED
    
    Note over App,CC: TCP now in slow start with proper cwnd/ssthresh<br/>TSO: Exponential growth from 64KB<br/>Non-TSO: RFC-compliant growth from 4*MSS
Loading

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Resolve 30-second throughput ramp-up issue for TSO-enabled
TCP connections by fixing multiple congestion control bugs
and implementing TSO-aware initial window and RTO recovery
parameters.

Problem:
Applications using TSO experienced ~30 seconds of near-zero
throughput before achieving line-rate. Investigation revealed
multiple interrelated bugs in XLIO's congestion control
implementation:

1. ssthresh Reset Bug: ssthresh was unconditionally reset to
   10*MSS (14,600 bytes) during SYN-ACK processing in
   tcp_in.c, forcing TCP into congestion avoidance mode
   immediately instead of slow start.

2. Slow Start Algorithm Bug: Both LWIP and CUBIC modules
   implemented incorrect linear cwnd growth (cwnd += mss)
   instead of RFC 5681's exponential growth (cwnd += acked).
   This is particularly devastating for TSO where one ACK can
   acknowledge 64KB (44 segments), but cwnd only grew by
   1 MSS (1460 bytes).

3. RTO Recovery Bug: On retransmission timeout, cwnd was
   reset to 1 MSS per RFC 5681, which is too conservative for
   modern TSO hardware that sends 256KB super-packets, causing
   20+ second recovery times.

Root Cause Analysis:
XLIO's CUBIC implementation is based on FreeBSD CUBIC
(2007-2010), which contained the slow start bug. Modern CUBIC
(Linux kernel 2024, RFC 9438) uses standard TCP slow start
with exponential growth. Verification against Linux kernel
source (net/ipv4/tcp_cubic.c) confirmed that XLIO's behavior
was incorrect and outdated.

Solution:

1. Centralized Initial Window Management:
   Created tcp_set_initial_cwnd_ssthresh() to consistently
   set TSO-aware parameters across all connection
   initialization paths:

   For TSO-enabled connections:
     - cwnd = TSO_max_payload / 4 (64KB with default 256KB)
     - ssthresh = 0x7FFFFFFF (2GB - effectively unlimited)

   For non-TSO connections:
     - cwnd = RFC 3390: min(4*MSS, max(2*MSS, 4380 bytes))
     - ssthresh = 10 * MSS

2. Fixed Slow Start Algorithm:
   Changed cwnd increment from mss to acked in both CC
   modules:

   Before (WRONG):
     pcb->cwnd += pcb->mss;  // Linear growth

   After (CORRECT):
     pcb->cwnd += pcb->acked;  // Exponential per RFC 5681

   This matches modern implementations including Linux CUBIC
   and is critical for TSO where one ACK can acknowledge many
   segments.

3. TSO-Aware RTO Recovery:
   Created tcp_reset_cwnd_on_congestion() for TSO-aware RTO
   handling:

   For TSO connections (deviates from RFC 5681):
     - cwnd = 26KB (10% of TSO max) instead of 1 MSS
     - ssthresh = 64KB (25% of TSO max) for slow start

   For non-TSO connections (RFC 5681 compliant):
     - cwnd = 1 MSS
     - ssthresh = max(FlightSize/2, 2*MSS)

   Rationale: RFC 5681 predates aggressive TSO. Even Linux
   CUBIC (which follows RFC) suffers from slow RTO recovery
   with large TSO. Our TSO-aware approach balances fast
   recovery with congestion safety.

Implementation Details:
- Consolidated initialization logic in 6 locations:
  * tcp_pcb_init() - initial PCB setup
  * tcp_pcb_recycle() - PCB reuse after TIME_WAIT
  * tcp_connect() - client-side connection initiation
  * tcp_in.c SYN-ACK handler
  * lwip_conn_init() - LWIP CC module initialization
  * cubic_conn_init() - CUBIC CC module initialization

- Fixed slow start in 2 locations:
  * cc_lwip.c:lwip_ack_received() - changed mss to acked
  * cc_cubic.c:cubic_ack_received() - changed mss to acked

- Centralized RTO recovery in 2 locations:
  * cc_lwip.c:lwip_cong_signal() - calls helper
  * cc_cubic.c:cubic_cong_signal() - calls helper

Performance Impact:
Before: 20+ seconds to reach line-rate (200-300 Gbps)
After: Line-rate achieved in <1 second

Technical Rationale:

1. Very High ssthresh (2GB): Follows industry best practices
   for high-BDP networks, allowing slow start to discover
   optimal window rather than artificially limiting growth.
   Standard TCP behavior for modern data center deployments.

2. TSO-Independence: TSO max payload is determined by
   hardware capabilities, not negotiated MSS. Initial window
   should similarly be independent of MSS for TSO
   connections.

3. Initial cwnd (64KB): Balances aggressive throughput with
   conservative buffer management. Exceeds RFC 6928's 10
   segments (~15KB) but appropriate for XLIO's controlled
   environment where TSO hardware handles segmentation and
   applications target high-throughput scenarios.

4. RTO Recovery (26KB): More conservative than initial window
   (25% vs 10% of TSO max) to balance fast recovery with
   safety. While this deviates from RFC 5681 (1 MSS), it
   recognizes the reality that 1460 bytes is artificially
   small when hardware sends 256KB super-packets.

RFC Compliance Notes:
- Initial window and slow start: Compliant with RFC 5681/3390
  spirit, optimized for TSO hardware.
- RTO recovery: Intentionally deviates from RFC 5681 for TSO
  to address modern hardware reality. Non-TSO connections
  remain RFC-compliant.

Comparison to Modern Implementations:
- Linux CUBIC (2024): Uses standard slow start
  (cwnd += acked)
- Linux CUBIC RTO: Follows RFC 5681 (cwnd = 1 MSS), which
  causes same slow recovery issue with aggressive TSO that we
  fix here
- FreeBSD CUBIC (2007-2010): Had slow start bug
  (cwnd += mss) that XLIO inherited and we now fix

References:
- RFC 3390: Increasing TCP's Initial Window
- RFC 5681: TCP Congestion Control (slow start, RTO behavior)
- RFC 6928: Increasing TCP's Initial Window (10 segments)
- RFC 9438: CUBIC for Fast and Long-Distance Networks (2023,
  obsoletes RFC 8312)
- Linux kernel: net/ipv4/tcp_cubic.c (verified modern CUBIC
  behavior)
  https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_cubic.c
- Excentis: "Optimizing TCP Congestion Avoidance Parameters
  for Gigabit Networks" - recommends very high ssthresh for
  fast networks

Verification:
- GDB debugging confirmed ssthresh overwrite during SYN-ACK
  processing
- Post-fix: cwnd=64KB and ssthresh=2GB maintained through
  connection setup
- Enhanced debug logs confirmed exponential cwnd growth during
  slow start
- Throughput testing: 200-300 Gbps achieved in <1s
  (previously 20+s)

Signed-off-by: Tomer Cabouly <tcabouly@nvidia.com>
@greptile-apps
Copy link

greptile-apps bot commented Nov 10, 2025

Greptile Overview

Greptile Summary

Resolves critical 30-second throughput ramp-up issue for TSO-enabled connections by implementing proper congestion control initialization. The PR centralizes cwnd/ssthresh configuration into tcp_set_initial_cwnd_ssthresh(), which sets TSO-aware values (cwnd=64KB, ssthresh=2GB) for TSO connections and RFC 3390-compliant values for non-TSO connections.

Key Changes:

  • Fixed critical bug in tcp_in.c:584 where ssthresh was unconditionally reset to 10*MSS during SYN-ACK processing, forcing immediate congestion avoidance
  • Replaced duplicate initialization logic across 6 locations (tcp_pcb_init, tcp_pcb_recycle, tcp_connect, SYN-ACK handler, lwip_conn_init, cubic_conn_init)
  • Fixed slow start in both CUBIC and LWIP CC modules from incorrect linear growth (cwnd += mss) to proper exponential growth (cwnd += acked), critical for TSO where one ACK can acknowledge 44+ segments
  • Added tcp_reset_cwnd_on_congestion() for TSO-aware RTO/fast retransmit recovery
  • Added XLIO_TCP_CC_TSO_AWARE configuration option (default: true) to toggle TSO-aware behavior
  • Comprehensive configuration infrastructure, documentation, and logging

Performance Impact: Throughput ramp-up reduced from 20+ seconds to <1 second for 200 Gbps workloads.

Confidence Score: 4/5

  • This PR is generally safe to merge with one minor consideration regarding aggressive TSO parameters
  • The implementation is well-structured with proper centralization, overflow protection, and comprehensive configuration support. The critical SYN-ACK fix correctly addresses the root cause. However, the extremely aggressive TSO parameters (2GB ssthresh, 64KB initial cwnd) deviate significantly from TCP standards and may need validation in diverse network conditions.
  • Pay attention to src/core/lwip/tcp.c (lines 518-537) where the TSO-aware congestion parameters are set, particularly the 2GB ssthresh value which effectively disables slow start threshold

Important Files Changed

File Analysis

Filename Score Overview
src/core/lwip/tcp.c 4/5 Added two helper functions for TSO-aware congestion control initialization and recovery. Functions handle both TSO and non-TSO paths correctly. Safe division operations due to tcp_tso() guard.
src/core/lwip/tcp_in.c 5/5 Critical fix: replaced hardcoded ssthresh reset with tcp_set_initial_cwnd_ssthresh() in SYN-ACK handler. Resolves 30-second throughput ramp-up issue.
src/core/lwip/cc_cubic.c 5/5 Fixed slow start from linear (cwnd += mss) to exponential (cwnd += acked) growth. Added overflow protection. Replaced custom RTO logic with centralized function.
src/core/lwip/cc_lwip.c 5/5 Fixed slow start from linear to exponential growth with overflow protection. Replaced custom congestion recovery with centralized TSO-aware function.
src/core/lwip/tcp.h 5/5 Added function declarations for new helper functions and clarified tcp_tso() macro documentation.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant SI as sockinfo_tcp
    participant PCB as tcp_pcb
    participant LWIP as LWIP/TCP Stack
    participant Net as Network

    Note over App,Net: Connection Establishment (Client)
    App->>SI: socket() + connect()
    SI->>PCB: tcp_pcb_init()
    Note over PCB: memset(pcb, 0) → tso.max_payload_sz = 0
    PCB->>PCB: tcp_set_initial_cwnd_ssthresh()
    Note over PCB: tcp_tso(pcb) = 0, so use RFC 3390:<br/>cwnd = 4*MSS (~5840), ssthresh = 10*MSS
    
    SI->>SI: create_dst_entry()
    SI->>SI: prepare_dst_to_send()
    Note over SI: TSO configuration happens here:<br/>pcb->tso.max_payload_sz = 256KB
    
    SI->>LWIP: tcp_connect()
    LWIP->>LWIP: tcp_set_initial_cwnd_ssthresh()
    Note over LWIP: NOW tcp_tso(pcb) = 256KB<br/>If tcp_cc_tso_aware = true:<br/>cwnd = 64KB, ssthresh = 2GB
    
    LWIP->>Net: Send SYN
    Net-->>LWIP: Receive SYN-ACK
    
    LWIP->>LWIP: tcp_process() [SYN-ACK handler]
    Note over LWIP: MSS negotiation complete
    LWIP->>LWIP: tcp_set_initial_cwnd_ssthresh()
    Note over LWIP: CRITICAL: Re-apply TSO-aware values<br/>cwnd = 64KB, ssthresh = 2GB<br/>(Previously was incorrectly reset to 10*MSS)
    
    LWIP->>LWIP: cc_conn_init()
    Note over LWIP: CC module initialization<br/>preserves cwnd if != 1
    
    LWIP->>Net: Send ACK
    Note over App,Net: Connection Established - Ready for Slow Start
    
    Net-->>LWIP: ACK received
    LWIP->>LWIP: cwnd += acked (exponential growth)
    Note over LWIP: With TSO: one ACK can acknowledge<br/>64KB (44 segments), enabling rapid<br/>cwnd growth to line-rate in <1s
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@tomerdbz
Copy link
Collaborator Author

@pasis - could you please review this?

@tomerdbz
Copy link
Collaborator Author

bot:retest

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

* TSO values are recalculated later in tcp_connect() after TSO is configured.
*/
if (tcp_tso(pcb) && get_tcp_cc_tso_aware()) {
LWIP_ASSERT("TSO configured", pcb->tso.max_payload_sz > 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: assert placement after conditional check is confusing

the LWIP_ASSERT appears AFTER the if (tcp_tso(pcb) && ...) check, but asserts that max_payload_sz > 0. this creates a logical inconsistency - if the condition at line 536 is false due to max_payload_sz == 0, we never reach this assert. if the condition is true, then max_payload_sz must be non-zero (since tcp_tso() returns it).

while this is safe (LWIP_ASSERT is compiled out), consider removing the assert or moving it before the conditional to clarify intent

void tcp_reset_cwnd_on_congestion(struct tcp_pcb *pcb, bool is_rto)
{
if (tcp_tso(pcb) && get_tcp_cc_tso_aware()) {
LWIP_ASSERT("TSO configured", pcb->tso.max_payload_sz > 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: same assert placement issue as in tcp_set_initial_cwnd_ssthresh()

see comment on line 537 - the assert on line 592 has the same logical inconsistency

* dictate otherwise, allowing exponential growth to discover
* the optimal sending rate.
*/
pcb->ssthresh = 0x7FFFFFFF; /* 2GB - effectively unlimited */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: extremely high ssthresh value deviates from TCP standards

ssthresh = 0x7FFFFFFF (2GB) effectively disables the slow start threshold. while intentional for TSO, this deviates from RFC 5681 and modern CUBIC. consider verifying this aggressive approach is acceptable for all TSO deployment scenarios, especially in networks with high latency or packet loss

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Add XLIO_TCP_CC_TSO_AWARE configuration option to allow users
to enable/disable TSO-aware congestion control optimizations.

The existing TSO-aware fixes (64KB initial window, 26KB RTO
recovery) are now controllable via configuration, providing
backward compatibility for users who prefer RFC-compliant
behavior.

Configuration methods:
- Environment variable: XLIO_TCP_CC_TSO_AWARE=1 (default)
- Inline config: network.protocols.tcp.congestion_control_tso_aware
- JSON config file: same path

When enabled (default):
- Initial cwnd: 64KB, ssthresh: 2GB
- RTO recovery cwnd: 26KB, ssthresh: 64KB minimum
- Optimized for high-BDP networks with TSO hardware

When disabled:
- Initial cwnd: RFC 3390 compliant (~5840 bytes)
- RTO recovery: RFC 5681 compliant (1 MSS)
- Conservative, standards-compliant behavior

Note: The exponential slow start fix (cwnd += acked) remains
always enabled as it corrects a bug in the base algorithm.

Signed-off-by: Tomer Cabouly <tcabouly@nvidia.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@galnoam
Copy link
Collaborator

galnoam commented Dec 12, 2025

@pasis , @AlexanderGrissik , please review and evaluate the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants