Skip to content

Conversation

@siyengar
Copy link
Contributor

Summary:
Add a host-side wrapper function launch_all_to_allv() for the AllToAllv
collective, similar to the dispatch host wrapper. This eliminates the need
for a separate kernel wrapper in the benchmark.

Key changes:

  • Add AllToAllv.h with host wrapper declaration
  • Add AllToAllv.cu with kernel and host wrapper implementation
  • Support optional cluster launch with spread scheduling for better
    load balancing across GPCs
  • Update AllToAllvBenchmark to use the new host wrapper instead of
    directly launching the kernel
  • Fix AllToAllvTest.cu to pass Timeout parameter to device function

The host wrapper is named launch_all_to_allv() to distinguish it from
the device function all_to_allv().

Differential Revision: D91440692

Subodh Iyengar and others added 5 commits January 25, 2026 12:59
Summary:
Simplify the ChunkState API by:
1. Renaming methods from camelCase to snake_case for consistency with the rest
   of the pipes codebase (e.g., waitReadyToSend -> wait_ready_to_send)
2. Requiring ThreadGroup parameter for all public methods, removing the
   single-threaded overloads that were error-prone

The ThreadGroup abstraction ensures proper synchronization:
- For signals (ready_to_recv, ready_to_send): sync before leader writes
- For waits (wait_ready_to_send, wait_ready_to_recv): all threads poll for
  better latency

This is a breaking change to the ChunkState API but the only callers are within
the pipes library (P2pNvlTransportDevice, P2pSyncBench).

Differential Revision: D91412047
Summary:
Add optional timeout support to ChunkState and SignalState wait methods
that can block indefinitely on P2P NVLink I/O operations. When a timeout
occurs, the kernel aborts via __trap().

Key changes:
- Add TimeoutUtils.cuh with Timeout struct for efficient GPU-side timeout
  checking using clock64() and precomputed deadline cycles
- Add TimeoutUtils.h with makeTimeout() host helper that queries GPU clock
  rate and converts milliseconds to cycles
- Add optional Timeout parameter to ChunkState::wait_ready_to_send(),
  wait_ready_to_recv() and SignalState::wait_until()
- Propagate Timeout parameters through P2pNvlTransportDevice methods:
  send(), recv(), send_one(), recv_one(), send_multiple(), recv_multiple()
- Add TimeoutTrapTest with tests verifying timeout behavior

Design decisions:
- Timeout unit is milliseconds (intuitive for debugging)
- Default timeout_ms=0 means infinite wait (backward compatible)
- start() must be called once at kernel entry to capture reference time
- check() is called in polling loops with leader-only optimization for
  ThreadGroup-based waits

Differential Revision: D91412046
Differential Revision: D91439414
Summary:
Integrate timeout support into the all_to_allv collective communication
primitive. This allows the collective to detect and abort when peer
communication hangs, rather than blocking indefinitely.

Key changes:
- Add required Timeout parameter to all_to_allv() function signature
- Call timeout.start() at the beginning to initialize the deadline
- Propagate timeout to P2pNvlTransportDevice send() and recv() calls
- Update CollectiveBenchmark kernel to accept and pass timeout
- Update AllToAllvBenchmark to create and pass Timeout to kernel

This builds on the timeout infrastructure added in D91412046.

Differential Revision: D91439413
Summary:
Add a host-side wrapper function `launch_all_to_allv()` for the AllToAllv
collective, similar to the dispatch host wrapper. This eliminates the need
for a separate kernel wrapper in the benchmark.

Key changes:
- Add AllToAllv.h with host wrapper declaration
- Add AllToAllv.cu with kernel and host wrapper implementation
- Support optional cluster launch with spread scheduling for better
  load balancing across GPCs
- Update AllToAllvBenchmark to use the new host wrapper instead of
  directly launching the kernel
- Fix AllToAllvTest.cu to pass Timeout parameter to device function

The host wrapper is named `launch_all_to_allv()` to distinguish it from
the device function `all_to_allv()`.

Differential Revision: D91440692
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 26, 2026
@meta-codesync
Copy link

meta-codesync bot commented Jan 26, 2026

@siyengar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91440692.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant