The performance we observe on NVL72 is still suboptimal, with NCCL being roughly 2x faster. We are currently at ~20GiB/s, where NCCL achieves ~40GiB/s on 64 GPUs.
We need to further investigate where we are losing. Most obvious probable candidates are:
- RapidsMPF
bench_comm
- UCXX lock contention
- UCX internals (could be a variety of different reasons)
Possible ways to explore this further:
- All-to-all test with OSU+UCX
- Properly configure to use MPI communicator with UCX backend on NVL72 (currently hitting issues with MLX5, probably due to minor misconfigurations that simply need to be corrected)