-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Hi, I have tried nccl-alltoall_perf-tests on 1/2/8 nodes with 8xA100 GPUs and found that the performance of msccl(in-place) did not imporve compared with nccl(out-of-place). My MSCCL_XML_FILES were generated by python msccl-tools/examples/mscclang/alltoall_a100_two_step.py.py --protocol=LL 8 8 > two_step_64.xml. I also tried alltoall_a100_three_step.py and alltoall_allpairs.py, they all behaved similarly.
The test code is nccl-tests/build/alltoall_perf -b 1MB -e 1024MB -f 2 -g 1 -n 100 -w 100, and I used 8/16/64 GPUs to run it, corresponding to 1/2/8 nodes.
The alltoall-test result of 8 nodes is like this:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 4096 float none -1 9012.2 0.12 0.11 0 561.5 1.87 1.84 N/A
2097152 8192 float none -1 1067.7 1.96 1.93 0 1046.3 2.00 1.97 N/A
4194304 16384 float none -1 2010.8 2.09 2.05 0 2023.0 2.07 2.04 N/A
8388608 32768 float none -1 5698.5 1.47 1.45 0 4261.4 1.97 1.94 N/A
16777216 65536 float none -1 8339.5 2.01 1.98 0 8211.3 2.04 2.01 N/A
33554432 131072 float none -1 16235 2.07 2.03 0 16281 2.06 2.03 N/A
67108864 262144 float none -1 32252 2.08 2.05 0 51440 1.30 1.28 N/A
134217728 524288 float none -1 63877 2.10 2.07 0 83221 1.61 1.59 N/A
268435456 1048576 float none -1 147334 1.82 1.79 0 142747 1.88 1.85 N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth : 1.77934
I also find that the Avg bus bandwidth drops sharply on multi-nodes(2/8) compared with one node. I have attached the logs of 8/16/64 GPUs below. Thank you!
gpu8-two_step.log
gpu16-two_step.log
gpu64-two_step.log