Skip to content

MSCCL all-to-all performance did not improve compared with NCCL #48

@Musisoul

Description

@Musisoul

Hi, I have tried nccl-alltoall_perf-tests on 1/2/8 nodes with 8xA100 GPUs and found that the performance of msccl(in-place) did not imporve compared with nccl(out-of-place). My MSCCL_XML_FILES were generated by python msccl-tools/examples/mscclang/alltoall_a100_two_step.py.py --protocol=LL 8 8 > two_step_64.xml. I also tried alltoall_a100_three_step.py and alltoall_allpairs.py, they all behaved similarly.
The test code is nccl-tests/build/alltoall_perf -b 1MB -e 1024MB -f 2 -g 1 -n 100 -w 100, and I used 8/16/64 GPUs to run it, corresponding to 1/2/8 nodes.
The alltoall-test result of 8 nodes is like this:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576          4096     float    none      -1   9012.2    0.12    0.11      0    561.5    1.87    1.84    N/A
     2097152          8192     float    none      -1   1067.7    1.96    1.93      0   1046.3    2.00    1.97    N/A
     4194304         16384     float    none      -1   2010.8    2.09    2.05      0   2023.0    2.07    2.04    N/A
     8388608         32768     float    none      -1   5698.5    1.47    1.45      0   4261.4    1.97    1.94    N/A
    16777216         65536     float    none      -1   8339.5    2.01    1.98      0   8211.3    2.04    2.01    N/A
    33554432        131072     float    none      -1    16235    2.07    2.03      0    16281    2.06    2.03    N/A
    67108864        262144     float    none      -1    32252    2.08    2.05      0    51440    1.30    1.28    N/A
   134217728        524288     float    none      -1    63877    2.10    2.07      0    83221    1.61    1.59    N/A
   268435456       1048576     float    none      -1   147334    1.82    1.79      0   142747    1.88    1.85    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.77934 

I also find that the Avg bus bandwidth drops sharply on multi-nodes(2/8) compared with one node. I have attached the logs of 8/16/64 GPUs below. Thank you!
gpu8-two_step.log
gpu16-two_step.log
gpu64-two_step.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions