This is the lib-ncclbench project, a library for benchmarking NCCL operations inspired by NCCL Tests, but with some differences:
- NCCL Tests has calls to
MPI_Barrierin the timing region, distorting the results. ncclbenchreports timings for all calls, instead of only the average.ncclbenchallows for benchmarking both by number of iterations and by total time.
We provide an example of benchmark in the example directory. You can build (using -DBUILD_EXAMPLES=ON option) and run it to see how the library works.
A typical run of the ncclbench example looks like this:
$ mpirun -n 4 ./example/ncclbench --operation ncclAllReduce --sizes 1024 --data-type float --blocking --csv --warmups 10 --iterations 100 --time 2
Operation,Blocking,Data_Type,Msg_Size_B,#Elements,Iterations,Stream_Sync_us,Time_us,AlgBW_GBps,BusBW_GBps
ncclAllReduce,Yes,float,1024,256,1,1.57722,25.428,0.0375049,0.0562573
ncclAllReduce,Yes,float,1024,256,1,1.57722,21.042,0.0453224,0.0679836
ncclAllReduce,Yes,float,1024,256,1,1.57722,23.801,0.0400687,0.060103
ncclAllReduce,Yes,float,1024,256,1,1.57722,21.683,0.0439826,0.0659739
ncclAllReduce,Yes,float,1024,256,1,1.57722,20.626,0.0462365,0.0693548
ncclAllReduce,Yes,float,1024,256,1,1.57722,21.074,0.0452536,0.0678804
ncclAllReduce,Yes,float,1024,256,1,1.57722,20.988,0.045439,0.0681585
...
It will run the ncclAllReduce operation with 1024 bytes of data, using the float data type, in blocking mode, with 10 warmup iterations and 100 timed iterations, or a total time of 2 seconds, whichever comes first. The results will be printed in CSV format.
See the BUILDING document.
See the CONTRIBUTING document.
See the LICENSE document.