-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Dear Nvprof developers:
I want to use nvprof to profile my cuda+mpi application. But the little test shows that the options --annote-mpi openmpi does not produce any information about MPI interface as described in the nvprof document. The following is the information of example for the test:
Sample Test:
From Link: http://geco.mines.edu/tesla/cuda_tutorial_mio/
Source Files: mpi_hello_gpu.cu, vecadd.cu
OpenMPI Version: 4.0.2
Cuda Version: 10.1
Command: $ mpirun -np 2 nvprof --annotate-mpi openmpi ./mpi_cuda
Output ( using 2 mpi processes):
rank 0 of 2 on p3dev02 received bcastme[3]=3 [gpu 0]
rank 1 of 2 on p3dev02 received bcastme[3]=3 [gpu 1]
==70253== NVPROF is profiling process 70253, command: ./mpi_cuda
==70254== NVPROF is profiling process 70254, command: ./mpi_cuda
rank 0: cudaGetDevice()=0
rank 1: cudaGetDevice()=1
rank 1: C[0]=0.000000
ranksum= 1
==70253== Profiling application: ./mpi_cuda
==70253== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 62.58% 3.1040us 2 1.5520us 1.3440us 1.7600us [CUDA memcpy HtoD]
37.42% 1.8560us 1 1.8560us 1.8560us 1.8560us [CUDA memcpy DtoH]
API calls: 86.74% 352.44ms 3 117.48ms 10.267us 352.42ms cudaMalloc
5.39% 21.910ms 582 37.645us 258ns 2.0794ms cuDeviceGetAttribute
4.75% 19.303ms 50000 386ns 303ns 102.73us cudaLaunchKernel
2.07% 8.3917ms 6 1.3986ms 1.1406ms 1.4661ms cuDeviceTotalMem
0.68% 2.7607ms 1 2.7607ms 2.7607ms 2.7607ms cudaGetDeviceProperties
0.34% 1.3713ms 6 228.55us 215.41us 247.59us cuDeviceGetName
0.02% 66.319us 3 22.106us 14.092us 30.931us cudaMemcpy
0.01% 20.708us 3 6.9020us 1.8690us 16.755us cudaFree
0.00% 12.278us 6 2.0460us 1.3700us 4.3850us cuDeviceGetPCIBusId
0.00% 7.5770us 12 631ns 375ns 973ns cuDeviceGet
0.00% 6.6190us 1 6.6190us 6.6190us 6.6190us cudaSetDevice
0.00% 6.2070us 4 1.5510us 867ns 2.3670us cuPointerGetAttributes
0.00% 2.3390us 6 389ns 354ns 461ns cuDeviceGetUuid
0.00% 1.8280us 3 609ns 437ns 780ns cuDeviceGetCount
0.00% 1.5210us 1 1.5210us 1.5210us 1.5210us cudaGetDevice
0.00% 1.2300us 1 1.2300us 1.2300us 1.2300us cudaGetDeviceCount
==70254== Profiling application: ./mpi_cuda
==70254== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 179.83ms 50000 3.5960us 3.5510us 4.0640us vecAdd(float*, float*, float*)
0.00% 3.0400us 2 1.5200us 1.3440us 1.6960us [CUDA memcpy HtoD]
0.00% 2.0480us 1 2.0480us 2.0480us 2.0480us [CUDA memcpy DtoH]
API calls: 68.49% 884.64ms 50000 17.692us 16.647us 1.4335ms cudaLaunchKernel
28.85% 372.61ms 3 124.20ms 15.212us 372.57ms cudaMalloc
1.55% 20.003ms 582 34.368us 453ns 1.2518ms cuDeviceGetAttribute
0.76% 9.7675ms 6 1.6279ms 1.6077ms 1.6602ms cuDeviceTotalMem
0.25% 3.2029ms 1 3.2029ms 3.2029ms 3.2029ms cudaGetDeviceProperties
0.10% 1.2356ms 6 205.93us 135.78us 224.53us cuDeviceGetName
0.01% 103.42us 3 34.473us 19.464us 60.273us cudaMemcpy
0.00% 60.895us 3 20.298us 4.2420us 51.665us cudaFree
0.00% 16.364us 4 4.0910us 2.0370us 9.1220us cuPointerGetAttributes
0.00% 14.154us 6 2.3590us 1.9510us 3.1620us cuDeviceGetPCIBusId
0.00% 11.338us 12 944ns 580ns 1.5080us cuDeviceGet
0.00% 7.3840us 1 7.3840us 7.3840us 7.3840us cudaSetDevice
0.00% 3.8410us 6 640ns 592ns 673ns cuDeviceGetUuid
0.00% 2.7020us 3 900ns 699ns 1.0970us cuDeviceGetCount
0.00% 1.9360us 1 1.9360us 1.9360us 1.9360us cudaGetDevice
0.00% 1.2750us 1 1.2750us 1.2750us 1.2750us cudaGetDeviceCount
Hope you can reproduce the issue.
Best,
Shelton