Skip to content

Performance Tools

jgp edited this page Oct 2, 2019 · 12 revisions

Several tools are available on Piz Daint:

Nvidia nsight systems

Nvidia nsight

  • NVIDIA® Nsight™ Systems is a system-wide performance analysis tool.

To use nsight on Piz Daint, load the modulefile:

module load daint-gpu
module load nvidia-nsight-systems  #  this will load version 2019.5.1

nsys -v
    NVIDIA Nsight Systems version 2019.5.1.58-5e791f0

which nsys
    /apps/daint/UES/jenkins/6.0.UP07/gpu/easybuild/tools/software/nvidia-nsight-systems/2019.5.1/bin/nsys

You may want to install the client on your laptop from:

https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2019-5

For further information check the documentation:

or read further:

nsys with mpi+cuda

  • Load the modulefiles
module load daint-gpu
module unload PrgEnv-cray
module load PrgEnv-gnu
module load craype-accel-nvidia60
module load nvidia-nsight-systems/2019.5.1
  • Compile and link your code with cuda
  • Add nsys to your jobscript
srun -n 2 \
nsys profile --force-overwrite=true \
-o %h.%q{SLURM_NODEID}.%q{SLURM_PROCID}.qdstrm \
--trace=cuda,mpi,nvtx --mpi-impl=mpich \
--stats=true --delay=2 ./exe

nsys will generate 1 output file per mpi rank:

nid02006.0.0.qdrep
nid02007.1.1.qdrep
  • A typical text report (--stats=true) will look like:
CUDA Kernel Statistics (nanoseconds)

Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name

-------  --------------  ----------  --------------  --------------  --------------
   49.0       107152611          88       1217643.3         1167541         1268020  density

   31.4        68693840          22       3122447.3         3092163         3160322  computeMomentumAndEnergyIAD

   19.5        42635403          22       1937972.9         1839759         2163115  computeIAD


CUDA Memory Operation Statistics (nanoseconds)

Time(%)      Total Time  Operations         Average         Minimum         Maximum  Name

-------  --------------  ----------  --------------  --------------  --------------
   98.9       135778347        1078        125953.9             896        15181550  [CUDA memcpy HtoD]

    1.1         1502834         308          4879.3            1760            7776  [CUDA memcpy DtoH]


CUDA Memory Operation Statistics (KiB)

            Total      Operations            Average            Minimum            Maximum  Name

-----------------  --------------  -----------------  -----------------  -----------------
        1383253.0            1078             1283.2              0.055            20312.0  [CUDA memcpy HtoD]

          15125.0             308               49.1             15.625               62.0  [CUDA memcpy DtoH]
etc...
  • Transfer the files to your laptop and open them in the client to get a graphical report:

Nsight Systems Cuda

nsys with mpi+openacc

  • Load the modulefiles
module load daint-gpu
module unload PrgEnv-cray
module load PrgEnv-pgi
module load craype-accel-nvidia60
module load nvidia-nsight-systems/2019.5.1
  • Compile and link your code with openacc

warning

We noticed an issue with the Cray wrapper scripts (ftn, cc, CC) when compiling OpenACC codes. A workaround is to set the right PATH at compilation time before loading PrgEnv-pgi:

export PATH=$CRAY_BINUTILS_BIN:$PATH
  • Add nsys to your jobscript
srun -n 2 \
nsys profile --force-overwrite=true \
-o %h.%q{SLURM_NODEID}.%q{SLURM_PROCID}.qdstrm \
--trace=openacc,mpi,nvtx --mpi-impl=mpich \
--stats=true --delay=2 ./exe

nsys will generate 1 output file per mpi rank:

nid02010.0.0.qdrep
nid02011.1.1.qdrep
  • A typical text report (--stats=true) will look like:
Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name

-------  --------------  ----------  --------------  --------------  --------------  -------------------------------------------------------------
-------------------
   85.5       175270396         273        642016.1            1794        11153499  cuStreamSynchronize

    9.1        18642361           1      18642361.0        18642361        18642361  cuMemFreeHost

    2.8         5791079         903          6413.2            4250           44709  cuMemcpyHtoDAsync_v2

    1.2         2497245          63         39638.8           11233         1699528  cuLaunchKernel

    0.7         1357685         273          4973.2            3433           19312  cuMemcpyDtoHAsync_v2

    0.4          746224         399          1870.2            1232           15760  cuEventSynchronize

    0.3          620821         400          1552.1            1107           16844  cuEventRecord

    0.0            2125           1          2125.0            2125            2125  cuEventCreate


Generating CUDA Kernel Statistics...

Generating CUDA Memory Operation Statistics...
CUDA Kernel Statistics (nanoseconds)

Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name

-------  --------------  ----------  --------------  --------------  --------------  -------------------------------------------------------------
-------------------
   49.5        42597382          21       2028446.8         2024873         2031945  computeMomentumAndEnergyIADImpl_189_gpu

   37.0        31846715          21       1516510.2         1512559         1519919  computeIADImpl_58_gpu

   13.5        11639164          21        554245.9          551514          556569  computeDensity_74_gpu


CUDA Memory Operation Statistics (nanoseconds)

Time(%)      Total Time  Operations         Average         Minimum         Maximum  Name

-------  --------------  ----------  --------------  --------------  --------------  -------------------------------------------------------------
-------------------
   98.7       109746976         903        121536.0             864         1345969  [CUDA memcpy HtoD]

    1.3         1389679         273          5090.4            1024           11840  [CUDA memcpy DtoH]


CUDA Memory Operation Statistics (KiB)

            Total      Operations            Average            Minimum            Maximum  Name

-----------------  --------------  -----------------  -----------------  -----------------  ------------------------------------------------------
--------------------------
        1320377.0             903             1462.2              0.008            16384.0  [CUDA memcpy HtoD]

          14439.0             273               52.9              0.055               62.0  [CUDA memcpy DtoH]
etc...
  • Transfer the files to your laptop and open them in the client to get a graphical report:

Nsight Systems Cuda

Nvidia nvprof

Nvidia nvprof

  • NVIDIA nvprof is CUDA standard performance tool.

To use nvprof on Piz Daint, load the cudatoolkit:

module load craype-accel-nvidia60
module swap cudatoolkit/9.2.148_3.19-6.0.7.1_2.1__g3d9acc8

nvprof -V
    nvprof: NVIDIA (R) Cuda command line profiler
    Copyright (c) 2012 - 2018 NVIDIA Corporation
    Release version 9.2.148 (21)

For further information check the documentation:

or read further:

nvprof with mpi+cuda

  • Load the modulefiles
  • Compile and link your code with cuda
  • Add nvprof to your jobscript
srun nvprof \
  -o nvprof.%h.%p.%q{SLURM_NODEID}.%q{SLURM_PROCID} \
  ./exe 2>&1
  • nvprof will generate 1 output file per mpi rank:
nvprof.nid03508.11812.0.0
nvprof.nid03509.7667.1.1
  • Text report with nvprof:
nvprof -i nvprof.nid03508.11812.0.0
  • A typical report will look like:
======== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   46.13%  38.352ms         8  4.7940ms  4.7418ms  4.8599ms  void cuda::momenumAndEnergy
                   39.92%  33.189ms        84  395.11us     896ns  2.0940ms  [CUDA memcpy HtoD]
                   13.64%  11.341ms         8  1.4177ms  1.3868ms  1.4743ms  void cuda::density
                    0.31%  261.44us        40  6.5350us  5.8240us  12.608us  [CUDA memcpy DtoH]
      API calls:   76.98%  310.71ms        58  5.3571ms  2.6920us  308.81ms  cudaMalloc
                   21.58%  87.084ms       124  702.29us  10.153us  4.8844ms  cudaMemcpy
                    1.15%  4.6426ms        58  80.045us  3.0450us  803.70us  cudaFree
                    0.15%  586.09us         1  586.09us  586.09us  586.09us  cuDeviceTotalMem
                    0.07%  291.41us        16  18.213us  13.658us  34.499us  cudaLaunchKernel
                    0.07%  268.56us        97  2.7680us     105ns  109.25us  cuDeviceGetAttribute
                    0.01%  26.828us         1  26.828us  26.828us  26.828us  cuDeviceGetName
                    0.00%  3.9620us        16     247ns     167ns     416ns  cudaGetLastError
                    0.00%  2.5690us         1  2.5690us  2.5690us  2.5690us  cuDeviceGetPCIBusId
                    0.00%  1.3230us         3     441ns     119ns     736ns  cuDeviceGetCount
  • Graphical report with nvvp (requires X11 or VNC or scp):

nvvp &

Menu -> Import

Getting Started

nvprof with mpi+openacc

  • Load the modulefiles
  • Compile and link your code with pgi and openacc

warning

We noticed an issue with the Cray wrapper scripts (ftn, cc, CC) when compiling OpenACC codes for nvprof. A workaround is to set the right PATH at compilation time before loading PrgEnv-pgi:

export PATH=$CRAY_BINUTILS_BIN:$PATH

If not set, nvprof will display this message at runtime:

======== Profiling result:
No kernels were profiled.
No API activities were profiled.
  • Add nvprof to your jobscript
srun nvprof \
  -o nvprof.%h.%p.%q{SLURM_NODEID}.%q{SLURM_PROCID} \
  ./exe 2>&1

nvprof will generate 1 output file per mpi rank:

nvprof.nid03508.11813.0.0
nvprof.nid03509.7668.1.1
  • Text report with nvprof:
nvprof -i nvprof.nid03508.11813.0.0
  • A typical report will look like:
======== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   53.82%  63.027ms       156  404.02us     896ns  1.3470ms  [CUDA memcpy HtoD]
                   22.65%  26.532ms         3  8.8439ms  8.7749ms  8.8983ms  
                   void sphexa::sph::computeMomentumAndEnergyIADImpl_189_gpu<double, SqPatch<double>>(std::vector<int, std::allocator<int>> const &, double&)
                   16.85%  19.740ms         3  6.5799ms  6.4749ms  6.7702ms  
                    void sphexa::sph::computeIADImpl_58_gpu<double, SqPatch<double>>(std::vector<int, std::allocator<int>> const &, double&)
                    6.10%  7.1390ms         3  2.3797ms  2.3748ms  2.3872ms  
                    void sphexa::sph::computeDensity_74_gpu<double, SqPatch<double>>(std::vector<int, std::allocator<int>> const &, double&)
                    0.58%  678.56us        39  17.398us  1.0240us  21.056us  [CUDA memcpy DtoH]
      API calls:   57.07%  312.06ms         1  312.06ms  312.06ms  312.06ms  cuDevicePrimaryCtxRetain
                   19.39%  106.03ms         1  106.03ms  106.03ms  106.03ms  cuDevicePrimaryCtxRelease
                   etc...
  • Graphical report with nvvp (requires X11 or VNC or scp):

nvvp &

Menu -> Import

Getting Started

nvprof options

nvprof --help

  -o,  --export-profile <filename>

Export the result file which can be imported later or opened by the NVIDIA
Visual Profiler.

    "%p" in the file name string is replaced with the process ID of the application being profiled.

    "%q{<ENV>}" in the file name string is replaced with the value of the
environment variable "<ENV>". If the environment variable is not set it is an
error.

    "%h" in the file name string is replaced with the hostname of the system.
By default, this option disables the summary output.

Other tools

PGI_ACC_TIME with mpi+openacc

  • Load the modulefiles
  • Compile and link your code with pgi and openacc
  • separate the performance data into 1 file per mpi rank by using a wrapper script:

cat tool.sh

#!/bin/bash
export PGI_ACC_TIME=1
./myexe $@ 2> `hostname`.$SLURM_NODEID.$SLURM_PROCID.rpt

and run your executable with:

chmod u+x tool.sh
srun ./tool.sh 

PGI_ACC_TIME will generate 1 output file per mpi rank:

nid04276.0.0.rpt
nid04277.1.1.rpt

A typical text report will look like:

cat nid04276.0.0.rpt

Accelerator Kernel Timing data
density.hpp
  _ZN6sphexa3sph14computeDensityId7SqPatchIdEEEvRKSt6vectorIiSaIiEERT0_  NVIDIA  devicenum=0
    time(us): 13,927
    68: compute region reached 2 times
        68: kernel launched 2 times
            grid: [32000]  block: [128]
            elapsed time(us): total=4,580 max=2,300 min=2,280 avg=2,290
    68: data region reached 8 times
        68: data copyin transfers: 28
             device time(us): total=13,857 max=1,352 min=4 avg=494
        111: data copyout transfers: 4
             device time(us): total=70 max=25 min=7 avg=17
PGI_ACC_NOTIFY with mpi+openacc

  • Load the modulefiles
  • Compile and link your code with pgi and openacc
  • separate the performance data into 1 file per mpi rank by using a wrapper script:

cat tool.sh

#!/bin/bash
export PGI_ACC_NOTIFY=3
./myexe $@ 2> `hostname`.$SLURM_NODEID.$SLURM_PROCID.rpt

and run your executable with:

chmod u+x tool.sh
srun ./tool.sh 

PGI_ACC_NOTIFY will generate 1 output file per mpi rank:

nid03510.0.0.rpt
nid03511.1.1.rpt

A typical text report will look like:

cat nid03510.0.0.rpt

launch CUDA kernel  file=density.hpp 
function=_ZN6sphexa3sph14computeDensityId7SqPatchIdEEEvRKSt6vectorIiSaIiEERT0_ 
line=68 device=0 threadid=1 num_gangs=32000 num_workers=1 vector_length=128
grid=32000 block=128 shared memory=1024
etc...

download CUDA data  file=density.hpp
function=_ZN6sphexa3sph14computeDensityId7SqPatchIdEEEvRKSt6vectorIiSaIiEERT0_ 
line=111 device=0 threadid=1 variable=bbox bytes=56
etc...

upload CUDA data  file=density.hpp
function=_ZN6sphexa3sph14computeDensityId7SqPatchIdEEEvRKSt6vectorIiSaIiEERT0_ 
line=68 device=0 threadid=1 variable=n bytes=8
etc...
COMPUTE_PROFILE

COMPUTE_PROFILE support has been dropped.

Score-P and Vampir

Score-P and Vampir

  • Score-P is a performance tool developed by the VI-HPS Institute (Virtual Institute for High Productivity Supercomputing). Vampir is a performance visualizer.

For further information check the documentation:

or read further:

Score-P/Vampir with mpi+cuda

To use Score-P with your cuda code on Piz Daint:

  • Load the modulefiles (Score-P/4.0-CrayGNU-18.08-cuda-9.1)
  • Compile and link your code with the tool (add scorep before the Cray compiler wrapper)
  • Add the following variables to your jobscript:
export SCOREP_ENABLE_PROFILING=false
export SCOREP_ENABLE_TRACING=true
export SCOREP_CUDA_ENABLE=yes

More details with:

scorep-info config-vars --full 
  • Graphical report with vampir (requires X11 or VNC or scp):

Getting Started

Score-P/Vampir with mpi+openacc

To use Score-P with your openacc code on Piz Daint:

  • Load the modulefiles (Score-P/6.0-CrayPGI-19.eurohack)
  • Compile and link your code with the tool (use scorep-CC instead of the CC Cray compiler wrapper)
  • Add the following variables to your jobscript:
export SCOREP_WRAPPER_INSTRUMENTER_FLAGS="--mpp=mpi --openacc"
export SCOREP_OPENACC_ENABLE=yes
export SCOREP_ENABLE_PROFILING=false
export SCOREP_ENABLE_TRACING=true
export SCOREP_TOTAL_MEMORY=1GB

More details with:

scorep-info config-vars --full 
  • Graphical report with vampir (requires X11 or VNC or scp):

Getting Started

Clone this wiki locally