Skip to content

MPI Pinned Memory Transfer Example

David Appelhans edited this page Apr 6, 2018 · 1 revision

MPI Pinned Memory Transfer Example

This example shows

  • How to work with multiple GPUs, including
    • rank to device assignment through a device binding helper script.
    • example jsrun submission script for core and thread binding.
  • how transfer rates are affected by multiple GPUs
    • GPUs have own NVLINKs but share system memory bandwidth

Note about the output:

mpi pinned memory output is disordered, but you can fix this with some grep post processing. Since the ranks are prepended in the output just grep on one of them:

grep "1: 0: " 56865.out

1: 0:  ****Pinned vs non-pinned memory transfer comparison***
1: 0: Array size:   2.0000 GB 
1: 0: Number of samples:     10
1: 0:   Test                       sec       Bandwidth GB/s
1: 0:  Allocation Timings:
1: 0:   regular host allocation:   .520E-05
1: 0:   pinned host allocation:    .812    
1: 0:   cuda device allocation     .387E-01
1: 0:  REGULAR PAGEABLE MEMORY:
1: 0:   Explicit CUDA HtoD         .893        22.390
1: 0:  USING PINNED MEMORY
1: 0:   Explicit CUDA HtoD         .458        43.677
1: 0:  REGULAR PAGEABLE MEMORY:
1: 0:   acc device allocation      .297E-02
1: 0:   acc HtoD                   2.77         7.221
1: 0:   acc device allocation      .115E-04
1: 0:   acc DtoH                   2.60         7.702
1: 0:   cuda device allocation     .341E-01
1: 0:   Explicit CUDA HtoD         1.86        10.741
1: 0:   Explicit CUDA DtoH         1.55        12.933
1: 0:  USING PINNED MEMORY
1: 0:   acc device allocation      .123E-04
1: 0:   acc HtoD                   .482        41.517
1: 0:   acc device allocation      .105E-04
1: 0:   acc DtoH                   .480        41.625
1: 0:   cuda device allocation     .301E-01
1: 0:   Explicit CUDA HtoD         .458        43.677
1: 0:   Explicit CUDA DtoH         .480        41.663
1: 0:  completed

Clone this wiki locally