-
Notifications
You must be signed in to change notification settings - Fork 1
MPI Pinned Memory Transfer Example
David Appelhans edited this page Apr 6, 2018
·
1 revision
This example shows
- How to work with multiple GPUs, including
- rank to device assignment through a device binding helper script.
- example jsrun submission script for core and thread binding.
- how transfer rates are affected by multiple GPUs
- GPUs have own NVLINKs but share system memory bandwidth
mpi pinned memory output is disordered, but you can fix this with some grep post processing. Since the ranks are prepended in the output just grep on one of them:
grep "1: 0: " 56865.out
1: 0: ****Pinned vs non-pinned memory transfer comparison***
1: 0: Array size: 2.0000 GB
1: 0: Number of samples: 10
1: 0: Test sec Bandwidth GB/s
1: 0: Allocation Timings:
1: 0: regular host allocation: .520E-05
1: 0: pinned host allocation: .812
1: 0: cuda device allocation .387E-01
1: 0: REGULAR PAGEABLE MEMORY:
1: 0: Explicit CUDA HtoD .893 22.390
1: 0: USING PINNED MEMORY
1: 0: Explicit CUDA HtoD .458 43.677
1: 0: REGULAR PAGEABLE MEMORY:
1: 0: acc device allocation .297E-02
1: 0: acc HtoD 2.77 7.221
1: 0: acc device allocation .115E-04
1: 0: acc DtoH 2.60 7.702
1: 0: cuda device allocation .341E-01
1: 0: Explicit CUDA HtoD 1.86 10.741
1: 0: Explicit CUDA DtoH 1.55 12.933
1: 0: USING PINNED MEMORY
1: 0: acc device allocation .123E-04
1: 0: acc HtoD .482 41.517
1: 0: acc device allocation .105E-04
1: 0: acc DtoH .480 41.625
1: 0: cuda device allocation .301E-01
1: 0: Explicit CUDA HtoD .458 43.677
1: 0: Explicit CUDA DtoH .480 41.663
1: 0: completed