p2r-tests

Build instructions on JLSE

Kokkos versions with CUDA backend:

Clone Kokkos as a sub-module inside the main directory of the p2r-tests repo:

git clone git@github.com:kokkos/kokkos.git

Set the environment var Kokkos_source to the kokkos directory
Load CUDA modules:

module loda cuda/11.6.2
Configure CMake with:

cd ./p2r-tests/bin
cmake ../ -DCMAKE_CXX_COMPILER=$Kokkos_source/bin/nvcc_wrapper\
    -DKokkos_ENABLE_CUDA=ON -DKokkos_ENABLE_CUDA_CONSTEXPR=On -DKokkos_ENABLE_CUDA_LAMBDA=On -DKokkos_CXX_STANDARD=17 -DKokkos_ARCH_AMPERE80=On

for A-100 GPU architecture

Build with make
Run with ./p2r_kokkos
For other backends, see bin/config.sh for examples

instructions to compile and run p2r on apollo@cs.uoregon.edu

module load intel

export LD_LIBRARY_PATH=LD_LIBRARY_PATH:/packages/intel/20/compilers_and_libraries_2020.1.217/linux/tbb/lib/intel64_lin/gcc4.8/

icc -Wall -I. -O3 -fopenmp -march=native -xHost -qopt-zmm-usage=high propagate-tor-test_tbb.cpp -I/packages/intel/20/compilers_and_libraries/linux/tbb/include/ -L/packages/intel/20/compilers_and_libraries_2020.1.217/linux/tbb/lib/intel64_lin/gcc4.8/ -Wl,-rpath,/lib -ltbb -o propagate-tor-test.exe

add the following to create detailed optimization report: -qopt-report=5

./propagate-tor-test.exe

instructions to compile and run p2r on lnx7188.classe.cornell.edu

source /opt/intel/oneapi/setvars.sh
source /cvmfs/cms.cern.ch/slc7_amd64_gcc820/lcg/root/6.18.04-bcolbf/etc/profile.d/init.sh
export TBB_GCC=/cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/tbb/2019_U9
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LIBJPEG_TURBO_ROOT/lib64

icc -Wall -I. -O3 -fopenmp -march=native -xHost -qopt-zmm-usage=high src/propagate-tor-test_tbb.cpp -I$TBBROOT/include/ -L$TBBROOT/lib/intel64/gcc4.8/ -Wl,-rpath,/lib -ltbb -o propagate-tor-test.exe

./propagate-tor-test.exe

instructions on cori

module load intel
module load tbb
#Build and run once with icc as compiler
python build.py -t tbb -c icc -v

Example commands:

#print out compile command
python build.py -t tbb -c icc -v --dryRun
#build and scan with multiple threads
python build.py -t tbb -c icc -v --nthreads 1,2,3,4,5
#Scan for two compilers with multiple threads
python build.py -t tbb -c icc,gcc -v --nthreads 1,2,3,4,5
#Append results to the same result json (Default is to skip existing scan points)
python build.py -t tbb -c icc -v --nthreads 1,2,3,4,5 --append

To run the CUDA version on cori:

#load the module once
module load cgpu
module load cuda
#Connect to a GPU node:
alloc -A m2845 -C gpu -N 1 --gres=gpu:1 -t 2:00:00 --exclusive
#Example command:
python build.py -t cuda --num_streams 1 --bsize 1 -v

CUDA versions

There are 3 different versions of CUDA implementations, with different indexing scheme and kernel launch patterns. For details of how the 3 implementaion differs, see slides here

cuda: Always run with bsize=1. Kernels are launched in 1D blocks with a constant threads per block inside.

Blocks per grid = (nevts * nTrks) / threads_per_block 
Threads_per_block = const

Example command:

python build.py -t cuda --num_streams 1 --bsize 1 -v --nevts 1 --nlayer 2 --ntrks 32,64,128,256

cuda_v2: bsize is set to ntrks in the implementation. Kernels are launched in 1D blocks with:

Blocks per grid = nevts  
Threads_per_block = ntrks

On a V100 GPU, ntrks cannot exceed 300. Example command:

python build.py -t cuda_v2 --num_streams 1 -v --nevts 1 --nlayer 20 --ntrks 32,64,128,256

cuda_v3: Follows p2z CUDA V2 conventions, default with bsize=128. Kernels are launched in 2D blocks.

Example command:

python build.py -t cuda_v3 --num_streams 1  -v --ntrks 9600 --nevts 100 --nlayer 20 --threadsperblockx 16 --threadsperblocky 2
python build.py -t cuda_v4 --num_streams 1  -v --noH2D --noD2H --dryRun

PSTL version

module load nvhpc/21.7
module load gcc/9.3.0 # needed for gcc  version
module load tbb       # needed for gcc  version

Example command:

python build.py -t pstl -v -c nvc++,nvc++_x86,gcc --dryRun

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
alpaka @ 487816e		alpaka @ 487816e
archived-src		archived-src
bin		bin
include		include
kokkos @ b52f8c8		kokkos @ b52f8c8
notebooks		notebooks
p2r_alpaka		p2r_alpaka
slides		slides
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md
build.py		build.py
make.sh		make.sh
read_throughput.py		read_throughput.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

p2r-tests

Build instructions on JLSE

Kokkos versions with CUDA backend:

instructions to compile and run p2r on apollo@cs.uoregon.edu

instructions to compile and run p2r on lnx7188.classe.cornell.edu

instructions on cori

CUDA versions

PSTL version

About

Uh oh!

Releases

Packages

Languages

kakwok/p2r-tests

Folders and files

Latest commit

History

Repository files navigation

p2r-tests

Build instructions on JLSE

Kokkos versions with CUDA backend:

instructions to compile and run p2r on apollo@cs.uoregon.edu

instructions to compile and run p2r on lnx7188.classe.cornell.edu

instructions on cori

CUDA versions

PSTL version

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages