modules/Architecture/gpu_memory_hierarchy at master · toyegoke/modules

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
colab.md	colab.md
cudaMem.ipynb	cudaMem.ipynb
incomplete_tiled_matrix_mult.cu	incomplete_tiled_matrix_mult.cu
lecture_slides.pdf	lecture_slides.pdf
lecture_slides.pptx	lecture_slides.pptx
matrix_multiply.cu	matrix_multiply.cu

[C2] GPU Memory Hierarchy

Jacob Newcomb, Choudry Abdul Rehman, David Bunde <dbunde@knox.edu>

Description

This module discusses the GPU memory hierarchy by showing how the performance of a matrix-multiply code can be improved with tiling, which aims to improve memory performance. Rather than a standard triply-nested loop that computes the result location by location, the tiled algorithm loads submatrices of the input into shared memory and computes part of the result for an entire submatrix of the result.

The module is based on an example from a well-known text [1].

Context

The module is intended as second module on GPU programming after students have been introduced to GPU programming and its SIMD programming model. (It could be a successor to our Introduction to CUDA Programming module.) The idea is to make an analogy to CPU caching and reinforce the idea of caching; GPU shared memory is used as a programmer-controlled cache. Because of its prerequisites, this module is appropriate for a mid-level systems course or an upper-level elective. I plan on using it in an Introduction to Systems course in which both CUDA programming and caching are introduced.

My students have found it easier to use Google Colab to run the code in this module than to use ssh to access departmental computing resources. Colab provides an interactive computing environment running Jupyter notebooks with access to GPUs. Using GPUs does require installing the nvcc compiler. As an added wrinkle, the GPU students run on changes (thus, changing its number of cores and architecture) when they restart the notebook, which can lead to different results. See the setup resource below for additional information on using Google Colab.

Topics

HC topics covered in this module are listed below. Bloom's classification is shown in brackets

GPU Acceleration [A]
Memory heterogeneity [A]
Opimizing memory performance [C]

Learning Outcomes

Having completed this module, students should be able to

Explain the properties and limitations of shared memory in CUDA programming
Write code using GPU shared memory in CUDA
Estimate the number of memory operations for simple programs

Instructor Resources

This module includes the following teaching materials:

Slides (.pptx, .pdf)
(Untiled) CUDA code for matrix multiply: An implementation that doesn't use tiling
Colab notebook (open it and then click "open in colab")
Skeleton of tiled matrix multiply: A version partially converted to using tiling. (A completed version is available to instructors upon request.)
Information on using Google Colab

All material available for download from the ToUCH git repository

References

D.B. Kirk and W.-m.W. Hwu. Programming massively parallel processors. Sections 4.4-4.6, pages 84-96, Morgan Kaufmann, 3rd edition, 2017.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

[C2] GPU Memory Hierarchy

Description

Context

Topics

Learning Outcomes

Instructor Resources

References

FilesExpand file tree

gpu_memory_hierarchy

Directory actions

More options

Directory actions

More options

Latest commit

History

gpu_memory_hierarchy

Folders and files

parent directory

README.md

[C2] GPU Memory Hierarchy

Description

Context

Topics

Learning Outcomes

Instructor Resources

References