My implementation for CMU 11868: Large Language Model Systems assignments. This repository focuses on CUDA-related parts, including CUDA map/reduce kernels, tiled matrix multiplication, and custom forward/backward pass kernels for Softmax and LayerNorm. Automatic differentiation is also implemented. You can learn the fundamental CUDA programming knowledge like memory hierarchy, warp reduction, parallelization strategies, and efficient kernel implementation.
ztzhu1/CMU-11868
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|