This directory contains 53 CUDA programming evaluation tasks based on the "Programming Massively Parallel Processors" (PMPP) textbook by Hwu, Kirk, and Hajj. Each task evaluates a specific CUDA programming concept.
Source: SinatrasC/pmpp-eval
Each evaluation task follows a standardized structure:
eval-tasks/
├── ch02-vecadd-single-turn/
│ ├── Makefile # Build configuration
│ ├── README.md # Task documentation
│ ├── student_kernel.cu # Student implementation file (to be completed)
│ ├── reference_solution.cu # Reference implementation
│ └── test_student.cu # Test harness
├── ch02-vecmul-single-turn/
│ └── ...
├── ch03-ex1a-matmul-row-per-thread/
│ └── ...
└── ... (53 tasks total)
| File | Purpose |
|---|---|
Makefile |
Defines build targets (test_student, test_reference) |
README.md |
Task description, requirements, and hints |
student_kernel.cu |
Skeleton file where students implement the CUDA kernel |
reference_solution.cu |
Correct reference implementation |
test_*.cu |
Test harness that validates correctness |
ch02-vecadd-single-turn- Vector additionch02-vecmul-single-turn- Vector multiplication
ch03-ex1a-matmul-row-per-thread- Matrix multiplication (row-per-thread)ch03-ex1b-matmul-col-per-thread- Matrix multiplication (column-per-thread)ch03-rgb2gray-single-turn- RGB to grayscale conversion
ch04-device-props-eval- Device properties evaluationch04-matmul-basic-single-turn- Basic matrix multiplication
ch05-matmul-tiled- Tiled matrix multiplicationch05-matmul-tiled-multiturn- Multi-turn tiled matrix multiplicationch05-matmul-tiled-speed- Optimized tiled matrix multiplication
ch06-thread-coarsening-matmul- Thread coarsening in matrix multiplication
ch07-conv1d-basic-single-turn- 1D convolution (basic)ch07-conv1d-tiled-caching- 1D convolution with tiled cachingch07-conv2d-basic- 2D convolution (basic)ch07-conv2d-tiled-constant- 2D convolution with constant memory
ch08-stencil-1d-basic- 1D stencil computationch08-stencil-2d-basic- 2D stencil computation
ch09-histogram-naive-single-turn- Naive histogram (global atomics)ch09-histogram-privatization- Histogram with privatization
ch10-reduction-max-arbitrary- Reduction (max) with arbitrary sizech10-reduction-sum-2048- Reduction (sum) for 2048 elementsch10-reduction-sum-arbitrary- Reduction (sum) with arbitrary size
ch11-prefix-sum-kogge-stone- Kogge-Stone scanch11-prefix-sum-brent-kung- Brent-Kung scan
ch12-merge-basic- Basic mergech12-merge-tiled- Tiled merge
ch13-bitonic-sort- Bitonic sortch13-radix-sort-basic- Basic radix sort
ch14-spmv-coo- Sparse matrix-vector multiply (COO format)ch14-spmv-csr- Sparse matrix-vector multiply (CSR format)ch14-spmv-ell- Sparse matrix-vector multiply (ELL format)
ch15-bfs-direction-optimized-single- BFS with direction optimizationch15-bfs-edge-centric-single- Edge-centric BFSch15-bfs-pull-single- Pull-based BFSch15-bfs-push-single- Push-based BFS
ch16-softmax-basic- Basic softmaxch16-layernorm-basic- Basic layer normalization
ch17-sparse-iterative-cg- Conjugate gradient method
ch18-segmented-scan- Segmented scan
ch19-warp-shuffle-reduction- Warp shuffle reductionch19-warp-vote-predicate- Warp vote predicates
ch20-streams-overlap- Stream-based overlap
ch21-bezier-dp-free-child-buffers- Bezier curve with dynamic parallelismch21-bezier-dp-parent-child-single- Parent-child dynamic parallelismch21-quadtree-dp-build-single- Quadtree with dynamic parallelismch21-quadtree-dp-pack-coalesced- Coalesced quadtree packing
Each task uses a Makefile with standard targets:
make test_student # Build and run student implementation
make test_reference # Build and run reference solution
make clean # Clean build artifactsNVCC- NVIDIA CUDA compiler (default:nvcc)NVCC_FLAGS- Compiler flags (e.g.,-arch=sm_70,-O3)CUDA_PATH- CUDA installation path
# Navigate to task directory
cd eval-tasks/ch02-vecadd-single-turn/
# Edit student_kernel.cu with your implementation
vim student_kernel.cu
# Build and test
make test_student
# Compare with reference
make test_referenceThe PMPP evaluation harness automatically:
- Extracts CUDA code from LLM responses
- Writes code to
student_kernel.cu - Compiles using
make test_student - Runs the test binary
- Reports success/failure (1.0 or 0.0)
Students must complete the skeleton in student_kernel.cu:
__global__ void myKernel(float* input, float* output, int n) {
// TODO: Implement kernel
// Hints provided in comments
}Each test harness (test_*.cu):
- Allocates input/output buffers
- Initializes test data
- Launches student kernel
- Validates results against expected output
- Returns exit code 0 (success) or 1 (failure)
- CUDA Toolkit 11.0+ (nvcc compiler)
- GNU Make
- C++14 or later
- Linux/WSL2 (recommended)
- NVIDIA GPU
- Recommended: 4GB+ VRAM
Tasks are automatically downloaded from GitHub releases on first use:
# Default: Downloads to ~/.cache/pmpp/eval-tasks
uv run vf-eval pmpp -m openai/gpt-4o-mini -n 5
# Custom cache location
uv run vf-eval pmpp -n 5 \
--env-args '{"eval_tasks_cache_dir": "/custom/path"}'# Download specific version
wget https://github.com/SinatrasC/pmpp-eval/releases/download/v1.0.0/eval-tasks.tar.gz
# Extract
tar -xzf eval-tasks.tar.gz
# Use in evaluation
uv run vf-eval pmpp -n 5 \
--env-args '{"use_bundled_tasks": true}'# Check CUDA installation
nvcc --version
# Install CUDA (Ubuntu/Debian)
sudo apt install nvidia-cuda-toolkit- Ensure correct CUDA architecture flags in Makefile
- Check compute capability:
nvidia-smi - Verify C++ standard compatibility
- Check GPU memory availability
- Validate kernel launch parameters
- Review synchronization points
To add new evaluation tasks:
- Create task directory:
eval-tasks/chXX-topic-variant/ - Add required files (Makefile, README, student_kernel.cu, etc.)
- Write test harness with clear pass/fail criteria
- Update dataset JSONL with task metadata
- Test with reference implementation
- Submit PR to pmpp-eval repository
Evaluation tasks are distributed under the same license as the PMPP codebase.
- Issues: GitHub Issues
- Documentation: PMPP Environment README
- Author: Sinatras - GitHub · X