PMPP Coding Evaluation

Overview

This directory contains 53 CUDA programming evaluation tasks based on the "Programming Massively Parallel Processors" (PMPP) textbook by Hwu, Kirk, and Hajj. Each task evaluates a specific CUDA programming concept.

Source: SinatrasC/pmpp-eval

Directory Structure

Each evaluation task follows a standardized structure:

eval-tasks/
├── ch02-vecadd-single-turn/
│   ├── Makefile                    # Build configuration
│   ├── README.md                   # Task documentation
│   ├── student_kernel.cu           # Student implementation file (to be completed)
│   ├── reference_solution.cu       # Reference implementation
│   └── test_student.cu             # Test harness
├── ch02-vecmul-single-turn/
│   └── ...
├── ch03-ex1a-matmul-row-per-thread/
│   └── ...
└── ... (53 tasks total)

Standard Files in Each Task

File	Purpose
`Makefile`	Defines build targets (`test_student`, `test_reference`)
`README.md`	Task description, requirements, and hints
`student_kernel.cu`	Skeleton file where students implement the CUDA kernel
`reference_solution.cu`	Correct reference implementation
`test_*.cu`	Test harness that validates correctness

Task Categories by Chapter

Chapter 2: Data Parallelism

ch02-vecadd-single-turn - Vector addition
ch02-vecmul-single-turn - Vector multiplication

Chapter 3: Multidimensional Grids and Data

ch03-ex1a-matmul-row-per-thread - Matrix multiplication (row-per-thread)
ch03-ex1b-matmul-col-per-thread - Matrix multiplication (column-per-thread)
ch03-rgb2gray-single-turn - RGB to grayscale conversion

Chapter 4: Memory Architecture and Performance

ch04-device-props-eval - Device properties evaluation
ch04-matmul-basic-single-turn - Basic matrix multiplication

Chapter 5: Shared Memory and Tiling

ch05-matmul-tiled - Tiled matrix multiplication
ch05-matmul-tiled-multiturn - Multi-turn tiled matrix multiplication
ch05-matmul-tiled-speed - Optimized tiled matrix multiplication

Chapter 6: Performance Optimization

ch06-thread-coarsening-matmul - Thread coarsening in matrix multiplication

Chapter 7: Convolution

ch07-conv1d-basic-single-turn - 1D convolution (basic)
ch07-conv1d-tiled-caching - 1D convolution with tiled caching
ch07-conv2d-basic - 2D convolution (basic)
ch07-conv2d-tiled-constant - 2D convolution with constant memory

Chapter 8: Stencil Computations

ch08-stencil-1d-basic - 1D stencil computation
ch08-stencil-2d-basic - 2D stencil computation

Chapter 9: Parallel Histogram

ch09-histogram-naive-single-turn - Naive histogram (global atomics)
ch09-histogram-privatization - Histogram with privatization

Chapter 10: Reduction

ch10-reduction-max-arbitrary - Reduction (max) with arbitrary size
ch10-reduction-sum-2048 - Reduction (sum) for 2048 elements
ch10-reduction-sum-arbitrary - Reduction (sum) with arbitrary size

Chapter 11: Prefix Sum (Scan)

ch11-prefix-sum-kogge-stone - Kogge-Stone scan
ch11-prefix-sum-brent-kung - Brent-Kung scan

Chapter 12: Merge

ch12-merge-basic - Basic merge
ch12-merge-tiled - Tiled merge

Chapter 13: Sorting

ch13-bitonic-sort - Bitonic sort
ch13-radix-sort-basic - Basic radix sort

Chapter 14: Sparse Matrix

ch14-spmv-coo - Sparse matrix-vector multiply (COO format)
ch14-spmv-csr - Sparse matrix-vector multiply (CSR format)
ch14-spmv-ell - Sparse matrix-vector multiply (ELL format)

Chapter 15: Graph Search

ch15-bfs-direction-optimized-single - BFS with direction optimization
ch15-bfs-edge-centric-single - Edge-centric BFS
ch15-bfs-pull-single - Pull-based BFS
ch15-bfs-push-single - Push-based BFS

Chapter 16: Deep Learning

ch16-softmax-basic - Basic softmax
ch16-layernorm-basic - Basic layer normalization

Chapter 17: Iterative Methods

ch17-sparse-iterative-cg - Conjugate gradient method

Chapter 18: Parallel Patterns

ch18-segmented-scan - Segmented scan

Chapter 19: Advanced Optimization

ch19-warp-shuffle-reduction - Warp shuffle reduction
ch19-warp-vote-predicate - Warp vote predicates

Chapter 20: CUDA Streams

ch20-streams-overlap - Stream-based overlap

Chapter 21: Dynamic Parallelism

ch21-bezier-dp-free-child-buffers - Bezier curve with dynamic parallelism
ch21-bezier-dp-parent-child-single - Parent-child dynamic parallelism
ch21-quadtree-dp-build-single - Quadtree with dynamic parallelism
ch21-quadtree-dp-pack-coalesced - Coalesced quadtree packing

Build System

Each task uses a Makefile with standard targets:

Build Targets

make test_student    # Build and run student implementation
make test_reference  # Build and run reference solution
make clean          # Clean build artifacts

Common Makefile Variables

NVCC - NVIDIA CUDA compiler (default: nvcc)
NVCC_FLAGS - Compiler flags (e.g., -arch=sm_70, -O3)
CUDA_PATH - CUDA installation path

Evaluation Process

Local Evaluation

# Navigate to task directory
cd eval-tasks/ch02-vecadd-single-turn/

# Edit student_kernel.cu with your implementation
vim student_kernel.cu

# Build and test
make test_student

# Compare with reference
make test_reference

Automated Evaluation

The PMPP evaluation harness automatically:

Extracts CUDA code from LLM responses
Writes code to student_kernel.cu
Compiles using make test_student
Runs the test binary
Reports success/failure (1.0 or 0.0)

Task Format

Student Implementation

Students must complete the skeleton in student_kernel.cu:

__global__ void myKernel(float* input, float* output, int n) {
    // TODO: Implement kernel
    // Hints provided in comments
}

Test Harness

Each test harness (test_*.cu):

Allocates input/output buffers
Initializes test data
Launches student kernel
Validates results against expected output
Returns exit code 0 (success) or 1 (failure)

Requirements

Software

CUDA Toolkit 11.0+ (nvcc compiler)
GNU Make
C++14 or later
Linux/WSL2 (recommended)

Hardware

NVIDIA GPU
Recommended: 4GB+ VRAM

Download and Caching

Automatic Download

Tasks are automatically downloaded from GitHub releases on first use:

# Default: Downloads to ~/.cache/pmpp/eval-tasks
uv run vf-eval pmpp -m openai/gpt-4o-mini -n 5

# Custom cache location
uv run vf-eval pmpp -n 5 \
  --env-args '{"eval_tasks_cache_dir": "/custom/path"}'

Manual Download

# Download specific version
wget https://github.com/SinatrasC/pmpp-eval/releases/download/v1.0.0/eval-tasks.tar.gz

# Extract
tar -xzf eval-tasks.tar.gz

# Use in evaluation
uv run vf-eval pmpp -n 5 \
  --env-args '{"use_bundled_tasks": true}'

Common Issues

Issue: Missing CUDA Toolkit

# Check CUDA installation
nvcc --version

# Install CUDA (Ubuntu/Debian)
sudo apt install nvidia-cuda-toolkit

Issue: Compilation Errors

Ensure correct CUDA architecture flags in Makefile
Check compute capability: nvidia-smi
Verify C++ standard compatibility

Issue: Runtime Errors

Check GPU memory availability
Validate kernel launch parameters
Review synchronization points

Contributing

To add new evaluation tasks:

Create task directory: eval-tasks/chXX-topic-variant/
Add required files (Makefile, README, student_kernel.cu, etc.)
Write test harness with clear pass/fail criteria
Update dataset JSONL with task metadata
Test with reference implementation
Submit PR to pmpp-eval repository

License

Evaluation tasks are distributed under the same license as the PMPP codebase.

Support

Issues: GitHub Issues
Documentation: PMPP Environment README
Author: Sinatras - GitHub · X

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
eval-tasks		eval-tasks
LICENSE		LICENSE
README.md		README.md

License

SinatrasC/pmpp-eval

Folders and files

Latest commit

History

Repository files navigation

PMPP Coding Evaluation

Overview

Directory Structure

Standard Files in Each Task

Task Categories by Chapter

Chapter 2: Data Parallelism

Chapter 3: Multidimensional Grids and Data

Chapter 4: Memory Architecture and Performance

Chapter 5: Shared Memory and Tiling

Chapter 6: Performance Optimization

Chapter 7: Convolution

Chapter 8: Stencil Computations

Chapter 9: Parallel Histogram

Chapter 10: Reduction

Chapter 11: Prefix Sum (Scan)

Chapter 12: Merge

Chapter 13: Sorting

Chapter 14: Sparse Matrix

Chapter 15: Graph Search

Chapter 16: Deep Learning

Chapter 17: Iterative Methods

Chapter 18: Parallel Patterns

Chapter 19: Advanced Optimization

Chapter 20: CUDA Streams

Chapter 21: Dynamic Parallelism

Build System

Build Targets

Common Makefile Variables

Evaluation Process

Local Evaluation

Automated Evaluation

Task Format

Student Implementation

Test Harness

Requirements

Software

Hardware

Download and Caching

Automatic Download

Manual Download

Common Issues

Issue: Missing CUDA Toolkit

Issue: Compilation Errors

Issue: Runtime Errors

Contributing

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages