This repo is a CUDA programming portfolio project for CNN inference acceleration.
This implementation was the second-fastest in ECE408 CUDA programming competition.
- High-performance CUDA kernel design for convolution-heavy workloads.
- Practical performance engineering techniques for GPU compute.
- Host-side CUDA orchestration: allocation, transfer, launch, sync, and output collection.
- Kernel fusion: reduce launch overhead and global-memory round trips.
- Shared-memory tiling: increase data reuse and reduce DRAM traffic.
- Tensor Core (WMMA) compute: accelerate GEMM-like convolution workloads.
- Constant memory lookup: speed repeated small read-only index accesses.
__restrict__pointers: improve compiler optimization opportunities.- Loop unrolling: reduce loop/control overhead.
- Index reuse/flattening: reduce arithmetic and address-generation cost.
- Per-layer kernel tuning: choose launch geometry by layer shape.
cuda_conv_kernel.cucuda_conv_interface.hcuda_conv_host_demo.cu
flowchart LR
A[Host Input Buffer] --> B[H2D Copy and Device Alloc]
B --> C[Fused Conv CUDA Kernel]
C --> D[Device Sync]
D --> E[D2H Copy and Output Checksum]
nvcc -O3 -arch=sm_86 cuda_conv_host_demo.cu cuda_conv_kernel.cu -o cuda_conv_demo
./cuda_conv_demoFull pipeline code, build wiring, milestones, and legacy assets are preserved under archive/.