This repository hosts parallel computing labs exploring progressively more powerful accelerators for dense matrix multiplication and convolutional kernels. Each lab scales the same GEMM baseline across OpenMP, MPI, CUDA, and FPGA/HLS targets.
| Path | Description |
|---|---|
common/ |
Common toolchains. |
lab1-openmp-gemm/ |
Multi-core acceleration: C++17 + OpenMP GEMM with blocked/streamed kernels. |
lab2-mpi-gemm/ |
Multi-CPU acceleration: Distributed GEMM that scatters tiles with MPI, supports blocking/buffered/non-blocking communication. |
lab3-cuda-cnn/ |
GPU acceleration: CUDA implementation of convolution/GEMM hybrids using shared-memory tiling. |
lab4-fpga-cnn/ |
FPGA acceleration: MerlinCC/HLS kernels for FPGA emulation and AWS F1 synthesis. |
- OpenMP:
cd lab1-openmp-gemm && make -j && make testto benchmark both parallel kernels against the baseline library. Reports required by the course are regenerated viamake zip. - MPI:
cd lab2-mpi-gemm && make test np=4(overridenpas needed). Switch between communication APIs at the top ofmpi.cpp. - CUDA:
cd lab3-cuda-cnn && make cnn && . ./params.sh && ./cnnfor the CNN benchmark, ormake vadd && . ./params.sh && ./vaddfor micro-validation.make test-seqcompares against the sequential host reference. - FPGA/HLS:
cd lab4-fpga-cnn && make KERNEL=cnn testto run the fast simulator,make estimateto pull cycle counts frommerlin.rpt, andmake KERNEL=dotprod/vaddfor alternate kernels. Scripts prefixed withsetupconfigure the Merlin or Docker toolchains.
- Tile sizes, unrolling factors, and kernel variants are documented in each
lab*-report.md. - Speedups are measured relative to the naive single-core routine in
lab1-openmp-gemm/lib/gemm.cpp, typically using 4096² matrices on Apple Silicon hosts and AWS F1 instances.
Run make clean inside any lab directory to drop binaries.