Skip to content

Dense matric multiplication accelerator for convolutional kernels calculation using OpenMP, MPI, CUDA, and FPGA

Notifications You must be signed in to change notification settings

ykozxy/parallel-matrix-multiplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Matrix Accelerator

Overview

This repository hosts parallel computing labs exploring progressively more powerful accelerators for dense matrix multiplication and convolutional kernels. Each lab scales the same GEMM baseline across OpenMP, MPI, CUDA, and FPGA/HLS targets.

Directory Map

Path Description
common/ Common toolchains.
lab1-openmp-gemm/ Multi-core acceleration: C++17 + OpenMP GEMM with blocked/streamed kernels.
lab2-mpi-gemm/ Multi-CPU acceleration: Distributed GEMM that scatters tiles with MPI, supports blocking/buffered/non-blocking communication.
lab3-cuda-cnn/ GPU acceleration: CUDA implementation of convolution/GEMM hybrids using shared-memory tiling.
lab4-fpga-cnn/ FPGA acceleration: MerlinCC/HLS kernels for FPGA emulation and AWS F1 synthesis.

Build & Run

  • OpenMP: cd lab1-openmp-gemm && make -j && make test to benchmark both parallel kernels against the baseline library. Reports required by the course are regenerated via make zip.
  • MPI: cd lab2-mpi-gemm && make test np=4 (override np as needed). Switch between communication APIs at the top of mpi.cpp.
  • CUDA: cd lab3-cuda-cnn && make cnn && . ./params.sh && ./cnn for the CNN benchmark, or make vadd && . ./params.sh && ./vadd for micro-validation. make test-seq compares against the sequential host reference.
  • FPGA/HLS: cd lab4-fpga-cnn && make KERNEL=cnn test to run the fast simulator, make estimate to pull cycle counts from merlin.rpt, and make KERNEL=dotprod / vadd for alternate kernels. Scripts prefixed with setup configure the Merlin or Docker toolchains.

Performance & Profiling Notes

  • Tile sizes, unrolling factors, and kernel variants are documented in each lab*-report.md.
  • Speedups are measured relative to the naive single-core routine in lab1-openmp-gemm/lib/gemm.cpp, typically using 4096² matrices on Apple Silicon hosts and AWS F1 instances.

Cleaning Up

Run make clean inside any lab directory to drop binaries.

About

Dense matric multiplication accelerator for convolutional kernels calculation using OpenMP, MPI, CUDA, and FPGA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published