This repository is a research and learning workspace for building a RISC-V based GPU architecture that avoids custom ISA extensions (the primary exception being the RISC-V Vector extension where appropriate). The goal is to explore GPU building blocks, accelerator micro-architectures, and end-to-end flows using open tools and standard RISC-V features.
Status: initial work focused on systolic-array based matrix-multiply building blocks (2x2 and 4x4 example designs), RTL testbenches, and simple simulation flows.
Contents
- Overview: goals and design philosophy.
- systolic_array/: starting point — example systolic array RTL for matmul and simulation flows.
- Roadmap: planned next steps for the GPU stack.
GPUs are complex systems built from well-understood building blocks: vector/SIMD units, memory hierarchies, data-movement engines, and computation accelerators. This repo aims to:
- Study those building blocks in isolation and in combination.
- Build hardware blocks using synthesizable RTL and open-source tools.
- Keep the ISA standard (RISC-V + Vector extension) to maximize portability.
Top-level structure (high-level):
systolic_array/— Systolic-array examples and simulation flows.2x2_matmul/— 2x2 processing-element (PE) matrix multiply example.rtl/— Verilog sources for the PE and top-level systolic array.simulation/— Makefile and helper files to run iverilog/vvp simulations and produce waveforms.
4x4_matmul/— 4x4 PE example and testbench.
Prerequisites (recommended):
iverilog+vvp(simulation)yosys(synthesis / linting)gtkwaveor any VCD viewer (optional)
On macOS (Homebrew):
brew install icarus-verilog yosys gtkwaveEach example has a simulation/ directory with a Makefile to run the RTL tests and generate waveforms. Example:
# 2x2 example
cd systolic_array/2x2_matmul/simulation
make
# 4x4 example
cd ../../4x4_matmul/simulation
makeThe Makefiles typically run iverilog to produce a sim.vvp binary and then run vvp sim.vvp producing wave.vcd; open that VCD with gtkwave wave.vcd.
- Systolic arrays are used here as a compute substrate for dense matrix multiply (GEMM). They demonstrate dataflow-style mapping of compute to constant-mesh PE arrays and are a common accelerator building block in GPUs.
- The long-term plan is to integrate these accelerators with a RISC-V CPU and expose work to them using standard mechanisms (memory-mapped queues, DMA engines, or offload semantics) while keeping the ISA itself standard-compliant.
- The RISC-V Vector extension is considered the primary ISA-level mechanism for expressing wide-data parallelism; accelerators (like systolic arrays) are complementary and can be targeted from vectorized code or from higher-level runtimes.
The goal is to build a GPU where the compute cores execute standard RISC-V instructions (RV32IMV or RV64GV), avoiding custom ISA extensions beyond the Vector extension.
A GPU consists of:
- Many parallel execution units — SIMT lanes organized into warps
- A warp/thread scheduler — manages thousands of concurrent threads
- Memory hierarchy — registers, shared memory, caches, global memory
- Command frontend — receives work from host, dispatches to compute units
- Integrate or build a minimal RV32IM core (candidates: PicoRV32, VexRiscv, or custom)
- Add Vector extension support (study Ara, Vicuna, or build minimal vector ALU)
- Test with simple vector programs: vector add, dot product, small matmul
- Establish simulation and verification flow
- Instantiate multiple scalar pipelines sharing instruction fetch (warp)
- Build warp scheduler (round-robin or scoreboard-based)
- Implement divergence handling (masked execution for branches)
- Add warp-level synchronization primitives
- Design banked register file (GPUs need large register files)
- Add shared memory / scratchpad (fast, software-managed, per-workgroup)
- Implement L1 data cache or texture-cache style access
- Build global memory interface (AXI or similar to external DRAM)
- Command processor: read work descriptors from memory
- Thread block / workgroup dispatcher: assign blocks to compute units
- Barrier / sync support (
__syncthreads()equivalent) - DMA engine for bulk data movement
- Minimal runtime to launch kernels from host
- Use LLVM RISC-V backend with vector intrinsics for compilation
- Write example GPU kernels in C with RVV intrinsics
- Simple benchmarks: SAXPY, GEMM, reduction
| Project | Description |
|---|---|
| Vortex | Open-source RISC-V GPGPU — closest reference architecture |
| Ara | Full RVV 1.0 vector unit from ETH Zurich |
| Vicuna | Lightweight RVV core |
| VexRiscv | Configurable RISC-V in SpinalHDL |
| PicoRV32 | Tiny RV32 core, easy to understand |
✅ Done: Systolic array building blocks (2×2, 4×4 matmul PEs) — understanding dataflow compute
🔄 Next: Phase 1 — integrate a base RISC-V core and add vector support
Contributions are welcome. Suggested workflow:
- Open an issue describing the change or feature.
- Create a branch and send a PR with clear description and tests where applicable.
This repository does not include an explicit license file yet. If you want to add a license, consider a permissive license such as MIT or Apache-2.0.
Open an issue or create a discussion in this repository for questions, design discussions, or coordination.
This work uses open-source tools such as Icarus Verilog and Yosys for simulation and synthesis research.
--