Skip to content

Latest commit

 

History

History
310 lines (235 loc) · 13.9 KB

File metadata and controls

310 lines (235 loc) · 13.9 KB

RV32IMA RISC-V SoC

RV32IMA RISC-V SoC — Boots Linux

A complete pipelined RV32IMA CPU with S-mode, Sv32 MMU, and peripherals — written entirely in Signal DSL. Boots Linux 6.6.0 via OpenSBI v0.9 on Verilator — both hand-written and auto-generated SystemVerilog reach the same kernel init point.

OpenSBI v0.9 — Platform: Sparkle RV32IMA SoC — ISA: rv32imasu
Linux version 6.6.0 ... #6 Thu Feb 26 06:29:23 UTC 2026
Machine model: Sparkle RV32IMA SoC
Memory: 26208K/28672K available (1279K kernel code, 465K rwdata, ...)
Hand-written SV Generated SV (from Lean)
UART output 3944 bytes 5250 bytes
Linux boot Kernel init Kernel init
Final PC 0xC013A9xx 0xC013A9xx
Firmware test ALL PASS ALL PASS
-- 122 registers in a single Signal.loop — full SoC in one function
-- State fields accessed by name via declare_signal_state macro
def rv32iSoCSimulate (firmware : BitVec 12 → BitVec 32) : Signal dom SoCState :=
  Signal.loopMemo fun state => rv32iSoCWithFirmwareBody firmware state

CPU Core

  • 4-stage pipeline: IF/ID/EX/WB with hazard detection and data forwarding
  • RV32IMA ISA: Full integer + M-extension (MUL/DIV/REM) + A-extension (LR.W/SC.W/AMO)
  • Multi-cycle divider: Restoring division algorithm in Signal DSL

Privilege & Virtual Memory

  • S-mode + M-mode: Full privilege separation with trap delegation (medeleg/mideleg)
  • Sv32 MMU: 4-entry TLB + hardware page table walker (PTW), megapage support
  • CSRs: mstatus, mie, mtvec, mepc, mcause, mtval, satp, sstatus, stvec, sepc, scause, stval, medeleg, mideleg, mcounteren, scounteren, PMP stubs

Peripherals

  • UART 8250: TX/RX with LSR/IER/LCR/DLL/DLM registers (Linux-compatible)
  • CLINT: Machine timer with mtime/mtimecmp
  • Memory: 32 MB DRAM (byte-addressable, sub-word load/store: LB/LH/LBU/LHU/SB/SH)

Signal DSL Ergonomics

Ergonomic operators and macros reduce boilerplate in hardware descriptions:

open Sparkle.Core.Signal

-- Hardware equality (replaces (· == ·) <$> a <*> Signal.pure val)
let isIdle := fsmReg === (0#4)

-- Bool operators (replaces (· && ·) <$> a <*> b etc.)
let startAndIdle := start &&& isIdle
let shouldFlush  := branchTaken ||| trap_taken

-- Implicit constant lifting via Coe (replaces Signal.pure)
let p3SavedNext := Signal.mux cond (true : Signal dom _) p3SavedReg

-- Hardware conditional macro (replaces deeply nested Signal.mux)
let fsmNext := hw_cond fsmReg            -- default value
  | startAndIdle  => (1#4 : Signal dom _)  -- first match wins
  | stemDone      => (2#4 : Signal dom _)
  | stageConvDone => (3#4 : Signal dom _)
  | isDone        => (0#4 : Signal dom _)

All features are synthesis-compatible — === expands to (· == ·) <$> a <*> b, hw_cond expands to nested Signal.mux calls, and Coe unfolds to Signal.pure.

Named State Accessors (declare_signal_state)

Eliminates error-prone magic-number indices for hardware state tuples:

-- Declare state with named fields, types, and defaults
declare_signal_state BottleneckState
  | fsmReg      : BitVec 2   := 0#2
  | residualReg : BitVec 8   := 0#8
  | resultReg   : BitVec 8   := 0#8
  | doneReg     : Bool        := false

-- Access fields by name (no more projN! state 122 47)
let fsmReg := BottleneckState.fsmReg state
let residualReg := BottleneckState.residualReg state

Generates a tuple type alias, synthesis-compatible accessor defs, default value, and Inhabited instance. The RV32 SoC uses this for all 122 registers — adding/removing a field no longer requires updating every index.

JIT Simulation — Transparent & Fast (~200x faster than Lean)

Two APIs: Signal API (drop-in replacement for loopMemo) and Streaming API (O(1) memory for long runs):

-- Signal API: same interface as loopMemo, JIT speed under the hood
let soc ← rv32iSoCJITSimulate (jitCppPath := "verilator/generated_soc_jit.cpp") (firmware := fw)
let out := soc.atTime 1000  -- SoCOutput with pc, uartValid, uartData, ...

-- Streaming API: 10M+ cycles with per-cycle callback, O(1) memory
rv32iSoCJITRun (jitCppPath := cppPath) (firmware := fw) (cycles := 10000000)
  (callback := fun cycle vals => do
    let out := SoCOutput.fromWireValues vals
    if out.uartValid then IO.println s!"UART: {out.uartData}"
    return true)  -- continue
  • Compile C++ to shared library at runtime, load via dlopen from Lean
  • Hash-based caching: recompilation skipped if source unchanged
  • Uses stable named output wires (_gen_pcReg, _gen_uartValidBV, etc.) — immune to DCE
  • 6 observable wires (via SoCOutput.wireNames), 11 memories, 6 input ports

Simulation Performance (10M cycles, Apple Silicon)

Backend Speed vs Verilator vs Lean
JIT evalTick (fused) 13.0M cyc/s 1.17x faster ~2600x
JIT evalTick + 6 wires 12.7M cyc/s 1.14x faster ~2540x
JIT eval+tick (separate) 12.7M cyc/s 1.14x faster ~2540x
Verilator 5.044 11.1M cyc/s 1.00x ~2220x
CppSim (-O3 AOT) 6.0M cyc/s 0.54x ~1200x
Lean loopMemo ~5K cyc/s 1x

JIT exceeds Verilator speed (1.17x faster) thanks to: (1) no mutex/thread overhead (Verilator 5.x wastes 17.4% on locks even single-threaded), (2) observable wire optimization (33 class members + 321 locals, L1-cache friendly), (3) fewer CPU instructions per sim-cycle, and (4) fused evalTick() with stack-local _next variables (eliminates ~260 intermediate memory ops/cycle). See docs/BENCHMARK.md for detailed benchmark results, profiling data, and bottleneck analysis.

JIT Cycle-Skipping — Dynamic Oracle

Detects when the CPU is stuck in a tight halt loop and skips forward by advancing the cycle counter + CLINT timer, achieving 706x effective speedup. The oracle receives the JITHandle directly for dynamic register reads, bulk memory ops (memsetWord), and direct state mutation:

import Sparkle.Core.Oracle
open Sparkle.Core.Oracle

-- Create oracle: detects PC stuck in ≤12-byte range for 50+ consecutive cycles
-- Oracle receives JITHandle per-call and handles all state mutations internally
let (oracle, statsRef) ← mkSelfLoopOracle {}

-- Run with oracle — 10M cycles complete in 9ms instead of 5.5 seconds
let cycles ← JIT.runOptimized handle 10_000_000 wireIndices oracle callback

-- Post-run: 9,998 triggers, 9,998,000 cycles skipped, UART output identical
let stats ← statsRef.get
IO.println s!"Skipped {stats.totalSkipped} cycles in {stats.triggerCount} triggers"

-- Bulk memory fill (e.g., BSS zeroing for Linux boot)
JIT.memsetWord handle memIdx addr 0 count
Metric Without Oracle With Oracle
Wall-clock time (10M cycles) ~770 ms 7 ms
Effective cyc/s 13.0M ~1.4 billion

BSS-Clear Speculative Warp: A custom inline firmware (7-instruction BSS-clear loop) demonstrates the full pattern — the oracle detects the memory-clearing loop, bulk-zeros all 4 DMEM byte banks via memsetWord, and skips ~100K cycles in <1 ms (389 triggers, 99,584 cycles skipped). See Tests/RV32/JITDynamicWarpTest.lean.

Speculative Simulation with Rollback: Full-state snapshot/restore via C++ default copy constructor enables guard-and-rollback speculation. The oracle snapshots state, speculatively applies bulk updates, checks guard conditions (e.g., timer interrupt), and rolls back if the guard fails — providing bit-accurate cycle-skipping even with interrupts. See Tests/RV32/JITSpeculativeWarpTest.lean.

Also includes register snapshot/restore API (130 registers), bulk memory API, and full-state snapshot/restore for state introspection:

Verilator Backend (~1000x faster)

  • Auto-generated SystemVerilog via #writeDesign — boots Linux (5250 UART bytes at 10M cycles)
  • Hand-written SystemVerilog reference at verilator/rv32i_soc.sv (3944 UART bytes at 10M cycles)
  • Both reach same kernel init region (PC 0xC013A9xx) — generated SV matches reference behavior
  • VCD waveform tracing for debugging
  • Firmware test mode + OpenSBI/Linux boot mode
# Lean simulation
lake test  # Includes RV32 simulation tests

# JIT simulation from Lean (compile + load + run)
lake exe rv32-jit-test verilator/generated_soc_jit.cpp firmware/firmware.hex 5000

# JIT loop test (loopMemoJIT Signal API + Streaming API)
lake exe rv32-jit-loop-test verilator/generated_soc_jit.cpp firmware/firmware.hex 5000

# JIT cycle-skip test (register snapshot/restore roundtrip)
lake exe rv32-jit-cycle-skip-test

# JIT oracle test (self-loop detection, 10M cycles with cycle-skipping)
lake exe rv32-jit-oracle-test

# JIT dynamic warp test (memsetWord + dynamic oracle with JITHandle access)
lake exe rv32-jit-dynamic-warp-test

# JIT speculative warp test (snapshot/restore + guard-and-rollback)
lake exe rv32-jit-speculative-warp-test

# CppSim (standalone C++)
cd verilator && make build-cppsim && make run-cppsim

# Verilator simulation (much faster)
cd verilator && make build
./obj_dir/Vrv32i_soc ../firmware/firmware.hex 500000  # Firmware tests

# Linux kernel boot
./obj_dir/Vrv32i_soc ../firmware/opensbi/boot.hex 10000000 \
    --dram /tmp/opensbi/build/platform/generic/firmware/fw_jump.bin \
    --dtb ../firmware/opensbi/sparkle-soc.dtb \
    --payload /tmp/linux/arch/riscv/boot/Image

SV→Sparkle Transpiler {#sv-transpiler}

Parse existing SystemVerilog RTL (e.g., PicoRV32) into Sparkle IR and execute via JIT — no Verilator required.

Pipeline

picorv32.v + picorv32_soc_m.v
    → [SV Parser]    Parse to SV AST (250 items, 32 always blocks)
    → [Lower]         MemorySSA sequential emitter + MUX for always @*
    → [Flatten]       Inline sub-modules with parameter overrides
    → [CppSim]        Generate C++ simulation class
    → [GCC -O2]       Compile to shared library
    → [JIT dlopen]    Load and simulate from Lean

C Firmware Execution

GCC-compiled bare-metal firmware runs on the transpiled PicoRV32 SoC with M-extension:

# Build RV32I firmware
cd firmware
riscv32-none-elf-gcc -march=rv32i -mabi=ilp32 -O2 -nostdlib \
    -T link.ld -o firmware_rv32i.elf boot_rv32i.S main_rv32i.c

# Build RV32IM firmware (MUL/DIV/REM)
riscv32-none-elf-gcc -march=rv32im -mabi=ilp32 -O2 -nostdlib \
    -T link.ld -o firmware_rv32im.elf boot_rv32i.S main_rv32im.c

# Run 34 tests via Sparkle JIT (no external dependencies)
lake exe svparser-test
Test Expected Actual
Fibonacci 0,1,1,2,3,5,8,13,21,34 ✓ exact match
Array Sum 360
Bubble Sort 3,8,17,42,55,99 ✓ exact match
GCD 6,25,1 ✓ exact match
MUL 7*6 42
MUL 12345*6789 83810205
DIV 42/6, 100/7 7, 14
REM 42%6, 100%7 0, 2
Factorial 10! 3628800
Store/Load 0x12345678 0x12345678
Final (RV32I) 0xCAFE0000 ALL PASSED
Final (RV32IM) 0xCAFE0000 ALL PASSED

34 CI-Safe JIT Tests

All tests embed Verilog as string literals — no external file dependencies:

Tests Description
1-6 Parser, lowering, round-trip, JIT simulation
7-11 PicoRV32 parse, SoC "Hello", C firmware (requires /tmp files)
12-16 IR regression: nested if/else, case(1'b1) priority, concat-LHS, for-loop SSA, array read
17-21 pcpi_mul: IR structure, SSA chain, topo sort, JIT 76/100100/12345*6789/consecutive/wrapper
22-26 JIT pairs: if/else+register, array read, case decoder, read-overwrite, for-loop accumulator
27-29 Bit replication {4{en}}, byte-lane write/read, multi-bit replication {2{din}}

Key Algorithms

  • MemorySSA Sequential Emitter: Process always @* statements top-to-bottom, creating SSA wires for each variable write. Correctly handles read-then-overwrite patterns that cause cyclic dependencies under pure MUX.
  • If-Conversion (Guarded Assignments): Walk statement tree tracking guard conditions; emit flat priority mux chain with !covered guard for first-match-wins in case(1'b1).
  • Generate Block Evaluation: Resolve generate if (PARAM) using parameter defaults + overrides. 2-pass architecture for nested hierarchies.
  • Concat-LHS Bit Scatter: Handle {next_rdt[j+3], next_rd[j+:4]} = expr via read-modify-write decomposition with __RMW_BASE__ placeholder resolved by stmtsToMuxExprBlocking.
  • Byte-Strobe Memory: Detect if(wstrb[N]) memory[addr][hi:lo] <= wdata[hi:lo] and generate read-modify-write with per-byte mask and proper shift.
  • Bit Replication: {N{expr}}(0-val) & mask for 1-bit, concat for multi-bit.
  • For-Loop Unrolling: Static unroll with parameter substitution. MemorySSA handles sequential dependencies natively.
  • $signed Sign Extension: SHL+ASR pattern with min-32-bit types and 64-bit promotion to prevent C++ UB.

Inline Formal Verification (verilog! Macro)

Write Verilog directly in Lean, prove properties at compile time:

import Tools.SVParser.Macro

verilog! "
module counter8_en (input clk, input rst, input en, output [7:0] count);
    reg [7:0] count_reg;
    assign count = count_reg;
    always @(posedge clk) begin
        if (rst) count_reg <= 0;
        else if (en) count_reg <= count_reg + 1;
        assert(rst ? (count_reg == 0) : 1);  // auto-proved!
    end
endmodule
"

open counter8_en.Verify

-- These theorems prove against the auto-generated nextState
theorem counter_holds (s : State) :
    nextState s { rst := 0, en := 0 } = s := by simp [nextState]

-- auto_assert_0 was auto-generated and auto-proved by bv_decide
-- Change the assert condition → instant error in editor

Features:

  • verilog! macro: Parse Verilog at elaboration time, inject State/Input/nextState
  • assert(cond): Auto-generate theorem and prove via simp [nextState]; bv_decide
  • Instant feedback: Edit Verilog, save → proofs re-check in milliseconds
  • No Lean knowledge needed: Write standard Verilog assertions, get mathematical proofs