Skip to content

Feature/rv32#12

Merged
junjihashimoto merged 53 commits intomainfrom
feature/rv32
Mar 26, 2026
Merged

Feature/rv32#12
junjihashimoto merged 53 commits intomainfrom
feature/rv32

Conversation

@junjihashimoto
Copy link
Copy Markdown
Contributor

No description provided.

Add Sparkle/Verification/RV32Props.lean with 36 formally proved
theorems covering:

- Opcode encode/decode roundtrip + injectivity (2)
- ALUOp encode/decode roundtrip + injectivity (2)
- R-type field extraction: opcode, rd, funct3, rs1, rs2, funct7 (6)
- Immediate roundtrip: I-type, U-type, S-type (3)
- ALU algebraic properties: AND-zero, SUB=ADD+neg, SLT/SLTU irrefl,
  SRL/SRA-zero identity (6)
- Instruction format classification: totality + 6 specific (7)
- Decoder control signals: LOAD/STORE/ALU/BRANCH/JAL/LUI (8)
- Branch evaluation: BEQ/BNE reflexivity (2)

All proofs use cases/simp/native_decide/bv_decide — no sorry.
Add encodeB/encodeJ helpers and prove the two trickiest immediate
format roundtrips in RV32I:

- b_type_imm_roundtrip: {inst[31], inst[7], inst[30:25], inst[11:8], 0}
  correctly encodes and decodes with sign extension
- j_type_imm_roundtrip: {inst[31], inst[19:12], inst[20], inst[30:21], 0}
  correctly encodes and decodes with sign extension

Both require the constraint that bit 0 = 0 (halfword alignment).
Proved via bv_decide (SAT-based BitVec decision procedure).

Total: 38 theorems, zero sorry.
Add Sparkle/Verification/PipelineProps.lean with pure Lean models
of the pipeline's critical invariants that guard against the class
of bugs found during Linux boot development:

Forwarding (5):
- forward_never_x0: WB→EX bypass never activates for x0
- forward_implies_match: active forwarding implies address match
- forward_implies_wb_en: active forwarding implies WB enabled
- wb_disable_x0: x0 destination disables write-back
- wb_disable_no_write: regWrite=false disables write-back

Hazard Detection (5):
- no_hazard_no_load: no stall without load in EX
- no_hazard_x0_dest: no stall for x0 destination
- no_hazard_no_match: no stall when registers don't match
- hazard_rs1_match: stall when load rd = ID rs1
- hazard_rs2_match: stall when load rd = ID rs2

Flush/NOP (5):
- branch/jump/trap each cause flush
- no_flush_idle: idle state doesn't flush
- flush_comm: flush conditions are commutative

x0 Invariance (3):
- x0_always_zero: reading x0 always returns 0
- forward_priority: forwarding beats register file
- regfile_when_no_fwd: register file used without forwarding

Store-to-Load Forwarding (5):
- disabled/match/no_match/byte_offset/adjacent_no_match

Decoder Invariants (3):
- load_alu_is_add, store_no_regwrite, branch_no_regwrite
Add Sparkle/Verification/CSRProps.lean with pure Lean models of CSR
operations that formally demonstrate specification violations:

BUGS FOUND (proved as theorems):
- mstatus_wpri_bug_csrrs: CSRRS sets reserved WPRI bits (bit 1)
- mstatus_wpri_bug_csrrw: CSRRW overwrites all WPRI bits
- mstatus_all_bits_writable_bug: CSRRS 0xFFFFFFFF sets all 32 bits
  but only bits 3,7,11,12 (MIE/MPIE/MPP) should be writable
- csrrs_zero_value_correct_but_write_not_suppressed: hardware
  csrDoWrite is active even when rs1=x0 (spec violation)

CORRECT behavior proved (masked implementation):
- mstatus_wpri_correct_csrrs: with mask, bit 1 stays 0
- mstatus_masked_only_legal: only legal fields are settable

Trap/MRET transitions (verified correct):
- trap_mie_to_mpie: MIE→MPIE on trap entry
- mret_mpie_to_mie: MPIE→MIE on MRET
- trap_mret_roundtrip_mie: MIE preserved across trap→MRET cycle
- trap_preserves_non_mstatus_bits: no WPRI bits set by trap

M-extension edge cases (verified correct):
- div_int_min_neg1: INT_MIN / -1 = INT_MIN
- rem_int_min_neg1: INT_MIN % -1 = 0
- div_by_zero, rem_by_zero, divu_by_zero, mulh_neg1_neg1
Prove that the synthesizable Signal DSL hardware implementations
produce the same results as the pure Lean reference specifications.

This bridges the gap between "spec is correct" and "hardware matches
spec" — the Signal DSL IS the hardware (generates Verilog), and
these proofs show it computes identical values.

Key technique: Signal dom α = Nat → α. We provide 17 @[simp] lemmas
that push .val through all Signal combinators (mux, beq, +, -, &,
|, ^, <<<, >>>, slt, ult, ashr, ~~~, register), reducing Signal
expressions to pure BitVec computations. Then simp closes the goal.

ALU equivalence (8): ADD, SUB, AND, OR, XOR, SLL, SRL, PASS
  aluSignal (pure op) (pure a) (pure b) .val 0 = aluCompute op a b

Branch equivalence (3): BEQ (universal), BNE (concrete ×2)
  branchCompSignal (pure f3) (pure a) (pure b) .val 0 = evalBranch f3 a b

Hazard detection (4): concrete hit/miss/no-load/x0
  hazardSignal (pure ...) .val 0 = expected

Register semantics (2): init at t=0, delay-by-one at t+1
…+ CI test

Oracle improvements (Sparkle/Core/Oracle.lean):
- MIE/MTIE guard: check MSTATUS.MIE and MIE.MTIE before timer-compare
  skip. Prevents skipping when interrupts are disabled (the timer
  interrupt wouldn't fire, so skipping would be incorrect).
- WFI fast-path: optional wfiWireArrayIdx triggers immediate skip when
  WFI instruction is detected (threshold=1 instead of default 50).
- S-mode fields: mstatusRegIdx, mieRegIdx for privilege-aware checking.
- mkBootOracle now enables checkInterruptEnable by default.

Self-contained CI test (Tests/RV32/OracleAccuracyTest.lean):
- Test 1: Halt loop detection — oracle triggers on firmware halt loop
- Test 2: Timer advance accuracy — set mtimecmp=mtime+5000, verify skip
- Test 3: MIE guard — disable interrupts, verify oracle does NOT skip
- Test 4: MTIE guard — disable timer interrupt, verify oracle does NOT skip
All 4 tests pass.

Added to CI workflow and lakefile.
…igation

Investigation findings:
- SVParser successfully PARSES PicoRV32 with ENABLE_MUL=1, ENABLE_DIV=1
  (152 registers vs 148 for RV32I-only)
- However, execution FAILS: even RV32I firmware produces no UART output
  on the M-ext SoC, indicating the PCPI sub-module flattening breaks
  the signal routing
- Root cause: the generate-if blocks for ENABLE_MUL instantiate
  picorv32_pcpi_mul/picorv32_pcpi_div sub-modules whose state machines
  are not correctly integrated into the flattened design
- No new IR operators needed: PicoRV32's MUL/DIV uses shift-and-add
  algorithms with standard +/-/shift operators
- Fix requires: debugging the sub-module flattening in Lower.lean for
  the PCPI interface (pcpi_wr, pcpi_rd, pcpi_wait, pcpi_ready signals)

Status changed from "Not started" to "Investigated".
Three root-cause fixes in flattenDesign (Lower.lean):

1. Parameter override propagation (was: generate-if used defaults only)
   - lowerModule now accepts paramOverrides argument
   - flattenDesign re-lowers sub-modules with parameter overrides from
     the instantiation (#(.ENABLE_MUL(1)) etc.)
   - generate-if blocks now evaluate with correct parameter values

2. Complex output port connections (was: silently dropped)
   - Output ports connected to non-identifier expressions (array index,
     bit slice, concat) were silently skipped with `| _ => pure ()`
   - Now creates a temporary wire and assigns the sub-module output to it
   - Parent module can read from the temporary wire

3. Missing sub-module handling (was: kept as dangling .inst)
   - Missing sub-modules now emit a warning wire instead of keeping
     an un-flattenable .inst statement that would break CppSim

Note: PicoRV32 M-extension (ENABLE_MUL=1) parsing succeeds and
flattening produces a single module, but execution still fails due
to PCPI handshake signal routing issues. Further debugging needed.

All 11 SVParser tests + 504 unit tests pass (no regressions).
Root cause analysis and fixes for sub-module wiring failures:

1. Parser: parse #(.PARAM(val)) instead of skipping (was discarded)
   - SVModuleItem.instantiation now carries paramOverrides field
   - Parser reads named parameter values from #(...) syntax

2. Lowering: apply parameter overrides to constant assignments
   - lowerModule accepts paramOverrides argument
   - Parameter values emitted as correct constants (not defaults)
   - generate-if blocks expanded with overridden values

3. Flattening: 2-pass architecture for nested hierarchies
   - flattenDesign gets SVDesign for re-lowering with overrides
   - Extracts parameter overrides from SV AST at flatten time
   - Iterative loop (up to 5 passes) until no .inst remains
   - Re-adds original module map for each iteration

4. IR stays clean: no parameter info in Stmt.inst
   - Parameter resolution is purely a parser/lowering concern
   - IR sees only resolved port connections

Status: PicoRV32 MUL sub-module (picorv32_pcpi_mul) now correctly
inlined with ENABLE_MUL=1. DIV module fails because the parser
skips module definitions after picorv32_pcpi_fast_mul (separate
parser bug, not a flattening issue).

All 11 SVParser tests + 504 unit tests pass.
Parser fix: add unary minus (negation) to parseUnary.
Previously, -expr in Verilog (e.g., -pcpi_rs1, -quotient) caused
parse failure, which silently stopped module parsing. This was the
root cause of picorv32_pcpi_div being invisible to the flattener.

Before: 5 modules parsed (stopped at picorv32_pcpi_fast_mul)
After:  9 modules parsed (all PicoRV32 sub-modules visible)

M-extension status: PicoRV32 with ENABLE_MUL=1, ENABLE_DIV=1 now
parses, lowers, flattens (183 registers), and compiles to JIT C++.
RV32I firmware execution still hangs — PCPI handshake stall issue
remains (pcpi_int_wait blocks CPU instruction fetch).

All 11 SVParser tests + 504 unit tests pass.
Traced PicoRV32 M-extension failure to mem_state FSM stuck at 0.
if-conversion generates wrong guard priority: outer reset guard
(mem_state<=0) overrides inner case branch (mem_state<=1).
Root cause: PicoRV32 uses Verilog macro invocations like:
  \`assert(mem_wstrb == 0);
  \`assert(mem_do_prefetch || mem_do_rinst || mem_do_rdata);

These backtick-assert lines were not stripped by the preprocessor,
causing the parser to fail inside case arms. This silently corrupted
the case statement structure, which in turn caused if-conversion to
generate wrong guard priorities for mem_state FSM transitions.

The fix: add \`assert to the preprocessor skip list (1 line change).

Result: RV32I firmware now executes correctly on PicoRV32 SoC with
ENABLE_MUL=1, ENABLE_DIV=1 (183 registers, 26 UART words — identical
to RV32I-only SoC output).

All 11 SVParser tests + 504 unit tests pass.
M-extension firmware now runs to completion (pass marker reached):
- MUL 7*6 returns 0 (expected 42) — PCPI mul sub-module result bug
- DIV/REM propagate the wrong MUL result
- RV32I firmware on M-ext SoC: PASS (26 words, identical to RV32I SoC)

The PCPI handshake works (CPU doesn't hang), but picorv32_pcpi_mul
returns incorrect computation results. Likely a register/shift issue
in the flattened mul sub-module.
Two fixes for always @* combinational block processing:

1. collectGuardedBlock: add forLoop handling (was silently ignored)
   - for loops inside always @* now have their body collected
   - Matches collectGuardedNB which already handled forLoop

2. tryEvalConst: constant folding for if-else conditions
   - eq(const, const) and not(const) are evaluated at collect time
   - When condition is statically true/false, only one branch is taken
   - Applied to both collectGuardedNB and collectGuardedBlock

3. Parameter substitution in SV AST:
   - substParamExpr/substParamStmt: recursive parameter→literal replacement
   - substituteParamsInItem: applied to always blocks and cont assigns
   - Parameters in if conditions inside always @* are now constants

MUL result still incorrect due to CARRY_CHAIN=4 default using [j+:W]
(variable part-select) syntax which is not yet parsed. The CARRY_CHAIN=0
workaround generates a broken for loop (j=j+0 infinite loop).
Next step: implement [base +: width] part-select parsing.

All 11 SVParser tests + 504 unit tests pass.
Three new capabilities for Verilog-2001 support:

1. Part-select parsing ([base +: width]):
   - New SVExpr.partSelectPlus variant in AST
   - Parser recognizes [expr +: expr] after [ in parsePostfix
   - Lowering: (expr >> base) & ((1 << width) - 1)

2. For-loop unrolling (SV AST level):
   - unrollForLoops: statically unroll for loops with constant bounds
   - Loop variable substituted in body for each iteration
   - Recursive unrolling for nested loops
   - Applied after parameter substitution in substituteParamsInItem

3. Constant folding in if-conversion:
   - tryEvalConst: evaluate eq(const, const) and not(const)
   - collectGuardedNB/Block: when if-condition is compile-time constant,
     take only the true/false branch (eliminates dead code)

MUL status: PicoRV32 pcpi_mul uses concat-LHS blocking assignment
({next_rdt[j+3], next_rd[j +: 4]} = add_expr) inside the unrolled
carry-save loop. This concat-LHS in always @* is not yet handled
by collectGuardedBlock (only collectGuardedNB has concat-LHS support
via lowerConcatLhsAssign). The next_rdt=0 initialization survives
as the final value because the loop body assignments are silently
dropped.

Next step: add concat-LHS blocking assignment support to
collectGuardedBlock (bit-scatter decomposition for always @*).

All 11 SVParser tests + 504 unit tests pass.
1. decomposeMultiConcatLhs: decompose {a[hi], b[lo+:w]} = rhs into
   per-variable assignments. Each target gets the corresponding
   bit-field from the RHS expression via slice + shift.

2. collectGuardedBlock: use decomposeMultiConcatLhs as fallback when
   both exprToName and lowerConcatLhsAssign fail on blockAssign LHS.

3. collectBlockNamesTop: extract variable names from concat-LHS
   elements (ident, index, slice, partSelectPlus). Also recurse
   into forLoop bodies.

Verified working with simple [hi:lo] slice concat-LHS (nb[3], na[3:0]).
PicoRV32 pcpi_mul uses [base +: width] in concat-LHS, which the
collectBlockNamesTop recognizes but the blocking-assign concat
parser doesn't see (parser issue with LHS concat containing +: syntax).

Next: debug why {nb[3], na[0 +: 4]} = expr is not parsed as concat-LHS
blockAssign — likely the parser's LHS expression handling doesn't
recognize concat with partSelectPlus elements.

All 11 SVParser tests + 504 unit tests pass.
Major progress on M-extension support:

1. Fixed [base +: width] parsing: use parsePrimary (not parseExpr) for
   base to prevent + being consumed as addition operator

2. Added svExprToNat: evaluate constant SVExpr expressions (add/sub/mul)
   for concat-LHS index resolution after forLoop unroll (e.g., j+CARRY_CHAIN-1
   with j=0, CARRY_CHAIN=4 → 3)

3. Read-modify-write pattern for concat-LHS decomposition:
   value = (old & ~mask) | (shifted & mask)
   This preserves other bit fields when multiple loop iterations
   update different fields of the same register

Result: MUL 7*6 = 40 (was 0, expected 42). Off by 2 — likely a
carry propagation edge case in the carry-save accumulator decomposition.
The carry-save pattern is fundamentally working now.

All 11 SVParser tests + 504 unit tests pass.
Implement SSA renaming for blocking assignment loops:
- Variables written in loop body get iteration-specific names
  (e.g., next_rd → next_rd_ssa0_0, next_rd_ssa0_1, ...)
- Prologue captures initial values, epilogue writes back final
- Depth parameter distinguishes nested loops (ssa0_, ssa1_, ...)
- SSA wires auto-declared when name contains _ssa

Status: Standalone multiplier test still fails due to nested SSA
complexity. The outer loop (32 iters) and inner loop (16 iters)
create deeply nested SSA chains where the inner loop's variables
get double-renamed (next_rd → next_rd_ssa0_k → next_rd_ssa0_k_ssa1_j).
The epilogue's _ssa0_32 reference isn't declared because it only
exists as the write-target of the last iteration.

Next step: ensure epilogue SSA vars are also declared as wires,
or simplify by only applying SSA to the innermost loop (which is
the one with the self-referential carry-save accumulator pattern).

All 11 SVParser tests + 504 unit tests pass.
Major SSA improvements for carry-save accumulator support:

1. Self-referential-only SSA: only variables appearing in both LHS and
   RHS of loop body get SSA renaming (reduces unnecessary renaming)

2. Recursive LHS rename (renameLhsOnly): correctly renames write-side
   of blockAssign even inside nested ifElse/forLoop structures

3. OR-merge for same-guard assignments: concat-LHS bit-field writes
   to the same variable are OR-combined instead of mux-overwritten

4. SSA base resolution: _ssa0_N uses _ssa0_{N-1} as base value instead
   of self-referencing (eliminates circular dependency)

5. Auto-declare all SSA wires from both LHS and RHS scan

Status: SSA chain is correctly formed (next_rd_ssa0_1 = next_rd_ssa0_0
instead of self-reference). However OR-merge of 16 concat-LHS
bit-field decompositions doesn't reach the guardedToMux filtered list.
The concat-LHS decomposition inside nested ifElse+unrolled forLoop
needs investigation — the guardedAssigns may not be collected from
the deeply nested structure.

All 11 SVParser tests + 504 unit tests pass.
SSA prologue fix: rd_ssa0_0 = rd no longer creates circular dependency
with rd's epilogue assignment. The topoSort now recognizes _ssa*_0
assignments as reading the initial value, not the final value.

Result: SSA chain evaluates in correct order (rd_ssa0_0 first).
However rd_ssa0_2 still evaluates before rd_ssa0_1 because rd_ssa0_1
depends on rdx, which depends on rdt, which creates an indirect cycle.

Root cause: stmtsToMuxExprBlocking merges ALL blocking assignments
to a variable into a single mux expression, losing the sequential
evaluation semantics of always @*. Variables like rdx have both an
initialization (rdx = 0) and a post-loop update (rdx = rdt << 1),
but the mux expression only preserves the final value.

The fundamental issue is that Sparkle's IR represent combinational
logic as simultaneous assignments, but Verilog's always @* blocks
have sequential semantics for blocking assignments. A full solution
requires either:
1. Sequential evaluation in CppSim (respecting statement order)
2. Full SSA conversion of ALL blocking assignments (not just loops)

All 11 SVParser tests + 504 unit tests pass.
Three critical fixes that enable the carry-save accumulator to work:

1. Sequential emit for always @* blocks (emitBlockingStmtsSequential):
   Convert blocking assignments to IR assigns in statement order instead
   of mux-ifying them. This preserves the sequential evaluation semantics
   where later statements read the results of earlier ones.
   Stored in seqBody (not body) to bypass topoSort reordering.

2. SVExpr width for [base +: width] part-select:
   Width is now kept as SVExpr (not Nat) in the AST, so parameter
   substitution (CARRY_CHAIN → 4) is applied BEFORE width evaluation.
   Previously width was evaluated at parse time when CARRY_CHAIN was
   still an identifier, resulting in width=1 (fallback).

3. Concat-LHS OR-accumulate in sequential emit:
   Each concat-LHS decomposition appends `name = name | shifted_bits`
   instead of overwriting. Multiple iterations (j=0,4,...) each add
   their bit-field to the running value via OR.

4. Exclude part-select variables from SSA renaming:
   Variables accessed via [base +: width] in concat-LHS have non-
   overlapping bit-field writes across loop iterations, so SSA
   renaming is unnecessary and harmful (breaks OR-accumulation).

Result: 8-bit standalone carry-save multiplier passes all tests!
  mul8(1,6)=6 ✓  mul8(3,5)=5 ✓  mul8(7,1)=1 ✓

All 11 SVParser tests + 504 unit tests pass.
… type

Standalone carry-save multiplier (8-bit) passes all tests.
Full PicoRV32 M-ext compilation fails: cpu_cpuregs is declared as
uint64_t (scalar) but accessed as cpuregs[idx] (array). The register
file (reg [31:0] cpuregs [0:31]) needs to be recognized as memory
in CppSim when ENABLE_MUL=1 adds direct cpuregs[] write paths.
When a memory (e.g., cpuregs[0:31]) is flattened into the top module,
CppSim would also declare a local scalar variable with the same name
in eval(), shadowing the std::array member. This caused compilation
errors like 'invalid types uint64_t[uint8_t] for array subscript'.

Fix: collect memory names from IR .memory statements and exclude them
from localWires declarations.

Result: PicoRV32 M-ext SoC (183 registers) now compiles to JIT C++.
Firmware execution still produces 0 UART words — likely the sequential
always @* emit causes evaluation order issues in the larger CPU FSM
where some blocking assignments interact with the posedge always block.

All 11 SVParser tests + 504 unit tests pass.
- Use sequential emit only for always @* blocks containing concat-LHS
  (carry-save accumulator pattern). Other always @* blocks use mux mode.
- hasConcatLhs: recursive check for concat-LHS in nested ifElse/forLoop
- seqBody placed after topoSorted body (not before)

RV32I-only SoC: PASS (11/11 tests). M-ext SoC: compiles but RV32I
firmware produces 0 UART words. The regression from the backtick-assert
fix commit is likely caused by partSelectPlus width SVExpr change or
other cumulative changes affecting the PicoRV32 CPU FSM evaluation.

All 11 SVParser tests + 504 unit tests pass.
Removed constant folding from both collectGuardedNB (non-blocking)
and collectGuardedBlock (blocking) in if-conversion. The constant
folding was changing guard priority in the decoder's case statements,
causing incorrect instruction decode (instr_addi=0 for ADDI).

Constant folding is preserved only in emitBlockingStmtsSequential
(always @* sequential emit) where it's needed for carry-save
accumulator's CARRY_CHAIN==0 branch elimination.

M-ext SoC still fails (0 UART words) — the decoder regression may
have a different root cause beyond tryEvalConst. Needs comparison
of instr_addi_next mux expression between working (ef6ef74) and
current version.

All 11 SVParser tests + 504 unit tests pass.
Major architectural changes to the SVParser→IR lowering:

- Remove emitBlockingStmtsSequential entirely: all always @* blocks
  now use the MUX-based collectGuardedBlock path
- Fix sigNames filter to use collectBlockNamesTop (recursive) instead
  of flat .any check — fixes cpuregs_rs1/rs2 being silently dropped
  when nested inside if(ENABLE_REGS_DUALPORT)
- Fix processCaseArms with !covered guard for first-match-wins
- Remove OR-merge hack from stmtsToMuxExprBlocking
- decomposeMultiConcatLhs now produces read-modify-write with
  __RMW_BASE__ placeholder, resolved by stmtsToMuxExprBlocking to
  the correct SSA base (previous iteration's output)
- SSA rename all write variables in loops (including part-select)
  with nested SSA support (stripSsa base-name unification)
- Fix unrollForLoops to use recursive unroll result (was discarding it)
- Fix ssaBase to handle nested SSA (use last _ssa segment)
- Add 64-bit promotion in decomposeMultiConcatLhs to avoid C++ UB
  on shifts >= 32 bits
- Add collectRefs before guardedToMux for proper scoping
- Skip outer SSA for single-iteration loops (numIters <= 1)

Results: 11/11 SVParser tests pass, RV32I firmware on M-ext SoC
produces all 26 UART words correctly. Runtime MUL via carry-save
accumulator (PCPI) still produces incorrect values — to be debugged
via standalone pcpi_mul unit test in next session.
Test 12: Nested always @* assign inside if/else (cpuregs_rs1 bug)
Test 13: case(1'b1) first-match-wins priority (processCaseArms bug)
Test 14: Concat-LHS part-select decomposition in always @*
Test 15: For-loop unroll with SSA renaming chain
Test 16: Array read in always @* (register file dual-port)
Test 17: pcpi_mul standalone IR structure (regs, SSA wires, assigns)
Test 18: Carry-save SSA chain: next_rd_ssa1_N refs ssa1_{N-1}
Test 19: Carry-save SSA chain: next_rdt_ssa1_N refs ssa1_{N-1}
Test 20: next_rdx reads post-loop next_rdt (not stale _ssa0_0)

All 20 tests pass. Tests 18-20 confirm the IR-level SSA chain for
pcpi_mul's carry-save accumulator is correct — the remaining MUL
bug is in CppSim emit or C++ numeric evaluation, not in lowering.
Replace MUX-only approach with emitSequentialSSA that processes
statements top-to-bottom, creating SSA wires for each variable write.
This correctly handles read-then-overwrite patterns (e.g., PicoRV32's
carry-save accumulator) that cause cyclic dependencies under pure MUX.

Key change: always @* blocks now use sequential evaluation with
dynamic SSA wire generation. For-loop SSA in unrollForLoops is
disabled (sequential emitter handles ordering natively).

Test 21: pcpi_mul JIT (7*6=42) PASSES for the first time!
Tests 15-20 need updating for new _seq naming convention.
Test 11 (RV32I firmware) has a regression to investigate.
When a variable has no prior _seq wire and is only modified in one
branch of if-else, skip the MUX to avoid self-referencing the final
output (fixes decoded_rs cycle in PicoRV32).

Test 21 (pcpi_mul 7*6=42) still PASSES.
Test 10/11 (RV32I SoC) still has regression — CPU always @* blocks
need further investigation with sequential emitter.
Each test embeds Verilog, parses, lowers, JIT-compiles, and verifies
output values — no external file dependencies (CI-safe).

Test 22: if/else one-sided assign + register (decoded_rs pattern)
Test 23: Array read in always @* → register (cpuregs_rs1 pattern)
Test 24: case(1'b1) decoder priority (processCaseArms pattern)
Test 25: Read-then-overwrite sequential chain (pcpi_mul pattern)
Test 26: For-loop accumulator (carry-save pattern)

All 5 new tests PASS, confirming the sequential SSA emitter
correctly handles all patterns that caused bugs during development.
Fix case statement merge in emitSequentialSSA: when a variable has
no prior _seq wire (e.g., initialized with don't-care 'bx), avoid
self-referencing the final output in the default MUX path. This
resolves 16 cyclic dependencies in PicoRV32's always @* blocks
(alu_out, cpuregs_wrdata, mem_rdata_word, etc.).

Results:
- Test 10: PicoRV32 SoC "Hello" PASS
- Test 11: RV32I C firmware (26 words, ALL C TESTS OK) PASS
- Test 21: pcpi_mul JIT 7*6=42 PASS
- Tests 22-26: All JIT pair tests PASS
- M-ext SoC: RV32I PASS, RV32IM (MUL/DIV/REM) PASS
- M-ext factorial 10! = 3628800 (0x375F00) CORRECT

Tests 15-20 need updating for _seq naming convention (not failures).
Large number MUL (12345*6789) still has a precision issue to investigate.
- Update Tests 15-20, 21a for _seq naming (was _ssa1_)
- Add Test 21b: pcpi_mul 100*100=10000 (PASS)
- Add Test 21c: pcpi_mul 12345*6789=83810205 (PASS)
- Add topo sort deadlock warning in topoSortBody

All 29 tests PASS. pcpi_mul standalone handles all multiply sizes
correctly. SoC integration has a type-width issue for large operands
(CPU→PCPI operand truncation) — separate from the compiler.
- Test 21d: pcpi_mul consecutive multiply (7*6 then 12345*6789) PASS
- Fixed test protocol: deassert pcpi_valid between multiplications
  to allow FSM re-trigger (mul_start edge detection)
- SoC large-number MUL issue identified as CPU⇔PCPI protocol timing,
  not a compiler bug (pcpi_mul standalone handles all sizes correctly)

All 30 tests PASS.
Test 21e: mul_wrapper instantiates pcpi_mul with a minimal CPU-like
sequencer that drives pcpi_valid/rs1/rs2, captures pcpi_rd on ready.
Tests consecutive multiplies: 7*6=42 then 12345*6789=83810205.

SoC integration issue with large MUL (12345*6789) is isolated to the
full PicoRV32 CPU data path — not in pcpi_mul, not in the wrapper,
and not in the compiler. All 31 standalone tests PASS.
Root cause of SoC large-number MUL failure: lowerExpr for repeat_
was ignoring count, turning {4{mem_la_write}} into 1-bit value.
Fix: (0 - val) & mask for 1-bit, concat for multi-bit.

Test 27 (bit replication) PASS. wstrb now correctly 0xF for sw.
Store/Load still returns byte-0 only despite correct wstrb —
investigating CppSim byte-lane write/read ordering.
Root cause of SoC store/load failure: buildByteStrobeWrite did not
shift the data slice to the target bit position before masking.
e.g., mem_wdata[15:8] (= 0x56) was ANDed with mask 0xFF00 directly,
giving 0 instead of being shifted to 0x5600 first.

Fix: shift dataExpr by lane.lo bits before applying mask.

Test 28 (byte-lane memory write/read): PASS
Simple MUL test (12345*6789=83810205): PASS on M-ext SoC!
Store/Load test (0x12345678 round-trip): PASS!

All SoC tests now pass: RV32I (26 words), RV32IM (18 words),
Simple MUL (5 words), Store/Load (6 words).
33/34 standalone tests pass (Test 29 multi-bit replication WIP).
For repeat_ with ident operands (e.g., {2{din}}), the element width
was incorrectly assumed to be 1-bit, causing (0 - val) & 3 instead
of proper concat. Fix: use concat path for unknown-width operands,
which correctly shift+OR using inferExprWidth from the type map.

Test 29: {2{16'hABCD}} = 0xABCDABCD — PASS
All 34 tests PASS. All M-ext SoC tests PASS.
After fixing {N{expr}} replication, sb (store byte) correctly
produces {4{byte}} in mem_wdata (e.g., 0x48484848 for 'H').
Fix UART display to mask low 8 bits: "Hello" restored.

34/34 tests PASS. All SoC tests PASS.
sb (store byte) in PicoRV32 produces {4{byte}} in mem_wdata
(e.g., 0x48484848 for 'H'). This is normal — SoC UART stores
the full word, test masks to low 8 bits to extract actual byte.
- STATUS.md: Phase 4 (M-ext) marked Done, Phase 51 detailed summary
- CHANGELOG.md: Phase 51 entry with architecture changes, 12 bugs, test table
- RV32.md: Updated SV Transpiler section with M-ext results, 34 tests, MemorySSA
- README.md: Updated IP catalog entry for SV Transpiler (34 tests, M-ext)
- Add RV32 JIT benchmark step: runs firmware.hex for 5000 cycles
  via Signal DSL SoC JIT (122 registers, 4-stage pipeline)
- Commit verilator/generated_soc_jit.cpp (3038 lines) so CI can
  run the JIT test without regenerating
- Update SVParser test description (34 tests)
- Reference: JIT ~13M cyc/s vs Verilator 11.1M cyc/s on Apple Silicon
Add benchmark-action/github-action-benchmark integration:
- Measures RV32 JIT SoC simulation (5000 cycles, compile+sim)
- Pushes results to dev/rv32-jit-bench on GitHub Pages
- Alert at 150% regression, comment on alert
- Add permissions (deployments: write, contents: write)

Reference performance (Apple Silicon):
  JIT: 13.0M cyc/s (1.17x faster than Verilator)
  Verilator 5.044: 11.1M cyc/s
Full benchmark pipeline:
1. Install Verilator on Ubuntu
2. Build Verilator simulation (make build)
3. Build JIT shared library (.so) from generated_soc_jit.cpp
4. Build benchmark binaries (verilator_bench, jit_bench)
5. Run both for 10M cycles, parse cycles/sec
6. Push results to GitHub Pages via benchmark-action

Metrics tracked:
- Verilator (10M cycles) — cycles/sec
- JIT eval+tick (10M cycles) — cycles/sec
- JIT evalTick fused (10M cycles) — cycles/sec
- JIT evalTick+6wires (10M cycles) — cycles/sec

Reference (Apple Silicon): JIT 13M cyc/s vs Verilator 11.1M cyc/s
@junjihashimoto junjihashimoto merged commit 9da7df6 into main Mar 26, 2026
1 of 2 checks passed
@junjihashimoto junjihashimoto deleted the feature/rv32 branch March 26, 2026 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant