Improve JIT performance by junjihashimoto · Pull Request #18 · Verilean/sparkle

junjihashimoto · 2026-04-02T22:16:21Z

No description provided.

- Add reg initializer parsing: reg foo = 1d0 (LiteX pattern) - Add @(*) preprocessing in test (LiteX/Migen convention) - Add litex_sim_minimal.v fixture (PicoRV32 + SRAM + UART) - Add litex-test exe target - Phase 1 (Parse): PASS - 1 module, 784 items - Phase 2 (Lower): FAIL - concat-LHS in continuous assign - 34/34 existing tests still PASS

- Support concat-LHS in continuous assign (assign {a,b,c} = expr) - Phase 1 (Parse): PASS - 1 module, 784 items - Phase 2 (Lower): PASS - 60 regs, 282 assigns, 6 memories, 432 wires - Phase 3 (JIT C++): PASS - 143K chars generated - Phase 4 (JIT compile): FAIL - 3 issues to fix: 1. PicoRV32 module not included (separate file in LiteX) 2. Memory array used as shift operand (sram >> addr) 3. Shift count overflow (>= type width) - 34/34 existing tests still PASS

All 4 phases pass for LiteX PicoRV32 SoC (1730 lines + PicoRV32): - Phase 1 (Parse): 9 modules, 784+ items - Phase 2 (Lower): 60 regs, 282 assigns, 6 memories, 432 wires - Phase 3 (JIT C++): 1.25MB generated (flattened) - Phase 4 (JIT sim): 100 cycles executed successfully Fixes: - Extend isArrayName for LiteX patterns (regs, sram, rom, storage) - Use parseAndLowerFlat for sub-module flattening - Deduplicate wire declarations in CppSim emitModule - Include PicoRV32 CPU source alongside LiteX SoC - 34/34 existing tests still PASS

Call optimizeDesign (constant folding, DCE, single-use wire inlining) on parseAndLowerFlat output. Previously only ran on Signal DSL path. LiteX SoC benchmark (10M cycles): C++ size: 1,253,535 → 1,096,141 chars (-12.6%) JIT eval+tick: 3.45M → 3.88M cyc/s (+12.5%) JIT evalTick: 3.45M → 3.86M cyc/s (+11.9%) Verilator: 8.62M cyc/s (reference) 34/34 SVParser tests PASS. LiteX Phase 1-4 PASS.

Extend foldConstants with safe patterns: - mux(0, t, e) → e (zero in any width is false) - not(not(x)) → x (double negation elimination) Const-const bitwise folding (or/and/xor of two constants) deferred: Int.toNat loses sign info for negative constants, causing Test 11 FAIL. Needs proper unsigned conversion before enabling. C++ size: 1,096,141 → 1,093,093 chars (-0.3%) 34/34 tests PASS.

Add toUnsigned helper for correct two's complement Int → unsigned Nat conversion in constant folding. This fixes the Test 11 regression that prevented const-const binary operations (or/and/xor/add/sub/shl/shr). New constant folding patterns: - add/sub/or/and/xor/not/shl/shr/lt_u with both operands constant - All use toUnsigned(v, w) = ((v % 2^w) + 2^w) % 2^w for correct handling of negative Int values (e.g., -1 in 32-bit = 0xFFFFFFFF) LiteX SoC benchmark (10M cycles): C++ size: 1,253,535 → 816,741 (-34.8%) JIT evalTick: 3.45M → 4.00M cyc/s (+16.0%) Verilator: 8.66M cyc/s (reference, 0.46x) 34/34 SVParser tests PASS. LiteX Phase 1-4 PASS.

Split eval() into eval_partN() functions (500 lines each) for better I-cache utilization on large designs. evalTick() calls the same parts, eliminating the duplicated eval body. Local wires are promoted to class members when splitting is active to allow sharing across part functions. LiteX SoC benchmark (10M cycles): C++ size: 1,253,535 → 518,541 (-59%) JIT evalTick: 3.46M → 4.77M cyc/s (+38%) Verilator: 8.61M cyc/s (reference, 0.55x) Cumulative improvement: 3.46M → 4.77M (+38% total) 34/34 SVParser tests PASS. LiteX Phase 1-4 PASS.

When eval() is split into eval_partN() functions, analyze which local wires are used in only one chunk and keep them as locals of that chunk instead of promoting all to class members. This reduces class member footprint and improves data cache behavior. LiteX benchmark: 4.74M → 4.76M cyc/s (+0.5%) Effect is small because GCC -O2 already optimizes stack variables well. 34/34 tests PASS. LiteX Phase 1-4 PASS.

Sequential SSA wires (_seqN) were all declared as bitVector 64, causing unnecessary uint64_t usage in generated C++. Now inherits the type from the base variable's declaration. Type distribution change: uint64_t: 1310 → 1180 (-10%) LiteX benchmark (10M cycles): evalTick: 4.77M → 4.99M cyc/s (+4.7%) Verilator: 8.63M cyc/s (reference, 0.58x) Cumulative improvement: 3.46M → 4.99M (+44% total) 34/34 tests PASS. LiteX Phase 1-4 PASS.

Use U suffix for ≤32-bit constants instead of ULL everywhere. Reduces C++ source size but GCC already optimizes the difference. Confirmed: remaining 420 uint64_t _seq wires are legitimate 64-bit variables (PicoRV32 pcpi_mul carry-save uses reg [63:0]). C++ size: 515,933 → 506,409 (-1.8%) Speed: unchanged (4.99M cyc/s) 34/34 tests PASS. LiteX Phase 1-4 PASS.

Attempted removing 215 redundant bit masks (402 → 187) by extending exprIsMasked. Result: 8% SLOWER (5.01M → 4.61M cyc/s). Root cause: GCC relies on explicit masks to infer value ranges, enabling better instruction selection and register allocation. Without masks, GCC generates more defensive code. Reverted exprIsMasked changes. Kept analysis: Verilator hot eval: 14.6KB (3721 instructions) Sparkle hot eval: 37.7KB (8870 instructions, 2.4x) Sparkle mask ops: 402 (Verilator: 0) Sparkle movzx: 190 (Verilator: 51, 3.7x) The 2.4x instruction count gap is the main perf difference. Next steps: CSE (common sub-expression elimination) to reduce duplicate computation, or if-else codegen instead of ternary.

Attempted IR-level CSE (common sub-expression elimination): - Detected 285x duplicate (resetn == 0), 195x (cpu_state == 64), etc. - Extracted to _cse_N wires, reduced C++ by 4.5% - Result: 2% SLOWER (GCC -O2 already performs CSE; IR-level CSE interferes with GCC's register allocation and scheduling) - Also broke Test 11 (topo sort ordering of CSE wires) Reverted. Key insight: sub-expression duplication is at the *C++ expression level* (MUX guard conditions repeated), not at the *IR statement level* (only 20 duplicate RHS expressions). GCC handles this efficiently. Further improvement needs either: 1. Trigger-based eval (skip unchanged signals) — like Verilator 2. if-else codegen instead of ternary (reduce instruction count)

Trigger-based eval (Verilator-style) analysis: - PicoRV32 CPU: 73% of eval assigns (536/735) — active every cycle - UART/Timer/Bus: 27% (199/735) — mostly idle - Max theoretical speedup from peripheral skip: 27% → 6.8M cyc/s - CPU always changes PC/state → cannot skip CPU eval Implementation approaches: - Full wire-level dirty: Very High difficulty, 5-14x potential - Partition-level skip: Low difficulty but CPU always dirty → 0% effect - Region-level skip (peripheral): Medium, ~27% max - Oracle-style peripheral idle: conceptually same as region skip Conclusion: For CPU-centric SoCs, trigger-based eval has limited ROI. Multi-thread (CPU + peripheral on separate cores) is more effective than trying to skip peripheral eval within the same thread.

New files: - Partition.lean: Split flattened IR into CPU/Peripheral by wire name - CppSimThreaded.lean: Generate 2 C++ classes + barrier-sync runner LiteX PicoRV32 SoC partition result: CPU: 558 stmts (73%), Peripheral: 193 stmts (27%) Boundary: 20 signals (8 CPU→Peri + 12 Peri→CPU) Phase 5 (partitioned codegen): PASS — 418K chars generated Phase 6 (compile + run): WIP — wide type (128-bit) emit issue 34/34 existing tests PASS. LiteX Phase 1-4 PASS.

2-partition barrier-sync simulation working end-to-end: - CPU partition: 558 stmts (73%) - Peripheral partition: 193 stmts (27%) - Boundary: 20 signals (8 CPU→Peri + 12 Peri→CPU) - 100 cycles executed correctly Fixes: - Cap concat width to 64-bit (avoid std::array cast in C++) - Dedup output port declarations against register outputs - Move registerNames extraction before output port emission Benchmark (single-thread, barrier-sync is sequential): Verilator: 8.59M cyc/s JIT single: 4.99M cyc/s JIT partitioned: 4.78M cyc/s (boundary copy overhead) 34/34 SVParser tests PASS. LiteX Phase 1-6 PASS. Next: true 2-thread parallel execution via std::thread.

Per-cycle spinlock barrier between CPU (73%) and Peripheral (27%): - Correctness: PASS (100 cycles) - Performance: ~5x slower than single-thread (spinlock overhead >> parallel gain) - Batched sync (16 cycles): hangs (bus handshake requires 1-cycle latency) Root cause: peripheral partition is only 27% of compute. Spinlock atomic ops cost ~100ns/cycle = significant overhead. CPU and peripheral are tightly coupled via bus handshake. Multi-threading only viable when: 1. Partitions are roughly equal in compute 2. Communication latency tolerance > 1 cycle 3. Or using speculative execution (Time Warp with rollback) For LiteX PicoRV32, single-thread at 4.99M cyc/s (0.58x Verilator) remains the practical optimum. Multi-thread needs larger SoCs with independent clock domains or loosely-coupled subsystems.

Attempted converting MUX chains (ternary operators) to if-else blocks for branch prediction benefit. Result: 4% SLOWER (4.98M → 4.77M). Root cause: Sparkle evaluates ALL wires every cycle (no trigger-based skip). With if-else, every branch is taken with ~50% probability, causing frequent branch mispredictions. Ternary (CMOV) is branchless and avoids this penalty entirely. Verilator benefits from if-else because its trigger-based eval skips entire blocks when signals haven't changed. Without trigger-based eval, if-else is strictly worse than ternary for Sparkle. Reverted. Ternary MUX remains optimal for Sparkle's evaluation model. Key insight: branch prediction only helps when branches are predictable (taken >90% or <10% of the time). In HDL simulation, MUX conditions change frequently → branches are unpredictable → CMOV (branchless) wins.

Coarse-grained trigger-based evaluation: only re-evaluate peripheral partition when CPU-to-Peri boundary signals change. LiteX SoC benchmark (10M cycles): Verilator: 8.59M cyc/s (1.00x) JIT no partition: 4.99M cyc/s (0.58x) JIT peri-skip: 6.05M cyc/s (0.70x) +21% Cumulative: 3.46M to 6.05M (+75% total) 34/34 tests PASS. LiteX Phase 1-6 PASS.

Wrap consecutive PCPI-related eval lines in if(pcpi_valid) guard. Skips ~225 MUL/DIV assigns when no PCPI operation active (~95% of cycles). Combined with peripheral-skip trigger-based eval: Verilator: 8.64M cyc/s (1.00x) JIT no opt: 3.46M cyc/s (0.40x) JIT single+PCPI: 5.66M cyc/s (0.65x) JIT peri+PCPI: 6.83M cyc/s (0.79x) ← +97% from baseline! Cumulative optimization: 3.46M → 6.83M (+97%) 34/34 tests PASS. LiteX Phase 1-6 PASS.

Replace hardcoded pcpi_mul/pcpi_div/pcpi_fast keyword matching with auto-detection: scan eval lines for keyword clusters (30+ lines) and find corresponding *_valid enable signal. Wraps consecutive matching lines in if(guard) block. Works on any RTL with co-processor or clock-enable patterns, not just PicoRV32. The heuristic detects: - PCPI (pcpi_mul/div/fast → pcpi_valid guard) - Extensible to clock enables, FSM state guards, power domains LiteX single-thread benchmark: Auto-guard: 5.70M cyc/s (was 4.99M without guard, +14%) Verilator: 8.55M cyc/s (0.67x) 34/34 tests PASS. LiteX Phase 1-4 PASS. Threaded peri-skip has C++ syntax issue — to be fixed.

Sparkle JIT now exceeds Verilator on both small and large SoCs: - RV32I SoC: 14.2M cyc/s (1.63x Verilator) - LiteX PicoRV32 SoC (1730 lines): 11.7M cyc/s (1.11x Verilator) - Timer Oracle: 49 GHz effective (9,895x speedup) ## JIT Optimization (5.62M → 11.7M, +108%) Cumulative optimization phases applied to the C++ code generator: | Phase | Optimization | Effect | |-------|-------------|--------| | Dead code + hex masks | Eliminate `if(0)`, `eq(x,0)→!(x)`, hex mask constants | +4% | | Constant/alias propagation | New IR Phase 0: replace const/alias wire refs | +17% | | Deep MUX → if-else | ≥16-arm MUX chains → if-else for CPU FSM | +3% | | Self-ref register if-else | `mux(en,val,self)` → `if(en) next=val;` (156 regs) | +9% | | Duplicate assign dedup | Remove identical SSA assigns from case branches | +1% | | Decoder trigger guard | Auto-detect decoder_trigger, skip ~174 lines | +14% | | evalTick wire localization | ~270 wires from heap members to stack locals | +39% | | Debug wire elimination | Remove PicoRV32 debug/trace signals from IR | +4% | | Self-ref _next elimination | Direct register update without _next variable | +22% | Attempted but reverted (no improvement or regression): - Default-zero MUX if-else (−5%, branch misprediction worse than CMOV) - Dependency-based decoder guard (−10%, too precise, lost lookahead benefit) - Conditional tick copy (−28%, comparison cost > write cost) - Decoder-priority topo sort (−10%, disrupted GCC optimization patterns) - Reachability-based DCE (−5%, disrupted guard effectiveness) - PGO (crash), -O3 (+1%, negligible) ## SVParser LiteX/Migen Compatibility (10 bug fixes) LiteX/Migen generates non-standard Verilog patterns that required SVParser fixes: 1. `always @(*)` with `<=` (nonblocking): treated as combinational assign 2. Default-only case SSA: short-circuit with defEnv when arms=0 3. Bit-index assign (`sel[0] <= expr`): read-modify-write SSA pattern 4. `begin : name` named blocks: removed in preprocessing 5. `integer` declarations: removed in preprocessing 6. `for()` loops: unrolled in preprocessing (parser doesn't support `for`) 7. `arr[addr][base+:8]` LHS: Python preprocessing converts to whole-word 8. Case SSA merge with default changes: include defChangedNames when all arms are empty/unchanged (fixes CSR write path) 9. Memory read address: extract from cont assign / registered read (was always const 0) 10. `isArrayName("_storage")`: scalar registers like timer_en_storage misclassified as arrays (workaround: Python preprocessing renames) ## RTL Pattern Detection (Reverse Synthesis) New `Sparkle/IR/PatternDetect.lean`: auto-detects hardware patterns: - Countdown timers: 7 found in LiteX SoC (timer_value, count, etc.) - Idle registers: 150 self-referencing registers ## Timer Oracle New `Sparkle/Core/TimerOracle.lean`: auto-generated cycle skip from detected countdown timers. Combined with existing SelfLoopOracle. Verified with LiteX firmware: CPU sets TIMER_LOAD=100000, TIMER_EN=1 via CSR bus. Timer counts down. Oracle skips timer_value cycles when CPU is idle → 9,895x effective speedup. ## LiteX Firmware - `firmware/main_litex_simple.c`: 8-instruction timer test (no stack) - `firmware/main_litex_timer.c`: Full timer + UART test - `firmware/boot_litex.S` + `firmware/link_litex.ld`: LiteX boot support - Python preprocessing required for byte-lane memory writes ## Tests - `Tests/SVParser/OptimizeTest.lean`: 10 unit tests covering all IR optimizations (const prop, MUX fold, self-ref, PatternDetect, nonblockAssign, debug elimination, dedup, bit-index RMW, default-only case SSA) - All existing parser tests (5/5) continue to pass - RV32I Verilator sim: ALL TESTS PASSED (19 UART words) ## Files Changed - `Sparkle/Backend/CppSim.lean` — All C++ emitter optimizations - `Sparkle/IR/Optimize.lean` — Const prop, dedup, fold rules, debug strip - `Sparkle/IR/PatternDetect.lean` — NEW: RTL pattern detection - `Sparkle/Core/TimerOracle.lean` — NEW: countdown timer oracle - `Tools/SVParser/Lower.lean` — LiteX compat fixes, preprocessing - `Tests/SVParser/OptimizeTest.lean` — NEW: 10 optimizer unit tests - `docs/BENCHMARK.md`, `docs/CHANGELOG.md`, `docs/STATUS.md` — Updated - `README.md` — Updated benchmark figures - `firmware/` — LiteX firmware and boot support

gen_litex_multicore.py generates N-core PicoRV32 SoC variants: - Verilator version: module hierarchy (sim_core + wrapper) - Sparkle version: flat single module with prefixed signals Scaling results (10M cycles): | Cores | Verilator | Sparkle JIT | Ratio | |-------|-----------|-------------|-------| | 1 | 34.7M | 10.8M | 0.31x | | 2 | 34.8M | 10.7M | 0.31x | | 4 | 35.6M | 10.7M | 0.30x | Note: Verilator uses module hierarchy (shared code, I-cache friendly). Sparkle uses flat expansion (8x code duplication). The fair single-module comparison is 11.7M vs 10.5M = 1.11x Sparkle advantage. Multi-core scaling requires Sparkle submodule support (future work).

Adds LiteX PicoRV32 SoC benchmark alongside existing RV32I benchmark: - Fetches PicoRV32 RTL from GitHub - Generates 1-core flat wrapper via gen_litex_multicore.py - Python preprocessing (for-loop unroll, byte-lane conversion) - Verilator build + 10M cycle benchmark - Sparkle JIT generation + compile + 10M cycle benchmark - Results stored via benchmark-action for graph tracking Both RV32I and LiteX benchmarks produce separate graphs: - dev/rv32-bench: RV32I SoC (Sparkle-native, ~1.6x Verilator) - dev/litex-bench: LiteX SoC (real-world, ~1.1x Verilator)

Sparkle JIT now supports module hierarchy instead of flat expansion: - Each Verilog module → separate C++ class - Sub-module instances → member variables with eval()/tick()/evalTick() calls - Code shared across instances → I-cache friendly New: parseAndLowerHierarchical in Lower.lean - Preserves module hierarchy (no flattening) - Auto-detects top module (module not instantiated by others) - Per-module optimization (const prop, DCE, debug strip) - Light preprocessing (named blocks, integer decls only) CppSim improvements: - Topological sort for module emission (dependency order) - Instance name collision fix (picorv32 module == instance name → _inst suffix) - evalTick fusion for sub-module calls (sub.evalTick() instead of eval+tick) Scaling benchmark (LiteX PicoRV32, 10M cycles): | Cores | Hierarchical | Flat | Verilator | |-------|-------------|---------|-----------| | 1 | 11.62M | 10.8M | 35.2M | | 8 | 11.80M | 10.8M | 35.3M | JIT code size: 381K (hier) vs 4.25MB (8-core flat) — constant with core count. All tests pass: Parser 5/5, Optimizer 10/10.

Verilator 32.9M on wrapper is eval-skip (empty eval), not hierarchy win. Fair comparison: Sparkle 11.7M vs Verilator 10.7M = 1.09x Sparkle.

Previous wrapper had undriven UART sink ports → Verilator eliminated all UART logic as dead code → unfair 3x speedup. Fixed: all UART ports are top-level I/O driven by testbench. Fair multi-core results (LiteX PicoRV32, 10M cycles): | Cores | Verilator | Sparkle | Ratio | |-------|-----------|----------|------------| | 1 | 10.51M | 11.71M | 1.11x | | 2 | 6.40M | 11.81M | 1.85x | | 4 | 2.81M | 11.79M | 4.20x | | 8 | 1.07M | 11.70M | 10.93x | Verilator degrades linearly with core count (code duplication per instance). Sparkle stays constant (module hierarchy + shared code via function calls).

Add shared bus between cores: each core's UART output is OR'd and fed back as input to all cores. This creates real inter-core data dependencies that prevent dead code elimination in both simulators. Results (LiteX PicoRV32, 10M cycles, shared bus): | Cores | Verilator | Sparkle | Ratio | |-------|-----------|----------|--------| | 1 | 10.37M | 11.80M | 1.14x | | 2 | 5.94M | 11.67M | 1.96x | | 4 | 2.73M | 11.77M | 4.31x | | 8 | 1.06M | 11.89M | 11.22x | Verilator degrades O(N) — inlines each instance, N× code duplication. Sparkle constant — module hierarchy shares eval() code across instances.

Both Sparkle and Verilator degrade with core count (D-cache). Sparkle wins at 1-core (1.13x) and 8-core (1.37x).

New multicore_runner.cpp: runs N sim_core instances on N threads with batched barrier sync. Uses std::barrier (C++20) for cycle synchronization and shared bus data exchange via atomic variable. Results (LiteX PicoRV32, 10M cycles, 8 cores): | Config | per-core cyc/s | vs 8-seq | |-----------------|---------------|----------| | 1 core single | 11.57M | — | | 8 cores seq | 1.46M | 1.00x | | 8 cores batch=1 | 0.76M | 0.51x | | 8 cores batch=100 | 4.41M | 3.00x | | 8 cores batch=100K | 5.70M | 3.87x | Batch sync amortizes barrier overhead. 3.87x of 8x theoretical = 48% efficiency. Sparkle 8-core parallel (5.70M) vs Verilator 8-core seq (1.06M) = 5.38x.

Three benchmark graphs now tracked in CI: 1. dev/rv32-bench: RV32I SoC (Sparkle-native, 1.6x Verilator) 2. dev/litex-bench: LiteX 1-core (1.1x Verilator) 3. dev/multicore-bench: 8-core parallel vs sequential vs Verilator Multi-core benchmark measures: - JIT 1-core single-thread (baseline) - JIT 8-core sequential (D-cache pressure) - JIT 8-core parallel batch=10K (CDC multi-thread, 3.87x speedup) - Verilator 8-core (linear degradation reference) Uses multicore_runner.cpp with std::barrier (C++20) for N-thread parallel simulation with batched sync.

New bench/ directory for running Sparkle vs Verilator benchmarks locally: ./bench/bench.sh # All benchmarks (10M cycles) ./bench/bench.sh litex # LiteX 1-core only ./bench/bench.sh multicore # 8-core parallel only ./bench/bench.sh all 50000000 # 50M cycles Files: - bench/bench.sh — Main benchmark script - bench/preprocess_litex.py — LiteX Verilog preprocessing - bench/bench_jit.cpp — JIT evalTick benchmark binary - bench/bench_multicore.cpp — Multi-core 1/8-seq/8-parallel benchmark - bench/README.md — Documentation with results and architecture Results: LiteX 1-core 1.13x, 8-core parallel 3.6x vs sequential, 5.0x vs Verilator 8-core.

Status: LiteX 1-core 11.7M (1.13x Verilator), 8-core parallel 5.1M (4.78x Verilator), Timer Oracle 49 GHz (9,900x). Changelog: 10 optimization phases, 10 SVParser fixes, multi-core parallel, RTL pattern detection, timer oracle. Benchmark: corrected figures for RV32I (14.2M/1.63x), LiteX (11.7M/1.13x), multi-core scaling table with honest results. README: updated IP catalog with latest figures and test counts.

…sim macros, tutorial Phase 54: Verified Reverse Synthesis - OracleReduction type class with mandatory equiv proof (zero sorry) - Carry-save shift-and-add = multiplication proof (induction + bv_decide) - pcpi_mul IR reduction: 573 → 535 stmts, 2.14x speedup (8.4M → 18.1M cyc/s) - No Mathlib dependency Phase 53: Generic Auto-Detection - Reachability DCE replaces hardcoded signal names - Generic guard detection (_valid/_trigger/_enable/_waiting) New macros: - sim! "verilog..." — parse Verilog, auto-generate typed SimInput/SimOutput/Simulator - #sim signalDef — same for Signal DSL definitions - No manual port definitions needed New files: - Sparkle/Core/OracleSpec.lean — type class framework - Sparkle/Core/MulOracle.lean — pcpi_mul instance - Sparkle/Core/MulOracleProof.lean — full inductive proof - Sparkle/Verification/MulProps.lean — 20 supporting theorems - Tools/SVParser/SimMacro.lean — sim! macro - docs/Tutorial.md — 6-step beginner guide Cleanup: - Deleted 8 redundant example files - Updated all docs (STATUS, CHANGELOG, BENCHMARK, README)

Move BitNet, YOLOv8, RV32, Arbiter from Examples/ to IP/ directory. These are production-grade verified IP cores, not tutorial examples. Examples/ now contains only demos and tutorials: - Counter, LoopSynthesis, FullCycle (Signal DSL basics) - CDC/MultiClockSim (CDC usage demo) - SVParser/VerilogSim (Verilog simulation example) - Sparkle16 (toy ISA example) All imports updated (99 files): Examples.BitNet → IP.BitNet, etc.

- Restore lean-toolchain to v4.28.0-rc1 (was accidentally changed to v4.29.0 by Mathlib) - Remove SimInput/SimOutput/Simulator generation from verilog! macro (use sim! instead — much faster build, same functionality) - verilog! now focuses on verification only (State/Input/nextState/proofs) - .gitignore: exclude ELF binaries, verilator obj_dir, venv

Root cause: inlineAssigns recursively expanded register output names that appeared in both the assign list and register input expressions, causing infinite self-referencing expansion (e.g., count_reg = mux(..., count_reg, ...)). Fix: Filter register names from the assign list before inlining. Registers are state variables, not combinational wires — they should not be inlined into their own next-state expressions. Also: auto-assert proofs in verilog! now use sorry (users prove them manually in separate files). bv_decide/native_decide hang in lake build compilation mode. CounterProps.lean: 10+ min → 1.4s build time.

Remove 4873 files that should not be in version control: - venv/ (Python virtualenv, 4796 files) - firmware/*.bin, *.elf, *.map (compiled firmware binaries) - firmware/firmware_litex_*.hex etc. (generated hex files) - c_src/cdc/cdc_example, cdc_test (compiled C binaries) - verilator/obj_dir/, obj_dir_notrace/ (Verilator build cache) - verilator/verilator_bench (compiled benchmark binary) - verilator/generated_soc_cppsim.h (generated C++ header) - Tests/SVParser/fixtures/*_flat.v (generated flat Verilog) All removed files can be regenerated from source. Updated .gitignore to prevent re-addition.

junjihashimoto added 30 commits March 26, 2026 15:22

docs: add multi-core scaling analysis + fair comparison explanation

7c588ef

Verilator 32.9M on wrapper is eval-skip (empty eval), not hierarchy win. Fair comparison: Sparkle 11.7M vs Verilator 10.7M = 1.09x Sparkle.

docs: update benchmark with honest multi-core results

67c17a7

Both Sparkle and Verilator degrade with core count (D-cache). Sparkle wins at 1-core (1.13x) and 8-core (1.37x).

junjihashimoto added 11 commits March 31, 2026 00:22

fix: rewrite Basic.lean using stdlib lemmas (eliminate omega on 2^n)

e4c182a

docs: add multi-domain parallel section to tutorial, sim_parallel! TODO

995e77f

ci: fix Examples.RV32 → IP.RV32 in GitHub Actions

30c6717

fix: add all registers as reachability DCE roots (fix wire elimination)

9e80e8b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve JIT performance#18

Improve JIT performance#18
junjihashimoto wants to merge 41 commits intomainfrom
feature/rv32

junjihashimoto commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

junjihashimoto commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant