Skip to content

Latest commit

 

History

History
87 lines (65 loc) · 3.64 KB

File metadata and controls

87 lines (65 loc) · 3.64 KB

BitNet b1.58 ASIC Inference Engine

Killer App: BitNet b1.58 ASIC Inference Engine

Sparkle ships with a complete, formally verified BitNet b1.58 accelerator — a production-grade ternary-weight neural network inference core targeting ASIC synthesis, written entirely in the Signal DSL. This is the world's first formally verified LLM inference hardware generated from a theorem prover.

What It Does

Pure Signal DSL functions compose into a complete BitNet SoC — simulate directly or synthesize to SystemVerilog:

import IP.BitNet.SoC.Top

open Sparkle.Core.Signal
open Sparkle.IP.BitNet.SoC

-- Build a 2-layer, 4-dimension BitNet SoC as a Signal function
let cfg : SoCConfig := { archMode := .HardwiredUnrolled, nLayers := 2, dim := 4, ffnDim := 4 }
let x : Signal defaultDomain (BitVec 32) := Signal.pure (BitVec.ofNat 32 0x10000)  -- 1.0 Q16.16
let result := bitNetSoCSignal cfg layerWeights layerScales x

-- Simulate: evaluate at any timestep
IO.println s!"Output at t=0: {result.atTime 0}"

Dual-Architecture: Choose Your Trade-off

HardwiredUnrolled TimeMultiplexed
Area 202,566 cells 99,020 cells
Latency 1 cycle (combinational) 12 cycles (1 per layer)
Throughput Maximum 1/12 of HW
Source Lines 19,042 1,909
Use Case Ultra-low-latency Area-constrained

Yosys 0.62 technology-independent synthesis. See hw/synth/PPA_Report.md for full breakdown.

60+ Formally Verified Theorems

Every arithmetic operation in the RTL datapath is backed by machine-checked proofs:

-- Proves ReLU²(2.0) = 4.0 in Q16.16 fixed-point (checked by Lean kernel)
theorem relu_sq_two :
    reluSquared (BitVec.ofNat 32 0x20000) = BitVec.ofNat 32 0x40000 := by
  native_decide

-- Proves 48-bit × 32-bit scale product fits in 80 bits (no overflow)
theorem scale_prod_fits_80 : (2^47 - 1) * (2^31 - 1) < (2^79 : Nat) := by
  native_decide

Proof categories: Scale multiply (5), ReLU² (6), Residual add (6), Element multiply (6), Bit-width sufficiency (7), INT8 dot product (15), Attention bit-width (7), Softmax (8), Fixed-point spec (5).

Architecture Overview

x[dim] ──► BitLinear(gate) ──► Scale ──► ReLU² ──┐
        ├─► BitLinear(up)   ──► Scale ────────────┤─► ElemMul ──► ResidualAdd ──► y[dim]
        └─► BitLinear(down) ──► Scale ◄───────────┘                    ↑
                                                                  x[dim] ─┘
  • Ternary weights: {-1, 0, +1} encoded as 2-bit i2_s (zero-weight pruning eliminates ~35% of MACs)
  • Fixed-point datapath: Q16.16 activations, 48-bit accumulators, Q8.24 scale factors
  • Binary adder tree: Automatic bit-width propagation with configurable pipeline registers
  • LUT-based softmax: 256-entry exp/reciprocal lookup tables as mux trees
  • Full attention pipeline: QKV projection, INT8 dot product, softmax, score-V multiply, multi-head

Golden Value Validation

RTL spec functions are validated against real model data from bitnet.cpp (16 tests):

=== RTL Golden Value Validation ===
  [PASS] Q16.16 round-trip       (cosine: 0.9999+)
  [PASS] reluSquared              (cosine: 0.999+)
  [PASS] elemMul                  (cosine: 0.999+)
  [PASS] residualAdd              (cosine: 0.9999+)
  [PASS] fixedPointScale          (cosine: 0.9999+)
  [PASS] quantizeToInt8           (exact match)
  [PASS] FFN forward pass         (cosine: 0.999+)
  [PASS] Attention score pipeline (exact match)
  [PASS] Softmax + weighted V sum
ALL TESTS PASSED