Skip to content

Kunj-Patel/FPGA-based-High-Frequency-Adder

Repository files navigation

2048-bit Adders for Intel Agilex 5

Overview

This lab implements and optimizes two 2048-bit adders for Intel Agilex 5 FPGA:

  1. Naive Adder: Pipelined RCA with diagonal carry propagation
  2. Clever Adder: Type-2 "Monster Adder" with Kogge-Stone prefix tree

Target Device: Intel Agilex 5 A5EC065BB23AE4SR0 (222,400 ALMs)

Final Results

Naive Adder (naiveadder2048b.sv)

  • Best Configuration: M=9
  • Performance: 599.88 MHz @ 9,073 ALMs
  • Latency: 230 cycles (⌈2048/9⌉ + 2 = 228 + 2)
  • Architecture: RCA_PIPE with diagonal carry propagation

Clever Adder (cleveradder2048b.sv) 🏆

  • Best Configuration: M=16
  • Performance: 825.08 MHz @ 8,814 ALMs
  • Latency: 11 cycles
  • Speedup vs Naive: +37.5% Fmax, -2.9% ALMs, +0.094 MHz/ALM efficiency
  • Architecture: Type-2 Monster Adder with Kogge-Stone prefix tree

Performance Summary

Adder M Fmax (MHz) ALMs Latency Efficiency (MHz/ALM) Hyper-REG
Naive 9 599.88 9,073 230 0.0661 7,536
Clever 16 825.08 8,814 11 0.0936 8,543
Improvement - +37.5% -2.9% -95.2% +41.6% +13.4%

Architecture Details

1. Naive Adder (naiveadder2048b.sv)

Implementation: Direct wrapper around rca_pipe.sv with optimized M parameter.

module naiveadder2048b #(
    parameter W = 2048,
    parameter M = 9  // Optimal: 599.88 MHz @ 9,073 ALMs
) (
    input  logic clk, rst, in_valid,
    input  logic [W-1:0] a, b,
    input  logic c_in,
    output logic out_valid,
    output logic [W-1:0] sum,
    output logic c_out
);
    rca_pipe #(.W(W), .M(M)) rca_inst (.*);
endmodule

Architecture: Pipelined RCA with diagonal carry propagation

  • Breaks W-bit addition into NUM_CHUNKS = ⌈W/M⌉ parallel M-bit chunks
  • Stage 0: Input registration
  • Stage 1: Parallel chunk computation (chunk 0 uses actual c_in, others assume c_in=0)
  • Stages 2 to NUM_CHUNKS: Ripple correction with diagonal OR rule
  • Final stage: Output registration

Latency Formula: NUM_CHUNKS + 2 cycles where NUM_CHUNKS = ⌈W/M⌉

  • Example (M=9, W=2048): ⌈2048/9⌉ + 2 = 228 + 2 = 230 cycles
  • Example (M=16, W=2048): ⌈2048/16⌉ + 2 = 128 + 2 = 130 cycles

Note on M=9: Non-divisible M values are supported. RCA_PIPE handles partial chunks internally.

Optimization: M=9 provides best Fmax-to-ALM ratio among tested values.


2. Clever Adder (cleveradder2048b.sv)

Implementation: Type-2 Monster Adder (Langhammer 2019) with three-level parallel carry computation.

module cleveradder2048b #(
    parameter W = 2048,
    parameter M = 16,            // Optimal: 825.08 MHz @ 8,814 ALMs
    parameter PIPE = 1,          // Prefix tree output register
    parameter PIPE_EVERY_2 = 1   // Internal prefix registers every 2 stages
) (
    input  logic clk, rst,
    input  logic [W-1:0] a, b,
    input  logic c_in,
    input  logic in_valid,
    output logic [W-1:0] sum,
    output logic c_out,
    output logic out_valid
);

Architecture: Three-level design achieving O(log N) carry computation complexity.

Level 1: Generate/Propagate Computation (RCA_PIPE)
   ├─ For each chunk i: compute g[i], p[i], sum_0[i]
   ├─ Dual RCA per chunk: RCA(c_in=0) → G, RCA(c_in=1) → P
   └─ Latency: 3 cycles (RCA_PIPE with M=M)

Level 2: Prefix Network (Kogge-Stone)
   ├─ Input: g[N-2:0], p[N-2:0] (MSB excluded)
   ├─ Compute: c_vec[i] = carry INTO chunk i+1
   ├─ Algorithm: Kogge-Stone prefix tree (O(log N) depth)
   ├─ Pipelining: Input reg + Internal regs (every 2 stages) + Output reg
   └─ Latency: PREFIX_LATENCY = 1 + ⌊stages/2⌋ + 1 cycles

Level 3: Final Sum Computation (RCA_PIPE)
   ├─ For each chunk i: sum_final[i] = sum_0[i] + carry[i]
   ├─ carry[0] = c_in (global), carry[i] = c_vec[i-1] (from prefix)
   └─ Latency: 3 cycles (RCA_PIPE with M=M)

Total Latency: 3 + PREFIX_LATENCY + 3 = 6 + PREFIX_LATENCY cycles

Latency Examples (W=2048, PIPE=1, PIPE_EVERY_2=1):

M N Chunks Prefix Stages Input Reg Internal Regs Output Reg PREFIX_LATENCY Total Latency
8 256 8 1 4 (after 1,3,5,7) 1 6 12 cycles
16 128 7 1 3 (after 1,3,5) 1 5 11 cycles
32 64 6 1 3 (after 1,3,5) 1 5 11 cycles
64 32 5 1 2 (after 1,3) 1 4 10 cycles

Key Constraint: M must divide W evenly (N = W/M must be integer) for prefix tree.

  • Valid M for W=2048: {4, 8, 16, 32, 64, 128, 256, 512, 1024}

Level 1: Generate/Propagate Computation

Purpose: Precompute per-chunk carry behavior for both possible carry-in values.

For each chunk i (0 to N-1), compute:

  • g[i]: Generate signal (carry-out when c_in=0 to chunk) = c_out from RCA(a[i], b[i], c_in=0)
  • p[i]: Propagate signal (carry-out when c_in=1 to chunk) = c_out from RCA(a[i], b[i], c_in=1)
  • sum_0[i]: Partial sum assuming c_in=0 to chunk

Implementation (rtl/cleveradder2048b.sv:89-132):

// 128 chunks × 2 RCAs = 256 RCA_PIPE instances
for (i = 0; i < N; i++) begin
    (* keep_hierarchy = "no" *) rca_pipe #(.W(M), .M(M)) rca_g (
        .c_in(1'b0), .c_out(c_out_g), .sum(sum_g), ...
    );
    (* keep_hierarchy = "no" *) rca_pipe #(.W(M), .M(M)) rca_p (
        .c_in(1'b1), .c_out(c_out_p), ...
    );
    assign g[i] = c_out_g;
    assign p[i] = c_out_p;
    assign sum_0[i*M +: M] = sum_g;
end

Timing:

  • All RCA_PIPE outputs are registered after 3 cycles (M=M → NUM_CHUNKS=1 → latency=3)
  • Valid signals track completion: valid_l1_all = &valid_l1[N-1:0]
  • Global c_in delayed to c_in_l1 via shift register to align with Level 1 outputs

Key Insight: By computing g and p, Level 1 decouples sum computation from carry propagation, enabling Level 2 to compute all carries in parallel.


Level 2: Prefix Network (Kogge-Stone)

Purpose: Compute carry into each chunk using parallel prefix algorithm.

Critical Bug Fix - Global Carry-In Injection (rtl/cleveradder2048b.sv:146-160):

The original implementation had a critical bug for small M values (especially M=4): the global carry-in (c_in) was not properly injected into the prefix tree computation.

Problem:

  • Chunk 0's generate signal g[0] only reflects "does chunk 0 generate carry with c_in=0?"
  • But if chunk 0 propagates (p[0]=1) AND we have global c_in=1, chunk 0 should generate carry-out
  • Without this correction, carry-out from chunk 0 ignores c_in → wrong results

Solution:

logic [N-2:0] g_prefix_in;

always_comb begin
    g_prefix_in = g[N-2:0];
    // Inject global carry: Real_G[0] = G[0] | (P[0] & C_in_Level1)
    g_prefix_in[0] = g[0] | (p[0] & c_in_l1);
end

prefix_tree prefix_inst (
    .g(g_prefix_in),  // Use modified signal
    .p(p[N-2:0]),
    ...
);

Why this fixes the bug:

  • If chunk 0 generates (g[0]=1): g_prefix_in[0] = 1 (correct)
  • If chunk 0 propagates (p[0]=1) with c_in=1: g_prefix_in[0] = g[0] | (1 & 1) = 1 (correct)
  • If chunk 0 kills (p[0]=0, g[0]=0): g_prefix_in[0] = 0 (correct)

Prefix Tree Implementation (rtl/prefix_tree.sv):

Uses Kogge-Stone algorithm for O(log N) depth parallel carry computation.

Pipelining Structure:

  1. Stage 0 - Input Registers (lines 70-81): Always registered

    • Breaks critical path from Level 1 RCA outputs to prefix logic
    • Adds 1 cycle latency
  2. Stages 1 to log₂(N) - Kogge-Stone Computation (lines 88-133):

    • Each stage k processes span = 2^k
    • For each position i ≥ span:
      • G[i] = G[i] | (P[i] & G[i-span])
      • P[i] = P[i] & P[i-span]
    • PIPE_EVERY_2=1: Registers inserted after odd stages (1, 3, 5, 7, ...)
    • Even stages (0, 2, 4, 6, ...) are combinational
  3. Output Register (lines 145-159): When PIPE=1

    • Final registration for timing closure
    • Adds 1 cycle latency

Latency Calculation:

PREFIX_STAGES = $clog2(N)
PREFIX_INPUT_REGS = 1  // Always present
PREFIX_INTERNAL_REGS = (PIPE_EVERY_2 == 1) ? (PREFIX_STAGES / 2) : 0
PREFIX_OUTPUT_REGS = PIPE
PREFIX_LATENCY = PREFIX_INPUT_REGS + PREFIX_INTERNAL_REGS + PREFIX_OUTPUT_REGS

Example (M=16, N=128):

  • PREFIX_STAGES = log₂(127) = 7
  • PREFIX_INPUT_REGS = 1
  • PREFIX_INTERNAL_REGS = ⌊7/2⌋ = 3 (after stages 1, 3, 5)
  • PREFIX_OUTPUT_REGS = 1
  • PREFIX_LATENCY = 1 + 3 + 1 = 5 cycles

MSB Handling: MSB chunk (N-1) excluded from prefix tree. Its carry-out is computed separately in final c_out formula.

Output: c_vec[i] = carry INTO chunk i+1 (for i = 0 to N-2)


Level 3: Final Sum Computation

Purpose: Combine Level 1 partial sums with Level 2 computed carries.

Pipeline Alignment (rtl/cleveradder2048b.sv:177-222):

Before Level 3 can compute, Level 1 outputs must be delayed to synchronize with prefix tree output:

// Delay sum_0, c_in_l1, g, p by PREFIX_LATENCY cycles
logic [W-1:0] sum_0_shift [PREFIX_LATENCY:0];
logic c_in_shift_l2 [PREFIX_LATENCY:0];
logic [N-1:0] g_shift [PREFIX_LATENCY:0];
logic [N-1:0] p_shift [PREFIX_LATENCY:0];

// Shift registers (no reset for Hyper-Flex)
for (d = 0; d < PREFIX_LATENCY; d++) begin
    always_ff @(posedge clk) begin
        sum_0_shift[d+1] <= sum_0_shift[d];
        c_in_shift_l2[d+1] <= c_in_shift_l2[d];
        g_shift[d+1] <= g_shift[d];
        p_shift[d+1] <= p_shift[d];
    end
end

Final Sum Computation (rtl/cleveradder2048b.sv:224-261):

For each chunk i (0 to N-1):

carry_in = (i == 0) ? c_in_l2 : c_vec[i-1];

(* keep_hierarchy = "no" *) rca_pipe #(.W(M), .M(M)) rca_final (
    .a(sum_0_l2[i*M +: M]),
    .b({M{1'b0}}),  // Add zero
    .c_in(carry_in),
    .sum(sum_final[i*M +: M]),
    ...
);
  • Chunk 0: Uses delayed global c_in (c_in_l2)
  • Chunks 1+: Use prefix tree output (c_vec[i-1])
  • Each RCA_PIPE adds 3 cycles latency
  • Output: sum with out_valid = &valid_l3

Final Carry-Out (rtl/cleveradder2048b.sv:264-277):

c_out_formula = g_l2[N-1] | (p_l2[N-1] & c_vec[N-2]);

// Delay by 3 cycles to match Level 3 RCA latency
logic [RCA_LATENCY-1:0] c_out_shift;
always_ff @(posedge clk) begin
    c_out_shift <= {c_out_shift[RCA_LATENCY-2:0], c_out_formula};
end
assign c_out = c_out_shift[RCA_LATENCY-1];

Formula Explanation:

  • c_vec[N-2] = carry INTO MSB chunk (N-1)
  • If MSB generates (g[N-1]=1): c_out = 1
  • If MSB propagates (p[N-1]=1) and carry comes in: c_out = 1
  • Otherwise: c_out = 0

3. RCA_PIPE Module (rtl/rca_pipe.sv)

Purpose: Pipelined ripple-carry adder with diagonal carry propagation.

Architecture:

  1. Input Registration (lines 100-105): Register a, b, c_in, in_valid
  2. Parallel Chunk Computation: All chunks compute simultaneously
    • Chunk 0: Uses actual c_in
    • Chunks 1+: Assume c_in=0
  3. Stage 0 Registration (lines 163-168): Register initial sums and carries
  4. Ripple Correction Stages (lines 177-201): One stage per chunk
    • Stage i corrects chunk i using carry from stage i-1
    • Diagonal OR Rule: carry[i] = new_carry[i] | carry_internal[i]
  5. Output Registration (lines 208-212): Final sum, c_out, out_valid

Latency: NUM_CHUNKS + 2 cycles where NUM_CHUNKS = ⌈W/M⌉

  • Input register: 1 cycle
  • Stage 0 register: 1 cycle
  • Ripple stages 1 to NUM_CHUNKS-1: (NUM_CHUNKS - 1) cycles
  • Output register: 1 cycle
  • Total: 1 + 1 + (NUM_CHUNKS - 1) + 1 = NUM_CHUNKS + 2

Key Feature - Diagonal OR Rule: When correcting chunk i by adding carry[i-1]:

  • May generate new carry: new_carry[i]
  • Original carry from parallel computation: carry_internal[i]
  • At most one can be 1 (adding single-bit carry)
  • Safe to OR: carry[i] = new_carry[i] | carry_internal[i]

Usage in Clever Adder:

  • Always instantiated with W=M, M=M → NUM_CHUNKS=1
  • This gives fixed 3-cycle latency: 1 (input) + 1 (stage 0) + 0 (no ripple stages) + 1 (output) = 3
  • Used in Level 1 (256 G/P instances) and Level 3 (128 final sum instances)

4. Prefix Tree Module (rtl/prefix_tree.sv)

Purpose: Parallel prefix network for carry computation using Kogge-Stone algorithm.

Parameters:

  • N: Number of positions (typically W/M - 1 for clever adder)
  • PIPE: Output registration (0=combinational, 1=registered)
  • PIPE_EVERY_2: Internal pipeline registers (0=disabled, 1=every 2 stages)

Algorithm - Kogge-Stone:

For stage k = 0 to log₂(N)-1:
    span = 2^k
    For each position i ≥ span:
        G[i] = G[i] | (P[i] & G[i-span])  // Group generate
        P[i] = P[i] & P[i-span]           // Group propagate

Advantages:

  • Depth: O(log N) stages
  • Parallelism: Maximum fan-out (all positions computed in parallel per stage)
  • Latency: Minimal with pipelining

Alternative - Brent-Kung (not used):

  • Depth: O(log N) but with more serial stages
  • Fewer operations: Better for area-constrained designs
  • Trade-off: Lower area but potentially higher latency vs Kogge-Stone

Pipelining Strategy:

  1. Input Register: Breaks long path from RCA outputs
  2. Internal Registers (PIPE_EVERY_2=1): After odd stages (1, 3, 5, ...)
    • Balances pipeline depth with Fmax
    • Even stages combinational reduces total latency
  3. Output Register (PIPE=1): Final timing closure

Verilator Note: Uses /* verilator lint_off UNOPTFLAT */ to suppress false circular dependency warnings on stage arrays.


Optimization Strategies

1. Hyper-Flex Register Optimization

Goal: Enable Intel Hyper-Flex registers for high-frequency operation on Agilex 5.

Key Changes:

Remove Resets from Data Path

Problem: Asynchronous resets block Quartus register retiming.

Solution: Remove all if (rst) conditions from data path registers.

Rationale:

  • Valid signals control when data is meaningful
  • Invalid data during first few cycles after power-up is acceptable
  • FPGA streaming designs don't require deterministic reset state
  • Result: Quartus can freely retime registers for Fmax optimization

2. Hierarchy Flattening

Goal: Enable cross-module optimization and Hyper-Register insertion.

Implementation: Add (* keep_hierarchy = "no" *) to all RCA_PIPE instances.

Benefits:

  • Quartus can retime registers across module boundaries
  • Enables Hyper-Register insertion inside carry chains
  • Allows global optimization of critical paths
  • Balances register distribution across design

Trade-off: Slight increase in compile time, but significant Fmax improvement.


3. Prefix Tree Pipelining

Strategy: Balance latency vs Fmax with selective pipeline insertion.

Configuration (PIPE=1, PIPE_EVERY_2=1):

  • Input Register: Always present (breaks critical path from Level 1)
  • Internal Registers: After odd stages only (1, 3, 5, 7, ...)
    • Stage 0→1: combinational
    • Stage 1→2: registered
    • Stage 2→3: combinational
    • Stage 3→4: registered
    • Continues pattern...
  • Output Register: Final timing closure

Rationale:

  • Every 2 stages keeps pipeline depth manageable
  • Odd-stage placement balances path lengths
  • Reduces total latency compared to every-stage pipelining

Alternative Tested (PIPE_EVERY_1 - not implemented):

  • Would add registers after every stage
  • Latency: +3 to +4 cycles (stages + input + output)
  • Marginal Fmax improvement not worth latency cost

4. Quartus Synthesis Settings

File: fpga/setup.tcl

Conservative Retiming Strategy:

set_global_assignment -name ALLOW_REGISTER_RETIMING ON
set_global_assignment -name ALLOW_REGISTER_DUPLICATION OFF
set_global_assignment -name ALLOW_REGISTER_MERGING OFF
set_global_assignment -name REMOVE_DUPLICATE_REGISTERS OFF

Rationale:

  • RETIMING ON: Enables Quartus to move registers for timing
  • DUPLICATION OFF: Prevents excessive register duplication (saves ~1000 ALMs)
  • MERGING OFF: Preserves register boundaries for predictable timing
  • REMOVE_DUPLICATES OFF: Keeps necessary duplicates for fanout

Trade-off: ~10 MHz Fmax reduction vs aggressive settings, but 1000+ ALM savings.

Result: Area-efficient Hyper-Flex optimization with good Fmax.


Experimental Results

Experiment 1: Hyper-Flex Optimization (Reset Removal + Hierarchy Flattening)

Hypothesis: Removing resets from data path and flattening hierarchy will enable Hyper-Register insertion and improve Fmax.

Baseline (Before optimization):

  • M=16: 749.63 MHz @ 9,809 ALMs
  • Estimated Hyper-REG: ~5,000
  • All registers had if (rst) conditions
  • Module hierarchy preserved (keep_hierarchy="yes" by default)

Changes Applied:

  1. Removed resets from all data path registers (kept on valid signals)
  2. Added (* keep_hierarchy = "no" *) to 384 RCA instances
  3. Conservative Quartus settings (duplication OFF)

Results (After optimization):

M Fmax (MHz) ALMs Hyper-REG vs Baseline Notes
4 599.88 11,484 16,892 N/A Highest Hyper-REG
8 783.09 9,166 11,571 +33 MHz -
16 825.08 8,814 8,543 +75 MHz 🏆 WINNER
32 696.86 8,003 8,116 +30 MHz -
64 670.24 7,459 7,935 +20 MHz Best ALM efficiency

Key Findings: ✅ M=16 optimal: 825.08 MHz (+10% over baseline) ✅ Hyper-REG increased: 8,543 vs ~5,000 estimated baseline (+71%) ✅ ALM efficient: 8,814 vs 9,809 baseline (-995 ALMs, -10%) ✅ Best efficiency: 0.0936 MHz/ALM

Why M=16 Wins:

  • 128 chunks → 384 RCA instances (good parallelism)
  • 16-bit RCA carry chains (balanced path length)
  • 7 Kogge-Stone prefix stages (optimal depth for 127 positions)
  • 5-cycle prefix latency (11-cycle total)

Status: ✅ VALIDATED


Experiment 2: Prefix Tree Input Registers (Tested and Validated)

Hypothesis: Adding input registers to prefix tree would break critical path from Level 1 RCA outputs.

Motivation:

  • Critical path identified: RCA output register → prefix stage 0 logic → prefix stage 1 logic → register
  • ~2 combinational stages before first prefix register
  • Adding input register breaks this critical path

Updated cleveradder2048b.sv PREFIX_LATENCY calculation (line 68):

localparam PREFIX_INPUT_REGS = 1;
localparam PREFIX_LATENCY = PREFIX_INPUT_REGS + PREFIX_INTERNAL_REGS + PREFIX_OUTPUT_REGS;

Latency Impact (M=16):

  • M=16: N=128, PREFIX_STAGES = log₂(127) = 7
  • PREFIX_INPUT_REGS = 1 (always registered)
  • PREFIX_INTERNAL_REGS = ⌊7/2⌋ = 3 (registers after stages 1, 3, 5)
  • PREFIX_OUTPUT_REGS = 1 (PIPE=1)
  • PREFIX_LATENCY = 1 + 3 + 1 = 5 cycles
  • Total clever latency = 3 (L1) + 5 (L2) + 3 (L3) = 11 cycles

Note: The input register was part of the original design to break the critical path from Level 1 RCA outputs to prefix logic.

Expected Results:

  • Fmax: 820-880 MHz target
  • ALMs: +50-100 (508 new registers)
  • Hyper-REG: +200-500

Actual Results: Integrated into final design, M=16 achieved 825.08 MHz.

Status: ✅ VALIDATED


Experiment 3: Top-Level Input Registers (⚠️ TESTED AND REJECTED)

Hypothesis: Register all inputs at module boundary to reduce routing delay to RCA instances.

Expected Benefits:

  • Shorter routing: PAD → register → RCA
  • Enable register duplication for different RCA fanout
  • Allow Hyper-Flex insertion between register stages

Actual Results (❌ FAILED):

M Fmax WITHOUT Fmax WITH Δ Fmax Δ ALMs
4 699.3 668.0 -31.3 MHz +739
8 743.49 736.92 -6.57 MHz +467
16 749.63 672.95 -76.68 MHz ❌ +287
32 666.67 662.69 -3.98 MHz +229

Why It Failed:

  1. I/O Boundary Bottleneck: PAD → a_r path became new critical path
    • "Domain Boundary Entry" constraints hard to optimize
    • Retiming reports: "Add stages BEFORE a_r" (impossible at boundary)
  2. Double Buffering Overhead: PAD → a_r → RCA input register
    • Two serial stages worse than one optimizable stage
  3. Routing Congestion: 4,098 new registers (a_r, b_r, c_in_r, valid_r)
    • Massive fanout to 256+ RCA instances
    • Poor placement, increased ALM usage
  4. M=16 Worst Hit: 256 RCA instances = highest fanout = worst degradation

Lesson Learned: Original design (direct port-to-RCA connection) is optimal. RCA_PIPE's internal input registers provide sufficient pipelining. Adding another layer exposed I/O constraints.

Status: ❌ REJECTED


M Parameter Sweep Analysis

Clever Adder Sweep Results

Full Sweep (M ∈ {4, 8, 16, 32, 64}):

M N Chunks Prefix Stages Fmax (MHz) ALMs Hyper-REG Latency Efficiency (MHz/ALM)
4 512 9 599.88 11,484 16,892 13 0.0522
8 256 8 783.09 9,166 11,571 12 0.0854
16 128 7 825.08 8,814 8,543 11 0.0936 🏆
32 64 6 696.86 8,003 8,116 11 0.0871
64 32 5 670.24 7,459 7,935 10 0.0899

Analysis:

M=4 (❌ Worst Fmax):

  • 512 chunks → 1,536 RCA instances (massive)
  • 9 prefix stages → 6-cycle prefix latency
  • Highest ALM usage: 11,484
  • Issue: Too many instances → routing congestion
  • Lesson: Excessive parallelism hurts

M=8 (✅ Good, but not optimal):

  • 256 chunks → 768 RCA instances
  • 8 prefix stages → 6-cycle prefix latency
  • 783.09 MHz (+30% over M=4)
  • Decent efficiency: 0.0854 MHz/ALM
  • Why not winner: M=16 has better balance

M=16 (🏆 WINNER):

  • 128 chunks → 384 RCA instances (sweet spot)
  • 7 prefix stages → 5-cycle prefix latency
  • 825.08 MHz (+5.4% over M=8)
  • Lowest latency: 11 cycles
  • Best efficiency: 0.0936 MHz/ALM
  • Perfect balance: Parallelism vs path length vs resources

M=32 (❌ Fmax drops):

  • 64 chunks → 192 RCA instances
  • 6 prefix stages → 5-cycle prefix latency
  • 696.86 MHz (-15.5% vs M=16)
  • Issue: Longer carry chains (32-bit RCA)
  • Lesson: Too little parallelism hurts Fmax

M=64 (❌ Further Fmax drop):

  • 32 chunks → 96 RCA instances
  • 5 prefix stages → 4-cycle prefix latency
  • 670.24 MHz (-18.8% vs M=16)
  • Best ALM usage: 7,459
  • Issue: 64-bit RCA carry chains too long
  • Lesson: Minimal latency doesn't mean optimal Fmax

Key Insight: M=16 provides optimal balance between:

  • Prefix tree depth (7 stages = low latency)
  • RCA carry chain length (16-bit = fast enough)
  • Number of instances (384 = good parallelism without congestion)
  • Resource utilization (8,814 ALMs = efficient)

Naive Adder Sweep Results

Selected Results (M ∈ {8, 9, 10, 11, 16, 19, 20, 21, 29, 30, 31, 39, 40, 41}):

M NUM_CHUNKS Fmax (MHz) ALMs Latency (cycles) Hyper-REG Notes
8 256 599.88 9,685 258 7,599 Tied Fmax
9 228 599.88 9,073 230 7,536 🏆 WINNER
10 205 599.52 9,164 207 7,355 Fmax drops
11 187 592.42 8,847 189 7,276 -
16 128 553.10 8,112 130 6,875 -
19 108 507.87 8,035 110 6,821 Lab3 golden
20 103 488.28 8,230 105 8,819 -
21 98 518.94 7,977 100 6,609 -

Why M=9 Wins:

  • Highest Fmax: 599.88 MHz (tied with M=8)
  • Lowest ALM usage among tied Fmax: 9,073 vs 9,685 (M=8)
  • Reasonable latency: 230 cycles
  • Best Fmax-to-ALM ratio at peak Fmax

Observation: Naive adder saturates at ~600 MHz regardless of M < 10.

  • Likely hitting fundamental routing or I/O limits
  • Smaller M doesn't improve Fmax further
  • M=9 chosen for lowest resource usage at peak Fmax

Latency Analysis

Latency Comparison: Clever vs Naive

Formulas:

  • Naive: Latency = NUM_CHUNKS + 2 = ⌈W/M⌉ + 2
  • Clever: Latency = 6 + PREFIX_LATENCY
    • PREFIX_LATENCY = 1 (input) + ⌊log₂(W/M)/2⌋ (internal) + 1 (output)

Concrete Examples (W=2048):

M Naive Chunks Naive Latency Clever Chunks Prefix Stages PREFIX_LAT Clever Latency Speedup
8 256 258 256 8 6 12 21.5×
16 128 130 128 7 5 11 11.8×
32 64 66 64 6 5 11 6.0×
64 32 34 32 5 4 10 3.4×

Key Insight: Clever adder achieves logarithmic latency vs naive's linear latency.

At M=8:

  • Naive: 258 cycles @ 599.88 MHz = 430 ns
  • Clever: 12 cycles @ 783.09 MHz = 15.3 ns
  • 21.5× faster in cycle count, 28.1× faster in actual time!

At M=16 (optimal Fmax):

  • Naive: 130 cycles @ 553.10 MHz = 235 ns
  • Clever: 11 cycles @ 825.08 MHz = 13.3 ns
  • 11.8× faster in cycle count, 17.7× faster in actual time!

Usage Instructions

Run Submission Scripts

source env.sh

# Naive adder (M=9)
./naive.sh
# Output: results/winner_naive.csv

# Clever adder (M=16)
./clever.sh
# Output: results/winner_clever.csv

Run M Sweeps

# Clever adder sweep (M ∈ {4, 8, 16, 32, 64})
./sweep_clever.sh
# Output: results/clever_sweep.csv

# Naive adder sweep (M ∈ {8-11, 16, 19-21, 29-31, 39-41})
./sweep_naive.sh
# Output: results/naive_sweep.csv

Manual Synthesis

# Makefile auto-sets ARCH_PARAM_NAME=M and ARCH_PARAM_VAL=$M
make fit DUT=cleveradder2048b W=2048 M=32
make extract DUT=cleveradder2048b W=2048 M=32
cat results/cleveradder2048b_M32.csv

About

Implemented and validated a high frequency 2048-bit adder on an Intel Agilex 5 device using Quartus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors