2048-bit Adders for Intel Agilex 5

Overview

This lab implements and optimizes two 2048-bit adders for Intel Agilex 5 FPGA:

Naive Adder: Pipelined RCA with diagonal carry propagation
Clever Adder: Type-2 "Monster Adder" with Kogge-Stone prefix tree

Target Device: Intel Agilex 5 A5EC065BB23AE4SR0 (222,400 ALMs)

Final Results

Naive Adder (naiveadder2048b.sv)

Best Configuration: M=9
Performance: 599.88 MHz @ 9,073 ALMs
Latency: 230 cycles (⌈2048/9⌉ + 2 = 228 + 2)
Architecture: RCA_PIPE with diagonal carry propagation

Clever Adder (cleveradder2048b.sv) 🏆

Best Configuration: M=16
Performance: 825.08 MHz @ 8,814 ALMs
Latency: 11 cycles
Speedup vs Naive: +37.5% Fmax, -2.9% ALMs, +0.094 MHz/ALM efficiency
Architecture: Type-2 Monster Adder with Kogge-Stone prefix tree

Performance Summary

Adder	M	Fmax (MHz)	ALMs	Latency	Efficiency (MHz/ALM)	Hyper-REG
Naive	9	599.88	9,073	230	0.0661	7,536
Clever	16	825.08	8,814	11	0.0936	8,543
Improvement	-	+37.5%	-2.9%	-95.2%	+41.6%	+13.4%

Architecture Details

1. Naive Adder (naiveadder2048b.sv)

Implementation: Direct wrapper around rca_pipe.sv with optimized M parameter.

module naiveadder2048b #(
    parameter W = 2048,
    parameter M = 9  // Optimal: 599.88 MHz @ 9,073 ALMs
) (
    input  logic clk, rst, in_valid,
    input  logic [W-1:0] a, b,
    input  logic c_in,
    output logic out_valid,
    output logic [W-1:0] sum,
    output logic c_out
);
    rca_pipe #(.W(W), .M(M)) rca_inst (.*);
endmodule

Architecture: Pipelined RCA with diagonal carry propagation

Breaks W-bit addition into NUM_CHUNKS = ⌈W/M⌉ parallel M-bit chunks
Stage 0: Input registration
Stage 1: Parallel chunk computation (chunk 0 uses actual c_in, others assume c_in=0)
Stages 2 to NUM_CHUNKS: Ripple correction with diagonal OR rule
Final stage: Output registration

Latency Formula: NUM_CHUNKS + 2 cycles where NUM_CHUNKS = ⌈W/M⌉

Example (M=9, W=2048): ⌈2048/9⌉ + 2 = 228 + 2 = 230 cycles
Example (M=16, W=2048): ⌈2048/16⌉ + 2 = 128 + 2 = 130 cycles

Note on M=9: Non-divisible M values are supported. RCA_PIPE handles partial chunks internally.

Optimization: M=9 provides best Fmax-to-ALM ratio among tested values.

2. Clever Adder (cleveradder2048b.sv)

Implementation: Type-2 Monster Adder (Langhammer 2019) with three-level parallel carry computation.

module cleveradder2048b #(
    parameter W = 2048,
    parameter M = 16,            // Optimal: 825.08 MHz @ 8,814 ALMs
    parameter PIPE = 1,          // Prefix tree output register
    parameter PIPE_EVERY_2 = 1   // Internal prefix registers every 2 stages
) (
    input  logic clk, rst,
    input  logic [W-1:0] a, b,
    input  logic c_in,
    input  logic in_valid,
    output logic [W-1:0] sum,
    output logic c_out,
    output logic out_valid
);

Architecture: Three-level design achieving O(log N) carry computation complexity.

Level 1: Generate/Propagate Computation (RCA_PIPE)
   ├─ For each chunk i: compute g[i], p[i], sum_0[i]
   ├─ Dual RCA per chunk: RCA(c_in=0) → G, RCA(c_in=1) → P
   └─ Latency: 3 cycles (RCA_PIPE with M=M)

Level 2: Prefix Network (Kogge-Stone)
   ├─ Input: g[N-2:0], p[N-2:0] (MSB excluded)
   ├─ Compute: c_vec[i] = carry INTO chunk i+1
   ├─ Algorithm: Kogge-Stone prefix tree (O(log N) depth)
   ├─ Pipelining: Input reg + Internal regs (every 2 stages) + Output reg
   └─ Latency: PREFIX_LATENCY = 1 + ⌊stages/2⌋ + 1 cycles

Level 3: Final Sum Computation (RCA_PIPE)
   ├─ For each chunk i: sum_final[i] = sum_0[i] + carry[i]
   ├─ carry[0] = c_in (global), carry[i] = c_vec[i-1] (from prefix)
   └─ Latency: 3 cycles (RCA_PIPE with M=M)

Total Latency: 3 + PREFIX_LATENCY + 3 = 6 + PREFIX_LATENCY cycles

Latency Examples (W=2048, PIPE=1, PIPE_EVERY_2=1):

M	N Chunks	Prefix Stages	Input Reg	Internal Regs	Output Reg	PREFIX_LATENCY	Total Latency
8	256	8	1	4 (after 1,3,5,7)	1	6	12 cycles
16	128	7	1	3 (after 1,3,5)	1	5	11 cycles
32	64	6	1	3 (after 1,3,5)	1	5	11 cycles
64	32	5	1	2 (after 1,3)	1	4	10 cycles

Key Constraint: M must divide W evenly (N = W/M must be integer) for prefix tree.

Valid M for W=2048: {4, 8, 16, 32, 64, 128, 256, 512, 1024}

Level 1: Generate/Propagate Computation

Purpose: Precompute per-chunk carry behavior for both possible carry-in values.

For each chunk i (0 to N-1), compute:

g[i]: Generate signal (carry-out when c_in=0 to chunk) = c_out from RCA(a[i], b[i], c_in=0)
p[i]: Propagate signal (carry-out when c_in=1 to chunk) = c_out from RCA(a[i], b[i], c_in=1)
sum_0[i]: Partial sum assuming c_in=0 to chunk

Implementation (rtl/cleveradder2048b.sv:89-132):

// 128 chunks × 2 RCAs = 256 RCA_PIPE instances
for (i = 0; i < N; i++) begin
    (* keep_hierarchy = "no" *) rca_pipe #(.W(M), .M(M)) rca_g (
        .c_in(1'b0), .c_out(c_out_g), .sum(sum_g), ...
    );
    (* keep_hierarchy = "no" *) rca_pipe #(.W(M), .M(M)) rca_p (
        .c_in(1'b1), .c_out(c_out_p), ...
    );
    assign g[i] = c_out_g;
    assign p[i] = c_out_p;
    assign sum_0[i*M +: M] = sum_g;
end

Timing:

All RCA_PIPE outputs are registered after 3 cycles (M=M → NUM_CHUNKS=1 → latency=3)
Valid signals track completion: valid_l1_all = &valid_l1[N-1:0]
Global c_in delayed to c_in_l1 via shift register to align with Level 1 outputs

Key Insight: By computing g and p, Level 1 decouples sum computation from carry propagation, enabling Level 2 to compute all carries in parallel.

Level 2: Prefix Network (Kogge-Stone)

Purpose: Compute carry into each chunk using parallel prefix algorithm.

Critical Bug Fix - Global Carry-In Injection (rtl/cleveradder2048b.sv:146-160):

The original implementation had a critical bug for small M values (especially M=4): the global carry-in (c_in) was not properly injected into the prefix tree computation.

Problem:

Chunk 0's generate signal g[0] only reflects "does chunk 0 generate carry with c_in=0?"
But if chunk 0 propagates (p[0]=1) AND we have global c_in=1, chunk 0 should generate carry-out
Without this correction, carry-out from chunk 0 ignores c_in → wrong results

Solution:

logic [N-2:0] g_prefix_in;

always_comb begin
    g_prefix_in = g[N-2:0];
    // Inject global carry: Real_G[0] = G[0] | (P[0] & C_in_Level1)
    g_prefix_in[0] = g[0] | (p[0] & c_in_l1);
end

prefix_tree prefix_inst (
    .g(g_prefix_in),  // Use modified signal
    .p(p[N-2:0]),
    ...
);

Why this fixes the bug:

If chunk 0 generates (g[0]=1): g_prefix_in[0] = 1 (correct)
If chunk 0 propagates (p[0]=1) with c_in=1: g_prefix_in[0] = g[0] | (1 & 1) = 1 (correct)
If chunk 0 kills (p[0]=0, g[0]=0): g_prefix_in[0] = 0 (correct)

Prefix Tree Implementation (rtl/prefix_tree.sv):

Uses Kogge-Stone algorithm for O(log N) depth parallel carry computation.

Pipelining Structure:

Stage 0 - Input Registers (lines 70-81): Always registered
- Breaks critical path from Level 1 RCA outputs to prefix logic
- Adds 1 cycle latency
Stages 1 to log₂(N) - Kogge-Stone Computation (lines 88-133):
- Each stage k processes span = 2^k
- For each position i ≥ span:
  - G[i] = G[i] | (P[i] & G[i-span])
  - P[i] = P[i] & P[i-span]
- PIPE_EVERY_2=1: Registers inserted after odd stages (1, 3, 5, 7, ...)
- Even stages (0, 2, 4, 6, ...) are combinational
Output Register (lines 145-159): When PIPE=1
- Final registration for timing closure
- Adds 1 cycle latency

Latency Calculation:

PREFIX_STAGES = $clog2(N)
PREFIX_INPUT_REGS = 1  // Always present
PREFIX_INTERNAL_REGS = (PIPE_EVERY_2 == 1) ? (PREFIX_STAGES / 2) : 0
PREFIX_OUTPUT_REGS = PIPE
PREFIX_LATENCY = PREFIX_INPUT_REGS + PREFIX_INTERNAL_REGS + PREFIX_OUTPUT_REGS

Example (M=16, N=128):

PREFIX_STAGES = log₂(127) = 7
PREFIX_INPUT_REGS = 1
PREFIX_INTERNAL_REGS = ⌊7/2⌋ = 3 (after stages 1, 3, 5)
PREFIX_OUTPUT_REGS = 1
PREFIX_LATENCY = 1 + 3 + 1 = 5 cycles

MSB Handling: MSB chunk (N-1) excluded from prefix tree. Its carry-out is computed separately in final c_out formula.

Output: c_vec[i] = carry INTO chunk i+1 (for i = 0 to N-2)

Level 3: Final Sum Computation

Purpose: Combine Level 1 partial sums with Level 2 computed carries.

Pipeline Alignment (rtl/cleveradder2048b.sv:177-222):

Before Level 3 can compute, Level 1 outputs must be delayed to synchronize with prefix tree output:

// Delay sum_0, c_in_l1, g, p by PREFIX_LATENCY cycles
logic [W-1:0] sum_0_shift [PREFIX_LATENCY:0];
logic c_in_shift_l2 [PREFIX_LATENCY:0];
logic [N-1:0] g_shift [PREFIX_LATENCY:0];
logic [N-1:0] p_shift [PREFIX_LATENCY:0];

// Shift registers (no reset for Hyper-Flex)
for (d = 0; d < PREFIX_LATENCY; d++) begin
    always_ff @(posedge clk) begin
        sum_0_shift[d+1] <= sum_0_shift[d];
        c_in_shift_l2[d+1] <= c_in_shift_l2[d];
        g_shift[d+1] <= g_shift[d];
        p_shift[d+1] <= p_shift[d];
    end
end

Final Sum Computation (rtl/cleveradder2048b.sv:224-261):

For each chunk i (0 to N-1):

carry_in = (i == 0) ? c_in_l2 : c_vec[i-1];

(* keep_hierarchy = "no" *) rca_pipe #(.W(M), .M(M)) rca_final (
    .a(sum_0_l2[i*M +: M]),
    .b({M{1'b0}}),  // Add zero
    .c_in(carry_in),
    .sum(sum_final[i*M +: M]),
    ...
);

Chunk 0: Uses delayed global c_in (c_in_l2)
Chunks 1+: Use prefix tree output (c_vec[i-1])
Each RCA_PIPE adds 3 cycles latency
Output: sum with out_valid = &valid_l3

Final Carry-Out (rtl/cleveradder2048b.sv:264-277):

c_out_formula = g_l2[N-1] | (p_l2[N-1] & c_vec[N-2]);

// Delay by 3 cycles to match Level 3 RCA latency
logic [RCA_LATENCY-1:0] c_out_shift;
always_ff @(posedge clk) begin
    c_out_shift <= {c_out_shift[RCA_LATENCY-2:0], c_out_formula};
end
assign c_out = c_out_shift[RCA_LATENCY-1];

Formula Explanation:

c_vec[N-2] = carry INTO MSB chunk (N-1)
If MSB generates (g[N-1]=1): c_out = 1
If MSB propagates (p[N-1]=1) and carry comes in: c_out = 1
Otherwise: c_out = 0

3. RCA_PIPE Module (rtl/rca_pipe.sv)

Purpose: Pipelined ripple-carry adder with diagonal carry propagation.

Architecture:

Input Registration (lines 100-105): Register a, b, c_in, in_valid
Parallel Chunk Computation: All chunks compute simultaneously
- Chunk 0: Uses actual c_in
- Chunks 1+: Assume c_in=0
Stage 0 Registration (lines 163-168): Register initial sums and carries
Ripple Correction Stages (lines 177-201): One stage per chunk
- Stage i corrects chunk i using carry from stage i-1
- Diagonal OR Rule: carry[i] = new_carry[i] | carry_internal[i]
Output Registration (lines 208-212): Final sum, c_out, out_valid

Latency: NUM_CHUNKS + 2 cycles where NUM_CHUNKS = ⌈W/M⌉

Input register: 1 cycle
Stage 0 register: 1 cycle
Ripple stages 1 to NUM_CHUNKS-1: (NUM_CHUNKS - 1) cycles
Output register: 1 cycle
Total: 1 + 1 + (NUM_CHUNKS - 1) + 1 = NUM_CHUNKS + 2

Key Feature - Diagonal OR Rule: When correcting chunk i by adding carry[i-1]:

May generate new carry: new_carry[i]
Original carry from parallel computation: carry_internal[i]
At most one can be 1 (adding single-bit carry)
Safe to OR: carry[i] = new_carry[i] | carry_internal[i]

Usage in Clever Adder:

Always instantiated with W=M, M=M → NUM_CHUNKS=1
This gives fixed 3-cycle latency: 1 (input) + 1 (stage 0) + 0 (no ripple stages) + 1 (output) = 3
Used in Level 1 (256 G/P instances) and Level 3 (128 final sum instances)

4. Prefix Tree Module (rtl/prefix_tree.sv)

Purpose: Parallel prefix network for carry computation using Kogge-Stone algorithm.

Parameters:

N: Number of positions (typically W/M - 1 for clever adder)
PIPE: Output registration (0=combinational, 1=registered)
PIPE_EVERY_2: Internal pipeline registers (0=disabled, 1=every 2 stages)

Algorithm - Kogge-Stone:

For stage k = 0 to log₂(N)-1:
    span = 2^k
    For each position i ≥ span:
        G[i] = G[i] | (P[i] & G[i-span])  // Group generate
        P[i] = P[i] & P[i-span]           // Group propagate

Advantages:

Depth: O(log N) stages
Parallelism: Maximum fan-out (all positions computed in parallel per stage)
Latency: Minimal with pipelining

Alternative - Brent-Kung (not used):

Depth: O(log N) but with more serial stages
Fewer operations: Better for area-constrained designs
Trade-off: Lower area but potentially higher latency vs Kogge-Stone

Pipelining Strategy:

Input Register: Breaks long path from RCA outputs
Internal Registers (PIPE_EVERY_2=1): After odd stages (1, 3, 5, ...)
- Balances pipeline depth with Fmax
- Even stages combinational reduces total latency
Output Register (PIPE=1): Final timing closure

Verilator Note: Uses /* verilator lint_off UNOPTFLAT */ to suppress false circular dependency warnings on stage arrays.

Optimization Strategies

1. Hyper-Flex Register Optimization

Goal: Enable Intel Hyper-Flex registers for high-frequency operation on Agilex 5.

Key Changes:

Remove Resets from Data Path

Problem: Asynchronous resets block Quartus register retiming.

Solution: Remove all if (rst) conditions from data path registers.

Rationale:

Valid signals control when data is meaningful
Invalid data during first few cycles after power-up is acceptable
FPGA streaming designs don't require deterministic reset state
Result: Quartus can freely retime registers for Fmax optimization

2. Hierarchy Flattening

Goal: Enable cross-module optimization and Hyper-Register insertion.

Implementation: Add (* keep_hierarchy = "no" *) to all RCA_PIPE instances.

Benefits:

Quartus can retime registers across module boundaries
Enables Hyper-Register insertion inside carry chains
Allows global optimization of critical paths
Balances register distribution across design

Trade-off: Slight increase in compile time, but significant Fmax improvement.

3. Prefix Tree Pipelining

Strategy: Balance latency vs Fmax with selective pipeline insertion.

Configuration (PIPE=1, PIPE_EVERY_2=1):

Input Register: Always present (breaks critical path from Level 1)
Internal Registers: After odd stages only (1, 3, 5, 7, ...)
- Stage 0→1: combinational
- Stage 1→2: registered
- Stage 2→3: combinational
- Stage 3→4: registered
- Continues pattern...
Output Register: Final timing closure

Rationale:

Every 2 stages keeps pipeline depth manageable
Odd-stage placement balances path lengths
Reduces total latency compared to every-stage pipelining

Alternative Tested (PIPE_EVERY_1 - not implemented):

Would add registers after every stage
Latency: +3 to +4 cycles (stages + input + output)
Marginal Fmax improvement not worth latency cost

4. Quartus Synthesis Settings

File: fpga/setup.tcl

Conservative Retiming Strategy:

set_global_assignment -name ALLOW_REGISTER_RETIMING ON
set_global_assignment -name ALLOW_REGISTER_DUPLICATION OFF
set_global_assignment -name ALLOW_REGISTER_MERGING OFF
set_global_assignment -name REMOVE_DUPLICATE_REGISTERS OFF

Rationale:

RETIMING ON: Enables Quartus to move registers for timing
DUPLICATION OFF: Prevents excessive register duplication (saves ~1000 ALMs)
MERGING OFF: Preserves register boundaries for predictable timing
REMOVE_DUPLICATES OFF: Keeps necessary duplicates for fanout

Trade-off: ~10 MHz Fmax reduction vs aggressive settings, but 1000+ ALM savings.

Result: Area-efficient Hyper-Flex optimization with good Fmax.

Experimental Results

Experiment 1: Hyper-Flex Optimization (Reset Removal + Hierarchy Flattening)

Hypothesis: Removing resets from data path and flattening hierarchy will enable Hyper-Register insertion and improve Fmax.

Baseline (Before optimization):

M=16: 749.63 MHz @ 9,809 ALMs
Estimated Hyper-REG: ~5,000
All registers had if (rst) conditions
Module hierarchy preserved (keep_hierarchy="yes" by default)

Changes Applied:

Removed resets from all data path registers (kept on valid signals)
Added (* keep_hierarchy = "no" *) to 384 RCA instances
Conservative Quartus settings (duplication OFF)

Results (After optimization):

M	Fmax (MHz)	ALMs	Hyper-REG	vs Baseline	Notes
4	599.88	11,484	16,892	N/A	Highest Hyper-REG
8	783.09	9,166	11,571	+33 MHz	-
16	825.08	8,814	8,543	+75 MHz	🏆 WINNER
32	696.86	8,003	8,116	+30 MHz	-
64	670.24	7,459	7,935	+20 MHz	Best ALM efficiency

Key Findings: ✅ M=16 optimal: 825.08 MHz (+10% over baseline) ✅ Hyper-REG increased: 8,543 vs ~5,000 estimated baseline (+71%) ✅ ALM efficient: 8,814 vs 9,809 baseline (-995 ALMs, -10%) ✅ Best efficiency: 0.0936 MHz/ALM

Why M=16 Wins:

128 chunks → 384 RCA instances (good parallelism)
16-bit RCA carry chains (balanced path length)
7 Kogge-Stone prefix stages (optimal depth for 127 positions)
5-cycle prefix latency (11-cycle total)

Status: ✅ VALIDATED

Experiment 2: Prefix Tree Input Registers (Tested and Validated)

Hypothesis: Adding input registers to prefix tree would break critical path from Level 1 RCA outputs.

Motivation:

Critical path identified: RCA output register → prefix stage 0 logic → prefix stage 1 logic → register
~2 combinational stages before first prefix register
Adding input register breaks this critical path

Updated cleveradder2048b.sv PREFIX_LATENCY calculation (line 68):

localparam PREFIX_INPUT_REGS = 1;
localparam PREFIX_LATENCY = PREFIX_INPUT_REGS + PREFIX_INTERNAL_REGS + PREFIX_OUTPUT_REGS;

Latency Impact (M=16):

M=16: N=128, PREFIX_STAGES = log₂(127) = 7
PREFIX_INPUT_REGS = 1 (always registered)
PREFIX_INTERNAL_REGS = ⌊7/2⌋ = 3 (registers after stages 1, 3, 5)
PREFIX_OUTPUT_REGS = 1 (PIPE=1)
PREFIX_LATENCY = 1 + 3 + 1 = 5 cycles
Total clever latency = 3 (L1) + 5 (L2) + 3 (L3) = 11 cycles

Note: The input register was part of the original design to break the critical path from Level 1 RCA outputs to prefix logic.

Expected Results:

Fmax: 820-880 MHz target
ALMs: +50-100 (508 new registers)
Hyper-REG: +200-500

Actual Results: Integrated into final design, M=16 achieved 825.08 MHz.

Status: ✅ VALIDATED

Experiment 3: Top-Level Input Registers (⚠️ TESTED AND REJECTED)

Hypothesis: Register all inputs at module boundary to reduce routing delay to RCA instances.

Expected Benefits:

Shorter routing: PAD → register → RCA
Enable register duplication for different RCA fanout
Allow Hyper-Flex insertion between register stages

Actual Results (❌ FAILED):

M	Fmax WITHOUT	Fmax WITH	Δ Fmax	Δ ALMs
4	699.3	668.0	-31.3 MHz	+739
8	743.49	736.92	-6.57 MHz	+467
16	749.63	672.95	-76.68 MHz ❌	+287
32	666.67	662.69	-3.98 MHz	+229

Why It Failed:

I/O Boundary Bottleneck: PAD → a_r path became new critical path
- "Domain Boundary Entry" constraints hard to optimize
- Retiming reports: "Add stages BEFORE a_r" (impossible at boundary)
Double Buffering Overhead: PAD → a_r → RCA input register
- Two serial stages worse than one optimizable stage
Routing Congestion: 4,098 new registers (a_r, b_r, c_in_r, valid_r)
- Massive fanout to 256+ RCA instances
- Poor placement, increased ALM usage
M=16 Worst Hit: 256 RCA instances = highest fanout = worst degradation

Lesson Learned: Original design (direct port-to-RCA connection) is optimal. RCA_PIPE's internal input registers provide sufficient pipelining. Adding another layer exposed I/O constraints.

Status: ❌ REJECTED

M Parameter Sweep Analysis

Clever Adder Sweep Results

Full Sweep (M ∈ {4, 8, 16, 32, 64}):

M	N Chunks	Prefix Stages	Fmax (MHz)	ALMs	Hyper-REG	Latency	Efficiency (MHz/ALM)
4	512	9	599.88	11,484	16,892	13	0.0522
8	256	8	783.09	9,166	11,571	12	0.0854
16	128	7	825.08	8,814	8,543	11	0.0936 🏆
32	64	6	696.86	8,003	8,116	11	0.0871
64	32	5	670.24	7,459	7,935	10	0.0899

Analysis:

M=4 (❌ Worst Fmax):

512 chunks → 1,536 RCA instances (massive)
9 prefix stages → 6-cycle prefix latency
Highest ALM usage: 11,484
Issue: Too many instances → routing congestion
Lesson: Excessive parallelism hurts

M=8 (✅ Good, but not optimal):

256 chunks → 768 RCA instances
8 prefix stages → 6-cycle prefix latency
783.09 MHz (+30% over M=4)
Decent efficiency: 0.0854 MHz/ALM
Why not winner: M=16 has better balance

M=16 (🏆 WINNER):

128 chunks → 384 RCA instances (sweet spot)
7 prefix stages → 5-cycle prefix latency
825.08 MHz (+5.4% over M=8)
Lowest latency: 11 cycles
Best efficiency: 0.0936 MHz/ALM
Perfect balance: Parallelism vs path length vs resources

M=32 (❌ Fmax drops):

64 chunks → 192 RCA instances
6 prefix stages → 5-cycle prefix latency
696.86 MHz (-15.5% vs M=16)
Issue: Longer carry chains (32-bit RCA)
Lesson: Too little parallelism hurts Fmax

M=64 (❌ Further Fmax drop):

32 chunks → 96 RCA instances
5 prefix stages → 4-cycle prefix latency
670.24 MHz (-18.8% vs M=16)
Best ALM usage: 7,459
Issue: 64-bit RCA carry chains too long
Lesson: Minimal latency doesn't mean optimal Fmax

Key Insight: M=16 provides optimal balance between:

Prefix tree depth (7 stages = low latency)
RCA carry chain length (16-bit = fast enough)
Number of instances (384 = good parallelism without congestion)
Resource utilization (8,814 ALMs = efficient)

Naive Adder Sweep Results

Selected Results (M ∈ {8, 9, 10, 11, 16, 19, 20, 21, 29, 30, 31, 39, 40, 41}):

M	NUM_CHUNKS	Fmax (MHz)	ALMs	Latency (cycles)	Hyper-REG	Notes
8	256	599.88	9,685	258	7,599	Tied Fmax
9	228	599.88	9,073	230	7,536	🏆 WINNER
10	205	599.52	9,164	207	7,355	Fmax drops
11	187	592.42	8,847	189	7,276	-
16	128	553.10	8,112	130	6,875	-
19	108	507.87	8,035	110	6,821	Lab3 golden
20	103	488.28	8,230	105	8,819	-
21	98	518.94	7,977	100	6,609	-

Why M=9 Wins:

Highest Fmax: 599.88 MHz (tied with M=8)
Lowest ALM usage among tied Fmax: 9,073 vs 9,685 (M=8)
Reasonable latency: 230 cycles
Best Fmax-to-ALM ratio at peak Fmax

Observation: Naive adder saturates at ~600 MHz regardless of M < 10.

Likely hitting fundamental routing or I/O limits
Smaller M doesn't improve Fmax further
M=9 chosen for lowest resource usage at peak Fmax

Latency Analysis

Latency Comparison: Clever vs Naive

Formulas:

Naive: Latency = NUM_CHUNKS + 2 = ⌈W/M⌉ + 2
Clever: Latency = 6 + PREFIX_LATENCY
- PREFIX_LATENCY = 1 (input) + ⌊log₂(W/M)/2⌋ (internal) + 1 (output)

Concrete Examples (W=2048):

M	Naive Chunks	Naive Latency	Clever Chunks	Prefix Stages	PREFIX_LAT	Clever Latency	Speedup
8	256	258	256	8	6	12	21.5×
16	128	130	128	7	5	11	11.8×
32	64	66	64	6	5	11	6.0×
64	32	34	32	5	4	10	3.4×

Key Insight: Clever adder achieves logarithmic latency vs naive's linear latency.

At M=8:

Naive: 258 cycles @ 599.88 MHz = 430 ns
Clever: 12 cycles @ 783.09 MHz = 15.3 ns
21.5× faster in cycle count, 28.1× faster in actual time!

At M=16 (optimal Fmax):

Naive: 130 cycles @ 553.10 MHz = 235 ns
Clever: 11 cycles @ 825.08 MHz = 13.3 ns
11.8× faster in cycle count, 17.7× faster in actual time!

Usage Instructions

Run Submission Scripts

source env.sh

# Naive adder (M=9)
./naive.sh
# Output: results/winner_naive.csv

# Clever adder (M=16)
./clever.sh
# Output: results/winner_clever.csv

Run M Sweeps

# Clever adder sweep (M ∈ {4, 8, 16, 32, 64})
./sweep_clever.sh
# Output: results/clever_sweep.csv

# Naive adder sweep (M ∈ {8-11, 16, 19-21, 29-31, 39-41})
./sweep_naive.sh
# Output: results/naive_sweep.csv

Manual Synthesis

# Makefile auto-sets ARCH_PARAM_NAME=M and ARCH_PARAM_VAL=$M
make fit DUT=cleveradder2048b W=2048 M=32
make extract DUT=cleveradder2048b W=2048 M=32
cat results/cleveradder2048b_M32.csv

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
fpga		fpga
results		results
rtl		rtl
.DS_Store		.DS_Store
2019-FCCM-MonsterAdders.pdf		2019-FCCM-MonsterAdders.pdf
Makefile		Makefile
README.md		README.md
adder.sdc		adder.sdc
clever.sh		clever.sh
env.sh		env.sh
extract.sh		extract.sh
naive.sh		naive.sh
sweep_clever.sh		sweep_clever.sh
sweep_naive.sh		sweep_naive.sh

Folders and files

Latest commit

History

Repository files navigation

2048-bit Adders for Intel Agilex 5

Overview

Final Results

Naive Adder (naiveadder2048b.sv)

Clever Adder (cleveradder2048b.sv) 🏆

Performance Summary

Architecture Details

1. Naive Adder (naiveadder2048b.sv)

2. Clever Adder (cleveradder2048b.sv)

Level 1: Generate/Propagate Computation

Level 2: Prefix Network (Kogge-Stone)

Level 3: Final Sum Computation

3. RCA_PIPE Module (rtl/rca_pipe.sv)

4. Prefix Tree Module (rtl/prefix_tree.sv)

Optimization Strategies

1. Hyper-Flex Register Optimization

Remove Resets from Data Path

2. Hierarchy Flattening

3. Prefix Tree Pipelining

4. Quartus Synthesis Settings

Experimental Results

Experiment 1: Hyper-Flex Optimization (Reset Removal + Hierarchy Flattening)

Experiment 2: Prefix Tree Input Registers (Tested and Validated)

Experiment 3: Top-Level Input Registers (⚠️ TESTED AND REJECTED)

M Parameter Sweep Analysis

Clever Adder Sweep Results

Naive Adder Sweep Results

Latency Analysis

Latency Comparison: Clever vs Naive

Usage Instructions

Run Submission Scripts

Run M Sweeps

Manual Synthesis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages