This lab implements and optimizes two 2048-bit adders for Intel Agilex 5 FPGA:
- Naive Adder: Pipelined RCA with diagonal carry propagation
- Clever Adder: Type-2 "Monster Adder" with Kogge-Stone prefix tree
Target Device: Intel Agilex 5 A5EC065BB23AE4SR0 (222,400 ALMs)
- Best Configuration: M=9
- Performance: 599.88 MHz @ 9,073 ALMs
- Latency: 230 cycles (⌈2048/9⌉ + 2 = 228 + 2)
- Architecture: RCA_PIPE with diagonal carry propagation
- Best Configuration: M=16
- Performance: 825.08 MHz @ 8,814 ALMs
- Latency: 11 cycles
- Speedup vs Naive: +37.5% Fmax, -2.9% ALMs, +0.094 MHz/ALM efficiency
- Architecture: Type-2 Monster Adder with Kogge-Stone prefix tree
| Adder | M | Fmax (MHz) | ALMs | Latency | Efficiency (MHz/ALM) | Hyper-REG |
|---|---|---|---|---|---|---|
| Naive | 9 | 599.88 | 9,073 | 230 | 0.0661 | 7,536 |
| Clever | 16 | 825.08 | 8,814 | 11 | 0.0936 | 8,543 |
| Improvement | - | +37.5% | -2.9% | -95.2% | +41.6% | +13.4% |
Implementation: Direct wrapper around rca_pipe.sv with optimized M parameter.
module naiveadder2048b #(
parameter W = 2048,
parameter M = 9 // Optimal: 599.88 MHz @ 9,073 ALMs
) (
input logic clk, rst, in_valid,
input logic [W-1:0] a, b,
input logic c_in,
output logic out_valid,
output logic [W-1:0] sum,
output logic c_out
);
rca_pipe #(.W(W), .M(M)) rca_inst (.*);
endmoduleArchitecture: Pipelined RCA with diagonal carry propagation
- Breaks W-bit addition into
NUM_CHUNKS = ⌈W/M⌉parallel M-bit chunks - Stage 0: Input registration
- Stage 1: Parallel chunk computation (chunk 0 uses actual c_in, others assume c_in=0)
- Stages 2 to NUM_CHUNKS: Ripple correction with diagonal OR rule
- Final stage: Output registration
Latency Formula: NUM_CHUNKS + 2 cycles where NUM_CHUNKS = ⌈W/M⌉
- Example (M=9, W=2048): ⌈2048/9⌉ + 2 = 228 + 2 = 230 cycles
- Example (M=16, W=2048): ⌈2048/16⌉ + 2 = 128 + 2 = 130 cycles
Note on M=9: Non-divisible M values are supported. RCA_PIPE handles partial chunks internally.
Optimization: M=9 provides best Fmax-to-ALM ratio among tested values.
Implementation: Type-2 Monster Adder (Langhammer 2019) with three-level parallel carry computation.
module cleveradder2048b #(
parameter W = 2048,
parameter M = 16, // Optimal: 825.08 MHz @ 8,814 ALMs
parameter PIPE = 1, // Prefix tree output register
parameter PIPE_EVERY_2 = 1 // Internal prefix registers every 2 stages
) (
input logic clk, rst,
input logic [W-1:0] a, b,
input logic c_in,
input logic in_valid,
output logic [W-1:0] sum,
output logic c_out,
output logic out_valid
);Architecture: Three-level design achieving O(log N) carry computation complexity.
Level 1: Generate/Propagate Computation (RCA_PIPE)
├─ For each chunk i: compute g[i], p[i], sum_0[i]
├─ Dual RCA per chunk: RCA(c_in=0) → G, RCA(c_in=1) → P
└─ Latency: 3 cycles (RCA_PIPE with M=M)
Level 2: Prefix Network (Kogge-Stone)
├─ Input: g[N-2:0], p[N-2:0] (MSB excluded)
├─ Compute: c_vec[i] = carry INTO chunk i+1
├─ Algorithm: Kogge-Stone prefix tree (O(log N) depth)
├─ Pipelining: Input reg + Internal regs (every 2 stages) + Output reg
└─ Latency: PREFIX_LATENCY = 1 + ⌊stages/2⌋ + 1 cycles
Level 3: Final Sum Computation (RCA_PIPE)
├─ For each chunk i: sum_final[i] = sum_0[i] + carry[i]
├─ carry[0] = c_in (global), carry[i] = c_vec[i-1] (from prefix)
└─ Latency: 3 cycles (RCA_PIPE with M=M)
Total Latency: 3 + PREFIX_LATENCY + 3 = 6 + PREFIX_LATENCY cycles
Latency Examples (W=2048, PIPE=1, PIPE_EVERY_2=1):
| M | N Chunks | Prefix Stages | Input Reg | Internal Regs | Output Reg | PREFIX_LATENCY | Total Latency |
|---|---|---|---|---|---|---|---|
| 8 | 256 | 8 | 1 | 4 (after 1,3,5,7) | 1 | 6 | 12 cycles |
| 16 | 128 | 7 | 1 | 3 (after 1,3,5) | 1 | 5 | 11 cycles |
| 32 | 64 | 6 | 1 | 3 (after 1,3,5) | 1 | 5 | 11 cycles |
| 64 | 32 | 5 | 1 | 2 (after 1,3) | 1 | 4 | 10 cycles |
Key Constraint: M must divide W evenly (N = W/M must be integer) for prefix tree.
- Valid M for W=2048: {4, 8, 16, 32, 64, 128, 256, 512, 1024}
Purpose: Precompute per-chunk carry behavior for both possible carry-in values.
For each chunk i (0 to N-1), compute:
- g[i]: Generate signal (carry-out when c_in=0 to chunk) = c_out from RCA(a[i], b[i], c_in=0)
- p[i]: Propagate signal (carry-out when c_in=1 to chunk) = c_out from RCA(a[i], b[i], c_in=1)
- sum_0[i]: Partial sum assuming c_in=0 to chunk
Implementation (rtl/cleveradder2048b.sv:89-132):
// 128 chunks × 2 RCAs = 256 RCA_PIPE instances
for (i = 0; i < N; i++) begin
(* keep_hierarchy = "no" *) rca_pipe #(.W(M), .M(M)) rca_g (
.c_in(1'b0), .c_out(c_out_g), .sum(sum_g), ...
);
(* keep_hierarchy = "no" *) rca_pipe #(.W(M), .M(M)) rca_p (
.c_in(1'b1), .c_out(c_out_p), ...
);
assign g[i] = c_out_g;
assign p[i] = c_out_p;
assign sum_0[i*M +: M] = sum_g;
endTiming:
- All RCA_PIPE outputs are registered after 3 cycles (M=M → NUM_CHUNKS=1 → latency=3)
- Valid signals track completion:
valid_l1_all = &valid_l1[N-1:0] - Global c_in delayed to c_in_l1 via shift register to align with Level 1 outputs
Key Insight: By computing g and p, Level 1 decouples sum computation from carry propagation, enabling Level 2 to compute all carries in parallel.
Purpose: Compute carry into each chunk using parallel prefix algorithm.
Critical Bug Fix - Global Carry-In Injection (rtl/cleveradder2048b.sv:146-160):
The original implementation had a critical bug for small M values (especially M=4): the global carry-in (c_in) was not properly injected into the prefix tree computation.
Problem:
- Chunk 0's generate signal g[0] only reflects "does chunk 0 generate carry with c_in=0?"
- But if chunk 0 propagates (p[0]=1) AND we have global c_in=1, chunk 0 should generate carry-out
- Without this correction, carry-out from chunk 0 ignores c_in → wrong results
Solution:
logic [N-2:0] g_prefix_in;
always_comb begin
g_prefix_in = g[N-2:0];
// Inject global carry: Real_G[0] = G[0] | (P[0] & C_in_Level1)
g_prefix_in[0] = g[0] | (p[0] & c_in_l1);
end
prefix_tree prefix_inst (
.g(g_prefix_in), // Use modified signal
.p(p[N-2:0]),
...
);Why this fixes the bug:
- If chunk 0 generates (g[0]=1): g_prefix_in[0] = 1 (correct)
- If chunk 0 propagates (p[0]=1) with c_in=1: g_prefix_in[0] = g[0] | (1 & 1) = 1 (correct)
- If chunk 0 kills (p[0]=0, g[0]=0): g_prefix_in[0] = 0 (correct)
Prefix Tree Implementation (rtl/prefix_tree.sv):
Uses Kogge-Stone algorithm for O(log N) depth parallel carry computation.
Pipelining Structure:
-
Stage 0 - Input Registers (lines 70-81): Always registered
- Breaks critical path from Level 1 RCA outputs to prefix logic
- Adds 1 cycle latency
-
Stages 1 to log₂(N) - Kogge-Stone Computation (lines 88-133):
- Each stage k processes span = 2^k
- For each position i ≥ span:
G[i] = G[i] | (P[i] & G[i-span])P[i] = P[i] & P[i-span]
- PIPE_EVERY_2=1: Registers inserted after odd stages (1, 3, 5, 7, ...)
- Even stages (0, 2, 4, 6, ...) are combinational
-
Output Register (lines 145-159): When PIPE=1
- Final registration for timing closure
- Adds 1 cycle latency
Latency Calculation:
PREFIX_STAGES = $clog2(N)
PREFIX_INPUT_REGS = 1 // Always present
PREFIX_INTERNAL_REGS = (PIPE_EVERY_2 == 1) ? (PREFIX_STAGES / 2) : 0
PREFIX_OUTPUT_REGS = PIPE
PREFIX_LATENCY = PREFIX_INPUT_REGS + PREFIX_INTERNAL_REGS + PREFIX_OUTPUT_REGSExample (M=16, N=128):
- PREFIX_STAGES = log₂(127) = 7
- PREFIX_INPUT_REGS = 1
- PREFIX_INTERNAL_REGS = ⌊7/2⌋ = 3 (after stages 1, 3, 5)
- PREFIX_OUTPUT_REGS = 1
- PREFIX_LATENCY = 1 + 3 + 1 = 5 cycles
MSB Handling: MSB chunk (N-1) excluded from prefix tree. Its carry-out is computed separately in final c_out formula.
Output: c_vec[i] = carry INTO chunk i+1 (for i = 0 to N-2)
Purpose: Combine Level 1 partial sums with Level 2 computed carries.
Pipeline Alignment (rtl/cleveradder2048b.sv:177-222):
Before Level 3 can compute, Level 1 outputs must be delayed to synchronize with prefix tree output:
// Delay sum_0, c_in_l1, g, p by PREFIX_LATENCY cycles
logic [W-1:0] sum_0_shift [PREFIX_LATENCY:0];
logic c_in_shift_l2 [PREFIX_LATENCY:0];
logic [N-1:0] g_shift [PREFIX_LATENCY:0];
logic [N-1:0] p_shift [PREFIX_LATENCY:0];
// Shift registers (no reset for Hyper-Flex)
for (d = 0; d < PREFIX_LATENCY; d++) begin
always_ff @(posedge clk) begin
sum_0_shift[d+1] <= sum_0_shift[d];
c_in_shift_l2[d+1] <= c_in_shift_l2[d];
g_shift[d+1] <= g_shift[d];
p_shift[d+1] <= p_shift[d];
end
endFinal Sum Computation (rtl/cleveradder2048b.sv:224-261):
For each chunk i (0 to N-1):
carry_in = (i == 0) ? c_in_l2 : c_vec[i-1];
(* keep_hierarchy = "no" *) rca_pipe #(.W(M), .M(M)) rca_final (
.a(sum_0_l2[i*M +: M]),
.b({M{1'b0}}), // Add zero
.c_in(carry_in),
.sum(sum_final[i*M +: M]),
...
);- Chunk 0: Uses delayed global c_in (
c_in_l2) - Chunks 1+: Use prefix tree output (
c_vec[i-1]) - Each RCA_PIPE adds 3 cycles latency
- Output:
sumwithout_valid = &valid_l3
Final Carry-Out (rtl/cleveradder2048b.sv:264-277):
c_out_formula = g_l2[N-1] | (p_l2[N-1] & c_vec[N-2]);
// Delay by 3 cycles to match Level 3 RCA latency
logic [RCA_LATENCY-1:0] c_out_shift;
always_ff @(posedge clk) begin
c_out_shift <= {c_out_shift[RCA_LATENCY-2:0], c_out_formula};
end
assign c_out = c_out_shift[RCA_LATENCY-1];Formula Explanation:
c_vec[N-2]= carry INTO MSB chunk (N-1)- If MSB generates (g[N-1]=1): c_out = 1
- If MSB propagates (p[N-1]=1) and carry comes in: c_out = 1
- Otherwise: c_out = 0
Purpose: Pipelined ripple-carry adder with diagonal carry propagation.
Architecture:
- Input Registration (lines 100-105): Register a, b, c_in, in_valid
- Parallel Chunk Computation: All chunks compute simultaneously
- Chunk 0: Uses actual c_in
- Chunks 1+: Assume c_in=0
- Stage 0 Registration (lines 163-168): Register initial sums and carries
- Ripple Correction Stages (lines 177-201): One stage per chunk
- Stage i corrects chunk i using carry from stage i-1
- Diagonal OR Rule:
carry[i] = new_carry[i] | carry_internal[i]
- Output Registration (lines 208-212): Final sum, c_out, out_valid
Latency: NUM_CHUNKS + 2 cycles where NUM_CHUNKS = ⌈W/M⌉
- Input register: 1 cycle
- Stage 0 register: 1 cycle
- Ripple stages 1 to NUM_CHUNKS-1: (NUM_CHUNKS - 1) cycles
- Output register: 1 cycle
- Total: 1 + 1 + (NUM_CHUNKS - 1) + 1 = NUM_CHUNKS + 2
Key Feature - Diagonal OR Rule: When correcting chunk i by adding carry[i-1]:
- May generate new carry:
new_carry[i] - Original carry from parallel computation:
carry_internal[i] - At most one can be 1 (adding single-bit carry)
- Safe to OR:
carry[i] = new_carry[i] | carry_internal[i]
Usage in Clever Adder:
- Always instantiated with W=M, M=M → NUM_CHUNKS=1
- This gives fixed 3-cycle latency: 1 (input) + 1 (stage 0) + 0 (no ripple stages) + 1 (output) = 3
- Used in Level 1 (256 G/P instances) and Level 3 (128 final sum instances)
Purpose: Parallel prefix network for carry computation using Kogge-Stone algorithm.
Parameters:
N: Number of positions (typically W/M - 1 for clever adder)PIPE: Output registration (0=combinational, 1=registered)PIPE_EVERY_2: Internal pipeline registers (0=disabled, 1=every 2 stages)
Algorithm - Kogge-Stone:
For stage k = 0 to log₂(N)-1:
span = 2^k
For each position i ≥ span:
G[i] = G[i] | (P[i] & G[i-span]) // Group generate
P[i] = P[i] & P[i-span] // Group propagate
Advantages:
- Depth: O(log N) stages
- Parallelism: Maximum fan-out (all positions computed in parallel per stage)
- Latency: Minimal with pipelining
Alternative - Brent-Kung (not used):
- Depth: O(log N) but with more serial stages
- Fewer operations: Better for area-constrained designs
- Trade-off: Lower area but potentially higher latency vs Kogge-Stone
Pipelining Strategy:
- Input Register: Breaks long path from RCA outputs
- Internal Registers (PIPE_EVERY_2=1): After odd stages (1, 3, 5, ...)
- Balances pipeline depth with Fmax
- Even stages combinational reduces total latency
- Output Register (PIPE=1): Final timing closure
Verilator Note: Uses /* verilator lint_off UNOPTFLAT */ to suppress false circular dependency warnings on stage arrays.
Goal: Enable Intel Hyper-Flex registers for high-frequency operation on Agilex 5.
Key Changes:
Problem: Asynchronous resets block Quartus register retiming.
Solution: Remove all if (rst) conditions from data path registers.
Rationale:
- Valid signals control when data is meaningful
- Invalid data during first few cycles after power-up is acceptable
- FPGA streaming designs don't require deterministic reset state
- Result: Quartus can freely retime registers for Fmax optimization
Goal: Enable cross-module optimization and Hyper-Register insertion.
Implementation: Add (* keep_hierarchy = "no" *) to all RCA_PIPE instances.
Benefits:
- Quartus can retime registers across module boundaries
- Enables Hyper-Register insertion inside carry chains
- Allows global optimization of critical paths
- Balances register distribution across design
Trade-off: Slight increase in compile time, but significant Fmax improvement.
Strategy: Balance latency vs Fmax with selective pipeline insertion.
Configuration (PIPE=1, PIPE_EVERY_2=1):
- Input Register: Always present (breaks critical path from Level 1)
- Internal Registers: After odd stages only (1, 3, 5, 7, ...)
- Stage 0→1: combinational
- Stage 1→2: registered
- Stage 2→3: combinational
- Stage 3→4: registered
- Continues pattern...
- Output Register: Final timing closure
Rationale:
- Every 2 stages keeps pipeline depth manageable
- Odd-stage placement balances path lengths
- Reduces total latency compared to every-stage pipelining
Alternative Tested (PIPE_EVERY_1 - not implemented):
- Would add registers after every stage
- Latency: +3 to +4 cycles (stages + input + output)
- Marginal Fmax improvement not worth latency cost
File: fpga/setup.tcl
Conservative Retiming Strategy:
set_global_assignment -name ALLOW_REGISTER_RETIMING ON
set_global_assignment -name ALLOW_REGISTER_DUPLICATION OFF
set_global_assignment -name ALLOW_REGISTER_MERGING OFF
set_global_assignment -name REMOVE_DUPLICATE_REGISTERS OFFRationale:
- RETIMING ON: Enables Quartus to move registers for timing
- DUPLICATION OFF: Prevents excessive register duplication (saves ~1000 ALMs)
- MERGING OFF: Preserves register boundaries for predictable timing
- REMOVE_DUPLICATES OFF: Keeps necessary duplicates for fanout
Trade-off: ~10 MHz Fmax reduction vs aggressive settings, but 1000+ ALM savings.
Result: Area-efficient Hyper-Flex optimization with good Fmax.
Hypothesis: Removing resets from data path and flattening hierarchy will enable Hyper-Register insertion and improve Fmax.
Baseline (Before optimization):
- M=16: 749.63 MHz @ 9,809 ALMs
- Estimated Hyper-REG: ~5,000
- All registers had
if (rst)conditions - Module hierarchy preserved (keep_hierarchy="yes" by default)
Changes Applied:
- Removed resets from all data path registers (kept on valid signals)
- Added
(* keep_hierarchy = "no" *)to 384 RCA instances - Conservative Quartus settings (duplication OFF)
Results (After optimization):
| M | Fmax (MHz) | ALMs | Hyper-REG | vs Baseline | Notes |
|---|---|---|---|---|---|
| 4 | 599.88 | 11,484 | 16,892 | N/A | Highest Hyper-REG |
| 8 | 783.09 | 9,166 | 11,571 | +33 MHz | - |
| 16 | 825.08 | 8,814 | 8,543 | +75 MHz | 🏆 WINNER |
| 32 | 696.86 | 8,003 | 8,116 | +30 MHz | - |
| 64 | 670.24 | 7,459 | 7,935 | +20 MHz | Best ALM efficiency |
Key Findings: ✅ M=16 optimal: 825.08 MHz (+10% over baseline) ✅ Hyper-REG increased: 8,543 vs ~5,000 estimated baseline (+71%) ✅ ALM efficient: 8,814 vs 9,809 baseline (-995 ALMs, -10%) ✅ Best efficiency: 0.0936 MHz/ALM
Why M=16 Wins:
- 128 chunks → 384 RCA instances (good parallelism)
- 16-bit RCA carry chains (balanced path length)
- 7 Kogge-Stone prefix stages (optimal depth for 127 positions)
- 5-cycle prefix latency (11-cycle total)
Status: ✅ VALIDATED
Hypothesis: Adding input registers to prefix tree would break critical path from Level 1 RCA outputs.
Motivation:
- Critical path identified: RCA output register → prefix stage 0 logic → prefix stage 1 logic → register
- ~2 combinational stages before first prefix register
- Adding input register breaks this critical path
Updated cleveradder2048b.sv PREFIX_LATENCY calculation (line 68):
localparam PREFIX_INPUT_REGS = 1;
localparam PREFIX_LATENCY = PREFIX_INPUT_REGS + PREFIX_INTERNAL_REGS + PREFIX_OUTPUT_REGS;Latency Impact (M=16):
- M=16: N=128, PREFIX_STAGES = log₂(127) = 7
- PREFIX_INPUT_REGS = 1 (always registered)
- PREFIX_INTERNAL_REGS = ⌊7/2⌋ = 3 (registers after stages 1, 3, 5)
- PREFIX_OUTPUT_REGS = 1 (PIPE=1)
- PREFIX_LATENCY = 1 + 3 + 1 = 5 cycles
- Total clever latency = 3 (L1) + 5 (L2) + 3 (L3) = 11 cycles
Note: The input register was part of the original design to break the critical path from Level 1 RCA outputs to prefix logic.
Expected Results:
- Fmax: 820-880 MHz target
- ALMs: +50-100 (508 new registers)
- Hyper-REG: +200-500
Actual Results: Integrated into final design, M=16 achieved 825.08 MHz.
Status: ✅ VALIDATED
Hypothesis: Register all inputs at module boundary to reduce routing delay to RCA instances.
Expected Benefits:
- Shorter routing: PAD → register → RCA
- Enable register duplication for different RCA fanout
- Allow Hyper-Flex insertion between register stages
Actual Results (❌ FAILED):
| M | Fmax WITHOUT | Fmax WITH | Δ Fmax | Δ ALMs |
|---|---|---|---|---|
| 4 | 699.3 | 668.0 | -31.3 MHz | +739 |
| 8 | 743.49 | 736.92 | -6.57 MHz | +467 |
| 16 | 749.63 | 672.95 | -76.68 MHz ❌ | +287 |
| 32 | 666.67 | 662.69 | -3.98 MHz | +229 |
Why It Failed:
- I/O Boundary Bottleneck: PAD → a_r path became new critical path
- "Domain Boundary Entry" constraints hard to optimize
- Retiming reports: "Add stages BEFORE a_r" (impossible at boundary)
- Double Buffering Overhead: PAD → a_r → RCA input register
- Two serial stages worse than one optimizable stage
- Routing Congestion: 4,098 new registers (a_r, b_r, c_in_r, valid_r)
- Massive fanout to 256+ RCA instances
- Poor placement, increased ALM usage
- M=16 Worst Hit: 256 RCA instances = highest fanout = worst degradation
Lesson Learned: Original design (direct port-to-RCA connection) is optimal. RCA_PIPE's internal input registers provide sufficient pipelining. Adding another layer exposed I/O constraints.
Status: ❌ REJECTED
Full Sweep (M ∈ {4, 8, 16, 32, 64}):
| M | N Chunks | Prefix Stages | Fmax (MHz) | ALMs | Hyper-REG | Latency | Efficiency (MHz/ALM) |
|---|---|---|---|---|---|---|---|
| 4 | 512 | 9 | 599.88 | 11,484 | 16,892 | 13 | 0.0522 |
| 8 | 256 | 8 | 783.09 | 9,166 | 11,571 | 12 | 0.0854 |
| 16 | 128 | 7 | 825.08 | 8,814 | 8,543 | 11 | 0.0936 🏆 |
| 32 | 64 | 6 | 696.86 | 8,003 | 8,116 | 11 | 0.0871 |
| 64 | 32 | 5 | 670.24 | 7,459 | 7,935 | 10 | 0.0899 |
Analysis:
M=4 (❌ Worst Fmax):
- 512 chunks → 1,536 RCA instances (massive)
- 9 prefix stages → 6-cycle prefix latency
- Highest ALM usage: 11,484
- Issue: Too many instances → routing congestion
- Lesson: Excessive parallelism hurts
M=8 (✅ Good, but not optimal):
- 256 chunks → 768 RCA instances
- 8 prefix stages → 6-cycle prefix latency
- 783.09 MHz (+30% over M=4)
- Decent efficiency: 0.0854 MHz/ALM
- Why not winner: M=16 has better balance
M=16 (🏆 WINNER):
- 128 chunks → 384 RCA instances (sweet spot)
- 7 prefix stages → 5-cycle prefix latency
- 825.08 MHz (+5.4% over M=8)
- Lowest latency: 11 cycles
- Best efficiency: 0.0936 MHz/ALM
- Perfect balance: Parallelism vs path length vs resources
M=32 (❌ Fmax drops):
- 64 chunks → 192 RCA instances
- 6 prefix stages → 5-cycle prefix latency
- 696.86 MHz (-15.5% vs M=16)
- Issue: Longer carry chains (32-bit RCA)
- Lesson: Too little parallelism hurts Fmax
M=64 (❌ Further Fmax drop):
- 32 chunks → 96 RCA instances
- 5 prefix stages → 4-cycle prefix latency
- 670.24 MHz (-18.8% vs M=16)
- Best ALM usage: 7,459
- Issue: 64-bit RCA carry chains too long
- Lesson: Minimal latency doesn't mean optimal Fmax
Key Insight: M=16 provides optimal balance between:
- Prefix tree depth (7 stages = low latency)
- RCA carry chain length (16-bit = fast enough)
- Number of instances (384 = good parallelism without congestion)
- Resource utilization (8,814 ALMs = efficient)
Selected Results (M ∈ {8, 9, 10, 11, 16, 19, 20, 21, 29, 30, 31, 39, 40, 41}):
| M | NUM_CHUNKS | Fmax (MHz) | ALMs | Latency (cycles) | Hyper-REG | Notes |
|---|---|---|---|---|---|---|
| 8 | 256 | 599.88 | 9,685 | 258 | 7,599 | Tied Fmax |
| 9 | 228 | 599.88 | 9,073 | 230 | 7,536 | 🏆 WINNER |
| 10 | 205 | 599.52 | 9,164 | 207 | 7,355 | Fmax drops |
| 11 | 187 | 592.42 | 8,847 | 189 | 7,276 | - |
| 16 | 128 | 553.10 | 8,112 | 130 | 6,875 | - |
| 19 | 108 | 507.87 | 8,035 | 110 | 6,821 | Lab3 golden |
| 20 | 103 | 488.28 | 8,230 | 105 | 8,819 | - |
| 21 | 98 | 518.94 | 7,977 | 100 | 6,609 | - |
Why M=9 Wins:
- Highest Fmax: 599.88 MHz (tied with M=8)
- Lowest ALM usage among tied Fmax: 9,073 vs 9,685 (M=8)
- Reasonable latency: 230 cycles
- Best Fmax-to-ALM ratio at peak Fmax
Observation: Naive adder saturates at ~600 MHz regardless of M < 10.
- Likely hitting fundamental routing or I/O limits
- Smaller M doesn't improve Fmax further
- M=9 chosen for lowest resource usage at peak Fmax
Formulas:
- Naive:
Latency = NUM_CHUNKS + 2 = ⌈W/M⌉ + 2 - Clever:
Latency = 6 + PREFIX_LATENCY- PREFIX_LATENCY = 1 (input) + ⌊log₂(W/M)/2⌋ (internal) + 1 (output)
Concrete Examples (W=2048):
| M | Naive Chunks | Naive Latency | Clever Chunks | Prefix Stages | PREFIX_LAT | Clever Latency | Speedup |
|---|---|---|---|---|---|---|---|
| 8 | 256 | 258 | 256 | 8 | 6 | 12 | 21.5× |
| 16 | 128 | 130 | 128 | 7 | 5 | 11 | 11.8× |
| 32 | 64 | 66 | 64 | 6 | 5 | 11 | 6.0× |
| 64 | 32 | 34 | 32 | 5 | 4 | 10 | 3.4× |
Key Insight: Clever adder achieves logarithmic latency vs naive's linear latency.
At M=8:
- Naive: 258 cycles @ 599.88 MHz = 430 ns
- Clever: 12 cycles @ 783.09 MHz = 15.3 ns
- 21.5× faster in cycle count, 28.1× faster in actual time!
At M=16 (optimal Fmax):
- Naive: 130 cycles @ 553.10 MHz = 235 ns
- Clever: 11 cycles @ 825.08 MHz = 13.3 ns
- 11.8× faster in cycle count, 17.7× faster in actual time!
source env.sh
# Naive adder (M=9)
./naive.sh
# Output: results/winner_naive.csv
# Clever adder (M=16)
./clever.sh
# Output: results/winner_clever.csv# Clever adder sweep (M ∈ {4, 8, 16, 32, 64})
./sweep_clever.sh
# Output: results/clever_sweep.csv
# Naive adder sweep (M ∈ {8-11, 16, 19-21, 29-31, 39-41})
./sweep_naive.sh
# Output: results/naive_sweep.csv# Makefile auto-sets ARCH_PARAM_NAME=M and ARCH_PARAM_VAL=$M
make fit DUT=cleveradder2048b W=2048 M=32
make extract DUT=cleveradder2048b W=2048 M=32
cat results/cleveradder2048b_M32.csv