Skip to content

Ideal 2-Taken & 2-Fetch#736

Open
Yakkhini wants to merge 4 commits intoxs-devfrom
2-taken-ideal
Open

Ideal 2-Taken & 2-Fetch#736
Yakkhini wants to merge 4 commits intoxs-devfrom
2-taken-ideal

Conversation

@Yakkhini
Copy link
Collaborator

@Yakkhini Yakkhini commented Jan 26, 2026

Change-Id: I39d54a0621d139cc00a156b02a6d7d888d9b15f0

Summary by CodeRabbit

  • New Features

    • Optional two-fetch mode: perform up to two predictions per cycle when enabled.
    • New configuration: toggle two-fetch and set max fetch bytes per cycle.
  • Refactor

    • Fetch and prediction flow reworked to support in-cycle two-fetch extension and updated fetch-loop/termination semantics.
  • Chores

    • Example presets updated to enable two-fetch and reduce queue sizes.

@coderabbitai
Copy link

coderabbitai bot commented Jan 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

DecoupledBPUWithBTB gains a two-fetch mode and max-fetch-bytes config; tick() can produce up to two predictions per cycle. Fetch path now supports an in-cycle 2‑fetch extension, preserves/keeps the next FSQ entry buffered, and changes the fetch-stop semantics.

Changes

Cohort / File(s) Summary
Prediction core
src/cpu/pred/btb/decoupled_bpred.cc
Adds batching in tick() to run up to 2 prediction iterations per cycle (controlled by new flags); moves per-iteration request/finalize/clear/dry-run/FSQ-enqueue logic into loop; introduces tempNumOverrideBubbles.
Predictor interface / flags
src/cpu/pred/btb/decoupled_bpred.hh, src/cpu/pred/BranchPredictor.py
Adds configuration members enable2Fetch and maxFetchBytesPerCycle, enableTwoTaken flag, FTQ navigation/accessor helpers (getTarget, ftqHasNext, ftqNext, is2FetchEnabled, getMaxFetchBytesPerCycle).
Fetch logic
src/cpu/o3/fetch.cc, src/cpu/o3/fetch.hh
Adds 2‑fetch extension path: conditionally perform/do_2fetch, keep next FSQ entry buffered, change lookupAndUpdateNextPC return to reflect stop-this-cycle semantics, force I-cache reissue on buffer-edge PCs, and propagate stopFetchThisCycle.
Configs
configs/example/kmhv3.py, configs/example/idealkmhv3.py
Enable cpu.branchPred.enable2Fetch = True for DecoupledBPUWithBTB and reduce FTQ/FSQ sizes from 256 → 64 in example configs.

Sequence Diagram(s)

sequenceDiagram
    participant Core as DecoupledBPUWithBTB
    participant Predictor as BTB/TAGE
    participant FTQ as FTQ
    participant FSQ as FSQ
    participant Fetch as FetchUnit

    Note over Core: Up to N = (enable2Fetch ? 2 : 1) iterations per tick
    loop per-prediction
        Core->>Predictor: requestPrediction()
        Predictor-->>Core: provisionalPrediction
        Core->>Predictor: generateFinalPrediction()
        Predictor-->>Core: finalPrediction (+overrideBubbles)
        Core->>FTQ: ftqNext / getTarget()
        Core->>FSQ: enqueue(finalPrediction)
        alt do 2-fetch
            Core->>Fetch: signal do_2fetch (keep next FSQ entry buffered)
            Fetch-->>Core: continue fetch without consuming FTQ entry
        else single fetch
            Core->>Fetch: consume FTQ target / mark used
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • jensen-yan
  • tastynoob
  • CJ362ff

Poem

🐰 I hop in pairs where branches bend,

I stash the next and gently send,
Two little peeks in one swift beat,
Buffer tucked and pipeline neat,
— coderabbit 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: implementation of 2-taken and 2-fetch capabilities in the branch predictor and fetch logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 2-taken-ideal

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/cpu/pred/btb/decoupled_bpred.cc`:
- Line 140: The local variable tempNumOverrideBubbles is declared but never
used; remove the unused declaration of tempNumOverrideBubbles from the scope
where it's defined (the function containing the line "unsigned
tempNumOverrideBubbles = 0;") to clean up dead code and avoid compiler warnings,
ensuring no other references to tempNumOverrideBubbles remain in the function.
🧹 Nitpick comments (2)
src/cpu/pred/btb/decoupled_bpred.hh (1)

162-162: Consider making enableTwoTaken configurable via params.

This feature flag is hardcoded to true with no way to disable it through simulation parameters. Other similar options like fetchStreamQueueSize, predictWidth, and resolveBlockThreshold are initialized from the Params object in the constructor. For flexibility during experimentation and for consistency with the existing pattern, consider adding this to the DecoupledBPUWithBTBParams.

src/cpu/pred/btb/decoupled_bpred.cc (1)

142-184: Add documentation clarifying the multi-prediction loop behavior.

The loop logic for producing up to 2 predictions per tick is non-trivial. Consider adding a comment explaining:

  1. The intended behavior when both predictions succeed (no bubbles)
  2. What happens when the first prediction generates override bubbles (second iteration essentially becomes a no-op)
  3. The interaction between numOverrideBubbles being set inside the loop but decremented only once outside

This will help future maintainers understand the "ideal 2-taken" semantics.

📝 Suggested documentation
     int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;
-    unsigned tempNumOverrideBubbles = 0;

+    // "Ideal 2-taken" mode: attempt up to 2 predictions per tick.
+    // - If the first prediction generates override bubbles, the second iteration
+    //   will be blocked by validateFSQEnqueue() until bubbles are consumed.
+    // - If no bubbles, both predictions can be enqueued in a single tick.
+    // - Bubble decrement happens once per tick after the loop.
     while (predsRemainsToBeMade > 0) {

requestNewPrediction();
bpuState = BpuState::PREDICTOR_DONE;
int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;
unsigned tempNumOverrideBubbles = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused variable tempNumOverrideBubbles.

This variable is declared but never used anywhere in the function. It appears to be leftover code from development.

🧹 Proposed fix
     int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;
-    unsigned tempNumOverrideBubbles = 0;
 
     while (predsRemainsToBeMade > 0) {
🤖 Prompt for AI Agents
In `@src/cpu/pred/btb/decoupled_bpred.cc` at line 140, The local variable
tempNumOverrideBubbles is declared but never used; remove the unused declaration
of tempNumOverrideBubbles from the scope where it's defined (the function
containing the line "unsigned tempNumOverrideBubbles = 0;") to clean up dead
code and avoid compiler warnings, ensuring no other references to
tempNumOverrideBubbles remain in the function.

@github-actions
Copy link

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1727 -
This PR 2.1706 📉 -0.0021 (-0.10%)

✅ Difftest smoke test passed!

@Yakkhini Yakkhini added the perf label Jan 27, 2026
@github-actions
Copy link

🚀 Performance test triggered: spec06-0.8c

@XiangShanRobot
Copy link

[Generated by GEM5 Performance Robot]
commit: 0d664b4
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 20.64 20.27 +1.86 🟢

@github-actions
Copy link

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1727 -
This PR 2.2052 📈 +0.0325 (+1.50%)

✅ Difftest smoke test passed!

@Yakkhini Yakkhini added perf and removed perf labels Jan 29, 2026
@github-actions
Copy link

🚀 Performance test triggered: spec06-0.8c

@XiangShanRobot
Copy link

[Generated by GEM5 Performance Robot]
commit: ebf7386
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 20.64 20.27 +1.86 🟢

@github-actions
Copy link

🚀 Performance test triggered: spec06-0.8c

@github-actions
Copy link

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1727 -
This PR 2.2052 📈 +0.0325 (+1.50%)

✅ Difftest smoke test passed!

@XiangShanRobot
Copy link

[Generated by GEM5 Performance Robot]
commit: 9406572
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 20.65 20.27 +1.92 🟢

@Yakkhini Yakkhini added perf and removed perf labels Jan 30, 2026
@github-actions
Copy link

🚀 Performance test triggered: spec06-0.8c

@github-actions
Copy link

🚀 Performance test triggered: spec06-0.8c

@github-actions
Copy link

🚀 Performance test triggered: spec06-0.8c

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/cpu/pred/btb/decoupled_bpred.cc`:
- Around line 135-179: When blockPredictionPending is true the loop clears it
and continues so a second prediction can be issued in the same tick; change that
behavior to exit the prediction loop immediately to honor the "prioritize
resolve update" backpressure. In the while loop that uses predsRemainsToBeMade
(and enableTwoTaken), update the branch that currently does DPRINTF(Override) /
dbpBtbStats.predictionBlockedForUpdate++ / blockPredictionPending = false so
that after logging and incrementing the stat you break out of the loop (or set
predsRemainsToBeMade = 0) instead of clearing blockPredictionPending; this
ensures requestNewPrediction(), requestNewPrediction() / bpuState transitions,
generateFinalPredAndCreateBubbles(), validateFSQEnqueue(), and
processNewPrediction() cannot run for subsequent iterations when a block is
pending.

Comment on lines +135 to +179
int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;
unsigned tempNumOverrideBubbles = 0;

while (predsRemainsToBeMade > 0) {
// 1. Request new prediction if FSQ not full and we are idle
if (bpuState == BpuState::IDLE && !targetQueueFull()) {
if (blockPredictionPending) {
DPRINTF(Override, "Prediction blocked to prioritize resolve update\n");
dbpBtbStats.predictionBlockedForUpdate++;
blockPredictionPending = false;
} else {
requestNewPrediction();
bpuState = BpuState::PREDICTOR_DONE;
}
}
}

// 2. Handle pending prediction if available
if (bpuState == BpuState::PREDICTOR_DONE) {
DPRINTF(Override, "Generating final prediction for PC %#lx\n", s0PC);
numOverrideBubbles = generateFinalPredAndCreateBubbles();
bpuState = BpuState::PREDICTION_OUTSTANDING;
// 2. Handle pending prediction if available
if (bpuState == BpuState::PREDICTOR_DONE) {
DPRINTF(Override, "Generating final prediction for PC %#lx\n", s0PC);
numOverrideBubbles = generateFinalPredAndCreateBubbles();
bpuState = BpuState::PREDICTION_OUTSTANDING;

// Clear each predictor's output
for (int i = 0; i < numStages; i++) {
predsOfEachStage[i].btbEntries.clear();
// Clear each predictor's output
for (int i = 0; i < numStages; i++) {
predsOfEachStage[i].btbEntries.clear();
}
}
}

if (bpuState == BpuState::PREDICTION_OUTSTANDING && numOverrideBubbles > 0) {
tage->dryRunCycle(s0PC);
}
if (bpuState == BpuState::PREDICTION_OUTSTANDING && numOverrideBubbles > 0) {
tage->dryRunCycle(s0PC);
}

// check if:
// 1. FSQ has space
// 2. there's no bubble
// 3. PREDICTION_OUTSTANDING
if (validateFSQEnqueue()) {
// Create new FSQ entry with the current prediction
processNewPrediction();
// check if:
// 1. FSQ has space
// 2. there's no bubble
// 3. PREDICTION_OUTSTANDING
if (validateFSQEnqueue()) {
// Create new FSQ entry with the current prediction
processNewPrediction();

DPRINTF(Override, "FSQ entry enqueued, prediction state reset\n");
bpuState = BpuState::IDLE;
DPRINTF(Override, "FSQ entry enqueued, prediction state reset\n");
bpuState = BpuState::IDLE;
}

predsRemainsToBeMade--;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Block prediction should skip all iterations when resolve backpressure is set.
In two‑taken mode, blockPredictionPending is cleared in the first iteration, so the second iteration can still issue a prediction in the same tick. That undermines the intended “prioritize resolve update” block.

Proposed fix (exit the loop when prediction is blocked)
         if (bpuState == BpuState::IDLE && !targetQueueFull()) {
             if (blockPredictionPending) {
                 DPRINTF(Override, "Prediction blocked to prioritize resolve update\n");
                 dbpBtbStats.predictionBlockedForUpdate++;
                 blockPredictionPending = false;
+                break; // block predictions for the rest of this tick
             } else {
                 requestNewPrediction();
                 bpuState = BpuState::PREDICTOR_DONE;
             }
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;
unsigned tempNumOverrideBubbles = 0;
while (predsRemainsToBeMade > 0) {
// 1. Request new prediction if FSQ not full and we are idle
if (bpuState == BpuState::IDLE && !targetQueueFull()) {
if (blockPredictionPending) {
DPRINTF(Override, "Prediction blocked to prioritize resolve update\n");
dbpBtbStats.predictionBlockedForUpdate++;
blockPredictionPending = false;
} else {
requestNewPrediction();
bpuState = BpuState::PREDICTOR_DONE;
}
}
}
// 2. Handle pending prediction if available
if (bpuState == BpuState::PREDICTOR_DONE) {
DPRINTF(Override, "Generating final prediction for PC %#lx\n", s0PC);
numOverrideBubbles = generateFinalPredAndCreateBubbles();
bpuState = BpuState::PREDICTION_OUTSTANDING;
// 2. Handle pending prediction if available
if (bpuState == BpuState::PREDICTOR_DONE) {
DPRINTF(Override, "Generating final prediction for PC %#lx\n", s0PC);
numOverrideBubbles = generateFinalPredAndCreateBubbles();
bpuState = BpuState::PREDICTION_OUTSTANDING;
// Clear each predictor's output
for (int i = 0; i < numStages; i++) {
predsOfEachStage[i].btbEntries.clear();
// Clear each predictor's output
for (int i = 0; i < numStages; i++) {
predsOfEachStage[i].btbEntries.clear();
}
}
}
if (bpuState == BpuState::PREDICTION_OUTSTANDING && numOverrideBubbles > 0) {
tage->dryRunCycle(s0PC);
}
if (bpuState == BpuState::PREDICTION_OUTSTANDING && numOverrideBubbles > 0) {
tage->dryRunCycle(s0PC);
}
// check if:
// 1. FSQ has space
// 2. there's no bubble
// 3. PREDICTION_OUTSTANDING
if (validateFSQEnqueue()) {
// Create new FSQ entry with the current prediction
processNewPrediction();
// check if:
// 1. FSQ has space
// 2. there's no bubble
// 3. PREDICTION_OUTSTANDING
if (validateFSQEnqueue()) {
// Create new FSQ entry with the current prediction
processNewPrediction();
DPRINTF(Override, "FSQ entry enqueued, prediction state reset\n");
bpuState = BpuState::IDLE;
DPRINTF(Override, "FSQ entry enqueued, prediction state reset\n");
bpuState = BpuState::IDLE;
}
predsRemainsToBeMade--;
int predsRemainsToBeMade = enableTwoTaken ? 2 : 1;
unsigned tempNumOverrideBubbles = 0;
while (predsRemainsToBeMade > 0) {
// 1. Request new prediction if FSQ not full and we are idle
if (bpuState == BpuState::IDLE && !targetQueueFull()) {
if (blockPredictionPending) {
DPRINTF(Override, "Prediction blocked to prioritize resolve update\n");
dbpBtbStats.predictionBlockedForUpdate++;
blockPredictionPending = false;
break; // block predictions for the rest of this tick
} else {
requestNewPrediction();
bpuState = BpuState::PREDICTOR_DONE;
}
}
// 2. Handle pending prediction if available
if (bpuState == BpuState::PREDICTOR_DONE) {
DPRINTF(Override, "Generating final prediction for PC %#lx\n", s0PC);
numOverrideBubbles = generateFinalPredAndCreateBubbles();
bpuState = BpuState::PREDICTION_OUTSTANDING;
// Clear each predictor's output
for (int i = 0; i < numStages; i++) {
predsOfEachStage[i].btbEntries.clear();
}
}
if (bpuState == BpuState::PREDICTION_OUTSTANDING && numOverrideBubbles > 0) {
tage->dryRunCycle(s0PC);
}
// check if:
// 1. FSQ has space
// 2. there's no bubble
// 3. PREDICTION_OUTSTANDING
if (validateFSQEnqueue()) {
// Create new FSQ entry with the current prediction
processNewPrediction();
DPRINTF(Override, "FSQ entry enqueued, prediction state reset\n");
bpuState = BpuState::IDLE;
}
predsRemainsToBeMade--;
🧰 Tools
🪛 Cppcheck (2.19.0)

[error] 138-138: Shifting 64-bit value by 64 bits is undefined behaviour

(shiftTooManyBits)

🤖 Prompt for AI Agents
In `@src/cpu/pred/btb/decoupled_bpred.cc` around lines 135 - 179, When
blockPredictionPending is true the loop clears it and continues so a second
prediction can be issued in the same tick; change that behavior to exit the
prediction loop immediately to honor the "prioritize resolve update"
backpressure. In the while loop that uses predsRemainsToBeMade (and
enableTwoTaken), update the branch that currently does DPRINTF(Override) /
dbpBtbStats.predictionBlockedForUpdate++ / blockPredictionPending = false so
that after logging and incrementing the stat you break out of the loop (or set
predsRemainsToBeMade = 0) instead of clearing blockPredictionPending; this
ensures requestNewPrediction(), requestNewPrediction() / bpuState transitions,
generateFinalPredAndCreateBubbles(), validateFSQEnqueue(), and
processNewPrediction() cannot run for subsequent iterations when a block is
pending.

@github-actions
Copy link

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1727 -
This PR 2.2397 📈 +0.0670 (+3.08%)

✅ Difftest smoke test passed!

@XiangShanRobot
Copy link

[Generated by GEM5 Performance Robot]
commit: 0542507
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 20.56 20.37 +0.93 🟢

@Yakkhini Yakkhini changed the title cpu-o3: ideal 2-taken implementation Ideal 2-Taken & 2-Fetch Feb 4, 2026
@Yakkhini Yakkhini added perf and removed perf labels Feb 5, 2026
@github-actions
Copy link

github-actions bot commented Feb 5, 2026

🚀 Performance test triggered: spec06-0.8c

@XiangShanRobot
Copy link

[Generated by GEM5 Performance Robot]
commit: 0721720
workflow: On-Demand SPEC Test (Tier 1.5)

Ideal BTB Performance

Overall Score

PR Master Diff(%)
Score 20.94 20.37 +2.84 🟢

Yakkhini and others added 4 commits February 26, 2026 10:31
Change-Id: I39d54a0621d139cc00a156b02a6d7d888d9b15f0
Co-authored-by: Xu Boran <xuboran@bosc.ac.cn>
Change-Id: Ic203a9694c093034744986309e796b9d66d6f826
Change-Id: I3f0f686000b610c3bf842e62c9b9e91e7188a028
Change-Id: Ida30dab747aaf11d36b38637476b1c137e589942
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
configs/example/idealkmhv3.py (1)

81-83: Consider pinning maxFetchBytesPerCycle explicitly in this ideal profile.

This profile currently depends on the global default (64). Setting it here avoids silent benchmark drift if the default changes later.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@configs/example/idealkmhv3.py` around lines 81 - 83, The branch predictor
settings rely on the global default for maxFetchBytesPerCycle and should pin it
explicitly to prevent silent drift; in the block where cpu.branchPred.ftq_size,
cpu.branchPred.fsq_size, and cpu.branchPred.enable2Fetch are set, add an
explicit assignment cpu.branchPred.maxFetchBytesPerCycle = 64 so the ideal
profile fixes the fetch width regardless of global defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cpu/o3/fetch.cc`:
- Around line 821-836: The 2-fetch extension can be re-enabled multiple times in
a single cycle; add a per-cycle one-shot guard so do_2fetch is only allowed once
per cycle: introduce a boolean flag (e.g., twoFetchTakenThisCycle) that is
cleared at the start of the fetch cycle and checked before setting do_2fetch in
the block that currently tests predict_taken && dbpbtb->is2FetchEnabled() &&
dbpbtb->ftqHasNext(), then set the flag when you set do_2fetch = true; apply the
same guard update to the other symmetric 2-fetch site around the loop (the block
referenced at lines ~2004-2025) so the extension cannot chain beyond one extra
FTQ entry per cycle and still obey the per-cycle byte budget.

---

Nitpick comments:
In `@configs/example/idealkmhv3.py`:
- Around line 81-83: The branch predictor settings rely on the global default
for maxFetchBytesPerCycle and should pin it explicitly to prevent silent drift;
in the block where cpu.branchPred.ftq_size, cpu.branchPred.fsq_size, and
cpu.branchPred.enable2Fetch are set, add an explicit assignment
cpu.branchPred.maxFetchBytesPerCycle = 64 so the ideal profile fixes the fetch
width regardless of global defaults.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0721720 and 51bfe68.

📒 Files selected for processing (7)
  • configs/example/idealkmhv3.py
  • configs/example/kmhv3.py
  • src/cpu/o3/fetch.cc
  • src/cpu/o3/fetch.hh
  • src/cpu/pred/BranchPredictor.py
  • src/cpu/pred/btb/decoupled_bpred.cc
  • src/cpu/pred/btb/decoupled_bpred.hh
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/cpu/o3/fetch.hh
  • configs/example/kmhv3.py
  • src/cpu/pred/btb/decoupled_bpred.hh

Comment on lines +821 to +836
if (predict_taken && dbpbtb->is2FetchEnabled() && dbpbtb->ftqHasNext()) {
const Addr target_pc = stream.predBranchInfo.target;
const auto &next_stream = dbpbtb->ftqNext();
const Addr span = next_stream.predEndPC - stream.startPC;
const unsigned max_bytes = dbpbtb->getMaxFetchBytesPerCycle();
const bool target_in_buffer =
target_pc >= fetchBuffer[tid].startPC && target_pc + 4 <= fetchBuffer[tid].startPC + fetchBufferSize;

if (target_pc == next_stream.startPC && span <= max_bytes && target_in_buffer) {
do_2fetch = true;
DPRINTF(DecoupleBP,
"2Fetch: extend in-cycle to next FSQ entry (cur [%#lx, %#lx), next [%#lx, %#lx), span=%lu, "
"max=%u)\n",
stream.startPC, stream.predEndPC, next_stream.startPC, next_stream.predEndPC, span, max_bytes);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Limit 2-fetch extension to once per cycle.

At Line 821, do_2fetch can be re-enabled on each taken boundary, and Line 2025 keeps iterating in the same cycle. This can chain beyond two FSQ entries in one cycle, which breaks the intended “2-fetch” cap and can exceed the per-cycle byte budget semantics.

💡 Proposed fix (add a per-cycle one-shot guard)
-bool Fetch::lookupAndUpdateNextPC(const DynInstPtr &inst, PCStateBase &next_pc)
+bool Fetch::lookupAndUpdateNextPC(const DynInstPtr &inst, PCStateBase &next_pc,
+                                  bool &used2FetchExtension)

-        if (predict_taken && dbpbtb->is2FetchEnabled() && dbpbtb->ftqHasNext()) {
+        if (!used2FetchExtension &&
+            predict_taken && dbpbtb->is2FetchEnabled() && dbpbtb->ftqHasNext()) {
             ...
             if (target_pc == next_stream.startPC && span <= max_bytes && target_in_buffer) {
                 do_2fetch = true;
+                used2FetchExtension = true;
                 ...
             }
         }
-bool Fetch::processSingleInstruction(ThreadID tid, PCStateBase &pc,
-                                     StaticInstPtr &curMacroop)
+bool Fetch::processSingleInstruction(ThreadID tid, PCStateBase &pc,
+                                     StaticInstPtr &curMacroop,
+                                     bool &used2FetchExtension)
{
    ...
-    stopFetchThisCycle = lookupAndUpdateNextPC(instruction, *next_pc);
+    stopFetchThisCycle =
+        lookupAndUpdateNextPC(instruction, *next_pc, used2FetchExtension);
 void Fetch::performInstructionFetch(ThreadID tid)
 {
     ...
     bool stopFetchThisCycle = false;
+    bool used2FetchExtension = false;
     ...
-            stopFetchThisCycle = processSingleInstruction(tid, pc_state, curMacroop);
+            stopFetchThisCycle =
+                processSingleInstruction(tid, pc_state, curMacroop,
+                                         used2FetchExtension);

Also applies to: 2004-2025

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpu/o3/fetch.cc` around lines 821 - 836, The 2-fetch extension can be
re-enabled multiple times in a single cycle; add a per-cycle one-shot guard so
do_2fetch is only allowed once per cycle: introduce a boolean flag (e.g.,
twoFetchTakenThisCycle) that is cleared at the start of the fetch cycle and
checked before setting do_2fetch in the block that currently tests predict_taken
&& dbpbtb->is2FetchEnabled() && dbpbtb->ftqHasNext(), then set the flag when you
set do_2fetch = true; apply the same guard update to the other symmetric 2-fetch
site around the loop (the block referenced at lines ~2004-2025) so the extension
cannot chain beyond one extra FTQ entry per cycle and still obey the per-cycle
byte budget.

@github-actions
Copy link

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1727 -
This PR 2.2133 📈 +0.0406 (+1.87%)

✅ Difftest smoke test passed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants