Skip to content

cpu-o3: add 2Fetch features#700

Open
jensen-yan wants to merge 3 commits intoxs-devfrom
2fetch-align
Open

cpu-o3: add 2Fetch features#700
jensen-yan wants to merge 3 commits intoxs-devfrom
2fetch-align

Conversation

@jensen-yan
Copy link
Collaborator

@jensen-yan jensen-yan commented Jan 12, 2026

  • Introduced support for 2Fetch features in the branch predictor and Fetch Stage.
  • Enhanced the FetchTargetQueue to manage next FTQ entries for 2Fetch functionality.
  • Now if 2 fetchBlock in the same 64 byte fetchBuffer, it could be 2 fetched at the same cycle.

Change-Id: I3b112cc844c485d81cea1f4ed0ff221cb37d2782

Summary by CodeRabbit

  • New Features

    • Added 2Taken/2Fetch support with new configuration knobs and an example config enabling 2Fetch.
    • Added performance statistics to track 2Fetch attempts and outcomes.
  • Refactor

    • Centralized per-cycle fetch-stop decision to simplify fetch control.
    • In decoupled frontend mode, fetch logic now forces a new fetch entry and invalidates the current buffer when the PC moves outside the buffer.

✏️ Tip: You can customize this high-level summary in your review settings.

- Introduced support for 2Fetch features in the branch predictor and Fetch Stage.
- Enhanced the FetchTargetQueue to manage next FTQ entries for 2Fetch functionality.
- Now if 2 fetchBlock in the same 64 byte fetchBuffer, it could be 2 fetched at the same cycle.

Change-Id: I3b112cc844c485d81cea1f4ed0ff221cb37d2782
@coderabbitai
Copy link

coderabbitai bot commented Jan 12, 2026

📝 Walkthrough

Walkthrough

Introduces a "2Fetch" feature for the decoupled branch predictor (lookahead into the next FTQ entry) and centralizes per-cycle fetch stopping into shouldStopFetchThisCycle(bool predictedBranch). Adds FTQ peek/advance APIs, config knobs, statistics, and conditional fetch-extension logic in the decoupled predictor.

Changes

Cohort / File(s) Summary
Fetch loop control
src/cpu/o3/fetch.cc, src/cpu/o3/fetch.hh, src/cpu/o3/fetch.md
Replaced multi-condition loop stop predicates with shouldStopFetchThisCycle(bool predictedBranch) and added decoupled-frontend invalidation path forcing an FTQ entry when architectural PC escapes the fetch buffer.
Branch predictor config
src/cpu/pred/BranchPredictor.py
Added enable2Taken, enable2Fetch, and maxFetchBytesPerCycle parameters to DecoupledBPUWithBTB.
2Fetch core logic
src/cpu/pred/btb/decoupled_bpred.cc, src/cpu/pred/btb/decoupled_bpred.hh
Added enable2Fetch, has1Fetched, and helpers canExtendToNextFTQ() / extendToNextFTQ(); integrated 2Fetch checks into decoupledPredict flow to optionally extend to next FTQ entry instead of consuming current one.
FTQ lookahead API
src/cpu/pred/btb/fetch_target_queue.cc, src/cpu/pred/btb/fetch_target_queue.hh
Added hasNext(), peekNext(), and advance() to inspect/advance to the next FTQ entry without dequeuing.
2Fetch stats
src/cpu/pred/btb/decoupled_bpred_stats.cc, src/cpu/pred/btb/decoupled_bpred.hh
Added six new statistics: fetch2Attempts, fetch2Successes, fetch2SpanTooLarge, fetch2NoNextFTQ, fetch2FirstNotTaken, fetch2FirstNotAtStart.
Config example
configs/example/kmhv3.py
Enabled cpu.branchPred.enable2Fetch = True in example config.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Fetch
participant DecoupledBPUWithBTB as BPU
participant FetchTargetQueue as FTQQ
participant FTQ
Fetch->>BPU: request prediction / decoupledPredict
BPU->>FTQQ: check hasNext() / peekNext()
alt canExtendToNextFTQ == true
BPU->>FTQQ: extendToNextFTQ()
FTQQ->>FTQ: advance supply state (no dequeue)
BPU->>Fetch: return extended PC (2Fetch success)
else
BPU->>FTQ: consume current FTQ entry, process completion
BPU->>Fetch: return normal PC (no extension)
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

perf

Suggested reviewers

  • Yakkhini
  • CJ362ff

Poem

🐰 I peek ahead with whiskers twitching bright,
A second fetch to catch the flight,
FTQ whispers what comes next in line,
Two strides forward, one step fine,
Hopping onward, predictor delight! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'cpu-o3: add 2Fetch features' directly and clearly summarizes the main objective of the changeset, which is to introduce 2Fetch functionality across multiple components of the CPU's O3 fetch pipeline.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f95836f and 1222442.

📒 Files selected for processing (2)
  • configs/example/kmhv3.py
  • src/cpu/o3/fetch.cc
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cpu/o3/fetch.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Quick Build, Unit Tests & Smoke Test
  • GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)
🔇 Additional comments (1)
configs/example/kmhv3.py (1)

111-112: LGTM!

The configuration correctly enables the 2Fetch feature within the DecoupledBPUWithBTB branch predictor section. The enable2Fetch parameter is properly defined in BranchPredictor.py and the placement alongside other predictor settings is appropriate.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/cpu/pred/btb/fetch_target_queue.cc (1)

287-293: Minor: Redundant map lookup in peekNext().

The method performs the same ftq.find(fetchDemandTargetId + 1) lookup twice - once in the assert(hasNext()) call and once explicitly. Consider caching the result or using a single lookup.

♻️ Suggested optimization
 const FtqEntry&
 FetchTargetQueue::peekNext() const
 {
-    assert(hasNext());
-    auto next_it = ftq.find(fetchDemandTargetId + 1);
+    auto next_it = ftq.find(fetchDemandTargetId + 1);
+    assert(next_it != ftq.end() && "peekNext() requires hasNext()");
     return next_it->second;
 }
src/cpu/pred/btb/decoupled_bpred.cc (1)

167-167: Formatting issue: Misaligned closing brace.

The closing brace appears to have lost its indentation from surrounding code.

♻️ Suggested fix
         for (int i = 0; i < numStages; i++) {
             predsOfEachStage[i].btbEntries.clear();
-}
+        }
src/cpu/pred/btb/decoupled_bpred.hh (1)

1013-1027: Consider encapsulation: has1Fetched is internal state that should be private.

The has1Fetched flag is internal state tracking whether the first FTQ has been processed in a 2Fetch cycle. It's currently in the public section along with enable2Fetch and maxFetchBytesPerCycle, but unlike those configuration parameters, has1Fetched is runtime state that shouldn't be accessed externally.

♻️ Suggested reorganization

Move has1Fetched to the private section near other internal state variables, while keeping enable2Fetch and maxFetchBytesPerCycle as public configuration parameters:

+  private:
+    // 2Fetch internal state
+    bool has1Fetched{false};  ///< Whether first FTQ has been fetched in current 2Fetch cycle
+
   public:
     // NEW: 2Fetch support variables
     /**
      * @brief Enable 2fetch capability
      */
     bool enable2Fetch{true};

-    /**
-     * @brief Whether fetched first FTQ
-     */
-    bool has1Fetched{false};
-
     /**
      * @brief Maximum fetch bytes per cycle for 2fetch
      */
     unsigned maxFetchBytesPerCycle{64};
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39cd6af and f95836f.

📒 Files selected for processing (9)
  • src/cpu/o3/fetch.cc
  • src/cpu/o3/fetch.hh
  • src/cpu/o3/fetch.md
  • src/cpu/pred/BranchPredictor.py
  • src/cpu/pred/btb/decoupled_bpred.cc
  • src/cpu/pred/btb/decoupled_bpred.hh
  • src/cpu/pred/btb/decoupled_bpred_stats.cc
  • src/cpu/pred/btb/fetch_target_queue.cc
  • src/cpu/pred/btb/fetch_target_queue.hh
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-12T02:49:08.514Z
Learnt from: jensen-yan
Repo: OpenXiangShan/GEM5 PR: 686
File: src/cpu/pred/btb/btb_mgsc.cc:226-230
Timestamp: 2026-01-12T02:49:08.514Z
Learning: In src/cpu/pred/btb/btb_mgsc.cc, the calculateScaledPercsum function returns the unscaled percsum (weight scaling disabled) to align with RTL implementation. The RTL currently does not have a weight table, so setting weight to 1 (equivalent to disabling the weight table) maintains hardware/software consistency with minimal performance impact.

Applied to files:

  • src/cpu/pred/btb/decoupled_bpred.cc
🧬 Code graph analysis (2)
src/cpu/o3/fetch.cc (1)
src/cpu/o3/fetch.hh (2)
  • predictedBranch (522-531)
  • predictedBranch (522-522)
src/cpu/pred/btb/decoupled_bpred.cc (2)
src/cpu/pred/btb/decoupled_bpred.hh (7)
  • pc (1041-1041)
  • target_to_fetch (854-854)
  • current_pc (1037-1037)
  • fetchTargetQueue (510-513)
  • fetchTargetQueue (515-518)
  • fetchTargetQueue (520-523)
  • fetchTargetQueue (524-527)
src/cpu/pred/btb/test/decoupled_bpred.cc (2)
  • processFetchTargetCompletion (343-361)
  • processFetchTargetCompletion (344-344)
🪛 Ruff (0.14.10)
src/cpu/pred/BranchPredictor.py

1185-1185: Param may be undefined, or defined from star imports

(F405)


1186-1186: Param may be undefined, or defined from star imports

(F405)


1187-1187: Param may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Quick Build, Unit Tests & Smoke Test
  • GitHub Check: perf_test / XS-GEM5 - Run performance test (spec06-0.3c)
🔇 Additional comments (12)
src/cpu/o3/fetch.cc (1)

2071-2091: LGTM - Clean refactoring of the fetch loop termination condition.

The replacement of the inline condition with shouldStopFetchThisCycle(predictedBranch) properly centralizes the fetch stopping logic. The method in fetch.hh correctly handles:

  1. waitForVsetvl gating
  2. Decoupled frontend usedUpFetchTargets check
  3. predictedBranch for traditional frontends

This sets a good foundation for future 2Fetch extensions.

src/cpu/pred/BranchPredictor.py (1)

1184-1187: LGTM - New 2Fetch configuration parameters are well-defined.

The new parameters follow gem5 conventions:

  • Features disabled by default (False) ensures backward compatibility
  • maxFetchBytesPerCycle=64 is a sensible default matching typical fetch buffer sizes
  • The static analysis warnings about Param are false positives since it's imported via the star import from m5.params at line 29
src/cpu/o3/fetch.hh (1)

517-531: LGTM - Well-documented centralized fetch stop decision.

The shouldStopFetchThisCycle() method cleanly encapsulates the per-cycle fetch termination logic with:

  • Clear documentation explaining the three cases
  • Proper priority ordering (waitForVsetvl → decoupled frontend → predictedBranch)
  • Inline implementation appropriate for this lightweight decision function

This consolidation improves maintainability and prepares the codebase for 2Fetch extensions.

src/cpu/o3/fetch.md (2)

344-353: Documentation accurately reflects the code changes.

The updated documentation correctly shows:

  • The new shouldStopFetchThisCycle(bool predictedBranch) signature
  • The three-way logic (waitForVsetvl → decoupled frontend → predictedBranch)
  • Consistent with the implementation in fetch.hh

This keeps the documentation in sync with the refactored code.


584-586: Code example updated to match implementation.

The fetch loop condition example now correctly uses !shouldStopFetchThisCycle(predictedBranch), maintaining consistency between documentation and actual code.

src/cpu/pred/btb/fetch_target_queue.hh (1)

160-181: Well-designed 2Fetch lookahead API.

The new methods provide a clean interface for 2Fetch traversal:

  • hasNext() / peekNext() / advance() follow a standard lookahead pattern
  • Returning const FtqEntry& from peekNext() prevents unintended modifications
  • peekNext() properly asserts the precondition via assert(hasNext()), ensuring safe use
src/cpu/pred/btb/decoupled_bpred_stats.cc (1)

508-514: LGTM! Statistics for 2Fetch feature are well-structured.

The new statistics cover all decision paths in the 2Fetch logic: attempts, successes, and specific rejection reasons. This provides good observability for tuning and debugging the feature.

src/cpu/pred/btb/fetch_target_queue.cc (1)

301-319: LGTM! advance() correctly updates supply state for 2Fetch.

The method properly updates supplyFetchTargetState to point to the current demand target (which has already been incremented by finishCurrentFetchTarget()). The fallback to invalidate supply state when the entry doesn't exist is appropriate.

src/cpu/pred/btb/decoupled_bpred.cc (2)

1401-1446: LGTM! canExtendToNextFTQ has well-structured validation.

The method correctly validates all preconditions for 2Fetch extension:

  1. First block must end with a taken branch (to jump within the fetch buffer)
  2. Next FTQ entry must be available
  3. Current PC must match next FTQ's start (ensuring continuity)
  4. Total span must fit within maxFetchBytesPerCycle

The early exit checks are ordered efficiently and each rejection path records a specific statistic for observability.


1456-1477: LGTM! extendToNextFTQ correctly sets up for processing the second FTQ entry.

The method properly:

  1. Advances to the next FTQ entry via fetchTargetQueue.advance()
  2. Resets the instruction counter for the new FTQ entry
  3. Updates the PC to the start of the new FTQ entry
  4. Records the successful 2Fetch metric
src/cpu/pred/btb/decoupled_bpred.hh (2)

383-389: LGTM! Statistics declarations are well-documented.

The statistics cover all decision paths in the 2Fetch logic with clear documentation comments explaining each counter's purpose.


1029-1041: LGTM! Method declarations are appropriate.

The method signatures correctly use const references where modification isn't needed and non-const reference for the PC that gets updated.

Comment on lines +443 to 457
// Check if 2fetch is enabled, not fetched first FTQ yet, and if we can extend to the next FTQ
// NEW: 2Fetch extension check - before processing completion
dbpBtbStats.fetch2Attempts++;
if (enable2Fetch && !has1Fetched && canExtendToNextFTQ(pc, target_to_fetch)) {
DPRINTF(DecoupleBP, "2Fetch: extending to next FTQ in same cycle\n");
has1Fetched = true;
processFetchTargetCompletion(target_to_fetch);
extendToNextFTQ(pc);
// first fetchBlock is always taken, do not run out of FTQ now
return std::make_pair(true, false);
}

processFetchTargetCompletion(target_to_fetch);
has1Fetched = false; // reset 2fetch flag
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

fetch2Attempts is incremented even when 2Fetch is disabled or already used.

The counter fetch2Attempts is incremented unconditionally at line 445, before checking enable2Fetch and !has1Fetched. This means:

  1. Attempts are counted even when enable2Fetch is false
  2. The second FTQ completion in a 2Fetch cycle (when has1Fetched=true) also counts as an attempt

Consider moving the increment inside the condition, or after the early-exit checks in canExtendToNextFTQ.

🐛 Suggested fix
     if (run_out_of_this_entry) {
         // Check if 2fetch is enabled, not fetched first FTQ yet, and if we can extend to the next FTQ
         // NEW: 2Fetch extension check - before processing completion
-        dbpBtbStats.fetch2Attempts++;
-        if (enable2Fetch && !has1Fetched && canExtendToNextFTQ(pc, target_to_fetch)) {
+        if (enable2Fetch && !has1Fetched) {
+            dbpBtbStats.fetch2Attempts++;
+            if (canExtendToNextFTQ(pc, target_to_fetch)) {
             DPRINTF(DecoupleBP, "2Fetch: extending to next FTQ in same cycle\n");
             has1Fetched = true;
             processFetchTargetCompletion(target_to_fetch);
             extendToNextFTQ(pc);
             // first fetchBlock is always taken, do not run out of FTQ now
             return std::make_pair(true, false);
+            }
         }

         processFetchTargetCompletion(target_to_fetch);

@github-actions
Copy link

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1691 -
This PR 2.1691 ➡️ 0.0000 (0.00%)

✅ Difftest smoke test passed!

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f95836f92e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1435 to +1439
Addr span = next_ftq.endPC - current_ftq.startPC;
if (span > maxFetchBytesPerCycle) {
DPRINTF(DecoupleBP, "2Fetch rejected: span %d exceeds %d bytes\n",
span, maxFetchBytesPerCycle);
dbpBtbStats.fetch2SpanTooLarge++;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate 2fetch extension by actual fetch-buffer span

The 2fetch gate only checks next_ftq.endPC - current_ftq.startPC <= maxFetchBytesPerCycle, but it doesn’t guarantee that the next FTQ entry actually lies within the current fetch buffer. If maxFetchBytesPerCycle is configured larger than fetchBufferSize (or the next entry crosses the buffer), extendToNextFTQ returns (true, false) and leaves usedUpFetchTargets false; checkMemoryNeeds() will then stall because the PC is outside the buffer, yet needNewFTQEntry() won’t issue a new cache request, so fetch can hang. Consider clamping the span check to the fetch-buffer size or invalidating the buffer when extension spans beyond it.

Useful? React with 👍 / 👎.

- Added handling in the fetch stage to invalidate the fetch buffer and force a new FTQ entry when the architectural fetch PC exceeds the buffered FTQ window in decoupled-frontend mode. This prevents potential deadlocks during ICache stalls.

Change-Id: I38a6003e136a51c115f33358a62138dc987336da
Change-Id: Iaa4a853508ae80e412a8b3bdd41dea26d9eb8c29
@github-actions
Copy link

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1691 -
This PR 2.1691 ➡️ 0.0000 (0.00%)

✅ Difftest smoke test passed!

@github-actions
Copy link

🚀 Coremark Smoke Test Results

Branch IPC Change
Base (xs-dev) 2.1691 -
This PR 2.1617 📉 -0.0074 (-0.34%)

✅ Difftest smoke test passed!

@XiangShanRobot
Copy link

[Generated by GEM5 Performance Robot]
commit: 1222442
workflow: gem5 Align BTB Performance Test(0.3c)

Align BTB Performance

Overall Score

PR Master Diff(%)
Score 17.54 17.52 +0.12 🟢

[Generated by GEM5 Performance Robot]
commit: 1222442
workflow: gem5 Align BTB Performance Test(0.3c)

Align BTB Performance

Overall Score

PR Previous Commit Diff(%)
Score 17.54 17.52 +0.12 🟢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants