Skip to content

Conversation

@luohuan19
Copy link

…imization

This pass performs dependency-preserving reordering of statements within reorderable segments to reduce peak live cross-pipe dependencies (max 8 events), intended to run before InsertSyncPass to avoid hardware event_id exhaustion.

Key features:

  • Kahn-based topological scheduling with resource constraints
  • Multiple heuristic strategies to avoid greedy dead-ends
  • Graceful fallback when strict limits cannot be satisfied
  • Comprehensive Python bindings and test suite
  • Integration into PassManager's XPlatform optimization strategy

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @luohuan19, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new compiler pass designed to optimize the execution order of operations within the Intermediate Representation (IR). By intelligently reordering statements, the pass minimizes the concurrent usage of critical hardware resources, specifically cross-pipe event IDs. This proactive scheduling ensures that subsequent synchronization passes can operate effectively without encountering resource limitations, thereby enhancing the overall efficiency and reliability of the generated code for heterogeneous architectures.

Highlights

  • New Out-of-Order Scheduler Pass: Introduced a new OutOfOrderSchedulerPass to reorder statements within reorderable segments (e.g., AssignStmt/EvalStmt) to reduce peak live cross-pipe dependencies.
  • Hardware Resource Optimization: The pass aims to keep the peak number of 'live' cross-pipe dependency edges at or below a limit of 8, preventing hardware event_id exhaustion, especially before the InsertSyncPass.
  • Advanced Scheduling Logic: Employs Kahn-based topological scheduling with resource constraints, incorporating multiple heuristic strategies to avoid greedy dead-ends and includes graceful fallback mechanisms when strict limits cannot be met.
  • Integration and Test Coverage: The new pass is integrated into the PassManager's XPlatform optimization strategy, complete with comprehensive Python bindings and a robust test suite covering various dependency types and edge cases.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the OutOfOrderSchedulerPass, a new optimization pass designed to reorder statements within reorderable segments to reduce peak live cross-pipe dependencies. This is crucial for avoiding hardware event ID exhaustion, especially when running before InsertSyncPass. The implementation utilizes a Kahn-based topological scheduling algorithm with multiple heuristic strategies and a graceful fallback mechanism. The changes include adding the C++ implementation, integrating it into the Python bindings and PassManager's XPlatform strategy, and providing a comprehensive test suite. The new pass significantly enhances the IR optimization capabilities by intelligently managing resource constraints.

if (s == "SCALAR" || s == "S") return PipeType::S;
if (s == "FIX") return PipeType::FIX;
if (s == "ALL") return PipeType::ALL;
return PipeType::S;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PipeTypeFromString function defaults to PipeType::S if the input string does not match any known pipe type. While this provides a fallback, it might silently mask issues where an unexpected or malformed pipe_type string is provided. Consider adding a LOG_WARN or INTERNAL_CHECK for unhandled cases to aid debugging and prevent silent misconfigurations.

Comment on lines +91 to +106
return PipeType::S;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to PipeTypeFromString, the get_call_pipe lambda (and consequently GetStmtPipe) defaults to PipeType::S if call->HasKwarg("pipe_type") is true but the GetKwarg call fails or the string cannot be parsed. This could lead to incorrect scheduling decisions without explicit notification. Adding a LOG_WARN in the catch block would be beneficial for identifying such scenarios.


# Verify order is preserved (both statements write to same variable)
assert optimized_func is not None
assert isinstance(optimized_func.body, ir.SeqStmts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In test_out_of_order_scheduler_waw_dependency, the assertion assert isinstance(optimized_func.body, ir.SeqStmts) is quite general. To thoroughly verify WAW dependency preservation, it would be more robust to explicitly check that the statement defining _tile_a_v1 appears before the statement defining tile_a_v2 in the optimized_func.body.stmts list. This ensures the scheduler maintains the correct write order.

Comment on lines 895 to 898
# We don't enforce strict ordering here since the pass may do best-effort reordering
# The key is that it doesn't crash and returns a valid function
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In test_out_of_order_scheduler_exceeds_event_limit, the comment mentions that the implementation should log a warning when the 8-event limit cannot be satisfied. It would be good to add an assertion here to verify that this warning is indeed logged (e.g., by capturing logs or checking for a specific log message), ensuring the fallback behavior is correctly communicated.

@luohuan19
Copy link
Author

luohuan19 commented Jan 27, 2026

OutOfOrderSchedulerPass

Overview

OutOfOrderSchedulerPass reschedules reorderable statements to optimize cross-pipe dependencies while keeping peak event pressure ≤ 8 per pipeline pair.

Goal: Under dependency constraints, reorder statements to minimize peak pressure of cross-pipe synchronization events.

Core Concepts

Pipeline Types

Different computational units: M (CUBE), V (VECTOR), S (SCALAR), MTE1/2/3 (transfers), FIX, ALL.

Cross-Pipe Dependencies

When a statement on pipeline A depends on pipeline B (A ≠ B), synchronization via events is needed:

  • Producer (A) issues set_event
  • Consumer (B) waits on wait_event

Live Events

Event is "live" from set_event to wait_event. Resource constraint: max 8 live events per pipeline pair.

Reorderable Statements

This pass runs on each SeqStmts node and may reorder its direct children under dependency constraints.

  • Compute-like (typical reorder candidates): AssignStmt, EvalStmt
  • Control-flow / terminator nodes (kept stable in relative order): IfStmt, ForStmt, ReturnStmt, YieldStmt

Phase 1: Control Flow Node Support (CF-aware Analysis)

Phase 1 Overview

Phase 1 extends the scheduler to treat control flow nodes (IfStmt, ForStmt) as immovable black-box composite nodes in the dependency graph. This enables compute statements to be reordered across control flow boundaries when data dependencies allow.

Key Innovation: Instead of cutting statement streams into isolated segments separated by CF barriers, Phase 1 analyzes dependencies at the parent statement level (SeqStmts), allowing better reordering opportunities.

Design Principle: Black-Box CF Nodes

  • Immovable: Control flow nodes cannot change relative order (if A comes before B, A must stay before B)
  • Black-box: Statement-level analysis uses StmtEffect to conservatively summarize CF node reads/writes
  • Permeable: Compute statements can cross CF boundaries if data dependencies permit

StmtEffect: Conservative Side-Effect Summary

Each statement (including CF nodes) is analyzed for side effects:

struct StmtEffect {
  std::set<MemRefPtr> reads;                  // MemRefs read by statement
  std::set<MemRefPtr> writes;                 // MemRefs written by statement
  bool has_unknown_side_effect = false;       // Conservative flag
};

Analysis rules by statement type:

  • AssignStmt: writes = var, reads = value's MemRefs
  • EvalStmt: reads = expr's MemRefs, unknown_side_effect = true (conservative)
  • IfStmt: Union of condition reads + both branch effects
  • ForStmt: Union of bounds reads + body reads/writes (loop-carried)
  • SeqStmts/OpStmts: Fold effects from all children
  • Return/Yield: unknown_side_effect = true (terminators)

Conservative union for branching: When IfStmt or ForStmt can execute different code paths, we conservatively take the union of all possible effects.

Scheduling with CF Nodes

Dependency graph construction (CF-aware mode):

  1. Analyze all statements (compute + CF) using MemRef reads/writes
  2. Build RAW/WAW/WAR edges using StmtEffect results
  3. Unknown side effects create barriers (edges to all subsequent statements)

Ordering constraints (Stability Chain):
After dependency edges are built, add "CF stability chain":

  • Identify all CF-like nodes in original order: c0, c1, ..., ck
  • Add edges: c0 → c1 → ... → ck
  • This preserves CF relative order while allowing compute to cross them

Candidate selection (Strategy A):
During Kahn scheduling, prioritize schedulable compute statements over CF nodes:

  1. First pass: Schedule compute statements with best score
  2. If none available: Fall back to CF nodes
  3. This prevents CF nodes from "blocking" compute optimization

Example: Cross-CF Reordering

Input:

tile_a = load(input_a)      # Depends on input_a

if cond:                    # CF node (reads cond)
    tile_b = add(tile_a, tile_a)

tile_c = load(input_c)      # Independent of If (reads different input)
result = add(tile_c, tile_c) # Depends on tile_c

Dependency Analysis:

  • tile_a depends on input_a (RAW)
  • if depends on cond (reads cond expression)
  • tile_c depends on input_c (RAW, independent of tile_a)
  • result depends on tile_c (RAW)
  • result does NOT depend on if statement (different MemRefs)

Phase 1 Optimized Order:

tile_a = load(input_a)      # tile_a first (needed by if body)
tile_c = load(input_c)      # tile_c can cross if (no dependency)
result = add(tile_c, tile_c) # result follows tile_c

if cond:                    # If node preserved (CF stability chain)
    tile_b = add(tile_a, tile_a)

Benefit: tile_c load and result computation moved before if → better pipelining and cross-pipe synchronization.

MemRefCollector

Collects memory references from expressions to build dependency relationships. Analyzes reads/writes to detect:

  • RAW (Read-After-Write): reads must follow writes
  • WAW (Write-After-Write): writes must follow previous writes
  • WAR (Write-After-Read): writes must follow all reads

GetStmtPipe

Extracts pipeline type of statement:

  1. Use Op::GetPipe() if available
  2. Fall back to call.kwargs["pipe_type"]
  3. Default to PipeType::S (scalar)

Returns the pipeline where the statement executes.

LiveCrossPipeEvents

Tracks cross-pipe event state during scheduling:

  • live_by_pair_: Global live event count per pipeline pair (counts unique active producers, not edges)
  • pending_successors_: Per-producer map tracking unscheduled consumers per pipe pair
  • incoming_producers_: Per-consumer list of (producer, pair) dependencies
  • peak_by_pair_: Peak pressure statistics

Key methods:

  • PredictAfterScheduling(candidate): Predicts resource impact, returns whether scheduling is feasible
  • ReleaseIncomingBeforeExecute(stmt): Release wait-side events before statement execution
  • AllocateOutgoingAfterExecute(stmt): Allocate set-side events after statement execution

Event Semantics: Broadcast Model

Hardware reality: Cross-pipe synchronization uses broadcast semantics:

  • Producer issues ONE set_event(id) per unique (SRC, DST) pair
  • Multiple consumers on the same DST pipe can share this event via wait_event(id)
  • Event_id slot is freed when the FIRST consumer is scheduled (matching InsertSyncPass behavior: sync_dst is inserted only before the first consumer; after that the hardware event_id can be reused)

Implementation:

  • pending_successors_[producer][pair].remaining: Counts unscheduled consumers (for correctness bookkeeping).
  • pending_successors_[producer][pair].event_live: Tracks whether this (producer, pair) still occupies an event_id slot.
  • incoming_producers_[consumer]: Tracks which (producer, pair) combinations this consumer depends on
  • live_by_pair_[pair]: Counts unique active producers (NOT edges)

Example: If producer P on MTE2 has 3 consumers on V:

  • Old (per-edge): live_by_pair_[(MTE2,V)] += 3
  • New (broadcast): live_by_pair_[(MTE2,V)] += 1

The (producer, pair) event_id slot is freed when the first of these consumers is actually scheduled. Remaining consumers still keep the dependency relationship, but do not consume an event_id slot.

Lifecycle:

P (MTE2) → C1, C2, C3 (all on V)

After P executes:  live_by_pair_[(MTE2,V)] = 1, pending_successors_[P][(MTE2,V)] = 3
After first scheduled consumer (e.g. C2): pending = 2, live = 0 (event_id slot freed)
After next consumer (e.g. C1): pending = 1, live = 0
After last consumer (e.g. C3): pending = 0 (bookkeeping cleanup), live = 0

Consumer Role Tracking

This scheduler treats “first-consumer” as a dynamic concept:

  • releases_event (first scheduled consumer): If a ready candidate has at least one incoming (producer, pair) whose event_live is still true, then scheduling this candidate will free at least one event_id slot.
    • This matches the runtime insertion model: whichever consumer is scheduled first will be the one that carries the sync_dst wait, and thus frees the event_id slot for reuse.
    • Other consumers still keep dependency ordering (they must be scheduled after the producer), but they do not consume additional event_id slots.

Scheduling Algorithm

Overall Flow

  1. Visit each SeqStmts: Collect and visit all direct children
  2. Build dependency graph (CF-aware): Conservative MemRef hazard detection (RAW/WAW/WAR) + unknown side-effect barriers
  3. Add CF stability chain: Preserve relative order among CF/terminator nodes
  4. Kahn topological sort: Enhanced with event_id resource constraints
  5. Multi-strategy scheduling: Try multiple heuristics to find a feasible schedule (strict), then best-effort (relaxed)

Building Dependency Graph

For each statement, collect read/write memory references:

  • Track last writer for each memory location
  • Track all readers since last write

Build edges:

  • RAW: Add edge from last writer to current reader
  • WAW: Add edge from last writer to current writer
  • WAR: Add edges from all readers to current writer

Mark each edge as cross-pipe or same-pipe based on pipeline types.

Kahn + Resource Constraints

Enhanced Kahn algorithm that respects event limits:

Initialize ready set with statements having indegree 0
While unscheduled statements exist:
  For each candidate in ready set:
    Predict resource impact if scheduled
    Skip if violates constraint (live events > 8)
    Score candidate using strategy
    Prefer candidates that release at least one live event_id slot (`releases_event`)

  Select best candidate
  Release incoming events (before execution)
  Mark as scheduled
  Allocate outgoing events (after execution)
  Update peak statistics

  Update ready set with new zero-indegree statements

First-Consumer Priority Optimization:

To minimize peak event pressure, the scheduler prioritizes first-consumers:

  • Candidate comparison first prefers candidates that releases_event == true
  • This schedules event-releasing consumers earlier, freeing event_id slots sooner
  • Reduces the likelihood of exceeding the 8-event limit per pipeline pair
  • Works in conjunction with Strategy A (compute over CF nodes)

Example benefit:

Without priority:  tail_x → [consumer_1, consumer_2, ..., consumer_0] → event held until consumer_0
With priority:     tail_x → [consumer_0, consumer_1, consumer_2, ...] → event released immediately

Candidate Selection Strategies

Selection criteria (in priority order):

  1. kMinMaxThenSumThenIndex (default):

    • Primary: Minimize worst pipeline pair pressure (pred_max)
    • Secondary: Minimize total pressure (pred_sum)
    • Tertiary: By original index
  2. kMinSumThenMaxThenIndex:

    • Primary: Minimize total pressure first
    • Avoids local greedy traps
  3. kMinMaxThenIndex:

    • Only minimize worst pressure
    • Simpler, faster decisions

Fallback Strategy

Try strategies in order:

  1. Strict mode (enforce_limit=true):

    • Try each strategy
    • Enforce 8-event limit strictly
    • Return first successful schedule
  2. Relaxed mode (enforce_limit=false):

    • If all strict strategies fail
    • Don't enforce limit, but minimize pressure
    • Generate best-effort topological order
    • Logs warning to user

Invariants

Resource Constraint

Each pipeline pair (SRC, DST) has at most 8 live events at any time. This is hardware-enforced and cannot be violated.

Invariant verification:

  • PredictAfterScheduling checks this before scheduling
  • INTERNAL_CHECK(pred >= 0) ensures release doesn't make count negative

State Consistency

Internal bookkeeping stays consistent:

  • live_by_pair_ never goes negative; predicted counts must be ≥ 0
  • pending_successors_ and incoming_producers_ remain consistent (no double-release, no missing producer-pair state)
  • Peak statistics tracked accurately

Topological Order

Output satisfies all dependencies (RAW/WAW/WAR). Guaranteed by Kahn algorithm: only schedules statements with indegree 0.

Example

Input Code

A = compute_on_M(...)     # Pipeline M
B = compute_on_V(A)       # Pipeline V, depends on A (cross-pipe)
C = compute_on_M(...)     # Pipeline M
D = compute_on_V(C)       # Pipeline V, depends on C (cross-pipe)
E = compute_on_V(B, D)    # Pipeline V, depends on B and D

Dependency Graph

A(M) → B(V)
       ↓
C(M) → D(V) → E(V)

Cross-pipe edges: A→B, C→D

Original Schedule

Order: A → B → C → D → E

Time Execute Live Events (M→V) Count
1 A {A→B} 1
2 B {} 0
3 C {C→D} 1
4 D {} 0
5 E {} 0

Peak M→V events: 1

Optimized Schedule

Order: A → C → B → D → E

Time Execute Live Events (M→V) Count
1 A {A→B} 1
2 C {A→B, C→D} 2
3 B {C→D} 1
4 D {} 0
5 E {} 0

Peak M→V events: 2

Benefit: Pipeline M operations batched together (A, C), then pipeline V operations (B, D, E). Reduces pipeline switches and improves instruction-level parallelism, even though peak event pressure slightly increases.

Complexity

  • Time: O(n²) graph building + O(n × |ready| × 3) Kahn scheduling = O(n²) worst case
  • Space: O(n²) edges + O(pipeline pairs × n) live events

Limitations

  1. Phase 1 limitations:
    • Control flow nodes treated as immovable black boxes (no inter-procedural analysis)
    • StmtEffect uses conservative union for branches (may create false dependencies)
    • No path-sensitive analysis (assumes all branches equally likely)
    • No loop-invariant code motion (LICM not implemented yet)
  2. Conservative: MemRef-based analysis may be overly conservative
  3. Hardcoded limit: kMaxEventIds = 8 not configurable
  4. Best-effort fallback: May not always satisfy constraints

Future Work (Phase 2+)

Path-sensitive analysis:

  • Analyze conditional branches to enable more aggressive reordering
  • Differentiate between "must execute" vs "may execute" effects

Loop-invariant code motion (LICM):

  • Move loop-invariant computations outside ForStmt bodies
  • Requires proving expressions don't change across iterations

Inter-procedural analysis:

  • Analyze nested CF bodies for finer-grained reordering opportunities
  • Recursively schedule within If/For statement bodies

Debugging

Enable debug logs to track:

  • Segment scheduling: "scheduled segment size=X, worst_peak=Y"
  • Strategy recovery: "Recovered feasible schedule with strategy=Z"
  • Relaxed fallback: "Cannot satisfy event limit, using best-effort"

Verify:

  1. GetStmtPipe returns correct pipeline types
  2. Dependency graph captures RAW/WAW/WAR correctly
  3. Live event tracking matches expectations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. rename file to 00-out-of-order-schedule.md
  2. Make it shorter, each docs should be around 200-300 lines of markdown

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll make the changes.

}
};

auto Better = [](const CandidateScore& a, const CandidateScore& b, PickStrategy strategy) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ask AI to reorganize code. The current version contains one large function, which is hard to read

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll refactor that function.

…imization

This pass performs dependency-preserving reordering of statements within reorderable
segments to reduce peak live cross-pipe dependencies (max 8 events), intended to run
before InsertSyncPass to avoid hardware event_id exhaustion.

Key features:
- Kahn-based topological scheduling with resource constraints
- Multiple heuristic strategies to avoid greedy dead-ends
- Graceful fallback when strict limits cannot be satisfied
- Comprehensive Python bindings and test suite
- Integration into PassManager's XPlatform optimization strategy
Extract large ScheduleSegment function (~260 lines) into focused helper
functions for better readability and maintainability:

- GetMemRefs: Extract memory references from expressions
- BuildDependencyGraph: Build RAW/WAW/WAR dependency edges
- BuildAdjacencyLists: Create successor lists and indegree arrays
- IsBetterCandidate: Compare candidates using selection strategies
- RunKahnScheduling: Kahn topological sort with resource constraints
- FindFeasibleSchedule: Multi-strategy scheduling with fallback

ScheduleSegment now serves as a clean 40-line orchestrator showing the
three main steps: extract pipe types, build dependencies, find schedule.

Also rename OutOfOrderSchedulerPass.md to 00-out-of-order-schedule.md
to follow documentation naming convention.
…d broadcast event model

Add Phase 1 control flow node support that treats IfStmt/ForStmt as immovable
black-box composite nodes in the dependency graph. This enables compute
statements to be reordered across control flow boundaries when data
dependencies allow.

Key improvements:
- Introduce StmtEffect for conservative side-effect analysis of CF nodes
- Implement broadcast event semantics (one event_id per producer-pair, not
  per edge) matching hardware reality and InsertSyncPass behavior
- Add first-consumer priority optimization to minimize peak event pressure
- Add CF stability chain to preserve relative order of control flow nodes
- Refactor LiveCrossPipeEvents tracking with pending_successors and
  incoming_producers maps

Documentation updates:
- Add comprehensive Phase 1 design documentation
- Document broadcast event model and consumer role tracking
- Add cross-CF reordering examples and event lifecycle diagrams
- Update limitations and add future work section

Test improvements:
- Add run_pass_with_ir_print helper for better test debugging
- Refactor test organization and cleanup
- Remove separate header file, consolidate into source file
- Update pass registration in passes.h and bindings
- Synchronize Python bindings and type stubs
- Update test implementation

This refactoring simplifies the pass structure by moving the class
definition into the implementation file, reducing header dependencies
while maintaining all functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants