chore: Cleanup eval logic; use TFLOP by ko3n1g · Pull Request #2661 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-03-05T11:00:14Z

What does this PR do ?

Always print performance results (even when it passes, so that we can see the avg)
Use colorcoding for improvement vs regression
Cleanup and harmonize the script

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Enhanced performance monitoring to track GPU utilization metrics alongside convergence and performance validation.
- Added colored console output for performance status indicators (regression/improvement).
Improvements
- Expanded performance metrics reporting to include current and golden GPU utilization averages with signed difference calculations.
- Updated memory validation outputs with additional per-step allocation details.
- Enhanced logging and analytics integration with new GPU utilization metrics.

Signed-off-by: oliver könig <okoenig@nvidia.com>

coderabbitai · 2026-03-05T11:08:20Z

📝 Walkthrough

Walkthrough

The scripts/performance/utils/evaluate.py file is refactored to validate GPU utilization metrics instead of generic timing metrics. The validate_performance function now accepts GPU utilization arrays and computes signed percentage differences against golden values, reporting regressions or improvements with colored status outputs. The calc_convergence_and_performance function is enhanced to collect and propagate per-step GPU utilization data into the validation pipeline and logging.

Changes

Cohort / File(s)	Summary
GPU Utilization Validation `scripts/performance/utils/evaluate.py`	Reworked `validate_performance` function signature to accept `current_gpu_util_values` and `golden_gpu_util_values` instead of generic timing values. Introduced GPU utilization statistics (current/golden averages, signed percentage difference) as outputs with color-coded status (REGRESSION/UNEXPECTED IMPROVEMENT). Enhanced `calc_convergence_and_performance` to extract and propagate per-step GPU utilization from golden values into convergence/performance checks. Extended results structure with new metrics (`current_avg_gpu_util`, `golden_avg_gpu_util`, `signed_diff`, `threshold`, `direction`) under `results["metrics"]`. Updated memory validation to include per-MUB allocation metrics and modified error messages to reference metrics dictionary. Added colorized console and WandB logging for GPU utilization statistics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

PR #2388: Modifies GPU utilization handling in the same performance evaluation script, specifically propagating per-step GPU-util values into convergence and performance validation checks.

Suggested labels

r0.3.0

Suggested reviewers

malay-nagda
erhoo82
chtruong814

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title 'chore: Cleanup eval logic; use TFLOP' does not match the actual changes, which focus on GPU utilization validation metrics, not TFLOP calculations.	Update the title to reflect the main change: consider 'refactor: Add GPU utilization validation to performance evaluation' or similar to accurately describe the GPU-util focused changes.
Test Results For Major Changes	⚠️ Warning	PR contains major changes to performance validation logic without accompanying test results or performance validation data.	Provide test results, performance metrics before/after changes, GPU utilization validation details, and detailed changelog documenting the impact on convergence and performance checking behavior.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ko3n1g/ci/test-evaluations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/performance/utils/evaluate.py (1)
615-629: ⚠️ Potential issue | 🟠 Major

Fail fast when golden GPU utilization is missing instead of storing None.

At line 619, missing "GPU utilization" is stored as None. This value flows to line 656-657 where np.array([golden_gpu_util.get(s, float("nan")) for s in steps]) doesn't help—the .get() default is only used if the key doesn't exist, but the key already exists with a None value. This creates a numpy array with object dtype containing mixed None and float values, which destabilizes np.nanmean behavior at line 42 of the validate_performance function.

Raise an explicit error instead, tracking which steps are missing GPU utilization data:
Proposed fix
-    golden_gpu_util = {}
+    golden_gpu_util: dict[str, float] = {}
+    missing_golden_gpu_util_steps: list[str] = []
@@
-        golden_gpu_util[key] = value.get("GPU utilization")
+        gpu_util = value.get("GPU utilization")
+        if gpu_util is None:
+            missing_golden_gpu_util_steps.append(key)
+            continue
+        golden_gpu_util[key] = float(gpu_util)
+
+    if missing_golden_gpu_util_steps:
+        raise ValueError(
+            "Missing 'GPU utilization' in golden values for steps: "
+            + ", ".join(missing_golden_gpu_util_steps[:10])
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/utils/evaluate.py` around lines 615 - 629, The loop that
populates golden_gpu_util stores None when "GPU utilization" is missing, which
later produces an object-dtype array and breaks np.nanmean in
validate_performance; modify the loop that iterates expected_golden_values
(referencing alloc_metric, max_alloc_metric, steps, golden_gpu_util) to check
value.get("GPU utilization") and immediately raise a descriptive ValueError
(including the affected step keys) if the metric is missing or is None instead
of inserting None, so downstream code can assume numeric GPU utilization values.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/utils/evaluate.py`:
- Around line 292-298: The function signature using parameters
current_gpu_util_values, golden_gpu_util_values, steps, logger, wandb_run, and
config should be updated to modern Python 3.10+ typing: replace List[str] with
built-in list[str], Dict[str, Any] with dict[str, Any], and
Optional["wandb.Run"] with "wandb.Run | None" (or wandb.Run | None if not using
string forward refs) and keep np.ndarray as the array type; adjust any
imports/forward references if needed. Locate the function where these parameters
are declared and change the type hints accordingly (steps → list[str], wandb_run
→ wandb.Run | None, config → dict[str, Any]); ensure the rest of the code that
references these types still type-checks.
- Around line 327-337: The slicing for GPU-util windows (variables skip,
current_stable, golden_stable) can produce empty arrays causing np.nanmean to
return NaN and break downstream comparisons; update the logic in evaluate.py
around the computation of current_avg, golden_avg and signed_diff to validate
that current_stable and golden_stable are non-empty and contain finite values
before calling np.nanmean (e.g., check len(...)>0 and np.isfinite on the
arrays), and if validation fails, set a clear failure/invalid flag or raise an
explicit error so the test doesn't silently pass on NaN results; ensure any code
that reads signed_diff handles this explicit error/flag path.

---

Outside diff comments:
In `@scripts/performance/utils/evaluate.py`:
- Around line 615-629: The loop that populates golden_gpu_util stores None when
"GPU utilization" is missing, which later produces an object-dtype array and
breaks np.nanmean in validate_performance; modify the loop that iterates
expected_golden_values (referencing alloc_metric, max_alloc_metric, steps,
golden_gpu_util) to check value.get("GPU utilization") and immediately raise a
descriptive ValueError (including the affected step keys) if the metric is
missing or is None instead of inserting None, so downstream code can assume
numeric GPU utilization values.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7c587ee0-a737-471a-bb65-d90b65f46ef4

📥 Commits

Reviewing files that changed from the base of the PR and between 15d758f and 4a2e6bb.

📒 Files selected for processing (1)

scripts/performance/utils/evaluate.py

coderabbitai · 2026-03-05T11:08:23Z

scripts/performance/utils/evaluate.py

+    current_gpu_util_values: "np.ndarray",
+    golden_gpu_util_values: "np.ndarray",
    steps: List[str],
    logger: logging.Logger,
    wandb_run: Optional["wandb.Run"] = None,
    config: Dict[str, Any] = None,
 ) -> Dict[str, Any]:


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check the specific lines in the file head -300 scripts/performance/utils/evaluate.py | tail -15

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 457

🏁 Script executed:

# Get the full function signature and check imports sed -n '285,305p' scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 625

🏁 Script executed:

# Check what typing imports are used in the file head -50 scripts/performance/utils/evaluate.py | grep -E "^from typing|^import typing"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 115

🏁 Script executed:

# Check Python version requirement (setup.py, pyproject.toml, etc.) find . -maxdepth 2 -type f \( -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" \) -exec grep -l "python_requires\|python_version" {} \;

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

Update function signature to use Python 3.10+ type-hint syntax.

Lines 296–298 use outdated typing constructs (Optional[...], Dict[...], List[...]) that conflict with coding guidelines requiring PEP 604 unions and built-in generics.

Proposed fix

def validate_performance( current_gpu_util_values: "np.ndarray", golden_gpu_util_values: "np.ndarray", - steps: List[str], + steps: list[str], logger: logging.Logger, - wandb_run: Optional["wandb.Run"] = None, - config: Dict[str, Any] = None, -) -> Dict[str, Any]: + wandb_run: "wandb.Run | None" = None, + config: dict[str, Any] | None = None, +) -> dict[str, Any]:

Per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use built-in generics (list, dict, tuple) instead of typing equivalents."

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 297-297: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/performance/utils/evaluate.py` around lines 292 - 298, The function signature using parameters current_gpu_util_values, golden_gpu_util_values, steps, logger, wandb_run, and config should be updated to modern Python 3.10+ typing: replace List[str] with built-in list[str], Dict[str, Any] with dict[str, Any], and Optional["wandb.Run"] with "wandb.Run | None" (or wandb.Run | None if not using string forward refs) and keep np.ndarray as the array type; adjust any imports/forward references if needed. Locate the function where these parameters are declared and change the type hints accordingly (steps → list[str], wandb_run → wandb.Run | None, config → dict[str, Any]); ensure the rest of the code that references these types still type-checks.

coderabbitai · 2026-03-05T11:08:23Z

scripts/performance/utils/evaluate.py

+    # Discard first N% of iterations for stable comparison
+    skip = max(1, int(len(steps) * config["skip_first_percent_time"]))
+    current_stable = current_gpu_util_values[skip:]
+    golden_stable = golden_gpu_util_values[skip:]

-    # Calculate average step timing
-    current_avg_timing = np.mean(current_timing_stable)
-    golden_avg_timing = np.mean(golden_timing_stable)
+    current_avg = float(np.nanmean(current_stable))
+    golden_avg = float(np.nanmean(golden_stable))

-    # Calculate timing difference
-    timing_diff = abs(current_avg_timing - golden_avg_timing) / golden_avg_timing
+    # Signed diff: positive = improvement (higher util), negative = regression (lower util)
+    signed_diff = (current_avg - golden_avg) / golden_avg if golden_avg != 0 else 0.0
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, verify the file exists and check its size ls -la scripts/performance/utils/evaluate.py wc -l scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 200

🏁 Script executed:

#!/bin/bash # Read the relevant section of the file with context sed -n '320,350p' scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1380

🏁 Script executed:

#!/bin/bash # Look for regression checks mentioned (lines 338-340) sed -n '330,345p' scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 801

🏁 Script executed:

#!/bin/bash # Test what np.nanmean returns for empty arrays python3 << 'EOF' import numpy as np # Test empty array result = np.nanmean([]) print(f"np.nanmean([]) = {result}") print(f"Type: {type(result)}") print(f"Is NaN: {np.isnan(result)}") # Test NaN comparison behavior nan_value = float('nan') print(f"\nNaN < -0.05: {nan_value < -0.05}") print(f"NaN > 0.05: {nan_value > 0.05}") print(f"NaN == 0: {nan_value == 0}") print(f"bool(NaN < -0.05): {bool(nan_value < -0.05)}") # Test converting NaN to float print(f"\nfloat(np.nanmean([])) = {float(np.nanmean([]))}") EOF

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 188

🏁 Script executed:

#!/bin/bash # Check for error handling and look at the broader function context sed -n '300,360p' scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2705

🏁 Script executed:

#!/bin/bash # Look for where config["skip_first_percent_time"] is set/used rg -n "skip_first_percent_time" scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 569

🏁 Script executed:

#!/bin/bash # Check if there's any error handling after the regression checks sed -n '337,380p' scripts/performance/utils/evaluate.py | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2051

🏁 Script executed:

#!/bin/bash # Verify how this function is called and if callers handle NaN/exceptions rg -A 3 "validate_gpu_utilization" scripts/performance/utils/evaluate.py | head -30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

Guard against empty/invalid GPU-util windows before pass/fail decisions.

Lines 327–333 can produce empty arrays on short runs: when skip >= len(steps), the slices become empty. Then np.nanmean(...) returns NaN, and the comparison checks at lines 337–338 silently fail to trigger—NaN comparisons always evaluate to False, allowing invalid test results to pass without detection.

🛡️ Proposed fix

+ if len(steps) == 0: + raise ValueError("No steps available for GPU utilization validation") + - skip = max(1, int(len(steps) * config["skip_first_percent_time"])) + skip = int(len(steps) * config["skip_first_percent_time"]) + if skip >= len(steps): + skip = 0 current_stable = current_gpu_util_values[skip:] golden_stable = golden_gpu_util_values[skip:] current_avg = float(np.nanmean(current_stable)) golden_avg = float(np.nanmean(golden_stable)) + if np.isnan(current_avg) or np.isnan(golden_avg): + raise ValueError("GPU utilization metrics are missing or non-finite for compared steps")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/performance/utils/evaluate.py` around lines 327 - 337, The slicing for GPU-util windows (variables skip, current_stable, golden_stable) can produce empty arrays causing np.nanmean to return NaN and break downstream comparisons; update the logic in evaluate.py around the computation of current_avg, golden_avg and signed_diff to validate that current_stable and golden_stable are non-empty and contain finite values before calling np.nanmean (e.g., check len(...)>0 and np.isfinite on the arrays), and if validation fails, set a clear failure/invalid flag or raise an explicit error so the test doesn't silently pass on NaN results; ensure any code that reads signed_diff handles this explicit error/flag path.

Signed-off-by: oliver könig <okoenig@nvidia.com>

chore: Cleanup eval logic; use TFLOP

4a2e6bb

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g requested a review from malay-nagda March 5, 2026 11:00

copy-pr-bot bot had a problem deploying to test March 5, 2026 11:01 Error

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

fix logging

3e02951

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g requested review from a team and erhoo82 as code owners March 5, 2026 11:19

copy-pr-bot bot had a problem deploying to test March 5, 2026 11:20 Error

test

231eacb

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 5, 2026 11:37 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 5, 2026 11:47 Error

test

d771d13

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot had a problem deploying to test March 5, 2026 11:52 Error

test

d941077

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 5, 2026 12:09 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 5, 2026 12:23 Error

test

abf93b8

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 5, 2026 12:35 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 12:42 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 12:50 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 5, 2026 13:00 Error

test

cb5f6f5

Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 5, 2026 13:05 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 13:09 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 13:17 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 13:26 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 5, 2026 13:26 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 13:26 Inactive

malay-nagda approved these changes Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Cleanup eval logic; use TFLOP#2661

chore: Cleanup eval logic; use TFLOP#2661
ko3n1g merged 7 commits intomainfrom
ko3n1g/ci/test-evaluations

ko3n1g commented Mar 5, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 5, 2026

Uh oh!

coderabbitai bot Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ko3n1g commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ko3n1g commented Mar 5, 2026 •

edited

Loading

coderabbitai bot commented Mar 5, 2026 •

edited

Loading