Skip to content

chore: Cleanup eval logic; use TFLOP#2661

Merged
ko3n1g merged 7 commits intomainfrom
ko3n1g/ci/test-evaluations
Mar 5, 2026
Merged

chore: Cleanup eval logic; use TFLOP#2661
ko3n1g merged 7 commits intomainfrom
ko3n1g/ci/test-evaluations

Conversation

@ko3n1g
Copy link
Contributor

@ko3n1g ko3n1g commented Mar 5, 2026

What does this PR do ?

  • Always print performance results (even when it passes, so that we can see the avg)
  • Use colorcoding for improvement vs regression
  • Cleanup and harmonize the script

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features

    • Enhanced performance monitoring to track GPU utilization metrics alongside convergence and performance validation.
    • Added colored console output for performance status indicators (regression/improvement).
  • Improvements

    • Expanded performance metrics reporting to include current and golden GPU utilization averages with signed difference calculations.
    • Updated memory validation outputs with additional per-step allocation details.
    • Enhanced logging and analytics integration with new GPU utilization metrics.

Signed-off-by: oliver könig <okoenig@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

The scripts/performance/utils/evaluate.py file is refactored to validate GPU utilization metrics instead of generic timing metrics. The validate_performance function now accepts GPU utilization arrays and computes signed percentage differences against golden values, reporting regressions or improvements with colored status outputs. The calc_convergence_and_performance function is enhanced to collect and propagate per-step GPU utilization data into the validation pipeline and logging.

Changes

Cohort / File(s) Summary
GPU Utilization Validation
scripts/performance/utils/evaluate.py
Reworked validate_performance function signature to accept current_gpu_util_values and golden_gpu_util_values instead of generic timing values. Introduced GPU utilization statistics (current/golden averages, signed percentage difference) as outputs with color-coded status (REGRESSION/UNEXPECTED IMPROVEMENT). Enhanced calc_convergence_and_performance to extract and propagate per-step GPU utilization from golden values into convergence/performance checks. Extended results structure with new metrics (current_avg_gpu_util, golden_avg_gpu_util, signed_diff, threshold, direction) under results["metrics"]. Updated memory validation to include per-MUB allocation metrics and modified error messages to reference metrics dictionary. Added colorized console and WandB logging for GPU utilization statistics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • PR #2388: Modifies GPU utilization handling in the same performance evaluation script, specifically propagating per-step GPU-util values into convergence and performance validation checks.

Suggested labels

r0.3.0

Suggested reviewers

  • malay-nagda
  • erhoo82
  • chtruong814
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'chore: Cleanup eval logic; use TFLOP' does not match the actual changes, which focus on GPU utilization validation metrics, not TFLOP calculations. Update the title to reflect the main change: consider 'refactor: Add GPU utilization validation to performance evaluation' or similar to accurately describe the GPU-util focused changes.
Test Results For Major Changes ⚠️ Warning PR contains major changes to performance validation logic without accompanying test results or performance validation data. Provide test results, performance metrics before/after changes, GPU utilization validation details, and detailed changelog documenting the impact on convergence and performance checking behavior.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ko3n1g/ci/test-evaluations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/performance/utils/evaluate.py (1)

615-629: ⚠️ Potential issue | 🟠 Major

Fail fast when golden GPU utilization is missing instead of storing None.

At line 619, missing "GPU utilization" is stored as None. This value flows to line 656-657 where np.array([golden_gpu_util.get(s, float("nan")) for s in steps]) doesn't help—the .get() default is only used if the key doesn't exist, but the key already exists with a None value. This creates a numpy array with object dtype containing mixed None and float values, which destabilizes np.nanmean behavior at line 42 of the validate_performance function.

Raise an explicit error instead, tracking which steps are missing GPU utilization data:

Proposed fix
-    golden_gpu_util = {}
+    golden_gpu_util: dict[str, float] = {}
+    missing_golden_gpu_util_steps: list[str] = []
@@
-        golden_gpu_util[key] = value.get("GPU utilization")
+        gpu_util = value.get("GPU utilization")
+        if gpu_util is None:
+            missing_golden_gpu_util_steps.append(key)
+            continue
+        golden_gpu_util[key] = float(gpu_util)
+
+    if missing_golden_gpu_util_steps:
+        raise ValueError(
+            "Missing 'GPU utilization' in golden values for steps: "
+            + ", ".join(missing_golden_gpu_util_steps[:10])
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/utils/evaluate.py` around lines 615 - 629, The loop that
populates golden_gpu_util stores None when "GPU utilization" is missing, which
later produces an object-dtype array and breaks np.nanmean in
validate_performance; modify the loop that iterates expected_golden_values
(referencing alloc_metric, max_alloc_metric, steps, golden_gpu_util) to check
value.get("GPU utilization") and immediately raise a descriptive ValueError
(including the affected step keys) if the metric is missing or is None instead
of inserting None, so downstream code can assume numeric GPU utilization values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/utils/evaluate.py`:
- Around line 292-298: The function signature using parameters
current_gpu_util_values, golden_gpu_util_values, steps, logger, wandb_run, and
config should be updated to modern Python 3.10+ typing: replace List[str] with
built-in list[str], Dict[str, Any] with dict[str, Any], and
Optional["wandb.Run"] with "wandb.Run | None" (or wandb.Run | None if not using
string forward refs) and keep np.ndarray as the array type; adjust any
imports/forward references if needed. Locate the function where these parameters
are declared and change the type hints accordingly (steps → list[str], wandb_run
→ wandb.Run | None, config → dict[str, Any]); ensure the rest of the code that
references these types still type-checks.
- Around line 327-337: The slicing for GPU-util windows (variables skip,
current_stable, golden_stable) can produce empty arrays causing np.nanmean to
return NaN and break downstream comparisons; update the logic in evaluate.py
around the computation of current_avg, golden_avg and signed_diff to validate
that current_stable and golden_stable are non-empty and contain finite values
before calling np.nanmean (e.g., check len(...)>0 and np.isfinite on the
arrays), and if validation fails, set a clear failure/invalid flag or raise an
explicit error so the test doesn't silently pass on NaN results; ensure any code
that reads signed_diff handles this explicit error/flag path.

---

Outside diff comments:
In `@scripts/performance/utils/evaluate.py`:
- Around line 615-629: The loop that populates golden_gpu_util stores None when
"GPU utilization" is missing, which later produces an object-dtype array and
breaks np.nanmean in validate_performance; modify the loop that iterates
expected_golden_values (referencing alloc_metric, max_alloc_metric, steps,
golden_gpu_util) to check value.get("GPU utilization") and immediately raise a
descriptive ValueError (including the affected step keys) if the metric is
missing or is None instead of inserting None, so downstream code can assume
numeric GPU utilization values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7c587ee0-a737-471a-bb65-d90b65f46ef4

📥 Commits

Reviewing files that changed from the base of the PR and between 15d758f and 4a2e6bb.

📒 Files selected for processing (1)
  • scripts/performance/utils/evaluate.py

Comment on lines +292 to 298
current_gpu_util_values: "np.ndarray",
golden_gpu_util_values: "np.ndarray",
steps: List[str],
logger: logging.Logger,
wandb_run: Optional["wandb.Run"] = None,
config: Dict[str, Any] = None,
) -> Dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check the specific lines in the file
head -300 scripts/performance/utils/evaluate.py | tail -15

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 457


🏁 Script executed:

# Get the full function signature and check imports
sed -n '285,305p' scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 625


🏁 Script executed:

# Check what typing imports are used in the file
head -50 scripts/performance/utils/evaluate.py | grep -E "^from typing|^import typing"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 115


🏁 Script executed:

# Check Python version requirement (setup.py, pyproject.toml, etc.)
find . -maxdepth 2 -type f \( -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" \) -exec grep -l "python_requires\|python_version" {} \;

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


Update function signature to use Python 3.10+ type-hint syntax.

Lines 296–298 use outdated typing constructs (Optional[...], Dict[...], List[...]) that conflict with coding guidelines requiring PEP 604 unions and built-in generics.

Proposed fix
 def validate_performance(
     current_gpu_util_values: "np.ndarray",
     golden_gpu_util_values: "np.ndarray",
-    steps: List[str],
+    steps: list[str],
     logger: logging.Logger,
-    wandb_run: Optional["wandb.Run"] = None,
-    config: Dict[str, Any] = None,
-) -> Dict[str, Any]:
+    wandb_run: "wandb.Run | None" = None,
+    config: dict[str, Any] | None = None,
+) -> dict[str, Any]:

Per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use built-in generics (list, dict, tuple) instead of typing equivalents."

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 297-297: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/utils/evaluate.py` around lines 292 - 298, The function
signature using parameters current_gpu_util_values, golden_gpu_util_values,
steps, logger, wandb_run, and config should be updated to modern Python 3.10+
typing: replace List[str] with built-in list[str], Dict[str, Any] with dict[str,
Any], and Optional["wandb.Run"] with "wandb.Run | None" (or wandb.Run | None if
not using string forward refs) and keep np.ndarray as the array type; adjust any
imports/forward references if needed. Locate the function where these parameters
are declared and change the type hints accordingly (steps → list[str], wandb_run
→ wandb.Run | None, config → dict[str, Any]); ensure the rest of the code that
references these types still type-checks.

Comment on lines +327 to +337
# Discard first N% of iterations for stable comparison
skip = max(1, int(len(steps) * config["skip_first_percent_time"]))
current_stable = current_gpu_util_values[skip:]
golden_stable = golden_gpu_util_values[skip:]

# Calculate average step timing
current_avg_timing = np.mean(current_timing_stable)
golden_avg_timing = np.mean(golden_timing_stable)
current_avg = float(np.nanmean(current_stable))
golden_avg = float(np.nanmean(golden_stable))

# Calculate timing difference
timing_diff = abs(current_avg_timing - golden_avg_timing) / golden_avg_timing
# Signed diff: positive = improvement (higher util), negative = regression (lower util)
signed_diff = (current_avg - golden_avg) / golden_avg if golden_avg != 0 else 0.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, verify the file exists and check its size
ls -la scripts/performance/utils/evaluate.py
wc -l scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 200


🏁 Script executed:

#!/bin/bash
# Read the relevant section of the file with context
sed -n '320,350p' scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1380


🏁 Script executed:

#!/bin/bash
# Look for regression checks mentioned (lines 338-340)
sed -n '330,345p' scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 801


🏁 Script executed:

#!/bin/bash
# Test what np.nanmean returns for empty arrays
python3 << 'EOF'
import numpy as np

# Test empty array
result = np.nanmean([])
print(f"np.nanmean([]) = {result}")
print(f"Type: {type(result)}")
print(f"Is NaN: {np.isnan(result)}")

# Test NaN comparison behavior
nan_value = float('nan')
print(f"\nNaN < -0.05: {nan_value < -0.05}")
print(f"NaN > 0.05: {nan_value > 0.05}")
print(f"NaN == 0: {nan_value == 0}")
print(f"bool(NaN < -0.05): {bool(nan_value < -0.05)}")

# Test converting NaN to float
print(f"\nfloat(np.nanmean([])) = {float(np.nanmean([]))}")
EOF

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 188


🏁 Script executed:

#!/bin/bash
# Check for error handling and look at the broader function context
sed -n '300,360p' scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2705


🏁 Script executed:

#!/bin/bash
# Look for where config["skip_first_percent_time"] is set/used
rg -n "skip_first_percent_time" scripts/performance/utils/evaluate.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 569


🏁 Script executed:

#!/bin/bash
# Check if there's any error handling after the regression checks
sed -n '337,380p' scripts/performance/utils/evaluate.py | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2051


🏁 Script executed:

#!/bin/bash
# Verify how this function is called and if callers handle NaN/exceptions
rg -A 3 "validate_gpu_utilization" scripts/performance/utils/evaluate.py | head -30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


Guard against empty/invalid GPU-util windows before pass/fail decisions.

Lines 327–333 can produce empty arrays on short runs: when skip >= len(steps), the slices become empty. Then np.nanmean(...) returns NaN, and the comparison checks at lines 337–338 silently fail to trigger—NaN comparisons always evaluate to False, allowing invalid test results to pass without detection.

🛡️ Proposed fix
+    if len(steps) == 0:
+        raise ValueError("No steps available for GPU utilization validation")
+
-    skip = max(1, int(len(steps) * config["skip_first_percent_time"]))
+    skip = int(len(steps) * config["skip_first_percent_time"])
+    if skip >= len(steps):
+        skip = 0
     current_stable = current_gpu_util_values[skip:]
     golden_stable = golden_gpu_util_values[skip:]
 
     current_avg = float(np.nanmean(current_stable))
     golden_avg = float(np.nanmean(golden_stable))
+    if np.isnan(current_avg) or np.isnan(golden_avg):
+        raise ValueError("GPU utilization metrics are missing or non-finite for compared steps")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/utils/evaluate.py` around lines 327 - 337, The slicing
for GPU-util windows (variables skip, current_stable, golden_stable) can produce
empty arrays causing np.nanmean to return NaN and break downstream comparisons;
update the logic in evaluate.py around the computation of current_avg,
golden_avg and signed_diff to validate that current_stable and golden_stable are
non-empty and contain finite values before calling np.nanmean (e.g., check
len(...)>0 and np.isfinite on the arrays), and if validation fails, set a clear
failure/invalid flag or raise an explicit error so the test doesn't silently
pass on NaN results; ensure any code that reads signed_diff handles this
explicit error/flag path.

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants