Skip to content

feat: Add first-class Comet ML experiment tracking#2653

Open
shanecmoran wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
shanecmoran:feat/comet-ml-logging
Open

feat: Add first-class Comet ML experiment tracking#2653
shanecmoran wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
shanecmoran:feat/comet-ml-logging

Conversation

@shanecmoran
Copy link

@shanecmoran shanecmoran commented Mar 4, 2026

Summary

Adds Comet ML as a fourth experiment tracking backend alongside TensorBoard, Weights & Biases, and MLflow. The integration follows the exact same structural pattern used by the existing backends — no new abstractions, just symmetric extension of each logging call site.

Reference: NeMo Automodel PR #1411 adds a similar Comet integration to Automodel.

Changes

Config (LoggerConfig)

  • 5 new fields: comet_project, comet_experiment_name, comet_workspace, comet_api_key, comet_tags
  • finalize() validation: requires comet_experiment_name when comet_project is set, checks comet_ml importable

Core (GlobalState)

  • comet_logger lazy property: initializes comet_ml.Experiment on last rank, logs full config as parameters
  • _timers_write_to_comet timer patch (preserves / in metric names unlike MLflow)
  • Reset in reset_for_restart()

Metric Logging

  • training_log(): ~18 inline if comet_logger: blocks mirroring every if mlflow_logger: block
  • eval.py: validation loss and PPL
  • train.py: experiment.end() at both shutdown paths

Checkpoint Tracking

  • New comet_utils.py with on_save_checkpoint_success / on_load_checkpoint_success
  • Wired into checkpointing.py save/load paths

Integration Points

  • tensor_inspect.py: _CometExperimentLogger wrapper for NVIDIA DLFw Inspect
  • run_plugins.py: CometPlugin for NeMo Run launching
  • docs/training/logging.md: full documentation section

Tests

  • test_comet_utils.py: save/load checkpoint callback tests
  • test_state.py: comet_logger property, timer patch, reset tests
  • test_config.py: finalize validation tests

Config Example

cfg.logger = LoggerConfig(
    tensorboard_dir="./runs/tensorboard",
    comet_project="my_project",
    comet_experiment_name="qwen3_14b_sft",
    comet_workspace="my_team",
    comet_tags=["sft", "qwen3"],
)

Fixes #2652

Summary by CodeRabbit

  • New Features

    • Added Comet ML logging integration for training workflows, enabling automatic logging of training metrics, validation performance, checkpoint metadata, and system telemetry.
    • New configuration options for Comet project, experiment name, workspace, API key, and custom tags.
    • Added comprehensive documentation for Comet ML setup and usage.
  • Tests

    • Added unit tests validating Comet ML configuration, logging, and integration behavior.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shanecmoran shanecmoran marked this pull request as draft March 4, 2026 23:19
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

This PR adds Comet ML as a first-class experiment tracking backend to Megatron Bridge, mirroring the existing integrations for W&B and MLflow. The changes introduce configuration options, lazy-initialized logger properties, symmetric logging instrumentation across training/evaluation/checkpointing workflows, checkpoint artifact tracking, timer metrics, tensor inspection hooks, and a NeMo Run plugin.

Changes

Cohort / File(s) Summary
Configuration & State Management
src/megatron/bridge/training/config.py, src/megatron/bridge/training/state.py
Added comet_project, comet_experiment_name, comet_workspace, comet_api_key, comet_tags config fields to LoggerConfig with validation. Introduced lazy-initialized comet_logger property in GlobalState that constructs a Comet ML Experiment instance on last rank, and added _timers_write_to_comet hook for timer metric logging.
Logging Integration
src/megatron/bridge/training/utils/train_utils.py, src/megatron/bridge/training/eval.py, src/megatron/bridge/training/tensor_inspect.py
Extended training metric logging to push throughput, memory, gradient stats, learning rate, batch metrics, and loss metrics to Comet. Added validation metric logging in evaluate_and_print_results. Introduced comet_logger parameter to _maybe_attach_metric_loggers with CometExperimentLogger wrapper for post-model initialization tensor inspection.
Checkpoint Artifact Management
src/megatron/bridge/training/checkpointing.py, src/megatron/bridge/training/utils/comet_utils.py
Created comet_utils module with on_save_checkpoint_success and on_load_checkpoint_success callbacks. Integrated comet_finalize_fn into checkpoint save/load finalization flows alongside existing wandb and mlflow handlers.
Training Lifecycle
src/megatron/bridge/training/setup.py, src/megatron/bridge/training/train.py
Added comet_logger propagation through finalize_tensor_inspect_post_model_initialization call chain. Integrated comet logger termination via experiment.end() at training completion in both train() and _finish_train().
Run Plugin & Documentation
src/megatron/bridge/recipes/run_plugins.py, docs/training/logging.md
Introduced CometPlugin suite (CometPluginScriptArgs, _default_comet_converter, CometPlugin) for NeMo Run integration with environment key propagation and CLI override injection. Added comprehensive Comet ML logging documentation covering configuration, logging coverage, and setup steps.
Test Coverage
tests/unit_tests/training/test_config.py, tests/unit_tests/training/test_state.py, tests/unit_tests/training/utils/test_comet_utils.py
Added unit tests for LoggerConfig.finalize() Comet validation, GlobalState.comet_logger initialization under various rank/config conditions, Timers.write_to_comet metric logging with error resilience, and comet_utils checkpoint callbacks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Megatron-Bridge#2112: Implements analogous MLflow logging integration across the same training plumbing—config, state management, checkpointing, tensor_inspect, training utils, and docs—providing a direct code-level parallel for Comet changes.

Suggested labels

community-request

Suggested reviewers

  • cuichenx
  • ananthsub
  • ko3n1g
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR adds major Comet ML experiment tracking feature across 13+ files but lacks documented test execution results in the description. Include documented test results in PR description: (1) summary confirming all tests pass, (2) confirmation existing tests unaffected, (3) performance impact documentation.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding Comet ML as a first-class experiment tracking feature, matching the PR's comprehensive integration across config, state, metrics, checkpoints, and docs.
Linked Issues check ✅ Passed The PR fully addresses all coding requirements from issue #2652: config fields, lazy-initialized comet_logger property, training/validation metric logging, timer patch, checkpoint callbacks, tensor inspect wrapper, NeMo Run plugin, shutdown cleanup, and documentation.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the issue objectives. The comprehensive additions across config, state, metrics, checkpoints, plugins, and docs are all explicitly requested in issue #2652 to enable first-class Comet ML integration.
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
tests/unit_tests/training/utils/test_comet_utils.py (1)

44-53: Prefer pytest fixtures (tmp_path) over manual tempfile blocks in these unit tests.

This setup pattern is repeated and can be simplified by injecting tmp_path fixtures directly into the test functions.

As per coding guidelines "Use pytest fixtures for common setup in unit tests".

Also applies to: 95-103

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/utils/test_comet_utils.py` around lines 44 - 53,
Replace the manual tempfile.TemporaryDirectory block in the test using
on_save_checkpoint_success with the pytest tmp_path fixture: accept tmp_path as
a parameter to the test, create checkpoint_path = tmp_path / "checkpoint" and
checkpoint_path.mkdir(), and pass checkpoint_path.as_posix() (or str) and
tmp_path.as_posix() (or str) into on_save_checkpoint_success along with
mock_comet and iteration; apply the same replacement for the repeated pattern
around the other block (lines ~95-103) to use tmp_path consistently instead of
creating a TemporaryDirectory.
src/megatron/bridge/training/utils/comet_utils.py (1)

46-47: Narrow the exception handling scope in Comet callback logging.

Catching Exception here can hide non-recoverable coding errors. Prefer catching only expected logging/path exceptions and let unexpected failures surface during development.

As per coding guidelines "When using try-except blocks, limit the except clause to the smallest set of errors possible."

Also applies to: 74-75

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/comet_utils.py` around lines 46 - 47, The
broad except Exception in the Comet ML checkpoint logging blocks should be
narrowed to only the expected runtime errors (e.g., I/O and Comet client errors)
so coding errors surface; update the two catch sites that currently read "except
Exception as exc" (the Comet ML checkpoint logging function(s) that call
print_rank_last) to catch specific exceptions such as OSError/IOError and
comet_ml.exceptions.CometMLException (importing that exception class), log the
error message with print_rank_last, and re-raise any other unexpected exceptions
(or omit a broad catch) so non-recoverable bugs are not swallowed.
src/megatron/bridge/training/utils/train_utils.py (1)

494-646: Consider centralizing metric fan-out across backends.

With Comet added, the repeated wandb/mlflow/comet branches are getting large and easy to drift. A small helper like _log_metrics_to_backends(metrics, step, wandb_writer, mlflow_logger, comet_logger) would reduce risk and simplify future additions.

Also applies to: 713-759

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/train_utils.py` around lines 494 - 646,
The metric fan-out is duplicated for wandb_writer, mlflow_logger, and
comet_logger across multiple blocks (e.g., around
report_memory/report_runtime/report_l2_norm_grad and the loss/throughput
sections); create a helper function named _log_metrics_to_backends(metrics,
step, wandb_writer=None, mlflow_logger=None, comet_logger=None,
sanitize_mlflow=False) and replace each repeated if-wandb/if-mlflow/if-comet
sequence with a single call to this helper; ensure the helper calls
wandb_writer.log(metrics, step) when wandb_writer is present, calls
mlflow_logger.log_metrics(_sanitize_mlflow_metrics(metrics)) when mlflow_logger
is present and sanitize_mlflow=True (or raw metrics otherwise), and calls
comet_logger.log_metrics(metrics, step) when comet_logger is present, and update
all places that currently manually log (including uses of
_sanitize_mlflow_metrics) to use the new helper.
src/megatron/bridge/training/config.py (1)

1146-1152: Include all Comet-specific fields in using_comet detection.

At Line 1146, using_comet skips comet_api_key and comet_tags even though they’re part of this config. Including them keeps dependency validation behavior consistent across all Comet knobs.

♻️ Suggested change
-        using_comet = any(
-            [
-                self.comet_project,
-                self.comet_experiment_name,
-                self.comet_workspace,
-            ]
-        )
+        using_comet = any(
+            [
+                self.comet_project,
+                self.comet_experiment_name,
+                self.comet_workspace,
+                self.comet_api_key,
+                self.comet_tags,
+            ]
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/config.py` around lines 1146 - 1152, The
using_comet detection currently checks only comet_project,
comet_experiment_name, and comet_workspace; update the detection to include all
Comet-related fields (add self.comet_api_key and self.comet_tags) so dependency
validation covers every Comet knob (locate the using_comet variable in
src/megatron/bridge/training/config.py and modify the any([...]) list to include
those two attributes).
src/megatron/bridge/training/state.py (1)

522-524: Set an explicit warning stacklevel for better caller diagnostics.

Using stacklevel=2 makes warnings point to the call site instead of this helper.

📝 Suggested tweak
-            warnings.warn("Failed to log timer metrics to Comet ML; continuing without timer metrics.")
+            warnings.warn(
+                "Failed to log timer metrics to Comet ML; continuing without timer metrics.",
+                stacklevel=2,
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/state.py` around lines 522 - 524, The warning
call that logs "Failed to log timer metrics to Comet ML; continuing without
timer metrics." should include an explicit stacklevel so the warning points at
the caller; update the warnings.warn call (the warnings.warn(...) statement that
emits that message) to pass stacklevel=2 (e.g., warnings.warn("Failed to log
timer metrics to Comet ML; continuing without timer metrics.", stacklevel=2)).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/training/state.py`:
- Around line 297-299: The check for comet_experiment_name should treat None
like an empty string and fail-fast (replace the equality check with a
truthy/None-safe check for logger_cfg.comet_experiment_name), and ensure the
Comet API key is normalized before use: trim whitespace (call .strip() or
equivalent) on the captured api_key variable before creating the Experiment
(Comet) instance so the trimmed value is passed to Experiment; apply the same
normalization to any other api_key usages in the nearby block that create or
configure the Experiment.

In `@src/megatron/bridge/training/train.py`:
- Around line 625-627: Avoid accessing the lazy property
global_state.comet_logger during teardown; instead check the underlying backing
attribute directly (e.g., use getattr(global_state, "_comet_logger", None) or
global_state.__dict__.get("_comet_logger")) and call .end() only if that backing
attribute is non-None. Update the two teardown spots that call comet_logger =
global_state.comet_logger / comet_logger.end() (around the occurrences at the
shown diff and also at lines 1293-1294) to use the backing-field check and end
that instance to prevent accidental initialization.

In `@src/megatron/bridge/training/utils/comet_utils.py`:
- Around line 43-45: The code currently logs absolute checkpoint paths via
resolved_ckpt and comet_logger.log_other calls; change this to log a path
relative to the training base directory (use save_dir or load_dir) instead:
compute base = Path(save_dir or load_dir).resolve(), ckpt_path =
Path(checkpoint_path).resolve(), then try rel = str(ckpt_path.relative_to(base))
and fall back to ckpt_path.name or a safe relative string if relative_to raises,
and pass that rel string to comet_logger.log_other("last_saved_checkpoint",
...); repeat the same relative-path logic for the other comet_logger.log_other
calls around the 70-73 area so no absolute filesystem paths are emitted.

In `@tests/unit_tests/training/test_state.py`:
- Around line 989-990: The new test classes are missing the pytest marker; add
`@pytest.mark.unit` above the TestCometLoggerProperty class and the other new test
class in the same file to categorize them as unit tests (import pytest if not
already present) so pytest selection respects the repo’s test categorization
rules.

---

Nitpick comments:
In `@src/megatron/bridge/training/config.py`:
- Around line 1146-1152: The using_comet detection currently checks only
comet_project, comet_experiment_name, and comet_workspace; update the detection
to include all Comet-related fields (add self.comet_api_key and self.comet_tags)
so dependency validation covers every Comet knob (locate the using_comet
variable in src/megatron/bridge/training/config.py and modify the any([...])
list to include those two attributes).

In `@src/megatron/bridge/training/state.py`:
- Around line 522-524: The warning call that logs "Failed to log timer metrics
to Comet ML; continuing without timer metrics." should include an explicit
stacklevel so the warning points at the caller; update the warnings.warn call
(the warnings.warn(...) statement that emits that message) to pass stacklevel=2
(e.g., warnings.warn("Failed to log timer metrics to Comet ML; continuing
without timer metrics.", stacklevel=2)).

In `@src/megatron/bridge/training/utils/comet_utils.py`:
- Around line 46-47: The broad except Exception in the Comet ML checkpoint
logging blocks should be narrowed to only the expected runtime errors (e.g., I/O
and Comet client errors) so coding errors surface; update the two catch sites
that currently read "except Exception as exc" (the Comet ML checkpoint logging
function(s) that call print_rank_last) to catch specific exceptions such as
OSError/IOError and comet_ml.exceptions.CometMLException (importing that
exception class), log the error message with print_rank_last, and re-raise any
other unexpected exceptions (or omit a broad catch) so non-recoverable bugs are
not swallowed.

In `@src/megatron/bridge/training/utils/train_utils.py`:
- Around line 494-646: The metric fan-out is duplicated for wandb_writer,
mlflow_logger, and comet_logger across multiple blocks (e.g., around
report_memory/report_runtime/report_l2_norm_grad and the loss/throughput
sections); create a helper function named _log_metrics_to_backends(metrics,
step, wandb_writer=None, mlflow_logger=None, comet_logger=None,
sanitize_mlflow=False) and replace each repeated if-wandb/if-mlflow/if-comet
sequence with a single call to this helper; ensure the helper calls
wandb_writer.log(metrics, step) when wandb_writer is present, calls
mlflow_logger.log_metrics(_sanitize_mlflow_metrics(metrics)) when mlflow_logger
is present and sanitize_mlflow=True (or raw metrics otherwise), and calls
comet_logger.log_metrics(metrics, step) when comet_logger is present, and update
all places that currently manually log (including uses of
_sanitize_mlflow_metrics) to use the new helper.

In `@tests/unit_tests/training/utils/test_comet_utils.py`:
- Around line 44-53: Replace the manual tempfile.TemporaryDirectory block in the
test using on_save_checkpoint_success with the pytest tmp_path fixture: accept
tmp_path as a parameter to the test, create checkpoint_path = tmp_path /
"checkpoint" and checkpoint_path.mkdir(), and pass checkpoint_path.as_posix()
(or str) and tmp_path.as_posix() (or str) into on_save_checkpoint_success along
with mock_comet and iteration; apply the same replacement for the repeated
pattern around the other block (lines ~95-103) to use tmp_path consistently
instead of creating a TemporaryDirectory.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 44c4f91a-92f3-4d40-bfbe-ea36e6949f94

📥 Commits

Reviewing files that changed from the base of the PR and between 394037d and 6db8c15.

📒 Files selected for processing (14)
  • docs/training/logging.md
  • src/megatron/bridge/recipes/run_plugins.py
  • src/megatron/bridge/training/checkpointing.py
  • src/megatron/bridge/training/config.py
  • src/megatron/bridge/training/eval.py
  • src/megatron/bridge/training/setup.py
  • src/megatron/bridge/training/state.py
  • src/megatron/bridge/training/tensor_inspect.py
  • src/megatron/bridge/training/train.py
  • src/megatron/bridge/training/utils/comet_utils.py
  • src/megatron/bridge/training/utils/train_utils.py
  • tests/unit_tests/training/test_config.py
  • tests/unit_tests/training/test_state.py
  • tests/unit_tests/training/utils/test_comet_utils.py

Comment on lines +297 to +299
if logger_cfg.comet_experiment_name == "":
raise ValueError("Please specify the comet_experiment_name for Comet ML logging!")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize Comet API key before passing it to Experiment.

At Line 309, api_key is captured before env trimming. At Line 313, the potentially untrimmed value is passed to Comet, which can cause auth failures for keys with surrounding whitespace. Also, Line 297 should treat None the same as empty string for local fail-fast safety.

🔧 Proposed fix
-                if logger_cfg.comet_experiment_name == "":
+                if not logger_cfg.comet_experiment_name:
                     raise ValueError("Please specify the comet_experiment_name for Comet ML logging!")
@@
-                api_key = logger_cfg.comet_api_key
-                if api_key is None:
-                    api_key = os.environ.get("COMET_API_KEY")
-                if api_key:
-                    if "COMET_API_KEY" in os.environ:
-                        os.environ["COMET_API_KEY"] = os.environ["COMET_API_KEY"].strip()
-                    init_kwargs["api_key"] = api_key
+                api_key = logger_cfg.comet_api_key or os.environ.get("COMET_API_KEY")
+                if api_key:
+                    api_key = api_key.strip()
+                    if api_key:
+                        init_kwargs["api_key"] = api_key

Also applies to: 307-313

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 298-298: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/state.py` around lines 297 - 299, The check for
comet_experiment_name should treat None like an empty string and fail-fast
(replace the equality check with a truthy/None-safe check for
logger_cfg.comet_experiment_name), and ensure the Comet API key is normalized
before use: trim whitespace (call .strip() or equivalent) on the captured
api_key variable before creating the Experiment (Comet) instance so the trimmed
value is passed to Experiment; apply the same normalization to any other api_key
usages in the nearby block that create or configure the Experiment.

Comment on lines +625 to +627
comet_logger = global_state.comet_logger
if comet_logger:
comet_logger.end()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not use the lazy comet_logger property during teardown.

Reading global_state.comet_logger in shutdown paths can initialize a new Comet experiment if one was never created earlier. Teardown should only end an already-initialized logger.

🧹 Proposed teardown-safe fix
-        comet_logger = global_state.comet_logger
+        comet_logger = global_state._comet_logger
         if comet_logger:
             comet_logger.end()
+            global_state._comet_logger = None
@@
-    if global_state.comet_logger:
-        global_state.comet_logger.end()
+    comet_logger = global_state._comet_logger
+    if comet_logger:
+        comet_logger.end()
+        global_state._comet_logger = None

Also applies to: 1293-1294

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/train.py` around lines 625 - 627, Avoid
accessing the lazy property global_state.comet_logger during teardown; instead
check the underlying backing attribute directly (e.g., use getattr(global_state,
"_comet_logger", None) or global_state.__dict__.get("_comet_logger")) and call
.end() only if that backing attribute is non-None. Update the two teardown spots
that call comet_logger = global_state.comet_logger / comet_logger.end() (around
the occurrences at the shown diff and also at lines 1293-1294) to use the
backing-field check and end that instance to prevent accidental initialization.

Comment on lines +43 to +45
resolved_ckpt = str(Path(checkpoint_path).resolve())
comet_logger.log_other("last_saved_checkpoint", resolved_ckpt)
comet_logger.log_other("last_saved_iteration", iteration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging absolute checkpoint paths to Comet metadata.

These callbacks currently export resolved absolute filesystem paths. For external tracking backends, this can leak internal infrastructure details (mount layout, usernames, host-specific paths). Since save_dir/load_dir are available, log relative checkpoint paths instead.

🔐 Proposed change (relative path + base dir)
 def on_save_checkpoint_success(
     checkpoint_path: str,
     save_dir: str,
     iteration: int,
     comet_logger: Optional[Any],
 ) -> None:
@@
-        resolved_ckpt = str(Path(checkpoint_path).resolve())
-        comet_logger.log_other("last_saved_checkpoint", resolved_ckpt)
+        resolved_ckpt = Path(checkpoint_path).resolve()
+        resolved_save_dir = Path(save_dir).resolve()
+        ckpt_relpath = (
+            str(resolved_ckpt.relative_to(resolved_save_dir))
+            if resolved_ckpt.is_relative_to(resolved_save_dir)
+            else resolved_ckpt.name
+        )
+        comet_logger.log_other("last_saved_checkpoint", ckpt_relpath)
+        comet_logger.log_other("checkpoint_base_dir", str(resolved_save_dir))
         comet_logger.log_other("last_saved_iteration", iteration)
@@
 def on_load_checkpoint_success(
@@
-        resolved_ckpt = str(Path(checkpoint_path).resolve())
-        resolved_load_dir = str(Path(load_dir).resolve())
-        comet_logger.log_other("last_loaded_checkpoint", resolved_ckpt)
-        comet_logger.log_other("checkpoint_base_dir", resolved_load_dir)
+        resolved_ckpt = Path(checkpoint_path).resolve()
+        resolved_load_dir = Path(load_dir).resolve()
+        ckpt_relpath = (
+            str(resolved_ckpt.relative_to(resolved_load_dir))
+            if resolved_ckpt.is_relative_to(resolved_load_dir)
+            else resolved_ckpt.name
+        )
+        comet_logger.log_other("last_loaded_checkpoint", ckpt_relpath)
+        comet_logger.log_other("checkpoint_base_dir", str(resolved_load_dir))

Also applies to: 70-73

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/comet_utils.py` around lines 43 - 45, The
code currently logs absolute checkpoint paths via resolved_ckpt and
comet_logger.log_other calls; change this to log a path relative to the training
base directory (use save_dir or load_dir) instead: compute base = Path(save_dir
or load_dir).resolve(), ckpt_path = Path(checkpoint_path).resolve(), then try
rel = str(ckpt_path.relative_to(base)) and fall back to ckpt_path.name or a safe
relative string if relative_to raises, and pass that rel string to
comet_logger.log_other("last_saved_checkpoint", ...); repeat the same
relative-path logic for the other comet_logger.log_other calls around the 70-73
area so no absolute filesystem paths are emitted.

Comment on lines +989 to +990
class TestCometLoggerProperty:
"""Tests for the comet_logger property on GlobalState."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Mark the new Comet test classes with @pytest.mark.unit.

The new classes are currently unmarked, which makes unit test selection less consistent with the repo’s test categorization rules.

✅ Suggested update
+@pytest.mark.unit
 class TestCometLoggerProperty:
@@
+@pytest.mark.unit
 class TestTimersWriteToComet:

As per coding guidelines "Use 'pytest.mark' to categorize tests (unit, integration, system)".

Also applies to: 1075-1076

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/test_state.py` around lines 989 - 990, The new test
classes are missing the pytest marker; add `@pytest.mark.unit` above the
TestCometLoggerProperty class and the other new test class in the same file to
categorize them as unit tests (import pytest if not already present) so pytest
selection respects the repo’s test categorization rules.

@shanecmoran shanecmoran marked this pull request as ready for review March 5, 2026 00:22
@shanecmoran shanecmoran requested a review from a team as a code owner March 5, 2026 00:37
Add Comet ML as a fourth experiment tracking backend alongside
TensorBoard, Weights & Biases, and MLflow. The integration follows
the exact same structural pattern used by the existing backends.

Config fields: comet_project, comet_experiment_name, comet_workspace,
comet_api_key, comet_tags in LoggerConfig with finalize() validation.

GlobalState.comet_logger property lazily initializes a
comet_ml.Experiment on the last rank, logs the full training config
as parameters, and supports tags.

Training metrics: all ~18 metric logging call sites in training_log()
now dispatch to Comet alongside WandB and MLflow. Validation metrics
in eval.py are also logged. Timer metrics use a new
_timers_write_to_comet patch (no metric name sanitization since Comet
supports / in names natively).

Checkpoint tracking: comet_utils.py provides on_save_checkpoint_success
and on_load_checkpoint_success callbacks wired into checkpointing.py.

Tensor inspect: _CometExperimentLogger wrapper registered with NVIDIA
DLFw Inspect MetricLogger system.

CometPlugin added to run_plugins.py for NeMo Run launching.

Documentation added to docs/training/logging.md.

Fixes: NVIDIA-NeMo#2652

Signed-off-by: Shane Moran <shane.moran@shopify.com>
- Include comet_api_key and comet_tags in using_comet detection
- Use truthy check for comet_experiment_name (handles None and "")
- Strip whitespace from api_key before passing to Experiment
- Use backing attribute _comet_logger during teardown to avoid
  accidental lazy initialization
- Add stacklevel=2 to timer warning for better caller diagnostics

Signed-off-by: Shane Moran <shane.moran@shopify.com>
wandb and mlflow are both in the main dependency list. Add comet-ml
to match, since it's now a first-class logging backend.

Signed-off-by: Shane Moran <shane.moran@shopify.com>
@shanecmoran shanecmoran force-pushed the feat/comet-ml-logging branch from 1c12892 to e1e2db5 Compare March 5, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add first-class Comet ML experiment tracking

1 participant