feat: Add first-class Comet ML experiment tracking#2653
feat: Add first-class Comet ML experiment tracking#2653shanecmoran wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR adds Comet ML as a first-class experiment tracking backend to Megatron Bridge, mirroring the existing integrations for W&B and MLflow. The changes introduce configuration options, lazy-initialized logger properties, symmetric logging instrumentation across training/evaluation/checkpointing workflows, checkpoint artifact tracking, timer metrics, tensor inspection hooks, and a NeMo Run plugin. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (5)
tests/unit_tests/training/utils/test_comet_utils.py (1)
44-53: Prefer pytest fixtures (tmp_path) over manualtempfileblocks in these unit tests.This setup pattern is repeated and can be simplified by injecting
tmp_pathfixtures directly into the test functions.As per coding guidelines "Use pytest fixtures for common setup in unit tests".
Also applies to: 95-103
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/training/utils/test_comet_utils.py` around lines 44 - 53, Replace the manual tempfile.TemporaryDirectory block in the test using on_save_checkpoint_success with the pytest tmp_path fixture: accept tmp_path as a parameter to the test, create checkpoint_path = tmp_path / "checkpoint" and checkpoint_path.mkdir(), and pass checkpoint_path.as_posix() (or str) and tmp_path.as_posix() (or str) into on_save_checkpoint_success along with mock_comet and iteration; apply the same replacement for the repeated pattern around the other block (lines ~95-103) to use tmp_path consistently instead of creating a TemporaryDirectory.src/megatron/bridge/training/utils/comet_utils.py (1)
46-47: Narrow the exception handling scope in Comet callback logging.Catching
Exceptionhere can hide non-recoverable coding errors. Prefer catching only expected logging/path exceptions and let unexpected failures surface during development.As per coding guidelines "When using try-except blocks, limit the except clause to the smallest set of errors possible."
Also applies to: 74-75
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/utils/comet_utils.py` around lines 46 - 47, The broad except Exception in the Comet ML checkpoint logging blocks should be narrowed to only the expected runtime errors (e.g., I/O and Comet client errors) so coding errors surface; update the two catch sites that currently read "except Exception as exc" (the Comet ML checkpoint logging function(s) that call print_rank_last) to catch specific exceptions such as OSError/IOError and comet_ml.exceptions.CometMLException (importing that exception class), log the error message with print_rank_last, and re-raise any other unexpected exceptions (or omit a broad catch) so non-recoverable bugs are not swallowed.src/megatron/bridge/training/utils/train_utils.py (1)
494-646: Consider centralizing metric fan-out across backends.With Comet added, the repeated
wandb/mlflow/cometbranches are getting large and easy to drift. A small helper like_log_metrics_to_backends(metrics, step, wandb_writer, mlflow_logger, comet_logger)would reduce risk and simplify future additions.Also applies to: 713-759
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/utils/train_utils.py` around lines 494 - 646, The metric fan-out is duplicated for wandb_writer, mlflow_logger, and comet_logger across multiple blocks (e.g., around report_memory/report_runtime/report_l2_norm_grad and the loss/throughput sections); create a helper function named _log_metrics_to_backends(metrics, step, wandb_writer=None, mlflow_logger=None, comet_logger=None, sanitize_mlflow=False) and replace each repeated if-wandb/if-mlflow/if-comet sequence with a single call to this helper; ensure the helper calls wandb_writer.log(metrics, step) when wandb_writer is present, calls mlflow_logger.log_metrics(_sanitize_mlflow_metrics(metrics)) when mlflow_logger is present and sanitize_mlflow=True (or raw metrics otherwise), and calls comet_logger.log_metrics(metrics, step) when comet_logger is present, and update all places that currently manually log (including uses of _sanitize_mlflow_metrics) to use the new helper.src/megatron/bridge/training/config.py (1)
1146-1152: Include all Comet-specific fields inusing_cometdetection.At Line 1146,
using_cometskipscomet_api_keyandcomet_tagseven though they’re part of this config. Including them keeps dependency validation behavior consistent across all Comet knobs.♻️ Suggested change
- using_comet = any( - [ - self.comet_project, - self.comet_experiment_name, - self.comet_workspace, - ] - ) + using_comet = any( + [ + self.comet_project, + self.comet_experiment_name, + self.comet_workspace, + self.comet_api_key, + self.comet_tags, + ] + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/config.py` around lines 1146 - 1152, The using_comet detection currently checks only comet_project, comet_experiment_name, and comet_workspace; update the detection to include all Comet-related fields (add self.comet_api_key and self.comet_tags) so dependency validation covers every Comet knob (locate the using_comet variable in src/megatron/bridge/training/config.py and modify the any([...]) list to include those two attributes).src/megatron/bridge/training/state.py (1)
522-524: Set an explicit warningstacklevelfor better caller diagnostics.Using
stacklevel=2makes warnings point to the call site instead of this helper.📝 Suggested tweak
- warnings.warn("Failed to log timer metrics to Comet ML; continuing without timer metrics.") + warnings.warn( + "Failed to log timer metrics to Comet ML; continuing without timer metrics.", + stacklevel=2, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/state.py` around lines 522 - 524, The warning call that logs "Failed to log timer metrics to Comet ML; continuing without timer metrics." should include an explicit stacklevel so the warning points at the caller; update the warnings.warn call (the warnings.warn(...) statement that emits that message) to pass stacklevel=2 (e.g., warnings.warn("Failed to log timer metrics to Comet ML; continuing without timer metrics.", stacklevel=2)).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/training/state.py`:
- Around line 297-299: The check for comet_experiment_name should treat None
like an empty string and fail-fast (replace the equality check with a
truthy/None-safe check for logger_cfg.comet_experiment_name), and ensure the
Comet API key is normalized before use: trim whitespace (call .strip() or
equivalent) on the captured api_key variable before creating the Experiment
(Comet) instance so the trimmed value is passed to Experiment; apply the same
normalization to any other api_key usages in the nearby block that create or
configure the Experiment.
In `@src/megatron/bridge/training/train.py`:
- Around line 625-627: Avoid accessing the lazy property
global_state.comet_logger during teardown; instead check the underlying backing
attribute directly (e.g., use getattr(global_state, "_comet_logger", None) or
global_state.__dict__.get("_comet_logger")) and call .end() only if that backing
attribute is non-None. Update the two teardown spots that call comet_logger =
global_state.comet_logger / comet_logger.end() (around the occurrences at the
shown diff and also at lines 1293-1294) to use the backing-field check and end
that instance to prevent accidental initialization.
In `@src/megatron/bridge/training/utils/comet_utils.py`:
- Around line 43-45: The code currently logs absolute checkpoint paths via
resolved_ckpt and comet_logger.log_other calls; change this to log a path
relative to the training base directory (use save_dir or load_dir) instead:
compute base = Path(save_dir or load_dir).resolve(), ckpt_path =
Path(checkpoint_path).resolve(), then try rel = str(ckpt_path.relative_to(base))
and fall back to ckpt_path.name or a safe relative string if relative_to raises,
and pass that rel string to comet_logger.log_other("last_saved_checkpoint",
...); repeat the same relative-path logic for the other comet_logger.log_other
calls around the 70-73 area so no absolute filesystem paths are emitted.
In `@tests/unit_tests/training/test_state.py`:
- Around line 989-990: The new test classes are missing the pytest marker; add
`@pytest.mark.unit` above the TestCometLoggerProperty class and the other new test
class in the same file to categorize them as unit tests (import pytest if not
already present) so pytest selection respects the repo’s test categorization
rules.
---
Nitpick comments:
In `@src/megatron/bridge/training/config.py`:
- Around line 1146-1152: The using_comet detection currently checks only
comet_project, comet_experiment_name, and comet_workspace; update the detection
to include all Comet-related fields (add self.comet_api_key and self.comet_tags)
so dependency validation covers every Comet knob (locate the using_comet
variable in src/megatron/bridge/training/config.py and modify the any([...])
list to include those two attributes).
In `@src/megatron/bridge/training/state.py`:
- Around line 522-524: The warning call that logs "Failed to log timer metrics
to Comet ML; continuing without timer metrics." should include an explicit
stacklevel so the warning points at the caller; update the warnings.warn call
(the warnings.warn(...) statement that emits that message) to pass stacklevel=2
(e.g., warnings.warn("Failed to log timer metrics to Comet ML; continuing
without timer metrics.", stacklevel=2)).
In `@src/megatron/bridge/training/utils/comet_utils.py`:
- Around line 46-47: The broad except Exception in the Comet ML checkpoint
logging blocks should be narrowed to only the expected runtime errors (e.g., I/O
and Comet client errors) so coding errors surface; update the two catch sites
that currently read "except Exception as exc" (the Comet ML checkpoint logging
function(s) that call print_rank_last) to catch specific exceptions such as
OSError/IOError and comet_ml.exceptions.CometMLException (importing that
exception class), log the error message with print_rank_last, and re-raise any
other unexpected exceptions (or omit a broad catch) so non-recoverable bugs are
not swallowed.
In `@src/megatron/bridge/training/utils/train_utils.py`:
- Around line 494-646: The metric fan-out is duplicated for wandb_writer,
mlflow_logger, and comet_logger across multiple blocks (e.g., around
report_memory/report_runtime/report_l2_norm_grad and the loss/throughput
sections); create a helper function named _log_metrics_to_backends(metrics,
step, wandb_writer=None, mlflow_logger=None, comet_logger=None,
sanitize_mlflow=False) and replace each repeated if-wandb/if-mlflow/if-comet
sequence with a single call to this helper; ensure the helper calls
wandb_writer.log(metrics, step) when wandb_writer is present, calls
mlflow_logger.log_metrics(_sanitize_mlflow_metrics(metrics)) when mlflow_logger
is present and sanitize_mlflow=True (or raw metrics otherwise), and calls
comet_logger.log_metrics(metrics, step) when comet_logger is present, and update
all places that currently manually log (including uses of
_sanitize_mlflow_metrics) to use the new helper.
In `@tests/unit_tests/training/utils/test_comet_utils.py`:
- Around line 44-53: Replace the manual tempfile.TemporaryDirectory block in the
test using on_save_checkpoint_success with the pytest tmp_path fixture: accept
tmp_path as a parameter to the test, create checkpoint_path = tmp_path /
"checkpoint" and checkpoint_path.mkdir(), and pass checkpoint_path.as_posix()
(or str) and tmp_path.as_posix() (or str) into on_save_checkpoint_success along
with mock_comet and iteration; apply the same replacement for the repeated
pattern around the other block (lines ~95-103) to use tmp_path consistently
instead of creating a TemporaryDirectory.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 44c4f91a-92f3-4d40-bfbe-ea36e6949f94
📒 Files selected for processing (14)
docs/training/logging.mdsrc/megatron/bridge/recipes/run_plugins.pysrc/megatron/bridge/training/checkpointing.pysrc/megatron/bridge/training/config.pysrc/megatron/bridge/training/eval.pysrc/megatron/bridge/training/setup.pysrc/megatron/bridge/training/state.pysrc/megatron/bridge/training/tensor_inspect.pysrc/megatron/bridge/training/train.pysrc/megatron/bridge/training/utils/comet_utils.pysrc/megatron/bridge/training/utils/train_utils.pytests/unit_tests/training/test_config.pytests/unit_tests/training/test_state.pytests/unit_tests/training/utils/test_comet_utils.py
| if logger_cfg.comet_experiment_name == "": | ||
| raise ValueError("Please specify the comet_experiment_name for Comet ML logging!") | ||
|
|
There was a problem hiding this comment.
Normalize Comet API key before passing it to Experiment.
At Line 309, api_key is captured before env trimming. At Line 313, the potentially untrimmed value is passed to Comet, which can cause auth failures for keys with surrounding whitespace. Also, Line 297 should treat None the same as empty string for local fail-fast safety.
🔧 Proposed fix
- if logger_cfg.comet_experiment_name == "":
+ if not logger_cfg.comet_experiment_name:
raise ValueError("Please specify the comet_experiment_name for Comet ML logging!")
@@
- api_key = logger_cfg.comet_api_key
- if api_key is None:
- api_key = os.environ.get("COMET_API_KEY")
- if api_key:
- if "COMET_API_KEY" in os.environ:
- os.environ["COMET_API_KEY"] = os.environ["COMET_API_KEY"].strip()
- init_kwargs["api_key"] = api_key
+ api_key = logger_cfg.comet_api_key or os.environ.get("COMET_API_KEY")
+ if api_key:
+ api_key = api_key.strip()
+ if api_key:
+ init_kwargs["api_key"] = api_keyAlso applies to: 307-313
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 298-298: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/training/state.py` around lines 297 - 299, The check for
comet_experiment_name should treat None like an empty string and fail-fast
(replace the equality check with a truthy/None-safe check for
logger_cfg.comet_experiment_name), and ensure the Comet API key is normalized
before use: trim whitespace (call .strip() or equivalent) on the captured
api_key variable before creating the Experiment (Comet) instance so the trimmed
value is passed to Experiment; apply the same normalization to any other api_key
usages in the nearby block that create or configure the Experiment.
| comet_logger = global_state.comet_logger | ||
| if comet_logger: | ||
| comet_logger.end() |
There was a problem hiding this comment.
Do not use the lazy comet_logger property during teardown.
Reading global_state.comet_logger in shutdown paths can initialize a new Comet experiment if one was never created earlier. Teardown should only end an already-initialized logger.
🧹 Proposed teardown-safe fix
- comet_logger = global_state.comet_logger
+ comet_logger = global_state._comet_logger
if comet_logger:
comet_logger.end()
+ global_state._comet_logger = None
@@
- if global_state.comet_logger:
- global_state.comet_logger.end()
+ comet_logger = global_state._comet_logger
+ if comet_logger:
+ comet_logger.end()
+ global_state._comet_logger = NoneAlso applies to: 1293-1294
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/training/train.py` around lines 625 - 627, Avoid
accessing the lazy property global_state.comet_logger during teardown; instead
check the underlying backing attribute directly (e.g., use getattr(global_state,
"_comet_logger", None) or global_state.__dict__.get("_comet_logger")) and call
.end() only if that backing attribute is non-None. Update the two teardown spots
that call comet_logger = global_state.comet_logger / comet_logger.end() (around
the occurrences at the shown diff and also at lines 1293-1294) to use the
backing-field check and end that instance to prevent accidental initialization.
| resolved_ckpt = str(Path(checkpoint_path).resolve()) | ||
| comet_logger.log_other("last_saved_checkpoint", resolved_ckpt) | ||
| comet_logger.log_other("last_saved_iteration", iteration) |
There was a problem hiding this comment.
Avoid logging absolute checkpoint paths to Comet metadata.
These callbacks currently export resolved absolute filesystem paths. For external tracking backends, this can leak internal infrastructure details (mount layout, usernames, host-specific paths). Since save_dir/load_dir are available, log relative checkpoint paths instead.
🔐 Proposed change (relative path + base dir)
def on_save_checkpoint_success(
checkpoint_path: str,
save_dir: str,
iteration: int,
comet_logger: Optional[Any],
) -> None:
@@
- resolved_ckpt = str(Path(checkpoint_path).resolve())
- comet_logger.log_other("last_saved_checkpoint", resolved_ckpt)
+ resolved_ckpt = Path(checkpoint_path).resolve()
+ resolved_save_dir = Path(save_dir).resolve()
+ ckpt_relpath = (
+ str(resolved_ckpt.relative_to(resolved_save_dir))
+ if resolved_ckpt.is_relative_to(resolved_save_dir)
+ else resolved_ckpt.name
+ )
+ comet_logger.log_other("last_saved_checkpoint", ckpt_relpath)
+ comet_logger.log_other("checkpoint_base_dir", str(resolved_save_dir))
comet_logger.log_other("last_saved_iteration", iteration)
@@
def on_load_checkpoint_success(
@@
- resolved_ckpt = str(Path(checkpoint_path).resolve())
- resolved_load_dir = str(Path(load_dir).resolve())
- comet_logger.log_other("last_loaded_checkpoint", resolved_ckpt)
- comet_logger.log_other("checkpoint_base_dir", resolved_load_dir)
+ resolved_ckpt = Path(checkpoint_path).resolve()
+ resolved_load_dir = Path(load_dir).resolve()
+ ckpt_relpath = (
+ str(resolved_ckpt.relative_to(resolved_load_dir))
+ if resolved_ckpt.is_relative_to(resolved_load_dir)
+ else resolved_ckpt.name
+ )
+ comet_logger.log_other("last_loaded_checkpoint", ckpt_relpath)
+ comet_logger.log_other("checkpoint_base_dir", str(resolved_load_dir))Also applies to: 70-73
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/training/utils/comet_utils.py` around lines 43 - 45, The
code currently logs absolute checkpoint paths via resolved_ckpt and
comet_logger.log_other calls; change this to log a path relative to the training
base directory (use save_dir or load_dir) instead: compute base = Path(save_dir
or load_dir).resolve(), ckpt_path = Path(checkpoint_path).resolve(), then try
rel = str(ckpt_path.relative_to(base)) and fall back to ckpt_path.name or a safe
relative string if relative_to raises, and pass that rel string to
comet_logger.log_other("last_saved_checkpoint", ...); repeat the same
relative-path logic for the other comet_logger.log_other calls around the 70-73
area so no absolute filesystem paths are emitted.
| class TestCometLoggerProperty: | ||
| """Tests for the comet_logger property on GlobalState.""" |
There was a problem hiding this comment.
Mark the new Comet test classes with @pytest.mark.unit.
The new classes are currently unmarked, which makes unit test selection less consistent with the repo’s test categorization rules.
✅ Suggested update
+@pytest.mark.unit
class TestCometLoggerProperty:
@@
+@pytest.mark.unit
class TestTimersWriteToComet:As per coding guidelines "Use 'pytest.mark' to categorize tests (unit, integration, system)".
Also applies to: 1075-1076
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit_tests/training/test_state.py` around lines 989 - 990, The new test
classes are missing the pytest marker; add `@pytest.mark.unit` above the
TestCometLoggerProperty class and the other new test class in the same file to
categorize them as unit tests (import pytest if not already present) so pytest
selection respects the repo’s test categorization rules.
Add Comet ML as a fourth experiment tracking backend alongside TensorBoard, Weights & Biases, and MLflow. The integration follows the exact same structural pattern used by the existing backends. Config fields: comet_project, comet_experiment_name, comet_workspace, comet_api_key, comet_tags in LoggerConfig with finalize() validation. GlobalState.comet_logger property lazily initializes a comet_ml.Experiment on the last rank, logs the full training config as parameters, and supports tags. Training metrics: all ~18 metric logging call sites in training_log() now dispatch to Comet alongside WandB and MLflow. Validation metrics in eval.py are also logged. Timer metrics use a new _timers_write_to_comet patch (no metric name sanitization since Comet supports / in names natively). Checkpoint tracking: comet_utils.py provides on_save_checkpoint_success and on_load_checkpoint_success callbacks wired into checkpointing.py. Tensor inspect: _CometExperimentLogger wrapper registered with NVIDIA DLFw Inspect MetricLogger system. CometPlugin added to run_plugins.py for NeMo Run launching. Documentation added to docs/training/logging.md. Fixes: NVIDIA-NeMo#2652 Signed-off-by: Shane Moran <shane.moran@shopify.com>
- Include comet_api_key and comet_tags in using_comet detection - Use truthy check for comet_experiment_name (handles None and "") - Strip whitespace from api_key before passing to Experiment - Use backing attribute _comet_logger during teardown to avoid accidental lazy initialization - Add stacklevel=2 to timer warning for better caller diagnostics Signed-off-by: Shane Moran <shane.moran@shopify.com>
wandb and mlflow are both in the main dependency list. Add comet-ml to match, since it's now a first-class logging backend. Signed-off-by: Shane Moran <shane.moran@shopify.com>
1c12892 to
e1e2db5
Compare
Summary
Adds Comet ML as a fourth experiment tracking backend alongside TensorBoard, Weights & Biases, and MLflow. The integration follows the exact same structural pattern used by the existing backends — no new abstractions, just symmetric extension of each logging call site.
Reference: NeMo Automodel PR #1411 adds a similar Comet integration to Automodel.
Changes
Config (
LoggerConfig)comet_project,comet_experiment_name,comet_workspace,comet_api_key,comet_tagsfinalize()validation: requirescomet_experiment_namewhencomet_projectis set, checkscomet_mlimportableCore (
GlobalState)comet_loggerlazy property: initializescomet_ml.Experimenton last rank, logs full config as parameters_timers_write_to_comettimer patch (preserves/in metric names unlike MLflow)reset_for_restart()Metric Logging
training_log(): ~18 inlineif comet_logger:blocks mirroring everyif mlflow_logger:blockeval.py: validation loss and PPLtrain.py:experiment.end()at both shutdown pathsCheckpoint Tracking
comet_utils.pywithon_save_checkpoint_success/on_load_checkpoint_successcheckpointing.pysave/load pathsIntegration Points
tensor_inspect.py:_CometExperimentLoggerwrapper for NVIDIA DLFw Inspectrun_plugins.py:CometPluginfor NeMo Run launchingdocs/training/logging.md: full documentation sectionTests
test_comet_utils.py: save/load checkpoint callback teststest_state.py: comet_logger property, timer patch, reset teststest_config.py: finalize validation testsConfig Example
Fixes #2652
Summary by CodeRabbit
New Features
Tests