Skip to content

feat: Add first-class Comet ML experiment tracking #2652

@shanecmoran

Description

@shanecmoran

Description

Megatron Bridge has first-class integrations for TensorBoard, Weights & Biases, and MLflow as experiment tracking backends. All three follow a consistent pattern: config fields in LoggerConfig, lazy-initialized properties in GlobalState, and symmetric inline logging calls in training_log(), eval.py, and checkpointing.py.

This issue proposes adding Comet ML as a fourth first-class backend, following the exact same structural pattern.

Motivation

Users who use Comet ML for experiment tracking currently rely on Comet's TensorBoard auto-patcher (COMET_AUTO_LOG_TENSORBOARD=1), which:

  • Only captures metrics written to TensorBoard, missing WandB-only or MLflow-only metrics
  • Depends on Comet intercepting SummaryWriter calls, which is fragile
  • Does not capture config/hyperparameters as Comet parameters
  • Does not support checkpoint artifact tracking
  • Does not integrate with the DLFw tensor inspect system

A first-class integration would provide the same level of metrics, parameters, and artifact logging that WandB and MLflow users already have.

Proposed Changes

Following the existing WandB/MLflow pattern:

  1. Config fields — Add comet_project, comet_experiment_name, comet_workspace, comet_api_key, comet_tags to LoggerConfig
  2. Lazy init — Add GlobalState.comet_logger property (last-rank init, stores comet_ml.Experiment instance)
  3. Training metrics — Add Comet logging calls alongside existing WandB/MLflow calls in training_log()
  4. Validation metrics — Add Comet logging in eval.py
  5. Timer metrics — Add _timers_write_to_comet monkey-patch
  6. Checkpoint tracking — Add comet_utils.py with save/load callbacks
  7. Tensor inspect — Add _CometExperimentLogger wrapper
  8. Run plugin — Add CometPlugin for NeMo Run launching
  9. Cleanup — Add experiment.end() at shutdown
  10. Documentation — Add Comet section to docs/training/logging.md

Config Example

cfg.logger.comet_project = "my-project"
cfg.logger.comet_experiment_name = "qwen3-14b-sft"
cfg.logger.comet_workspace = "my-workspace"
cfg.logger.comet_tags = ["sft", "qwen3"]

Reference

NeMo Automodel PR #1411 adds a similar Comet ML integration to the Automodel training recipe.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions