-
Notifications
You must be signed in to change notification settings - Fork 199
Description
Description
Megatron Bridge has first-class integrations for TensorBoard, Weights & Biases, and MLflow as experiment tracking backends. All three follow a consistent pattern: config fields in LoggerConfig, lazy-initialized properties in GlobalState, and symmetric inline logging calls in training_log(), eval.py, and checkpointing.py.
This issue proposes adding Comet ML as a fourth first-class backend, following the exact same structural pattern.
Motivation
Users who use Comet ML for experiment tracking currently rely on Comet's TensorBoard auto-patcher (COMET_AUTO_LOG_TENSORBOARD=1), which:
- Only captures metrics written to TensorBoard, missing WandB-only or MLflow-only metrics
- Depends on Comet intercepting
SummaryWritercalls, which is fragile - Does not capture config/hyperparameters as Comet parameters
- Does not support checkpoint artifact tracking
- Does not integrate with the DLFw tensor inspect system
A first-class integration would provide the same level of metrics, parameters, and artifact logging that WandB and MLflow users already have.
Proposed Changes
Following the existing WandB/MLflow pattern:
- Config fields — Add
comet_project,comet_experiment_name,comet_workspace,comet_api_key,comet_tagstoLoggerConfig - Lazy init — Add
GlobalState.comet_loggerproperty (last-rank init, storescomet_ml.Experimentinstance) - Training metrics — Add Comet logging calls alongside existing WandB/MLflow calls in
training_log() - Validation metrics — Add Comet logging in
eval.py - Timer metrics — Add
_timers_write_to_cometmonkey-patch - Checkpoint tracking — Add
comet_utils.pywith save/load callbacks - Tensor inspect — Add
_CometExperimentLoggerwrapper - Run plugin — Add
CometPluginfor NeMo Run launching - Cleanup — Add
experiment.end()at shutdown - Documentation — Add Comet section to
docs/training/logging.md
Config Example
cfg.logger.comet_project = "my-project"
cfg.logger.comet_experiment_name = "qwen3-14b-sft"
cfg.logger.comet_workspace = "my-workspace"
cfg.logger.comet_tags = ["sft", "qwen3"]Reference
NeMo Automodel PR #1411 adds a similar Comet ML integration to the Automodel training recipe.