Conversation
There was a problem hiding this comment.
Code Review
This pull request updates several cookbook examples with revised GPU configurations and model IDs, and refactors the loss calculation logic to support separate training and evaluation statuses. It also includes bug fixes for variable references in the vLLM engine and sequence length calculations. Reviewers identified an inconsistency where some metrics still hardcode training status during evaluation and suggested more robust error handling when dynamically loading model architectures from configurations.
| if self.model.training: | ||
| status = optimizer_config.train_status | ||
| else: | ||
| status = optimizer_config.eval_status |
There was a problem hiding this comment.
The introduction of the status variable to handle both training and evaluation metrics is a good improvement. However, the implementation appears incomplete as the accumulation of num_tokens (at line 506) still hardcodes optimizer_config.train_status.num_tokens. This inconsistency will cause evaluation tokens to be incorrectly added to training metrics when the model is in evaluation mode. Please ensure that all metric updates in this function use the status variable.
PR type
PR information
Write the detail information belongs to this PR.
Experiment results
Paste your experiment result here(if needed).