Skip to content

add stop+go tests to llama3 recipe, turn off async checkpointing for fp8#1494

Merged
pstjohn merged 1 commit intoNVIDIA:mainfrom
pstjohn:pstjohn/bio-307-fix-async-checkpoints-for-llama3-recipe
Mar 4, 2026
Merged

add stop+go tests to llama3 recipe, turn off async checkpointing for fp8#1494
pstjohn merged 1 commit intoNVIDIA:mainfrom
pstjohn:pstjohn/bio-307-fix-async-checkpoints-for-llama3-recipe

Conversation

@pstjohn
Copy link
Collaborator

@pstjohn pstjohn commented Mar 4, 2026

async dcp checkpointing is currently not working with fp8 model init, so we need to detect this and switch back to synchronous checkpointing. This also adds tests to ensure the dcp checkpoints are functional

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9a3b7783-2cf7-45d4-94b9-abab3d57dded

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Member

@cspades cspades left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just NaN and infinity right? LGTM, will add golden value parity tests on TE side as well!

@pstjohn pstjohn added this pull request to the merge queue Mar 4, 2026
Merged via the queue into NVIDIA:main with commit c157a41 Mar 4, 2026
15 checks passed
@pstjohn pstjohn deleted the pstjohn/bio-307-fix-async-checkpoints-for-llama3-recipe branch March 4, 2026 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants