Fix for async dcp checkpointing with Float8Tensors by pstjohn · Pull Request #2721 · NVIDIA/TransformerEngine

pstjohn · 2026-03-02T16:41:53Z

(includes changes from #2698)

dcp.async_save fails silently with QuantizedTensor (Float8Tensor) — staged tensors contain uninitialized (NaN) data instead of actual FP8 values.

PyTorch's async save stages tensors to CPU by copying raw storage via new_empty() + deep_copy. Float8Tensor is a wrapper subclass with data_ptr()==0 (empty storage), so:

new_empty() falls through to default dispatch, returning a plain tensor instead of a Float8Tensor
The deep-copied _data/_scale_inv attributes land on the plain tensor but are ignored by DCP's write path

Changes

quantized_tensor.py: Handle aten.new_empty.default in torch_dispatch so staging preserves the Float8Tensor subclass type
float8_tensor_storage.py: Add a CPU fallback in dequantize() using PyTorch native FP8 dtypes, since tex.dequantize is CUDA-only and the staged tensor lives on CPU
run_fsdp2_fused_adam.py: Remove the _dequantize_state_dict workaround — dcp.async_save now works transparently

…hard Signed-off-by: Peter St. John <pstjohn@nvidia.com>

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

pstjohn added 3 commits March 2, 2026 07:23

Expand fsdp2 test suite, add example for FusedAdam w/ and w/o fully_s…

5f0ebab

…hard Signed-off-by: Peter St. John <pstjohn@nvidia.com>

update xfail reason

900f485

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

fixes for async checkpointing

1fba8c7

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

pstjohn mentioned this pull request Mar 2, 2026

Add fused_adam, quantized_model_init, and fsdp2 example #2698

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for async dcp checkpointing with Float8Tensors#2721

Fix for async dcp checkpointing with Float8Tensors#2721
pstjohn wants to merge 3 commits intoNVIDIA:mainfrom
pstjohn:pstjohn/fix-async-dcp

pstjohn commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pstjohn commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant