-
Notifications
You must be signed in to change notification settings - Fork 9
Training may not work as expected in FSDP #15
Copy link
Copy link
Open
Description
Hi, thanks for your work on this project!
I noticed that all trainable parameters are contained within the RotateModule and SmoothModule. In your code, you set ignored_modules to avoid the FSDP error (ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32):
https://github.com/BrotherHappy/OSTQuant/blob/main/quant/trainer.py#L20-L25
However, as discussed here:
pytorch/pytorch#98281 (comment)
FSDP does not synchronize gradients for ignored modules.
This means that only the gradients computed on GPU0 are actually utilized.
Have you verified this behavior in your setup? Do you have plans to address this issue so that the gradients of these modules are properly synchronized?
Thanks in advance!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels