-
Notifications
You must be signed in to change notification settings - Fork 13
[FEATURE] fp16 support for FSDP ZeRO-1 #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Merging Upstream Ratex
…(), changed shard parameter generator method to sharded_paramaters()
…instance variable to constructor, renamed _get_shard to _shard_tensor
…test for single GPU
…s excluded when padding is needed due to errors
|
Some tests using fp16 parameters are failing. The loss difference is close between the use of FSDP fp32 buffer and the RAF buffer with no ZeRO-1 and optimizer ZeRO-1 is about 0.002 for these tests. However, the tolerance is currently set to 1e-10. Either more exploration is needed to find the reason for the difference, or the tolerance could be made more lenient. |
|
The e-3 difference does not seems to be ok as we just discussed. Let's spend more time to figure out the issue. |
|
@shawnakdeb any updates on this? |
No updates yet, I did not have stable internet access for the past week, but I will continue work on the issue now. |
I did just realize that with my amazon id closed, I no longer have access to my multi-GPU environment to run the script. Is there a way for us to get around this? |
I think, you can still access the instance if you have still have the key. Let's communicate this offline. Will send you an email. |
Description
Support for fp16 parameter type. Shards are kept in fp32 for the optimizer while output and gradient calculations are done in original fp16 Dtype.
cc @awslabs/raf-reviewer, @zachzzc, @zhen-jia, @XinweiFu