Skip to content

Conversation

@shawnakdeb
Copy link
Contributor

Description

Support for fp16 parameter type. Shards are kept in fp32 for the optimizer while output and gradient calculations are done in original fp16 Dtype.

cc @awslabs/raf-reviewer, @zachzzc, @zhen-jia, @XinweiFu

Merging Upstream Ratex
…(), changed shard parameter generator method to sharded_paramaters()
…instance variable to constructor, renamed _get_shard to _shard_tensor
…s excluded when padding is needed due to errors
@shawnakdeb
Copy link
Contributor Author

shawnakdeb commented Aug 19, 2022

Some tests using fp16 parameters are failing. The loss difference is close between the use of FSDP fp32 buffer and the RAF buffer with no ZeRO-1 and optimizer ZeRO-1 is about 0.002 for these tests. However, the tolerance is currently set to 1e-10. Either more exploration is needed to find the reason for the difference, or the tolerance could be made more lenient.

@zhen-jia
Copy link

The e-3 difference does not seems to be ok as we just discussed. Let's spend more time to figure out the issue.

@zhen-jia
Copy link

@shawnakdeb any updates on this?

@shawnakdeb
Copy link
Contributor Author

@shawnakdeb any updates on this?

No updates yet, I did not have stable internet access for the past week, but I will continue work on the issue now.

@shawnakdeb
Copy link
Contributor Author

No updates yet, I did not have stable internet access for the past week, but I will continue work on the issue now.

I did just realize that with my amazon id closed, I no longer have access to my multi-GPU environment to run the script. Is there a way for us to get around this?

@zhen-jia
Copy link

I did just realize that with my amazon id closed, I no longer have access to my multi-GPU environment to run the script. Is there a way for us to get around this?

I think, you can still access the instance if you have still have the key. Let's communicate this offline. Will send you an email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants