[FEATURE] fp16 support for FSDP ZeRO-1 #53

shawnakdeb · 2022-08-19T13:00:45Z

Description

Support for fp16 parameter type. Shards are kept in fp32 for the optimizer while output and gradient calculations are done in original fp16 Dtype.

cc @awslabs/raf-reviewer, @zachzzc, @zhen-jia, @XinweiFu

Merging Upstream Ratex

…(), changed shard parameter generator method to sharded_paramaters()

…instance variable to constructor, renamed _get_shard to _shard_tensor

…test for single GPU

… suite

…s excluded when padding is needed due to errors

shawnakdeb · 2022-08-19T16:09:18Z

Some tests using fp16 parameters are failing. The loss difference is close between the use of FSDP fp32 buffer and the RAF buffer with no ZeRO-1 and optimizer ZeRO-1 is about 0.002 for these tests. However, the tolerance is currently set to 1e-10. Either more exploration is needed to find the reason for the difference, or the tolerance could be made more lenient.

zhen-jia · 2022-08-19T17:32:38Z

The e-3 difference does not seems to be ok as we just discussed. Let's spend more time to figure out the issue.

zhen-jia · 2022-08-29T20:42:46Z

@shawnakdeb any updates on this?

shawnakdeb · 2022-08-30T19:27:06Z

@shawnakdeb any updates on this?

No updates yet, I did not have stable internet access for the past week, but I will continue work on the issue now.

shawnakdeb · 2022-08-30T19:38:44Z

No updates yet, I did not have stable internet access for the past week, but I will continue work on the issue now.

I did just realize that with my amazon id closed, I no longer have access to my multi-GPU environment to run the script. Is there a way for us to get around this?

zhen-jia · 2022-08-30T22:26:49Z

I did just realize that with my amazon id closed, I no longer have access to my multi-GPU environment to run the script. Is there a way for us to get around this?

I think, you can still access the instance if you have still have the key. Let's communicate this offline. Will send you an email.

shawnakdeb added 30 commits August 1, 2022 16:29

Merge pull request #1 from awslabs/main

909c184

Merging Upstream Ratex

Created RatexFullyShardedDataParallel Wrapper with basic functionality

c5c6fab

Addded test to confirm accuracy of FSDP wrapper training

2586976

Created example script for Ratex FSDP ZeRO-1

4819749

Lint and formatting fixes

2165936

Lint warning fixes - removed overriding of parameters() and zero_grad…

aa3334d

…(), changed shard parameter generator method to sharded_paramaters()

Fix format issue

2d087d6

Changing order of jit script wrapping due to new jit script changes

938be6f

Format fix

226e860

Added argument types, added arguments and returns to comments, moved …

257f77e

…instance variable to constructor, renamed _get_shard to _shard_tensor

Style change

6aa79d0

Removed seed from example script

68d0ad1

Fixed references to old method _get_shard

80724a4

Style fixes for test, removed early loss materialization, added skip …

44cf21b

…test for single GPU

Removed early mmaterialization of loss

ee14fcf

Fixed running_losses initialization and style fixes

5557cb9

Style fixes

7c78667

Removed outdated mnm suffix from variable names

b056dfa

Added test_ratex_fully_sharded_data_parallel to the automated testing…

f192ac5

… suite

Change test so that seed is set in train

3afb018

Changed ci test to use 4 GPU's to see if it passes

0550c34

Changed ci test to use 2 GPUs

edb879c

Fixed padding dimension to be first dimension instead of last dimension

b2830b2

Changed parameter dimension of test to be divisible by 4

4085655

Increased docker shared memory capacity to fix bug for more than 2 GPU's

177c459

Change CI test to use 4 GPU again

083c665

Merge branch 'main' into FSDP_ZeRO1

7c411ff

Lint and format fixes

de89842

Only pad tensor for sharding on the last rank

550b793

Included tests for when padding is needed and not. Optimizer ZeRO-1 i…

f9635e0

…s excluded when padding is needed due to errors

shawnakdeb added 5 commits August 17, 2022 16:52

Format fixes

068adb4

Implemented support for fp16 dtype

413eb42

Added tests for fp16 models

93c6915

Fix merge conflicts

54341fd

Merge branch 'main' into FSDP_fp16

49260f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] fp16 support for FSDP ZeRO-1 #53

[FEATURE] fp16 support for FSDP ZeRO-1 #53

Uh oh!

shawnakdeb commented Aug 19, 2022

Uh oh!

shawnakdeb commented Aug 19, 2022 •

edited

Loading

Uh oh!

zhen-jia commented Aug 19, 2022

Uh oh!

zhen-jia commented Aug 29, 2022

Uh oh!

shawnakdeb commented Aug 30, 2022

Uh oh!

shawnakdeb commented Aug 30, 2022

Uh oh!

zhen-jia commented Aug 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[FEATURE] fp16 support for FSDP ZeRO-1 #53

Are you sure you want to change the base?

[FEATURE] fp16 support for FSDP ZeRO-1 #53

Uh oh!

Conversation

shawnakdeb commented Aug 19, 2022

Description

Uh oh!

shawnakdeb commented Aug 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhen-jia commented Aug 19, 2022

Uh oh!

zhen-jia commented Aug 29, 2022

Uh oh!

shawnakdeb commented Aug 30, 2022

Uh oh!

shawnakdeb commented Aug 30, 2022

Uh oh!

zhen-jia commented Aug 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shawnakdeb commented Aug 19, 2022 •

edited

Loading