RuntimeError(f"still have inflight params " during Instruction Tuning

I am getting the below error
`RuntimeError: still have inflight params [{'id': 1078, 'status': 'INFLIGHT', 'numel': 16777216, 'ds_numel': 16777216, 'shape': (4096, 4096), 'ds_shape': (4096, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([4194304])}, {'id': 1076, 'status': 'AVAILABLE', 'numel': 5242880, 'ds_numel': 5242880, 'shape': (4096, 1280), 'ds_shape': (4096, 1280), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1310720])}]
`
I have added a vision backbone(ViT) to the already available CLIP model, similar to shown in your codebase. I am able to complete the pretraining stage (with deepspeed zero2) wihtout any issues. However for finetuning(with deepsepeed zero3) i get the above error after 30 iterations. I believe the error is mentioning the parameters for the projection layers of the added vision encoder. However, i am not able to find where the issue lies, since pretraining works fine.

Further, when i debug without deepspeed on a single GPU, it works fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError(f"still have inflight params " during Instruction Tuning #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RuntimeError(f"still have inflight params " during Instruction Tuning #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions