-
Notifications
You must be signed in to change notification settings - Fork 11
Description
I am getting the below error
RuntimeError: still have inflight params [{'id': 1078, 'status': 'INFLIGHT', 'numel': 16777216, 'ds_numel': 16777216, 'shape': (4096, 4096), 'ds_shape': (4096, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([4194304])}, {'id': 1076, 'status': 'AVAILABLE', 'numel': 5242880, 'ds_numel': 5242880, 'shape': (4096, 1280), 'ds_shape': (4096, 1280), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1310720])}]
I have added a vision backbone(ViT) to the already available CLIP model, similar to shown in your codebase. I am able to complete the pretraining stage (with deepspeed zero2) wihtout any issues. However for finetuning(with deepsepeed zero3) i get the above error after 30 iterations. I believe the error is mentioning the parameters for the projection layers of the added vision encoder. However, i am not able to find where the issue lies, since pretraining works fine.
Further, when i debug without deepspeed on a single GPU, it works fine.