Skip to content

RuntimeError(f"still have inflight params " during Instruction Tuning #26

@HashmatShadab

Description

@HashmatShadab

I am getting the below error
RuntimeError: still have inflight params [{'id': 1078, 'status': 'INFLIGHT', 'numel': 16777216, 'ds_numel': 16777216, 'shape': (4096, 4096), 'ds_shape': (4096, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([4194304])}, {'id': 1076, 'status': 'AVAILABLE', 'numel': 5242880, 'ds_numel': 5242880, 'shape': (4096, 1280), 'ds_shape': (4096, 1280), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1310720])}]
I have added a vision backbone(ViT) to the already available CLIP model, similar to shown in your codebase. I am able to complete the pretraining stage (with deepspeed zero2) wihtout any issues. However for finetuning(with deepsepeed zero3) i get the above error after 30 iterations. I believe the error is mentioning the parameters for the projection layers of the added vision encoder. However, i am not able to find where the issue lies, since pretraining works fine.

Further, when i debug without deepspeed on a single GPU, it works fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions