GPU utilization stuck at 100% and training hangs at first iteration in MTKD step3 (ChangeFormer)

Hi,
I am training ChangeFormer under the MTKD framework (step3: student distillation) using 4× RTX 4090 with DDP. When SyncBN is enabled for the student model, the training hangs at the first iteration:

1. GPU utilization goes to 100%
2. No Iter [1/x] log is printed
3. No explicit error message (process just gets stuck)

If I switch SyncBN to BN, the training runs normally without hanging. Has anyone encountered this issue before? Is there a recommended way to use SyncBN in MTKD / distillation setups, or a known workaround?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU utilization stuck at 100% and training hangs at first iteration in MTKD step3 (ChangeFormer) #154

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GPU utilization stuck at 100% and training hangs at first iteration in MTKD step3 (ChangeFormer) #154

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions