Hi,
I am training ChangeFormer under the MTKD framework (step3: student distillation) using 4× RTX 4090 with DDP. When SyncBN is enabled for the student model, the training hangs at the first iteration:
- GPU utilization goes to 100%
- No Iter [1/x] log is printed
- No explicit error message (process just gets stuck)
If I switch SyncBN to BN, the training runs normally without hanging. Has anyone encountered this issue before? Is there a recommended way to use SyncBN in MTKD / distillation setups, or a known workaround?
Thanks!