-
Notifications
You must be signed in to change notification settings - Fork 121
Open
Description
Hi team,
I’m trying to reproduce this example training of qwen-omni, but I consistently hit CUDA OOM.
Hardware
- 1 node, 8× NVIDIA H100 80GB
- CPU RAM: e.g., 256 GB
Symptoms
- With ZeRO-0: CUDA OOM on the first forward/backward steps.
- Switching to ZeRO-3 avoids OOM but sometimes triggers NCCL collective timeouts (
_ALLGATHER_BASEwatchdog).
Questions
-
In your reproduction, did you train LoRA with ZeRO-0 on 8× H100 80GB?
- If yes, could you share the exact batch sizes, sequence lengths, and DeepSpeed config? (The current one consistently hits OOM)
-
Could you please provide a
requirements.txt(or conda env) for training Qwen 2.5 Omni with this example?- Version pins for torch / deepspeed / nccl would be very helpful.
What I tried
- Lowering
per_device_train_batch_size(16 → 2). - Shorter seq lengths (e.g.,
query_max_len=256,passage_max_len=256). - Enabling gradient checkpointing.
If you have a minimal working config (train args + DS json) and a requirements.txt, that would greatly help us reproduce your results. Thanks!
Metadata
Metadata
Assignees
Labels
No labels