Skip to content

OOM when running example of LoRA on Qwen2.5-Omni #205

@haon-chen

Description

@haon-chen

Hi team,

I’m trying to reproduce this example training of qwen-omni, but I consistently hit CUDA OOM.

Hardware

  • 1 node, 8× NVIDIA H100 80GB
  • CPU RAM: e.g., 256 GB

Symptoms

  • With ZeRO-0: CUDA OOM on the first forward/backward steps.
  • Switching to ZeRO-3 avoids OOM but sometimes triggers NCCL collective timeouts (_ALLGATHER_BASE watchdog).

Questions

  1. In your reproduction, did you train LoRA with ZeRO-0 on 8× H100 80GB?

    • If yes, could you share the exact batch sizes, sequence lengths, and DeepSpeed config? (The current one consistently hits OOM)
  2. Could you please provide a requirements.txt (or conda env) for training Qwen 2.5 Omni with this example?

    • Version pins for torch / deepspeed / nccl would be very helpful.

What I tried

  • Lowering per_device_train_batch_size (16 → 2).
  • Shorter seq lengths (e.g., query_max_len=256, passage_max_len=256).
  • Enabling gradient checkpointing.

If you have a minimal working config (train args + DS json) and a requirements.txt, that would greatly help us reproduce your results. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions