This is an extension from cwpeng's, you can now train Qwen3-VL with Unsloth on multi-GPU:
- For the first run, you need to disable multi-gpu for Unsloth to compile into unsloth_compiled_cache
- After that, add os.environ["UNSLOTH_COMPILE_DISABLE"] = "1", disable Unsloth compilation to avoid hanging <= I don't know why, this will reduce speed but my experiment is too small for noticeable effect
- The root cause seems to be related to gradient checkpointing
- Can run with DeepSpeed.
- Tested of 2x L4
- Working on GRPO for VL model.