-
Notifications
You must be signed in to change notification settings - Fork 199
Add Qwen 3.5 recipes #2654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cuichenx
wants to merge
16
commits into
main
Choose a base branch
from
chcui/qwen35_recipes
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+3,303
−14
Open
Add Qwen 3.5 recipes #2654
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
238b9af
fix test
cuichenx cfd9d90
add 3 new dense models and training recipes
cuichenx 047134b
Merge branch 'main' into chcui/qwen35_recipes
cuichenx 935c40c
recipe refactor
cuichenx ad628e2
Address CodeRabbit review feedback on SLURM scripts and tests
cuichenx 7bea48e
Merge branch 'main' into chcui/qwen35_recipes
cuichenx c4448fc
fix test
cuichenx 9213637
Merge branch 'chcui/qwen35_recipes' of github.com:NVIDIA-NeMo/Megatro…
cuichenx 3996cd5
update docs and readmes
cuichenx 6d54d97
Merge branch 'main' into chcui/qwen35_recipes
cuichenx 0c14274
doc
cuichenx eeb9de3
Merge branch 'chcui/qwen35_recipes' of github.com:NVIDIA-NeMo/Megatro…
cuichenx 8407255
fix doc link
cuichenx c617fee
doc
cuichenx 76ac963
doc
cuichenx 63395f9
Merge branch 'main' into chcui/qwen35_recipes
cuichenx File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,4 +11,5 @@ ministral3.md | |
| nemotron-nano-v2-vl.md | ||
| qwen2.5-vl.md | ||
| qwen3-vl.md | ||
| qwen35-vl.md | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| # Qwen 3.5 | ||
|
|
||
| [Qwen3.5](https://huggingface.co/collections/Qwen/qwen35) is a family of vision-language models supporting multimodal understanding across text, images, and videos. Qwen3.5-VL includes both dense models and Mixture-of-Experts (MoE) variants for improved efficiency at scale. | ||
|
|
||
| Qwen 3.5 models feature a hybrid architecture combining GDN (Gated DeltaNet) layers with standard attention layers, SwiGLU activations, and RMSNorm. MoE variants use top-k routing with shared experts for better quality. | ||
|
|
||
| Qwen 3.5 models are supported via Megatron Bridge with auto-detected configuration and weight mapping. | ||
|
|
||
| ```{important} | ||
| Please upgrade to `transformers` >= 5.2.0 in order to use the Qwen 3.5 models. | ||
| ``` | ||
|
|
||
| ## Available Models | ||
|
|
||
| ### Dense Models | ||
| - **Qwen3.5 0.8B** (`Qwen/Qwen3.5-0.8B`): 0.8B parameter vision-language model | ||
| - Recommended: 1 node, 8 GPUs | ||
|
|
||
| - **Qwen3.5 2B** (`Qwen/Qwen3.5-2B`): 2B parameter vision-language model | ||
| - Recommended: 1 node, 8 GPUs | ||
|
|
||
| - **Qwen3.5 4B** (`Qwen/Qwen3.5-4B`): 4B parameter vision-language model | ||
| - Recommended: 1 node, 8 GPUs | ||
|
|
||
| - **Qwen3.5 9B** (`Qwen/Qwen3.5-9B`): 9B parameter vision-language model | ||
| - Recommended: 1 node, 8 GPUs | ||
|
|
||
| - **Qwen3.5 27B** (`Qwen/Qwen3.5-27B`): 27B parameter vision-language model | ||
| - Recommended: 2 nodes, 16 GPUs | ||
|
|
||
| ### Mixture-of-Experts (MoE) Models | ||
| - **Qwen3.5 35B-A3B** (`Qwen/Qwen3.5-35B-A3B`): 35B total parameters, 3B activated per token | ||
| - Recommended: 2 nodes, 16 GPUs | ||
|
|
||
| - **Qwen3.5 122B-A10B** (`Qwen/Qwen3.5-122B-A10B`): 122B total parameters, 10B activated per token | ||
| - Recommended: 4 nodes, 32 GPUs | ||
|
|
||
| - **Qwen3.5 397B-A17B** (`Qwen/Qwen3.5-397B-A17B`): 397B total parameters, 17B activated per token | ||
| - 512 experts with top-10 routing and shared experts | ||
| - Recommended: 16 nodes, 128 GPUs | ||
|
|
||
| ## Examples | ||
|
|
||
| For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the [Qwen 3.5 Examples](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/qwen35_vl/README.md). | ||
|
|
||
| ## Hugging Face Model Cards | ||
|
|
||
| - Qwen3.5 0.8B: https://huggingface.co/Qwen/Qwen3.5-0.8B | ||
| - Qwen3.5 2B: https://huggingface.co/Qwen/Qwen3.5-2B | ||
| - Qwen3.5 4B: https://huggingface.co/Qwen/Qwen3.5-4B | ||
| - Qwen3.5 9B: https://huggingface.co/Qwen/Qwen3.5-9B | ||
| - Qwen3.5 27B: https://huggingface.co/Qwen/Qwen3.5-27B | ||
| - Qwen3.5 35B-A3B (MoE): https://huggingface.co/Qwen/Qwen3.5-35B-A3B | ||
| - Qwen3.5 122B-A10B (MoE): https://huggingface.co/Qwen/Qwen3.5-122B-A10B | ||
| - Qwen3.5 397B-A17B (MoE): https://huggingface.co/Qwen/Qwen3.5-397B-A17B | ||
|
|
||
| ## Related Docs | ||
| - Related VLM: [Qwen3-VL](qwen3-vl.md) | ||
| - Related LLM: [Qwen](../llm/qwen.md) | ||
| - Recipe usage: [Recipe usage](../../recipe-usage.md) | ||
| - Customizing the training recipe configuration: [Configuration overview](../../training/config-container-overview.md) | ||
| - Training entry points: [Entry points](../../training/entry-points.md) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| # Qwen3.5-VL Examples | ||
|
|
||
| This directory contains example scripts for Qwen3.5-VL vision-language models. | ||
|
|
||
| For model introduction and architecture details, see the [Qwen3.5-VL documentation](../../../../docs/models/vlm/qwen35-vl.md). | ||
|
|
||
| ## Workspace Configuration | ||
|
|
||
| All scripts use a `WORKSPACE` environment variable to define the base directory for checkpoints and results. By default, this is set to `/workspace`. You can override it: | ||
|
|
||
| ```bash | ||
| export WORKSPACE=/your/custom/path | ||
| ``` | ||
|
|
||
| Directory structure: | ||
| - `${WORKSPACE}/models/` - Converted checkpoints | ||
| - `${WORKSPACE}/results/` - Training outputs and experiment results | ||
|
|
||
| ## Checkpoint Conversion | ||
|
|
||
| ### Import HF → Megatron | ||
| To import the HF VL model to your desired Megatron path: | ||
| ```bash | ||
| python examples/conversion/convert_checkpoints.py import \ | ||
| --hf-model Qwen/Qwen3.5-35B-A3B \ | ||
| --megatron-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B | ||
| ``` | ||
|
|
||
| ### Export Megatron → HF | ||
| ```bash | ||
| python examples/conversion/convert_checkpoints.py export \ | ||
| --hf-model Qwen/Qwen3.5-35B-A3B \ | ||
| --megatron-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B/iter_0000000 \ | ||
| --hf-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B-hf-export | ||
| ``` | ||
|
|
||
| See the [conversion.sh](conversion.sh) script for more examples including multi-GPU round-trip validation. | ||
|
|
||
| ## Inference | ||
|
|
||
| ### Run Inference on Converted Checkpoint | ||
|
|
||
| ```bash | ||
| python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \ | ||
| --hf_model_path Qwen/Qwen3.5-35B-A3B \ | ||
| --megatron_model_path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B/iter_0000000 \ | ||
| --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \ | ||
| --prompt "Describe this image." \ | ||
| --max_new_tokens 100 \ | ||
| --tp 2 --pp 2 --ep 4 | ||
| ``` | ||
|
|
||
| Note: | ||
| - `--megatron_model_path` is optional. If not specified, the script will convert the model and then run forward. | ||
| - You can also use image URLs: `--image_path="https://example.com/image.jpg"` | ||
| - For MoE models, set `--ep` to the desired expert parallelism degree. | ||
|
|
||
| See the [inference.sh](inference.sh) script for commands to: | ||
| - Run inference with Hugging Face checkpoints | ||
| - Run inference with imported Megatron checkpoints | ||
| - Run inference with exported Hugging Face checkpoints | ||
|
|
||
| For multi-node distributed inference—required for the largest 397B model—see the [slurm_inference.sh](slurm_inference.sh) script. | ||
|
|
||
| ## Finetune Recipes | ||
|
|
||
| - Available recipes: | ||
| - `qwen35_vl_800m_sft_config` / `qwen35_vl_800m_peft_config`: 0.8B dense model | ||
| - `qwen35_vl_2b_sft_config` / `qwen35_vl_2b_peft_config`: 2B dense model | ||
| - `qwen35_vl_4b_sft_config` / `qwen35_vl_4b_peft_config`: 4B dense model | ||
| - `qwen35_vl_9b_sft_config` / `qwen35_vl_9b_peft_config`: 9B dense model | ||
| - `qwen35_vl_27b_sft_config` / `qwen35_vl_27b_peft_config`: 27B dense model | ||
| - `qwen35_vl_35b_a3b_sft_config` / `qwen35_vl_35b_a3b_peft_config`: 35B-A3B MoE model | ||
| - `qwen35_vl_122b_a10b_sft_config` / `qwen35_vl_122b_a10b_peft_config`: 122B-A10B MoE model | ||
| - `qwen35_vl_397b_a17b_sft_config` / `qwen35_vl_397b_a17b_peft_config`: 397B-A17B MoE model | ||
|
|
||
| Before training, ensure the following environment variables are set: | ||
| 1. `SAVE_DIR`: checkpoint and log saving directory | ||
| 2. `HF_TOKEN`: to download models from HF Hub (if required) | ||
| 3. `HF_HOME`: (optional) to avoid re-downloading models and datasets | ||
| 4. `WANDB_API_KEY`: (optional) to enable WandB logging | ||
|
|
||
| ### Pretrain | ||
|
|
||
| Pretraining is not verified for this model. | ||
|
|
||
| ### Supervised Fine-Tuning (SFT) | ||
|
|
||
| See the [slurm_sft.sh](slurm_sft.sh) script for full parameter fine-tuning with configurable model sizes. | ||
|
|
||
| ### Parameter-Efficient Fine-Tuning (PEFT) with LoRA | ||
|
|
||
| See the [slurm_peft.sh](slurm_peft.sh) script for LoRA fine-tuning with configurable model sizes. | ||
|
|
||
| ### Multi-Token Prediction (MTP) | ||
|
|
||
| All Qwen3.5 models are trained with Multi-Token Prediction (`mtp_num_hidden_layers=1` in the HuggingFace config). MTP adds an auxiliary loss that predicts the next-next token alongside the standard next-token prediction, improving training quality. | ||
|
|
||
| MTP is **enabled by default** in all recipes. The MTP layer uses standard attention (not GDN) and the same MLP architecture as the main decoder (dense MLP for dense models, MoE for MoE models). The MTP loss is scaled by `mtp_loss_scaling_factor=0.1` relative to the main LM loss. | ||
|
|
||
| **Finetune with MTP** (default): | ||
| ```python | ||
| cfg.model.mtp_num_layers = 1 | ||
| cfg.model.mtp_loss_scaling_factor = 0.1 | ||
| ``` | ||
|
|
||
| **Finetune without MTP** (discard MTP weights, standard LM loss only): | ||
| ```python | ||
| cfg.model.mtp_num_layers = None | ||
| ``` | ||
|
|
||
| When converting checkpoints, MTP weights are included by default. Setting `mtp_num_layers = None` skips MTP weight conversion and removes the MTP auxiliary loss during training. | ||
|
|
||
| ### Expected Training Dynamics | ||
| We provide a [Weights & Biases report](https://api.wandb.ai/links/nvidia-nemo-fw-public/rt6uzrvf) for the expected loss curves and grad norms. | ||
|
|
||
| ## Evaluation | ||
|
|
||
| Coming soon. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.