NVIDIA-NeMo · cuichenx · Mar 2, 2026 · Mar 4, 2026 · Mar 5, 2026 · Mar 5, 2026
diff --git a/docs/models/vlm/index.md b/docs/models/vlm/index.md
@@ -11,4 +11,5 @@ ministral3.md
 nemotron-nano-v2-vl.md
 qwen2.5-vl.md
 qwen3-vl.md
+qwen35-vl.md
 ```
diff --git a/docs/models/vlm/qwen35-vl.md b/docs/models/vlm/qwen35-vl.md
@@ -0,0 +1,62 @@
+# Qwen 3.5
+
+[Qwen3.5](https://huggingface.co/collections/Qwen/qwen35) is a family of vision-language models supporting multimodal understanding across text, images, and videos. Qwen3.5-VL includes both dense models and Mixture-of-Experts (MoE) variants for improved efficiency at scale.
+
+Qwen 3.5 models feature a hybrid architecture combining GDN (Gated DeltaNet) layers with standard attention layers, SwiGLU activations, and RMSNorm. MoE variants use top-k routing with shared experts for better quality.
+
+Qwen 3.5 models are supported via Megatron Bridge with auto-detected configuration and weight mapping.
+
+```{important}
+Please upgrade to `transformers` >= 5.2.0 in order to use the Qwen 3.5 models.
+```
+
+## Available Models
+
+### Dense Models
+- **Qwen3.5 0.8B** (`Qwen/Qwen3.5-0.8B`): 0.8B parameter vision-language model
+  - Recommended: 1 node, 8 GPUs
+
+- **Qwen3.5 2B** (`Qwen/Qwen3.5-2B`): 2B parameter vision-language model
+  - Recommended: 1 node, 8 GPUs
+
+- **Qwen3.5 4B** (`Qwen/Qwen3.5-4B`): 4B parameter vision-language model
+  - Recommended: 1 node, 8 GPUs
+
+- **Qwen3.5 9B** (`Qwen/Qwen3.5-9B`): 9B parameter vision-language model
+  - Recommended: 1 node, 8 GPUs
+
+- **Qwen3.5 27B** (`Qwen/Qwen3.5-27B`): 27B parameter vision-language model
+  - Recommended: 2 nodes, 16 GPUs
+
+### Mixture-of-Experts (MoE) Models
+- **Qwen3.5 35B-A3B** (`Qwen/Qwen3.5-35B-A3B`): 35B total parameters, 3B activated per token
+  - Recommended: 2 nodes, 16 GPUs
+
+- **Qwen3.5 122B-A10B** (`Qwen/Qwen3.5-122B-A10B`): 122B total parameters, 10B activated per token
+  - Recommended: 4 nodes, 32 GPUs
+
+- **Qwen3.5 397B-A17B** (`Qwen/Qwen3.5-397B-A17B`): 397B total parameters, 17B activated per token
+  - 512 experts with top-10 routing and shared experts
+  - Recommended: 16 nodes, 128 GPUs
+
+## Examples
+
+For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the [Qwen 3.5 Examples](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/qwen35_vl/README.md).
+
+## Hugging Face Model Cards
+
+- Qwen3.5 0.8B: https://huggingface.co/Qwen/Qwen3.5-0.8B
+- Qwen3.5 2B: https://huggingface.co/Qwen/Qwen3.5-2B
+- Qwen3.5 4B: https://huggingface.co/Qwen/Qwen3.5-4B
+- Qwen3.5 9B: https://huggingface.co/Qwen/Qwen3.5-9B
+- Qwen3.5 27B: https://huggingface.co/Qwen/Qwen3.5-27B
+- Qwen3.5 35B-A3B (MoE): https://huggingface.co/Qwen/Qwen3.5-35B-A3B
+- Qwen3.5 122B-A10B (MoE): https://huggingface.co/Qwen/Qwen3.5-122B-A10B
+- Qwen3.5 397B-A17B (MoE): https://huggingface.co/Qwen/Qwen3.5-397B-A17B
+
+## Related Docs
+- Related VLM: [Qwen3-VL](qwen3-vl.md)
+- Related LLM: [Qwen](../llm/qwen.md)
+- Recipe usage: [Recipe usage](../../recipe-usage.md)
+- Customizing the training recipe configuration: [Configuration overview](../../training/config-container-overview.md)
+- Training entry points: [Entry points](../../training/entry-points.md)
diff --git a/examples/models/vlm/qwen35_vl/README.md b/examples/models/vlm/qwen35_vl/README.md
@@ -0,0 +1,119 @@
+# Qwen3.5-VL Examples
+
+This directory contains example scripts for Qwen3.5-VL vision-language models.
+
+For model introduction and architecture details, see the [Qwen3.5-VL documentation](../../../../docs/models/vlm/qwen35-vl.md).
+
+## Workspace Configuration
+
+All scripts use a `WORKSPACE` environment variable to define the base directory for checkpoints and results. By default, this is set to `/workspace`. You can override it:
+
+```bash
+export WORKSPACE=/your/custom/path
+```
+
+Directory structure:
+- `${WORKSPACE}/models/` - Converted checkpoints
+- `${WORKSPACE}/results/` - Training outputs and experiment results
+
+## Checkpoint Conversion
+
+### Import HF → Megatron
+To import the HF VL model to your desired Megatron path:
+```bash
+python examples/conversion/convert_checkpoints.py import \
+  --hf-model Qwen/Qwen3.5-35B-A3B \
+  --megatron-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B
+```
+
+### Export Megatron → HF
+```bash
+python examples/conversion/convert_checkpoints.py export \
+  --hf-model Qwen/Qwen3.5-35B-A3B \
+  --megatron-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B/iter_0000000 \
+  --hf-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B-hf-export
+```
+
+See the [conversion.sh](conversion.sh) script for more examples including multi-GPU round-trip validation.
+
+## Inference
+
+### Run Inference on Converted Checkpoint
+
+```bash
+python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
+  --hf_model_path Qwen/Qwen3.5-35B-A3B \
+  --megatron_model_path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B/iter_0000000 \
+  --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+  --prompt "Describe this image." \
+  --max_new_tokens 100 \
+  --tp 2 --pp 2 --ep 4
+```
+
+Note:
+- `--megatron_model_path` is optional. If not specified, the script will convert the model and then run forward.
+- You can also use image URLs: `--image_path="https://example.com/image.jpg"`
+- For MoE models, set `--ep` to the desired expert parallelism degree.
+
+See the [inference.sh](inference.sh) script for commands to:
+- Run inference with Hugging Face checkpoints
+- Run inference with imported Megatron checkpoints
+- Run inference with exported Hugging Face checkpoints
+
+For multi-node distributed inference—required for the largest 397B model—see the [slurm_inference.sh](slurm_inference.sh) script.
+
+## Finetune Recipes
+
+- Available recipes:
+  - `qwen35_vl_800m_sft_config` / `qwen35_vl_800m_peft_config`: 0.8B dense model
+  - `qwen35_vl_2b_sft_config` / `qwen35_vl_2b_peft_config`: 2B dense model
+  - `qwen35_vl_4b_sft_config` / `qwen35_vl_4b_peft_config`: 4B dense model
+  - `qwen35_vl_9b_sft_config` / `qwen35_vl_9b_peft_config`: 9B dense model
+  - `qwen35_vl_27b_sft_config` / `qwen35_vl_27b_peft_config`: 27B dense model
+  - `qwen35_vl_35b_a3b_sft_config` / `qwen35_vl_35b_a3b_peft_config`: 35B-A3B MoE model
+  - `qwen35_vl_122b_a10b_sft_config` / `qwen35_vl_122b_a10b_peft_config`: 122B-A10B MoE model
+  - `qwen35_vl_397b_a17b_sft_config` / `qwen35_vl_397b_a17b_peft_config`: 397B-A17B MoE model
+
+Before training, ensure the following environment variables are set:
+1. `SAVE_DIR`: checkpoint and log saving directory
+2. `HF_TOKEN`: to download models from HF Hub (if required)
+3. `HF_HOME`: (optional) to avoid re-downloading models and datasets
+4. `WANDB_API_KEY`: (optional) to enable WandB logging
+
+### Pretrain
+
+Pretraining is not verified for this model.
+
+### Supervised Fine-Tuning (SFT)
+
+See the [slurm_sft.sh](slurm_sft.sh) script for full parameter fine-tuning with configurable model sizes.
+
+### Parameter-Efficient Fine-Tuning (PEFT) with LoRA
+
+See the [slurm_peft.sh](slurm_peft.sh) script for LoRA fine-tuning with configurable model sizes.
+
+### Multi-Token Prediction (MTP)
+
+All Qwen3.5 models are trained with Multi-Token Prediction (`mtp_num_hidden_layers=1` in the HuggingFace config). MTP adds an auxiliary loss that predicts the next-next token alongside the standard next-token prediction, improving training quality.
+
+MTP is **enabled by default** in all recipes. The MTP layer uses standard attention (not GDN) and the same MLP architecture as the main decoder (dense MLP for dense models, MoE for MoE models). The MTP loss is scaled by `mtp_loss_scaling_factor=0.1` relative to the main LM loss.
+
+**Finetune with MTP** (default):
+```python
+cfg.model.mtp_num_layers = 1
+cfg.model.mtp_loss_scaling_factor = 0.1
+```
+
+**Finetune without MTP** (discard MTP weights, standard LM loss only):
+```python
+cfg.model.mtp_num_layers = None
+```
+
+When converting checkpoints, MTP weights are included by default. Setting `mtp_num_layers = None` skips MTP weight conversion and removes the MTP auxiliary loss during training.
+
+### Expected Training Dynamics
+We provide a [Weights & Biases report](https://api.wandb.ai/links/nvidia-nemo-fw-public/rt6uzrvf) for the expected loss curves and grad norms.
+
+## Evaluation
+
+Coming soon.
diff --git a/examples/models/vlm/qwen35_vl/conversion.sh b/examples/models/vlm/qwen35_vl/conversion.sh
@@ -12,15 +12,27 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+set -e
 
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
-MODEL_NAME=Qwen3.5-35B-A3B # Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen3.5-27B
+# Supported model variants are:
+# Qwen3.5-0.8B, Qwen3.5-2B, Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B
+MODEL_NAME=Qwen3.5-35B-A3B
 
-if [ "${MODEL_NAME}" = "Qwen3.5-27B" ]; then
+if [ "${MODEL_NAME}" = "Qwen3.5-0.8B" ] || [ "${MODEL_NAME}" = "Qwen3.5-2B" ] || [ "${MODEL_NAME}" = "Qwen3.5-4B" ] || [ "${MODEL_NAME}" = "Qwen3.5-9B" ] || [ "${MODEL_NAME}" = "Qwen3.5-27B" ]; then
     HF_MODEL_CLASS="Qwen3_5ForConditionalGeneration"
-else
+    EP=1
+    PP=8
+    TP=1
+elif [ "${MODEL_NAME}" = "Qwen3.5-35B-A3B" ] || [ "${MODEL_NAME}" = "Qwen3.5-122B-A10B" ] || [ "${MODEL_NAME}" = "Qwen3.5-397B-A17B" ]; then
     HF_MODEL_CLASS="Qwen3_5MoeForConditionalGeneration"
+    EP=8
+    PP=1
+    TP=1
+else
+    echo "Unsupported model variant: ${MODEL_NAME}"
+    exit 1
 fi
 
 # Make sure to upgrade to transformers >= 5.2.0
@@ -39,7 +51,7 @@ uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/co
     --model_class "${HF_MODEL_CLASS}" \
     --image_path "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" \
     --prompt "Describe this image." \
-    --tp 1 --pp 1 --ep 8
+    --tp ${TP} --pp ${PP} --ep ${EP}
 
 # Export Megatron → HF
 uv run python examples/conversion/convert_checkpoints.py export \
@@ -49,4 +61,4 @@ uv run python examples/conversion/convert_checkpoints.py export \
 
 # Round-trip validation
 uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
-      --hf-model-id Qwen/${MODEL_NAME} --tp 1 --pp 2 --ep 4 --trust-remote-code
+      --hf-model-id Qwen/${MODEL_NAME} --tp ${TP} --pp ${PP} --ep ${EP}
diff --git a/examples/models/vlm/qwen35_vl/inference.sh b/examples/models/vlm/qwen35_vl/inference.sh
@@ -13,17 +13,37 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+set -e
+
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
-MODEL_NAME=Qwen3.5-35B-A3B  # Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-27B
+# Set the model name to any of the supported dense or MoE Qwen3.5-VL models:
+#   Dense: Qwen3.5-0.8B, Qwen3.5-2B, Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B
+#   MoE:   Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B
+# For Qwen3.5-397B-A17B, please use the slurm_inference.sh script for multinode inference.
+MODEL_NAME=Qwen3.5-35B-A3B
+
+# Set EP (Expert Parallelism) to 1 for dense models, 4 for MoE models
+case "$MODEL_NAME" in
+    Qwen3.5-0.8B|Qwen3.5-2B|Qwen3.5-4B|Qwen3.5-9B|Qwen3.5-27B)
+        EP=1
+        ;;
+    Qwen3.5-35B-A3B|Qwen3.5-122B-A10B|Qwen3.5-397B-A17B)
+        EP=4
+        ;;
+    *)
+        echo "ERROR: Unknown model type for \$MODEL_NAME: $MODEL_NAME"
+        exit 1
+        ;;
+esac
 
 # Inference with Hugging Face checkpoints
 uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
     --hf_model_path Qwen/${MODEL_NAME} \
     --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
     --prompt "Describe this image." \
     --max_new_tokens 50 \
-    --tp 2 --pp 2 --ep 4
+    --tp 2 --pp 2 --ep ${EP}
 
 # Inference with imported Megatron checkpoints
 uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
@@ -32,12 +52,12 @@ uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf
     --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
     --prompt "Describe this image." \
     --max_new_tokens 50 \
-    --tp 2 --pp 2 --ep 4
+    --tp 2 --pp 2 --ep ${EP}
 
 # Inference with exported HF checkpoints
 uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
     --hf_model_path ${WORKSPACE}/${MODEL_NAME}-hf-export \
     --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
     --prompt "Describe this image." \
     --max_new_tokens 50 \
-    --tp 2 --pp 2 --ep 4
+    --tp 2 --pp 2 --ep ${EP}