Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/models/vlm/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,5 @@ ministral3.md
nemotron-nano-v2-vl.md
qwen2.5-vl.md
qwen3-vl.md
qwen35-vl.md
```
62 changes: 62 additions & 0 deletions docs/models/vlm/qwen35-vl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Qwen 3.5

[Qwen3.5](https://huggingface.co/collections/Qwen/qwen35) is a family of vision-language models supporting multimodal understanding across text, images, and videos. Qwen3.5-VL includes both dense models and Mixture-of-Experts (MoE) variants for improved efficiency at scale.

Qwen 3.5 models feature a hybrid architecture combining GDN (Gated DeltaNet) layers with standard attention layers, SwiGLU activations, and RMSNorm. MoE variants use top-k routing with shared experts for better quality.

Qwen 3.5 models are supported via Megatron Bridge with auto-detected configuration and weight mapping.

```{important}
Please upgrade to `transformers` >= 5.2.0 in order to use the Qwen 3.5 models.
```

## Available Models

### Dense Models
- **Qwen3.5 0.8B** (`Qwen/Qwen3.5-0.8B`): 0.8B parameter vision-language model
- Recommended: 1 node, 8 GPUs

- **Qwen3.5 2B** (`Qwen/Qwen3.5-2B`): 2B parameter vision-language model
- Recommended: 1 node, 8 GPUs

- **Qwen3.5 4B** (`Qwen/Qwen3.5-4B`): 4B parameter vision-language model
- Recommended: 1 node, 8 GPUs

- **Qwen3.5 9B** (`Qwen/Qwen3.5-9B`): 9B parameter vision-language model
- Recommended: 1 node, 8 GPUs

- **Qwen3.5 27B** (`Qwen/Qwen3.5-27B`): 27B parameter vision-language model
- Recommended: 2 nodes, 16 GPUs

### Mixture-of-Experts (MoE) Models
- **Qwen3.5 35B-A3B** (`Qwen/Qwen3.5-35B-A3B`): 35B total parameters, 3B activated per token
- Recommended: 2 nodes, 16 GPUs

- **Qwen3.5 122B-A10B** (`Qwen/Qwen3.5-122B-A10B`): 122B total parameters, 10B activated per token
- Recommended: 4 nodes, 32 GPUs

- **Qwen3.5 397B-A17B** (`Qwen/Qwen3.5-397B-A17B`): 397B total parameters, 17B activated per token
- 512 experts with top-10 routing and shared experts
- Recommended: 16 nodes, 128 GPUs

## Examples

For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the [Qwen 3.5 Examples](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/qwen35_vl/README.md).

## Hugging Face Model Cards

- Qwen3.5 0.8B: https://huggingface.co/Qwen/Qwen3.5-0.8B
- Qwen3.5 2B: https://huggingface.co/Qwen/Qwen3.5-2B
- Qwen3.5 4B: https://huggingface.co/Qwen/Qwen3.5-4B
- Qwen3.5 9B: https://huggingface.co/Qwen/Qwen3.5-9B
- Qwen3.5 27B: https://huggingface.co/Qwen/Qwen3.5-27B
- Qwen3.5 35B-A3B (MoE): https://huggingface.co/Qwen/Qwen3.5-35B-A3B
- Qwen3.5 122B-A10B (MoE): https://huggingface.co/Qwen/Qwen3.5-122B-A10B
- Qwen3.5 397B-A17B (MoE): https://huggingface.co/Qwen/Qwen3.5-397B-A17B

## Related Docs
- Related VLM: [Qwen3-VL](qwen3-vl.md)
- Related LLM: [Qwen](../llm/qwen.md)
- Recipe usage: [Recipe usage](../../recipe-usage.md)
- Customizing the training recipe configuration: [Configuration overview](../../training/config-container-overview.md)
- Training entry points: [Entry points](../../training/entry-points.md)
119 changes: 119 additions & 0 deletions examples/models/vlm/qwen35_vl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Qwen3.5-VL Examples

This directory contains example scripts for Qwen3.5-VL vision-language models.

For model introduction and architecture details, see the [Qwen3.5-VL documentation](../../../../docs/models/vlm/qwen35-vl.md).

## Workspace Configuration

All scripts use a `WORKSPACE` environment variable to define the base directory for checkpoints and results. By default, this is set to `/workspace`. You can override it:

```bash
export WORKSPACE=/your/custom/path
```

Directory structure:
- `${WORKSPACE}/models/` - Converted checkpoints
- `${WORKSPACE}/results/` - Training outputs and experiment results

## Checkpoint Conversion

### Import HF → Megatron
To import the HF VL model to your desired Megatron path:
```bash
python examples/conversion/convert_checkpoints.py import \
--hf-model Qwen/Qwen3.5-35B-A3B \
--megatron-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B
```

### Export Megatron → HF
```bash
python examples/conversion/convert_checkpoints.py export \
--hf-model Qwen/Qwen3.5-35B-A3B \
--megatron-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B/iter_0000000 \
--hf-path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B-hf-export
```

See the [conversion.sh](conversion.sh) script for more examples including multi-GPU round-trip validation.

## Inference

### Run Inference on Converted Checkpoint

```bash
python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path Qwen/Qwen3.5-35B-A3B \
--megatron_model_path ${WORKSPACE}/models/Qwen/Qwen3.5-35B-A3B/iter_0000000 \
--image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
--prompt "Describe this image." \
--max_new_tokens 100 \
--tp 2 --pp 2 --ep 4
```

Note:
- `--megatron_model_path` is optional. If not specified, the script will convert the model and then run forward.
- You can also use image URLs: `--image_path="https://example.com/image.jpg"`
- For MoE models, set `--ep` to the desired expert parallelism degree.

See the [inference.sh](inference.sh) script for commands to:
- Run inference with Hugging Face checkpoints
- Run inference with imported Megatron checkpoints
- Run inference with exported Hugging Face checkpoints

For multi-node distributed inference—required for the largest 397B model—see the [slurm_inference.sh](slurm_inference.sh) script.

## Finetune Recipes

- Available recipes:
- `qwen35_vl_800m_sft_config` / `qwen35_vl_800m_peft_config`: 0.8B dense model
- `qwen35_vl_2b_sft_config` / `qwen35_vl_2b_peft_config`: 2B dense model
- `qwen35_vl_4b_sft_config` / `qwen35_vl_4b_peft_config`: 4B dense model
- `qwen35_vl_9b_sft_config` / `qwen35_vl_9b_peft_config`: 9B dense model
- `qwen35_vl_27b_sft_config` / `qwen35_vl_27b_peft_config`: 27B dense model
- `qwen35_vl_35b_a3b_sft_config` / `qwen35_vl_35b_a3b_peft_config`: 35B-A3B MoE model
- `qwen35_vl_122b_a10b_sft_config` / `qwen35_vl_122b_a10b_peft_config`: 122B-A10B MoE model
- `qwen35_vl_397b_a17b_sft_config` / `qwen35_vl_397b_a17b_peft_config`: 397B-A17B MoE model

Before training, ensure the following environment variables are set:
1. `SAVE_DIR`: checkpoint and log saving directory
2. `HF_TOKEN`: to download models from HF Hub (if required)
3. `HF_HOME`: (optional) to avoid re-downloading models and datasets
4. `WANDB_API_KEY`: (optional) to enable WandB logging

### Pretrain

Pretraining is not verified for this model.

### Supervised Fine-Tuning (SFT)

See the [slurm_sft.sh](slurm_sft.sh) script for full parameter fine-tuning with configurable model sizes.

### Parameter-Efficient Fine-Tuning (PEFT) with LoRA

See the [slurm_peft.sh](slurm_peft.sh) script for LoRA fine-tuning with configurable model sizes.

### Multi-Token Prediction (MTP)

All Qwen3.5 models are trained with Multi-Token Prediction (`mtp_num_hidden_layers=1` in the HuggingFace config). MTP adds an auxiliary loss that predicts the next-next token alongside the standard next-token prediction, improving training quality.

MTP is **enabled by default** in all recipes. The MTP layer uses standard attention (not GDN) and the same MLP architecture as the main decoder (dense MLP for dense models, MoE for MoE models). The MTP loss is scaled by `mtp_loss_scaling_factor=0.1` relative to the main LM loss.

**Finetune with MTP** (default):
```python
cfg.model.mtp_num_layers = 1
cfg.model.mtp_loss_scaling_factor = 0.1
```

**Finetune without MTP** (discard MTP weights, standard LM loss only):
```python
cfg.model.mtp_num_layers = None
```

When converting checkpoints, MTP weights are included by default. Setting `mtp_num_layers = None` skips MTP weight conversion and removes the MTP auxiliary loss during training.

### Expected Training Dynamics
We provide a [Weights & Biases report](https://api.wandb.ai/links/nvidia-nemo-fw-public/rt6uzrvf) for the expected loss curves and grad norms.

## Evaluation

Coming soon.
22 changes: 17 additions & 5 deletions examples/models/vlm/qwen35_vl/conversion.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,27 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e

# Workspace directory for checkpoints and results
WORKSPACE=${WORKSPACE:-/workspace}
MODEL_NAME=Qwen3.5-35B-A3B # Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen3.5-27B
# Supported model variants are:
# Qwen3.5-0.8B, Qwen3.5-2B, Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B
MODEL_NAME=Qwen3.5-35B-A3B

if [ "${MODEL_NAME}" = "Qwen3.5-27B" ]; then
if [ "${MODEL_NAME}" = "Qwen3.5-0.8B" ] || [ "${MODEL_NAME}" = "Qwen3.5-2B" ] || [ "${MODEL_NAME}" = "Qwen3.5-4B" ] || [ "${MODEL_NAME}" = "Qwen3.5-9B" ] || [ "${MODEL_NAME}" = "Qwen3.5-27B" ]; then
HF_MODEL_CLASS="Qwen3_5ForConditionalGeneration"
else
EP=1
PP=8
TP=1
elif [ "${MODEL_NAME}" = "Qwen3.5-35B-A3B" ] || [ "${MODEL_NAME}" = "Qwen3.5-122B-A10B" ] || [ "${MODEL_NAME}" = "Qwen3.5-397B-A17B" ]; then
HF_MODEL_CLASS="Qwen3_5MoeForConditionalGeneration"
EP=8
PP=1
TP=1
else
echo "Unsupported model variant: ${MODEL_NAME}"
exit 1
fi

# Make sure to upgrade to transformers >= 5.2.0
Expand All @@ -39,7 +51,7 @@ uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/co
--model_class "${HF_MODEL_CLASS}" \
--image_path "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" \
--prompt "Describe this image." \
--tp 1 --pp 1 --ep 8
--tp ${TP} --pp ${PP} --ep ${EP}

# Export Megatron → HF
uv run python examples/conversion/convert_checkpoints.py export \
Expand All @@ -49,4 +61,4 @@ uv run python examples/conversion/convert_checkpoints.py export \

# Round-trip validation
uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
--hf-model-id Qwen/${MODEL_NAME} --tp 1 --pp 2 --ep 4 --trust-remote-code
--hf-model-id Qwen/${MODEL_NAME} --tp ${TP} --pp ${PP} --ep ${EP}
28 changes: 24 additions & 4 deletions examples/models/vlm/qwen35_vl/inference.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,37 @@
# See the License for the specific language governing permissions and
# limitations under the License.

set -e

# Workspace directory for checkpoints and results
WORKSPACE=${WORKSPACE:-/workspace}
MODEL_NAME=Qwen3.5-35B-A3B # Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-27B
# Set the model name to any of the supported dense or MoE Qwen3.5-VL models:
# Dense: Qwen3.5-0.8B, Qwen3.5-2B, Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B
# MoE: Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B
# For Qwen3.5-397B-A17B, please use the slurm_inference.sh script for multinode inference.
MODEL_NAME=Qwen3.5-35B-A3B

# Set EP (Expert Parallelism) to 1 for dense models, 4 for MoE models
case "$MODEL_NAME" in
Qwen3.5-0.8B|Qwen3.5-2B|Qwen3.5-4B|Qwen3.5-9B|Qwen3.5-27B)
EP=1
;;
Qwen3.5-35B-A3B|Qwen3.5-122B-A10B|Qwen3.5-397B-A17B)
EP=4
;;
*)
echo "ERROR: Unknown model type for \$MODEL_NAME: $MODEL_NAME"
exit 1
;;
esac

# Inference with Hugging Face checkpoints
uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path Qwen/${MODEL_NAME} \
--image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
--prompt "Describe this image." \
--max_new_tokens 50 \
--tp 2 --pp 2 --ep 4
--tp 2 --pp 2 --ep ${EP}

# Inference with imported Megatron checkpoints
uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
Expand All @@ -32,12 +52,12 @@ uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf
--image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
--prompt "Describe this image." \
--max_new_tokens 50 \
--tp 2 --pp 2 --ep 4
--tp 2 --pp 2 --ep ${EP}

# Inference with exported HF checkpoints
uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path ${WORKSPACE}/${MODEL_NAME}-hf-export \
--image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
--prompt "Describe this image." \
--max_new_tokens 50 \
--tp 2 --pp 2 --ep 4
--tp 2 --pp 2 --ep ${EP}
Loading