Skip to content

Add Qwen 3.5 recipes#2654

Open
cuichenx wants to merge 6 commits intomainfrom
chcui/qwen35_recipes
Open

Add Qwen 3.5 recipes#2654
cuichenx wants to merge 6 commits intomainfrom
chcui/qwen35_recipes

Conversation

@cuichenx
Copy link
Contributor

@cuichenx cuichenx commented Mar 5, 2026

What does this PR do?

Add Qwen3.5 VL (dense + MoE) bridge, provider, fine-tuning recipes, and example scripts for the full model family.

Changelog

  • Add Qwen35VLBridge (dense) and Qwen35VLMoEBridge for bidirectional HF ↔ Megatron checkpoint conversion with GDN + Gated Attention hybrid architecture support
  • Add Qwen35VLModelProvider / Qwen35VLMoEModelProvider with selective attention spec patching (_patch_standard_attention_specs) for mRoPE on standard attention layers only
  • Add fine-tuning recipes for 8 model sizes: 800M, 2B, 4B, 9B, 27B (dense) and 35B-A3B, 122B-A10B, 397B-A17B (MoE), with recommended TP/PP/EP defaults for SFT and LoRA
  • Add recipes/qwen_vl/__init__.py to register the new recipe module
  • Add example scripts: conversion.sh, inference.sh, slurm_sft.sh, slurm_peft.sh
  • Add unit tests (test_qwen35_vl_recipes.py) and functional tests (test_qwen35_vl_recipes_finetune.py)
  • Fix test_qwen35_vl_bridge.py flaky assertions

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Requires transformers >= 5.2.0. Import guards are in place for both dense and MoE transformers classes.

cuichenx added 2 commits March 2, 2026 10:06
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx marked this pull request as draft March 5, 2026 00:05
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

This PR adds comprehensive fine-tuning support for Qwen3.5 Vision-Language models. It introduces parameterized SLURM scripts for SFT and PEFT training, a new recipe module with eight factory functions for different model sizes, validation logic for parallelism settings, and extensive test coverage. Scripts are updated to dynamically handle multiple model variants with appropriate tensor/pipeline/expert parallelism configurations.

Changes

Cohort / File(s) Summary
Conversion & Inference Scripts
examples/models/vlm/qwen35_vl/conversion.sh, examples/models/vlm/qwen35_vl/inference.sh
Added strict error handling and parameterized parallelism flags (EP, PP, TP). Conversion script supports multiple dense and MoE variants. Inference script dynamically sets EP based on model type (dense vs. MoE).
SLURM Training Scripts
examples/models/vlm/qwen35_vl/slurm_peft.sh, examples/models/vlm/qwen35_vl/slurm_sft.sh
New scripts for orchestrating PEFT (LoRA) and SFT training via SLURM. Both accept model size arguments, map to HF model names and recipes, configure container execution, build CLI overrides, and integrate with Weights & Biases logging.
Qwen3.5 VL Recipe Module
src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py
New module with eight public factory functions for sizes 800M–397B-A17B (dense and MoE). Shared _qwen35_vl_common builds ConfigContainer with model, training, optimizer, scheduler, PEFT, dataset provider selection, and validation. Supports dataset type selection (mock, preloaded, HF) with error handling.
Bridge & Provider Updates
src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py, src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
Bridge now uses top-level HF config for embedding sharing instead of text_config default. Provider adds validate_parallelism() method to both dense and MoE variants, enforcing tensor_model_parallel_size ≤ num_query_groups.
Recipe Package Exports
src/megatron/bridge/recipes/qwen_vl/__init__.py
Added imports and all exports for eight new Qwen3.5-VL finetune config functions, grouped by density (dense and MoE).
Functional Tests
tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py
New test class with four GPU-only SFT scenarios: nothing frozen, language frozen, vision+language frozen, and recompute. Includes autouse fixture to reset global microbatch calculator between tests.
Unit Tests
tests/unit_tests/recipes/qwen_vl/test_qwen35_vl_recipes.py, tests/unit_tests/models/qwen_vl/test_qwen35_vl_bridge.py
Comprehensive recipe unit tests with fake AutoBridge and model config doubles, parameterized coverage of 8 model sizes, SFT/PEFT scenarios, dataset type selection, and override propagation. Bridge tests updated with Mock(spec=[]) and torch_dtype attribute adjustments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • PR #2613: Directly related—identical mock adjustments in test_qwen35_vl_bridge.py (Mock(spec=[]) and torch_dtype additions).
  • PR #2614: Related refactoring of VLM finetune recipe surface and shared SFT/PEFT helpers.

Suggested labels

Run CICD

Suggested reviewers

  • yaoyu-33
  • yashaswikarnati
  • chtruong814
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major changes (691 new lines, 8 model variants, SLURM scripts) with comprehensive tests (955 lines), but PR description lacks test results, coverage metrics, or validation information. Add PR description section documenting executed tests, their results, coverage of new recipe functions and model variants, performance validation, and test run outputs.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add Qwen 3.5 recipes' is concise and directly reflects the primary objective of the PR, which is to introduce fine-tuning recipes for various Qwen3.5 VL model variants (dense and MoE) across multiple sizes.
Docstring Coverage ✅ Passed Docstring coverage is 88.14% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chcui/qwen35_recipes

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (1)
tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py (1)

138-199: Prefer subprocess isolation for these functional training runs.

These tests delegate to run_pretrain_vl_recipe_test (see tests/functional_tests/recipes/utils.py, Lines 199-298), which initializes distributed state and runs real pretraining in-process. Moving each scenario to a subprocess will reduce inter-test state coupling and flakes.

As per coding guidelines, "tests/functional_tests/**/*.py: Use subprocess for functional tests that require process isolation".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py` around
lines 138 - 199, The four functional tests (test_sft_nothing_frozen,
test_sft_language_model_frozen, test_sft_vision_and_language_frozen,
test_recompute) currently call run_pretrain_vl_recipe_test in-process and must
be converted to run each scenario in a separate subprocess to avoid shared
distributed state; replace the direct call to run_pretrain_vl_recipe_test with
logic that spawns a subprocess (e.g., subprocess.run or multiprocessing.Process
invoking a small runner module/function that imports run_pretrain_vl_recipe_test
and executes it with the same arguments), pass the same
config_func/recipe_name/parallelism_overrides/model_overrides/tmp_path inputs
(serialize as needed), wait for completion, and assert the subprocess exit code
is zero; ensure the new runner entrypoint and the test wrappers use unique
identifiers matching the existing test parameterization so behavior and
signatures remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/models/vlm/qwen35_vl/inference.sh`:
- Around line 26-33: The EP assignment in the MODEL_NAME case block currently
sets EP=4 for all MoE variants; update the case for Qwen3.5-35B-A3B,
Qwen3.5-122B-A10B, and Qwen3.5-397B-A17B to use per-model defaults (set EP to
the specific values used in the VL scripts instead of a single shared 4) by
modifying the MODEL_NAME case in inference.sh to assign the correct EP for each
model name (reference the MODEL_NAME case block and the EP variable).

In `@examples/models/vlm/qwen35_vl/slurm_peft.sh`:
- Around line 116-121: The script currently overwrites externally provided
CONTAINER_IMAGE and CONTAINER_MOUNTS with empty strings (symbols CONTAINER_IMAGE
and CONTAINER_MOUNTS), which breaks the later guard that checks for a container
image; change the assignment so you only set defaults when the variables are
unset or empty (e.g., use a conditional or shell parameter expansion to leave an
externally supplied value intact), and apply the same change to the mounts
handling and any subsequent logic that validates CONTAINER_IMAGE to ensure it
respects environment-supplied values.
- Around line 49-50: SBATCH output/error paths reference a logs/ directory that
may not exist when Slurm opens the files; either stop using a non-existent
subdirectory or ensure it is created before submission. Fix by changing the
SBATCH directives in the header (the lines with “#SBATCH
--output=logs/qwen35vl_lora_%j.out” and “#SBATCH
--error=logs/qwen35vl_lora_%j.err”) to write to the current directory (e.g.,
remove the "logs/" prefix or use "%x_%j.out" and "%x_%j.err") or ensure the logs
directory is created by the submitter before calling sbatch (add a pre-submit
mkdir -p logs in the wrapper that writes/submits this script rather than inside
the job script itself).

In `@examples/models/vlm/qwen35_vl/slurm_sft.sh`:
- Around line 1-15: Add strict shell options to the script to fail fast on
errors and unset variables: immediately after the shebang (#!/bin/bash) in
slurm_sft.sh enable "set -euo pipefail" and set a safe IFS (e.g., IFS=$'\n\t')
so command failures and unset variables cannot silently continue during SLURM
training runs.
- Around line 27-35: Commented parallelism recommendations in slurm_sft.sh are
inconsistent with the canonical recipe values in
src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py; update or remove the
mismatched lines (e.g., entries for 122B and 397B) so they reflect the recipe's
TP/PP/EP defaults. Open examples/models/vlm/qwen35_vl/slurm_sft.sh and either
correct the TP/PP/EP numbers to match the values defined in qwen35_vl.py or
replace the block with a short note directing users to the recipe file for
authoritative parallelism settings, ensuring the commented sizes (e.g., 27B,
35B, 122B, 397B) match the recipe's PP/EP/TP/EP entries. Make sure the text
explicitly references the recipe as the source of truth to avoid future drift.

In `@src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`:
- Around line 354-357: The documented default pipeline-parallel size for 122B
full-SFT (text saying "PP=4") is inconsistent with the implementation which sets
pipeline_model_parallel_size to 6; update one to match the other: either change
the documentation block in qwen35_vl.py to "PP=6" to reflect the code, or change
the implementation variable pipeline_model_parallel_size (and any related
FULL_SFT config assignments) to 4 so the code matches the docs; ensure both the
default config comment block and the pipeline_model_parallel_size assignment
(and any other FULL_SFT-specific parallelism constants) are updated together.
- Around line 593-595: The TokenizerConfig is being wired to hf_path while
preprocessing uses _processor_model (set from tokenizer_model or hf_path),
causing config drift; update the TokenizerConfig instantiation (and any other
places that currently pass hf_path, e.g., around the TokenizerConfig creation at
the block using hf_path) to use _processor_model (or tokenizer_model if present)
instead so the tokenizer config and dataset preprocessing use the same value
(ensure both the dataset preprocessing branch that relies on _processor_model
and the TokenizerConfig/tokenizer creation use the same symbol:
_processor_model).

In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py`:
- Around line 110-199: Add an explicit test category marker (e.g.,
pytest.mark.integration or pytest.mark.system) to this module/class so tests are
properly categorized; for example, set a module-level pytestmark or decorate the
TestQwen35VLFinetuneRecipes class with pytest.mark.integration. Update the
file-level or class-level marker near TestQwen35VLFinetuneRecipes so the GPU
gating remains (run_only_on stays on individual tests) but the module/class also
has the integration/system pytest.mark.
- Around line 113-127: The fixture _reset_microbatch_calculator only destroys
the leaked calculator after each test; change it to also clear any existing
global state before tests start by checking _GLOBAL_NUM_MICROBATCHES_CALCULATOR
and calling destroy_num_microbatches_calculator() prior to yield (in addition to
the existing post-yield cleanup). Update the fixture body that references
_GLOBAL_NUM_MICROBATCHES_CALCULATOR and destroy_num_microbatches_calculator so
both pre- and post-yield paths ensure the global is None.

In `@tests/unit_tests/recipes/qwen_vl/test_qwen35_vl_recipes.py`:
- Around line 98-102: Rename unused parameters to start with an underscore to
silence ruff ARG002/ARG004: in from_hf_pretrained change hf_path to _hf_path and
**kwargs to **_kwargs; in to_megatron_provider change self to _self (or keep
self and prefix unused params) and load_weights to _load_weights so the intent
is explicit and linters stop flagging _FakeAutoBridge and _FakeModelCfg factory
helpers.
- Around line 132-756: Add a pytest category marker so these are discoverable as
unit tests: declare a module-level marker variable `pytestmark =
pytest.mark.unit` near the top of the file (ensure `pytest` is imported), rather
than decorating every function; this will mark all functions such as
`test_each_qwen35_vl_recipe_builds_config`,
`test_qwen35_vl_dataset_type_selection`, `test_sft_nothing_frozen`, etc., as
unit tests for CI selection.

---

Nitpick comments:
In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py`:
- Around line 138-199: The four functional tests (test_sft_nothing_frozen,
test_sft_language_model_frozen, test_sft_vision_and_language_frozen,
test_recompute) currently call run_pretrain_vl_recipe_test in-process and must
be converted to run each scenario in a separate subprocess to avoid shared
distributed state; replace the direct call to run_pretrain_vl_recipe_test with
logic that spawns a subprocess (e.g., subprocess.run or multiprocessing.Process
invoking a small runner module/function that imports run_pretrain_vl_recipe_test
and executes it with the same arguments), pass the same
config_func/recipe_name/parallelism_overrides/model_overrides/tmp_path inputs
(serialize as needed), wait for completion, and assert the subprocess exit code
is zero; ensure the new runner entrypoint and the test wrappers use unique
identifiers matching the existing test parameterization so behavior and
signatures remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 09eb4a86-1e7b-4867-8806-73d971b7578d

📥 Commits

Reviewing files that changed from the base of the PR and between 394037d and cfd9d90.

📒 Files selected for processing (11)
  • examples/models/vlm/qwen35_vl/conversion.sh
  • examples/models/vlm/qwen35_vl/inference.sh
  • examples/models/vlm/qwen35_vl/slurm_peft.sh
  • examples/models/vlm/qwen35_vl/slurm_sft.sh
  • src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py
  • src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
  • src/megatron/bridge/recipes/qwen_vl/__init__.py
  • src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py
  • tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py
  • tests/unit_tests/models/qwen_vl/test_qwen35_vl_bridge.py
  • tests/unit_tests/recipes/qwen_vl/test_qwen35_vl_recipes.py

Comment on lines +49 to +50
#SBATCH --output=logs/qwen35vl_lora_%j.out
#SBATCH --error=logs/qwen35vl_lora_%j.err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

logs/ may not exist when Slurm opens stdout/stderr files.

At Lines 49-50, Slurm writes to logs/... before script body runs; mkdir -p logs at Line 152 is too late for first open. This can cause job startup failure on clean workdirs.

💡 Suggested fix
-#SBATCH --output=logs/qwen35vl_lora_%j.out
-#SBATCH --error=logs/qwen35vl_lora_%j.err
+#SBATCH --output=qwen35vl_lora_%j.out
+#SBATCH --error=qwen35vl_lora_%j.err
@@
-mkdir -p logs

Also applies to: 152-152

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/vlm/qwen35_vl/slurm_peft.sh` around lines 49 - 50, SBATCH
output/error paths reference a logs/ directory that may not exist when Slurm
opens the files; either stop using a non-existent subdirectory or ensure it is
created before submission. Fix by changing the SBATCH directives in the header
(the lines with “#SBATCH --output=logs/qwen35vl_lora_%j.out” and “#SBATCH
--error=logs/qwen35vl_lora_%j.err”) to write to the current directory (e.g.,
remove the "logs/" prefix or use "%x_%j.out" and "%x_%j.err") or ensure the logs
directory is created by the submitter before calling sbatch (add a pre-submit
mkdir -p logs in the wrapper that writes/submits this script rather than inside
the job script itself).

Comment on lines +1 to +15
#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Enable strict shell mode for safer SLURM job execution.

Without set -euo pipefail, command failures and unset variables can silently continue in expensive training jobs.

🔧 Suggested fix
 #!/bin/bash
+set -euo pipefail
 # Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines: **/*.sh: Follow Google Shell Style Guide.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/bash
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
set -euo pipefail
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/vlm/qwen35_vl/slurm_sft.sh` around lines 1 - 15, Add strict
shell options to the script to fail fast on errors and unset variables:
immediately after the shebang (#!/bin/bash) in slurm_sft.sh enable "set -euo
pipefail" and set a safe IFS (e.g., IFS=$'\n\t') so command failures and unset
variables cannot silently continue during SLURM training runs.

Comment on lines +27 to +35
# Recommended parallelism (recipe defaults for full SFT):
# 0.8B (dense): TP=1, PP=1 (1 node)
# 2B (dense): TP=1, PP=1 (1 node)
# 4B (dense): TP=2, PP=1 (1 node)
# 9B (dense): TP=4, PP=1 (1 node)
# 27B (dense): TP=4, PP=4 (2 nodes)
# 35B-A3B (MoE): TP=2, PP=1, EP=16 (2 nodes)
# 122B-A10B (MoE): TP=2, PP=1, EP=32 (4 nodes)
# 397B-A17B (MoE): TP=2, PP=4, EP=32 (16 nodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Parallelism recommendations in comments are out of sync with recipe defaults.

Lines 33-35 advertise defaults that do not match src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py (e.g., 122B uses different PP/EP in the recipe). Since this script does not override TP/PP/EP directly, these comments can mislead node sizing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/vlm/qwen35_vl/slurm_sft.sh` around lines 27 - 35, Commented
parallelism recommendations in slurm_sft.sh are inconsistent with the canonical
recipe values in src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py; update or
remove the mismatched lines (e.g., entries for 122B and 397B) so they reflect
the recipe's TP/PP/EP defaults. Open examples/models/vlm/qwen35_vl/slurm_sft.sh
and either correct the TP/PP/EP numbers to match the values defined in
qwen35_vl.py or replace the block with a short note directing users to the
recipe file for authoritative parallelism settings, ensuring the commented sizes
(e.g., 27B, 35B, 122B, 397B) match the recipe's PP/EP/TP/EP entries. Make sure
the text explicitly references the recipe as the source of truth to avoid future
drift.

Comment on lines +110 to +199
class TestQwen35VLFinetuneRecipes:
"""Functional tests covering SFT freeze combos and recompute."""

@pytest.fixture(autouse=True)
def _reset_microbatch_calculator(self):
"""Ensure the global microbatch calculator is cleared between tests.

If a previous test fails mid-pretrain, destroy_global_state() never
runs and the calculator leaks into the next test.
"""
yield
from megatron.core.num_microbatches_calculator import (
_GLOBAL_NUM_MICROBATCHES_CALCULATOR,
destroy_num_microbatches_calculator,
)

if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None:
destroy_num_microbatches_calculator()

# -----------------------------------------------------------------------
# SFT scenarios
# -----------------------------------------------------------------------

@pytest.mark.run_only_on("GPU")
@pytest.mark.parametrize(
"config_func,recipe_name,parallelism_overrides,model_overrides",
QWEN35_VL_SFT_NONE_FROZEN,
)
def test_sft_nothing_frozen(self, config_func, recipe_name, parallelism_overrides, model_overrides, tmp_path):
"""Scenario 1: all modules trainable."""
run_pretrain_vl_recipe_test(
config_func,
recipe_name,
tmp_path,
model_overrides=model_overrides,
**parallelism_overrides,
)

@pytest.mark.run_only_on("GPU")
@pytest.mark.parametrize(
"config_func,recipe_name,parallelism_overrides,model_overrides",
QWEN35_VL_SFT_LM_FROZEN,
)
def test_sft_language_model_frozen(
self, config_func, recipe_name, parallelism_overrides, model_overrides, tmp_path
):
"""Scenario 2: language model frozen, train vision + projection."""
run_pretrain_vl_recipe_test(
config_func,
recipe_name,
tmp_path,
model_overrides=model_overrides,
**parallelism_overrides,
)

@pytest.mark.run_only_on("GPU")
@pytest.mark.parametrize(
"config_func,recipe_name,parallelism_overrides,model_overrides",
QWEN35_VL_SFT_PROJ_ONLY,
)
def test_sft_vision_and_language_frozen(
self, config_func, recipe_name, parallelism_overrides, model_overrides, tmp_path
):
"""Scenario 3: vision + language frozen, train projection only."""
run_pretrain_vl_recipe_test(
config_func,
recipe_name,
tmp_path,
model_overrides=model_overrides,
**parallelism_overrides,
)

# -----------------------------------------------------------------------
# Recompute
# -----------------------------------------------------------------------

@pytest.mark.run_only_on("GPU")
@pytest.mark.parametrize(
"config_func,recipe_name,parallelism_overrides,model_overrides",
QWEN35_VL_SFT_RECOMPUTE,
)
def test_recompute(self, config_func, recipe_name, parallelism_overrides, model_overrides, tmp_path):
"""SFT with activation recomputation."""
run_pretrain_vl_recipe_test(
config_func,
recipe_name,
tmp_path,
model_overrides=model_overrides,
**parallelism_overrides,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add an explicit integration/system category marker for this module.

GPU gating is present, but unit/integration/system categorization is missing.

💡 Suggested fix
 import pytest

+pytestmark = pytest.mark.integration
+
 from megatron.bridge.recipes.qwen_vl.qwen35_vl import qwen35_vl_27b_finetune_config
 from tests.functional_tests.recipes.utils import run_pretrain_vl_recipe_test

As per coding guidelines, "tests/**/*.py: Use pytest.mark to categorize tests (unit, integration, system)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py` around
lines 110 - 199, Add an explicit test category marker (e.g.,
pytest.mark.integration or pytest.mark.system) to this module/class so tests are
properly categorized; for example, set a module-level pytestmark or decorate the
TestQwen35VLFinetuneRecipes class with pytest.mark.integration. Update the
file-level or class-level marker near TestQwen35VLFinetuneRecipes so the GPU
gating remains (run_only_on stays on individual tests) but the module/class also
has the integration/system pytest.mark.

Comment on lines +113 to +127
@pytest.fixture(autouse=True)
def _reset_microbatch_calculator(self):
"""Ensure the global microbatch calculator is cleared between tests.

If a previous test fails mid-pretrain, destroy_global_state() never
runs and the calculator leaks into the next test.
"""
yield
from megatron.core.num_microbatches_calculator import (
_GLOBAL_NUM_MICROBATCHES_CALCULATOR,
destroy_num_microbatches_calculator,
)

if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None:
destroy_num_microbatches_calculator()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reset leaked microbatch global both before and after each test.

Current fixture only cleans up after yield. If a prior module leaked global state, the first test here still starts dirty.

💡 Suggested fix
     `@pytest.fixture`(autouse=True)
     def _reset_microbatch_calculator(self):
         """Ensure the global microbatch calculator is cleared between tests.
@@
         from megatron.core.num_microbatches_calculator import (
             _GLOBAL_NUM_MICROBATCHES_CALCULATOR,
             destroy_num_microbatches_calculator,
         )

+        if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None:
+            destroy_num_microbatches_calculator()
+
         if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None:
             destroy_num_microbatches_calculator()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pytest.fixture(autouse=True)
def _reset_microbatch_calculator(self):
"""Ensure the global microbatch calculator is cleared between tests.
If a previous test fails mid-pretrain, destroy_global_state() never
runs and the calculator leaks into the next test.
"""
yield
from megatron.core.num_microbatches_calculator import (
_GLOBAL_NUM_MICROBATCHES_CALCULATOR,
destroy_num_microbatches_calculator,
)
if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None:
destroy_num_microbatches_calculator()
`@pytest.fixture`(autouse=True)
def _reset_microbatch_calculator(self):
"""Ensure the global microbatch calculator is cleared between tests.
If a previous test fails mid-pretrain, destroy_global_state() never
runs and the calculator leaks into the next test.
"""
from megatron.core.num_microbatches_calculator import (
_GLOBAL_NUM_MICROBATCHES_CALCULATOR,
destroy_num_microbatches_calculator,
)
if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None:
destroy_num_microbatches_calculator()
yield
if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None:
destroy_num_microbatches_calculator()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py` around
lines 113 - 127, The fixture _reset_microbatch_calculator only destroys the
leaked calculator after each test; change it to also clear any existing global
state before tests start by checking _GLOBAL_NUM_MICROBATCHES_CALCULATOR and
calling destroy_num_microbatches_calculator() prior to yield (in addition to the
existing post-yield cleanup). Update the fixture body that references
_GLOBAL_NUM_MICROBATCHES_CALCULATOR and destroy_num_microbatches_calculator so
both pre- and post-yield paths ensure the global is None.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx marked this pull request as ready for review March 5, 2026 06:44
cuichenx added 2 commits March 4, 2026 22:51
- Add `set -euo pipefail` to slurm_sft.sh and slurm_peft.sh for fail-fast behavior
- Remove `logs/` prefix from SBATCH output/error paths (directory doesn't exist at job start)
- Remove now-unnecessary `mkdir -p logs` calls
- Fix 122B-A10B parallelism comment in slurm_sft.sh to match recipe (TP=2, PP=6, EP=8)
- Add `pytestmark = pytest.mark.integration` to functional test module
- Reset microbatch calculator both before and after each test in fixture

Signed-off-by: Chen Cui <chcui@nvidia.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant