Conversation
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Chen Cui <chcui@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds comprehensive fine-tuning support for Qwen3.5 Vision-Language models. It introduces parameterized SLURM scripts for SFT and PEFT training, a new recipe module with eight factory functions for different model sizes, validation logic for parallelism settings, and extensive test coverage. Scripts are updated to dynamically handle multiple model variants with appropriate tensor/pipeline/expert parallelism configurations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (1)
tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py (1)
138-199: Prefer subprocess isolation for these functional training runs.These tests delegate to
run_pretrain_vl_recipe_test(seetests/functional_tests/recipes/utils.py, Lines 199-298), which initializes distributed state and runs real pretraining in-process. Moving each scenario to a subprocess will reduce inter-test state coupling and flakes.As per coding guidelines, "
tests/functional_tests/**/*.py: Use subprocess for functional tests that require process isolation".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py` around lines 138 - 199, The four functional tests (test_sft_nothing_frozen, test_sft_language_model_frozen, test_sft_vision_and_language_frozen, test_recompute) currently call run_pretrain_vl_recipe_test in-process and must be converted to run each scenario in a separate subprocess to avoid shared distributed state; replace the direct call to run_pretrain_vl_recipe_test with logic that spawns a subprocess (e.g., subprocess.run or multiprocessing.Process invoking a small runner module/function that imports run_pretrain_vl_recipe_test and executes it with the same arguments), pass the same config_func/recipe_name/parallelism_overrides/model_overrides/tmp_path inputs (serialize as needed), wait for completion, and assert the subprocess exit code is zero; ensure the new runner entrypoint and the test wrappers use unique identifiers matching the existing test parameterization so behavior and signatures remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/models/vlm/qwen35_vl/inference.sh`:
- Around line 26-33: The EP assignment in the MODEL_NAME case block currently
sets EP=4 for all MoE variants; update the case for Qwen3.5-35B-A3B,
Qwen3.5-122B-A10B, and Qwen3.5-397B-A17B to use per-model defaults (set EP to
the specific values used in the VL scripts instead of a single shared 4) by
modifying the MODEL_NAME case in inference.sh to assign the correct EP for each
model name (reference the MODEL_NAME case block and the EP variable).
In `@examples/models/vlm/qwen35_vl/slurm_peft.sh`:
- Around line 116-121: The script currently overwrites externally provided
CONTAINER_IMAGE and CONTAINER_MOUNTS with empty strings (symbols CONTAINER_IMAGE
and CONTAINER_MOUNTS), which breaks the later guard that checks for a container
image; change the assignment so you only set defaults when the variables are
unset or empty (e.g., use a conditional or shell parameter expansion to leave an
externally supplied value intact), and apply the same change to the mounts
handling and any subsequent logic that validates CONTAINER_IMAGE to ensure it
respects environment-supplied values.
- Around line 49-50: SBATCH output/error paths reference a logs/ directory that
may not exist when Slurm opens the files; either stop using a non-existent
subdirectory or ensure it is created before submission. Fix by changing the
SBATCH directives in the header (the lines with “#SBATCH
--output=logs/qwen35vl_lora_%j.out” and “#SBATCH
--error=logs/qwen35vl_lora_%j.err”) to write to the current directory (e.g.,
remove the "logs/" prefix or use "%x_%j.out" and "%x_%j.err") or ensure the logs
directory is created by the submitter before calling sbatch (add a pre-submit
mkdir -p logs in the wrapper that writes/submits this script rather than inside
the job script itself).
In `@examples/models/vlm/qwen35_vl/slurm_sft.sh`:
- Around line 1-15: Add strict shell options to the script to fail fast on
errors and unset variables: immediately after the shebang (#!/bin/bash) in
slurm_sft.sh enable "set -euo pipefail" and set a safe IFS (e.g., IFS=$'\n\t')
so command failures and unset variables cannot silently continue during SLURM
training runs.
- Around line 27-35: Commented parallelism recommendations in slurm_sft.sh are
inconsistent with the canonical recipe values in
src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py; update or remove the
mismatched lines (e.g., entries for 122B and 397B) so they reflect the recipe's
TP/PP/EP defaults. Open examples/models/vlm/qwen35_vl/slurm_sft.sh and either
correct the TP/PP/EP numbers to match the values defined in qwen35_vl.py or
replace the block with a short note directing users to the recipe file for
authoritative parallelism settings, ensuring the commented sizes (e.g., 27B,
35B, 122B, 397B) match the recipe's PP/EP/TP/EP entries. Make sure the text
explicitly references the recipe as the source of truth to avoid future drift.
In `@src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`:
- Around line 354-357: The documented default pipeline-parallel size for 122B
full-SFT (text saying "PP=4") is inconsistent with the implementation which sets
pipeline_model_parallel_size to 6; update one to match the other: either change
the documentation block in qwen35_vl.py to "PP=6" to reflect the code, or change
the implementation variable pipeline_model_parallel_size (and any related
FULL_SFT config assignments) to 4 so the code matches the docs; ensure both the
default config comment block and the pipeline_model_parallel_size assignment
(and any other FULL_SFT-specific parallelism constants) are updated together.
- Around line 593-595: The TokenizerConfig is being wired to hf_path while
preprocessing uses _processor_model (set from tokenizer_model or hf_path),
causing config drift; update the TokenizerConfig instantiation (and any other
places that currently pass hf_path, e.g., around the TokenizerConfig creation at
the block using hf_path) to use _processor_model (or tokenizer_model if present)
instead so the tokenizer config and dataset preprocessing use the same value
(ensure both the dataset preprocessing branch that relies on _processor_model
and the TokenizerConfig/tokenizer creation use the same symbol:
_processor_model).
In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py`:
- Around line 110-199: Add an explicit test category marker (e.g.,
pytest.mark.integration or pytest.mark.system) to this module/class so tests are
properly categorized; for example, set a module-level pytestmark or decorate the
TestQwen35VLFinetuneRecipes class with pytest.mark.integration. Update the
file-level or class-level marker near TestQwen35VLFinetuneRecipes so the GPU
gating remains (run_only_on stays on individual tests) but the module/class also
has the integration/system pytest.mark.
- Around line 113-127: The fixture _reset_microbatch_calculator only destroys
the leaked calculator after each test; change it to also clear any existing
global state before tests start by checking _GLOBAL_NUM_MICROBATCHES_CALCULATOR
and calling destroy_num_microbatches_calculator() prior to yield (in addition to
the existing post-yield cleanup). Update the fixture body that references
_GLOBAL_NUM_MICROBATCHES_CALCULATOR and destroy_num_microbatches_calculator so
both pre- and post-yield paths ensure the global is None.
In `@tests/unit_tests/recipes/qwen_vl/test_qwen35_vl_recipes.py`:
- Around line 98-102: Rename unused parameters to start with an underscore to
silence ruff ARG002/ARG004: in from_hf_pretrained change hf_path to _hf_path and
**kwargs to **_kwargs; in to_megatron_provider change self to _self (or keep
self and prefix unused params) and load_weights to _load_weights so the intent
is explicit and linters stop flagging _FakeAutoBridge and _FakeModelCfg factory
helpers.
- Around line 132-756: Add a pytest category marker so these are discoverable as
unit tests: declare a module-level marker variable `pytestmark =
pytest.mark.unit` near the top of the file (ensure `pytest` is imported), rather
than decorating every function; this will mark all functions such as
`test_each_qwen35_vl_recipe_builds_config`,
`test_qwen35_vl_dataset_type_selection`, `test_sft_nothing_frozen`, etc., as
unit tests for CI selection.
---
Nitpick comments:
In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py`:
- Around line 138-199: The four functional tests (test_sft_nothing_frozen,
test_sft_language_model_frozen, test_sft_vision_and_language_frozen,
test_recompute) currently call run_pretrain_vl_recipe_test in-process and must
be converted to run each scenario in a separate subprocess to avoid shared
distributed state; replace the direct call to run_pretrain_vl_recipe_test with
logic that spawns a subprocess (e.g., subprocess.run or multiprocessing.Process
invoking a small runner module/function that imports run_pretrain_vl_recipe_test
and executes it with the same arguments), pass the same
config_func/recipe_name/parallelism_overrides/model_overrides/tmp_path inputs
(serialize as needed), wait for completion, and assert the subprocess exit code
is zero; ensure the new runner entrypoint and the test wrappers use unique
identifiers matching the existing test parameterization so behavior and
signatures remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 09eb4a86-1e7b-4867-8806-73d971b7578d
📒 Files selected for processing (11)
examples/models/vlm/qwen35_vl/conversion.shexamples/models/vlm/qwen35_vl/inference.shexamples/models/vlm/qwen35_vl/slurm_peft.shexamples/models/vlm/qwen35_vl/slurm_sft.shsrc/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.pysrc/megatron/bridge/models/qwen_vl/qwen35_vl_provider.pysrc/megatron/bridge/recipes/qwen_vl/__init__.pysrc/megatron/bridge/recipes/qwen_vl/qwen35_vl.pytests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.pytests/unit_tests/models/qwen_vl/test_qwen35_vl_bridge.pytests/unit_tests/recipes/qwen_vl/test_qwen35_vl_recipes.py
| #SBATCH --output=logs/qwen35vl_lora_%j.out | ||
| #SBATCH --error=logs/qwen35vl_lora_%j.err |
There was a problem hiding this comment.
logs/ may not exist when Slurm opens stdout/stderr files.
At Lines 49-50, Slurm writes to logs/... before script body runs; mkdir -p logs at Line 152 is too late for first open. This can cause job startup failure on clean workdirs.
💡 Suggested fix
-#SBATCH --output=logs/qwen35vl_lora_%j.out
-#SBATCH --error=logs/qwen35vl_lora_%j.err
+#SBATCH --output=qwen35vl_lora_%j.out
+#SBATCH --error=qwen35vl_lora_%j.err
@@
-mkdir -p logsAlso applies to: 152-152
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/models/vlm/qwen35_vl/slurm_peft.sh` around lines 49 - 50, SBATCH
output/error paths reference a logs/ directory that may not exist when Slurm
opens the files; either stop using a non-existent subdirectory or ensure it is
created before submission. Fix by changing the SBATCH directives in the header
(the lines with “#SBATCH --output=logs/qwen35vl_lora_%j.out” and “#SBATCH
--error=logs/qwen35vl_lora_%j.err”) to write to the current directory (e.g.,
remove the "logs/" prefix or use "%x_%j.out" and "%x_%j.err") or ensure the logs
directory is created by the submitter before calling sbatch (add a pre-submit
mkdir -p logs in the wrapper that writes/submits this script rather than inside
the job script itself).
| #!/bin/bash | ||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
There was a problem hiding this comment.
Enable strict shell mode for safer SLURM job execution.
Without set -euo pipefail, command failures and unset variables can silently continue in expensive training jobs.
🔧 Suggested fix
#!/bin/bash
+set -euo pipefail
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines: **/*.sh: Follow Google Shell Style Guide.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/bin/bash | |
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| #!/bin/bash | |
| set -euo pipefail | |
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/models/vlm/qwen35_vl/slurm_sft.sh` around lines 1 - 15, Add strict
shell options to the script to fail fast on errors and unset variables:
immediately after the shebang (#!/bin/bash) in slurm_sft.sh enable "set -euo
pipefail" and set a safe IFS (e.g., IFS=$'\n\t') so command failures and unset
variables cannot silently continue during SLURM training runs.
| # Recommended parallelism (recipe defaults for full SFT): | ||
| # 0.8B (dense): TP=1, PP=1 (1 node) | ||
| # 2B (dense): TP=1, PP=1 (1 node) | ||
| # 4B (dense): TP=2, PP=1 (1 node) | ||
| # 9B (dense): TP=4, PP=1 (1 node) | ||
| # 27B (dense): TP=4, PP=4 (2 nodes) | ||
| # 35B-A3B (MoE): TP=2, PP=1, EP=16 (2 nodes) | ||
| # 122B-A10B (MoE): TP=2, PP=1, EP=32 (4 nodes) | ||
| # 397B-A17B (MoE): TP=2, PP=4, EP=32 (16 nodes) |
There was a problem hiding this comment.
Parallelism recommendations in comments are out of sync with recipe defaults.
Lines 33-35 advertise defaults that do not match src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py (e.g., 122B uses different PP/EP in the recipe). Since this script does not override TP/PP/EP directly, these comments can mislead node sizing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/models/vlm/qwen35_vl/slurm_sft.sh` around lines 27 - 35, Commented
parallelism recommendations in slurm_sft.sh are inconsistent with the canonical
recipe values in src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py; update or
remove the mismatched lines (e.g., entries for 122B and 397B) so they reflect
the recipe's TP/PP/EP defaults. Open examples/models/vlm/qwen35_vl/slurm_sft.sh
and either correct the TP/PP/EP numbers to match the values defined in
qwen35_vl.py or replace the block with a short note directing users to the
recipe file for authoritative parallelism settings, ensuring the commented sizes
(e.g., 27B, 35B, 122B, 397B) match the recipe's PP/EP/TP/EP entries. Make sure
the text explicitly references the recipe as the source of truth to avoid future
drift.
| class TestQwen35VLFinetuneRecipes: | ||
| """Functional tests covering SFT freeze combos and recompute.""" | ||
|
|
||
| @pytest.fixture(autouse=True) | ||
| def _reset_microbatch_calculator(self): | ||
| """Ensure the global microbatch calculator is cleared between tests. | ||
|
|
||
| If a previous test fails mid-pretrain, destroy_global_state() never | ||
| runs and the calculator leaks into the next test. | ||
| """ | ||
| yield | ||
| from megatron.core.num_microbatches_calculator import ( | ||
| _GLOBAL_NUM_MICROBATCHES_CALCULATOR, | ||
| destroy_num_microbatches_calculator, | ||
| ) | ||
|
|
||
| if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None: | ||
| destroy_num_microbatches_calculator() | ||
|
|
||
| # ----------------------------------------------------------------------- | ||
| # SFT scenarios | ||
| # ----------------------------------------------------------------------- | ||
|
|
||
| @pytest.mark.run_only_on("GPU") | ||
| @pytest.mark.parametrize( | ||
| "config_func,recipe_name,parallelism_overrides,model_overrides", | ||
| QWEN35_VL_SFT_NONE_FROZEN, | ||
| ) | ||
| def test_sft_nothing_frozen(self, config_func, recipe_name, parallelism_overrides, model_overrides, tmp_path): | ||
| """Scenario 1: all modules trainable.""" | ||
| run_pretrain_vl_recipe_test( | ||
| config_func, | ||
| recipe_name, | ||
| tmp_path, | ||
| model_overrides=model_overrides, | ||
| **parallelism_overrides, | ||
| ) | ||
|
|
||
| @pytest.mark.run_only_on("GPU") | ||
| @pytest.mark.parametrize( | ||
| "config_func,recipe_name,parallelism_overrides,model_overrides", | ||
| QWEN35_VL_SFT_LM_FROZEN, | ||
| ) | ||
| def test_sft_language_model_frozen( | ||
| self, config_func, recipe_name, parallelism_overrides, model_overrides, tmp_path | ||
| ): | ||
| """Scenario 2: language model frozen, train vision + projection.""" | ||
| run_pretrain_vl_recipe_test( | ||
| config_func, | ||
| recipe_name, | ||
| tmp_path, | ||
| model_overrides=model_overrides, | ||
| **parallelism_overrides, | ||
| ) | ||
|
|
||
| @pytest.mark.run_only_on("GPU") | ||
| @pytest.mark.parametrize( | ||
| "config_func,recipe_name,parallelism_overrides,model_overrides", | ||
| QWEN35_VL_SFT_PROJ_ONLY, | ||
| ) | ||
| def test_sft_vision_and_language_frozen( | ||
| self, config_func, recipe_name, parallelism_overrides, model_overrides, tmp_path | ||
| ): | ||
| """Scenario 3: vision + language frozen, train projection only.""" | ||
| run_pretrain_vl_recipe_test( | ||
| config_func, | ||
| recipe_name, | ||
| tmp_path, | ||
| model_overrides=model_overrides, | ||
| **parallelism_overrides, | ||
| ) | ||
|
|
||
| # ----------------------------------------------------------------------- | ||
| # Recompute | ||
| # ----------------------------------------------------------------------- | ||
|
|
||
| @pytest.mark.run_only_on("GPU") | ||
| @pytest.mark.parametrize( | ||
| "config_func,recipe_name,parallelism_overrides,model_overrides", | ||
| QWEN35_VL_SFT_RECOMPUTE, | ||
| ) | ||
| def test_recompute(self, config_func, recipe_name, parallelism_overrides, model_overrides, tmp_path): | ||
| """SFT with activation recomputation.""" | ||
| run_pretrain_vl_recipe_test( | ||
| config_func, | ||
| recipe_name, | ||
| tmp_path, | ||
| model_overrides=model_overrides, | ||
| **parallelism_overrides, | ||
| ) |
There was a problem hiding this comment.
Add an explicit integration/system category marker for this module.
GPU gating is present, but unit/integration/system categorization is missing.
💡 Suggested fix
import pytest
+pytestmark = pytest.mark.integration
+
from megatron.bridge.recipes.qwen_vl.qwen35_vl import qwen35_vl_27b_finetune_config
from tests.functional_tests.recipes.utils import run_pretrain_vl_recipe_testAs per coding guidelines, "tests/**/*.py: Use pytest.mark to categorize tests (unit, integration, system)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py` around
lines 110 - 199, Add an explicit test category marker (e.g.,
pytest.mark.integration or pytest.mark.system) to this module/class so tests are
properly categorized; for example, set a module-level pytestmark or decorate the
TestQwen35VLFinetuneRecipes class with pytest.mark.integration. Update the
file-level or class-level marker near TestQwen35VLFinetuneRecipes so the GPU
gating remains (run_only_on stays on individual tests) but the module/class also
has the integration/system pytest.mark.
| @pytest.fixture(autouse=True) | ||
| def _reset_microbatch_calculator(self): | ||
| """Ensure the global microbatch calculator is cleared between tests. | ||
|
|
||
| If a previous test fails mid-pretrain, destroy_global_state() never | ||
| runs and the calculator leaks into the next test. | ||
| """ | ||
| yield | ||
| from megatron.core.num_microbatches_calculator import ( | ||
| _GLOBAL_NUM_MICROBATCHES_CALCULATOR, | ||
| destroy_num_microbatches_calculator, | ||
| ) | ||
|
|
||
| if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None: | ||
| destroy_num_microbatches_calculator() |
There was a problem hiding this comment.
Reset leaked microbatch global both before and after each test.
Current fixture only cleans up after yield. If a prior module leaked global state, the first test here still starts dirty.
💡 Suggested fix
`@pytest.fixture`(autouse=True)
def _reset_microbatch_calculator(self):
"""Ensure the global microbatch calculator is cleared between tests.
@@
from megatron.core.num_microbatches_calculator import (
_GLOBAL_NUM_MICROBATCHES_CALCULATOR,
destroy_num_microbatches_calculator,
)
+ if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None:
+ destroy_num_microbatches_calculator()
+
if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None:
destroy_num_microbatches_calculator()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @pytest.fixture(autouse=True) | |
| def _reset_microbatch_calculator(self): | |
| """Ensure the global microbatch calculator is cleared between tests. | |
| If a previous test fails mid-pretrain, destroy_global_state() never | |
| runs and the calculator leaks into the next test. | |
| """ | |
| yield | |
| from megatron.core.num_microbatches_calculator import ( | |
| _GLOBAL_NUM_MICROBATCHES_CALCULATOR, | |
| destroy_num_microbatches_calculator, | |
| ) | |
| if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None: | |
| destroy_num_microbatches_calculator() | |
| `@pytest.fixture`(autouse=True) | |
| def _reset_microbatch_calculator(self): | |
| """Ensure the global microbatch calculator is cleared between tests. | |
| If a previous test fails mid-pretrain, destroy_global_state() never | |
| runs and the calculator leaks into the next test. | |
| """ | |
| from megatron.core.num_microbatches_calculator import ( | |
| _GLOBAL_NUM_MICROBATCHES_CALCULATOR, | |
| destroy_num_microbatches_calculator, | |
| ) | |
| if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None: | |
| destroy_num_microbatches_calculator() | |
| yield | |
| if _GLOBAL_NUM_MICROBATCHES_CALCULATOR is not None: | |
| destroy_num_microbatches_calculator() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py` around
lines 113 - 127, The fixture _reset_microbatch_calculator only destroys the
leaked calculator after each test; change it to also clear any existing global
state before tests start by checking _GLOBAL_NUM_MICROBATCHES_CALCULATOR and
calling destroy_num_microbatches_calculator() prior to yield (in addition to the
existing post-yield cleanup). Update the fixture body that references
_GLOBAL_NUM_MICROBATCHES_CALCULATOR and destroy_num_microbatches_calculator so
both pre- and post-yield paths ensure the global is None.
Signed-off-by: Chen Cui <chcui@nvidia.com>
- Add `set -euo pipefail` to slurm_sft.sh and slurm_peft.sh for fail-fast behavior - Remove `logs/` prefix from SBATCH output/error paths (directory doesn't exist at job start) - Remove now-unnecessary `mkdir -p logs` calls - Fix 122B-A10B parallelism comment in slurm_sft.sh to match recipe (TP=2, PP=6, EP=8) - Add `pytestmark = pytest.mark.integration` to functional test module - Reset microbatch calculator both before and after each test in fixture Signed-off-by: Chen Cui <chcui@nvidia.com> Made-with: Cursor
What does this PR do?
Add Qwen3.5 VL (dense + MoE) bridge, provider, fine-tuning recipes, and example scripts for the full model family.
Changelog
Qwen35VLBridge(dense) andQwen35VLMoEBridgefor bidirectional HF ↔ Megatron checkpoint conversion with GDN + Gated Attention hybrid architecture supportQwen35VLModelProvider/Qwen35VLMoEModelProviderwith selective attention spec patching (_patch_standard_attention_specs) for mRoPE on standard attention layers onlyrecipes/qwen_vl/__init__.pyto register the new recipe moduleconversion.sh,inference.sh,slurm_sft.sh,slurm_peft.shtest_qwen35_vl_recipes.py) and functional tests (test_qwen35_vl_recipes_finetune.py)test_qwen35_vl_bridge.pyflaky assertionsGitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Requires
transformers >= 5.2.0. Import guards are in place for both dense and MoE transformers classes.