Skip to content

add qwen2_5_omni#2634

Open
yuekaizhang wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
yuekaizhang:qwen2_5_omni
Open

add qwen2_5_omni#2634
yuekaizhang wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
yuekaizhang:qwen2_5_omni

Conversation

@yuekaizhang
Copy link

@yuekaizhang yuekaizhang commented Mar 4, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features
    • Added Qwen2.5 Omni multimodal model support for simultaneous processing of images, videos, audio, and text inputs
    • Implemented provider and bridge infrastructure enabling seamless model integration and parameter conversion from source formats
    • Enhanced positional encoding system for multimodal inputs with specialized token management and grid-based positioning
    • Enabled selective component freezing controls across vision encoders, audio processors, and language modules

Signed-off-by: root <zhangyuekai@foxmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

This PR adds support for the Qwen2.5 Omni multimodal model to Megatron-Bridge. It introduces a complete model implementation including transformer configurations, HuggingFace-to-Megatron bridge infrastructure, a model provider, multimodal RoPE position encoding utilities, and comprehensive unit tests.

Changes

Cohort / File(s) Summary
Module Public API
src/megatron/bridge/models/__init__.py, src/megatron/bridge/models/qwen_omni/__init__.py, src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/__init__.py
New package structure exposing Qwen25OmniModel, Qwen25OmniBridge, and Qwen25OmniModelProvider classes at module level with proper all declarations.
Bridge and Provider
src/megatron/bridge/models/qwen_omni/qwen25_omni_bridge.py, src/megatron/bridge/models/qwen_omni/qwen25_omni_provider.py
Implements Megatron bridge for HuggingFace Qwen2_5OmniForConditionalGeneration to Qwen25OmniModel with parameter mapping registry and MegatronMappingRegistry for embeddings, attention, and MLP layers. Provider instantiates fully configured Qwen25OmniModel with language, thinker, talker, and token2wav configs, supporting selective freezing across modalities.
Model Implementation
src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/model.py, src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/thinker_model.py, src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/transformer_config.py
Core model wrapper (Qwen25OmniModel) delegating to Qwen25OmniThinkerModel which integrates HF vision/audio encoders with dense language model. Transformer config dataclass defines multimodal parameters (patch sizes, token IDs, RoPE sections, temporal/spatial merge configs).
RoPE Position Encoding
src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/rope.py
Implements multimodal RoPE utilities computing 3D position indices for image, video, and audio modalities with chunked interleaving, grid-based vision tokenization, and temporal dimension handling.
Architecture Alias Mapping
src/megatron/bridge/models/conversion/auto_bridge.py
Introduces HF_ARCHITECTURE_ALIASES mapping to resolve non-standard architecture names (e.g., "Qwen2_5OmniModel" → "Qwen2_5OmniForConditionalGeneration") used in auto-detection and validation flows.
Unit Tests
tests/unit_tests/models/qwen_omni/modeling_qwen25_omni/test_omni_model.py
Comprehensive test suite for Qwen25OmniModel with distributed setup, fixtures for multimodal inputs (image, video, audio), and tests covering freeze API, shared embeddings, and set_input_tensor behavior.

Sequence Diagram

sequenceDiagram
    actor User
    participant Bridge as Qwen25OmniBridge
    participant Provider as Qwen25OmniModelProvider
    participant Model as Qwen25OmniModel
    participant ThinkerModel as Qwen25OmniThinkerModel
    participant VisionEnc as Vision Encoder<br/>(HF)
    participant AudioEnc as Audio Encoder<br/>(HF)
    participant LanguageModel as Language Model<br/>(Megatron)

    User->>Bridge: Load HF model
    Bridge->>Provider: provider_bridge()
    activate Provider
    Provider->>Model: Instantiate with configs
    activate Model
    Model->>ThinkerModel: Initialize thinker
    activate ThinkerModel
    ThinkerModel->>VisionEnc: Initialize vision encoder
    ThinkerModel->>AudioEnc: Initialize audio encoder
    ThinkerModel->>LanguageModel: Initialize language model
    deactivate ThinkerModel
    deactivate Model
    deactivate Provider

    User->>Model: forward(input_ids, pixel_values,<br/>input_features, ...)
    activate Model
    Model->>ThinkerModel: forward()
    activate ThinkerModel
    
    ThinkerModel->>VisionEnc: Encode images/videos
    activate VisionEnc
    VisionEnc-->>ThinkerModel: vision_embeddings
    deactivate VisionEnc
    
    ThinkerModel->>AudioEnc: get_audio_features()
    activate AudioEnc
    AudioEnc-->>ThinkerModel: audio_embeddings
    deactivate AudioEnc
    
    ThinkerModel->>LanguageModel: Get text embeddings
    activate LanguageModel
    LanguageModel-->>ThinkerModel: text_embeddings
    deactivate LanguageModel
    
    ThinkerModel->>ThinkerModel: Substitute vision/audio<br/>embeddings at token positions
    
    ThinkerModel->>ThinkerModel: Compute 3D RoPE<br/>position_ids
    
    ThinkerModel->>LanguageModel: forward(embeddings,<br/>position_ids, masks, ...)
    activate LanguageModel
    LanguageModel-->>ThinkerModel: logits/output
    deactivate LanguageModel
    
    ThinkerModel-->>Model: output
    deactivate ThinkerModel
    
    Model-->>User: logits
    deactivate Model
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • yaoyu-33
  • cuichenx
  • chtruong814
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR adds 1,700+ lines of Qwen2.5 Omni model code but PR description lacks test results, performance data, or numerical validation. Test file exists but has bugs (network dependencies, undefined variables) preventing execution. Fix test bugs and provide evidence that unit tests pass. Document validation results, performance benchmarks, and confirmation that numerical outputs match reference implementation or explain test blockers.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add qwen2_5_omni' directly matches the main objective: introducing Qwen2.5 Omni model support to Megatron-Core. It clearly summarizes the primary change across all modified and added files.
Docstring Coverage ✅ Passed Docstring coverage is 83.87% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (1)
src/megatron/bridge/models/qwen_omni/qwen25_omni_provider.py (1)

24-24: Replace typing.List with built-in generic syntax.

Use list[int] for mrope_section and remove the List import.

Proposed fix
-from typing import List
@@
-    mrope_section: List[int] = field(default_factory=lambda: [16, 24, 24])
+    mrope_section: list[int] = field(default_factory=lambda: [16, 24, 24])

As per coding guidelines, use built-in generics (list, dict, tuple) instead of typing equivalents.

Also applies to: 79-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_omni/qwen25_omni_provider.py` at line 24,
Replace the typing.List usages with built-in generics: remove the import "from
typing import List" and change any annotations that use List (notably the
mrope_section annotation) to use the built-in form (e.g., list[int]); update any
other occurrences in this module that reference List (such as the other
annotation around qwen25_omni provider functions/variables) to their equivalent
built-in generic types.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/qwen_omni/__init__.py`:
- Around line 20-24: The __all__ export list in this module is unsorted; update
the __all__ variable so its entries are in deterministic alphabetical order
(e.g., ensure "Qwen25OmniBridge", "Qwen25OmniModel", "Qwen25OmniModelProvider"
are sorted) to satisfy Ruff RUF022; locate the __all__ definition containing
"Qwen25OmniModel", "Qwen25OmniBridge", and "Qwen25OmniModelProvider" and reorder
the entries alphabetically (or generate the list with sorted(...)) so the lint
warning is resolved.

In `@src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/model.py`:
- Around line 43-44: Constructor parameters talker_transformer_config and
token2wav_transformer_config are declared but unused; either remove them from
the constructor signature or explicitly retain them by assigning to instance
attributes (e.g., self.talker_transformer_config = talker_transformer_config and
self.token2wav_transformer_config = token2wav_transformer_config) or, if
intentionally unused, prefix them with an underscore (e.g.,
_talker_transformer_config) or add an explicit comment/annotation to silence
ARG002; update the __init__ of the model class in modeling_qwen25_omni to
reflect the chosen approach so intent is clear.
- Line 50: The nullable type hints in the class constructor (__init__) and the
forward method should use the explicit union syntax `T | None` instead of bare
defaults of `None`; update the annotations for parameters such as pg_collection
(in __init__) and in forward: position_ids, attention_mask, labels, loss_mask,
inference_params, packed_seq_params, extra_block_kwargs, pixel_values,
pixel_values_videos, image_grid_thw, video_grid_thw, image_input_mask,
video_input_mask, cp_img_num, and images_padded to use the form e.g.
torch.Tensor | None, dict | None, list[int] | None, list[bool] | None, etc.,
keeping their default values as None and preserving existing parameter names and
semantics.

In `@src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/rope.py`:
- Line 114: The loop currently shadows the function argument input_ids by using
"for i, input_ids in enumerate(total_input_ids):" which makes later references
(e.g., device handling that expects the outer input_ids) ambiguous; rename the
loop variable to something like batch_input_ids (e.g., "for i, batch_input_ids
in enumerate(total_input_ids):") and update all usages inside that loop to use
batch_input_ids, while ensuring any code that should reference the original
function argument input_ids (such as the device resolution/handling code later
in the function) continues to reference the outer input_ids variable.

In `@src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/thinker_model.py`:
- Around line 176-181: The get_audio_features method currently assumes
feature_attention_mask is non-null and ignores audio_feature_lengths; update it
to first use audio_feature_lengths if provided (use that as lengths and
construct/derive a matching feature_attention_mask if needed), otherwise if
feature_attention_mask is None compute feature lengths from input_features
(e.g., full-length) or safely skip mask.sum(-1); guard every place that calls
feature_attention_mask.sum(-1) (including the later block around lines 185-190)
with a conditional so you only sum when feature_attention_mask is not None, and
ensure the returned attention mask and length values reflect the caller-provided
audio_feature_lengths when present.
- Around line 62-63: The constructor of ThinkerModel accepts pg_collection:
ProcessGroupCollection = None but immediately dereferences it; update the
__init__ (and the other places where pg_collection is used) to guard against
None by either early-returning/skipping process-group setup when pg_collection
is None or by creating a default ProcessGroupCollection instance; specifically,
wrap usages of pg_collection (e.g., any calls like pg_collection.get_or_create,
pg_collection.add, pg_collection.create_process_group) in an if pg_collection is
not None: ... block or assign a fallback local variable before dereference so no
attribute access occurs when pg_collection is None.
- Around line 224-225: Replace the two assert checks for unsupported modes with
explicit raises of NotImplementedError so they cannot be bypassed with Python
-O; specifically change the checks that reference inference_params and
packed_seq_params in thinker_model.py (the assertions asserting inference_params
is None and packed_seq_params is None) to raise NotImplementedError with the
same descriptive messages ("not support inference" and "not support
packed_seq_params") to fail fast and clearly indicate unsupported functionality.

In
`@src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/transformer_config.py`:
- Line 16: Remove the typing.List import and replace all uses of the typing
generic with the built-in generic syntax: remove the line "from typing import
List" and change every annotation like "List[int]" to "list[int]" (also update
any other List[...] occurrences in this module, e.g., the annotation referenced
near the other occurrence). Ensure type hints across transformer_config.py use
built-in generics (list, dict, tuple) and run the type-checker to confirm no
residual imports of typing.List remain.

In `@src/megatron/bridge/models/qwen_omni/qwen25_omni_bridge.py`:
- Line 87: The current expression mrope_section=getattr(text_config,
"rope_scaling", {}).get("mrope_section", [16, 24, 24]) can raise AttributeError
if text_config.rope_scaling exists but is None; change the guard so you first
retrieve rope_scaling (e.g., rope_scaling = getattr(text_config, "rope_scaling",
None)) and then call .get on a safe dict (e.g., (rope_scaling or
{}).get("mrope_section", [16,24,24])) or use an explicit conditional to set
mrope_section; update the assignment in qwen25_omni_bridge.py to use this safe
lookup for rope_scaling.

In `@tests/unit_tests/models/qwen_omni/modeling_qwen25_omni/test_omni_model.py`:
- Around line 204-205: get_data_batch currently uses undefined locals
random_video and random_audio causing NameError; update the helper
(get_data_batch) to accept random_video and random_audio as parameters or create
them inside the function (similar to random_image) so all referenced variables
are defined; ensure any tests calling get_data_batch (and related calls around
lines referencing this helper) are updated to pass the new args if you choose
parameters, and keep the function signature consistent across usages.
- Around line 40-49: The fixtures processor and hf_config currently call
AutoProcessor.from_pretrained and AutoConfig.from_pretrained which perform
network downloads; change these tests to avoid runtime downloads by either (a)
replacing the fixtures with lightweight local test doubles (e.g., a simple stub
object implementing the minimal interface used in tests) or (b) monkeypatching
AutoProcessor.from_pretrained and AutoConfig.from_pretrained to return mocked
instances; update the processor and hf_config fixtures to return those local
stubs/mocks (or use pytest monkeypatch in the module-level fixtures) so tests
run offline and no external model artifact is fetched.

---

Nitpick comments:
In `@src/megatron/bridge/models/qwen_omni/qwen25_omni_provider.py`:
- Line 24: Replace the typing.List usages with built-in generics: remove the
import "from typing import List" and change any annotations that use List
(notably the mrope_section annotation) to use the built-in form (e.g.,
list[int]); update any other occurrences in this module that reference List
(such as the other annotation around qwen25_omni provider functions/variables)
to their equivalent built-in generic types.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c82075a4-ee37-4efc-9062-4b0ad957220d

📥 Commits

Reviewing files that changed from the base of the PR and between 2d72c33 and 4bdb0c7.

📒 Files selected for processing (11)
  • src/megatron/bridge/models/__init__.py
  • src/megatron/bridge/models/conversion/auto_bridge.py
  • src/megatron/bridge/models/qwen_omni/__init__.py
  • src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/__init__.py
  • src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/model.py
  • src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/rope.py
  • src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/thinker_model.py
  • src/megatron/bridge/models/qwen_omni/modeling_qwen25_omni/transformer_config.py
  • src/megatron/bridge/models/qwen_omni/qwen25_omni_bridge.py
  • src/megatron/bridge/models/qwen_omni/qwen25_omni_provider.py
  • tests/unit_tests/models/qwen_omni/modeling_qwen25_omni/test_omni_model.py

Signed-off-by: root <zhangyuekai@foxmail.com>
Signed-off-by: root <zhangyuekai@foxmail.com>
@yuekaizhang
Copy link
Author

@yaoyu-33 Would you mind helping reviewing the PR? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant