Mingyyan/dev/hybrid by Mingyuyang-1 · Pull Request #510 · AMD-AGI/Primus

Mingyuyang-1 · 2026-01-23T21:07:25Z

Add support to Mamba and hybrid MLA-Mamba models in Primus-LM with Megatron backends

Four models added:

Mamba_370M
Zebra_Llama_1B
Zebra_Llama_3B
Zebra_Llama_8B

To support Mamba and Mamba-based hybrid models, we add new layer_specs and model block under Primus/primus/backends/megatron/core/models/hybrid

…rmance

…og perf. overhead

…g perf. overhead

… perf. overhead

… stabilize trainer UTs (#262)" This reverts commit 1e2e1b1.

This reverts commit eabc2f8.

primus/backends/megatron/core/models/hybrid/hybrid_block.py

+
+        # Ensure that the tensor passed between pipeline parallel stages is
+        # viewless. See related notes in TransformerBlock and TransformerLayer
+        output = make_viewless_tensor(


primus/backends/megatron/core/models/hybrid/hybrid_block.py

+from megatron.core.inference.contexts import BaseInferenceContext
+from megatron.core.process_groups_config import ProcessGroupCollection
+from megatron.core.ssm.mamba_hybrid_layer_allocation import Symbols as LayerSymbols
+from megatron.core.ssm.mamba_hybrid_layer_allocation import allocate_layers


primus/backends/megatron/core/models/hybrid/hybrid_mamba_mla_layer_specs.py

+from megatron.core.transformer.identity_op import IdentityOp
+from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
+from megatron.core.models.gpt.moe_module_specs import get_moe_module_spec
+from megatron.core.ssm.mamba_block import MambaStack, MambaStackSubmodules


Copilot

Pull request overview

This PR adds support for Mamba and hybrid MLA-Mamba models in Primus-LM with both Megatron and TorchTitan backends. The changes enable training of four new model variants: Mamba_370M, Zebra_Llama_1B, Zebra_Llama_3B, and Zebra_Llama_8B.

Changes:

Added Mamba and hybrid model support in Megatron backend with new layer specifications and model blocks
Enhanced TorchTitan backend with improved attention patching, MoE grouped MM support, and FP8 quantization
Updated configuration files to support new quantization structure and model-specific settings

Reviewed changes

Copilot reviewed 67 out of 69 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
primus/backends/megatron/core/models/hybrid/*	New hybrid stack and layer specs for Mamba+MLA models
primus/modules/trainer/megatron/*.py	Model type detection and Mamba-specific forward pass handling
primus/core/utils/import_utils.py	Model provider resolution for GPT and Mamba models
primus/configs/models/megatron/*	Configuration files for new Mamba and Zebra models
primus/backends/torchtitan/models/*	Attention model updates for llama3, llama4, and deepseek_v3
primus/modules/trainer/torchtitan/pre_trainer.py	Enhanced patching logic for MoE, attention, and quantization
primus/configs/modules/torchtitan/*	Restructured quantization config and new turbo settings
examples/torchtitan/configs/MI300X/*	Updated training configs with new quantization structure
examples/run_pretrain.sh	Added Primus Turbo rebuild capability

Comments suppressed due to low confidence (6)

primus/configs/models/megatron/zebra_llama_1B.yaml:1

The comment incorrectly states 'Zebra Llama 8B configuration' when this is the 1B model configuration file. This should be 'Zebra Llama 1B configuration'.
primus/configs/models/megatron/zebra_llama_3B.yaml:1
The comment incorrectly states 'Zebra Llama 8B configuration' when this is the 3B model configuration file. This should be 'Zebra Llama 3B configuration'.
primus/backends/torchtitan/models/moe/moe.py:1
Corrected spelling of 'tyr' to 'try'.
primus/modules/trainer/torchtitan/patch_utils.py:1
Corrected spelling of 'PrimusPath' to 'PrimusPatch' to match the prefix used elsewhere in the file.
primus/modules/trainer/torchtitan/patch_utils.py:1
Corrected spelling of 'PrimusPath' to 'PrimusPatch' to match the prefix used elsewhere in the file.
primus/modules/trainer/torchtitan/pre_trainer.py:1
This commented-out error message provides valuable context for the actual error message below it. Consider removing this commented line or add a comment explaining why it's kept for reference.

Copilot · 2026-01-23T21:12:14Z

primus/backends/megatron/core/models/hybrid/hybrid_mamba_mla_layer_specs.py

+                    module=MambaMixer,
+                     params={
+                        "expand": 1,
+                        "d_conv": 4,


The parameter name 'd_conv' is unclear. Consider using a more descriptive name like 'conv_dimension' or add a comment explaining what 'd_conv' represents.

Suggested change

"d_conv": 4,

"d_conv": 4, # Convolution dimension (kernel size) used in the Mamba mixer

Copilot · 2026-01-23T21:12:15Z

examples/megatron/configs/MI355X/deepseek_v2_lite-pretrain.yaml

      fused_padded_mla_attention: false

-      multi_latent_attention: false
+        #multi_latent_attention: true


This commented-out configuration line has inconsistent indentation (7 spaces instead of the standard indentation). If this is meant to be uncommented for use, fix the indentation to match the surrounding code.

Suggested change

#multi_latent_attention: true

#multi_latent_attention: true

Copilot · 2026-01-23T21:12:15Z

primus/backends/megatron/core/fp8_utils.py

+            fp8_str = config.fp8.lower()
+
+            if fp8_str == "e4m3":
+                fp8_format = FP8Format.E4M3
+            elif fp8_str == "hybrid":
+                fp8_format = FP8Format.HYBRID


The code calls .lower() on config.fp8 but then compares with lowercase strings. If config.fp8 could be a non-string type (e.g., boolean for 'hybrid'), this will fail. Add a type check or ensure config.fp8 is always a string before calling .lower().

Copilot · 2026-01-23T21:12:15Z

primus/backends/megatron/core/models/hybrid/hybrid_block.py

+            sequence_len_offset = torch.tensor(
+                [inference_context.sequence_len_offset] * current_batch_size,
+                dtype=torch.int32,
+                device='cuda',


Hardcoding 'cuda' as the device may cause issues in multi-device environments. Consider using hidden_states.device or a device parameter from the config to maintain device consistency.

Suggested change

device='cuda',

device=hidden_states.device,

wenxie-amd · 2026-01-26T02:32:34Z

run_titan_dsv2_lite.sh

@@ -0,0 +1,67 @@
+#!/bin/bash


Please don't put this script in the root path of the Primus repo

wenxie-amd · 2026-01-26T02:36:46Z

.github/workflows/docker/Dockerfile

-FROM ${BASE_IMAGE}
+# Base image
+# FROM docker.io/rocm/megatron-lm:v25.9_gfx942
+FROM docker.io/rocm/pyt-megatron-lm-jax-nightly-private:pytorch_rocm7.0_20251024


Primus use public released docker image as the base image. The main branch use v25.10, please try to rebase or merge the main branch.

alfuyao1986 and others added 30 commits October 2, 2025 00:40

Disable torchtitan activation checkpointing for better 8B model perfo…

3de16a0

…rmance

Disable torchtitan activation checkpointing for better 8B model perfo…

7d6d52b

…rmance

Set default for MI300 best perf.

18cd0be

Update llama3.1_70B-BF16-pretrain.yaml drop log frequency to reduce l…

6075971

…og perf. overhead

Update llama3.1_70B-FP8-pretrain.yaml drop log frequency to reduce lo…

464d1ae

…g perf. overhead

Update llama3.1_8B-BF16-pretrain.yaml drop log frequency to reduce lo…

05ed2ff

…g perf. overhead

Update llama3.1_8B-FP8-pretrain.yaml drop log frequency to reduce log…

aab4234

… perf. overhead

tw script update

d1ec787

remove cluster-specific commands

5fcbe85

update common perf arguments - ce fusion - moe gemms

e16b27b

Merge branch 'main' into release/v25.10

5ca6561

Revert "refactor(torchtitan): rollback Titan to 99c0cb2(20250907) and…

ba187e4

… stabilize trainer UTs (#262)" This reverts commit 1e2e1b1.

torchtitan: tune FP8 configs and share quant settings

be3b984

update torchtitan yaml

01f745d

enable mla configs in DS models

e226b05

update fp8 llama3 70b tt yaml

9f94561

update torctitan config to use real dataset

299322e

update mi300 ds model yamls with mla

eabc2f8

Revert "update mi300 ds model yamls with mla"

919f9f6

This reverts commit eabc2f8.

fix PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32 typo. It will now default to 0

554be01

Merge branch 'main' into release/v25.10

116c196

update torch profiler gzip to false

653f0ec

support turbo groupgemm in titan

343b5fe

add turbo fp8 gemm and attn via Converter

04aa8ac

update deepseek_v3 config and load balance config

80f247a

add classic attn but with issues

d48cd84

update classic attention args for deepseek_v3

18f37be

update config

a835bb4

support new turo fp8 api

48fe128

support install turbo from source

ad43031

JohnQinAMD and others added 9 commits November 26, 2025 00:31

add dsv3 config for MI355

f89348d

Merge dev/john/titan-ptc into release/v25.10

d80b8d1

initial commit

c892660

set self.lr_warmup_steps < self.lr_decay_steps

d8ae27d

unwrap model to remove loss_mask parameter

df0b00e

add zebra-llama (hybrid mla mamba model) support

58de478

add Zebra-Llama 3B configurations

d36a136

add Zebra-Llama 1B configs and remove unused configs

294bf77

remove unused configs

2e0aec8

Mingyuyang-1 requested a review from clairesonglee January 23, 2026 21:07

Mingyuyang-1 requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners January 23, 2026 21:07

Copilot AI review requested due to automatic review settings January 23, 2026 21:07

github-code-quality bot found potential problems Jan 23, 2026

View reviewed changes

Copilot AI reviewed Jan 23, 2026

View reviewed changes

wenxie-amd reviewed Jan 26, 2026

View reviewed changes

wenxie-amd mentioned this pull request Jan 26, 2026

[Feature] Mamba Trainer Integration #348

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mingyyan/dev/hybrid#510

Mingyyan/dev/hybrid#510
Mingyuyang-1 wants to merge 39 commits intomainfrom
mingyyan/dev/hybrid

Mingyuyang-1 commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

wenxie-amd Jan 26, 2026

Uh oh!

wenxie-amd Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

	"d_conv": 4,
	"d_conv": 4, # Convolution dimension (kernel size) used in the Mamba mixer

Conversation

Mingyuyang-1 commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

wenxie-amd Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

wenxie-amd Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants