[QUARK-402] Add Quark GLM4.7-MXFP4 support by thpereir · Pull Request #223 · ROCm/ATOM

thpereir · 2026-02-18T22:01:13Z

Motivation

Technical Details

Test Plan

Test Result

Server:

python -m atom.entrypoints.openai_server --model /opt/group/huggingface/pretrained_models/amd/GLM-4.7-MXFP4/ -tp 4 --trust-remote-code

lm-eval

lm_eval \
  --model local-completions \
  --model_args "model=/opt/group/huggingface/pretrained_models/amd/GLM-4.7-MXFP4/,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 1

GSM 8k accuracy

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9431	±	0.0064
		strict-match	5	exact_match	↑	0.9424	±	0.0064

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This pull request adds support for Quark GLM4.7-MXFP4 quantization by implementing packed/merged module handling for layer-specific quantization exclusion. The changes enable proper handling of scenarios where users want to exclude specific component layers (e.g., gate_proj, up_proj) from quantization when they are packed into a single merged layer (e.g., gate_up_proj).

Changes:

Added build_packed_components_mapping utility function to create inverse mappings from packed parameter names to their component checkpoint weight names
Extended should_ignore_layer function to check if any components of a packed module should be excluded from quantization
Added prefix parameter to ColumnParallelLinear, MergedColumnParallelLinear, QKVParallelLinear, and RowParallelLinear classes to enable per-layer quantization config evaluation
Added packed_components field to QuantizationConfig to store the inverse mapping
Implemented build_inverse_mapping in ModelRunner to populate packed_components before model instantiation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
atom/models/utils.py	Added `build_packed_components_mapping` function and extended `should_ignore_layer` to handle packed modules
atom/model_ops/linear.py	Added `prefix` parameter to `ColumnParallelLinear`, `MergedColumnParallelLinear`, `QKVParallelLinear`, and `RowParallelLinear` for layer-specific quantization handling
atom/model_engine/model_runner.py	Added `build_inverse_mapping` method to build packed components mapping before model initialization
atom/config.py	Added `packed_components` field to `QuantizationConfig`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-24T08:06:52Z

atom/model_ops/linear.py

 class ReplicatedLinear(LinearBase):
    def __init__(
        self,
        input_size: int,
        output_size: int,
        bias: bool = False,
        quant_config: Optional[QuantizationConfig] = None,
        source_quant_dtype: torch.dtype = None,
        **kwargs,
    ):


The ReplicatedLinear class is being instantiated with a prefix argument in multiple places throughout the codebase (e.g., deepseek_v2.py, gpt_oss.py, mixtral.py, qwen3_moe.py, qwen3_next.py), but the class definition doesn't accept a prefix parameter. This parameter is likely being silently ignored due to the **kwargs in the constructor. For consistency with other linear layer classes (ColumnParallelLinear, RowParallelLinear, MergedColumnParallelLinear, QKVParallelLinear) and to properly support quantization exclusion for replicated layers, ReplicatedLinear should also accept and handle the prefix parameter.

haoyangli0109 · 2026-02-24T08:39:05Z

Hi, @thpereir, could you post the commands you used for testing?

thpereir · 2026-02-24T14:26:22Z

I used to serve:

python -m atom.entrypoints.openai_server --model /scratch/models/GLM4.7-MXFP4/ --trust-remote-code

To run lm-eval:

lm_eval --model local-completions \
  --model_args "model=/scratch/models/GLM4.7-MXFP4/,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 1

thpereir · 2026-02-25T22:04:47Z

@haoyangli0109 made more changes to fix the issues with -tp 4

- TP4 weight loading crash (moe.py _load_w13/_load_w2): Derived shard sizes from loaded_weight.shape instead of padded expert_data.shape to handle MXFP4 padding (384-512). - num_sms() returning None on ROCm (triton_kernels/target_info.py): Added or is_hip() to the CUDA branch. - Custom routing for grouped topk + sigmoid (fused_moe_triton.py): Added routing_from_topk() bridge function since triton_kernels.routing.routing() only supports softmax + basic topk. Modified Mxfp4MoEMethod.apply() to use FusedMoE.select_experts for routing with the triton matmul_ogs for compute. - Uninitialized bias causing NaN (glm4_moe.py): FusedMoE defaulted has_bias=True, creating torch.empty bias tensors that were never loaded (GLM-4.7 has no expert biases). Fixed with has_bias=getattr(config, "moe_ffn_bias", False). - Fused SwiGLU activation mismatch (fused_moe_triton.py) he final fix: - triton_kernels' swiglu_fn expects interleaved [gate0, up0, gate1, up1, ...] layout, but w13 weights produce concatenated [gate|up] Uses non-standard s*sigmoid(1.702*s)*(linear+1) instead of standard silu(gate)*up Fix: Bypassed fused SwiGLU; run matmul_ogs without activation, then manually apply F.silu(gate) * up on the concatenated output

valarLip · 2026-02-28T02:39:12Z

atom/model_engine/model_runner.py

        if self.config.compilation_config.level == 1:
            self.model = torch.compile(self.model, fullgraph=True, backend="eager")

+    def build_inverse_mapping(self, model_class: Any):


can this part move to quant_config, instead of in model runner

thpereir force-pushed the thpereir/quark_glm47_mxfp4 branch from 8d605d9 to 03fff40 Compare February 19, 2026 22:37

niuxjamd requested review from Copilot, k50112113 and valarLip February 24, 2026 08:00

Copilot started reviewing on behalf of niuxjamd February 24, 2026 08:01 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

thpereir force-pushed the thpereir/quark_glm47_mxfp4 branch from 03fff40 to fb84a80 Compare February 25, 2026 22:04

thpereir force-pushed the thpereir/quark_glm47_mxfp4 branch from fb84a80 to 227ea42 Compare February 26, 2026 16:39

thpereir marked this pull request as ready for review February 26, 2026 16:43

thpereir changed the title ~~Add Quark GLM4.7-MXFP4 support~~ [QUARK-402] Add Quark GLM4.7-MXFP4 support Feb 26, 2026

thpereir force-pushed the thpereir/quark_glm47_mxfp4 branch from 227ea42 to d176130 Compare February 26, 2026 17:03

valarLip reviewed Feb 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUARK-402] Add Quark GLM4.7-MXFP4 support#223

[QUARK-402] Add Quark GLM4.7-MXFP4 support#223
thpereir wants to merge 1 commit intoROCm:mainfrom
thpereir:thpereir/quark_glm47_mxfp4

thpereir commented Feb 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

haoyangli0109 commented Feb 24, 2026

Uh oh!

thpereir commented Feb 24, 2026

Uh oh!

thpereir commented Feb 25, 2026

Uh oh!

valarLip Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

thpereir commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

haoyangli0109 commented Feb 24, 2026

Uh oh!

thpereir commented Feb 24, 2026

Uh oh!

thpereir commented Feb 25, 2026

Uh oh!

valarLip Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thpereir commented Feb 18, 2026 •

edited

Loading