Skip to content

[QUARK-402] Add Quark GLM4.7-MXFP4 support#223

Open
thpereir wants to merge 1 commit intoROCm:mainfrom
thpereir:thpereir/quark_glm47_mxfp4
Open

[QUARK-402] Add Quark GLM4.7-MXFP4 support#223
thpereir wants to merge 1 commit intoROCm:mainfrom
thpereir:thpereir/quark_glm47_mxfp4

Conversation

@thpereir
Copy link
Contributor

@thpereir thpereir commented Feb 18, 2026

Motivation

Technical Details

Test Plan

Test Result

Server:

python -m atom.entrypoints.openai_server --model /opt/group/huggingface/pretrained_models/amd/GLM-4.7-MXFP4/ -tp 4 --trust-remote-code

lm-eval

lm_eval \
  --model local-completions \
  --model_args "model=/opt/group/huggingface/pretrained_models/amd/GLM-4.7-MXFP4/,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 1

GSM 8k accuracy

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9431 ± 0.0064
strict-match 5 exact_match 0.9424 ± 0.0064

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for Quark GLM4.7-MXFP4 quantization by implementing packed/merged module handling for layer-specific quantization exclusion. The changes enable proper handling of scenarios where users want to exclude specific component layers (e.g., gate_proj, up_proj) from quantization when they are packed into a single merged layer (e.g., gate_up_proj).

Changes:

  • Added build_packed_components_mapping utility function to create inverse mappings from packed parameter names to their component checkpoint weight names
  • Extended should_ignore_layer function to check if any components of a packed module should be excluded from quantization
  • Added prefix parameter to ColumnParallelLinear, MergedColumnParallelLinear, QKVParallelLinear, and RowParallelLinear classes to enable per-layer quantization config evaluation
  • Added packed_components field to QuantizationConfig to store the inverse mapping
  • Implemented build_inverse_mapping in ModelRunner to populate packed_components before model instantiation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
atom/models/utils.py Added build_packed_components_mapping function and extended should_ignore_layer to handle packed modules
atom/model_ops/linear.py Added prefix parameter to ColumnParallelLinear, MergedColumnParallelLinear, QKVParallelLinear, and RowParallelLinear for layer-specific quantization handling
atom/model_engine/model_runner.py Added build_inverse_mapping method to build packed components mapping before model initialization
atom/config.py Added packed_components field to QuantizationConfig

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 447 to 456
class ReplicatedLinear(LinearBase):
def __init__(
self,
input_size: int,
output_size: int,
bias: bool = False,
quant_config: Optional[QuantizationConfig] = None,
source_quant_dtype: torch.dtype = None,
**kwargs,
):
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ReplicatedLinear class is being instantiated with a prefix argument in multiple places throughout the codebase (e.g., deepseek_v2.py, gpt_oss.py, mixtral.py, qwen3_moe.py, qwen3_next.py), but the class definition doesn't accept a prefix parameter. This parameter is likely being silently ignored due to the **kwargs in the constructor. For consistency with other linear layer classes (ColumnParallelLinear, RowParallelLinear, MergedColumnParallelLinear, QKVParallelLinear) and to properly support quantization exclusion for replicated layers, ReplicatedLinear should also accept and handle the prefix parameter.

Copilot uses AI. Check for mistakes.
@haoyangli0109
Copy link

Hi, @thpereir, could you post the commands you used for testing?

@thpereir
Copy link
Contributor Author

I used to serve:

python -m atom.entrypoints.openai_server --model /scratch/models/GLM4.7-MXFP4/ --trust-remote-code

To run lm-eval:

lm_eval --model local-completions \
  --model_args "model=/scratch/models/GLM4.7-MXFP4/,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 1

@thpereir thpereir force-pushed the thpereir/quark_glm47_mxfp4 branch from 03fff40 to fb84a80 Compare February 25, 2026 22:04
@thpereir
Copy link
Contributor Author

@haoyangli0109 made more changes to fix the issues with -tp 4

@thpereir thpereir force-pushed the thpereir/quark_glm47_mxfp4 branch from fb84a80 to 227ea42 Compare February 26, 2026 16:39
@thpereir thpereir marked this pull request as ready for review February 26, 2026 16:43
@thpereir thpereir changed the title Add Quark GLM4.7-MXFP4 support [QUARK-402] Add Quark GLM4.7-MXFP4 support Feb 26, 2026
- TP4 weight loading crash (moe.py _load_w13/_load_w2): Derived shard sizes
from loaded_weight.shape instead of padded expert_data.shape to handle MXFP4
padding (384-512).

- num_sms() returning None on ROCm (triton_kernels/target_info.py): Added or
is_hip() to the CUDA branch.

- Custom routing for grouped topk + sigmoid (fused_moe_triton.py): Added
routing_from_topk() bridge function since triton_kernels.routing.routing() only
supports softmax + basic topk. Modified Mxfp4MoEMethod.apply() to use
FusedMoE.select_experts for routing with the triton matmul_ogs for compute.

- Uninitialized bias causing NaN (glm4_moe.py): FusedMoE defaulted has_bias=True,
creating torch.empty bias tensors that were never loaded
(GLM-4.7 has no expert biases). Fixed with has_bias=getattr(config, "moe_ffn_bias", False).

- Fused SwiGLU activation mismatch (fused_moe_triton.py) he final fix:

- triton_kernels' swiglu_fn expects interleaved [gate0, up0, gate1, up1, ...] layout,
but w13 weights produce concatenated [gate|up]
Uses non-standard s*sigmoid(1.702*s)*(linear+1) instead of standard silu(gate)*up
Fix: Bypassed fused SwiGLU; run matmul_ogs without activation, then manually
apply F.silu(gate) * up on the concatenated output
@thpereir thpereir force-pushed the thpereir/quark_glm47_mxfp4 branch from 227ea42 to d176130 Compare February 26, 2026 17:03
if self.config.compilation_config.level == 1:
self.model = torch.compile(self.model, fullgraph=True, backend="eager")

def build_inverse_mapping(self, model_class: Any):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this part move to quant_config, instead of in model runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants