[QUARK-402] Add Quark GLM4.7-MXFP4 support#223
Conversation
8d605d9 to
03fff40
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds support for Quark GLM4.7-MXFP4 quantization by implementing packed/merged module handling for layer-specific quantization exclusion. The changes enable proper handling of scenarios where users want to exclude specific component layers (e.g., gate_proj, up_proj) from quantization when they are packed into a single merged layer (e.g., gate_up_proj).
Changes:
- Added
build_packed_components_mappingutility function to create inverse mappings from packed parameter names to their component checkpoint weight names - Extended
should_ignore_layerfunction to check if any components of a packed module should be excluded from quantization - Added
prefixparameter toColumnParallelLinear,MergedColumnParallelLinear,QKVParallelLinear, andRowParallelLinearclasses to enable per-layer quantization config evaluation - Added
packed_componentsfield toQuantizationConfigto store the inverse mapping - Implemented
build_inverse_mappinginModelRunnerto populatepacked_componentsbefore model instantiation
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| atom/models/utils.py | Added build_packed_components_mapping function and extended should_ignore_layer to handle packed modules |
| atom/model_ops/linear.py | Added prefix parameter to ColumnParallelLinear, MergedColumnParallelLinear, QKVParallelLinear, and RowParallelLinear for layer-specific quantization handling |
| atom/model_engine/model_runner.py | Added build_inverse_mapping method to build packed components mapping before model initialization |
| atom/config.py | Added packed_components field to QuantizationConfig |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class ReplicatedLinear(LinearBase): | ||
| def __init__( | ||
| self, | ||
| input_size: int, | ||
| output_size: int, | ||
| bias: bool = False, | ||
| quant_config: Optional[QuantizationConfig] = None, | ||
| source_quant_dtype: torch.dtype = None, | ||
| **kwargs, | ||
| ): |
There was a problem hiding this comment.
The ReplicatedLinear class is being instantiated with a prefix argument in multiple places throughout the codebase (e.g., deepseek_v2.py, gpt_oss.py, mixtral.py, qwen3_moe.py, qwen3_next.py), but the class definition doesn't accept a prefix parameter. This parameter is likely being silently ignored due to the **kwargs in the constructor. For consistency with other linear layer classes (ColumnParallelLinear, RowParallelLinear, MergedColumnParallelLinear, QKVParallelLinear) and to properly support quantization exclusion for replicated layers, ReplicatedLinear should also accept and handle the prefix parameter.
|
Hi, @thpereir, could you post the commands you used for testing? |
|
I used to serve: To run lm-eval: |
03fff40 to
fb84a80
Compare
|
@haoyangli0109 made more changes to fix the issues with |
fb84a80 to
227ea42
Compare
- TP4 weight loading crash (moe.py _load_w13/_load_w2): Derived shard sizes from loaded_weight.shape instead of padded expert_data.shape to handle MXFP4 padding (384-512). - num_sms() returning None on ROCm (triton_kernels/target_info.py): Added or is_hip() to the CUDA branch. - Custom routing for grouped topk + sigmoid (fused_moe_triton.py): Added routing_from_topk() bridge function since triton_kernels.routing.routing() only supports softmax + basic topk. Modified Mxfp4MoEMethod.apply() to use FusedMoE.select_experts for routing with the triton matmul_ogs for compute. - Uninitialized bias causing NaN (glm4_moe.py): FusedMoE defaulted has_bias=True, creating torch.empty bias tensors that were never loaded (GLM-4.7 has no expert biases). Fixed with has_bias=getattr(config, "moe_ffn_bias", False). - Fused SwiGLU activation mismatch (fused_moe_triton.py) he final fix: - triton_kernels' swiglu_fn expects interleaved [gate0, up0, gate1, up1, ...] layout, but w13 weights produce concatenated [gate|up] Uses non-standard s*sigmoid(1.702*s)*(linear+1) instead of standard silu(gate)*up Fix: Bypassed fused SwiGLU; run matmul_ogs without activation, then manually apply F.silu(gate) * up on the concatenated output
227ea42 to
d176130
Compare
| if self.config.compilation_config.level == 1: | ||
| self.model = torch.compile(self.model, fullgraph=True, backend="eager") | ||
|
|
||
| def build_inverse_mapping(self, model_class: Any): |
There was a problem hiding this comment.
can this part move to quant_config, instead of in model runner
Motivation
Technical Details
Test Plan
Test Result
Server:
lm-eval
GSM 8k accuracy
Submission Checklist