-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Skill: quant-validate
Priority: P0 — Most frequently encountered and hardest-to-debug class of issues
Motivation
Quantization bugs (FP8, FP4, INT4, MXFP4) are the #1 source of inference accuracy issues in ATOM. Past incidents include:
- Scale layout mismatches between quantization and GEMM kernels (row-major vs column-major)
- ASM GEMM producing garbage output on gfx950 (cosine_sim ≈ 0.006)
- Silent parameter ignoring in fallback paths (
shuffle_scale=Trueignored by Triton fallback) - Weight normalization issues (e4m3fn → e4m3fnuz conversion)
These bugs are extremely time-consuming to diagnose because they often produce plausible-looking but incorrect outputs.
What This Skill Should Do
Given a model + quantization config, the skill should:
-
Validate the full quantization → GEMM chain (not components in isolation)
- Run reference computation in FP32/FP16
- Run quantized computation through the target path (ASM/CK/Triton)
- Compute cosine similarity and max absolute error
- Flag any layer with cosine_sim < 0.999
-
Check scale layout consistency
- Verify scale tensor shapes match what the GEMM kernel expects
- Detect row-major vs column-major mismatches
- Verify
shuffle_scaleandtranspose_scaleflags are respected
-
Test all backend paths
- ASM GEMM, CK GEMM, Triton GEMM, hipBLASLt
- CK-free mode fallbacks
- Report which backends produce correct results
-
Generate a diagnostic report
- Per-layer cosine similarity
- Per-layer scale layout analysis
- Backend comparison table
- Actionable recommendations
Key Lessons to Encode
- Always check directions not magnitudes — cosine_sim is the gold standard
- Test the full quant→GEMM chain, not components in isolation
- Silent parameter ignoring is the worst bug class — fallbacks must implement ALL parameters
- Use
/v1/completionsnot/v1/chat/completionsfor debugging
Acceptance Criteria
- Skill can validate FP8, FP4, INT4, MXFP4 quantization paths
- Detects scale layout mismatches automatically
- Tests multiple GEMM backends and compares results
- Generates clear diagnostic report with pass/fail per layer
- Includes instructions for both gfx942 (MI300X) and gfx950 (MI355X)
References
- CK-free debug conclusion: Past debugging sessions on ASM GEMM garbage output
- ATOM
linear.py,moe.py,attention_mla.py— backend selection logic - AITER
ops/quant.py— quantization implementations
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels