[Skill] ckfree-validate: Validate CK-free mode correctness

## Skill
`ckfree-validate`

**Priority**: P1 — Critical for CK-free deployment

### Motivation
CK-free mode removes the Composable Kernel dependency from AITER, dramatically reducing build time. However, this mode has been a persistent source of correctness bugs. Known issues: ASM GEMM on gfx950 produces garbage output (cosine ~ 0.006), Triton fallback for `dynamic_per_token_scaled_quant` ignores `shuffle_scale=True` (writes row-major scales when column-major is expected), and `_fallback_partial_transpose` is a no-op. Each bug was silent — the model generated text but it was incoherent. A validation skill would catch these issues systematically before deployment.

### What This Skill Should Do

1. **Verify GEMM backend selection** — Confirm that when `ATOM_CK_FREE=1`, `use_triton_gemm()` returns True in all code paths: `linear.py`, `attention_mla.py`, and `moe.py`. Ensure ASM GEMM is never invoked on gfx950 (known to produce garbage).
2. **Validate quantization scale layouts** — For FP8 per-1x128 quantization with `shuffle_scale=True`: verify scales are in column-major (transposed) layout after Triton fallback. Compare scale tensor shapes and strides against the CK path reference.
3. **Per-layer cosine similarity** — Run a forward pass on a short prompt and compute cosine similarity between CK-free output and FP16 reference at each layer's output. Flag any layer with cosine < 0.999.
4. **End-to-end generation test** — Generate 50 tokens and compare against FP16 reference generation. Check both token match rate and output embedding cosine similarity.
5. **MoE path validation** — For MoE models (DeepSeek), verify that Triton MoE kernels produce correct expert routing and expert GEMM output (cosine > 0.9999 with properly quantized data).
6. **Regression checklist** — Check all known bug patterns: JIT `SystemExit` vs `RuntimeError`, `_fallback_partial_transpose` actually transposes, `normalize_e4m3fn_to_e4m3fnuz` applied on gfx942.

### Acceptance Criteria

- [ ] Detects ASM GEMM usage in CK-free mode and flags it as an error
- [ ] Validates scale layout (row-major vs column-major) matches GEMM backend expectation
- [ ] Per-layer cosine similarity report identifies divergent layers
- [ ] End-to-end generation produces coherent output (not garbled text)
- [ ] Covers all three code paths: linear, attention_mla, moe
- [ ] Catches the known `shuffle_scale` bug if reintroduced
- [ ] Works on both gfx942 and gfx950

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Skill] ckfree-validate: Validate CK-free mode correctness #30

Skill

Motivation

What This Skill Should Do

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Skill] ckfree-validate: Validate CK-free mode correctness #30

Description

Skill

Motivation

What This Skill Should Do

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions