[Skill] accuracy-compare: Compare output accuracy across configurations

## Skill
`accuracy-compare`

**Priority**: P2 — Important for quality assurance

### Motivation
ATOM supports multiple quantization modes (FP16, FP8, FP4) and GEMM backends (ASM, CK, Triton, hipBLASLt). Each combination produces slightly different numerical results. When switching backends or quantization, we need to verify that accuracy stays within acceptable bounds. Past experience shows that silent accuracy degradation is the worst bug class — the model generates text that looks plausible but is subtly wrong. A systematic comparison skill would catch these issues before they reach production.

### What This Skill Should Do

1. **Define reference baseline** — Run FP16 inference (no quantization) on a set of calibration prompts. Capture per-layer activations and final logits as the ground truth reference.
2. **Run comparison configs** — Execute the same prompts under each target configuration (e.g., FP8 + Triton GEMM, FP8 + hipBLASLt, FP4 + CK, CK-free mode). Capture per-layer activations and logits.
3. **Compute per-layer cosine similarity** — For each layer in each config, compute cosine similarity against the FP16 reference. Identify layers where cosine drops below threshold (default 0.999 for FP8, 0.995 for FP4).
4. **Logit-level comparison** — Compare top-k logit distributions between configs. Report KL divergence and top-1 token agreement rate.
5. **End-to-end generation comparison** — Generate responses for 10 diverse prompts under each config. Report token match rate, BLEU score, and any cases where output diverges significantly.
6. **Generate report** — Produce a markdown report with: summary table (config vs overall cosine), per-layer heatmap data, flagged degradation layers, and recommendation (pass/fail per config).

### Acceptance Criteria

- [ ] FP16 reference baseline is correctly captured (no quantization artifacts)
- [ ] Per-layer cosine similarity is computed for all transformer layers
- [ ] Report correctly identifies known-bad configurations (e.g., ASM GEMM on gfx950 shows cosine ~ 0.006)
- [ ] FP8 configs show cosine > 0.999 on most layers
- [ ] FP4 configs show cosine > 0.995 on most layers
- [ ] End-to-end comparison catches garbled output (token match rate < 50%)
- [ ] Report is self-contained markdown with actionable pass/fail per configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Skill] accuracy-compare: Compare output accuracy across configurations #32

Skill

Motivation

What This Skill Should Do

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Skill] accuracy-compare: Compare output accuracy across configurations #32

Description

Skill

Motivation

What This Skill Should Do

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions