-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Skill
accuracy-compare
Priority: P2 — Important for quality assurance
Motivation
ATOM supports multiple quantization modes (FP16, FP8, FP4) and GEMM backends (ASM, CK, Triton, hipBLASLt). Each combination produces slightly different numerical results. When switching backends or quantization, we need to verify that accuracy stays within acceptable bounds. Past experience shows that silent accuracy degradation is the worst bug class — the model generates text that looks plausible but is subtly wrong. A systematic comparison skill would catch these issues before they reach production.
What This Skill Should Do
- Define reference baseline — Run FP16 inference (no quantization) on a set of calibration prompts. Capture per-layer activations and final logits as the ground truth reference.
- Run comparison configs — Execute the same prompts under each target configuration (e.g., FP8 + Triton GEMM, FP8 + hipBLASLt, FP4 + CK, CK-free mode). Capture per-layer activations and logits.
- Compute per-layer cosine similarity — For each layer in each config, compute cosine similarity against the FP16 reference. Identify layers where cosine drops below threshold (default 0.999 for FP8, 0.995 for FP4).
- Logit-level comparison — Compare top-k logit distributions between configs. Report KL divergence and top-1 token agreement rate.
- End-to-end generation comparison — Generate responses for 10 diverse prompts under each config. Report token match rate, BLEU score, and any cases where output diverges significantly.
- Generate report — Produce a markdown report with: summary table (config vs overall cosine), per-layer heatmap data, flagged degradation layers, and recommendation (pass/fail per config).
Acceptance Criteria
- FP16 reference baseline is correctly captured (no quantization artifacts)
- Per-layer cosine similarity is computed for all transformer layers
- Report correctly identifies known-bad configurations (e.g., ASM GEMM on gfx950 shows cosine ~ 0.006)
- FP8 configs show cosine > 0.999 on most layers
- FP4 configs show cosine > 0.995 on most layers
- End-to-end comparison catches garbled output (token match rate < 50%)
- Report is self-contained markdown with actionable pass/fail per configuration
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels