Skip to content

[Skill] accuracy-compare: Compare output accuracy across configurations #32

@sunway513

Description

@sunway513

Skill

accuracy-compare

Priority: P2 — Important for quality assurance

Motivation

ATOM supports multiple quantization modes (FP16, FP8, FP4) and GEMM backends (ASM, CK, Triton, hipBLASLt). Each combination produces slightly different numerical results. When switching backends or quantization, we need to verify that accuracy stays within acceptable bounds. Past experience shows that silent accuracy degradation is the worst bug class — the model generates text that looks plausible but is subtly wrong. A systematic comparison skill would catch these issues before they reach production.

What This Skill Should Do

  1. Define reference baseline — Run FP16 inference (no quantization) on a set of calibration prompts. Capture per-layer activations and final logits as the ground truth reference.
  2. Run comparison configs — Execute the same prompts under each target configuration (e.g., FP8 + Triton GEMM, FP8 + hipBLASLt, FP4 + CK, CK-free mode). Capture per-layer activations and logits.
  3. Compute per-layer cosine similarity — For each layer in each config, compute cosine similarity against the FP16 reference. Identify layers where cosine drops below threshold (default 0.999 for FP8, 0.995 for FP4).
  4. Logit-level comparison — Compare top-k logit distributions between configs. Report KL divergence and top-1 token agreement rate.
  5. End-to-end generation comparison — Generate responses for 10 diverse prompts under each config. Report token match rate, BLEU score, and any cases where output diverges significantly.
  6. Generate report — Produce a markdown report with: summary table (config vs overall cosine), per-layer heatmap data, flagged degradation layers, and recommendation (pass/fail per config).

Acceptance Criteria

  • FP16 reference baseline is correctly captured (no quantization artifacts)
  • Per-layer cosine similarity is computed for all transformer layers
  • Report correctly identifies known-bad configurations (e.g., ASM GEMM on gfx950 shows cosine ~ 0.006)
  • FP8 configs show cosine > 0.999 on most layers
  • FP4 configs show cosine > 0.995 on most layers
  • End-to-end comparison catches garbled output (token match rate < 50%)
  • Report is self-contained markdown with actionable pass/fail per configuration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions