Skip to content

[Skill] profile-decode: Analyze decode performance bottlenecks #26

@sunway513

Description

@sunway513

Skill

profile-decode

Priority: P1 — Critical for performance optimization workflow

Motivation

Decode latency is the key metric for interactive LLM serving (TTFT and ITL). When performance regresses or falls short of targets, we need a systematic way to break down where time is spent. Currently this requires ad-hoc profiling with rocprof, manually categorizing kernels, and comparing across runs. A skill that automates this analysis would dramatically speed up performance debugging on both MI300X (gfx942) and MI355X (gfx950).

What This Skill Should Do

  1. Capture a decode trace — Run the model with rocprof/RPD tracing enabled for a configurable number of decode steps (default 100). Filter out warmup iterations.
  2. Categorize kernel time — Classify each kernel invocation into: GEMM (ASM/CK/Triton/hipBLASLt), Attention (FA, MLA), MoE (gate + expert GEMM + sort/scatter), AllReduce (NCCL/RCCL), Quantization (FP8/FP4 quant/dequant), and Other.
  3. Build critical path breakdown — Report percentage of total decode time per category. Identify the single most expensive kernel and the top-5 kernels by cumulative time.
  4. Cross-architecture support — Handle both gfx942 (MI300X) and gfx950 (MI355X) kernel naming conventions and MFMA instruction variants.
  5. Comparative mode — Accept two trace files (e.g., public vs clean Docker) and produce a side-by-side diff showing which categories improved/regressed and by how much.
  6. Output summary — Generate a markdown report with tables, percentages, and actionable recommendations (e.g., "GEMM accounts for 72% of decode time — consider tuning GEMM CSV for M=1 shapes").

Acceptance Criteria

  • Can profile a single decode run and produce a categorized time breakdown
  • Correctly distinguishes GEMM backends (ASM vs CK vs Triton vs hipBLASLt) from kernel names
  • Supports both gfx942 and gfx950 architectures
  • Comparative mode highlights regressions > 5%
  • Output is a self-contained markdown report
  • Works with both single-GPU and TP>1 configurations

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions