-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Skill
profile-decode
Priority: P1 — Critical for performance optimization workflow
Motivation
Decode latency is the key metric for interactive LLM serving (TTFT and ITL). When performance regresses or falls short of targets, we need a systematic way to break down where time is spent. Currently this requires ad-hoc profiling with rocprof, manually categorizing kernels, and comparing across runs. A skill that automates this analysis would dramatically speed up performance debugging on both MI300X (gfx942) and MI355X (gfx950).
What This Skill Should Do
- Capture a decode trace — Run the model with rocprof/RPD tracing enabled for a configurable number of decode steps (default 100). Filter out warmup iterations.
- Categorize kernel time — Classify each kernel invocation into: GEMM (ASM/CK/Triton/hipBLASLt), Attention (FA, MLA), MoE (gate + expert GEMM + sort/scatter), AllReduce (NCCL/RCCL), Quantization (FP8/FP4 quant/dequant), and Other.
- Build critical path breakdown — Report percentage of total decode time per category. Identify the single most expensive kernel and the top-5 kernels by cumulative time.
- Cross-architecture support — Handle both gfx942 (MI300X) and gfx950 (MI355X) kernel naming conventions and MFMA instruction variants.
- Comparative mode — Accept two trace files (e.g., public vs clean Docker) and produce a side-by-side diff showing which categories improved/regressed and by how much.
- Output summary — Generate a markdown report with tables, percentages, and actionable recommendations (e.g., "GEMM accounts for 72% of decode time — consider tuning GEMM CSV for M=1 shapes").
Acceptance Criteria
- Can profile a single decode run and produce a categorized time breakdown
- Correctly distinguishes GEMM backends (ASM vs CK vs Triton vs hipBLASLt) from kernel names
- Supports both gfx942 and gfx950 architectures
- Comparative mode highlights regressions > 5%
- Output is a self-contained markdown report
- Works with both single-GPU and TP>1 configurations
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels