-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Skill: benchmark-report
Priority: P1 — Every evaluation requires running the same benchmark matrix manually
Motivation
Performance evaluation for ATOM involves running benchmark_serving across multiple configurations: varying concurrency levels (8, 16, 32, 64, 128), input/output lengths (128/128, 1024/1024, 4096/512, etc.), and models. Currently this is done with ad-hoc shell scripts, and results are manually collected into spreadsheets. A standardized skill would ensure consistent, reproducible benchmarks and formatted reports.
What This Skill Should Do
-
Run a standard benchmark matrix
- Multiple concurrency levels (8, 16, 32, 64, 128)
- Standard ISL/OSL combinations (128/128, 1024/1024, 4096/512)
- Configurable warm-up and measurement iterations
- Automatic server health check before benchmarking
-
Collect and format results
- Output tokens/sec, total tokens/sec
- TTFT (time to first token) — P50, P95, P99
- TPOT (time per output token) — P50, P95, P99
- ITL (inter-token latency) distribution
- GPU utilization and memory usage during benchmark
-
Generate comparison reports
- Compare two Docker images / builds side by side
- Compare across GPU types (MI300X vs MI355X)
- Historical trend tracking
- Markdown table output for GitHub PR/issue comments
-
Store results persistently
- JSON format matching existing bench_results/ convention
- Naming:
{model}-result-{isl}-{osl}-conc{N}.json - Support result aggregation across multiple runs
Acceptance Criteria
- One-command benchmark across full concurrency matrix
- Formatted markdown report output
- Side-by-side comparison mode
- Automatic server startup/shutdown management
- Results stored in consistent JSON format
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels