[Skill] benchmark-report: Automated Serving Benchmark Suite with Formatted Reports

## Skill: `benchmark-report`

**Priority**: P1 — Every evaluation requires running the same benchmark matrix manually

### Motivation

Performance evaluation for ATOM involves running benchmark_serving across multiple configurations: varying concurrency levels (8, 16, 32, 64, 128), input/output lengths (128/128, 1024/1024, 4096/512, etc.), and models. Currently this is done with ad-hoc shell scripts, and results are manually collected into spreadsheets. A standardized skill would ensure consistent, reproducible benchmarks and formatted reports.

### What This Skill Should Do

1. **Run a standard benchmark matrix**
   - Multiple concurrency levels (8, 16, 32, 64, 128)
   - Standard ISL/OSL combinations (128/128, 1024/1024, 4096/512)
   - Configurable warm-up and measurement iterations
   - Automatic server health check before benchmarking

2. **Collect and format results**
   - Output tokens/sec, total tokens/sec
   - TTFT (time to first token) — P50, P95, P99
   - TPOT (time per output token) — P50, P95, P99
   - ITL (inter-token latency) distribution
   - GPU utilization and memory usage during benchmark

3. **Generate comparison reports**
   - Compare two Docker images / builds side by side
   - Compare across GPU types (MI300X vs MI355X)
   - Historical trend tracking
   - Markdown table output for GitHub PR/issue comments

4. **Store results persistently**
   - JSON format matching existing bench_results/ convention
   - Naming: `{model}-result-{isl}-{osl}-conc{N}.json`
   - Support result aggregation across multiple runs

### Acceptance Criteria

- [ ] One-command benchmark across full concurrency matrix
- [ ] Formatted markdown report output
- [ ] Side-by-side comparison mode
- [ ] Automatic server startup/shutdown management
- [ ] Results stored in consistent JSON format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Skill] benchmark-report: Automated Serving Benchmark Suite with Formatted Reports #25

Skill: `benchmark-report`

Motivation

What This Skill Should Do

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Skill] benchmark-report: Automated Serving Benchmark Suite with Formatted Reports #25

Description

Skill: benchmark-report

Motivation

What This Skill Should Do

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Skill: `benchmark-report`