[Skill] config-tune: Tune serving configuration for model and hardware

## Skill
`config-tune`

**Priority**: P2 — Important for optimal deployment

### Motivation
ATOM serving performance is highly sensitive to configuration: batch size, KV cache allocation, TP/DP split, chunk prefill size, and CUDA graph batch sizes all interact in non-obvious ways. Currently, finding the optimal config requires manual experimentation across dozens of parameter combinations. A systematic tuning skill would find near-optimal configs faster and document the rationale for each choice.

### What This Skill Should Do

1. **Profile hardware** — Detect GPU count, GPU type (MI300X vs MI355X), HBM capacity per GPU, and NVLink/xGMI topology. Calculate available memory after model weight loading.
2. **Select TP/DP configuration** — Based on model size and GPU count, recommend TP (tensor parallel) and DP (data parallel) split. Rules: TP should match attention head divisibility; DP increases throughput but requires more memory.
3. **Size KV cache** — Calculate maximum KV cache blocks given remaining HBM after weights + activations. Account for quantization (FP8 KV cache halves memory vs FP16).
4. **Tune batch sizes** — Recommend max_num_seqs and max_num_batched_tokens based on KV cache capacity and target latency. Sweep CUDA graph batch sizes (capture sizes) to minimize graph miss rate.
5. **Configure chunk prefill** — Set enable_chunked_prefill, max_num_batched_tokens for prefill, and chunked_prefill_size. Balance prefill throughput vs decode latency interference.
6. **Output config file** — Generate a complete launch command or config YAML with all tuned parameters, plus a markdown explanation of each choice.

### Acceptance Criteria

- [ ] Correctly detects GPU type and count
- [ ] TP/DP recommendation is valid (TP divides num_heads evenly)
- [ ] KV cache sizing leaves adequate headroom (no OOM under max load)
- [ ] Generated config produces higher throughput than default config on benchmark
- [ ] Output includes both the config and rationale for each parameter choice
- [ ] Supports at least: DeepSeek-R1, Llama 3.1 70B/405B, Qwen 2.5 72B

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Skill] config-tune: Tune serving configuration for model and hardware #28

Skill

Motivation

What This Skill Should Do

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Skill] config-tune: Tune serving configuration for model and hardware #28

Description

Skill

Motivation

What This Skill Should Do

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions