-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Skill
config-tune
Priority: P2 — Important for optimal deployment
Motivation
ATOM serving performance is highly sensitive to configuration: batch size, KV cache allocation, TP/DP split, chunk prefill size, and CUDA graph batch sizes all interact in non-obvious ways. Currently, finding the optimal config requires manual experimentation across dozens of parameter combinations. A systematic tuning skill would find near-optimal configs faster and document the rationale for each choice.
What This Skill Should Do
- Profile hardware — Detect GPU count, GPU type (MI300X vs MI355X), HBM capacity per GPU, and NVLink/xGMI topology. Calculate available memory after model weight loading.
- Select TP/DP configuration — Based on model size and GPU count, recommend TP (tensor parallel) and DP (data parallel) split. Rules: TP should match attention head divisibility; DP increases throughput but requires more memory.
- Size KV cache — Calculate maximum KV cache blocks given remaining HBM after weights + activations. Account for quantization (FP8 KV cache halves memory vs FP16).
- Tune batch sizes — Recommend max_num_seqs and max_num_batched_tokens based on KV cache capacity and target latency. Sweep CUDA graph batch sizes (capture sizes) to minimize graph miss rate.
- Configure chunk prefill — Set enable_chunked_prefill, max_num_batched_tokens for prefill, and chunked_prefill_size. Balance prefill throughput vs decode latency interference.
- Output config file — Generate a complete launch command or config YAML with all tuned parameters, plus a markdown explanation of each choice.
Acceptance Criteria
- Correctly detects GPU type and count
- TP/DP recommendation is valid (TP divides num_heads evenly)
- KV cache sizing leaves adequate headroom (no OOM under max load)
- Generated config produces higher throughput than default config on benchmark
- Output includes both the config and rationale for each parameter choice
- Supports at least: DeepSeek-R1, Llama 3.1 70B/405B, Qwen 2.5 72B
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels