Skip to content

[Skill] config-tune: Tune serving configuration for model and hardware #28

@sunway513

Description

@sunway513

Skill

config-tune

Priority: P2 — Important for optimal deployment

Motivation

ATOM serving performance is highly sensitive to configuration: batch size, KV cache allocation, TP/DP split, chunk prefill size, and CUDA graph batch sizes all interact in non-obvious ways. Currently, finding the optimal config requires manual experimentation across dozens of parameter combinations. A systematic tuning skill would find near-optimal configs faster and document the rationale for each choice.

What This Skill Should Do

  1. Profile hardware — Detect GPU count, GPU type (MI300X vs MI355X), HBM capacity per GPU, and NVLink/xGMI topology. Calculate available memory after model weight loading.
  2. Select TP/DP configuration — Based on model size and GPU count, recommend TP (tensor parallel) and DP (data parallel) split. Rules: TP should match attention head divisibility; DP increases throughput but requires more memory.
  3. Size KV cache — Calculate maximum KV cache blocks given remaining HBM after weights + activations. Account for quantization (FP8 KV cache halves memory vs FP16).
  4. Tune batch sizes — Recommend max_num_seqs and max_num_batched_tokens based on KV cache capacity and target latency. Sweep CUDA graph batch sizes (capture sizes) to minimize graph miss rate.
  5. Configure chunk prefill — Set enable_chunked_prefill, max_num_batched_tokens for prefill, and chunked_prefill_size. Balance prefill throughput vs decode latency interference.
  6. Output config file — Generate a complete launch command or config YAML with all tuned parameters, plus a markdown explanation of each choice.

Acceptance Criteria

  • Correctly detects GPU type and count
  • TP/DP recommendation is valid (TP divides num_heads evenly)
  • KV cache sizing leaves adequate headroom (no OOM under max load)
  • Generated config produces higher throughput than default config on benchmark
  • Output includes both the config and rationale for each parameter choice
  • Supports at least: DeepSeek-R1, Llama 3.1 70B/405B, Qwen 2.5 72B

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions