Skip to content

[Skill] serve-model: One-command model serving with auto-configuration #31

@sunway513

Description

@sunway513

Skill

serve-model

Priority: P1 — Critical for rapid model deployment

Motivation

Launching an ATOM model server currently requires specifying many flags: TP config, quantization mode, GPU visibility, max-model-len, KV cache dtype, and more. Getting any of these wrong leads to OOM, incorrect output, or suboptimal performance. A one-command skill would auto-detect the environment, select optimal settings, launch the server, and verify it works — reducing time-to-serve from 30+ minutes of trial and error to under 5 minutes.

What This Skill Should Do

  1. Auto-detect environment — Discover available GPUs (count, type, memory), check which are free (respect HIP_VISIBLE_DEVICES if set, otherwise detect occupied GPUs). Identify architecture (gfx942 vs gfx950) to set correct dtypes and backend paths.
  2. Select model configuration — Based on model size and available GPUs: choose TP degree (must divide attention heads evenly), quantization mode (FP8 for gfx942/gfx950, FP4 for gfx950 only), and max sequence length (fit within HBM).
  3. Choose GEMM backend — On gfx950: use Triton GEMM (ASM produces garbage). On gfx942 with CK: use ASM/CK. Without CK: use Triton. Ensure tuned GEMM CSV is loaded if available.
  4. Launch server — Start the ATOM server with all configured parameters. Set ATOM_CK_FREE=1 if CK is unavailable. Wait for the health endpoint to return 200.
  5. Run smoke test — Send a simple prompt via /v1/completions (NOT /v1/chat/completions — chat templates and thinking tokens can mask issues). Verify the response is coherent (not garbled), latency is reasonable, and token count matches request.
  6. Report status — Output a summary: model loaded, TP config, quantization mode, GEMM backend, available context length, server URL, and smoke test result.

Acceptance Criteria

  • Single command launches a working server with correct output
  • Auto-detects GPU count and type without user input
  • Does not use GPUs occupied by other workloads
  • Selects correct GEMM backend per architecture (no ASM on gfx950)
  • Health check confirms server is ready before reporting success
  • Smoke test uses /v1/completions and verifies coherent output
  • Works for at least: DeepSeek-R1, Llama 3.1 70B, Qwen 2.5 72B

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions