-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Skill
serve-model
Priority: P1 — Critical for rapid model deployment
Motivation
Launching an ATOM model server currently requires specifying many flags: TP config, quantization mode, GPU visibility, max-model-len, KV cache dtype, and more. Getting any of these wrong leads to OOM, incorrect output, or suboptimal performance. A one-command skill would auto-detect the environment, select optimal settings, launch the server, and verify it works — reducing time-to-serve from 30+ minutes of trial and error to under 5 minutes.
What This Skill Should Do
- Auto-detect environment — Discover available GPUs (count, type, memory), check which are free (respect
HIP_VISIBLE_DEVICESif set, otherwise detect occupied GPUs). Identify architecture (gfx942 vs gfx950) to set correct dtypes and backend paths. - Select model configuration — Based on model size and available GPUs: choose TP degree (must divide attention heads evenly), quantization mode (FP8 for gfx942/gfx950, FP4 for gfx950 only), and max sequence length (fit within HBM).
- Choose GEMM backend — On gfx950: use Triton GEMM (ASM produces garbage). On gfx942 with CK: use ASM/CK. Without CK: use Triton. Ensure tuned GEMM CSV is loaded if available.
- Launch server — Start the ATOM server with all configured parameters. Set
ATOM_CK_FREE=1if CK is unavailable. Wait for the health endpoint to return 200. - Run smoke test — Send a simple prompt via
/v1/completions(NOT/v1/chat/completions— chat templates and thinking tokens can mask issues). Verify the response is coherent (not garbled), latency is reasonable, and token count matches request. - Report status — Output a summary: model loaded, TP config, quantization mode, GEMM backend, available context length, server URL, and smoke test result.
Acceptance Criteria
- Single command launches a working server with correct output
- Auto-detects GPU count and type without user input
- Does not use GPUs occupied by other workloads
- Selects correct GEMM backend per architecture (no ASM on gfx950)
- Health check confirms server is ready before reporting success
- Smoke test uses
/v1/completionsand verifies coherent output - Works for at least: DeepSeek-R1, Llama 3.1 70B, Qwen 2.5 72B
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels