[Skill] serve-model: One-command model serving with auto-configuration

## Skill
`serve-model`

**Priority**: P1 — Critical for rapid model deployment

### Motivation
Launching an ATOM model server currently requires specifying many flags: TP config, quantization mode, GPU visibility, max-model-len, KV cache dtype, and more. Getting any of these wrong leads to OOM, incorrect output, or suboptimal performance. A one-command skill would auto-detect the environment, select optimal settings, launch the server, and verify it works — reducing time-to-serve from 30+ minutes of trial and error to under 5 minutes.

### What This Skill Should Do

1. **Auto-detect environment** — Discover available GPUs (count, type, memory), check which are free (respect `HIP_VISIBLE_DEVICES` if set, otherwise detect occupied GPUs). Identify architecture (gfx942 vs gfx950) to set correct dtypes and backend paths.
2. **Select model configuration** — Based on model size and available GPUs: choose TP degree (must divide attention heads evenly), quantization mode (FP8 for gfx942/gfx950, FP4 for gfx950 only), and max sequence length (fit within HBM).
3. **Choose GEMM backend** — On gfx950: use Triton GEMM (ASM produces garbage). On gfx942 with CK: use ASM/CK. Without CK: use Triton. Ensure tuned GEMM CSV is loaded if available.
4. **Launch server** — Start the ATOM server with all configured parameters. Set `ATOM_CK_FREE=1` if CK is unavailable. Wait for the health endpoint to return 200.
5. **Run smoke test** — Send a simple prompt via `/v1/completions` (NOT `/v1/chat/completions` — chat templates and thinking tokens can mask issues). Verify the response is coherent (not garbled), latency is reasonable, and token count matches request.
6. **Report status** — Output a summary: model loaded, TP config, quantization mode, GEMM backend, available context length, server URL, and smoke test result.

### Acceptance Criteria

- [ ] Single command launches a working server with correct output
- [ ] Auto-detects GPU count and type without user input
- [ ] Does not use GPUs occupied by other workloads
- [ ] Selects correct GEMM backend per architecture (no ASM on gfx950)
- [ ] Health check confirms server is ready before reporting success
- [ ] Smoke test uses `/v1/completions` and verifies coherent output
- [ ] Works for at least: DeepSeek-R1, Llama 3.1 70B, Qwen 2.5 72B

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Skill] serve-model: One-command model serving with auto-configuration #31

Skill

Motivation

What This Skill Should Do

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Skill] serve-model: One-command model serving with auto-configuration #31

Description

Skill

Motivation

What This Skill Should Do

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions