Skip to content

[Skill] docker-clean-build: Build clean Docker images with correct backend configuration #27

@sunway513

Description

@sunway513

Skill

docker-clean-build

Priority: P1 — Critical for reproducible deployment

Motivation

Building clean ATOM Docker images is error-prone. Past issues include: gfx950 detection failures causing wrong kernel selection, GEMM defaulting to torch instead of hipBLASLt (570x slowdown), missing CK-free fallback paths, and LDS constraint violations. Each of these bugs took days to diagnose. A skill that codifies all known pitfalls into a validated build process would prevent these regressions. Convention: image tag follows username_date_rocmversion_aiterversion_atomversion pattern.

What This Skill Should Do

  1. Detect target architecture — Auto-detect gfx942 vs gfx950 from the build host or accept it as a parameter. Set correct dtypes.fp8 variant (e4m3fnuz for gfx942, e4m3fn for gfx950).
  2. Configure GEMM defaults — Ensure tuned_gemm.py defaults to hipblaslt (not torch). Validate that the tuned GEMM CSV is present and covers the target model's shapes.
  3. Set up fallback paths — Configure CK-free mode correctly: use_triton_gemm() must return True when ATOM_CK_FREE=1, matching the pattern in attention_mla.py and moe.py. Verify ASM GEMM is NOT used on gfx950 (produces garbage).
  4. Validate backend paths — Run a minimal forward pass through each backend (Triton GEMM, hipBLASLt, MoE Triton) and check cosine similarity > 0.999 against FP16 reference.
  5. Handle AITER build — Support both full build (with CK) and CK-free build. For CK-free: ensure jit/core.py raises RuntimeError (not SystemExit) on JIT failure.
  6. Tag and verify — Tag image per convention, run smoke test (10 tokens generation), report image size and build time.

Acceptance Criteria

  • Produces a working Docker image that generates coherent text
  • gfx942 and gfx950 both produce correct output (cosine > 0.999 vs reference)
  • GEMM defaults to hipBLASLt, not torch
  • CK-free mode works when ATOM_CK_FREE=1
  • ASM GEMM is disabled on gfx950 or produces correct output
  • Build time is documented and image is tagged per convention

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions