-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Skill
docker-clean-build
Priority: P1 — Critical for reproducible deployment
Motivation
Building clean ATOM Docker images is error-prone. Past issues include: gfx950 detection failures causing wrong kernel selection, GEMM defaulting to torch instead of hipBLASLt (570x slowdown), missing CK-free fallback paths, and LDS constraint violations. Each of these bugs took days to diagnose. A skill that codifies all known pitfalls into a validated build process would prevent these regressions. Convention: image tag follows username_date_rocmversion_aiterversion_atomversion pattern.
What This Skill Should Do
- Detect target architecture — Auto-detect gfx942 vs gfx950 from the build host or accept it as a parameter. Set correct
dtypes.fp8variant (e4m3fnuz for gfx942, e4m3fn for gfx950). - Configure GEMM defaults — Ensure
tuned_gemm.pydefaults tohipblaslt(nottorch). Validate that the tuned GEMM CSV is present and covers the target model's shapes. - Set up fallback paths — Configure CK-free mode correctly:
use_triton_gemm()must return True whenATOM_CK_FREE=1, matching the pattern inattention_mla.pyandmoe.py. Verify ASM GEMM is NOT used on gfx950 (produces garbage). - Validate backend paths — Run a minimal forward pass through each backend (Triton GEMM, hipBLASLt, MoE Triton) and check cosine similarity > 0.999 against FP16 reference.
- Handle AITER build — Support both full build (with CK) and CK-free build. For CK-free: ensure
jit/core.pyraises RuntimeError (not SystemExit) on JIT failure. - Tag and verify — Tag image per convention, run smoke test (10 tokens generation), report image size and build time.
Acceptance Criteria
- Produces a working Docker image that generates coherent text
- gfx942 and gfx950 both produce correct output (cosine > 0.999 vs reference)
- GEMM defaults to hipBLASLt, not torch
- CK-free mode works when
ATOM_CK_FREE=1 - ASM GEMM is disabled on gfx950 or produces correct output
- Build time is documented and image is tagged per convention
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels