From ecf367693b8eada5ca808122b727031a0dd5663a Mon Sep 17 00:00:00 2001 From: Meekail Zain Date: Thu, 12 Feb 2026 11:26:10 -0600 Subject: [PATCH] Added initial AI Agent instructions --- .claude/ck-debugging/SKILL.md | 100 +++++++++++++++++++++++++ .claude/ck-debugging/TEMPLATE.md | 121 +++++++++++++++++++++++++++++++ CLAUDE.md | 57 +++++++++++++++ 3 files changed, 278 insertions(+) create mode 100644 .claude/ck-debugging/SKILL.md create mode 100644 .claude/ck-debugging/TEMPLATE.md create mode 100644 CLAUDE.md diff --git a/.claude/ck-debugging/SKILL.md b/.claude/ck-debugging/SKILL.md new file mode 100644 index 000000000..4af0df589 --- /dev/null +++ b/.claude/ck-debugging/SKILL.md @@ -0,0 +1,100 @@ +--- +name: aiter-ck-integration-debugging +description: Triage and isolate CK/AITER Fused Attention failures in TransformerEngine as integration vs kernel issues. +--- + +# CK Fused Attention Debugging Guide (TransformerEngine, ROCm) + +Use this playbook to quickly answer one question: +**Is the failure in TE↔CK integration, or in the CK/AITER kernel itself?** + +## 1) Map the integration surface first +- Build-time CK args parsing/validation: + - `transformer_engine/common/CMakeLists.txt` + - `tools/check_aiter_mha_args_usage.py` +- CK fused-attn kernel wrappers/entry points: + - `transformer_engine/common/ck_fused_attn/ck_fused_attn_*` +- CK backend preprocessing and dispatch glue: + - `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp` +- Runtime backend selection / fallback path: + - `transformer_engine/common/fused_attn_rocm/fused_attn.cpp` + +## 2) Gather minimum reproducibility context (before changing code) +Capture these from logs or user report: +- Forward vs backward failure (`fwd` / `bwd`) +- Exact shape/config: batch, seq lengths (`s_q`, `s_kv`), num heads, head dim +- Data type(s): fp16/bf16/fp8 +- Mask/dropout/causal/windowing/alibi/padding settings +- GQA/MQA/group mode details if used +- GPU architecture + ROCm version + TE commit +- Whether fallback backend succeeds + +When self-collecting logs (for example, rerunning a failing pytest), enable full config logging in the same command: `NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 `. + +If reproducing triggers a segmentation fault, rerun under `rocgdb` to capture a usable backtrace: `rocgdb --args python -m pytest ...` (then run and collect `bt`). + +If config info is incomplete, request it first; otherwise debugging is noisy and slow. + +## 3) Reproduce in controlled CK-only path +Preferred path (native executables): +1. From `3rdparty/aiter/op_tests/cpp/mha`, build with `mha_build.sh`. +2. Use `fwd.exe -?` / `bwd.exe -?` to confirm argument mapping. +3. Run with required runtime env: + - `LD_LIBRARY_PATH=/transformer_engine/lib:${LD_LIBRARY_PATH}` + - `AITER_ASM_DIR=$(realpath ../../../hsa)` +4. Encode the failing TE config exactly in `fwd.exe` / `bwd.exe` args and record the full command. + +Fallback path (AITER Python JIT): +- Use `3rdparty/aiter/op_tests/test_mha.py` when native arg mapping is unclear or rapid iteration is needed. +- Add a small wrapper test (for example, `test_te_reproducer`) that pins only the failing config. +- Keep this wrapper minimal and temporary; include the exact parameters in handoff notes. + +## 4) Decision tree: integration bug vs kernel bug +1. **Fails in TE, but passes in `fwd.exe`/`bwd.exe` with equivalent config** + - Likely TE integration bug. + - Focus on argument marshaling/normalization in: + - `fused_attn_ck.cpp` + - `ck_fused_attn_*` + - backend selection conditions in `fused_attn.cpp` + +2. **Fails both in TE and standalone `fwd.exe`/`bwd.exe`** + - Likely CK/AITER kernel issue (or unsupported config). + - Produce a minimal standalone reproducer command and hand off. + +3. **Passes in TE only when fallback backend is chosen** + - CK eligibility/selection guard likely wrong. + - Inspect backend capability checks and shape constraints in `fused_attn.cpp`. + +## 5) High-value checks when it is integration-related +- Verify all expected CK args are present and in the right order/type. +- Check TE→CK conversions for: + - layout / strides + - sequence length semantics (`s_q` vs `s_kv`) + - grouped-query mapping + - mask/bias/dropout flags + - causal/windowing flags + - dtype/accumulator assumptions +- Confirm no silent defaulting for missing fields. +- Confirm runtime-selected backend matches intent (no accidental fallback/misroute). + +## 6) Output artifact requirements (always produce) +For each investigated failure, record: +- TE reproducer summary (shapes, dtype, flags) +- Standalone command(s) tested (`fwd.exe`/`bwd.exe`) and result +- Classification: `integration` or `kernel` +- Owning component and next action + +Suggested concise handoff format: +- **Config:** `B=?, Sq=?, Skv=?, H=?, D=?, dtype=?, causal=?, dropout=?, mask=?` +- **TE result:** pass/fail + key error +- **Standalone result:** pass/fail + key error +- **Conclusion:** integration vs kernel +- **Owner:** TE vs AITER/CK + +For more comprehensive output formatting, reference [TEMPLATE.md](TEMPLATE.md) + +## 7) Common pitfalls +- Mismatch between TE-side defaults and standalone binary defaults. +- Treating unsupported config as runtime failure instead of eligibility failure. +- Comparing non-equivalent configs across TE and standalone paths. +- Missing backward-only failures (always test both directions when applicable). \ No newline at end of file diff --git a/.claude/ck-debugging/TEMPLATE.md b/.claude/ck-debugging/TEMPLATE.md new file mode 100644 index 000000000..1cacdf24e --- /dev/null +++ b/.claude/ck-debugging/TEMPLATE.md @@ -0,0 +1,121 @@ +# CK/AITER Fused-Attn Debug Handoff Template + +Use this template when handing off a failure investigation to TE or AITER/CK owners. + +--- + +## 1) Summary +- **Classification:** `integration` | `kernel` | `unknown` +- **Direction:** `fwd` | `bwd` | `both` + +## 2) Environment +- **TE commit:** +- **AITER commit/submodule ref:** +- **ROCm version:** +- **GPU architecture (gfx):** + +## 3) Failing Configuration +- **Batch (B):** +- **Query seq (Sq):** +- **KV seq (Skv):** +- **Num heads (H):** +- **Head dim (D):** +- **DType(s):** fp16 / bf16 / fp8 +- **Causal:** true/false +- **Dropout:** +- **Mask/Bias mode:** +- **Windowing/Alibi/Padding:** +- **GQA/MQA details:** + +## 4) TE Reproducer +- **Backend intent:** CK only / auto / fallback allowed +- **Command or test entrypoint:** +- **Key env vars:** +- **Observed result:** pass/fail +- **First failing log line / error signature:** + +## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`) +- **Build location:** `3rdparty/aiter/op_tests/cpp/mha` +- **Build command:** +- **Runtime env:** + - `LD_LIBRARY_PATH=/transformer_engine/lib:${LD_LIBRARY_PATH}` + - `AITER_ASM_DIR=$(realpath ../../../hsa)` +- **Exact standalone command(s):** +- **Observed result:** pass/fail +- **First failing log line / error signature:** + +## 6) Equivalence Check (TE vs Standalone) +- **Are shape/dtype/flags exactly matched?** yes/no +- **Any default mismatch noticed?** +- **Notes:** + +## 7) Conclusion and Ownership +- **Conclusion:** integration vs kernel vs unsupported-config +- **Likely owner:** TE (`fused_attn_ck.cpp` / `fused_attn.cpp` / `ck_fused_attn_*`) or AITER/CK kernel team +- **Requested next action:** + +## 8) Artifacts +- **Logs attached:** +- **Minimal reproducer commands attached:** +- **Patch/commit links (if any):** + +--- + +# Example (Filled) + +## 1) Summary +- **Classification:** `integration` +- **Direction:** `bwd` + +## 2) Environment +- **TE commit:** `abc1234` +- **AITER commit/submodule ref:** `def5678` +- **ROCm version:** 6.2.1 +- **GPU architecture (gfx):** gfx942 + +## 3) Failing Configuration +- **Batch (B):** 4 +- **Query seq (Sq):** 4096 +- **KV seq (Skv):** 4096 +- **Num heads (H):** 32 +- **Head dim (D):** 128 +- **DType(s):** bf16 +- **Causal:** true +- **Dropout:** 0.0 +- **Mask/Bias mode:** causal mask only +- **Windowing/Alibi/Padding:** none +- **GQA/MQA details:** none + +## 4) TE Reproducer +- **Backend intent:** CK only +- **Command or test entrypoint:** `pytest tests/pytorch/fused_attn/test_fused_attn.py::test_bwd_case_x` +- **Key env vars:** CK backend forced; debug logging enabled +- **Observed result:** fail +- **First failing log line / error signature:** `invalid argument: ck_bwd workspace size mismatch` + +## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`) +- **Build location:** `3rdparty/aiter/op_tests/cpp/mha` +- **Build command:** `./mha_build.sh` +- **Runtime env:** + - `LD_LIBRARY_PATH=/transformer_engine/lib:${LD_LIBRARY_PATH}` + - `AITER_ASM_DIR=$(realpath ../../../hsa)` +- **Exact standalone command(s):** + - `./bwd.exe ` + - `./fwd.exe ` +- **Observed result:** pass (both) +- **First failing log line / error signature:** N/A + +## 6) Equivalence Check (TE vs Standalone) +- **Are shape/dtype/flags exactly matched?** yes +- **Any default mismatch noticed?** TE-side workspace/alignment default differs from standalone path +- **Notes:** likely marshaling/normalization issue before CK call + +## 7) Conclusion and Ownership +- **Conclusion:** integration +- **Likely owner:** TE (`fused_attn_ck.cpp` argument preparation) +- **Requested next action:** inspect workspace-size and alignment mapping in TE→CK bwd path + +## 8) Artifacts +- **Logs attached:** `te_fail.log`, `standalone_pass.log` +- **Minimal reproducer commands attached:** yes +- **Patch/commit links (if any):** none diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000..c83a27971 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,57 @@ +# Agent instructions for TransformerEngine (ROCm fork) + +## Big picture +- This repo builds **one core C++/HIP library** plus optional framework bindings: + - core: `transformer_engine/common` (CMake project producing `libtransformer_engine.so`) + - PyTorch binding: `transformer_engine/pytorch` + `transformer_engine/pytorch/csrc` + - JAX binding: `transformer_engine/jax` + `transformer_engine/jax/csrc/extensions` +- Python import flow is split: + - top-level framework selection in `transformer_engine/__init__.py` (`NVTE_FRAMEWORK` controls `pytorch|jax|all|none`) + - `.so` discovery/loading logic in `transformer_engine/common/__init__.py` (`load_framework_extension`, wheel/source/editable layouts) +- Build orchestration is in `setup.py` + `build_tools/*.py`, not only in CMake. + - `build_tools/utils.py::rocm_build()` auto-detects ROCm first, then CUDA, unless `NVTE_USE_ROCM` is set. + +## Platform/backends +- ROCm path is first-class in this fork (`README.rst`, `transformer_engine/common/CMakeLists.txt`). +- Fused attention backends are runtime/compile-time gated by env vars: + - `NVTE_FUSED_ATTN`, `NVTE_FUSED_ATTN_CK`, `NVTE_FUSED_ATTN_AOTRITON` +- ROCm fused-attn implementation is in `transformer_engine/common/fused_attn_rocm/*`; CK and AOTriton integration is wired in `transformer_engine/common/CMakeLists.txt`. +- Build-time validation for CK args runs from `setup.py` via `tools/check_aiter_mha_args_usage.py`. + +## Developer workflows you should follow +- Always initialize submodules before debugging build failures: `git submodule update --init --recursive` (required by CMake for 3rdparty deps). +- Typical source install in this repo: `pip install . --no-build-isolation` (see `README.rst`). +- C++ tests: build/run from `tests/cpp` with CMake+Ninja (`qa/L0_cppunittest/test.sh`, `ci/core.sh`). +- CI-style framework test entrypoints are shell scripts, not a single pytest command: + - PyTorch: `ci/pytorch.sh` + - JAX: `ci/jax.sh` + - They use `TEST_LEVEL`, `TEST_SGPU`, `TEST_MGPU`, `TEST_FILTER` from `ci/_utils.sh`. +- Lint/format workflow is repo-specific: + - local formatting: `qa/format.sh` (pre-commit hooks) + - cpplint+pylint flows: `qa/L0_pytorch_lint/test.sh`, `qa/L0_jax_lint/test.sh` + +## Code conventions and change boundaries +- Prefer edits in `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid changing `3rdparty/*` unless explicitly required. +- Preserve dual-platform structure when modifying kernels/build logic: + - shared sources are often `.cu` then hipified for ROCm (`transformer_engine/common/CMakeLists.txt`, `build_tools/pytorch.py`, `build_tools/jax.py`). + - never edit HIP files directly -- instead, edit the CUDA source and let the build system generate HIP variants. +- Keep environment-variable behavior stable; many tests intentionally toggle flags (examples in `ci/pytorch.sh` and `ci/jax.sh`). +- Respect existing tooling/style: + - Python formatted by Black (line length 100) via `.pre-commit-config.yaml` + - C/C++ style checked by cpplint and `.clang-format` + +## Practical pointers for AI agents +- If import fails with missing TE extension `.so`, inspect `transformer_engine/common/__init__.py` path resolution before changing packaging. +- If framework extension unexpectedly does not build on ROCm, check framework detection in `build_tools/utils.py::get_frameworks()` (ROCm-capable torch/jax checks). +- For fused-attn regressions, reproduce under multiple backend configs (`auto`, `ck`, `aotriton`, `unfused`) like CI scripts do. + +## Using Docker containers +- We generally work in Docker containers for reproducibility. +- For live debugging/investigations, run build/test commands **only** inside the designated container (not on host). +- If container is unspecified, ask for the exact image/tag and launch command **before** running anything expensive. +- Before debugging, record runtime context in notes/logs: + - container image/tag + - ROCm version in container + - GPU architecture visible in container + - TE commit/submodule state +- If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly. \ No newline at end of file