Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions .claude/ck-debugging/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
name: aiter-ck-integration-debugging
description: Triage and isolate CK/AITER Fused Attention failures in TransformerEngine as integration vs kernel issues.
---

# CK Fused Attention Debugging Guide (TransformerEngine, ROCm)

Use this playbook to quickly answer one question:
**Is the failure in TE↔CK integration, or in the CK/AITER kernel itself?**

## 1) Map the integration surface first
- Build-time CK args parsing/validation:
- `transformer_engine/common/CMakeLists.txt`
- `tools/check_aiter_mha_args_usage.py`
- CK fused-attn kernel wrappers/entry points:
- `transformer_engine/common/ck_fused_attn/ck_fused_attn_*`
- CK backend preprocessing and dispatch glue:
- `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp`
- Runtime backend selection / fallback path:
- `transformer_engine/common/fused_attn_rocm/fused_attn.cpp`

## 2) Gather minimum reproducibility context (before changing code)
Capture these from logs or user report:
- Forward vs backward failure (`fwd` / `bwd`)
- Exact shape/config: batch, seq lengths (`s_q`, `s_kv`), num heads, head dim
- Data type(s): fp16/bf16/fp8
- Mask/dropout/causal/windowing/alibi/padding settings
- GQA/MQA/group mode details if used
- GPU architecture + ROCm version + TE commit
- Whether fallback backend succeeds

When self-collecting logs (for example, rerunning a failing pytest), enable full config logging in the same command: `NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 <test command>`.

If reproducing triggers a segmentation fault, rerun under `rocgdb` to capture a usable backtrace: `rocgdb --args python -m pytest ...` (then run and collect `bt`).

If config info is incomplete, request it first; otherwise debugging is noisy and slow.

## 3) Reproduce in controlled CK-only path
Preferred path (native executables):
1. From `3rdparty/aiter/op_tests/cpp/mha`, build with `mha_build.sh`.
2. Use `fwd.exe -?` / `bwd.exe -?` to confirm argument mapping.
3. Run with required runtime env:
- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
- `AITER_ASM_DIR=$(realpath ../../../hsa)`
4. Encode the failing TE config exactly in `fwd.exe` / `bwd.exe` args and record the full command.

Fallback path (AITER Python JIT):
- Use `3rdparty/aiter/op_tests/test_mha.py` when native arg mapping is unclear or rapid iteration is needed.
- Add a small wrapper test (for example, `test_te_reproducer`) that pins only the failing config.
- Keep this wrapper minimal and temporary; include the exact parameters in handoff notes.

## 4) Decision tree: integration bug vs kernel bug
1. **Fails in TE, but passes in `fwd.exe`/`bwd.exe` with equivalent config**
- Likely TE integration bug.
- Focus on argument marshaling/normalization in:
- `fused_attn_ck.cpp`
- `ck_fused_attn_*`
- backend selection conditions in `fused_attn.cpp`

2. **Fails both in TE and standalone `fwd.exe`/`bwd.exe`**
- Likely CK/AITER kernel issue (or unsupported config).
- Produce a minimal standalone reproducer command and hand off.

3. **Passes in TE only when fallback backend is chosen**
- CK eligibility/selection guard likely wrong.
- Inspect backend capability checks and shape constraints in `fused_attn.cpp`.

## 5) High-value checks when it is integration-related
- Verify all expected CK args are present and in the right order/type.
- Check TE→CK conversions for:
- layout / strides
- sequence length semantics (`s_q` vs `s_kv`)
- grouped-query mapping
- mask/bias/dropout flags
- causal/windowing flags
- dtype/accumulator assumptions
- Confirm no silent defaulting for missing fields.
- Confirm runtime-selected backend matches intent (no accidental fallback/misroute).

## 6) Output artifact requirements (always produce)
For each investigated failure, record:
- TE reproducer summary (shapes, dtype, flags)
- Standalone command(s) tested (`fwd.exe`/`bwd.exe`) and result
- Classification: `integration` or `kernel`
- Owning component and next action

Suggested concise handoff format:
- **Config:** `B=?, Sq=?, Skv=?, H=?, D=?, dtype=?, causal=?, dropout=?, mask=?`
- **TE result:** pass/fail + key error
- **Standalone result:** pass/fail + key error
- **Conclusion:** integration vs kernel
- **Owner:** TE vs AITER/CK

For more comprehensive output formatting, reference [TEMPLATE.md](TEMPLATE.md)

## 7) Common pitfalls
- Mismatch between TE-side defaults and standalone binary defaults.
- Treating unsupported config as runtime failure instead of eligibility failure.
- Comparing non-equivalent configs across TE and standalone paths.
- Missing backward-only failures (always test both directions when applicable).
121 changes: 121 additions & 0 deletions .claude/ck-debugging/TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# CK/AITER Fused-Attn Debug Handoff Template

Use this template when handing off a failure investigation to TE or AITER/CK owners.

---

## 1) Summary
- **Classification:** `integration` | `kernel` | `unknown`
- **Direction:** `fwd` | `bwd` | `both`

## 2) Environment
- **TE commit:**
- **AITER commit/submodule ref:**
- **ROCm version:**
- **GPU architecture (gfx):**

## 3) Failing Configuration
- **Batch (B):**
- **Query seq (Sq):**
- **KV seq (Skv):**
- **Num heads (H):**
- **Head dim (D):**
- **DType(s):** fp16 / bf16 / fp8
- **Causal:** true/false
- **Dropout:**
- **Mask/Bias mode:**
- **Windowing/Alibi/Padding:**
- **GQA/MQA details:**

## 4) TE Reproducer
- **Backend intent:** CK only / auto / fallback allowed
- **Command or test entrypoint:**
- **Key env vars:**
- **Observed result:** pass/fail
- **First failing log line / error signature:**

## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`)
- **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
- **Build command:**
- **Runtime env:**
- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
- `AITER_ASM_DIR=$(realpath ../../../hsa)`
- **Exact standalone command(s):**
- **Observed result:** pass/fail
- **First failing log line / error signature:**

## 6) Equivalence Check (TE vs Standalone)
- **Are shape/dtype/flags exactly matched?** yes/no
- **Any default mismatch noticed?**
- **Notes:**

## 7) Conclusion and Ownership
- **Conclusion:** integration vs kernel vs unsupported-config
- **Likely owner:** TE (`fused_attn_ck.cpp` / `fused_attn.cpp` / `ck_fused_attn_*`) or AITER/CK kernel team
- **Requested next action:**

## 8) Artifacts
- **Logs attached:**
- **Minimal reproducer commands attached:**
- **Patch/commit links (if any):**

---

# Example (Filled)

## 1) Summary
- **Classification:** `integration`
- **Direction:** `bwd`

## 2) Environment
- **TE commit:** `abc1234`
- **AITER commit/submodule ref:** `def5678`
- **ROCm version:** 6.2.1
- **GPU architecture (gfx):** gfx942

## 3) Failing Configuration
- **Batch (B):** 4
- **Query seq (Sq):** 4096
- **KV seq (Skv):** 4096
- **Num heads (H):** 32
- **Head dim (D):** 128
- **DType(s):** bf16
- **Causal:** true
- **Dropout:** 0.0
- **Mask/Bias mode:** causal mask only
- **Windowing/Alibi/Padding:** none
- **GQA/MQA details:** none

## 4) TE Reproducer
- **Backend intent:** CK only
- **Command or test entrypoint:** `pytest tests/pytorch/fused_attn/test_fused_attn.py::test_bwd_case_x`
- **Key env vars:** CK backend forced; debug logging enabled
- **Observed result:** fail
- **First failing log line / error signature:** `invalid argument: ck_bwd workspace size mismatch`

## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`)
- **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
- **Build command:** `./mha_build.sh`
- **Runtime env:**
- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
- `AITER_ASM_DIR=$(realpath ../../../hsa)`
- **Exact standalone command(s):**
- `./bwd.exe <equivalent args>`
- `./fwd.exe <equivalent args>`
- **Observed result:** pass (both)
- **First failing log line / error signature:** N/A

## 6) Equivalence Check (TE vs Standalone)
- **Are shape/dtype/flags exactly matched?** yes
- **Any default mismatch noticed?** TE-side workspace/alignment default differs from standalone path
- **Notes:** likely marshaling/normalization issue before CK call

## 7) Conclusion and Ownership
- **Conclusion:** integration
- **Likely owner:** TE (`fused_attn_ck.cpp` argument preparation)
- **Requested next action:** inspect workspace-size and alignment mapping in TE→CK bwd path

## 8) Artifacts
- **Logs attached:** `te_fail.log`, `standalone_pass.log`
- **Minimal reproducer commands attached:** yes
- **Patch/commit links (if any):** none
57 changes: 57 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Agent instructions for TransformerEngine (ROCm fork)

## Big picture
- This repo builds **one core C++/HIP library** plus optional framework bindings:
- core: `transformer_engine/common` (CMake project producing `libtransformer_engine.so`)
- PyTorch binding: `transformer_engine/pytorch` + `transformer_engine/pytorch/csrc`
- JAX binding: `transformer_engine/jax` + `transformer_engine/jax/csrc/extensions`
- Python import flow is split:
- top-level framework selection in `transformer_engine/__init__.py` (`NVTE_FRAMEWORK` controls `pytorch|jax|all|none`)
- `.so` discovery/loading logic in `transformer_engine/common/__init__.py` (`load_framework_extension`, wheel/source/editable layouts)
- Build orchestration is in `setup.py` + `build_tools/*.py`, not only in CMake.
- `build_tools/utils.py::rocm_build()` auto-detects ROCm first, then CUDA, unless `NVTE_USE_ROCM` is set.

## Platform/backends
- ROCm path is first-class in this fork (`README.rst`, `transformer_engine/common/CMakeLists.txt`).
- Fused attention backends are runtime/compile-time gated by env vars:
- `NVTE_FUSED_ATTN`, `NVTE_FUSED_ATTN_CK`, `NVTE_FUSED_ATTN_AOTRITON`
- ROCm fused-attn implementation is in `transformer_engine/common/fused_attn_rocm/*`; CK and AOTriton integration is wired in `transformer_engine/common/CMakeLists.txt`.
- Build-time validation for CK args runs from `setup.py` via `tools/check_aiter_mha_args_usage.py`.

## Developer workflows you should follow
- Always initialize submodules before debugging build failures: `git submodule update --init --recursive` (required by CMake for 3rdparty deps).
- Typical source install in this repo: `pip install . --no-build-isolation` (see `README.rst`).
- C++ tests: build/run from `tests/cpp` with CMake+Ninja (`qa/L0_cppunittest/test.sh`, `ci/core.sh`).
- CI-style framework test entrypoints are shell scripts, not a single pytest command:
- PyTorch: `ci/pytorch.sh`
- JAX: `ci/jax.sh`
- They use `TEST_LEVEL`, `TEST_SGPU`, `TEST_MGPU`, `TEST_FILTER` from `ci/_utils.sh`.
- Lint/format workflow is repo-specific:
- local formatting: `qa/format.sh` (pre-commit hooks)
- cpplint+pylint flows: `qa/L0_pytorch_lint/test.sh`, `qa/L0_jax_lint/test.sh`

## Code conventions and change boundaries
- Prefer edits in `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid changing `3rdparty/*` unless explicitly required.
- Preserve dual-platform structure when modifying kernels/build logic:
- shared sources are often `.cu` then hipified for ROCm (`transformer_engine/common/CMakeLists.txt`, `build_tools/pytorch.py`, `build_tools/jax.py`).
- never edit HIP files directly -- instead, edit the CUDA source and let the build system generate HIP variants.
- Keep environment-variable behavior stable; many tests intentionally toggle flags (examples in `ci/pytorch.sh` and `ci/jax.sh`).
- Respect existing tooling/style:
- Python formatted by Black (line length 100) via `.pre-commit-config.yaml`
- C/C++ style checked by cpplint and `.clang-format`

## Practical pointers for AI agents
- If import fails with missing TE extension `.so`, inspect `transformer_engine/common/__init__.py` path resolution before changing packaging.
- If framework extension unexpectedly does not build on ROCm, check framework detection in `build_tools/utils.py::get_frameworks()` (ROCm-capable torch/jax checks).
- For fused-attn regressions, reproduce under multiple backend configs (`auto`, `ck`, `aotriton`, `unfused`) like CI scripts do.

## Using Docker containers
- We generally work in Docker containers for reproducibility.
- For live debugging/investigations, run build/test commands **only** inside the designated container (not on host).
- If container is unspecified, ask for the exact image/tag and launch command **before** running anything expensive.
- Before debugging, record runtime context in notes/logs:
- container image/tag
- ROCm version in container
- GPU architecture visible in container
- TE commit/submodule state
- If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly.