feat: multi-arch builds, env refactoring, new features, and expanded tests by littsk · Pull Request #307 · SandAI-org/MagiAttention

littsk · 2026-04-01T13:26:22Z

Summary

Comprehensive update: multi-arch CUDA builds, env var centralisation, new features, API refactoring, and expanded tests.

Build System & Installation

Refactor setup.py for comma-separated compute capabilities (MAGI_ATTENTION_BUILD_COMPUTE_CAPABILITY=90,100)
Add scripts/install_on_scm.sh with ARCH env var for cross-platform wheel builds (aarch64/x86_64)
Add scm_setup.py, scripts/install_skip_all.sh
Update flash-attention submodule, remove runtime hotfixes from install_flash_attn_cute.sh

Env & API Refactoring

Centralise all MAGI_ATTENTION_* env var accessors into magi_attention/env/ package
Add MAGI_ATTENTION_LOG_LEVEL, common/protocols.py
Clean up public API exports, bump flake8/ruff/mypy

New Features

FA4 Backend: Blackwell GPU support via FFA_FA4
No-Overlap Path: Non-overlapped distributed attention execution
SDPA Online: Online softmax-based SDPA fallback kernel
Uneven Shard: Last chunk can be smaller, no virtual padding
Dispatch Refactor: Custom autograd Functions for lower memory
Distributed Roll: P2P cyclic shift for MTP, O(N/P) memory

Refactoring

Merge csrc/utils/ into csrc/extensions/
Consolidate test_pipeline_sdpa.py into test_pipeline.py

Tests

New: test_dispatch.py, test_roll.py, test_protocol_conformance.py
Add dist_common.py testing utilities

概述

综合更新：多架构 CUDA 构建、环境变量集中管理、新功能、API 重构、测试扩展。

构建系统

支持逗号分隔计算能力（如 90,100），新增 SCM 安装脚本
更新 flash-attention 子模块，移除安装脚本中的运行时热修复

环境变量 & API

迁移所有 MAGI_ATTENTION_* 至 magi_attention/env/
新增日志级别控制、协议定义、升级 lint 工具

新功能

FA4 后端（Blackwell）、无重叠路径、SDPA Online、非均匀分片
Dispatch autograd 重构、分布式 Roll（MTP）

重构 & 测试

csrc/utils 合并到 csrc/extensions
合并 pipeline 测试，新增 dispatch/roll/protocol 测试

Test Plan

pytest tests/ — 所有单元测试
pytest tests/test_pipeline.py — 分布式测试
pytest tests/test_functional/ — dispatch 和 roll 测试
验证 install_on_scm.sh 在 x86_64/aarch64 集群正常运行
验证 FA4 后端（如有 Blackwell 硬件）

* Support Ampere with cutlass-based FFA_FA4 (#287) * Update v1.1.0 overview (#285) * Update v1.1.0 public blogs (#281) * [HotFix]: fix proxy (#284) * Add DSA attention interface in extensions (#283) See merge request: !1

Support Ampere with cutlass-based FFA_FA4 (#287) See merge request: !1

Refactor build system to accept comma-separated compute capabilities via MAGI_ATTENTION_BUILD_COMPUTE_CAPABILITY (e.g. "90,100"). Add helper functions parse_compute_capabilities, get_gencode_flags, and resolve_build_capabilities in setup.py. Update CMakeLists.txt to accept MAGI_CUDA_ARCHITECTURES and strip PyTorch-injected gencode flags that may reference unsupported architectures. Made-with: Cursor

Allow every query in a sample to always attend to the first global_window_size key tokens in addition to the sliding window, useful for architectures that require prefix tokens (e.g. sink tokens) to be globally visible. Update docs with the new parameter. Made-with: Cursor

Introduce a P2P-based `roll` operation that cyclically shifts dispatched local tensors along the sequence dimension without materialising the full global tensor (O(N/P) memory instead of O(N)). Primarily designed for Multi-Token Prediction (MTP) where labels are shifted relative to inputs. - New `functional/roll.py` with `roll_p2p` implementation and autograd support - Expose `roll` in public API (`magi_attention.api`) - Clean up import paths: import `roll_func` directly from `functional.roll` instead of re-exporting through `functional.dispatch` - Add `roll` section to API reference and quickstart docs - Allow optional `num_heads_q/kv`, `head_dim` override in `make_flex_key_for_new_mask_after_dispatch` - Add comprehensive tests (`tests/test_functional/test_roll.py`) Made-with: Cursor

Consolidate the `ceil_div` helper into the shared utils module instead of defining it locally in `api/functools.py`, so that meta/solver code can reuse it without circular imports. Made-with: Cursor

Replace the previous "virtual metadata padding" approach with a simpler design where `total_seqlen` is used as-is (no padding at all): - Use `ceil_div` for num_chunks so the last chunk can be smaller - Remove `actual_total_seqlen_q/k` parameters from all interfaces - MinHeapDispatchAlg now reports `is_equal_num_workloads=False` and uses `ceil_div` for the per-bucket job limit - Simplify dispatch/undispatch: no zero-size virtual chunks, so `torch.split` works directly with `chunk_actual_sizes` - Remove virtual padding logic from `magi_attn_flex_key` and `make_flex_key_for_new_mask_after_dispatch` Made-with: Cursor

Extract `_compute_segments` to handle source-segment calculation for both uniform and variable (last-chunk-smaller) layouts. Refactor `_roll_p2p_impl` to iterate segments generically, replacing the previous special-case branches for r==0 and r>0. Add comprehensive uneven-shard tests: aligned/non-aligned shifts, cross-last-chunk wrapping, negative/large shifts, edge cases (last_chunk_size=1), larger sequences, and backward correctness. Made-with: Cursor

Remove virtual metadata padding logic from test_pipeline and test_pipeline_sdpa: no longer need `compute_pad_size`/`apply_padding` imports or `actual_total_seqlen_q/k` variables, since the uneven_shard path now uses original total_seqlen directly. Made-with: Cursor

…eqlens Cache the hash of DistAttnRuntimeKey via __hash__ override to avoid repeated hashing of all fields on every dict lookup. Also add lru_cache to infer_attn_mask_from_cu_seqlens to skip redundant mask inference for repeated cu_seqlens patterns. Made-with: Cursor

Add precise type hints for return types, parameters, and internal data structures. Import DistAttnRuntimeMgr for proper typing and remove the resolved TODO comment. Made-with: Cursor

Replace the concat-all-then-scatter approach with custom autograd Functions (_DispatchFunc / _UndispatchFunc). Forward dispatch now selects local chunks directly (O(shard_seqlen) alloc) instead of building a full permuted tensor (O(total_seqlen)). Backward uses all_gather_v + unpermute, mirroring the inverse path. Made-with: Cursor

Ensure partial_dsink is contiguous before communication in the backward pass to avoid potential issues with non-contiguous tensor layouts. Made-with: Cursor

- test_dispatch_solver: fix wrong assertTrue -> assertFalse for MinHeap - test_gt_dispatcher: use Python AttnRanges for sub_mask comparisons to avoid C++/Python cross-type equality failure - test_pipeline: add native_grpcoll invalidation rules for uneven_shard and small hidden_size_kv configs; pass num_heads/head_dim in test_config Made-with: Cursor

Strivin0311 · 2026-04-03T03:45:48Z

@copilot review like proof-reading

Agent-Logs-Url: https://github.com/SandAI-org/MagiAttention/sessions/32611ac4-4154-4596-b276-d3f6d07fdf05 Co-authored-by: Strivin0311 <61719042+Strivin0311@users.noreply.github.com>

Copilot · 2026-04-03T03:56:56Z

@copilot review like proof-reading

Done — addressed in commit cfd7e00. Here's a summary of what was corrected across 13 files:

Typos in source code:

fa4.py: eanble → enable (×2), leanable → learnable

Grammar / phrasing ("by now" → "currently"):

sdpa_online.py (×2), env/general.py, dist_attn_runtime_mgr.py, comm/…/utils.py, tests/test_pipeline.py

Other grammar fixes:

overlap_solver.py: serves as as a → serves as a
dist_attn.py: can not be both True → cannot be both True
README.md: we ropose → we propose
docs/…/blackwell_ffa_fa4.md: To early support … NVIDIA team and us introduce → To provide early support for … NVIDIA team and we introduce
docs/…/env_variables.md: MagiAttention are … will to be released → MagiAttention is … will be released

The corresponding .po locale files were also updated to keep msgid strings in sync with the corrected source text.

Strivin0311

TOO much to review, forced to LGTM

magi_attention/functional/fa4.py

scripts/install_flash_attn_cute.sh

magi_attention/dist_attn_runtime_mgr.py

…ile compatibility einops.repeat hashes its axes_lengths kwargs internally, which fails under torch.compile(dynamic=True) because SymInt is not hashable. Made-with: Cursor

einops internally hashes tensor shapes for recipe caching, which is incompatible with SymInt under torch.compile(dynamic=True). Replace rearrange, reduce, and repeat with equivalent permute/sum/unsqueeze. Made-with: Cursor

codecov · 2026-04-08T06:45:30Z

Codecov Report

❌ Patch coverage is 73.04217% with 358 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
magi_attention/common/protocols.py	56.07%	112 Missing ⚠️
magi_attention/functional/sdpa_online.py	18.55%	79 Missing ⚠️
magi_attention/functional/dist_attn.py	66.31%	32 Missing ⚠️
magi_attention/api/magi_attn_interface.py	63.01%	27 Missing ⚠️
magi_attention/meta/collection/calc_meta.py	57.62%	25 Missing ⚠️
magi_attention/functional/flex_flash_attn.py	37.50%	20 Missing ⚠️
magi_attention/functional/dispatch.py	74.57%	15 Missing ⚠️
magi_attention/testing/dist_common.py	78.57%	9 Missing ⚠️
magi_attention/env/comm.py	71.42%	8 Missing ⚠️
magi_attention/dist_attn_runtime_mgr.py	86.66%	6 Missing ⚠️
... and 11 more

📢 Thoughts on this report? Let us know!

Strivin0311

LGTM

littsk added 30 commits March 27, 2026 04:15

Support Ampere with cutlass-based FFA_FA4 (#287)

0473b7f

* Support Ampere with cutlass-based FFA_FA4 (#287) * Update v1.1.0 overview (#285) * Update v1.1.0 public blogs (#281) * [HotFix]: fix proxy (#284) * Add DSA attention interface in extensions (#283) See merge request: !1

Merge 'main-merge' into 'main'

502313a

Support Ampere with cutlass-based FFA_FA4 (#287) See merge request: !1

add scm install script

ac09d22

skip fa4_ffa_precompile

3da069c

polish tests for roll

7a485ff

dynamic pad token

6ac2b28

ciel div clear

edf5f83

polish code

6963fc5

support uneven shard

9bba859

refactor: move ceil_div to magi_attention/utils/_utils.py

0d307c5

Consolidate the `ceil_div` helper into the shared utils module instead of defining it locally in `api/functools.py`, so that meta/solver code can reuse it without circular imports. Made-with: Cursor

fix rank error in roll

b56c16f

improve type annotations for DistAttnRuntimeDictManager

71d9da8

Add precise type hints for return types, parameters, and internal data structures. Import DistAttnRuntimeMgr for proper typing and remove the resolved TODO comment. Made-with: Cursor

fix partial_dsink contiguity before backward communication

9e440ca

Ensure partial_dsink is contiguous before communication in the backward pass to avoid potential issues with non-contiguous tensor layouts. Made-with: Cursor

polish code

aa94ef8

fix last chunk_size for uneven_shard

f6f035a

fast build

327c11c

fast build

55fe599

install ffa build

b51cd8d

enhance dist runtime dict

6b0c49f

mem save ag

2dff854

simple p2p

17f088a

simple p2p

0f7e096

littsk had a problem deploying to ci-internal April 2, 2026 11:05 — with GitHub Actions Failure

Strivin0311 closed this Apr 2, 2026

Strivin0311 reopened this Apr 2, 2026

Strivin0311 had a problem deploying to ci-internal April 2, 2026 15:47 — with GitHub Actions Failure

Copilot started work on behalf of Strivin0311 April 3, 2026 03:45 View session

fix: proof-reading corrections (typos, grammar, and phrasing)

cfd7e00

Agent-Logs-Url: https://github.com/SandAI-org/MagiAttention/sessions/32611ac4-4154-4596-b276-d3f6d07fdf05 Co-authored-by: Strivin0311 <61719042+Strivin0311@users.noreply.github.com>

Copilot finished work on behalf of Strivin0311 April 3, 2026 03:57

Copilot AI requested a review from Strivin0311 April 3, 2026 03:57

Strivin0311 approved these changes Apr 3, 2026

View reviewed changes

magi_attention/functional/fa4.py Show resolved Hide resolved

scripts/install_flash_attn_cute.sh Show resolved Hide resolved

magi_attention/dist_attn_runtime_mgr.py Show resolved Hide resolved

magi_attention/dist_attn_runtime_mgr.py Show resolved Hide resolved

Copilot AI had a problem deploying to ci-internal April 3, 2026 04:06 Failure

increase timeout

34209fa

littsk had a problem deploying to ci-internal April 3, 2026 08:29 — with GitHub Actions Failure

littsk had a problem deploying to ci-internal April 6, 2026 08:01 — with GitHub Actions Failure

fix ci

00d51e6

littsk had a problem deploying to ci-internal April 6, 2026 13:25 — with GitHub Actions Failure

fix: replace einops.repeat with native ops in sink_bwd for torch.comp…

7078e06

…ile compatibility einops.repeat hashes its axes_lengths kwargs internally, which fails under torch.compile(dynamic=True) because SymInt is not hashable. Made-with: Cursor

littsk force-pushed the big-pr branch from 00d51e6 to d2f5108 Compare April 6, 2026 17:03

littsk had a problem deploying to ci-internal April 6, 2026 17:03 — with GitHub Actions Failure

littsk had a problem deploying to ci-internal April 6, 2026 17:26 — with GitHub Actions Failure

fix max logits dtype error

1925560

littsk force-pushed the big-pr branch from 2cbcc32 to 1925560 Compare April 8, 2026 04:48

littsk deployed to ci-internal April 8, 2026 04:48 — with GitHub Actions Active

Strivin0311 approved these changes Apr 8, 2026

View reviewed changes

Strivin0311 merged commit 36edcce into main Apr 8, 2026
8 of 9 checks passed

Strivin0311 deleted the big-pr branch April 8, 2026 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-arch builds, env refactoring, new features, and expanded tests#307

feat: multi-arch builds, env refactoring, new features, and expanded tests#307
Strivin0311 merged 99 commits intomainfrom
big-pr

littsk commented Apr 1, 2026 •

edited

Loading

Uh oh!

Strivin0311 commented Apr 3, 2026

Uh oh!

Copilot AI commented Apr 3, 2026

Uh oh!

Strivin0311 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 8, 2026

Uh oh!

Strivin0311 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

littsk commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Build System & Installation

Env & API Refactoring

New Features

Refactoring

Tests

概述

构建系统

环境变量 & API

新功能

重构 & 测试

Test Plan

Uh oh!

Strivin0311 commented Apr 3, 2026

Uh oh!

Copilot AI commented Apr 3, 2026

Uh oh!

Strivin0311 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 8, 2026

Codecov Report

Uh oh!

Strivin0311 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

littsk commented Apr 1, 2026 •

edited

Loading