feat: multi-arch builds, env refactoring, new features, and expanded tests#307
Merged
Strivin0311 merged 99 commits intomainfrom Apr 8, 2026
Merged
feat: multi-arch builds, env refactoring, new features, and expanded tests#307Strivin0311 merged 99 commits intomainfrom
Strivin0311 merged 99 commits intomainfrom
Conversation
Support Ampere with cutlass-based FFA_FA4 (#287) See merge request: !1
Refactor build system to accept comma-separated compute capabilities via MAGI_ATTENTION_BUILD_COMPUTE_CAPABILITY (e.g. "90,100"). Add helper functions parse_compute_capabilities, get_gencode_flags, and resolve_build_capabilities in setup.py. Update CMakeLists.txt to accept MAGI_CUDA_ARCHITECTURES and strip PyTorch-injected gencode flags that may reference unsupported architectures. Made-with: Cursor
Allow every query in a sample to always attend to the first global_window_size key tokens in addition to the sliding window, useful for architectures that require prefix tokens (e.g. sink tokens) to be globally visible. Update docs with the new parameter. Made-with: Cursor
Introduce a P2P-based `roll` operation that cyclically shifts dispatched local tensors along the sequence dimension without materialising the full global tensor (O(N/P) memory instead of O(N)). Primarily designed for Multi-Token Prediction (MTP) where labels are shifted relative to inputs. - New `functional/roll.py` with `roll_p2p` implementation and autograd support - Expose `roll` in public API (`magi_attention.api`) - Clean up import paths: import `roll_func` directly from `functional.roll` instead of re-exporting through `functional.dispatch` - Add `roll` section to API reference and quickstart docs - Allow optional `num_heads_q/kv`, `head_dim` override in `make_flex_key_for_new_mask_after_dispatch` - Add comprehensive tests (`tests/test_functional/test_roll.py`) Made-with: Cursor
Consolidate the `ceil_div` helper into the shared utils module instead of defining it locally in `api/functools.py`, so that meta/solver code can reuse it without circular imports. Made-with: Cursor
Replace the previous "virtual metadata padding" approach with a simpler design where `total_seqlen` is used as-is (no padding at all): - Use `ceil_div` for num_chunks so the last chunk can be smaller - Remove `actual_total_seqlen_q/k` parameters from all interfaces - MinHeapDispatchAlg now reports `is_equal_num_workloads=False` and uses `ceil_div` for the per-bucket job limit - Simplify dispatch/undispatch: no zero-size virtual chunks, so `torch.split` works directly with `chunk_actual_sizes` - Remove virtual padding logic from `magi_attn_flex_key` and `make_flex_key_for_new_mask_after_dispatch` Made-with: Cursor
Extract `_compute_segments` to handle source-segment calculation for both uniform and variable (last-chunk-smaller) layouts. Refactor `_roll_p2p_impl` to iterate segments generically, replacing the previous special-case branches for r==0 and r>0. Add comprehensive uneven-shard tests: aligned/non-aligned shifts, cross-last-chunk wrapping, negative/large shifts, edge cases (last_chunk_size=1), larger sequences, and backward correctness. Made-with: Cursor
Remove virtual metadata padding logic from test_pipeline and test_pipeline_sdpa: no longer need `compute_pad_size`/`apply_padding` imports or `actual_total_seqlen_q/k` variables, since the uneven_shard path now uses original total_seqlen directly. Made-with: Cursor
…eqlens Cache the hash of DistAttnRuntimeKey via __hash__ override to avoid repeated hashing of all fields on every dict lookup. Also add lru_cache to infer_attn_mask_from_cu_seqlens to skip redundant mask inference for repeated cu_seqlens patterns. Made-with: Cursor
Add precise type hints for return types, parameters, and internal data structures. Import DistAttnRuntimeMgr for proper typing and remove the resolved TODO comment. Made-with: Cursor
Replace the concat-all-then-scatter approach with custom autograd Functions (_DispatchFunc / _UndispatchFunc). Forward dispatch now selects local chunks directly (O(shard_seqlen) alloc) instead of building a full permuted tensor (O(total_seqlen)). Backward uses all_gather_v + unpermute, mirroring the inverse path. Made-with: Cursor
Ensure partial_dsink is contiguous before communication in the backward pass to avoid potential issues with non-contiguous tensor layouts. Made-with: Cursor
- test_dispatch_solver: fix wrong assertTrue -> assertFalse for MinHeap - test_gt_dispatcher: use Python AttnRanges for sub_mask comparisons to avoid C++/Python cross-type equality failure - test_pipeline: add native_grpcoll invalidation rules for uneven_shard and small hidden_size_kv configs; pass num_heads/head_dim in test_config Made-with: Cursor
Contributor
|
@copilot review like proof-reading |
Agent-Logs-Url: https://github.com/SandAI-org/MagiAttention/sessions/32611ac4-4154-4596-b276-d3f6d07fdf05 Co-authored-by: Strivin0311 <61719042+Strivin0311@users.noreply.github.com>
Contributor
Done — addressed in commit Typos in source code:
Grammar / phrasing ("by now" → "currently"):
Other grammar fixes:
The corresponding |
Strivin0311
approved these changes
Apr 3, 2026
Contributor
Strivin0311
left a comment
There was a problem hiding this comment.
TOO much to review, forced to LGTM
…ile compatibility einops.repeat hashes its axes_lengths kwargs internally, which fails under torch.compile(dynamic=True) because SymInt is not hashable. Made-with: Cursor
einops internally hashes tensor shapes for recipe caching, which is incompatible with SymInt under torch.compile(dynamic=True). Replace rearrange, reduce, and repeat with equivalent permute/sum/unsqueeze. Made-with: Cursor
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Comprehensive update: multi-arch CUDA builds, env var centralisation, new features, API refactoring, and expanded tests.
Build System & Installation
setup.pyfor comma-separated compute capabilities (MAGI_ATTENTION_BUILD_COMPUTE_CAPABILITY=90,100)scripts/install_on_scm.shwithARCHenv var for cross-platform wheel builds (aarch64/x86_64)scm_setup.py,scripts/install_skip_all.shinstall_flash_attn_cute.shEnv & API Refactoring
MAGI_ATTENTION_*env var accessors intomagi_attention/env/packageMAGI_ATTENTION_LOG_LEVEL,common/protocols.pyNew Features
FFA_FA4Refactoring
csrc/utils/intocsrc/extensions/test_pipeline_sdpa.pyintotest_pipeline.pyTests
test_dispatch.py,test_roll.py,test_protocol_conformance.pydist_common.pytesting utilities概述
综合更新:多架构 CUDA 构建、环境变量集中管理、新功能、API 重构、测试扩展。
构建系统
90,100),新增 SCM 安装脚本环境变量 & API
MAGI_ATTENTION_*至magi_attention/env/新功能
重构 & 测试
csrc/utils合并到csrc/extensionsTest Plan
pytest tests/— 所有单元测试pytest tests/test_pipeline.py— 分布式测试pytest tests/test_functional/— dispatch 和 roll 测试install_on_scm.sh在 x86_64/aarch64 集群正常运行