Releases: SandAI-org/MagiAttention
MagiAttention V1.1.0
Major Features
-
Early Support Blackwell
- Added preliminary Blackwell support via a new attention kernel backend
FFA_FA4using a fork of Flash-Attention 4. - Enable with
export MAGI_ATTENTION_FA4_BACKEND=1(see our docs). - Installation requires the
flash_attn_cutemodule; see the install guide. - Technical details and implementation notes are in our blog.
- Blackwell benchmarks are available for both kernel-level and distributed-level.
- Related PRs: #190, #206, #209, #225.
- Added preliminary Blackwell support via a new attention kernel backend
-
Full Native Group Collective (Intra/Internode)
- Implemented native kernel implementation of
group collectiveprimitives for intranode and internode communication using DeepEP, improving performance over the priorall-to-all-vapproach, especially with pre-/post-processing elimination and internode volume de-duplication. - Enable with
export MAGI_ATTENTION_NATIVE_GRPCOLL=1(see our docs). - Installation requires enabling
IBGDA; see the install guide. - Implementation notes: blog.
- Distributed benchmark results: blog.
- Related PRs: #182, #223, #228, #229, #230, #233, #241, #249, #253.
- Implemented native kernel implementation of
-
(Distributed) Muon QK-Clip for Max Logits
- Added support for the (distributed) Muon QK-clip technique from Kimi K2 by returning
max_logitsfrom the FFA forward kernel and performing its distributed reduction incalc_attn. - Access
meta.max_logitsby passingreturn_max_logits=Truetoflex_flash_attn_funcandcalc_attn. - Implementation details: blog.
- Related PRs: #237, #239, #245, #247.
- Added support for the (distributed) Muon QK-clip technique from Kimi K2 by returning
Developing Features
-
Optimize Sparse Attention in FFA
- Added
SwapABto the FFA forward kernel to reduce kBlockM and reduce wgmma waste under sparse attention. - Added
PackGQAto the FFA forward kernel to gather Q heads sharing a KV head for GQA settings under sparse attention. - Added specific
tile_schedulerto the FFA forward kernel to lower meta overhead by passing optionalmax_seqlen_qunder sparse attention. - Implemented
SparseLoadto usecp.asyncinstead ofTMA, making sparse global memory access dense in shared memory under sparse attention. - Added
SwapBwdQKLoopin the backward kernel to enableq-outer-loop + kv-inner-loop, preparing for future backward optimizations under sparse attention. - Related PRs: #158, #185, #204, #207, #214, #224.
- Added
-
Optimize Dynamic Attention Solver
- Improved dynamic solver algorithms and benchmarked across full/causal and document mask patterns.
- Reduced solver overhead via a C++-based data structure backend and OMP parallelism.
- Added
flatten_head_group(enable withexport MAGI_ATTENTION_FLATTEN_HEAD_GROUPS=1) for additional speedups. - Added
cpp_backend_data_structure(enable withexport MAGI_ATTENTION_CPP_BACKEND=1) to lower solver overhead. - Related PRs: #183, #210, #220.
-
Optimize DistAttn
- Added
AutoRangeMergeto reduce fragmentedAttnSlicesafter partitioning (enable withexport MAGI_ATTENTION_AUTO_RANGE_MERGE=1and require JIT building). - Added
CatGQAto improve FFA backward kernels performance under GQA by concatenating Q heads sharing the same KV head (enable withexport MAGI_ATTENTION_CATGQA=1and require JIT building). - Added
HideTailReduceto trade saving the last remotekvactivation for reordering overlap stages during backward, hiding the final remotegroup_reducewith the host FFA stage (enable withexport MAGI_ATTENTION_BWD_HIDE_TAIL_REDUCE=1). - Related PRs: #244, #256.
- Added
Architecture Refactoring
-
API Update
- Changed return values of
flex_flash_attn_funcandcalc_attnfrom(out, lse)to(out, meta). Accesslseatmeta.lseand, if requested,max_logitsatmeta.max_logitsviareturn_max_logits=True(see PRs: #237, #247). - Added required arguments
num_heads_q,num_heads_kv, andhead_dimtomagi_attn_flex_keyandmagi_attn_varlen_key(see PR: #236). - Deprecated
magi_attn_varlen_dispatchandmagi_attn_flex_dispatch(see PR: #236). - Updated
dist_attn_runtime_dictto be instantiated percp_group;get_most_recent_keynow requires thecp_groupargument (see PRs: #226, #232).
- Changed return values of
-
MagiAttention Extensions
- Packaged magi_attn_extensions as a single pip-installable module (see PR: #176).
Bug Fixes
-
Flash Attention with Attention sink
- Detached the
sinktensor from the computation graph infa_interface_with_sink, fixing backprop issues (see PR: #155).
- Detached the
-
Native Group Collective
- Fixed the GPU-CPU notification bug in the native group collective by disabling pinned counter read/write when GPU-CPU sync is unnecessary (see PR: #200).
Testing Enhancements
- UniTest Update
- Integrated
coverageinto unit-test CI (see PR: #178). - Added online-softmax
ref_attn_funcfor the "torch" backend to simulate Flash Attention 2 and reduce memory usage (see PR: #174). - Migrated unit-test CI from
pull_requesttopull_request_targetto allow external contributors to run CI with reviewer permissions (see PRs: #212, #215, #216, #217).
- Integrated
Others
-
Refine CP Benchmark
- Enhanced CP benchmark tooling and pipeline (figures, profiling, flag combinations, etc.). See the CP benchmark README for details (see PRs: #177, #197, #205, #221, #222, #227).
- Extended CP baseline with Megatron HybridCP (see PR: #187).
- Published a dedicated blog post for the CP benchmark (see PR: #247).
-
Refactor Docs with new Blogs
- Reworked the User Guide and reorganized Blogs to reflect new content and deep dives (see PR: #247).
-
Update Copyright
- Updated copyright header comments from
2025to2025~2026(see PR: #203).
- Updated copyright header comments from
MagiAttention V1.0.5
NOTE
We are very sorry that the original tagged MagiAttention-v1.0.5 (at commit ID 12a0936d9b76436d49b0f97ef542f800a1ea7eae) has a severe bug that it will load the .so everytime calling Flex-Flash-Attention w/o caching, causing much more CPU overhead. Therefore we've decided to retag MagiAttention-v1.0.5 after this bug is fixed in this PR: #171, at commit ID: 4d12163530deb99c8bf2dae4318ecdf2348df866.
If you clone the magi_attention for main branch after this commit ID 36418f19af740dd83557fcfbb48a7e03cf71c2dd or checkout to v1.0.5 between 2025.11.17 and 2025.11.19, please pull the updated tag or switch to the latest commit.
Sorry for the inconvenience again.
Major Features
-
Support Attention Sink
- Supported (distributed) learnable attention sink mechanism for Flex-Flash-Attention, MagiAttention, as well as Flash-Attention as one of the extensions (see our blog for more details).
- Enabled by passing an optional
sinktensor to the interfaces (check our basic usage or see our docs for more details).
-
(Experimental) Support Native Group Collective Kernels for Intranode Communication
- Supported native kernel implementations of group collective for intranode communication based on DeepEP, replacing the original
all2all-based implementations to eliminate pre-/post-processing, dtype-cast and tensor-concat. - Served as an experimental feature for now and enabled by toggling the environment variable:
export MAGI_ATTENTION_NATIVE_GRPCOLL=1(see our docs for more details).
- Supported native kernel implementations of group collective for intranode communication based on DeepEP, replacing the original
Architecture Refactoring
-
API Update
- Added an optional
sinktensor to the argument list for bothflex_flash_attn_funcandcalc_attnto support learnable attention sink.
- Added an optional
-
MagiAttention Extensions
- Initialized the Extensions section with the FlashAttention with Attention Sink, which aims at providing users with more examples and recipes to utilize
MagiAttentionin various training scenarios.
- Initialized the Extensions section with the FlashAttention with Attention Sink, which aims at providing users with more examples and recipes to utilize
-
Environment Update
- Updated the basic docker image from
ngc:2505to latestngc:2510, and ensure compatibility fromngc:2505tongc:2510, as well as fromcuda-12.8tocuda-13.0.
- Updated the basic docker image from
Bug Fixes
- Flex-Flash-Attention
- Fixed the occasional illegal memory access of the uninitialized variable
cu_batches, according to this PR.
- Fixed the occasional illegal memory access of the uninitialized variable
Testing Enhancements
-
UniTest Update
- Updated the unitest of FFA with sink tensors and fine-grained control of error ratios.
- Updated the unitest of MagiAttention pipeline with sink tensors and fine-grained control of error ratios.
- Supported custom
timeoutproperty.
-
Profile Mode
- Added
profile modefor MagiAttention to profile all kernels by toggling the environment variableexport MAGI_ATTENTION_PROFILE_MODE =1. - For now, only the FFA kernels are well-supported (see the README for more details).
- Added
Others
-
Sparse Attention Benchmark
- Supported separate smaller block sizes for
QandKfor uniform block-sparse benchmarking (see the README for more details).
- Supported separate smaller block sizes for
-
Examples
- Updated the Examples sections for new interface and environment.
-
Code Style
- Downgraded clang-format version from
21to20for stable c++ code formatting.
- Downgraded clang-format version from
MagiAttention V1.0.4
Major Features
-
JIT-Built and Compilable FFA Support
- Supported pre-building common cases of FFA (AOT mode) and building specific cases of FFA during runtime (JIT mode).
- Supported compilable FFA to be compatible with
torch.compile.
-
Hybrid-Attention with Multiple Masks Support
- Added new APIs to support single dispatch (with one attention mask as reference for load-balance) for multiple different attention masks to support hybrid-attention structure (see more details in our docs).
Architecture Refactoring
-
API Update
- Removed the
max_seqlen_qandmax_seqlen_kfrom the ffa arguments (no longer needed). - Added optional arguments for
calc_attnto support non-defaultsoftmax_scaleandsoftcap. - Added APIs to make new key from the one used in dispatch, including
make_flex_key_for_new_mask_after_dispatchandmake_flex_key_for_new_mask_after_dispatch. - Added useful functions to help users infer the
AttnSlicerepresentation of non-trivial masks , includinginfer_attn_mask_from_cu_seqlensandinfer_attn_mask_from_sliding_window.
- Removed the
-
Code Cleanup
- Formatted the code in
csrcwith the help ofclang-format-21. - Refactored ffa code for JIT building.
- Refactored dist-attn autograd function to support qo comm and make it clean and symmetric.
- Changed all Chinese comments into English.
- Formatted the code in
-
Environment Update
- Updated the basic docker image from
ngc:2502tongc:2505.
- Updated the basic docker image from
Performance Improvements
-
Flex-Flash-Attention Optimization
- Improved the performance of FFA for sparse scenarios, with the
ref_block_sizeargument provided to tune in different situations. - Reduced the workspace memory of FFA by unpadding the softmax-related variables including
lse_log2,softmax_d, etc, frommax_seqlen.
- Improved the performance of FFA for sparse scenarios, with the
-
Dist-Attention Optimization
- Removed the old environment variable
MAGI_ATTENTION_FFA_FORWARD_INPLACE_CORRECTto make the corr. feature enabled all the time, to eliminate the post-process overhead for attention forward out/lse correction.
- Removed the old environment variable
Experimental Features
- QO communication and Dynamic Attention Solver
- Supported query/output communication for both dist-attn forward/backward, to unlimit the restriction of only communicating key/value pairs.
- Implemented dynamic attention solver to not only solve the dist-attn optimization problems (including computation load-balance and communication minimization) when involving QO comm, but also solve the problems during runtime, to deal with dynamic attention mask (such as hybrid-attention structure and dynamic sparse attention scenarios).
- For now, both QO communication and dynamic attention solver are experimental and under early development for now, and not compatible with many other features.
- You can enable this feature by toggling the environment variable:
export MAGI_ATTENTION_QO_COMM=1, but please do NOT enable it unless you know exactly what you are doing.
Testing Enhancements
- UniTest Update
- Updated the unitest of ffa with a much larger test suite.
- Added the specific unitest of ffa for sparse attention scenarios.
- Updated the unitest of reduce-ops with more test cases and for new features.
Others
-
Sparse Attention Benchmark
- Provided specific benchmarks for off-the-shelf sparse attention kernel implementations, to measure the performance and flexibility with ffa.
-
Distributed Attention Benchmark
- Enhanced the dist-attn metrics to support general situations when involving QO comm.
- Updated the dist-attn baselines with
transformer_engine-2.3.0APIs and self-implementedRing-AllGather USA.
-
LSE Shape
- Changed the
lsetensor shape from[num_heads_q, seqlen_q]to[seqlen_q, num_heads_q]to follow the "seqlen-first" style and better support QO comm.
- Changed the
MagiAttention V1.0.3
Major Features
-
Documentation Support
- Initialized the documentation with installation, quick start, API reference and our specific environment variables to both debug and control the trade-off among performance, memory usage and accuracy.
-
Enhanced Mask Support
- Added support for all four mask types with arbitrary overlapped q_ranges and k_ranges, enabling more flexible attention computation.
-
Deterministic Mode Support
- Introduced the deterministic mode that can be enabled via environment variables (see here), ensuring identical results across different runs for debugging.
Architecture Refactoring
-
API Update
- Updated the MagiAttention API to support configurable
chunk_size, providing users with more control over the granularity of load-balanced dispatch and padding. - Updated
cp_grouptocp_group_or_mesh, allowing users to pass in either process group or 1D/2D device mesh (2D device mesh is mostly used for hierarchical comm). - Removed useless
head_dimfrom the argument.
- Updated the MagiAttention API to support configurable
-
Code Cleanup
- Simplified the
setup.pyto speed up the compilation for FFA. - Refactored the
magi_attention/csrc/flexible_flash_attention/to remove redundant codes. - Refactored the
magi_attention/meta/solver/for more modularity and less redundancy. - Removed some deprecated flags from
DistAttnConfig, includinghigh_bandwidth_domain_sizeanddeterministic.
- Simplified the
Performance Improvements
-
Flex-Flash-Attention Optimization
- Improved the performance to fully align with Flash-Attention-3, under the scenarios they both support.
- Supported more fine-grained arguments to control the kernel behavior, including forward/backward SM margin, configurable dtype, in-place reduction etc.
- Supported sm margin for both forward and backward, which can be also configured via environment variables (see here) for magi_attention.
- Implemented fast-zero-fill kernel to avoid manually zero-fill the uncovered ranges for forward output.
-
Dist-Attention Optimization
- Removed redundant range-fill ops and relative logics for forward output correction, as well as redundant cuda-malloc ops.
- Absorbed partial dq reduction into ffa bwd kernel, and pre-allocated dkv zero buffer to avoid concatenation of dk and dv.
- Supported forward partial output and lse in-place reduction to avoid post result-correction, enabled via environment variables (see here).
- Supported high-precision (fp32) dkv reduction for better precision at the cost of double comm overhead, enabled via environment variables (see here).
-
Hierarchical Comm Optimization
- Supported hierarchical comm to decrease the inter-node comm overhead for both group-cast and group-reduce (especially useful for dense masks like pure full mask), enabled via environment variables (see here) with user-given 2D cp device mesh.
-
Dispatch Algorithm Optimization
- Supported
SortedSequentialAlg,SequentialAlgto heuristically resolve the large comm overhead ofMinHeapAlgunder varlen mask settings, with maintaining computation load-balance.
- Supported
Testing Enhancements
- UniTest Updates
- Updated the unitest for pipeline testing, including MHA, GQA, MQA settings, different head dims and various masks.
- Updated the unitest for FFA, including various masks and other flags.
- Updated the unitest for magi-attention API to be more specific, including functools and various inputs.
- Updated the unitest for communication, including hierarchical comm and range ops.
Bug Fixes
- Flex-Flash-Attention
Others
-
Example Codes
- Besides the integration of MagiAttention and Megatron-LM (see release note-v1.0.2), we have provided example codes for training a Llama-3 1B model, showcasing integration with both PyTorch Native FSDP (see here) and HuggingFace Transformers (see here).
-
Benchmarks
- Provided a benchmark here to compare both kernel-level performance among various attention kernels and module-level performance among different cp strategies.
MagiAttention V1.0.2
Major Features
-
Add Example code and experiments to integrate Megatron with MagiAttention
- Created a new repository Megatron-LM-MagiAttention forked from Megatron-LM v0.11.0.
- Provided example code in this PR to present the code modification to integrate Megatron with MagiAttention.
- Provided the training recipe to reproduce the experiment to pretrain Llama3-1b from scratch here, with our own experiment results of loss convergence.
-
Add RoadMap
- Added a Roadmap section in
READMEto outline our future plans, including feature support and performance improvement.
- Added a Roadmap section in
Bug Fixes
-
Flex Flash Attention
- Fixed an index out-of-boundary bug.
-
Range Ops
- Fixed redundant triton kernel recompilation of range ops.
MagiAttention V1.0.1
Major Features
-
Overlapped Q Ranges Support For Full Mask
- Enhanced support for overlapping query ranges for more flexible attention computation
- Added proper synchronization and atomic reduction control for overlapped ranges
-
Precision Improvements
- Using higher precision in forward pass to reduce errors from partial output correction
- Zero-initialized tensors to avoid NaN values in padding tokens interfering with gradients for other tokens
Architecture Refactoring
-
Code Cleanup
- Removed
is_causal_mask_supportedexperimental feature flag, as well as the usage in unitest code - Removed unused properties from
DispatchMeta - Simplified
DistAttnRuntimeKeyimplementation - Refactored zero-filling range initialization logic
- Removed
-
Core Algorithm Optimizations
- Generalized attention dispatch meta calculation for overlapped query ranges
- Simplified dispatch and undispatch functional flows
Testing Enhancements
- New Tests
- Added position IDs test suite
- Added test cases for overlapped query ranges
Bug Fixes
-
Build System Fixes
- Fixed CUDA tools path handling
- Improved build process robustness
-
Flex Flash Attention
- Fixed atomic reduction issues
- Added thread fence to avoid race conditions
MagiAttention V1.0.0
This is the first open-sourced version of MagiAttention, a distributed attention mechanism towards linear scalability for ultra-long, heterogeneous data training.
We provide four key features to realize linear scalability:
- Flexbile Flash Attention Kernel. We introduce a generalized formulation for irregular attention mask patterns and implement a flexible flash attention kernel (FFA). It is natively designed for distribution scenarios and provides greater flexibility in handling diverse attention mask types, with performance comparable to Flash-Attention 3 on Hopper GPUs.
- Computation Load-Balance. With a fine-grained sharding strategies, we elaborate an efficient dispatch solver that ensures balanced attention computational loads across each CP rank in every training iteration.
- Zero-Redundant Communication. Instead of adopting the common Ring-style P2P communication pattern in CP, we propose two novel communication primitives, GroupCast and GroupReduce, built upon All-to-All-v as a prototypal implementation, enabling zero-redundant communication volume for both forward and backward passes.
- Adaptive Multi-Stage Overlap. Leveraging the above enhancements, we further implement a multi-stage compute-communication overlap strategy that effectively hides communication latency and adaptively optimizes overlap through manual or automatic tuning.
The development of this project is ongoing. We trust it will be a valuable resource for you and warmly welcome your contributions.