Skip to content

Releases: SandAI-org/MagiAttention

MagiAttention V1.1.0

28 Feb 06:55
c5c90bf

Choose a tag to compare

Major Features

  1. Early Support Blackwell

    • Added preliminary Blackwell support via a new attention kernel backend FFA_FA4 using a fork of Flash-Attention 4.
    • Enable with export MAGI_ATTENTION_FA4_BACKEND=1 (see our docs).
    • Installation requires the flash_attn_cute module; see the install guide.
    • Technical details and implementation notes are in our blog.
    • Blackwell benchmarks are available for both kernel-level and distributed-level.
    • Related PRs: #190, #206, #209, #225.
  2. Full Native Group Collective (Intra/Internode)

    • Implemented native kernel implementation of group collective primitives for intranode and internode communication using DeepEP, improving performance over the prior all-to-all-v approach, especially with pre-/post-processing elimination and internode volume de-duplication.
    • Enable with export MAGI_ATTENTION_NATIVE_GRPCOLL=1 (see our docs).
    • Installation requires enabling IBGDA; see the install guide.
    • Implementation notes: blog.
    • Distributed benchmark results: blog.
    • Related PRs: #182, #223, #228, #229, #230, #233, #241, #249, #253.
  3. (Distributed) Muon QK-Clip for Max Logits

    • Added support for the (distributed) Muon QK-clip technique from Kimi K2 by returning max_logits from the FFA forward kernel and performing its distributed reduction in calc_attn.
    • Access meta.max_logits by passing return_max_logits=True to flex_flash_attn_func and calc_attn.
    • Implementation details: blog.
    • Related PRs: #237, #239, #245, #247.

Developing Features

  1. Optimize Sparse Attention in FFA

    • Added SwapAB to the FFA forward kernel to reduce kBlockM and reduce wgmma waste under sparse attention.
    • Added PackGQA to the FFA forward kernel to gather Q heads sharing a KV head for GQA settings under sparse attention.
    • Added specific tile_scheduler to the FFA forward kernel to lower meta overhead by passing optional max_seqlen_q under sparse attention.
    • Implemented SparseLoad to use cp.async instead of TMA, making sparse global memory access dense in shared memory under sparse attention.
    • Added SwapBwdQKLoop in the backward kernel to enable q-outer-loop + kv-inner-loop, preparing for future backward optimizations under sparse attention.
    • Related PRs: #158, #185, #204, #207, #214, #224.
  2. Optimize Dynamic Attention Solver

    • Improved dynamic solver algorithms and benchmarked across full/causal and document mask patterns.
    • Reduced solver overhead via a C++-based data structure backend and OMP parallelism.
    • Added flatten_head_group (enable with export MAGI_ATTENTION_FLATTEN_HEAD_GROUPS=1) for additional speedups.
    • Added cpp_backend_data_structure (enable with export MAGI_ATTENTION_CPP_BACKEND=1) to lower solver overhead.
    • Related PRs: #183, #210, #220.
  3. Optimize DistAttn

    • Added AutoRangeMerge to reduce fragmented AttnSlices after partitioning (enable with export MAGI_ATTENTION_AUTO_RANGE_MERGE=1 and require JIT building).
    • Added CatGQA to improve FFA backward kernels performance under GQA by concatenating Q heads sharing the same KV head (enable with export MAGI_ATTENTION_CATGQA=1 and require JIT building).
    • Added HideTailReduce to trade saving the last remote kv activation for reordering overlap stages during backward, hiding the final remote group_reduce with the host FFA stage (enable with export MAGI_ATTENTION_BWD_HIDE_TAIL_REDUCE=1).
    • Related PRs: #244, #256.

Architecture Refactoring

  1. API Update

    • Changed return values of flex_flash_attn_func and calc_attn from (out, lse) to (out, meta). Access lse at meta.lse and, if requested, max_logits at meta.max_logits via return_max_logits=True (see PRs: #237, #247).
    • Added required arguments num_heads_q, num_heads_kv, and head_dim to magi_attn_flex_key and magi_attn_varlen_key (see PR: #236).
    • Deprecated magi_attn_varlen_dispatch and magi_attn_flex_dispatch (see PR: #236).
    • Updated dist_attn_runtime_dict to be instantiated per cp_group; get_most_recent_key now requires the cp_group argument (see PRs: #226, #232).
  2. MagiAttention Extensions

Bug Fixes

  1. Flash Attention with Attention sink

    • Detached the sink tensor from the computation graph in fa_interface_with_sink, fixing backprop issues (see PR: #155).
  2. Native Group Collective

    • Fixed the GPU-CPU notification bug in the native group collective by disabling pinned counter read/write when GPU-CPU sync is unnecessary (see PR: #200).

Testing Enhancements

  1. UniTest Update
    • Integrated coverage into unit-test CI (see PR: #178).
    • Added online-softmax ref_attn_func for the "torch" backend to simulate Flash Attention 2 and reduce memory usage (see PR: #174).
    • Migrated unit-test CI from pull_request to pull_request_target to allow external contributors to run CI with reviewer permissions (see PRs: #212, #215, #216, #217).

Others

  1. Refine CP Benchmark

    • Enhanced CP benchmark tooling and pipeline (figures, profiling, flag combinations, etc.). See the CP benchmark README for details (see PRs: #177, #197, #205, #221, #222, #227).
    • Extended CP baseline with Megatron HybridCP (see PR: #187).
    • Published a dedicated blog post for the CP benchmark (see PR: #247).
  2. Refactor Docs with new Blogs

    • Reworked the User Guide and reorganized Blogs to reflect new content and deep dives (see PR: #247).
  3. Update Copyright

    • Updated copyright header comments from 2025 to 2025~2026 (see PR: #203).

MagiAttention V1.0.5

19 Nov 09:17
4d12163

Choose a tag to compare

NOTE

We are very sorry that the original tagged MagiAttention-v1.0.5 (at commit ID 12a0936d9b76436d49b0f97ef542f800a1ea7eae) has a severe bug that it will load the .so everytime calling Flex-Flash-Attention w/o caching, causing much more CPU overhead. Therefore we've decided to retag MagiAttention-v1.0.5 after this bug is fixed in this PR: #171, at commit ID: 4d12163530deb99c8bf2dae4318ecdf2348df866.

If you clone the magi_attention for main branch after this commit ID 36418f19af740dd83557fcfbb48a7e03cf71c2dd or checkout to v1.0.5 between 2025.11.17 and 2025.11.19, please pull the updated tag or switch to the latest commit.

Sorry for the inconvenience again.

Major Features

  1. Support Attention Sink

    • Supported (distributed) learnable attention sink mechanism for Flex-Flash-Attention, MagiAttention, as well as Flash-Attention as one of the extensions (see our blog for more details).
    • Enabled by passing an optional sink tensor to the interfaces (check our basic usage or see our docs for more details).
  2. (Experimental) Support Native Group Collective Kernels for Intranode Communication

    • Supported native kernel implementations of group collective for intranode communication based on DeepEP, replacing the original all2all-based implementations to eliminate pre-/post-processing, dtype-cast and tensor-concat.
    • Served as an experimental feature for now and enabled by toggling the environment variable: export MAGI_ATTENTION_NATIVE_GRPCOLL=1 (see our docs for more details).

Architecture Refactoring

  1. API Update

    • Added an optional sink tensor to the argument list for both flex_flash_attn_func and calc_attn to support learnable attention sink.
  2. MagiAttention Extensions

  3. Environment Update

    • Updated the basic docker image from ngc:2505 to latest ngc:2510, and ensure compatibility from ngc:2505 to ngc:2510, as well as from cuda-12.8 to cuda-13.0.

Bug Fixes

  1. Flex-Flash-Attention
    • Fixed the occasional illegal memory access of the uninitialized variable cu_batches, according to this PR.

Testing Enhancements

  1. UniTest Update

    • Updated the unitest of FFA with sink tensors and fine-grained control of error ratios.
    • Updated the unitest of MagiAttention pipeline with sink tensors and fine-grained control of error ratios.
    • Supported custom timeout property.
  2. Profile Mode

    • Added profile mode for MagiAttention to profile all kernels by toggling the environment variable export MAGI_ATTENTION_PROFILE_MODE =1.
    • For now, only the FFA kernels are well-supported (see the README for more details).

Others

  1. Sparse Attention Benchmark

    • Supported separate smaller block sizes for Q and K for uniform block-sparse benchmarking (see the README for more details).
  2. Examples

    • Updated the Examples sections for new interface and environment.
  3. Code Style

    • Downgraded clang-format version from 21 to 20 for stable c++ code formatting.

MagiAttention V1.0.4

13 Oct 09:02
e08bea8

Choose a tag to compare

Major Features

  1. JIT-Built and Compilable FFA Support

    • Supported pre-building common cases of FFA (AOT mode) and building specific cases of FFA during runtime (JIT mode).
    • Supported compilable FFA to be compatible with torch.compile.
  2. Hybrid-Attention with Multiple Masks Support

    • Added new APIs to support single dispatch (with one attention mask as reference for load-balance) for multiple different attention masks to support hybrid-attention structure (see more details in our docs).

Architecture Refactoring

  1. API Update

    • Removed the max_seqlen_q and max_seqlen_k from the ffa arguments (no longer needed).
    • Added optional arguments for calc_attn to support non-default softmax_scale and softcap.
    • Added APIs to make new key from the one used in dispatch, including make_flex_key_for_new_mask_after_dispatch and make_flex_key_for_new_mask_after_dispatch .
    • Added useful functions to help users infer the AttnSlice representation of non-trivial masks , including infer_attn_mask_from_cu_seqlens and infer_attn_mask_from_sliding_window .
  2. Code Cleanup

    • Formatted the code in csrc with the help of clang-format-21.
    • Refactored ffa code for JIT building.
    • Refactored dist-attn autograd function to support qo comm and make it clean and symmetric.
    • Changed all Chinese comments into English.
  3. Environment Update

    • Updated the basic docker image from ngc:2502 to ngc:2505.

Performance Improvements

  1. Flex-Flash-Attention Optimization

    • Improved the performance of FFA for sparse scenarios, with the ref_block_size argument provided to tune in different situations.
    • Reduced the workspace memory of FFA by unpadding the softmax-related variables including lse_log2, softmax_d, etc, from max_seqlen.
  2. Dist-Attention Optimization

    • Removed the old environment variable MAGI_ATTENTION_FFA_FORWARD_INPLACE_CORRECT to make the corr. feature enabled all the time, to eliminate the post-process overhead for attention forward out/lse correction.

Experimental Features

  1. QO communication and Dynamic Attention Solver
    • Supported query/output communication for both dist-attn forward/backward, to unlimit the restriction of only communicating key/value pairs.
    • Implemented dynamic attention solver to not only solve the dist-attn optimization problems (including computation load-balance and communication minimization) when involving QO comm, but also solve the problems during runtime, to deal with dynamic attention mask (such as hybrid-attention structure and dynamic sparse attention scenarios).
    • For now, both QO communication and dynamic attention solver are experimental and under early development for now, and not compatible with many other features.
    • You can enable this feature by toggling the environment variable: export MAGI_ATTENTION_QO_COMM=1, but please do NOT enable it unless you know exactly what you are doing.

Testing Enhancements

  1. UniTest Update
    • Updated the unitest of ffa with a much larger test suite.
    • Added the specific unitest of ffa for sparse attention scenarios.
    • Updated the unitest of reduce-ops with more test cases and for new features.

Others

  1. Sparse Attention Benchmark

    • Provided specific benchmarks for off-the-shelf sparse attention kernel implementations, to measure the performance and flexibility with ffa.
  2. Distributed Attention Benchmark

    • Enhanced the dist-attn metrics to support general situations when involving QO comm.
    • Updated the dist-attn baselines with transformer_engine-2.3.0 APIs and self-implemented Ring-AllGather USA.
  3. LSE Shape

    • Changed the lse tensor shape from [num_heads_q, seqlen_q] to [seqlen_q, num_heads_q] to follow the "seqlen-first" style and better support QO comm.

MagiAttention V1.0.3

30 Jul 06:17
1c29d75

Choose a tag to compare

Major Features

  1. Documentation Support

  2. Enhanced Mask Support

    • Added support for all four mask types with arbitrary overlapped q_ranges and k_ranges, enabling more flexible attention computation.
  3. Deterministic Mode Support

    • Introduced the deterministic mode that can be enabled via environment variables (see here), ensuring identical results across different runs for debugging.

Architecture Refactoring

  1. API Update

    • Updated the MagiAttention API to support configurable chunk_size, providing users with more control over the granularity of load-balanced dispatch and padding.
    • Updated cp_group to cp_group_or_mesh, allowing users to pass in either process group or 1D/2D device mesh (2D device mesh is mostly used for hierarchical comm).
    • Removed useless head_dim from the argument.
  2. Code Cleanup

    • Simplified the setup.py to speed up the compilation for FFA.
    • Refactored the magi_attention/csrc/flexible_flash_attention/ to remove redundant codes.
    • Refactored the magi_attention/meta/solver/ for more modularity and less redundancy.
    • Removed some deprecated flags from DistAttnConfig, including high_bandwidth_domain_size and deterministic.

Performance Improvements

  1. Flex-Flash-Attention Optimization

    • Improved the performance to fully align with Flash-Attention-3, under the scenarios they both support.
    • Supported more fine-grained arguments to control the kernel behavior, including forward/backward SM margin, configurable dtype, in-place reduction etc.
    • Supported sm margin for both forward and backward, which can be also configured via environment variables (see here) for magi_attention.
    • Implemented fast-zero-fill kernel to avoid manually zero-fill the uncovered ranges for forward output.
  2. Dist-Attention Optimization

    • Removed redundant range-fill ops and relative logics for forward output correction, as well as redundant cuda-malloc ops.
    • Absorbed partial dq reduction into ffa bwd kernel, and pre-allocated dkv zero buffer to avoid concatenation of dk and dv.
    • Supported forward partial output and lse in-place reduction to avoid post result-correction, enabled via environment variables (see here).
    • Supported high-precision (fp32) dkv reduction for better precision at the cost of double comm overhead, enabled via environment variables (see here).
  3. Hierarchical Comm Optimization

    • Supported hierarchical comm to decrease the inter-node comm overhead for both group-cast and group-reduce (especially useful for dense masks like pure full mask), enabled via environment variables (see here) with user-given 2D cp device mesh.
  4. Dispatch Algorithm Optimization

    • Supported SortedSequentialAlg, SequentialAlg to heuristically resolve the large comm overhead of MinHeapAlg under varlen mask settings, with maintaining computation load-balance.

Testing Enhancements

  1. UniTest Updates
    • Updated the unitest for pipeline testing, including MHA, GQA, MQA settings, different head dims and various masks.
    • Updated the unitest for FFA, including various masks and other flags.
    • Updated the unitest for magi-attention API to be more specific, including functools and various inputs.
    • Updated the unitest for communication, including hierarchical comm and range ops.

Bug Fixes

  1. Flex-Flash-Attention
    • Fixed the wrong-positioned __threadfence() in forward epilogue and tma_store_wait in backward mainloop (see #55).
    • Fixed the index-out-of-boundary bug in tile_scheduler (see #73).
    • Fixed the out-of-grid-size bug in backward preprocess in case of too many ranges (see #40).

Others

  1. Example Codes

    • Besides the integration of MagiAttention and Megatron-LM (see release note-v1.0.2), we have provided example codes for training a Llama-3 1B model, showcasing integration with both PyTorch Native FSDP (see here) and HuggingFace Transformers (see here).
  2. Benchmarks

    • Provided a benchmark here to compare both kernel-level performance among various attention kernels and module-level performance among different cp strategies.

MagiAttention V1.0.2

28 May 09:30
41b4f20

Choose a tag to compare

Major Features

  1. Add Example code and experiments to integrate Megatron with MagiAttention

    • Created a new repository Megatron-LM-MagiAttention forked from Megatron-LM v0.11.0.
    • Provided example code in this PR to present the code modification to integrate Megatron with MagiAttention.
    • Provided the training recipe to reproduce the experiment to pretrain Llama3-1b from scratch here, with our own experiment results of loss convergence.
  2. Add RoadMap

    • Added a Roadmap section in README to outline our future plans, including feature support and performance improvement.

Bug Fixes

  1. Flex Flash Attention

    • Fixed an index out-of-boundary bug.
  2. Range Ops

    • Fixed redundant triton kernel recompilation of range ops.

MagiAttention V1.0.1

13 May 14:00
c5d4c65

Choose a tag to compare

Major Features

  1. Overlapped Q Ranges Support For Full Mask

    • Enhanced support for overlapping query ranges for more flexible attention computation
    • Added proper synchronization and atomic reduction control for overlapped ranges
  2. Precision Improvements

    • Using higher precision in forward pass to reduce errors from partial output correction
    • Zero-initialized tensors to avoid NaN values in padding tokens interfering with gradients for other tokens

Architecture Refactoring

  1. Code Cleanup

    • Removed is_causal_mask_supported experimental feature flag, as well as the usage in unitest code
    • Removed unused properties from DispatchMeta
    • Simplified DistAttnRuntimeKey implementation
    • Refactored zero-filling range initialization logic
  2. Core Algorithm Optimizations

    • Generalized attention dispatch meta calculation for overlapped query ranges
    • Simplified dispatch and undispatch functional flows

Testing Enhancements

  1. New Tests
    • Added position IDs test suite
    • Added test cases for overlapped query ranges

Bug Fixes

  1. Build System Fixes

    • Fixed CUDA tools path handling
    • Improved build process robustness
  2. Flex Flash Attention

    • Fixed atomic reduction issues
    • Added thread fence to avoid race conditions

MagiAttention V1.0.0

21 Apr 01:51
0bdc141

Choose a tag to compare

This is the first open-sourced version of MagiAttention, a distributed attention mechanism towards linear scalability for ultra-long, heterogeneous data training.

We provide four key features to realize linear scalability:

  • Flexbile Flash Attention Kernel. We introduce a generalized formulation for irregular attention mask patterns and implement a flexible flash attention kernel (FFA). It is natively designed for distribution scenarios and provides greater flexibility in handling diverse attention mask types, with performance comparable to Flash-Attention 3 on Hopper GPUs.
  • Computation Load-Balance. With a fine-grained sharding strategies, we elaborate an efficient dispatch solver that ensures balanced attention computational loads across each CP rank in every training iteration.
  • Zero-Redundant Communication. Instead of adopting the common Ring-style P2P communication pattern in CP, we propose two novel communication primitives, GroupCast and GroupReduce, built upon All-to-All-v as a prototypal implementation, enabling zero-redundant communication volume for both forward and backward passes.
  • Adaptive Multi-Stage Overlap. Leveraging the above enhancements, we further implement a multi-stage compute-communication overlap strategy that effectively hides communication latency and adaptively optimizes overlap through manual or automatic tuning.

The development of this project is ongoing. We trust it will be a valuable resource for you and warmly welcome your contributions.