[feat] Support `extra_buffer` in Mamba2-based models #15829

roikoren755 · 2025-12-25T13:49:52Z

Motivation

Recent updates to Qwen3-Next models enabled running them with both radix cache and overlap scheduler enabled. This PR does the same for Mamba2-based models.

Modifications

Optionally return intermediate states in the mamba_chunk_scan_combined prefill kernel.
Fix writing locations of intermediate states in selective_state_update decode kernel.
Update MambaMixer2 to return prefill intermediate states, and provide SpecDec kernels with missing (and newly introduced) intermediate state writing locations, and update conv state for prefix caching in the extra_buffer code path.
Add missing tracking tensors to Mamba2Metadata.
Update MambaAttnBackendBase to work with non-FLA_CHUNK_SIZE chunk sizes too.
Update Mamba2AttnBackend for extra_buffer tracking.
Update ScheduleBatch and MambaRadixCache to work with non-FLA_CHUNK_SIZE chunk sizes too.
Update NemotronH and FalconH1 models to pass newly required forward batch to the attention backend.
Update ServerArgs initialization to allow running Mamba2-based models with both radix cache and overlap scheduler enabled.

Added tests with --mamba-scheduler-strategy extra_buffer for nvidia/NVIDIA-Nemotron-Nano-v2-9B, also in P/D disaggregation tests.
Refactored and cleaned up Qwen3-Next tests for code re-use.

Accuracy Tests

All existing tests pass without accuracy degradation.
Newly added tests pass with the same accuracy.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-25T13:50:24Z

Summary of Changes

Hello @roikoren755, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the extra_buffer scheduling strategy in Mamba2-based models. This crucial update allows these models to leverage both radix caching and overlap scheduling, significantly enhancing their inference efficiency and flexibility. The changes span across kernel implementations, attention backends, scheduling mechanisms, and model configurations, ensuring seamless integration and robust performance. The modifications enable more advanced state management during prefill and decode, adapting to varying chunk sizes and improving cache utilization for Mamba2 architectures.

Highlights

Mamba2 extra_buffer Support: Enabled Mamba2-based models to utilize the extra_buffer scheduling strategy, allowing them to run with both radix cache and overlap scheduler simultaneously. This was previously only available for Qwen3-Next models.
Kernel and Backend Enhancements: Modified the mamba_chunk_scan_combined prefill kernel to optionally return intermediate states and fixed writing locations of intermediate states in the selective_state_update decode kernel. The MambaMixer2 and Mamba2AttnBackend were updated to handle these intermediate states and extra_buffer tracking.
Dynamic Chunk Size Handling: Updated HybridLinearAttnBackend, ScheduleBatch, and MambaRadixCache to use a dynamic chunk_size (either mamba_chunk_size for Mamba2 or FLA_CHUNK_SIZE for FLA models) instead of a hardcoded FLA_CHUNK_SIZE, improving flexibility.
Model and Server Argument Adjustments: NemotronH and FalconH1 models were updated to pass the forward_batch to the attention backend. Server argument initialization logic was refined to include necessary validations for mamba_track_interval when extra_buffer is enabled for Mamba2 models.
Testing and Refactoring: Added new tests specifically for the --mamba-scheduler-strategy extra_buffer with Nemotron-Nano-v2, including prefill/decode cache hit and prefix cache branching tests. Existing Qwen3-Next tests were refactored for better code reusability using new mixin classes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for extra_buffer in Mamba2-based models, enabling the use of both radix cache and the overlap scheduler. The changes are extensive, touching model implementations, attention backends, scheduling logic, and adding new tests. The core idea is to generalize the handling of chunk sizes and enable tracking of intermediate states for Mamba2 models, similar to what was done for FLA-based models. The code is well-structured, and the addition of tests for the new functionality is commendable. I have a couple of suggestions for improving code clarity and maintainability.

gemini-code-assist · 2025-12-25T13:52:47Z

python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py

+            if (
+                intermediate_states is not None
+                and forward_batch.mamba_track_mask is not None
+                and forward_batch.mamba_track_mask.any()
+            ):


The check forward_batch.mamba_track_mask is not None is redundant within this nested if block, as it's already checked in the outer if statement on line 1195. Removing it will make the code slightly cleaner.

if ( intermediate_states is not None and forward_batch.mamba_track_mask.any() ):

gemini-code-assist · 2025-12-25T13:52:47Z

python/sglang/srt/layers/attention/mamba/ops/ssd_combined.py

+    if return_intermediate_states:
+        if return_varlen_states:
+            varlen_states = rest[0]
+            if return_final_states:
+                return states, final_states, varlen_states
+            else:
+                return states, varlen_states
+        else:
+            if return_final_states:
+                return states, final_states
+            else:
+                return states


The nested if statements for handling return values based on different flags can be a bit hard to follow. Refactoring this block to a flatter structure would improve readability and maintainability.

if return_intermediate_states: if not return_final_states and not return_varlen_states: return states if not return_final_states and return_varlen_states: return states, rest[0] if return_final_states and not return_varlen_states: return states, final_states # return_final_states and return_varlen_states return states, final_states, rest[0]

hanming-lu · 2026-01-14T18:57:10Z

python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py

        lens_to_track = (
            forward_batch.mamba_track_seqlens - forward_batch.extend_prefix_lens
        )
-        mamba_cache_chunk_size = get_global_server_args().mamba_cache_chunk_size


There are two variables:

FLA_CHUNK_SIZE

mamba_cache_chunk_size

IIUC, there's no reason to touch anything related to mamba_cache_chunk_size? IIUC, you should only replace all usages of FLA_CHUNK_SIZE with mamba_chunk_size (i.e. the CHUNK_SIZE for mamba2's backend, also maybe use a better name like <backend>_CHUNK_SIZE or MAMBA2_CHUNK_SIZE) for mamba2 models

hanming-lu · 2026-01-14T19:00:26Z

python/sglang/srt/managers/schedule_batch.py

        mamba_track_indices_cpu: List[int],
        mamba_track_seqlens_cpu: List[int],
    ):
+        mamba_track_interval = get_global_server_args().mamba_track_interval


IIUC, you don't need to touch any logic related to mamba_track_interval?

See https://github.com/sgl-project/sglang/pull/15829/files#r2691660148

I feel this is not crashing just because mamba_track_interval happens to be 256?

roikoren755 · 2026-01-18T14:13:18Z

@hanming-lu The current implementation for ServerArgs.mamba_cache_chunk_size didn't work for Mamba2-based models, as the page size (which defaults to 1) is unrelated. This meant that the value was 64, and it didn't work with Mamba2-based models which use a chunk size of 256 (from what I saw, all of them do). I updated the property to take that into account, and things should be OK now. I also rebased and fixed some of the tests refactor, to be clearer and better located. Please take another look 🙏

I'm still seeing some intermittent errors in some of the tests, with the scheduler raising an error that a memory leak is detected. Not sure if it's just my setup or if it will happen in the CI as well...

Signed-off-by: Roi Koren <roik@nvidia.com>

… tests and clean that file up Signed-off-by: Roi Koren <roik@nvidia.com>

Signed-off-by: Roi Koren <roik@nvidia.com>

roikoren755 requested review from Ying1123, hanming-lu, hebiao064, hnyls2002, merrymercy, xiezhq-hermann and yizhang2077 as code owners December 25, 2025 13:49

github-actions bot added the documentation Improvements or additions to documentation label Dec 25, 2025

gemini-code-assist bot reviewed Dec 25, 2025

View reviewed changes

roikoren755 force-pushed the feat/mamba2-radix-overlap branch 2 times, most recently from de24f9e to 1baf72d Compare December 31, 2025 09:06

roikoren755 force-pushed the feat/mamba2-radix-overlap branch 2 times, most recently from 41412f3 to 9244a86 Compare January 14, 2026 10:38

hanming-lu reviewed Jan 14, 2026

View reviewed changes

roikoren755 force-pushed the feat/mamba2-radix-overlap branch from 9244a86 to 1ce8415 Compare January 18, 2026 14:05

roikoren755 added 12 commits January 25, 2026 12:34

Add extra_buffer support to Mamba2 models

a0b8e3b

Signed-off-by: Roi Koren <roik@nvidia.com>

Add extra_buffer tests, refactor and re-use utilities from Qwen3-Next…

2fec678

… tests and clean that file up Signed-off-by: Roi Koren <roik@nvidia.com>

Rename file and mixin

4ae33ba

Signed-off-by: Roi Koren <roik@nvidia.com>

Less SpecDec tests, more Disagg tests

c002b7e

Signed-off-by: Roi Koren <roik@nvidia.com>

Slightly cleaner

54d66d6

Signed-off-by: Roi Koren <roik@nvidia.com>

Fix accuracy threshold in P/D disagg tests

be1d128

Signed-off-by: Roi Koren <roik@nvidia.com>

Fix rebase

70ba5b0

Signed-off-by: Roi Koren <roik@nvidia.com>

Fix rebase - qwen3-next tests

6567315

Signed-off-by: Roi Koren <roik@nvidia.com>

Fix rebase - formatting and comments

66bdcbf

Signed-off-by: Roi Koren <roik@nvidia.com>

Fix rebase and refactor KL and prefix branching kits

2584332

Signed-off-by: Roi Koren <roik@nvidia.com>

Stabilize tests

72c197a

Signed-off-by: Roi Koren <roik@nvidia.com>

Fix mamba_cache_chunk_size for NemotronH models

8fc91ef

Signed-off-by: Roi Koren <roik@nvidia.com>

Fix mamba_cache_chunk_size for other mamba2-based models

12e4649

Signed-off-by: Roi Koren <roik@nvidia.com>

roikoren755 force-pushed the feat/mamba2-radix-overlap branch from 1ce8415 to 12e4649 Compare January 25, 2026 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Support `extra_buffer` in Mamba2-based models #15829

[feat] Support `extra_buffer` in Mamba2-based models #15829

roikoren755 commented Dec 25, 2025

Uh oh!

gemini-code-assist bot commented Dec 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Uh oh!

gemini-code-assist bot Dec 25, 2025

Uh oh!

hanming-lu Jan 14, 2026 •

edited

Loading

Uh oh!

hanming-lu Jan 14, 2026 •

edited

Loading

Uh oh!

roikoren755 commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[feat] Support extra_buffer in Mamba2-based models #15829

Are you sure you want to change the base?

[feat] Support extra_buffer in Mamba2-based models #15829

Conversation

roikoren755 commented Dec 25, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 25, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

hanming-lu Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanming-lu Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roikoren755 commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[feat] Support `extra_buffer` in Mamba2-based models #15829

[feat] Support `extra_buffer` in Mamba2-based models #15829

hanming-lu Jan 14, 2026 •

edited

Loading

hanming-lu Jan 14, 2026 •

edited

Loading