Skip to content

Conversation

@roikoren755
Copy link
Contributor

Motivation

Recent updates to Qwen3-Next models enabled running them with both radix cache and overlap scheduler enabled. This PR does the same for Mamba2-based models.

Modifications

Optionally return intermediate states in the mamba_chunk_scan_combined prefill kernel.
Fix writing locations of intermediate states in selective_state_update decode kernel.
Update MambaMixer2 to return prefill intermediate states, and provide SpecDec kernels with missing (and newly introduced) intermediate state writing locations, and update conv state for prefix caching in the extra_buffer code path.
Add missing tracking tensors to Mamba2Metadata.
Update MambaAttnBackendBase to work with non-FLA_CHUNK_SIZE chunk sizes too.
Update Mamba2AttnBackend for extra_buffer tracking.
Update ScheduleBatch and MambaRadixCache to work with non-FLA_CHUNK_SIZE chunk sizes too.
Update NemotronH and FalconH1 models to pass newly required forward batch to the attention backend.
Update ServerArgs initialization to allow running Mamba2-based models with both radix cache and overlap scheduler enabled.

Added tests with --mamba-scheduler-strategy extra_buffer for nvidia/NVIDIA-Nemotron-Nano-v2-9B, also in P/D disaggregation tests.
Refactored and cleaned up Qwen3-Next tests for code re-use.

Accuracy Tests

All existing tests pass without accuracy degradation.
Newly added tests pass with the same accuracy.

Benchmarking and Profiling

Checklist

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 25, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @roikoren755, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the extra_buffer scheduling strategy in Mamba2-based models. This crucial update allows these models to leverage both radix caching and overlap scheduling, significantly enhancing their inference efficiency and flexibility. The changes span across kernel implementations, attention backends, scheduling mechanisms, and model configurations, ensuring seamless integration and robust performance. The modifications enable more advanced state management during prefill and decode, adapting to varying chunk sizes and improving cache utilization for Mamba2 architectures.

Highlights

  • Mamba2 extra_buffer Support: Enabled Mamba2-based models to utilize the extra_buffer scheduling strategy, allowing them to run with both radix cache and overlap scheduler simultaneously. This was previously only available for Qwen3-Next models.
  • Kernel and Backend Enhancements: Modified the mamba_chunk_scan_combined prefill kernel to optionally return intermediate states and fixed writing locations of intermediate states in the selective_state_update decode kernel. The MambaMixer2 and Mamba2AttnBackend were updated to handle these intermediate states and extra_buffer tracking.
  • Dynamic Chunk Size Handling: Updated HybridLinearAttnBackend, ScheduleBatch, and MambaRadixCache to use a dynamic chunk_size (either mamba_chunk_size for Mamba2 or FLA_CHUNK_SIZE for FLA models) instead of a hardcoded FLA_CHUNK_SIZE, improving flexibility.
  • Model and Server Argument Adjustments: NemotronH and FalconH1 models were updated to pass the forward_batch to the attention backend. Server argument initialization logic was refined to include necessary validations for mamba_track_interval when extra_buffer is enabled for Mamba2 models.
  • Testing and Refactoring: Added new tests specifically for the --mamba-scheduler-strategy extra_buffer with Nemotron-Nano-v2, including prefill/decode cache hit and prefix cache branching tests. Existing Qwen3-Next tests were refactored for better code reusability using new mixin classes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for extra_buffer in Mamba2-based models, enabling the use of both radix cache and the overlap scheduler. The changes are extensive, touching model implementations, attention backends, scheduling logic, and adding new tests. The core idea is to generalize the handling of chunk sizes and enable tracking of intermediate states for Mamba2 models, similar to what was done for FLA-based models. The code is well-structured, and the addition of tests for the new functionality is commendable. I have a couple of suggestions for improving code clarity and maintainability.

Comment on lines +1196 to +1169
if (
intermediate_states is not None
and forward_batch.mamba_track_mask is not None
and forward_batch.mamba_track_mask.any()
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check forward_batch.mamba_track_mask is not None is redundant within this nested if block, as it's already checked in the outer if statement on line 1195. Removing it will make the code slightly cleaner.

            if (
                intermediate_states is not None
                and forward_batch.mamba_track_mask.any()
            ):

Comment on lines +251 to +262
if return_intermediate_states:
if return_varlen_states:
varlen_states = rest[0]
if return_final_states:
return states, final_states, varlen_states
else:
return states, varlen_states
else:
if return_final_states:
return states, final_states
else:
return states
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The nested if statements for handling return values based on different flags can be a bit hard to follow. Refactoring this block to a flatter structure would improve readability and maintainability.

    if return_intermediate_states:
        if not return_final_states and not return_varlen_states:
            return states
        if not return_final_states and return_varlen_states:
            return states, rest[0]
        if return_final_states and not return_varlen_states:
            return states, final_states
        # return_final_states and return_varlen_states
        return states, final_states, rest[0]

@roikoren755 roikoren755 force-pushed the feat/mamba2-radix-overlap branch 2 times, most recently from de24f9e to 1baf72d Compare December 31, 2025 09:06
@roikoren755 roikoren755 force-pushed the feat/mamba2-radix-overlap branch 2 times, most recently from 41412f3 to 9244a86 Compare January 14, 2026 10:38
lens_to_track = (
forward_batch.mamba_track_seqlens - forward_batch.extend_prefix_lens
)
mamba_cache_chunk_size = get_global_server_args().mamba_cache_chunk_size
Copy link
Collaborator

@hanming-lu hanming-lu Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two variables:

  1. FLA_CHUNK_SIZE
  2. mamba_cache_chunk_size

IIUC, there's no reason to touch anything related to mamba_cache_chunk_size? IIUC, you should only replace all usages of FLA_CHUNK_SIZE with mamba_chunk_size (i.e. the CHUNK_SIZE for mamba2's backend, also maybe use a better name like <backend>_CHUNK_SIZE or MAMBA2_CHUNK_SIZE) for mamba2 models

mamba_track_indices_cpu: List[int],
mamba_track_seqlens_cpu: List[int],
):
mamba_track_interval = get_global_server_args().mamba_track_interval
Copy link
Collaborator

@hanming-lu hanming-lu Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, you don't need to touch any logic related to mamba_track_interval?

See https://github.com/sgl-project/sglang/pull/15829/files#r2691660148

I feel this is not crashing just because mamba_track_interval happens to be 256?

@roikoren755 roikoren755 force-pushed the feat/mamba2-radix-overlap branch from 9244a86 to 1ce8415 Compare January 18, 2026 14:05
@roikoren755
Copy link
Contributor Author

@hanming-lu The current implementation for ServerArgs.mamba_cache_chunk_size didn't work for Mamba2-based models, as the page size (which defaults to 1) is unrelated. This meant that the value was 64, and it didn't work with Mamba2-based models which use a chunk size of 256 (from what I saw, all of them do). I updated the property to take that into account, and things should be OK now. I also rebased and fixed some of the tests refactor, to be clearer and better located. Please take another look 🙏

I'm still seeing some intermittent errors in some of the tests, with the scheduler raising an error that a memory leak is detected. Not sure if it's just my setup or if it will happen in the CI as well...

Signed-off-by: Roi Koren <roik@nvidia.com>
… tests and clean that file up

Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
@roikoren755 roikoren755 force-pushed the feat/mamba2-radix-overlap branch from 1ce8415 to 12e4649 Compare January 25, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants