Megatron-LM changes to make Hyena/Evo 2 inference usable, especially for 40B models by antonvnv · Pull Request #1727 · NVIDIA/Megatron-LM

antonvnv · 2025-08-01T22:55:00Z

No description provided.

This is needed to make Evo 2 40b work on A6000 Ada x2.

Needed to pass None as inference params when we do cache-less forward pass.

If prompt length exceeds this value, it will be split into segments. This feature allows to process very large prompts that normally would cause Out Of Memory (OOM) during forward pass. Here's how it works. When the input prompt length exceeds this threshold, the generation process is split into three phases: 1. One large forward pass of input tokens up to the threshold value. 2. The rest of the prompt that exceed the threshold are processed token-by-token without sampling. This operation executes at the token generation speed (throughput) as shown. 3. Regular generation, where after the input prompt is fully processed, normal token generation with sampling resumes.

Logits reporting are required for Evo 2 NIM.

This is to make it compatible with Vortex: https://github.com/Zymrael/vortex/blob/debd9d160476b2498494507ffec0a697d3075a2d/vortex/model/sample.py#L51

Phlip79 · 2026-03-04T23:00:18Z

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

antonvnv added 8 commits July 22, 2025 17:54

Megatron-LM: evo2: More efficient checkpoint loading

0a6963e

This is needed to make Evo 2 40b work on A6000 Ada x2.

Megatron-LM: SamplingParams: Add token_callback

63ac050

Megatron-LM: AbstractModelInferenceWrapper: Add custom inference_params

c94a6e1

Needed to pass None as inference params when we do cache-less forward pass.

Megatron: Modelopt: Make Linear layer compatible with TELinear

7699375

Megatron-LM: Make flash_decode=True + inference_context=None work

808e85f

Megatron-LM: Add logits reporting to generate() API

391e174

Logits reporting are required for Evo 2 NIM.

Megatron-LM: Reset inference context more safely

358fa0d

sbhavani added the enhancement New feature or request label Aug 2, 2025

Megatron-LM: Allow top_p sub-sampling within top_k

a75c957

This is to make it compatible with Vortex: https://github.com/Zymrael/vortex/blob/debd9d160476b2498494507ffec0a697d3075a2d/vortex/model/sample.py#L51

ko3n1g requested review from a team as code owners February 18, 2026 09:18

Phlip79 marked this pull request as draft March 4, 2026 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron-LM changes to make Hyena/Evo 2 inference usable, especially for 40B models#1727

Megatron-LM changes to make Hyena/Evo 2 inference usable, especially for 40B models#1727
antonvnv wants to merge 9 commits intoNVIDIA:mainfrom
antonvnv:next

antonvnv commented Aug 1, 2025

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

antonvnv commented Aug 1, 2025

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants