[main][feature][under updating]zero-overhead activation offload by GeYuhong · Pull Request #1752 · NVIDIA/Megatron-LM

GeYuhong · 2025-08-18T18:11:00Z

This feature supports that activations of models are offloaded in the forward pass and prefetched in the backward pass.

Note: Must use TransformerEngine2.5 with the feature pr(NVIDIA/TransformerEngine#2145).

Currently, this feature can be used in a few modules, such as core_attention and router_fc1, we will support more modules(including qkv_linear, router_fc2 and shared_experts) as soon as possible.

We rewrite the _indices_to_multihot() in the token_dispatcher to remove all implicit synchronization without using fused ops, ensuring consistency in bitwise.

The following is the experimental results(dp4tp1cp1ep4pp2vpp2), including end-to-end performance and peak memory.
end2end perf:

	elapsed time per iteration (ms)
baseline	1262
baseline-new_indices_to_multihot	1249.7
offload_qkv	1253.8

peak memory ($R$ is the ratio of the actual decrease in peak memory to the theoretical value, where the theoretical values of stage0 and stage1 are 1440M and 800M respectively):

rank_id	base/B	base-new_indices_to_multihot/B	error between bases/M	offload_qkv/B	error offload vs base/M	$R$	error offload vs base-new/M	$R$
0	43687144448	43689495552	-2.24	42179546624	1437.76	99.84%	1440	100%
1	43687562240	43689913344	-2.24	42179859968	1437.86	99.85%	1440.1	100%
2	43687014912	43689366016	-2.24	42179417088	1437.76	99.84%	1440	100%
3	43686620672	43688971776	-2.24	42179022848	1437.76	99.84%	1440	100%
4	44975166976	44977182208	-1.92	44138519040	797.89	99.74%	799.81	99.98%
5	44975987712	44977182208	-1.14	44138321920	798.86	99.86%	800	100%
6	44975987712	44977182208	-1.14	44138716160	798.48	99.81%	799.62	99.95%
7	44973536256	44975551488	-1.92	44136691200	798.08	99.76%	800	100%

copy-pr-bot · 2025-08-18T18:11:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

hxbai · 2025-08-19T06:10:30Z

@GeYuhong Is your known bug related to this? https://github.com/NVIDIA/TransformerEngine/blob/734bcedd9d86e4be30ce44f1ef67af5f69f3670d/transformer_engine/pytorch/module/linear.py#L402-L406

lhb8125 · 2025-08-25T01:28:21Z

megatron/core/model_parallel_config.py

       rank 1 |   0 1 2 0 1 2 3 4 3 4
    """

+    offload_mlp_input: bool = False


Do we still need this flag?

fixed, see in 1555e6d

lhb8125 · 2025-08-25T01:30:59Z

megatron/core/pipeline_parallel/cpu_offload.py

+
+
+
+class ChunkOffloadHandler:


We should try reusing the code of cpu_offload.py in TE as much as possible. IIUC, the class should derive from TE’s AsyncDoubleBufferGroupOffloadHandler().

fixed, (although the PipelineOffload class was applied, it achieves very limited reuse). See in e845344

lhb8125 · 2025-08-25T01:33:30Z

megatron/core/pipeline_parallel/cpu_offload.py

+                        tensor_on_device.record_stream(self.d2h_stream)
+                        self._tensor_tag_to_state[tensor_tag] = state
+        self._offloaded_group_count = group_to_offload + 1
+        self._f_event.record(self.d2h_stream)


Maybe we can use stream synchronization instead to reduce the descrepancy from TE’s AsyncDoubleBufferGroupOffloadHandler(). Event synchronization is light-weight but I think it doesn't impact the perf a lot here.

fixed, see in e845344.

lhb8125 · 2025-08-25T01:34:55Z

megatron/core/pipeline_parallel/cpu_offload.py

+    return GroupStartFunction.apply(tensor, cur_forward_chunk)
+
+
+def offloading_checker(tensor):


Do we still need the checker?

fixed, see in 7168ccd

lhb8125 · 2025-08-25T01:56:50Z

megatron/core/pipeline_parallel/cpu_offload.py

+        return len(self._queue)
+
+    def reset_chunk_handler(self, num_layer, offload_mlp_input=True):
+        cur_vpp_rank = parallel_state.get_virtual_pipeline_model_parallel_rank()


get_virtual_pipeline_model_parallel_rank() is deprecated now. The vpp_size(named as vp_stage now) is passed at runtime. The MR is here.

fixed, see in c9f00c7.

lhb8125 · 2025-08-25T01:57:38Z

megatron/core/pipeline_parallel/schedules.py

            MoEAuxLossAutoScaler.set_loss_scale(loss_scale)
        else:
+            if config.offload_activation:
+                MoEPositiveAuxLossAutoScaler.set_loss_scale(loss_scale / num_microbatches)


Why do we need an extra loss scaler?

fixed, see in 4b0d3f1.

lhb8125 · 2025-08-25T02:11:49Z

megatron/core/transformer/attention.py


        return hidden_states

+    def _offload_qkv_linear_forward(


Can we make it as a factory function to simplify the calling logic of registering and offloading tensors?

fixed, see in b00acbc.

GeYuhong · 2025-09-08T02:14:59Z

@GeYuhong Is your known bug related to this? https://github.com/NVIDIA/TransformerEngine/blob/734bcedd9d86e4be30ce44f1ef67af5f69f3670d/transformer_engine/pytorch/module/linear.py#L402-L406

yes, this is the bug we encountered and we haved fixed it. Thank you!

yspMing · 2025-09-08T07:15:54Z

Is this PR ready for using? Or there exists some limitations for applying this patch

GeYuhong · 2025-09-09T02:22:17Z

Is this PR ready for using? Or there exists some limitations for applying this patch
This feature is ready for core_attn offload and router-fc1 offload. We will support other modules in a few days, incluing router-fc2, linear_qkv etc.