[main][feature][under updating]zero-overhead activation offload#1752
[main][feature][under updating]zero-overhead activation offload#1752GeYuhong wants to merge 43 commits intoNVIDIA:mainfrom
Conversation
| rank 1 | 0 1 2 0 1 2 3 4 3 4 | ||
| """ | ||
|
|
||
| offload_mlp_input: bool = False |
There was a problem hiding this comment.
Do we still need this flag?
|
|
||
|
|
||
|
|
||
| class ChunkOffloadHandler: |
There was a problem hiding this comment.
We should try reusing the code of cpu_offload.py in TE as much as possible. IIUC, the class should derive from TE’s AsyncDoubleBufferGroupOffloadHandler().
There was a problem hiding this comment.
fixed, (although the PipelineOffload class was applied, it achieves very limited reuse). See in e845344
| tensor_on_device.record_stream(self.d2h_stream) | ||
| self._tensor_tag_to_state[tensor_tag] = state | ||
| self._offloaded_group_count = group_to_offload + 1 | ||
| self._f_event.record(self.d2h_stream) |
There was a problem hiding this comment.
Maybe we can use stream synchronization instead to reduce the descrepancy from TE’s AsyncDoubleBufferGroupOffloadHandler(). Event synchronization is light-weight but I think it doesn't impact the perf a lot here.
| return GroupStartFunction.apply(tensor, cur_forward_chunk) | ||
|
|
||
|
|
||
| def offloading_checker(tensor): |
There was a problem hiding this comment.
Do we still need the checker?
| return len(self._queue) | ||
|
|
||
| def reset_chunk_handler(self, num_layer, offload_mlp_input=True): | ||
| cur_vpp_rank = parallel_state.get_virtual_pipeline_model_parallel_rank() |
There was a problem hiding this comment.
get_virtual_pipeline_model_parallel_rank() is deprecated now. The vpp_size(named as vp_stage now) is passed at runtime. The MR is here.
| MoEAuxLossAutoScaler.set_loss_scale(loss_scale) | ||
| else: | ||
| if config.offload_activation: | ||
| MoEPositiveAuxLossAutoScaler.set_loss_scale(loss_scale / num_microbatches) |
There was a problem hiding this comment.
Why do we need an extra loss scaler?
|
|
||
| return hidden_states | ||
|
|
||
| def _offload_qkv_linear_forward( |
There was a problem hiding this comment.
Can we make it as a factory function to simplify the calling logic of registering and offloading tensors?
64d156d to
af236f1
Compare
a410128 to
1555e6d
Compare
yes, this is the bug we encountered and we haved fixed it. Thank you! |
|
Is this PR ready for using? Or there exists some limitations for applying this patch |
|
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Hongbinl/activation offloading add arguments.py and minor fix, OOTB runable now
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
support activation offloading at PP=1&PP&VPP
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Hongbinl/activation offloading
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
support mixed dense&moe layer and a2a overlap
Hongbinl/activation offloading
Signed-off-by: Hongbin Liu <hongbinl@oci-hsg-cs-001-vscode-03.cm.cluster>
… into the last stage Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Hongbinl/activation offloading
2. refine the sync mechanism; 3. remove mark_layer_start; 4. support activation offload for dense layer; 5. support cuda graph but the cuda graph scope cannot contain the offloading module Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Hongbinl/activation offloading
|
We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged. Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md. |
This feature supports that activations of models are offloaded in the forward pass and prefetched in the backward pass.
Note: Must use TransformerEngine2.5 with the feature pr(NVIDIA/TransformerEngine#2145).
Currently, this feature can be used in a few modules, such as core_attention and router_fc1, we will support more modules(including qkv_linear, router_fc2 and shared_experts) as soon as possible.
We rewrite the _indices_to_multihot() in the token_dispatcher to remove all implicit synchronization without using fused ops, ensuring consistency in bitwise.
The following is the experimental results(dp4tp1cp1ep4pp2vpp2), including end-to-end performance and peak memory.
end2end perf:
peak memory ($R$ is the ratio of the actual decrease in peak memory to the theoretical value, where the theoretical values of stage0 and stage1 are 1440M and 800M respectively):