[Dev][feat] Support CUDA Graph capture offloading modules#3219
[Dev][feat] Support CUDA Graph capture offloading modules#3219lhb8125 wants to merge 101 commits intoNVIDIA:devfrom
Conversation
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
…ithub.com/lhb8125/Megatron-LM into hongbinl/activation_offloading_cuda_graph
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
2. refine offloading docs; 3. remove TransformerLayer.cuda_graph_stream and cuda_graph_event Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
d34be1d to
998d1b0
Compare
|
/ok to test 4bf0085 |
…ttps://github.com/lhb8125/Megatron-LM into hongbinl/activation_offloading_refactor_cuda_graph
|
/ok to test cd84623 |
|
/ok to test 61b589a |
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
|
/ok to test 0200121 |
| Fine-grained offloading is compatible with CUDA graphs. When CUDA graph is enabled, the following constraints apply: | ||
|
|
||
| - `attn_norm` and `mlp_norm` **cannot** be offloaded (they cross CUDA graph boundaries). | ||
| - `cuda_graph_scope` must include `attn` and `moe_router`. |
There was a problem hiding this comment.
Can I use "moe" scope if I'm in a drop-pad MoE?
Can I offload attention part modules if my cuda graph scope is only "moe_router"? This may be needed since some cases have dynamic-shaped attention so only the router part can be captured.
There was a problem hiding this comment.
I removed this hard limitation, now the scope could be moe_router alone or moe.
|
|
||
| Fine-grained offloading is compatible with CUDA graphs. When CUDA graph is enabled, the following constraints apply: | ||
|
|
||
| - `attn_norm` and `mlp_norm` **cannot** be offloaded (they cross CUDA graph boundaries). |
There was a problem hiding this comment.
unless using "moe" cudagrpah scope in a drop-pad or sync-free MoE.
There was a problem hiding this comment.
what if we only capture moe_router or moe_preprocess? Is it still true?
There was a problem hiding this comment.
I think so. If we only capture moe_router, mlp_norm works as the input buffer of the graph, so not offloadable. The only exception is that we use attn+moe scope for drop-pad MoE, then the mlp_norm is totally inside the graph, so offloadable.
There was a problem hiding this comment.
btw you cannot only capture moe_preprocess . moe_preprocess must go together with moe_router .
| hidden_states: Tensor, | ||
| inference_context: BaseInferenceContext | None = None, | ||
| padding_mask: Tensor | None = None, | ||
| flush_delayed_groups: bool = True, |
There was a problem hiding this comment.
since flush_delayed_groups is cudagraph specific can it be moved to cudagraph specific code? If this just needs to run after warmup can it be passed as TE's make_graphed_callable(post_warmup_hook=) ?
There was a problem hiding this comment.
Thanks for the comments, I removed the function call in forward() and kept only the call in _te_cuda_graph_replay. Now in the warmup iterations, we launch the offloading immediately and in the replay iterations, we delay the offloading and flush it after graph replay.
The "warmup" here is a little vague:
- The first several iterations are warmup iterations, after which we start graph capturing.
- Before capturing cuda graph, TE runs several fprop&bprop steps to "warmup";
In previous code, the flush_delayed_groups is executed in the end of forward() in the warmup iterations(1st case).
2. remove flush_delayed_groups() when the training is not in replay mode Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
|
/ok to test b481fa9 |
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
…ttps://github.com/lhb8125/Megatron-LM into hongbinl/activation_offloading_refactor_cuda_graph
|
/ok to test ce84682 |
| 3. **Apply fraction**: Only a fraction of eligible groups are actually offloaded (controlled by `activation_offload_fraction`). | ||
| 4. **Print summary table**: An ASCII table of per-rank offload bytes is printed for debugging. | ||
|
|
||
| ### CPU Tensor Pool |
|
|
||
| ### Warmup and Adaptive Offloading | ||
|
|
||
| The first training iteration serves as a **warmup phase** where the manager records tensor groups, their sizes, and the execution order. After warmup, a `post_warmup_callback` runs to: |
There was a problem hiding this comment.
So we cannot capture cudagraphs on the first training iteration? If so, we should assert cuda_graph_warmup_steps>0 when offloading is enabled.
What does this PR do ?
PR to main branch
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.