Activation Usage of Dualpipe

Hi! I’d like to clarify a couple of things about DualPipe's memory and bubble formula:

Activation Usage – Why “+1”?
In 1F1B, the peak activation memory is roughly PP.
But for DualPipe, it’s said to be PP + 1.
Is the "+1" simply because middle ranks process PP + 1 microbatches at peak time?
Or is it due to how the overlapping of forward and backward computation works?
For example, during the overlap_forward_backward phase, we first accumulate a forward activation (1 MB), and then free the corresponding backward activation. So in terms of memory, can we think of overlap_fwd_bwd as: Accumulate 1 MB activation then free 1MB?

<img width="794" alt="Image" src="https://github.com/user-attachments/assets/ad08b897-0cec-44b0-8642-e4a9a45f02f1" />

And also i wonder to understand, in the bubble formula for DualPipe, I noticed it includes a −3W term (where W is weight gradient computation).
<img width="798" alt="Image" src="https://github.com/user-attachments/assets/d17c3c6f-69a0-4e42-b09d-bfb29facf27c" />

From my understanding, if we're replacing idle time with weight computation, shouldn't it be more like: (PP/2 - 1) * (Fwd + Input - Weight)  ? (Input & Weight for Bwd_Input, Bwd_Weight)

Thanks a lot for any clarification!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activation Usage of Dualpipe #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Activation Usage of Dualpipe #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions