-
Notifications
You must be signed in to change notification settings - Fork 316
Description
Hi! I’d like to clarify a couple of things about DualPipe's memory and bubble formula:
Activation Usage – Why “+1”?
In 1F1B, the peak activation memory is roughly PP.
But for DualPipe, it’s said to be PP + 1.
Is the "+1" simply because middle ranks process PP + 1 microbatches at peak time?
Or is it due to how the overlapping of forward and backward computation works?
For example, during the overlap_forward_backward phase, we first accumulate a forward activation (1 MB), and then free the corresponding backward activation. So in terms of memory, can we think of overlap_fwd_bwd as: Accumulate 1 MB activation then free 1MB?
And also i wonder to understand, in the bubble formula for DualPipe, I noticed it includes a −3W term (where W is weight gradient computation).

From my understanding, if we're replacing idle time with weight computation, shouldn't it be more like: (PP/2 - 1) * (Fwd + Input - Weight) ? (Input & Weight for Bwd_Input, Bwd_Weight)
Thanks a lot for any clarification!