Using Torch Autograd ctx to Optimize Memory Leaking Issue #202

Mr-Philo · 2025-05-22T05:58:07Z

See Issue #201

This PR create a potential solution to solve the memory leadking issue when using MS-AMP custom GeMM.

Currently the custom GeMM function use ctx object to save input tensor x and weight tensor W. In backward gradient computing, x and W are needed. ctx.input_fp8 means directly saving this attribute. However, input_fp8 is for class ScalingTensor. In practice, this saving method does not fully leverage the advantage of FP8 tensors!

Instead, I suggest using ctx.save_for_backward(). This method is specially designed for better memory management. Change saved context from ScalingTensor to torch.Tensor + ScalingMeta. This is proved to be efficient in memory saving!

Effect for deit-base (86M) model training, batch size 256:

scheme	improvement	Mem after forward	Mem after backward	Max mem	Throughput	One epoch time
FP16	×	18774.96MB	1535.79MB	19242.61MB	~14974.5128 (12708.2790)	02:12
FP8 O2	×	15696.38MB	3964.60MB	19298.19MB	~13673.2756 (11722.0941)	02:15
FP8 O2	GEMM mem optimization	15687.63MB	761.23MB	16245.64MB	~13650.5041 (11671.5361)	02:17

Effect for deit 570M model training, batch size 256:

scheme	improvement	Mem after forward	Mem after backward	Max mem	Throughput	One epoch time
FP16	×	72189.91MB	8977.01MB	72946.86MB	~3379.1181 (3372.2263)	06:26
FP8 O2	×	58077.87MB	15488.33MB	70747.64MB	~3615.4217 (3586.6655)	06:08
FP8 O2	GEMM mem optimization	58079.98MB	3562.89MB	59507.70MB	~3332.3433 (3288.6809)	06:10

Mr-Philo · 2025-05-22T06:02:23Z

@Mr-Philo please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

using torch autograd ctx to optimize memory leaking issue

31afc86

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using Torch Autograd ctx to Optimize Memory Leaking Issue #202

Using Torch Autograd ctx to Optimize Memory Leaking Issue #202

Uh oh!

Mr-Philo commented May 22, 2025

Uh oh!

Mr-Philo commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Using Torch Autograd ctx to Optimize Memory Leaking Issue #202

Are you sure you want to change the base?

Using Torch Autograd ctx to Optimize Memory Leaking Issue #202

Uh oh!

Conversation

Mr-Philo commented May 22, 2025

Uh oh!

Mr-Philo commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant