Interaction with learning rate schedule

Has there been any research on how this strategy interacts with a learning rate schedule? Especially for something extreme like the one-cycle policy (super convergence). It seems like the history of the scale of the gradient would be dominated by changes in the learning rate. I found this paper that touches on the subject but doesn't propose any theory behind or solution to the interaction between the two.

<img width="596" alt="Screen Shot 2024-01-16 at 12 06 11 PM" src="https://github.com/pseeth/autoclip/assets/6076141/a94c5322-c555-4c5f-aa12-07758b943e60">
from https://hal.science/hal-03891707v1/file/Learning_rate_scheduling_and_gradient_clipping_for_audio_source_separation.pdf

As expected, AutoClip doesn't interact well with cosine annealing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interaction with learning rate schedule #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Interaction with learning rate schedule #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions