-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
Has there been any research on how this strategy interacts with a learning rate schedule? Especially for something extreme like the one-cycle policy (super convergence). It seems like the history of the scale of the gradient would be dominated by changes in the learning rate. I found this paper that touches on the subject but doesn't propose any theory behind or solution to the interaction between the two.
from https://hal.science/hal-03891707v1/file/Learning_rate_scheduling_and_gradient_clipping_for_audio_source_separation.pdf
As expected, AutoClip doesn't interact well with cosine annealing
Metadata
Metadata
Assignees
Labels
No labels