Skip to content

Interaction with learning rate schedule #9

@Permafacture

Description

@Permafacture

Has there been any research on how this strategy interacts with a learning rate schedule? Especially for something extreme like the one-cycle policy (super convergence). It seems like the history of the scale of the gradient would be dominated by changes in the learning rate. I found this paper that touches on the subject but doesn't propose any theory behind or solution to the interaction between the two.

Screen Shot 2024-01-16 at 12 06 11 PM from https://hal.science/hal-03891707v1/file/Learning_rate_scheduling_and_gradient_clipping_for_audio_source_separation.pdf

As expected, AutoClip doesn't interact well with cosine annealing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions