Skip to content

LoRA weight scaling #1

@dxqb

Description

@dxqb

Very interesting paper, and thank you for publishing code with it!

I have a question related to LoRA weight scaling:
The original LoRA paper, https://arxiv.org/pdf/2106.09685 proposed scaling the LoRA weights by alpha / rank.
This is still used in modern code, as it has been shown that this roughly allows you to change rank without changing learning rate. For example diffusers: https://github.com/huggingface/diffusers/blob/3c8b67b3711b668a6e7867e08b54280e51454eb5/src/diffusers/models/lora.py#L230
OneTrainer, diffusion model trainer: https://github.com/Nerogar/OneTrainer/blob/6276c512c2ff6ad50e74eb4617274ed2f44bdcee/modules/module/LoRAModule.py#L323

I wonder how this applies to your proposal to vary the effective rank by timestep.

I haven't found any mention in your paper except the use of a constant learning rate, but in your code you seem to have removed this scaling :

def forward(self, hidden_states, mask=None):

Could you address this?
Doesn't this mean that you effectively apply larger parameter updates to the higher timesteps, than you would have done if weight scaling was used, because the effective rank after masking is lower?
Parameter update size (effective learning rate) has a very high impact on overfitting in finetuning diffusion models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions