-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Very interesting paper, and thank you for publishing code with it!
I have a question related to LoRA weight scaling:
The original LoRA paper, https://arxiv.org/pdf/2106.09685 proposed scaling the LoRA weights by alpha / rank.
This is still used in modern code, as it has been shown that this roughly allows you to change rank without changing learning rate. For example diffusers: https://github.com/huggingface/diffusers/blob/3c8b67b3711b668a6e7867e08b54280e51454eb5/src/diffusers/models/lora.py#L230
OneTrainer, diffusion model trainer: https://github.com/Nerogar/OneTrainer/blob/6276c512c2ff6ad50e74eb4617274ed2f44bdcee/modules/module/LoRAModule.py#L323
I wonder how this applies to your proposal to vary the effective rank by timestep.
I haven't found any mention in your paper except the use of a constant learning rate, but in your code you seem to have removed this scaling :
Line 33 in ba41983
| def forward(self, hidden_states, mask=None): |
Could you address this?
Doesn't this mean that you effectively apply larger parameter updates to the higher timesteps, than you would have done if weight scaling was used, because the effective rank after masking is lower?
Parameter update size (effective learning rate) has a very high impact on overfitting in finetuning diffusion models.