LoRA weight scaling

Very interesting paper, and thank you for publishing code with it!

I have a question related to LoRA weight scaling:
The original LoRA paper, https://arxiv.org/pdf/2106.09685 proposed scaling the LoRA weights by `alpha / rank`.
This is still used in modern code, as it has been shown that this roughly allows you to change rank without changing learning rate. For example diffusers: https://github.com/huggingface/diffusers/blob/3c8b67b3711b668a6e7867e08b54280e51454eb5/src/diffusers/models/lora.py#L230
OneTrainer, diffusion model trainer: https://github.com/Nerogar/OneTrainer/blob/6276c512c2ff6ad50e74eb4617274ed2f44bdcee/modules/module/LoRAModule.py#L323

I wonder how this applies to your proposal to vary the effective rank by timestep.

I haven't found any mention in your paper except the use of a constant learning rate, but in your code you seem to have removed this scaling :
https://github.com/ControlGenAI/T-LoRA/blob/ba4198360c261ad75792219bc78e663a4573eb32/tlora/model/lora.py#L33

Could you address this?
Doesn't this mean that you effectively apply larger parameter updates to the higher timesteps, than you would have done if weight scaling was used, because the effective rank after masking is lower?
Parameter update size (effective learning rate) has a very high impact on overfitting in finetuning diffusion models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA weight scaling #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LoRA weight scaling #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions