Implementation of RRT-LoRA by Google DeepMind on TinyLlama.

regular transformer:
recursive version with L layers and B blocks:
the rrt:
for each weight matrix at each layer:
-
$W^\prime$ is learned shared weights -
$BA$ is position-specific LoRA (initialized via SVD)
-
compute residuals between original and tied for each position:
$R^l = W^l - W^\prime_{((l-1) \bmod L/B + 1)}$ -
get initial LoRA weights via truncated SVD:
$U_r^l, \Sigma_r^l, V_r^l = \text{TruncatedSVD}(R^l; r)$ $B^l = U_r^l \Sigma_r^l$ $A^l = (V_r^l)^T$
-
during training:
- forward:
$h = W^\prime x + B^lA^lx$ - backward: update BOTH
$W^\prime$ AND$B^l,A^l$ matrices -
$W^\prime$ learns optimal shared representation -
$B^l,A^l$ learn position-specific adjustments
- forward:
so the final learned mapping approximates:
@inproceedings{Bae2025Relaxed,
title={Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA},
author={Sangmin Bae and Adam Fisch and Hrayr Harutyunyan and Ziwei Ji and Seungyeon Kim and Tal Schuster},
booktitle={International Conference on Learning Representations},
year={2025}
}If you use this implementation in your research, please cite it as follows:
@software{avram_rrt_lora_2024,
author = {Avram Djordjevic},
title = {rrt-lora: An Implementation of Relaxed Recursive Transformers},
year = {2024},
url = {https://github.com/avramdj/rrt-lora}
}