This project re-implements LoRA (Hu et al., 2021) paper from scratch. I demonstrate that low-rank adaptation achieves comparable performance with 95% fewer parameters to train.
│
├── 📁.github # CI workflows
└── 📁workflows
└── tests.yml
├── 📁checkpoints # Saved model checkpoints
└── lora_roberta_sst2.pt
├── 📁experiments # Training scripts (RoBERTa)
├── __init__.py
├── lora_roberta_sst2_rank_ablation.py
└── train_lora_roberta_sst2.py
├── 📁inference # Inference util
├── 📁inject_lora # LoRA injection logic
├── 📁lora # Core LoRA implementation
├── __init__.py
└── lora_linear.py
├── 📁results # Plots & experiment outputs
├── 📁tests # Unit tests
├── 📁utils # Helper utilities
├── __init.py
├── .gitignore
├── pytest.ini
├── requirements.txt
└── README.mdAs per the paper, the best trade-off to apply LoRA is on the
Let's look at the overview of HUggingFace RoBERTa's Anatomy
RobertaModel
└── encoder
└── layer[i]
└── attention
└── self
├── query : nn.Linear
├── key : nn.Linear
├── value : nn.LinearWe only going to touch:
layer.attention.self.querylayer.attention.self.value
Rest of the matrices update doesn't add up much into performance in downstream task adaptation.
- Total Params : 124793088
- Trainable Params : 147456 (0.12%)
- Ranks (r) : 4
- Alpha (α) : 16
- Sequence Length: 64
- AMP enabled
| Epoch | Train Loss | Val Accuracy |
|---|---|---|
| 1 | 0.2850 | 93.12% |
| 2 | 0.2162 | 93.00% |
| 3 | 0.1985 | 92.66% |
LoRA weights (A, B) and classifier head are saved and reusable.
To analyze the effect of LoRA rank ( r ) on downstream performance, I conduct a rank ablation study following the methodology discussed in Section 7.2 of the LoRA paper.
The goal is to evaluate whether task-specific adaptation saturates at low rank, which would support the low intrinsic dimension hypothesis.
- Base model: RoBERTa-base
- Task: SST-2 (GLUE)
- LoRA applied to:
$W_q$ and$W_v$ - Ranks evaluated:
$r$ $\in$ ${1, 2, 4, 8}$ - α = 16, epochs = 3
- Sequence length = 64, AMP enabled
- Only LoRA parameters (A, B) and the classifier head are trainable
| Rank (r) | Validation Accuracy |
|---|---|
| 1 | 93.35% |
| 2 | 93.69% |
| 4 | 92.89% |
| 8 | 93.46% |
- Even very small ranks (r = 1, 2) achieve strong performance.
- Increasing rank beyond r ≈ 2–4 does not yield consistent improvements.
- Performance remains within a narrow band (~0.9% absolute) across all ranks.
The slight non-monotonic behavior is expected on small datasets such as SST-2 and can be attributed to training variance and noise rather than insufficient model capacity.
Overall, these results indicate that effective task adaptation can be achieved using low-rank updates, consistent with the LoRA paper’s claim that most task-relevant updates lie in a low-dimensional subspace.
One of the key advantages of LoRA is that the number of trainable parameters scales linearly with the chosen rank ($ r $).
For RoBERTa-base (hidden size = 768, 12 layers) with LoRA applied to ( W_q ) and ( W_v ), the number of trainable parameters is:
The LoRA-specific parameter count scales linearly with rank as: Trainable LoRA Params ≈ 36,864 × r
An additional ~1.5k parameters come from the task-specific classification head.
| Rank (r) | Trainable Parameters | % of Total Params |
|---|---|---|
| 1 | 38,402 | 0.031% |
| 2 | 75,266 | 0.060% |
| 4 | 148,994 | 0.119% |
| 8 | 296,450 | 0.237% |
Note: In above Table Trainable parameters include both LoRA adaptation matrices (A, B) and the task-specific classification head.
Despite increasing the rank by 8×, the fraction of trainable parameters remains well below 0.3% of the full model size.
Combined with the rank ablation results, this highlights that strong downstream performance can be achieved with extremely small parameter updates, reinforcing the low intrinsic dimension hypothesis behind LoRA.
After training, the low-rank LoRA updates can be merged into the frozen base weights to eliminate any inference-time overhead.
For a LoRA-adapted linear layer:
W_eff = W₀ + (α / r) · B · A
This merge is performed once, after which the model behaves as a standard Transformer without additional parameters or computation.
All LoRA weights are merged prior to inference, resulting in:
- zero added latency
- no extra memory usage
- identical outputs compared to the unmerged LoRA model
Unit tests focus on LoRA-specific invariants:
- zero-init equivalence
- merge correctness
- parameter freezing
- correct injection points
To keep CI fast and deterministic, tests use lightweight dummy transformer modules instead of downloading pretrained models.
pip install pytest
pytest -s tests/