Mattia Malipiero
Johannes Stärk
This project investigates how Low-Rank Adaptation (LoRA) compares to partial fine-tuning for text classification tasks using transformer-based models. We focus on three guiding questions:
- Classification Performance:
How does LoRA compare to partial fine-tuning in overall classification accuracy and F1-score? - Resource Efficiency:
How much GPU memory, time, and storage does LoRA save compared to partial fine-tuning? - Error Behavior:
Do misclassification patterns differ across models and fine-tuning methods?
Our hypotheses are that LoRA will achieve similar classification accuracy while requiring fewer trained parameters and less computational effort, and that different fine-tuning methods may lead to distinct misclassification patterns across models.
Sample Preview
We use the Stanford Natural Language Inference (SNLI) Corpus, a benchmark dataset for natural language understanding.
It consists of sentence pairs labeled as entailment, contradiction, or neutral, making it ideal for evaluating classification models on semantic inference.
This dataset is well-suited for testing various fine-tuning methods in multi-class text classification.
Its large size and balanced label distribution allow for robust performance comparisons across different model architectures and training strategies.
| Text | Judgments | Hypothesis |
|---|---|---|
| A man inspects the uniform of a figure in some East Asian country. | contradiction | The man is sleeping |
| An older and younger man smiling. | neutral | Two men are smiling and laughing at the cats playing on the floor. |
| A black race car starts up in front of a crowd of people. | contradiction | A man is driving down a lonely road. |
| A soccer game with multiple males playing. | entailment | Some men are playing a sport. |
| A smiling costumed woman is holding an umbrella. | neutral | A happy woman in a fairy costume holds an umbrella. |
EDA
- Label Distribution: The dataset contains a nearly equal number of samples for each class, ensuring balanced training and evaluation
- Text Length Histogram: Most texts range between 50–150 characters, with a peak around 100, supporting efficient tokenization and batching
- Box Plot of Text Length: The median text length is ~100 characters, with a compact interquartile range and some long-text outliers above 500 characters, guiding preprocessing decisions like truncation or padding
This project investigates parameter-efficient fine-tuning strategies for transformer-based text classification, focusing on Low-Rank Adaptation (LoRA) and partial fine-tuning.
The aim is to evaluate how these lightweight approaches can efficiently adapt pretrained encoder models, such as BERT and DistilBERT, to a specific classification task while minimizing computational and memory requirements.
The underlying assumption is that large pretrained models already capture strong general linguistic knowledge, and effective task-specific adaptation can be achieved by updating only a small subset of parameters.
- LoRA introduces small trainable matrices within the model’s attention layers, allowing adaptation without modifying the core pretrained weights
- Partial fine-tuning selectively unfreezes only certain layers, such as the final encoder block or classification head, enabling limited but targeted learning
Both methods are designed to reduce training time and resource consumption while maintaining high performance.
The hypothesis is that LoRA and partial fine-tuning will deliver similar classification accuracy and generalization to full fine-tuning, making them effective and scalable alternatives for fine-tuning large models in constrained computational environments.
BERT - partially finetuned
We fine-tuned the model by unfreezing the last two layers of BERT, allowing them to update during training.
Training:
Training was stable across three epochs, as shown by steadily decreasing train and validation loss curves. Key hyperparameters included a batch size of 18, learning rate of 3e-5, dropout rate of 0.5, and weight decay of 0.1.
The classification model achieved an overall accuracy of 84.98% with a test loss of 0.4151, demonstrating strong performance across all three classes. Precision and recall scores were consistently high, especially for the contradiction class (F1-score: 0.88), while entailment and neutral also showed solid metrics (F1-scores: 0.86 and 0.81 respectively). Despite this, around 15% of the test set—roughly 16,500 samples—were misclassified, indicating room for further optimization.
BERT - LoRA
We fine-tuned the model by applying LoRA which is a PEFT method, known for being computational efficient since it uses a lot less trainable parameters.
Training was stable across three epochs, as shown by steadily decreasing train and validation loss curves. Key hyperparameters included a batch size of 6, learning rate of 1e-4, dropout rate of 0.5, and weight decay of 0.1.
The classification model achieved an overall accuracy of 85.20% with a test loss of 0.4338, demonstrating strong performance across all three classes. Precision and recall scores were consistently high, especially for the contradiction class (F1-score: 0.88), while entailment and neutral also showed solid metrics (F1-scores: 0.87 and 0.81 respectively). Despite this, around 14.8% of the test set—roughly 16,262 samples—were misclassified, indicating room for further optimization.
Destilled BERT - partially finetuned
We fine-tuned the model by unfreezing the last two layers of BERT, allowing them to update during training.
Training was stable across three epochs, as shown by steadily decreasing train and validation loss curves. Key hyperparameters included a batch size of 13, learning rate of 5e-5, dropout rate of 0.5, and weight decay of 0.1.
The classification model achieved an overall accuracy of 81.12% with a test loss of 0.4828, showing solid performance across all three classes. Precision and recall scores were strong for entailment (F1-score: 0.84) and contradiction (F1-score: 0.83), while neutral performed slightly lower (F1-score: 0.77). A total of 20,742 samples were misclassified, accounting for 18.9% of the test set, indicating potential for further refinement.
Destilled BERT - LoRA
We fine-tuned the model by applying LoRA which is a PEFT method, known for being computational efficient since it uses a lot less trainable parameters.
Training was stable across three epochs, as shown by steadily decreasing train and validation loss curves. Key hyperparameters included a batch size of 8, learning rate of 1e-4, dropout rate of 0.5, and weight decay of 0.1.
The classification model reached an overall accuracy of 83.48% with a test loss of 0.4479, indicating strong and consistent performance. F1-scores were high for entailment and contradiction (both 0.86), while neutral maintained a respectable score of 0.79. A total of 18,151 samples were misclassified, representing 16.5% of the test set, suggesting solid generalization with room for further tuning.
Misclassifications
- Same top words across models: All four methods misclassify the same core set of words — man, woman, people, wearing, and shirt/young
- Stable ranking: The relative order of misclassifications is nearly identical (e.g., man always #1, woman always #2)
- Human-related bias: The most misclassified words are generic, high-frequency human-appearance terms, suggesting shared difficulty rather than model-specific issues
- Magnitude varies, pattern doesn’t: Raw counts differ (highest in distil_partial), but the relative scale is consistent across methods
- Across all models, the neutral class is the dominant source of misclassifications.
- In the true label distribution, neutral accounts for roughly half of all mistakes.
- In the predicted label distribution, all models also over-predict neutral.
- This confirms that neutral is the most ambiguous and error-prone category.
-
DistilBERT LoRA shows the highest share of neutral misclassifications and stronger bias toward predicting neutral overall
-
BERT models (LoRA and Partial) distribute errors more evenly and show slightly better discrimination between entailment and contradiction
-
Boxplots of predicted minus true probabilities show that all models exhibit large probability gaps in their misclassifications
-
Neutral again has the widest spread, indicating inconsistent confidence levels
-
DistilBERT LoRA tends to produce the largest overconfidence gaps, while BERT Partial is more stable
- All models share similar weaknesses:
- They confuse neutral examples most frequently
- Their probability outputs reveal overconfidence even in wrong predictions
- LoRA and partial fine-tuning yield comparable misclassification behavior, though BERT variants perform slightly more consistently than DistilBERT ones
- Fine-tuning method affects the extent of misclassification but not the type — the neutral class remains the primary challenge across all setups
Compute
LoRA is presented in the literature as a parameter-efficient fine-tuning method that significantly reduces memory and computational requirements by training only low-rank adapter matrices while keeping the pretrained weights frozen. These claims originate primarily from experiments on very large models such as GPT-3 (175B parameters). We hypothesized that LoRA would demonstrate similar efficiency gains over partial fine-tuning in our experimental setting.
To evaluate this, we tracked peak GPU memory allocation, total training time, and throughput (samples per second) across all runs, using a fixed batch size of 8 for fair comparison.
| Model | Method | Peak Memory (GB) | Training Time (s) | Throughput (samples/s) |
|---|---|---|---|---|
| BERT (110M) | Partial FT | 0.89 | 2,566 | 450 |
| BERT (110M) | LoRA | 1.57 | 4,270 | 270 |
| DistilBERT (66M) | Partial FT | 0.64 | 1,347 | 856 |
| DistilBERT (66M) | LoRA | 0.87 | 2,238 | 516 |
| DeBERTa-XXL (1.5B)* | Partial FT | 10.80 | 14,454 | 27 |
| DeBERTa-XXL (1.5B)* | LoRA | 19.66 | 21,695 | 18 |
*DeBERTa-XXL was trained for only 1 epoch due to computational constraints thus the performance metrics are not comparable.
Contrary to expectations, LoRA consumed more memory and was slower than partial fine-tuning across all model sizes tested. On BERT, LoRA used 1.77× more memory and was 1.66× slower. Notably, this pattern persisted even at 1.5B parameters: DeBERTa-XXL with LoRA used 1.82× more memory and was 1.50× slower than partial fine-tuning.
GPU memory during training consists of model weights, optimizer states, activations (intermediate outputs stored for backpropagation), and gradients. While LoRA trains far fewer parameters and thus requires smaller optimizer states, this advantage appears to be offset by other factors at BERT-scale models.
A likely explanation involves activation storage. In partial fine-tuning, early layers are frozen with requires_grad=False, which may allow PyTorch to discard their activations after the forward pass since they are not needed for gradient computation. In contrast, LoRA inserts trainable adapters throughout the model—even though the base weights are frozen, the computational graph must still flow through these layers to reach the adapters, potentially requiring activation retention across more layers.
This interpretation is supported by Zhang et al. (2023), who observe that LoRA "still requires expensive activation memory to update low-rank weights" and propose LoRA-FA to address this limitation (arXiv:2308.03303). However, the exact memory dynamics depend on implementation details and may vary across frameworks and configurations.
The observed slowdown with LoRA (1.66× on BERT) likely stems from the additional computations introduced by the adapter architecture. Each LoRA-adapted layer performs extra matrix operations for the low-rank decomposition. While each operation is small, they accumulate across all adapted layers and training steps.
LoRA's efficiency benefits are typically demonstrated on very large models (7B+ parameters), where optimizer state savings become substantial. At BERT scale (110M parameters), the optimizer state difference between methods is relatively small, and may be outweighed by activation-related overhead.
To explore whether LoRA becomes more efficient at larger scales, we ran experiments on DeBERTa-v2-XXLarge (1.5B parameters). Even at this scale, LoRA used 19.7 GB compared to 10.8 GB for partial fine-tuning. Training was limited to one epoch due to cost (each epoch took 6+ hours on an NVIDIA H100), so accuracy results are not comparable. However, the resource usage pattern suggests that the efficiency crossover point for LoRA may require models larger than 1.5B parameters—potentially in the 7B+ range where the original LoRA experiments were conducted.
This study compared LoRA and partial fine-tuning on BERT and DistilBERT for natural language inference. All configurations achieved 81–85% accuracy, with no meaningful performance difference between fine-tuning methods. Error analysis showed consistent patterns across models—the neutral class dominated misclassifications, and errors correlated with dataset ambiguity rather than method choice.
Contrary to expectations, LoRA used 1.77× more memory and was 1.66× slower than partial fine-tuning on BERT. This pattern held even at 1.5B parameters (DeBERTa-XXL), where LoRA still used 1.82× more memory. A likely explanation is that LoRA's activation memory overhead outweighs its optimizer state savings at these model scales. For encoder models up to at least 1.5B parameters, partial fine-tuning appears to offer a better resource-accuracy tradeoff.
The original LoRA paper (Hu et al., 2021) demonstrated memory savings on GPT-3 175B, but our results suggest these benefits may not transfer to smaller encoder models. This aligns with observations from Zhang et al. (2023), who note that LoRA "still requires expensive activation memory" and propose LoRA-FA to address this overhead.
The main learning to take from this work is that having fewer trainable parameters doesn't automatically mean lower memory or faster training. It was also noticed that all our models made similar types of errors regardless of the fine-tuning method, which suggests that for a task like NLI the choice between LoRA and partial fine-tuning probably matters less than expected.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685
Zhang, L., Zhang, L., Shi, S., Chu, X., & Li, B. (2023). LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning. arXiv preprint arXiv:2308.03303. https://arxiv.org/abs/2308.03303






