Comparison of partially finetuned vs. LoRA fine-tuned BERT and DistilBERT Models

Mattia Malipiero

Johannes Stärk

Central Problem & Domain

This project investigates how Low-Rank Adaptation (LoRA) compares to partial fine-tuning for text classification tasks using transformer-based models. We focus on three guiding questions:

Classification Performance:
How does LoRA compare to partial fine-tuning in overall classification accuracy and F1-score?
Resource Efficiency:
How much GPU memory, time, and storage does LoRA save compared to partial fine-tuning?
Error Behavior:
Do misclassification patterns differ across models and fine-tuning methods?

Our hypotheses are that LoRA will achieve similar classification accuracy while requiring fewer trained parameters and less computational effort, and that different fine-tuning methods may lead to distinct misclassification patterns across models.

Dataset

Sample Preview

We use the Stanford Natural Language Inference (SNLI) Corpus, a benchmark dataset for natural language understanding.
It consists of sentence pairs labeled as entailment, contradiction, or neutral, making it ideal for evaluating classification models on semantic inference.

This dataset is well-suited for testing various fine-tuning methods in multi-class text classification.
Its large size and balanced label distribution allow for robust performance comparisons across different model architectures and training strategies.

Text	Judgments	Hypothesis
A man inspects the uniform of a figure in some East Asian country.	contradiction	The man is sleeping
An older and younger man smiling.	neutral	Two men are smiling and laughing at the cats playing on the floor.
A black race car starts up in front of a crowd of people.	contradiction	A man is driving down a lonely road.
A soccer game with multiple males playing.	entailment	Some men are playing a sport.
A smiling costumed woman is holding an umbrella.	neutral	A happy woman in a fairy costume holds an umbrella.

EDA

Label Distribution: The dataset contains a nearly equal number of samples for each class, ensuring balanced training and evaluation
Text Length Histogram: Most texts range between 50–150 characters, with a peak around 100, supporting efficient tokenization and batching
Box Plot of Text Length: The median text length is ~100 characters, with a compact interquartile range and some long-text outliers above 500 characters, guiding preprocessing decisions like truncation or padding

Approach

This project investigates parameter-efficient fine-tuning strategies for transformer-based text classification, focusing on Low-Rank Adaptation (LoRA) and partial fine-tuning.
The aim is to evaluate how these lightweight approaches can efficiently adapt pretrained encoder models, such as BERT and DistilBERT, to a specific classification task while minimizing computational and memory requirements.

The underlying assumption is that large pretrained models already capture strong general linguistic knowledge, and effective task-specific adaptation can be achieved by updating only a small subset of parameters.

Fine-Tuning Strategies

LoRA introduces small trainable matrices within the model’s attention layers, allowing adaptation without modifying the core pretrained weights
Partial fine-tuning selectively unfreezes only certain layers, such as the final encoder block or classification head, enabling limited but targeted learning

Both methods are designed to reduce training time and resource consumption while maintaining high performance.
The hypothesis is that LoRA and partial fine-tuning will deliver similar classification accuracy and generalization to full fine-tuning, making them effective and scalable alternatives for fine-tuning large models in constrained computational environments.

Results

BERT - partially finetuned

Parameters

We fine-tuned the model by unfreezing the last two layers of BERT, allowing them to update during training.

Training:

Training was stable across three epochs, as shown by steadily decreasing train and validation loss curves. Key hyperparameters included a batch size of 18, learning rate of 3e-5, dropout rate of 0.5, and weight decay of 0.1.

Classification Results

The classification model achieved an overall accuracy of 84.98% with a test loss of 0.4151, demonstrating strong performance across all three classes. Precision and recall scores were consistently high, especially for the contradiction class (F1-score: 0.88), while entailment and neutral also showed solid metrics (F1-scores: 0.86 and 0.81 respectively). Despite this, around 15% of the test set—roughly 16,500 samples—were misclassified, indicating room for further optimization.

BERT - LoRA

Parameters

We fine-tuned the model by applying LoRA which is a PEFT method, known for being computational efficient since it uses a lot less trainable parameters.

Training

Training was stable across three epochs, as shown by steadily decreasing train and validation loss curves. Key hyperparameters included a batch size of 6, learning rate of 1e-4, dropout rate of 0.5, and weight decay of 0.1.

media_images_loss_curves_3_09434284645fae728a8c

Classification Results

The classification model achieved an overall accuracy of 85.20% with a test loss of 0.4338, demonstrating strong performance across all three classes. Precision and recall scores were consistently high, especially for the contradiction class (F1-score: 0.88), while entailment and neutral also showed solid metrics (F1-scores: 0.87 and 0.81 respectively). Despite this, around 14.8% of the test set—roughly 16,262 samples—were misclassified, indicating room for further optimization.

media_images_confusion_matrix_8_1fd7908f3e63ed123900

Destilled BERT - partially finetuned

Parameters

We fine-tuned the model by unfreezing the last two layers of BERT, allowing them to update during training.

Training

Training was stable across three epochs, as shown by steadily decreasing train and validation loss curves. Key hyperparameters included a batch size of 13, learning rate of 5e-5, dropout rate of 0.5, and weight decay of 0.1.

media_images_loss_curves_3_58e15d3b8603182ba5dd

Classification Results

The classification model achieved an overall accuracy of 81.12% with a test loss of 0.4828, showing solid performance across all three classes. Precision and recall scores were strong for entailment (F1-score: 0.84) and contradiction (F1-score: 0.83), while neutral performed slightly lower (F1-score: 0.77). A total of 20,742 samples were misclassified, accounting for 18.9% of the test set, indicating potential for further refinement.

media_images_confusion_matrix_8_407c43f9d599f3ea75db

Destilled BERT - LoRA

Parameters

We fine-tuned the model by applying LoRA which is a PEFT method, known for being computational efficient since it uses a lot less trainable parameters.

Training

Training was stable across three epochs, as shown by steadily decreasing train and validation loss curves. Key hyperparameters included a batch size of 8, learning rate of 1e-4, dropout rate of 0.5, and weight decay of 0.1.

media_images_loss_curves_3_ae8c92e0575b2a53b6c3

Classification Results

The classification model reached an overall accuracy of 83.48% with a test loss of 0.4479, indicating strong and consistent performance. F1-scores were high for entailment and contradiction (both 0.86), while neutral maintained a respectable score of 0.79. A total of 18,151 samples were misclassified, representing 16.5% of the test set, suggesting solid generalization with room for further tuning.

media_images_confusion_matrix_8_98e1452325ef83bcaf5a

Analysis

Misclassifications

Top Misclassified Words

Same top words across models: All four methods misclassify the same core set of words — man, woman, people, wearing, and shirt/young
Stable ranking: The relative order of misclassifications is nearly identical (e.g., man always #1, woman always #2)
Human-related bias: The most misclassified words are generic, high-frequency human-appearance terms, suggesting shared difficulty rather than model-specific issues
Magnitude varies, pattern doesn’t: Raw counts differ (highest in distil_partial), but the relative scale is consistent across methods

Label Distributions

Across all models, the neutral class is the dominant source of misclassifications.
In the true label distribution, neutral accounts for roughly half of all mistakes.
In the predicted label distribution, all models also over-predict neutral.
This confirms that neutral is the most ambiguous and error-prone category.

Model Comparison

DistilBERT LoRA shows the highest share of neutral misclassifications and stronger bias toward predicting neutral overall
BERT models (LoRA and Partial) distribute errors more evenly and show slightly better discrimination between entailment and contradiction

Probability Differences

Boxplots of predicted minus true probabilities show that all models exhibit large probability gaps in their misclassifications
Neutral again has the widest spread, indicating inconsistent confidence levels
DistilBERT LoRA tends to produce the largest overconfidence gaps, while BERT Partial is more stable

Overall Interpretation

All models share similar weaknesses:
- They confuse neutral examples most frequently
- Their probability outputs reveal overconfidence even in wrong predictions
LoRA and partial fine-tuning yield comparable misclassification behavior, though BERT variants perform slightly more consistently than DistilBERT ones
Fine-tuning method affects the extent of misclassification but not the type — the neutral class remains the primary challenge across all setups

Compute

Resource Usage Analysis

LoRA is presented in the literature as a parameter-efficient fine-tuning method that significantly reduces memory and computational requirements by training only low-rank adapter matrices while keeping the pretrained weights frozen. These claims originate primarily from experiments on very large models such as GPT-3 (175B parameters). We hypothesized that LoRA would demonstrate similar efficiency gains over partial fine-tuning in our experimental setting.

To evaluate this, we tracked peak GPU memory allocation, total training time, and throughput (samples per second) across all runs, using a fixed batch size of 8 for fair comparison.

Model	Method	Peak Memory (GB)	Training Time (s)	Throughput (samples/s)
BERT (110M)	Partial FT	0.89	2,566	450
BERT (110M)	LoRA	1.57	4,270	270
DistilBERT (66M)	Partial FT	0.64	1,347	856
DistilBERT (66M)	LoRA	0.87	2,238	516
DeBERTa-XXL (1.5B)*	Partial FT	10.80	14,454	27
DeBERTa-XXL (1.5B)*	LoRA	19.66	21,695	18

*DeBERTa-XXL was trained for only 1 epoch due to computational constraints thus the performance metrics are not comparable.

Contrary to expectations, LoRA consumed more memory and was slower than partial fine-tuning across all model sizes tested. On BERT, LoRA used 1.77× more memory and was 1.66× slower. Notably, this pattern persisted even at 1.5B parameters: DeBERTa-XXL with LoRA used 1.82× more memory and was 1.50× slower than partial fine-tuning.

Interpreting the Memory Overhead

GPU memory during training consists of model weights, optimizer states, activations (intermediate outputs stored for backpropagation), and gradients. While LoRA trains far fewer parameters and thus requires smaller optimizer states, this advantage appears to be offset by other factors at BERT-scale models.

A likely explanation involves activation storage. In partial fine-tuning, early layers are frozen with requires_grad=False, which may allow PyTorch to discard their activations after the forward pass since they are not needed for gradient computation. In contrast, LoRA inserts trainable adapters throughout the model—even though the base weights are frozen, the computational graph must still flow through these layers to reach the adapters, potentially requiring activation retention across more layers.

This interpretation is supported by Zhang et al. (2023), who observe that LoRA "still requires expensive activation memory to update low-rank weights" and propose LoRA-FA to address this limitation (arXiv:2308.03303). However, the exact memory dynamics depend on implementation details and may vary across frameworks and configurations.

Training Time Overhead

The observed slowdown with LoRA (1.66× on BERT) likely stems from the additional computations introduced by the adapter architecture. Each LoRA-adapted layer performs extra matrix operations for the low-rank decomposition. While each operation is small, they accumulate across all adapted layers and training steps.

Scaling Considerations

LoRA's efficiency benefits are typically demonstrated on very large models (7B+ parameters), where optimizer state savings become substantial. At BERT scale (110M parameters), the optimizer state difference between methods is relatively small, and may be outweighed by activation-related overhead.

To explore whether LoRA becomes more efficient at larger scales, we ran experiments on DeBERTa-v2-XXLarge (1.5B parameters). Even at this scale, LoRA used 19.7 GB compared to 10.8 GB for partial fine-tuning. Training was limited to one epoch due to cost (each epoch took 6+ hours on an NVIDIA H100), so accuracy results are not comparable. However, the resource usage pattern suggests that the efficiency crossover point for LoRA may require models larger than 1.5B parameters—potentially in the 7B+ range where the original LoRA experiments were conducted.

Final Evaluation

This study compared LoRA and partial fine-tuning on BERT and DistilBERT for natural language inference. All configurations achieved 81–85% accuracy, with no meaningful performance difference between fine-tuning methods. Error analysis showed consistent patterns across models—the neutral class dominated misclassifications, and errors correlated with dataset ambiguity rather than method choice.

Contrary to expectations, LoRA used 1.77× more memory and was 1.66× slower than partial fine-tuning on BERT. This pattern held even at 1.5B parameters (DeBERTa-XXL), where LoRA still used 1.82× more memory. A likely explanation is that LoRA's activation memory overhead outweighs its optimizer state savings at these model scales. For encoder models up to at least 1.5B parameters, partial fine-tuning appears to offer a better resource-accuracy tradeoff.

The original LoRA paper (Hu et al., 2021) demonstrated memory savings on GPT-3 175B, but our results suggest these benefits may not transfer to smaller encoder models. This aligns with observations from Zhang et al. (2023), who note that LoRA "still requires expensive activation memory" and propose LoRA-FA to address this overhead.

The main learning to take from this work is that having fewer trainable parameters doesn't automatically mean lower memory or faster training. It was also noticed that all our models made similar types of errors regardless of the fine-tuning method, which suggests that for a task like NLI the choice between LoRA and partial fine-tuning probably matters less than expected.

References

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685

Zhang, L., Zhang, L., Shi, S., Chu, X., & Li, B. (2023). LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning. arXiv preprint arXiv:2308.03303. https://arxiv.org/abs/2308.03303

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
configs		configs
data		data
figures		figures
misclassifications		misclassifications
model		model
notebooks		notebooks
resource_usage_analysis		resource_usage_analysis
utils		utils
.gitignore		.gitignore
README.md		README.md
init.sh		init.sh
main.py		main.py
requirements.txt		requirements.txt
run_experiments.py		run_experiments.py

Folders and files

Latest commit

History

Repository files navigation

Comparison of partially finetuned vs. LoRA fine-tuned BERT and DistilBERT Models

Central Problem & Domain

Dataset

Approach

Fine-Tuning Strategies

Results

Parameters

Classification Results

Parameters

Training

Classification Results

Parameters

Training

Classification Results

Parameters

Training

Classification Results

Analysis

Top Misclassified Words

Label Distributions

Model Comparison

Probability Differences

Overall Interpretation

Resource Usage Analysis

Interpreting the Memory Overhead

Training Time Overhead

Scaling Considerations

Final Evaluation

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages