Skip to content

alessiaianes/deep-learning-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Few-Shot Adaptation for Vision-Language Models

Base-to-Novel Generalization with LoRA, LP++, and Muon Optimizer

This project focuses on the Few-Shot Adaptation of the CLIP (Contrastive Language-Image Pre-training) model, specifically addressing the challenge of Base-to-Novel Generalization. The goal is to improve the model's performance on a specific set of "Base" classes using only a few annotated samples ($k=10$) while maintaining or enhancing its zero-shot capabilities on unseen "Novel" classes.


1. Problem Statement & Objectives

As defined in the project assignment, traditional fine-tuning often leads to overfitting on limited data and a degradation of performance on novel categories. This project implements a multi-technique approach to achieve:

  • High Accuracy on Base Classes: Adapting the model to 51 fine-grained flower categories.
  • Robustness on Novel Classes: Preserving zero-shot performance on 51 unseen categories.
  • Harmonic Mean Optimization: Improving the overall balance between base and novel accuracy compared to the zero-shot baseline.

2. Methodology

The implementation follows a progressive optimization strategy using Parameter-Efficient Fine-Tuning (PEFT) to mitigate the risk of overfitting in data-scarce scenarios:

  1. LoRA (Low-Rank Adaptation): Instead of tuning all parameters, we inject trainable low-rank adapters into the MLP layers (c_fc, c_proj) of all 12 transformer blocks in the CLIP visual encoder.
  2. LP++ (Linear Probing++): An advanced linear probing method that utilizes feature extraction with L2 normalization and StandardScaler for optimal classification stability.
  3. Muon Optimizer: Integration of the cutting-edge Muon optimizer, which uses Newton-Schulz iteration to ensure the orthogonalization of the weight matrix, maintaining well-conditioned gradients.
  4. Logits DeConfusion: A state-of-the-art technique (CVPR 2025) specifically designed to resolve confusion between base and novel categories during inference.

3. Experimental Setup

  • Dataset: Oxford Flowers-102 (split into 51 Base and 51 Novel classes).
  • Foundation Model: CLIP ViT-B/16.
  • Few-Shot Setting: $k=10$ samples per base class.
  • Augmentations: RandomResizedCrop and RandomHorizontalFlip to increase training diversity.

Baseline (Zero-Shot)

  • Base Accuracy: 67.21%
  • Novel Accuracy: 72.36%
  • Harmonic Mean: 69.69%

4. Requirements

The project requires the following libraries:

  • torch
  • torchvision
  • openai-clip
  • peft
  • scikit-learn
  • rich

Install the core CLIP dependency via:

pip install openai_clip

About

Few-Shot Adaptation for Vision-Language Models. Implements Base-to-Novel generalization on CLIP using LoRA, LP++, and Muon Optimizer to enhance performance on the Oxford Flowers-102 dataset.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors