Few-Shot Adaptation for Vision-Language Models

Base-to-Novel Generalization with LoRA, LP++, and Muon Optimizer

This project focuses on the Few-Shot Adaptation of the CLIP (Contrastive Language-Image Pre-training) model, specifically addressing the challenge of Base-to-Novel Generalization. The goal is to improve the model's performance on a specific set of "Base" classes using only a few annotated samples ($k=10$) while maintaining or enhancing its zero-shot capabilities on unseen "Novel" classes.

1. Problem Statement & Objectives

As defined in the project assignment, traditional fine-tuning often leads to overfitting on limited data and a degradation of performance on novel categories. This project implements a multi-technique approach to achieve:

High Accuracy on Base Classes: Adapting the model to 51 fine-grained flower categories.
Robustness on Novel Classes: Preserving zero-shot performance on 51 unseen categories.
Harmonic Mean Optimization: Improving the overall balance between base and novel accuracy compared to the zero-shot baseline.

2. Methodology

The implementation follows a progressive optimization strategy using Parameter-Efficient Fine-Tuning (PEFT) to mitigate the risk of overfitting in data-scarce scenarios:

LoRA (Low-Rank Adaptation): Instead of tuning all parameters, we inject trainable low-rank adapters into the MLP layers (c_fc, c_proj) of all 12 transformer blocks in the CLIP visual encoder.
LP++ (Linear Probing++): An advanced linear probing method that utilizes feature extraction with L2 normalization and StandardScaler for optimal classification stability.
Muon Optimizer: Integration of the cutting-edge Muon optimizer, which uses Newton-Schulz iteration to ensure the orthogonalization of the weight matrix, maintaining well-conditioned gradients.
Logits DeConfusion: A state-of-the-art technique (CVPR 2025) specifically designed to resolve confusion between base and novel categories during inference.

3. Experimental Setup

Dataset: Oxford Flowers-102 (split into 51 Base and 51 Novel classes).
Foundation Model: CLIP ViT-B/16.
Few-Shot Setting: $k=10$ samples per base class.
Augmentations: RandomResizedCrop and RandomHorizontalFlip to increase training diversity.

Baseline (Zero-Shot)

Base Accuracy: 67.21%
Novel Accuracy: 72.36%
Harmonic Mean: 69.69%

4. Requirements

The project requires the following libraries:

torch
torchvision
openai-clip
peft
scikit-learn
rich

Install the core CLIP dependency via:

pip install openai_clip

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
256181_257846_257847.ipynb		256181_257846_257847.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Few-Shot Adaptation for Vision-Language Models

Base-to-Novel Generalization with LoRA, LP++, and Muon Optimizer

1. Problem Statement & Objectives

2. Methodology

3. Experimental Setup

Baseline (Zero-Shot)

4. Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Few-Shot Adaptation for Vision-Language Models

Base-to-Novel Generalization with LoRA, LP++, and Muon Optimizer

1. Problem Statement & Objectives

2. Methodology

3. Experimental Setup

Baseline (Zero-Shot)

4. Requirements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages