This project focuses on the Few-Shot Adaptation of the CLIP (Contrastive Language-Image Pre-training) model, specifically addressing the challenge of Base-to-Novel Generalization. The goal is to improve the model's performance on a specific set of "Base" classes using only a few annotated samples (
As defined in the project assignment, traditional fine-tuning often leads to overfitting on limited data and a degradation of performance on novel categories. This project implements a multi-technique approach to achieve:
- High Accuracy on Base Classes: Adapting the model to 51 fine-grained flower categories.
- Robustness on Novel Classes: Preserving zero-shot performance on 51 unseen categories.
- Harmonic Mean Optimization: Improving the overall balance between base and novel accuracy compared to the zero-shot baseline.
The implementation follows a progressive optimization strategy using Parameter-Efficient Fine-Tuning (PEFT) to mitigate the risk of overfitting in data-scarce scenarios:
- LoRA (Low-Rank Adaptation): Instead of tuning all parameters, we inject trainable low-rank adapters into the MLP layers (
c_fc,c_proj) of all 12 transformer blocks in the CLIP visual encoder. - LP++ (Linear Probing++): An advanced linear probing method that utilizes feature extraction with L2 normalization and
StandardScalerfor optimal classification stability. - Muon Optimizer: Integration of the cutting-edge Muon optimizer, which uses Newton-Schulz iteration to ensure the orthogonalization of the weight matrix, maintaining well-conditioned gradients.
- Logits DeConfusion: A state-of-the-art technique (CVPR 2025) specifically designed to resolve confusion between base and novel categories during inference.
- Dataset: Oxford Flowers-102 (split into 51 Base and 51 Novel classes).
-
Foundation Model:
CLIP ViT-B/16. -
Few-Shot Setting:
$k=10$ samples per base class. -
Augmentations:
RandomResizedCropandRandomHorizontalFlipto increase training diversity.
- Base Accuracy: 67.21%
- Novel Accuracy: 72.36%
- Harmonic Mean: 69.69%
The project requires the following libraries:
torchtorchvisionopenai-clippeftscikit-learnrich
Install the core CLIP dependency via:
pip install openai_clip