This is the code-base for GTA-CLIP proposed in
Oindrila Saha, Logan Lawrence, Grant Van Horn, Subhransu Maji
ICCV'2025
(a) Vision-language models (VLMs) such as CLIP enable zero-shot classification using similarity between class prompts and images.
(b) Transduction exploits the structure of entire image dataset to assign images to classes improving accuracy.
(c) Our approach, GTA-CLIP, iteratively classifies images by
(i) generating attributes based on pairwise confusions,
(ii) performing attribute-augmented transductive inference, and
(iii) adapting CLIP encoders using the inferred labels.
(d) Across 12 datasets we improve upon CLIP and transductive CLIP by 8.6% and 4.0% using VIT-B/32, and similarly for other encoders. Significant improvements are also reported in the few-shot setting.
Create a conda environment with the specifications
conda create -y --name GTACLIP python=3.10.0
conda activate GTACLIP
pip3 install -r requirements.txt
export TOKENIZERS_PARALLELISM=truePlease follow DATASETS.md to install the datasets. For CUB dataset, follow AdaptCLIPZS
Download "gpt_descriptions" from AdaptCLIPZS
python run_gtaclip.py --dataset <dataset_name> --root_path </path/to/datasets/folder> --backbone <clip_backbone> --gpt_path </path/to/adaptclizs/visual/attributes --gpt_path_location </path/to/adaptclizs/location/attributesOn completion this code will print the accuracies of base CLIP, TransCLIP, and GTA-CLIP for the specified dataset. The --root_path should be assigned to the folder containing all the datasets. --backbone is the CLIP architecture eg. 'vit_b16'. The --gpt_path is the path to the folder containing GPT generated attributes for the specific dataset which can be obtained from AdaptCLIPZS. Note that only CUB and Flowers datasets have the --gpt_path_location attributes. The results should be close to this table:
Todo: Code for few-shot results
Thanks to TransCLIP for releasing the code base which our code is built upon.
The research is supported in part by grant #2329927 from the National Science Foundation (USA). Our experiments were performed on the GPU cluster funded by the Mass. Technology Collaborative.
If you find our work useful, please consider citing:
@inproceedings{saha2025generate,
title={Generate, Transduct, Adapt: Iterative Transduction with VLMs},
author={Saha, Oindrila and Lawrence, Logan and Van Horn, Grant and Maji, Subhransu},
booktitle={International Conference on Computer Vision (ICCV)},
year={2025}
}

