Generate, Transduct, Adapt: Iterative Transduction with VLMs

This is the code-base for GTA-CLIP proposed in

Generate, Transduct, Adapt: Iterative Transduction with VLMs

Oindrila Saha, Logan Lawrence, Grant Van Horn, Subhransu Maji

ICCV'2025

[arXiv] | [Poster] | [Video]

Overview of GTA-CLIP

(a) Vision-language models (VLMs) such as CLIP enable zero-shot classification using similarity between class prompts and images.
(b) Transduction exploits the structure of entire image dataset to assign images to classes improving accuracy.
(c) Our approach, GTA-CLIP, iteratively classifies images by (i) generating attributes based on pairwise confusions,
(ii) performing attribute-augmented transductive inference, and
(iii) adapting CLIP encoders using the inferred labels.
(d) Across 12 datasets we improve upon CLIP and transductive CLIP by 8.6% and 4.0% using VIT-B/32, and similarly for other encoders. Significant improvements are also reported in the few-shot setting.

Preparation

Create a conda environment with the specifications

conda create -y --name GTACLIP python=3.10.0
conda activate GTACLIP
pip3 install -r requirements.txt
export TOKENIZERS_PARALLELISM=true

Datasets

Please follow DATASETS.md to install the datasets. For CUB dataset, follow AdaptCLIPZS

Static LLM Attributes

Download "gpt_descriptions" from AdaptCLIPZS

Running GTA-CLIP

python run_gtaclip.py --dataset <dataset_name> --root_path </path/to/datasets/folder> --backbone <clip_backbone> --gpt_path </path/to/adaptclizs/visual/attributes --gpt_path_location </path/to/adaptclizs/location/attributes

On completion this code will print the accuracies of base CLIP, TransCLIP, and GTA-CLIP for the specified dataset. The --root_path should be assigned to the folder containing all the datasets. --backbone is the CLIP architecture eg. 'vit_b16'. The --gpt_path is the path to the folder containing GPT generated attributes for the specific dataset which can be obtained from AdaptCLIPZS. Note that only CUB and Flowers datasets have the --gpt_path_location attributes. The results should be close to this table:

Baselines: CLIP, TransCLIP

Todo: Code for few-shot results

Thanks to TransCLIP for releasing the code base which our code is built upon.

Acknowledgements

The research is supported in part by grant #2329927 from the National Science Foundation (USA). Our experiments were performed on the GPU cluster funded by the Mass. Technology Collaborative.

Citation

If you find our work useful, please consider citing:

@inproceedings{saha2025generate,
  title={Generate, Transduct, Adapt: Iterative Transduction with VLMs},
  author={Saha, Oindrila and Lawrence, Logan and Van Horn, Grant and Maji, Subhransu},
  booktitle={International Conference on Computer Vision (ICCV)},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
TransCLIP_solver		TransCLIP_solver
attributes_llama3		attributes_llama3
clip		clip
datasets		datasets
extras		extras
.gitattributes		.gitattributes
.gitignore		.gitignore
DATASETS.md		DATASETS.md
README.md		README.md
requirements.txt		requirements.txt
run_gtaclip.py		run_gtaclip.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generate, Transduct, Adapt: Iterative Transduction with VLMs