Animesh Gupta1 |
Jay Parmar1 |
Ishan Rajendrakumar Dave2 |
Mubarak Shah1
1University of Central Florida 2Adobe
Accepted in NeurIPS 2025
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 27.22
cd TF-CoVR/
conda create -n tfcovr python=3.10 -y
conda activate tfcovr
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
Please download our stage 2 pretrained weights from google drive here.
Please follow the instructions from DATASET.md to access the dataset.
Run following command:
python train.py data=finegd-covr-aim trainer=gpu model=aim model/ckpt=aim test=finegd-test-aim
python test.py data=finegd-covr-aim trainer=gpu model=aim_clip model/ckpt=aim test=finegd-test-aim-clip machine.num_workers=8 trainer.max_epochs=100 model.ckpt.path=/checkpoint/path/
If you use this dataset and/or this code in your work, please cite our paper:
@misc{gupta2025playreplaycomposedvideo,
title={From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos},
author={Animesh Gupta and Jay Parmar and Ishan Rajendrakumar Dave and Mubarak Shah},
year={2025},
eprint={2506.05274},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.05274},
}This repository has borrowed code from CoVR. We thank the authors for releasing their code.


