Yi Li1, Kyle Min2, Subarna Tripathi2, Nuno Vasconcelos1
1University of California, San Diego, 2Intel Labs
Project page | Paper | 8-min video
This repository contains PyTorch implementation of SViTT, a sparse multimodal transformer for video-language learning.
conda env create -n svitt --file environment.yml
conda activate svittAll datasets are expected under data/ directory with the following structure (other downstream datasets follow the same structure as MSRVTT):
data/
├── anno_pretrain/
│ └── webvid_train.json
├── anno_downstream/
│ ├── msrvtt_test1k.json
│ └── ...
├── webvid_videos/
│ └── *.mp4
├── msrvtt_videos/
│ └── *.mp4
└── ...
Raw videos should be downloaded from the websites of respective datasets. Annotations for pre-training and downstream tasks are available in the Singularity repo; additional annotations for Charades and AGQA used in this work are available here.
We follow the same structure of training and evaluation scripts as Singularity, with additional options for temporal modeling and sparse training.
To train a 4-frame SViTT model on WebVid: (use arg=value to override any arguments in configs/pretrain_webvid.yaml)
bash scripts/pretrain.sh pt_webvid webvid $GPUS local \
video_input.num_frames=4 \
output_dir=$OUTPUT_DIRTo perform temporal sparse expansion to 8 frames:
bash scripts/pretrain.sh pt_webvid webvid $GPUS local \
pretrained_path=$CKPT \
video_input.num_frames=8 \
vision_encoder_args.token_keep_rate=0.6 \
output_dir=$OUTPUT_DIRIt is recommended to use the same sparsity parameters (vision_encoder_args and joint_encoder_args) as the pre-trained model, though you can also override them with different values.
To evaluate zero-shot text-to-video retrieval (MSRVTT, DiDeMo):
bash scripts/eval_ret.sh $DATASET $CKPT eval-ret-$DATASET local $GPUSTo fine-tune text-to-video retrieval (Charades, SSv2):
bash scripts/train_ret.sh $DATASET $CKPT train-ret-$DATASET local $GPUSTo fine-tune video question answering (MSRVTT-QA, ActivityNet-QA, AGQA):
bash scripts/train_qa.sh $DATASET $CKPT train-qa-$DATASET local $GPUSThis project is built primarily on top of the awesome Singularity codebase. We also acknowledge the use of several other open-source repositories, including Frozen in Time, ALBEF, and 🤗 Transformers. This work was funded in part by NSF award IIS-2041009.
If you find this repo useful, please cite our work. Thanks!
@inproceedings{li2023svitt,
title={{SViTT}: Temporal Learning of Sparse Video-Text Transformers},
author={Li, Yi and Min, Kyle and Tripathi, Subarna and Vasconcelos, Nuno},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18919--18929},
year={2023}
}
