Zero-Shot Multilingual Text-to-Video Retrieval
| Date | News |
|---|---|
| π Oct 25, 2025 | Paper accepted at AACL 2025 |
| π Jul 24, 2025 | Paper will be presented at MAGMaR Workshop |
| Section | Description |
|---|---|
| π Installation | Setup and environment configuration |
| π Data | Datasets, models, and pre-generated data |
| π§ͺ Evaluation | Running experiments on MultiVENT & MSR-VTT-1kA |
| π§ Data Generation Scripts | Scripts for generating training data |
| π― Use Your Own Data | Custom dataset integration |
| π³ Using Docker | Containerized setup |
| π Citation | How to cite this work |
β οΈ System Requirements
Tested on CUDA 12.4 and A100. If you encounter issues, please use the Docker setup.
-
Install UV (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh -
Setup Environment:
uv venv --seed --python 3.10 uv sync source .venv/bin/activate # On Windows: .venv\Scripts\activate
Download our pre-processed data for quick evaluation:
source .venv/bin/activate
gdown --fuzzy https://drive.google.com/file/d/1qcr9ZqHptibJKHOwyOrjjbwQTjcsp_Vk/view
tar -xzvf data.tgzπ Note: Due to redistribution policies, videos must be downloaded separately.
| Dataset | Download Instructions | Save Location |
|---|---|---|
| MultiVENT | Download from NeurIPS | data/MultiVENT/videos/ |
| MSR-VTT | Download from Microsoft | data/MSR-VTT-1kA/videos/ |
mkdir -p data/models/MultiCLIP
wget -O data/models/MultiCLIP/open_clip_pytorch_model.bin \
https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/resolve/main/open_clip_pytorch_model.binmkdir -p data/models/InternVideo2
# Download from: https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4
# Save as: data/models/InternVideo2/InternVideo2-stage2_1b-224p-f4.ptβοΈ Licensing: InternVideo2 model has different licensing terms and must be downloaded separately.
π Prerequisites: Ensure data is generated and populated in the
datadirectory. See Data Generation Scripts for setup instructions.
| Dataset | Command | Description |
|---|---|---|
| MultiVENT | bash scripts/eval_multivent.sh |
MultiVENT Dataset evaluation |
| MSR-VTT-1kA | bash scripts/eval_msrvtt.sh |
MSR-VTT-1kA Dataset evaluation |
# MultiVENT Evaluation
bash scripts/eval_multivent.sh
# MSR-VTT-1kA Evaluation
bash scripts/eval_msrvtt.shπ‘ Tip: Pre-generated data is available in the Pre-Generated Data section for quick evaluation.
| Dataset | Audio Transcription | Script | Description |
|---|---|---|---|
| MultiVENT | β With ASR | generate_multivent_asr.sh |
Full pipeline with audio transcription |
| MultiVENT | β No ASR | generate_multivent_noasr.sh |
Without audio transcription |
| MSR-VTT-1kA | β With ASR | generate_msrvtt_asr.sh |
Full pipeline with audio transcription |
| MSR-VTT-1kA | β No ASR | generate_msrvtt_noasr.sh |
Without audio transcription |
| All Datasets | π Grid Search | grid_search_data.py |
Comprehensive data generation |
# Generate data for specific dataset
bash scripts/generate_multivent_asr.sh # MultiVENT with ASR
bash scripts/generate_msrvtt_noasr.sh # MSR-VTT-1kA without ASR
# Or run grid search for all combinations
python scripts/grid_search_data.pyCreate your custom dataset with the following structure:
{DATA_DIR}/
βββ videos/ # Your video files
βββ dataset.csv # Query-video mapping
Your dataset.csv should contain:
query: Text query for video retrievalvideo_id: Corresponding video filename (without path)
echo "Transcribing videos"
python -m src.data.transcribe_audios \
--video_dir={DATA_DIR}/videos
echo "Processing raw data"
python -m src.data.query_decomp \
--data_dir={DATA_DIR} \
--video_dir={DATA_DIR}/videos \
--gen_max_model_len=2048
echo "Captioning frames"
python -m src.data.frame_caption \
--data_dir={DATA_DIR} \
--video_dir={DATA_DIR}/videos \
--gen_max_model_len=16384 \
--num_of_frames=16
echo "Captioning videos"
python -m src.data.frame2video_caption \
--data_dir={DATA_DIR} \
--video_dir={DATA_DIR}/videos \
--gen_max_model_len=16384 \
--num_of_frames=16- Evaluate using MultiCLIP
echo "Without ASR"
python -m src.eval.MultiCLIP.infer \
--note=eval \
--dataset_dir={HFDatasetDIR} \
--aggregation_methods=inv_entropy
echo "With ASR"
python -m src.eval.MultiCLIP.infer \
--note=eval \
--dataset_dir={HFDatasetDIR}\
--aggregation_methods=inv_entropy- Evaluate using InternVideo2
echo "Without ASR"
python -m src.eval.InternVideo2.infer \
--note=eval \
--dataset_dir={HFDatasetDIR}\
--aggregation_methods=inv_entropy
echo "With ASR"
python -m src.eval.InternVideo2.infer \
--note=eval \
--dataset_dir={HFDatasetDIR}\
--aggregation_methods=inv_entropyπ‘ Perfect for: Systems without root permissions or consistent environment setup.
We recommend using udocker for containerized execution.
# Install udocker
uv add udocker
# Create and run the container
udocker pull runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
udocker create --name="runpod" runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
udocker setup --nvidia runpod
udocker run --volume="/${PWD}:/workspace" --name="runpod" runpod bash
# Inside the container
## install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
## install the dependencies
uv venv --seed --python=3.10
uv syncIf you find this work useful for your research, please consider citing:
@article{dipta2025q2e,
title={Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval},
author={Dipta, Shubhashis Roy and Ferraro, Francis},
journal={arXiv preprint arXiv:2506.10202},
year={2025}
}