🎯 Q₂E: Query-to-Event Decomposition

Zero-Shot Multilingual Text-to-Video Retrieval

🏆 Accepted at AACL 2025 🏆

📢 Latest News

Date	News
🎉 Oct 25, 2025	Paper accepted at AACL 2025
📅 Jul 24, 2025	Paper will be presented at MAGMaR Workshop

📋 Table of Contents

Section	Description
🚀 Installation	Setup and environment configuration
📊 Data	Datasets, models, and pre-generated data
🧪 Evaluation	Running experiments on MultiVENT & MSR-VTT-1kA
🔧 Data Generation Scripts	Scripts for generating training data
🎯 Use Your Own Data	Custom dataset integration
🐳 Using Docker	Containerized setup
📚 Citation	How to cite this work

🚀 Installation

⚠️ System Requirements
Tested on CUDA 12.4 and A100. If you encounter issues, please use the Docker setup.

Quick Start

Install UV (if not already installed):

curl -LsSf https://astral.sh/uv/install.sh | sh

Setup Environment:

uv venv --seed --python 3.10
uv sync
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

📊 Data

🎯 Pre-Generated Data (Recommended)

Download our pre-processed data for quick evaluation:

source .venv/bin/activate
gdown --fuzzy https://drive.google.com/file/d/1qcr9ZqHptibJKHOwyOrjjbwQTjcsp_Vk/view
tar -xzvf data.tgz

🎬 Video Datasets

📝 Note: Due to redistribution policies, videos must be downloaded separately.

Dataset	Download Instructions	Save Location
MultiVENT	Download from NeurIPS	`data/MultiVENT/videos/`
MSR-VTT	Download from Microsoft	`data/MSR-VTT-1kA/videos/`

🤖 Pre-Trained Models

MultiCLIP Model

mkdir -p data/models/MultiCLIP
wget -O data/models/MultiCLIP/open_clip_pytorch_model.bin \
  https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/resolve/main/open_clip_pytorch_model.bin

InternVideo2 Model

mkdir -p data/models/InternVideo2
# Download from: https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4
# Save as: data/models/InternVideo2/InternVideo2-stage2_1b-224p-f4.pt

⚖️ Licensing: InternVideo2 model has different licensing terms and must be downloaded separately.

🧪 Evaluation

📋 Prerequisites: Ensure data is generated and populated in the data directory. See Data Generation Scripts for setup instructions.

🎯 Quick Evaluation

Dataset	Command	Description
MultiVENT	`bash scripts/eval_multivent.sh`	MultiVENT Dataset evaluation
MSR-VTT-1kA	`bash scripts/eval_msrvtt.sh`	MSR-VTT-1kA Dataset evaluation

🚀 Run Evaluation

# MultiVENT Evaluation
bash scripts/eval_multivent.sh

# MSR-VTT-1kA Evaluation  
bash scripts/eval_msrvtt.sh

🔧 Data Generation Scripts

💡 Tip: Pre-generated data is available in the Pre-Generated Data section for quick evaluation.

📋 Available Scripts

Dataset	Audio Transcription	Script	Description
MultiVENT	✅ With ASR	`generate_multivent_asr.sh`	Full pipeline with audio transcription
MultiVENT	❌ No ASR	`generate_multivent_noasr.sh`	Without audio transcription
MSR-VTT-1kA	✅ With ASR	`generate_msrvtt_asr.sh`	Full pipeline with audio transcription
MSR-VTT-1kA	❌ No ASR	`generate_msrvtt_noasr.sh`	Without audio transcription
All Datasets	🔄 Grid Search	`grid_search_data.py`	Comprehensive data generation

🚀 Usage

# Generate data for specific dataset
bash scripts/generate_multivent_asr.sh    # MultiVENT with ASR
bash scripts/generate_msrvtt_noasr.sh     # MSR-VTT-1kA without ASR

# Or run grid search for all combinations
python scripts/grid_search_data.py

🎯 Use Your Own Data

📁 Dataset Structure

Create your custom dataset with the following structure:

{DATA_DIR}/
├── videos/           # Your video files
└── dataset.csv       # Query-video mapping

📝 Dataset Format

Your dataset.csv should contain:

query: Text query for video retrieval
video_id: Corresponding video filename (without path)

🚀 Generation Pipeline

echo "Transcribing videos"
python -m src.data.transcribe_audios \
    --video_dir={DATA_DIR}/videos

echo "Processing raw data"
python -m src.data.query_decomp  \
    --data_dir={DATA_DIR} \
    --video_dir={DATA_DIR}/videos \
    --gen_max_model_len=2048

echo "Captioning frames"
python -m src.data.frame_caption \
    --data_dir={DATA_DIR} \
    --video_dir={DATA_DIR}/videos \
    --gen_max_model_len=16384 \
    --num_of_frames=16

echo "Captioning videos"
python -m src.data.frame2video_caption \
    --data_dir={DATA_DIR} \
    --video_dir={DATA_DIR}/videos \
    --gen_max_model_len=16384 \
    --num_of_frames=16

Evaluate using MultiCLIP

echo "Without ASR"
python -m src.eval.MultiCLIP.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR} \
    --aggregation_methods=inv_entropy


echo "With ASR"
python -m src.eval.MultiCLIP.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR}\
    --aggregation_methods=inv_entropy

Evaluate using InternVideo2

echo "Without ASR"
python -m src.eval.InternVideo2.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR}\
    --aggregation_methods=inv_entropy


echo "With ASR"
python -m src.eval.InternVideo2.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR}\
    --aggregation_methods=inv_entropy

🐳 Using Docker

💡 Perfect for: Systems without root permissions or consistent environment setup.

We recommend using udocker for containerized execution.

# Install udocker
uv add udocker
# Create and run the container
udocker pull runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
udocker create --name="runpod" runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
udocker setup --nvidia runpod
udocker run --volume="/${PWD}:/workspace" --name="runpod" runpod bash

# Inside the container
## install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

## install the dependencies
uv venv --seed --python=3.10
uv sync

📚 Citation

If you find this work useful for your research, please consider citing:

@article{dipta2025q2e,
  title={Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval},
  author={Dipta, Shubhashis Roy and Ferraro, Francis},
  journal={arXiv preprint arXiv:2506.10202},
  year={2025}
}

⭐ If you found this project helpful, please give it a star! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎯 Q₂E: Query-to-Event Decomposition

🏆 Accepted at AACL 2025 🏆

📢 Latest News

📋 Table of Contents

🚀 Installation

Quick Start

📊 Data

🎯 Pre-Generated Data (Recommended)

🎬 Video Datasets

🤖 Pre-Trained Models

MultiCLIP Model

InternVideo2 Model

🧪 Evaluation

🎯 Quick Evaluation

🚀 Run Evaluation

🔧 Data Generation Scripts

📋 Available Scripts

🚀 Usage

🎯 Use Your Own Data

📁 Dataset Structure

📝 Dataset Format

🚀 Generation Pipeline

🐳 Using Docker

📚 Citation

About

Uh oh!

Uh oh!

Languages

dipta007/Q2E

Folders and files

Latest commit

History

Repository files navigation

🎯 Q2E: Query-to-Event Decomposition

🏆 Accepted at AACL 2025 🏆

📢 Latest News

📋 Table of Contents

🚀 Installation

Quick Start

📊 Data

🎯 Pre-Generated Data (Recommended)

🎬 Video Datasets

🤖 Pre-Trained Models

MultiCLIP Model

InternVideo2 Model

🧪 Evaluation

🎯 Quick Evaluation

🚀 Run Evaluation

🔧 Data Generation Scripts

📋 Available Scripts

🚀 Usage

🎯 Use Your Own Data

📁 Dataset Structure

📝 Dataset Format

🚀 Generation Pipeline

🐳 Using Docker

📚 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

🎯 Q₂E: Query-to-Event Decomposition