Skip to content

dipta007/Q2E

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

56 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎯 Q2E: Query-to-Event Decomposition

Zero-Shot Multilingual Text-to-Video Retrieval


πŸ† Accepted at AACL 2025 πŸ†

Paper Website Dataset

πŸ“’ Latest News

Date News
πŸŽ‰ Oct 25, 2025 Paper accepted at AACL 2025
πŸ“… Jul 24, 2025 Paper will be presented at MAGMaR Workshop

πŸ“‹ Table of Contents

Section Description
πŸš€ Installation Setup and environment configuration
πŸ“Š Data Datasets, models, and pre-generated data
πŸ§ͺ Evaluation Running experiments on MultiVENT & MSR-VTT-1kA
πŸ”§ Data Generation Scripts Scripts for generating training data
🎯 Use Your Own Data Custom dataset integration
🐳 Using Docker Containerized setup
πŸ“š Citation How to cite this work

πŸš€ Installation

⚠️ System Requirements
Tested on CUDA 12.4 and A100. If you encounter issues, please use the Docker setup.

Quick Start

  1. Install UV (if not already installed):

    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Setup Environment:

    uv venv --seed --python 3.10
    uv sync
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate

πŸ“Š Data

🎯 Pre-Generated Data (Recommended)

Download our pre-processed data for quick evaluation:

source .venv/bin/activate
gdown --fuzzy https://drive.google.com/file/d/1qcr9ZqHptibJKHOwyOrjjbwQTjcsp_Vk/view
tar -xzvf data.tgz

🎬 Video Datasets

πŸ“ Note: Due to redistribution policies, videos must be downloaded separately.

Dataset Download Instructions Save Location
MultiVENT Download from NeurIPS data/MultiVENT/videos/
MSR-VTT Download from Microsoft data/MSR-VTT-1kA/videos/

πŸ€– Pre-Trained Models

MultiCLIP Model

mkdir -p data/models/MultiCLIP
wget -O data/models/MultiCLIP/open_clip_pytorch_model.bin \
  https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/resolve/main/open_clip_pytorch_model.bin

InternVideo2 Model

mkdir -p data/models/InternVideo2
# Download from: https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4
# Save as: data/models/InternVideo2/InternVideo2-stage2_1b-224p-f4.pt

βš–οΈ Licensing: InternVideo2 model has different licensing terms and must be downloaded separately.

πŸ§ͺ Evaluation

πŸ“‹ Prerequisites: Ensure data is generated and populated in the data directory. See Data Generation Scripts for setup instructions.

🎯 Quick Evaluation

Dataset Command Description
MultiVENT bash scripts/eval_multivent.sh MultiVENT Dataset evaluation
MSR-VTT-1kA bash scripts/eval_msrvtt.sh MSR-VTT-1kA Dataset evaluation

πŸš€ Run Evaluation

# MultiVENT Evaluation
bash scripts/eval_multivent.sh

# MSR-VTT-1kA Evaluation  
bash scripts/eval_msrvtt.sh

πŸ”§ Data Generation Scripts

πŸ’‘ Tip: Pre-generated data is available in the Pre-Generated Data section for quick evaluation.

πŸ“‹ Available Scripts

Dataset Audio Transcription Script Description
MultiVENT βœ… With ASR generate_multivent_asr.sh Full pipeline with audio transcription
MultiVENT ❌ No ASR generate_multivent_noasr.sh Without audio transcription
MSR-VTT-1kA βœ… With ASR generate_msrvtt_asr.sh Full pipeline with audio transcription
MSR-VTT-1kA ❌ No ASR generate_msrvtt_noasr.sh Without audio transcription
All Datasets πŸ”„ Grid Search grid_search_data.py Comprehensive data generation

πŸš€ Usage

# Generate data for specific dataset
bash scripts/generate_multivent_asr.sh    # MultiVENT with ASR
bash scripts/generate_msrvtt_noasr.sh     # MSR-VTT-1kA without ASR

# Or run grid search for all combinations
python scripts/grid_search_data.py

🎯 Use Your Own Data

πŸ“ Dataset Structure

Create your custom dataset with the following structure:

{DATA_DIR}/
β”œβ”€β”€ videos/           # Your video files
└── dataset.csv       # Query-video mapping

πŸ“ Dataset Format

Your dataset.csv should contain:

  • query: Text query for video retrieval
  • video_id: Corresponding video filename (without path)

πŸš€ Generation Pipeline

echo "Transcribing videos"
python -m src.data.transcribe_audios \
    --video_dir={DATA_DIR}/videos

echo "Processing raw data"
python -m src.data.query_decomp  \
    --data_dir={DATA_DIR} \
    --video_dir={DATA_DIR}/videos \
    --gen_max_model_len=2048

echo "Captioning frames"
python -m src.data.frame_caption \
    --data_dir={DATA_DIR} \
    --video_dir={DATA_DIR}/videos \
    --gen_max_model_len=16384 \
    --num_of_frames=16

echo "Captioning videos"
python -m src.data.frame2video_caption \
    --data_dir={DATA_DIR} \
    --video_dir={DATA_DIR}/videos \
    --gen_max_model_len=16384 \
    --num_of_frames=16
  1. Evaluate using MultiCLIP
echo "Without ASR"
python -m src.eval.MultiCLIP.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR} \
    --aggregation_methods=inv_entropy


echo "With ASR"
python -m src.eval.MultiCLIP.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR}\
    --aggregation_methods=inv_entropy
  1. Evaluate using InternVideo2
echo "Without ASR"
python -m src.eval.InternVideo2.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR}\
    --aggregation_methods=inv_entropy


echo "With ASR"
python -m src.eval.InternVideo2.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR}\
    --aggregation_methods=inv_entropy

🐳 Using Docker

πŸ’‘ Perfect for: Systems without root permissions or consistent environment setup.

We recommend using udocker for containerized execution.

# Install udocker
uv add udocker
# Create and run the container
udocker pull runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
udocker create --name="runpod" runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
udocker setup --nvidia runpod
udocker run --volume="/${PWD}:/workspace" --name="runpod" runpod bash

# Inside the container
## install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

## install the dependencies
uv venv --seed --python=3.10
uv sync

πŸ“š Citation

If you find this work useful for your research, please consider citing:

@article{dipta2025q2e,
  title={Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval},
  author={Dipta, Shubhashis Roy and Ferraro, Francis},
  journal={arXiv preprint arXiv:2506.10202},
  year={2025}
}

⭐ If you found this project helpful, please give it a star! ⭐

GitHub stars GitHub forks

About

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Topics

Resources

Stars

Watchers

Forks