Objective evaluation on Linux is now wrapped by two top-level scripts:
setup_objective.shandrun_objective_batch.sh.
This flow was validated on a Linux server with a single 4090 GPU. The public interface is repository-relative by default and can be overridden with environment variables when you need a different cache or Conda location.
Default relative layout:
- Repository root:
T2AV-Compass/ - Input videos:
input/ - Prompts file:
t2av-compass/Data/prompts.json - Output directory:
Output/ - Cache root:
.cache/t2av-cache - Conda envs:
.cache/conda/envs
Use submodules.
git clone --recurse-submodules https://github.com/NJU-LINK/T2AV-Compass.git
cd T2AV-CompassIf GitHub is slow in your region, you can optionally clone through a mirror instead. Keep the checked-out repository layout unchanged.
Optional environment overrides before setup:
export T2AV_CACHE_ROOT=/path/to/cache-root
export T2AV_CONDA_ROOT=/path/to/conda-root
export HF_ENDPOINT=https://huggingface.co
# In mainland China, set this explicitly instead:
# export HF_ENDPOINT=https://hf-mirror.com
# optional when GitHub downloads need a mirror
export T2AV_GITHUB_MIRROR_PREFIX=https://your-mirror.examplebash setup_objective.shWhat this script does:
- installs system packages such as
ffmpeg - creates all required conda environments
- downloads checkpoints for DOVER, AudioBox, ImageBind, Synchformer, and LatentSync
- pre-creates cache directories under
.cache/by default
The script is safe to re-run.
Put videos into input/.
Supported video naming conventions for prompt-linked metrics (T-V, T-A) include:
sample_0001.mp4sample_0002.mp41.mp40001.mp4video_0001.mp4
The index field in prompts.json must match the video file index.
Example layout:
T2AV-Compass/
├── input/
│ ├── sample_0001.mp4
│ └── sample_0002.mp4
├── Output/
├── setup_objective.sh
├── run_objective_batch.sh
└── t2av-compass/
└── Data/
└── prompts.json
Minimal t2av-compass/Data/prompts.json example:
[
{
"index": 1,
"prompt": "A person speaking directly to the camera.",
"video_prompt": "A person speaking directly to the camera.",
"audio_prompt": "clean speech from a person speaking indoors",
"speech_prompt": []
},
{
"index": 2,
"prompt": "A person speaking directly to the camera.",
"video_prompt": "A person speaking directly to the camera.",
"audio_prompt": "clean speech from a person speaking indoors",
"speech_prompt": []
}
]Default paths:
bash run_objective_batch.shCustom paths:
bash run_objective_batch.sh /abs/path/to/input /abs/path/to/prompts.json /abs/path/to/outputThe batch runs all objective metrics:
VT: video technical qualityVA: video aesthetic qualityAA: audio aesthetic qualitySQ: speech qualityT-V: text-video alignmentT-A: text-audio alignmentA-V: audio-video alignmentDeSync: audio-video synchronization errorLS: lip-sync quality
After a successful run, Output/ contains:
video_technical.jsonvideo_aesthetic.jsonaudio_aesthetic.jsonspeech_quality.jsontext_video_alignment.jsontext_audio_alignment.jsonaudio_video_alignment.jsonav_sync.jsonlipsync.jsonevaluation_summary.json
Run these from the repository root.
bash t2av-compass/scripts/eval_video_technical.sh input Output
bash t2av-compass/scripts/eval_video_aesthetic.sh input Output
bash t2av-compass/scripts/eval_audio_aesthetic.sh input Output
bash t2av-compass/scripts/eval_speech_quality.sh input Output
bash t2av-compass/scripts/eval_text_video_alignment.sh input t2av-compass/Data/prompts.json Output
bash t2av-compass/scripts/eval_text_audio_alignment.sh input t2av-compass/Data/prompts.json Output
bash t2av-compass/scripts/eval_audio_video_alignment.sh input Output
bash t2av-compass/scripts/eval_av_sync.sh input Output
bash t2av-compass/scripts/eval_lipsync.sh input Output- No manual Hugging Face login is required for the checkpoints used in the validated objective flow.
- The default Hugging Face endpoint is the official
https://huggingface.co. In mainland China, explicitly setHF_ENDPOINT=https://hf-mirror.com; in other regions this is usually unnecessary. - You can override cache and Conda locations with
T2AV_CACHE_ROOT,T2AV_CONDA_ROOT, orT2AV_CONDA_EXE. - The first
DeSyncrun downloads an additional large MotionFormer checkpoint. LSis intended for talking-face videos. For non-talking-face content, the score is not meaningful even if the script finishes.- Re-running
setup_objective.shorrun_objective_batch.shis supported.
视频文件未找到 (index: N): rename the file to match one of the supported index patterns, or fix theindexfield inprompts.json.ffmpegnot found: runbash setup_objective.shagain on a Debian/Ubuntu-like system with package manager access.- mirror/network failures during checkpoint download: retry first; if needed, set
HF_ENDPOINTorT2AV_GITHUB_MIRROR_PREFIXbefore running setup. LSfails on a batch with no visible speaking face: use talking-face videos for this metric.
T2AV-Compass is a unified benchmark for evaluating Text-to-Audio-Video generation across:
- unimodal quality
- cross-modal alignment and synchronization
- checklist-based subjective evaluation
The benchmark includes 500 prompts and associated checklist annotations. For subjective evaluation and repository internals, see t2av-compass/README.md.