Character-aware audio-visual subtitling is an audiovisual diarisation pipeline for scripted TV content. It combines WhisperX for transcription, a custom audiovisual attention model, RetinaFace-based face clustering, and SpeechBrain speaker embeddings to produce per-character subtitles aligned with the original video.
- Use python = 3.10
git clone https://github.com/JaesungHuh/ca-subtitle.git
cd ca-subtitle
pip install -r requirements.txt
- If you use uv
uv sync
source .venv/bin/activate
- Download checkpoints and sample data to run this pipeline.
bash download_data.sh
All runtime options live in YAML under config/. Two templates are provided:
config/pipeline_sample.yaml– minimal example pointing atsample_data/.config/pipeline_config.yaml– richer template for full-season batch processing.
Important sections:
paths: input/output roots, weight locations, temp directories.processing: per-stage parameters (Whisper model size, AV batch size, face thresholds, speaker clustering).runtime: GPU assignment, number of parallel episodes, and caching/cleanup flags.
Copy one of the templates, adjust paths, and pass it to run_pipeline.py --config /path/to/config.yaml.
The paths under paths.*_root expect a consistent naming scheme so the orchestrator can discover assets automatically:
video_root/<ShowName>/<ShowName>_<SS>x<EE>.mp4– source episode video per show, whereSSandEEare zero-padded season/episode numbers (e.g.Seinfeld/Seinfeld_03x05.mp4).audio_root/<ShowName>/<ShowName>_<SS>x<EE>.wav– pre-extracted mono 16kHz WAV audio aligned with the corresponding video file.castlist_root/<ShowName>/<ShowName>_<SS>x<EE>.txt– CSV (no header) withCharacterName,RealNamerows used to map diarisation outputs to on-screen names.person_images/<ShowName>_<SS>x<EE>/<CharacterFolder>/*.jpg– reference face crops for each character. If per-episode folders are unavailable, place character folders directly underperson_images/CharacterFolder/*.jpgto share exemplars across episodes.
Adjust the extensions if your media differs, but keep the episode ID format identical to what is configured under shows so the pipeline can resolve files.
# Process every episode defined in the config sequentially
python run_pipeline.py --config config/pipeline_sample.yaml --sequential
# Process specific episodes (parallel by default)
python run_pipeline.py --episodes Seinfeld_03x01 Seinfeld_03x02
# Clean intermediate artifacts after a successful run
python run_pipeline.py --cleanupStages run in order: audio_processing → av_analysis → face_processing → speaker_identification. Results are saved under exp/<run_name>/output/<show>/<episode>/ as JSON plus RTTM/SRT subtitle files.
If a run is interrupted, the pipeline writes pipeline_checkpoint.json under the intermediate directory; re-running will resume from the last unfinished stage.
Each episode directory contains:
<episode>.json– aggregated stage results.<episode>.srt– speaker-labelled subtitles with Whisper transcripts.<episode>.rttm– diarisation file for evaluation tools.logs/,temp/,intermediate/– per-stage diagnostics and cached artifacts (optional, based on runtime flags).
- Keep
config/pipeline_sample.yamlaligned withsample_data/to ensure smoke tests stay functional. - Large intermediate files live under
exp/; clean the directory regularly or enableruntime.cleanup_temp. - GPU assignment is controlled via
runtime.gpu_ids. Leave the list empty to force CPU-only execution (slower); invalid or unavailable IDs are ignored automatically and the pipeline falls back to CPU. - AV analysis uses CuPy and Torch; ensure your CUDA toolkit matches the versions specified in
requirements.txt.
- In our original paper, we used CLIPPAD to recognise characters. This repository uses RetinaFace plus a SENet‑based face‑embedding model, which produces much better results.
- In the original paper, we applied the same nearest-neighbour–based audio filtering across all episodes within each show, whereas this repository applies it within each episode individually.
- We use only off‑the‑shelf models in this pipeline. You can plug in new models to obtain better transcription, diarisation, or character‑recognition results.
This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.
- WhisperX
- avobjects_repo
- Speechbrain
- RetinaFace
- Special thanks to Codex who helped me to refactor the original indecipherable code.
- This is a joint work with Bruno and AZ.
Please cite this if you use this for academic purposes.
@inproceedings{korbar24,
author = "Bruno Korbar and Jaesung Huh and Andrew Zisserman",
title = "Look, Listen and Recognise: character-aware audio-visual subtitling",
booktitle = "International Conference on Acoustics, Speech, and Signal Processing ",
year = "2024",
}

