Character-aware audio-visual subtitling

Character-aware audio-visual subtitling is an audiovisual diarisation pipeline for scripted TV content. It combines WhisperX for transcription, a custom audiovisual attention model, RetinaFace-based face clustering, and SpeechBrain speaker embeddings to produce per-character subtitles aligned with the original video.

Environment Setup

Use python = 3.10

git clone https://github.com/JaesungHuh/ca-subtitle.git
cd ca-subtitle
pip install -r requirements.txt

If you use uv

uv sync
source .venv/bin/activate

Download models and sample data

Download checkpoints and sample data to run this pipeline.

bash download_data.sh

Configuration

All runtime options live in YAML under config/. Two templates are provided:

config/pipeline_sample.yaml – minimal example pointing at sample_data/.
config/pipeline_config.yaml – richer template for full-season batch processing.

Important sections:

paths: input/output roots, weight locations, temp directories.
processing: per-stage parameters (Whisper model size, AV batch size, face thresholds, speaker clustering).
runtime: GPU assignment, number of parallel episodes, and caching/cleanup flags.

Copy one of the templates, adjust paths, and pass it to run_pipeline.py --config /path/to/config.yaml.

Data directory layout

The paths under paths.*_root expect a consistent naming scheme so the orchestrator can discover assets automatically:

video_root/<ShowName>/<ShowName>_<SS>x<EE>.mp4 – source episode video per show, where SS and EE are zero-padded season/episode numbers (e.g. Seinfeld/Seinfeld_03x05.mp4).
audio_root/<ShowName>/<ShowName>_<SS>x<EE>.wav – pre-extracted mono 16kHz WAV audio aligned with the corresponding video file.
castlist_root/<ShowName>/<ShowName>_<SS>x<EE>.txt – CSV (no header) with CharacterName,RealName rows used to map diarisation outputs to on-screen names.
person_images/<ShowName>_<SS>x<EE>/<CharacterFolder>/*.jpg – reference face crops for each character. If per-episode folders are unavailable, place character folders directly under person_images/CharacterFolder/*.jpg to share exemplars across episodes.

Adjust the extensions if your media differs, but keep the episode ID format identical to what is configured under shows so the pipeline can resolve files.

Running the Pipeline

# Process every episode defined in the config sequentially
python run_pipeline.py --config config/pipeline_sample.yaml --sequential

# Process specific episodes (parallel by default)
python run_pipeline.py --episodes Seinfeld_03x01 Seinfeld_03x02

# Clean intermediate artifacts after a successful run
python run_pipeline.py --cleanup

Stages run in order: audio_processing → av_analysis → face_processing → speaker_identification. Results are saved under exp/<run_name>/output/<show>/<episode>/ as JSON plus RTTM/SRT subtitle files.

If a run is interrupted, the pipeline writes pipeline_checkpoint.json under the intermediate directory; re-running will resume from the last unfinished stage.

Outputs

Each episode directory contains:

<episode>.json – aggregated stage results.
<episode>.srt – speaker-labelled subtitles with Whisper transcripts.
<episode>.rttm – diarisation file for evaluation tools.
logs/, temp/, intermediate/ – per-stage diagnostics and cached artifacts (optional, based on runtime flags).

Development Tips

Keep config/pipeline_sample.yaml aligned with sample_data/ to ensure smoke tests stay functional.
Large intermediate files live under exp/; clean the directory regularly or enable runtime.cleanup_temp.
GPU assignment is controlled via runtime.gpu_ids. Leave the list empty to force CPU-only execution (slower); invalid or unavailable IDs are ignored automatically and the pipeline falls back to CPU.
AV analysis uses CuPy and Torch; ensure your CUDA toolkit matches the versions specified in requirements.txt.

Comments

In our original paper, we used CLIPPAD to recognise characters. This repository uses RetinaFace plus a SENet‑based face‑embedding model, which produces much better results.
In the original paper, we applied the same nearest-neighbour–based audio filtering across all episodes within each show, whereas this repository applies it within each episode individually.
We use only off‑the‑shelf models in this pipeline. You can plug in new models to obtain better transcription, diarisation, or character‑recognition results.

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.

Acknowledgements

WhisperX
avobjects_repo
Speechbrain
RetinaFace
Special thanks to Codex who helped me to refactor the original indecipherable code.
This is a joint work with Bruno and AZ.

Citation

Please cite this if you use this for academic purposes.

@inproceedings{korbar24,
  author       = "Bruno Korbar and Jaesung Huh and Andrew Zisserman",
  title        = "Look, Listen and Recognise: character-aware audio-visual subtitling",
  booktitle    = "International Conference on Acoustics, Speech, and Signal Processing ",
  year         = "2024",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Character-aware audio-visual subtitling

Environment Setup

Download models and sample data

Configuration

Data directory layout

Running the Pipeline

Outputs

Development Tips

Comments

License

Acknowledgements

Citation

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
config		config
pipeline		pipeline
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_data.sh		download_data.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
uv.lock		uv.lock

License

JaesungHuh/ca-subtitle

Folders and files

Latest commit

History

Repository files navigation

Character-aware audio-visual subtitling

Environment Setup

Download models and sample data

Configuration

Data directory layout

Running the Pipeline

Outputs

Development Tips

Comments

License

Acknowledgements

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages