Skip to content

JaesungHuh/ca-subtitle

Repository files navigation

Character-aware audio-visual subtitling

[arXiv] [project_page]

Character-aware audio-visual subtitling is an audiovisual diarisation pipeline for scripted TV content. It combines WhisperX for transcription, a custom audiovisual attention model, RetinaFace-based face clustering, and SpeechBrain speaker embeddings to produce per-character subtitles aligned with the original video.

Character-Aware Subtitle Pipeline

Output Example

Environment Setup

  • Use python = 3.10
git clone https://github.com/JaesungHuh/ca-subtitle.git
cd ca-subtitle
pip install -r requirements.txt
  • If you use uv
uv sync
source .venv/bin/activate

Download models and sample data

  • Download checkpoints and sample data to run this pipeline.
bash download_data.sh

Configuration

All runtime options live in YAML under config/. Two templates are provided:

  • config/pipeline_sample.yaml – minimal example pointing at sample_data/.
  • config/pipeline_config.yaml – richer template for full-season batch processing.

Important sections:

  • paths: input/output roots, weight locations, temp directories.
  • processing: per-stage parameters (Whisper model size, AV batch size, face thresholds, speaker clustering).
  • runtime: GPU assignment, number of parallel episodes, and caching/cleanup flags.

Copy one of the templates, adjust paths, and pass it to run_pipeline.py --config /path/to/config.yaml.

Data directory layout

The paths under paths.*_root expect a consistent naming scheme so the orchestrator can discover assets automatically:

  • video_root/<ShowName>/<ShowName>_<SS>x<EE>.mp4 – source episode video per show, where SS and EE are zero-padded season/episode numbers (e.g. Seinfeld/Seinfeld_03x05.mp4).
  • audio_root/<ShowName>/<ShowName>_<SS>x<EE>.wav – pre-extracted mono 16kHz WAV audio aligned with the corresponding video file.
  • castlist_root/<ShowName>/<ShowName>_<SS>x<EE>.txt – CSV (no header) with CharacterName,RealName rows used to map diarisation outputs to on-screen names.
  • person_images/<ShowName>_<SS>x<EE>/<CharacterFolder>/*.jpg – reference face crops for each character. If per-episode folders are unavailable, place character folders directly under person_images/CharacterFolder/*.jpg to share exemplars across episodes.

Adjust the extensions if your media differs, but keep the episode ID format identical to what is configured under shows so the pipeline can resolve files.

Running the Pipeline

# Process every episode defined in the config sequentially
python run_pipeline.py --config config/pipeline_sample.yaml --sequential

# Process specific episodes (parallel by default)
python run_pipeline.py --episodes Seinfeld_03x01 Seinfeld_03x02

# Clean intermediate artifacts after a successful run
python run_pipeline.py --cleanup

Stages run in order: audio_processing → av_analysis → face_processing → speaker_identification. Results are saved under exp/<run_name>/output/<show>/<episode>/ as JSON plus RTTM/SRT subtitle files.

If a run is interrupted, the pipeline writes pipeline_checkpoint.json under the intermediate directory; re-running will resume from the last unfinished stage.

Outputs

Each episode directory contains:

  • <episode>.json – aggregated stage results.
  • <episode>.srt – speaker-labelled subtitles with Whisper transcripts.
  • <episode>.rttm – diarisation file for evaluation tools.
  • logs/, temp/, intermediate/ – per-stage diagnostics and cached artifacts (optional, based on runtime flags).

Development Tips

  • Keep config/pipeline_sample.yaml aligned with sample_data/ to ensure smoke tests stay functional.
  • Large intermediate files live under exp/; clean the directory regularly or enable runtime.cleanup_temp.
  • GPU assignment is controlled via runtime.gpu_ids. Leave the list empty to force CPU-only execution (slower); invalid or unavailable IDs are ignored automatically and the pipeline falls back to CPU.
  • AV analysis uses CuPy and Torch; ensure your CUDA toolkit matches the versions specified in requirements.txt.

Comments

  • In our original paper, we used CLIPPAD to recognise characters. This repository uses RetinaFace plus a SENet‑based face‑embedding model, which produces much better results.
  • In the original paper, we applied the same nearest-neighbour–based audio filtering across all episodes within each show, whereas this repository applies it within each episode individually.
  • We use only off‑the‑shelf models in this pipeline. You can plug in new models to obtain better transcription, diarisation, or character‑recognition results.

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.

Acknowledgements

Citation

Please cite this if you use this for academic purposes.

@inproceedings{korbar24,
  author       = "Bruno Korbar and Jaesung Huh and Andrew Zisserman",
  title        = "Look, Listen and Recognise: character-aware audio-visual subtitling",
  booktitle    = "International Conference on Acoustics, Speech, and Signal Processing ",
  year         = "2024",
}

About

Implementation of "Look, Listen and Recognise:character-aware audio-visual subtitling"

Topics

Resources

License

Stars

Watchers

Forks