- arXiv preprint
- Authors: Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman
- Python
3.12 - Recommended manager:
conda - Setup:
conda env create -f environment.yml
See more on https://github.com/sign-language-processing/pose.
- Install pose-format to run MediaPipe Holistic pose estimation from videos:
pip install pose-format - Input:
*.mp4 - Output:
*.pose(binary pose format) - Example:
videos_to_poses --num-workers 32 --format mediapipe --directory ~/BOBSL/derivatives/original_videos --additional-config="model_complexity=2,smooth_landmarks=false,refine_face_landmarks=true"
See more on https://github.com/J22Melody/segmentation/tree/bsl.
- Install the linguistic segmenter to segment signs based on poses:
pip install "git+https://github.com/J22Melody/segmentation@bsl" - Input:
*.pose - Output:
*.eaf(ELAN files for segmentation: inspection and annotation) - Example:
python segmentation.py --sign-b-threshold 30 --sign-o-threshold 50 --num_workers 4 --video_ids ./data/bobsl_align_val.txt --pose_dir ~/BOBSL/derivatives/video_features/mediapipe_v2_refine_face_complexity_2 --save_dir ~/BOBSL/sea_demo/segmentation
This step requires one GPU, the following tested on a NVIDIA Tesla P40 24GB. See more on https://github.com/J22Melody/fairseq/tree/main/examples/MMPT#inference.
- Install SignCLIP to embed text and signs (model weigths have to be downloaded):
git clone git@github.com:J22Melody/fairseq.git cd fairseq conda env update -n sea -f environment_inference.yml cd examples/MMPT pip install . - Input:
*.pose, *.vtt - Output:
*.npy(embeddings for signing and subtitle units) - Example to embed signs:
python scripts_bsl/extract_episode_features.py --video_ids ~/SEA/data/bobsl_align_val.txt --mode=segmentation --model_name bsl --language_tag "<en> <bfi>" --batch_size=32 --segmentation_dir ~/BOBSL/sea_demo/segmentation/E4s-1_30_50 --save_dir ~/BOBSL/sea_demo/segmentation_embedding/E4s-1_30_50/sign_clip
- Example to embed subtitles:
python scripts_bsl/extract_episode_features.py --video_ids ~/SEA/data/bobsl_align_val.txt --mode=subtitle --model_name bsl --language_tag "<en> <bfi>" --batch_size=1024 --subtitle_dir ~/BOBSL/v1.4/automatic_annotations/signing_aligned_subtitles/audio_aligned_heuristic_correction --save_dir ~/BOBSL/sea_demo/subtitle_embedding/sign_clip
- Input:
*.eaf, *.vtt - Output:
*.updated.eaf, *.vtt(ELAN files after alignment and aligned subtitle files) - Example:
python align.py --overwrite --mode=inference --video_ids ./data/bobsl_align_val.txt --num_workers=4 --dp_duration_penalty_weight 1 --dp_gap_penalty_weight 5 --dp_max_gap 10 --dp_window_size 50 --sign-b-threshold 30 --sign-o-threshold 50 --pr_subs_delta_bias_start 2.6 --pr_subs_delta_bias_end 2.1 --similarity_measure none --segmentation_dir ~/BOBSL/sea_demo/segmentation --save_dir ~/BOBSL/sea_demo/aligned_subtitles
- Output metrics (if ground truth is provided):
Metric | Result -------------------------------+------------ Total frames | 245614 Total sentences | 1973 Mean/median start offset | -0.50/-0.26 Mean/median end offset | -1.04/-0.88 Mean/median start offset (abs) | 0.93/0.54 Mean/median end offset (abs) | 1.29/0.99 Frame-level accuracy | 80.68 F1@0.10 | 83.07 F1@0.25 | 79.32 F1@0.50 | 66.24
- Input:
*.eaf, *.vtt, *.npy - Output:
*.updated.eaf, *.vtt(ELAN files after alignment and aligned subtitle files) - Example:
python align.py --overwrite --mode=inference --video_ids ./data/bobsl_align_val.txt --num_workers=4 --dp_duration_penalty_weight 1 --dp_gap_penalty_weight 5 --dp_max_gap 10 --dp_window_size 50 --sign-b-threshold 30 --sign-o-threshold 50 --pr_subs_delta_bias_start 2.6 --pr_subs_delta_bias_end 2.1 --similarity_measure sign_clip_embedding --similarity_weight 10 --segmentation_dir ~/BOBSL/sea_demo/segmentation --subtitle_embedding_dir ~/BOBSL/sea_demo/subtitle_embedding/sign_clip/ --segmentation_embedding_dir ~/BOBSL/sea_demo/segmentation_embedding/E4s-1_30_50/sign_clip --save_dir ~/BOBSL/sea_demo/aligned_subtitles
- Output metrics (if ground truth is provided):
Metric | Result -------------------------------+------------ Total frames | 245614 Total sentences | 1973 Mean/median start offset | -0.36/-0.18 Mean/median end offset | -0.91/-0.77 Mean/median start offset (abs) | 0.80/0.40 Mean/median end offset (abs) | 1.16/0.87 Frame-level accuracy | 82.52 F1@0.10 | 86.37 F1@0.25 | 82.92 F1@0.50 | 72.23
You should expect the following output files, including intermediate segmentation and embedding as well as final alignment results.
/users/zifan/BOBSL/sea_demo/
├── aligned_subtitles
│ ├── 5224144816887051284.vtt
│ ├── 5242317681679687839.vtt
│ ├── 5294309549287947552.vtt
│ └── 5439409006429129628.vtt
├── segmentation
│ └── E4s-1_30_50
│ ├── 5224144816887051284.eaf
│ ├── 5224144816887051284_updated.eaf
│ ├── 5242317681679687839.eaf
│ ├── 5242317681679687839_updated.eaf
│ ├── 5294309549287947552.eaf
│ ├── 5294309549287947552_updated.eaf
│ ├── 5439409006429129628.eaf
│ └── 5439409006429129628_updated.eaf
├── segmentation_embedding
│ └── E4s-1_30_50
│ └── sign_clip
│ ├── 5224144816887051284.npy
│ ├── 5242317681679687839.npy
│ ├── 5294309549287947552.npy
│ └── 5439409006429129628.npy
└── subtitle_embedding
└── sign_clip
├── 5224144816887051284.npy
├── 5242317681679687839.npy
├── 5294309549287947552.npy
└── 5439409006429129628.npy
@article{jiang2025segment,
title = {Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing},
author = {Jiang, Zifan and Jang, Youngjoon and Momeni, Liliane and Varol, G{\"u}l and Ebling, Sarah and Zisserman, Andrew},
journal = {arXiv preprint arXiv:2512.08094},
year = {2025},
url = {https://arxiv.org/abs/2512.08094}
}