Skip to content
Open
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ pip install subtitle-edit-rate
will install the `suber` command line tool.
Alternatively, check out this git repository and run the contained `suber` module with `python -m suber`.

For Japanese and/or Korean support (via `-l`, see below), specify `ja` and/or `ko` as optional dependency:
```console
pip install subtitle-edit-rate[ja,ko]
```

## Basic Usage
Currently, we expect subtitle files to come in [SubRip text (SRT)](https://en.wikipedia.org/wiki/SubRip) format. Given a human reference subtitle file `reference.srt` and a hypothesis file `hypothesis.srt` (typically the output of an automatic subtitling system) the SubER score can be calculated by running:

Expand All @@ -28,6 +33,9 @@ Also, note that `<i>`, `<b>` and `<u>` formatting tags are ignored if present in
#### Punctuation and Case-Sensitivity
The main SubER metric is computed on normalized text, which means case-insensitive and without taking punctuation into account, as we observe higher correlation with human judgements and post-edit effort in this setting. We provide an implementation of a case-sensitive variant which also uses a tokenizer to take punctuation into account as separate tokens which you can use "at your own risk" or to reassess our findings. For this, add `--metrics SubER-cased` to the command above. Please do not report results using this variant as "SubER" unless explicitly mentioning the punctuation-/case-sensitivity.

#### Language support
SubER is expected to give meaningful scores for all languages that use space-separation of words similar to English. In addition, versions `>=0.4.0` explicitly support __Chinese__, __Japanese__ and __Korean__. (Korean does use spaces, but we follow [SacreBLEU](https://github.com/mjpost/sacrebleu) by using [mecab-ko](https://github.com/NoUnique/pymecab-ko) tokenization.) For these particular languages it is __required__ to set the `-l`/`--language` option to the corresponding two-letter language code, for example for Japanese files `suber -H hypothesis.srt -R reference.srt -l ja`. An example of a currently not supported scriptio continua language is Thai. As a workaround, it is however possible to run your own tokenization / word segmentation on the SRT files before calling `suber`.

## Other Metrics
The SubER tool supports computing the following other metrics directly on subtitle files:

Expand All @@ -52,6 +60,8 @@ $ suber -H hypothesis.srt -R reference.srt --metrics WER BLEU TER chrF CER
```
In this mode, the text from each parallel subtitle pair is considered to be a sentence pair.

For __Chinese__, __Japanese__ and __Korean__ files, also here it is required to specify the language code via `-l`/`--language` option for correct BLEU, TER and WER scores. (This sets the `asian_support` option of TER, and for BLEU and WER enables tokenization via SacreBleu's dedicated tokenizers `TokenizerZh`, `TokenizerJaMecab`, and `TokenizerKoMecab`, respectively.)

### Scoring Non-Parallel Subtitle Files
In the general case, subtitle files for the same video can have different numbers of subtitles with different time stamps. All metrics - except SubER - usually require to be calculated on parallel segments. To apply these metrics to general subtitle files, the hypothesis file has to be re-segmented to correspond to the reference subtitles. The SubER tool implements two options:

Expand Down
6 changes: 6 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ dependencies = [
"sacrebleu==2.5.1",
"jiwer==4.0.0",
"numpy",
"regex",
"dataclasses;python_version<'3.7'",
]
requires-python = ">= 3.6"
Expand All @@ -31,6 +32,11 @@ classifiers = [
"Topic :: Scientific/Engineering :: Artificial Intelligence",
]

[project.optional-dependencies]
# Installs MeCab for Japanese/Korean word segmentation.
ja = ["sacrebleu[ja]==2.5.1"]
ko = ["sacrebleu[ko]==2.5.1"]

[project.urls]
Homepage = "https://github.com/apptek/SubER"
Issues = "https://github.com/apptek/SubER/issues"
Expand Down
26 changes: 18 additions & 8 deletions suber/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,17 @@ def parse_arguments():
help="The reference files. Usually just one file, but we support test sets consisting of "
"multiple files.")
parser.add_argument("-m", "--metrics", nargs="+", default=["SubER"], help="The metrics to compute.")
parser.add_argument("-f", "--hypothesis-format", default="SRT", help="Hypothesis file format, 'SRT' or 'plain'.")
parser.add_argument("-F", "--reference-format", default="SRT", help="Reference file format, 'SRT' or 'plain'.")
parser.add_argument("-f", "--hypothesis-format", default="SRT", choices=["SRT", "plain"],
help="Hypothesis file format, 'SRT' or 'plain'.")
parser.add_argument("-F", "--reference-format", default="SRT", choices=["SRT", "plain"],
help="Reference file format, 'SRT' or 'plain'.")
parser.add_argument("-l", "--language", choices=["zh", "ja", "ko"],
help='Set to "zh", "ja" or "ko" to enable correct tokenization of Chinese, Japanese or Korean '
"text, respectively. We follow sacrebleu and use its BLEU tokenizers 'zh', 'ja-mecab' and "
"'ko-mecab' for these three languages, respectively. We employ those tokenizers for SubER "
"and WER computation too, in favor of TercomTokenizer. That's because TercomTokenizer's "
'"asian_support" is questionable, it does not split Japanese Hiragana/Katakana at all. '
'Only for TER itself the original TercomTokenizer with "asian_support" is used.')
parser.add_argument("--suber-statistics", action="store_true",
help="If set, will create an '#info' field in the output containing statistics about the "
"different edit operations used to calculate the SubER score.")
Expand Down Expand Up @@ -66,7 +75,8 @@ def main():
continue # specified multiple times by the user

if metric == "length_ratio":
results[metric] = calculate_length_ratio(hypothesis=hypothesis_segments, reference=reference_segments)
results[metric] = calculate_length_ratio(
hypothesis=hypothesis_segments, reference=reference_segments, language=args.language)
continue

# When using existing parallel segments there will always be a <eob> word match in the end, don't count it.
Expand All @@ -82,7 +92,7 @@ def main():
# AS-WER and AS-BLEU were introduced by Matusov et al. https://aclanthology.org/2005.iwslt-1.19.pdf
if levenshtein_aligned_hypothesis_segments is None:
levenshtein_aligned_hypothesis_segments = levenshtein_align_hypothesis_to_reference(
hypothesis=hypothesis_segments, reference=reference_segments)
hypothesis=hypothesis_segments, reference=reference_segments, language=args.language)

hypothesis_segments_to_use = levenshtein_aligned_hypothesis_segments
metric = metric[len("AS-"):]
Expand All @@ -94,7 +104,7 @@ def main():
# https://www.isca-archive.org/interspeech_2021/cherry21_interspeech.pdf
if time_aligned_hypothesis_segments is None:
time_aligned_hypothesis_segments = time_align_hypothesis_to_reference(
hypothesis=hypothesis_segments, reference=reference_segments)
hypothesis=hypothesis_segments, reference=reference_segments, language=args.language)

hypothesis_segments_to_use = time_aligned_hypothesis_segments
metric = metric[len("t-"):]
Expand All @@ -110,15 +120,15 @@ def main():

metric_score = calculate_SubER(
hypothesis=hypothesis_segments_to_use, reference=reference_segments, metric=metric,
statistics_collector=statistics_collector)
statistics_collector=statistics_collector, language=args.language)

if statistics_collector:
additional_outputs[full_metric_name] = statistics_collector.get_statistics()

elif metric.startswith("WER"):
metric_score = calculate_word_error_rate(
hypothesis=hypothesis_segments_to_use, reference=reference_segments, metric=metric,
score_break_at_segment_end=score_break_at_segment_end)
score_break_at_segment_end=score_break_at_segment_end, language=args.language)

elif metric.startswith("CER"):
metric_score = calculate_character_error_rate(
Expand All @@ -127,7 +137,7 @@ def main():
else:
metric_score = calculate_sacrebleu_metric(
hypothesis=hypothesis_segments_to_use, reference=reference_segments, metric=metric,
score_break_at_segment_end=score_break_at_segment_end)
score_break_at_segment_end=score_break_at_segment_end, language=args.language)

results[full_metric_name] = metric_score

Expand Down
9 changes: 9 additions & 0 deletions suber/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,12 @@
END_OF_BLOCK_SYMBOL = "<eob>"

MASK_SYMBOL = "<mask>"

# These are the languages for which we enable "asian_support" for TER computation.
# TODO: Korean included as a precaution, does it make sense? "asian_support=True" should only have an effect in very
# rare cases for Korean text?
# For SubER and WER we actually use sacrebleu's TokenizerZh, TokenizerJaMecab, and TokenizerKoMecab instead of
# TercomTokenizer with "asian_support".
EAST_ASIAN_LANGUAGE_CODES = ["zh", "ja", "ko"]

SPACE_ESCAPE = "▁"
26 changes: 3 additions & 23 deletions suber/file_readers/srt_file_reader.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
import re
import datetime
import numpy


from suber.file_readers.file_reader_base import FileReaderBase
from suber.data_types import LineBreak, TimedWord, Subtitle
from suber.utilities import set_approximate_word_times


class SRTFormatError(Exception):
Expand Down Expand Up @@ -81,7 +80,7 @@ def _parse_lines(self, file_object):
if word_list: # might be an empty subtitle
word_list[-1].line_break = LineBreak.END_OF_BLOCK

self._set_approximate_word_times(word_list, start_time, end_time)
set_approximate_word_times(word_list, start_time, end_time)

subtitles.append(
Subtitle(word_list=word_list, index=subtitle_index, start_time=start_time, end_time=end_time))
Expand All @@ -98,32 +97,13 @@ def _parse_lines(self, file_object):
if word_list: # might be an empty subtitle
word_list[-1].line_break = LineBreak.END_OF_BLOCK

self._set_approximate_word_times(word_list, start_time, end_time)
set_approximate_word_times(word_list, start_time, end_time)

subtitles.append(
Subtitle(word_list=word_list, index=subtitle_index, start_time=start_time, end_time=end_time))

return subtitles

@classmethod
def _set_approximate_word_times(cls, word_list, start_time, end_time):
"""
Linearly interpolates word times from the subtitle start and end time as described in
https://www.isca-archive.org/interspeech_2021/cherry21_interspeech.pdf
"""
# Remove small margin to guarantee the first and last word will always be counted as within the subtitle.
epsilon = 1e-8
start_time = start_time + epsilon
end_time = end_time - epsilon

num_words = len(word_list)
duration = end_time - start_time
assert duration >= 0

approximate_word_times = numpy.linspace(start=start_time, stop=end_time, num=num_words)
for word_time, word in zip(approximate_word_times, word_list):
word.approximate_word_time = word_time

@classmethod
def _parse_time_stamp(cls, time_stamp):
time_stamp_tokens = time_stamp.split()
Expand Down
29 changes: 26 additions & 3 deletions suber/hyp_to_ref_alignment/levenshtein_alignment.py
Original file line number Diff line number Diff line change
@@ -1,27 +1,47 @@
import numpy
import regex
import string
from itertools import zip_longest
from typing import List, Tuple
from typing import List, Optional, Tuple

from suber import lib_levenshtein
from suber.constants import EAST_ASIAN_LANGUAGE_CODES, SPACE_ESCAPE
from suber.data_types import Segment
from suber.tokenizers import reversibly_tokenize_segments, detokenize_segments


def levenshtein_align_hypothesis_to_reference(hypothesis: List[Segment], reference: List[Segment]) -> List[Segment]:
def levenshtein_align_hypothesis_to_reference(
hypothesis: List[Segment], reference: List[Segment], language: Optional[str] = None) -> List[Segment]:
"""
Runs the Levenshtein algorithm to get the minimal set of edit operations to convert the full list of hypothesis
words into the full list of reference words. The edit operations implicitly define an alignment between hypothesis
and reference words. Using this alignment, the hypotheses are re-segmented to match the reference segmentation.
"""

if language in EAST_ASIAN_LANGUAGE_CODES:
# Punctuation kept attached because we want to remove it below to normalize the tokens before alignment, but
# there we cannot change the number of tokens (and must not create empty tokens).
hypothesis = reversibly_tokenize_segments(hypothesis, language, keep_punctuation_attached=True)
reference = reversibly_tokenize_segments(reference, language, keep_punctuation_attached=True)

remove_punctuation_table = str.maketrans('', '', string.punctuation)

def normalize_word(word):
"""
Lower-cases and removes punctuation as this increases the alignment accuracy.
"""
word = word.lower()
word_without_punctuation = word.translate(remove_punctuation_table)

if language in EAST_ASIAN_LANGUAGE_CODES:
# Space escape needed for detokenization, but we don't want it to influence the alignment.
if word.startswith(SPACE_ESCAPE):
word = word[1:]
assert word, "Word should not be only space escape character."
word_without_punctuation = regex.sub(r"\p{P}", "", word)
else:
# Backwards compatibility: keep old behavior for other languages, even though removing non-ASCII punctuation
# would also make sense here.
word_without_punctuation = word.translate(remove_punctuation_table)

if not word_without_punctuation:
return word # keep tokens that are purely punctuation
Expand Down Expand Up @@ -85,6 +105,9 @@ def normalize_word(word):

aligned_hypothesis = [Segment(word_list=word_list) for word_list in aligned_hypothesis_word_lists]

if language in EAST_ASIAN_LANGUAGE_CODES:
aligned_hypothesis = detokenize_segments(aligned_hypothesis)

return aligned_hypothesis


Expand Down
14 changes: 12 additions & 2 deletions suber/hyp_to_ref_alignment/time_alignment.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,23 @@
import numpy
from typing import List, Optional

from typing import List
from suber.constants import EAST_ASIAN_LANGUAGE_CODES
from suber.data_types import Segment, Subtitle
from suber.tokenizers import reversibly_tokenize_segments, detokenize_segments


def time_align_hypothesis_to_reference(hypothesis: List[Segment], reference: List[Subtitle]) -> List[Subtitle]:
def time_align_hypothesis_to_reference(
hypothesis: List[Segment], reference: List[Subtitle], language: Optional[str] = None) -> List[Subtitle]:
"""
Re-segments the hypothesis segments according to the reference subtitle timings. The output hypothesis subtitles
will have the same time stamps as the reference, and each will contain the words whose approximate times falls into
these intervals, i.e. reference_subtitle.start_time < word.approximate_word_time < reference_subtitle.end_time.
Hypothesis words that do not fall into any subtitle will be dropped.
"""

if language in EAST_ASIAN_LANGUAGE_CODES:
hypothesis = reversibly_tokenize_segments(hypothesis, language)

aligned_hypothesis_word_lists = [[] for _ in reference]

reference_start_times = numpy.array([subtitle.start_time for subtitle in reference])
Expand Down Expand Up @@ -40,4 +47,7 @@ def time_align_hypothesis_to_reference(hypothesis: List[Segment], reference: Lis

aligned_hypothesis.append(subtitle)

if language in EAST_ASIAN_LANGUAGE_CODES:
aligned_hypothesis = detokenize_segments(aligned_hypothesis)

return aligned_hypothesis
9 changes: 3 additions & 6 deletions suber/metrics/cer.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import string
from typing import List

import regex

from suber import lib_levenshtein
from suber.data_types import Segment
from suber.utilities import segment_to_string
Expand All @@ -14,12 +15,8 @@ def calculate_character_error_rate(hypothesis: List[Segment], reference: List[Se
reference_strings = [segment_to_string(segment) for segment in reference]

if metric != "CER-cased":
remove_punctuation_table = str.maketrans('', '', string.punctuation)

def normalize_string(string):
string = string.translate(remove_punctuation_table)
# Ellipsis is a common character in subtitles which is not included in string.punctuation.
string = string.replace('…', '')
string = regex.sub(r"\p{P}", "", string)
string = string.lower()
return string

Expand Down
27 changes: 18 additions & 9 deletions suber/metrics/jiwer_interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
import functools
from typing import List

from sacrebleu.tokenizers.tokenizer_ter import TercomTokenizer

from suber.data_types import Segment
from suber.constants import EAST_ASIAN_LANGUAGE_CODES
from suber.tokenizers import get_sacrebleu_tokenizer
from suber.utilities import segment_to_string, get_segment_to_string_opts_from_metric


def calculate_word_error_rate(hypothesis: List[Segment], reference: List[Segment], metric="WER",
score_break_at_segment_end=True) -> float:
score_break_at_segment_end=True, language: str = None) -> float:

assert len(hypothesis) == len(reference), (
"Number of hypothesis segments does not match reference, alignment step missing?")
Expand All @@ -18,19 +18,25 @@ def calculate_word_error_rate(hypothesis: List[Segment], reference: List[Segment
transformations = jiwer.Compose([
# Note: the original release used no tokenization here. We find this change to have a minor positive effect
# on correlation with post-edit effort (-0.657 vs. -0.650 in Table 1, row 2, "Combined" in our paper.)
TercomTokenize(),
Tokenize(language),
jiwer.ReduceToListOfListOfWords(),
])
metric = "WER"

else:
transformations = jiwer.Compose([
transformations = [
jiwer.ToLowerCase(),
jiwer.RemovePunctuation(),
# Ellipsis is a common character in subtitles that older jiwer versions would not remove by default.
jiwer.RemoveSpecificWords(['…']),
jiwer.ReduceToListOfListOfWords(),
])
]
# For most languages no tokenizer needed when punctuation is removed. Not true though for languages that do not
# use spaces to separate words.
if language in EAST_ASIAN_LANGUAGE_CODES:
transformations.insert(3, Tokenize(language))

transformations = jiwer.Compose(transformations)

include_breaks, mask_words, metric = get_segment_to_string_opts_from_metric(metric)
assert metric == "WER"
Expand All @@ -51,9 +57,12 @@ def calculate_word_error_rate(hypothesis: List[Segment], reference: List[Segment
return round(wer_score * 100, 3)


class TercomTokenize(jiwer.AbstractTransform):
def __init__(self):
self.tokenizer = TercomTokenizer(normalized=True, no_punct=False, case_sensitive=True)
class Tokenize(jiwer.AbstractTransform):
def __init__(self, language: str):
# For backwards-compatibility, TercomTokenizer is used for all languages except "ja", "ko", and "zh".
self.tokenizer = get_sacrebleu_tokenizer(language, default_to_tercom=True)

def process_string(self, s: str):
# TercomTokenizer would split "<eol>" into "< eol >"
s = s.replace("<eol>", "eol").replace("<eob>", "eob")
return self.tokenizer(s)
Loading