apptek · patrick-wilken · Sep 2, 2025 · Sep 3, 2025 · Feb 10, 2026 · Feb 11, 2026
diff --git a/README.md b/README.md
@@ -11,6 +11,11 @@ pip install subtitle-edit-rate
 will install the `suber` command line tool.
 Alternatively, check out this git repository and run the contained `suber` module with `python -m suber`.
 
+For Japanese and/or Korean support (via `-l`, see below), specify `ja` and/or `ko` as optional dependency:
+```console
+pip install subtitle-edit-rate[ja,ko]
+```
+
 ## Basic Usage
 Currently, we expect subtitle files to come in [SubRip text (SRT)](https://en.wikipedia.org/wiki/SubRip) format. Given a human reference subtitle file `reference.srt` and a hypothesis file `hypothesis.srt` (typically the output of an automatic subtitling system) the SubER score can be calculated by running:
 
@@ -28,6 +33,9 @@ Also, note that `<i>`, `<b>` and `<u>` formatting tags are ignored if present in
 #### Punctuation and Case-Sensitivity
 The main SubER metric is computed on normalized text, which means case-insensitive and without taking punctuation into account, as we observe higher correlation with human judgements and post-edit effort in this setting. We provide an implementation of a case-sensitive variant which also uses a tokenizer to take punctuation into account as separate tokens which you can use "at your own risk" or to reassess our findings. For this, add `--metrics SubER-cased` to the command above. Please do not report results using this variant as "SubER" unless explicitly mentioning the punctuation-/case-sensitivity.
 
+#### Language support
+SubER is expected to give meaningful scores for all languages that use space-separation of words similar to English. In addition, versions `>=0.4.0` explicitly support __Chinese__, __Japanese__ and __Korean__. (Korean does use spaces, but we follow [SacreBLEU](https://github.com/mjpost/sacrebleu) by using [mecab-ko](https://github.com/NoUnique/pymecab-ko) tokenization.) For these particular languages it is __required__ to set the `-l`/`--language` option to the corresponding two-letter language code, for example for Japanese files `suber -H hypothesis.srt -R reference.srt -l ja`. An example of a currently not supported scriptio continua language is Thai. As a workaround, it is however possible to run your own tokenization / word segmentation on the SRT files before calling `suber`.
+
 ## Other Metrics
 The SubER tool supports computing the following other metrics directly on subtitle files:
 
@@ -52,6 +60,8 @@ $ suber -H hypothesis.srt -R reference.srt --metrics WER BLEU TER chrF CER
 ```
 In this mode, the text from each parallel subtitle pair is considered to be a sentence pair.
 
+For __Chinese__, __Japanese__ and __Korean__ files, also here it is required to specify the language code via `-l`/`--language` option for correct BLEU, TER and WER scores. (This sets the `asian_support` option of TER, and for BLEU and WER enables tokenization via SacreBleu's dedicated tokenizers `TokenizerZh`, `TokenizerJaMecab`, and `TokenizerKoMecab`, respectively.)
+
 ### Scoring Non-Parallel Subtitle Files
 In the general case, subtitle files for the same video can have different numbers of subtitles with different time stamps. All metrics - except SubER - usually require to be calculated on parallel segments. To apply these metrics to general subtitle files, the hypothesis file has to be re-segmented to correspond to the reference subtitles. The SubER tool implements two options:
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -9,6 +9,7 @@ dependencies = [
     "sacrebleu==2.5.1",
     "jiwer==4.0.0",
     "numpy",
+    "regex",
     "dataclasses;python_version<'3.7'",
 ]
 requires-python = ">= 3.6"
@@ -31,6 +32,11 @@ classifiers = [
     "Topic :: Scientific/Engineering :: Artificial Intelligence",
 ]
 
+[project.optional-dependencies]
+# Installs MeCab for Japanese/Korean word segmentation.
+ja = ["sacrebleu[ja]==2.5.1"]
+ko = ["sacrebleu[ko]==2.5.1"]
+
 [project.urls]
 Homepage = "https://github.com/apptek/SubER"
 Issues = "https://github.com/apptek/SubER/issues"

diff --git a/suber/__main__.py b/suber/__main__.py
@@ -30,8 +30,17 @@ def parse_arguments():
                         help="The reference files. Usually just one file, but we support test sets consisting of "
                              "multiple files.")
     parser.add_argument("-m", "--metrics", nargs="+", default=["SubER"], help="The metrics to compute.")
-    parser.add_argument("-f", "--hypothesis-format", default="SRT", help="Hypothesis file format, 'SRT' or 'plain'.")
-    parser.add_argument("-F", "--reference-format", default="SRT", help="Reference file format, 'SRT' or 'plain'.")
+    parser.add_argument("-f", "--hypothesis-format", default="SRT", choices=["SRT", "plain"],
+                        help="Hypothesis file format, 'SRT' or 'plain'.")
+    parser.add_argument("-F", "--reference-format", default="SRT", choices=["SRT", "plain"],
+                        help="Reference file format, 'SRT' or 'plain'.")
+    parser.add_argument("-l", "--language", choices=["zh", "ja", "ko"],
+                        help='Set to "zh", "ja" or "ko" to enable correct tokenization of Chinese, Japanese or Korean '
+                             "text, respectively. We follow sacrebleu and use its BLEU tokenizers 'zh', 'ja-mecab' and "
+                             "'ko-mecab' for these three languages, respectively. We employ those tokenizers for SubER "
+                             "and WER computation too, in favor of TercomTokenizer. That's because TercomTokenizer's "
+                             '"asian_support" is questionable, it does not split Japanese Hiragana/Katakana at all. '
+                             'Only for TER itself the original TercomTokenizer with "asian_support" is used.')
     parser.add_argument("--suber-statistics", action="store_true",
                         help="If set, will create an '#info' field in the output containing statistics about the "
                              "different edit operations used to calculate the SubER score.")
@@ -66,7 +75,8 @@ def main():
             continue  # specified multiple times by the user
 
         if metric == "length_ratio":
-            results[metric] = calculate_length_ratio(hypothesis=hypothesis_segments, reference=reference_segments)
+            results[metric] = calculate_length_ratio(
+                hypothesis=hypothesis_segments, reference=reference_segments, language=args.language)
             continue
 
         # When using existing parallel segments there will always be a <eob> word match in the end, don't count it.
@@ -82,7 +92,7 @@ def main():
             # AS-WER and AS-BLEU were introduced by Matusov et al. https://aclanthology.org/2005.iwslt-1.19.pdf
             if levenshtein_aligned_hypothesis_segments is None:
                 levenshtein_aligned_hypothesis_segments = levenshtein_align_hypothesis_to_reference(
-                    hypothesis=hypothesis_segments, reference=reference_segments)
+                    hypothesis=hypothesis_segments, reference=reference_segments, language=args.language)
 
             hypothesis_segments_to_use = levenshtein_aligned_hypothesis_segments
             metric = metric[len("AS-"):]
@@ -94,7 +104,7 @@ def main():
             # https://www.isca-archive.org/interspeech_2021/cherry21_interspeech.pdf
             if time_aligned_hypothesis_segments is None:
                 time_aligned_hypothesis_segments = time_align_hypothesis_to_reference(
-                    hypothesis=hypothesis_segments, reference=reference_segments)
+                    hypothesis=hypothesis_segments, reference=reference_segments, language=args.language)
 
             hypothesis_segments_to_use = time_aligned_hypothesis_segments
             metric = metric[len("t-"):]
@@ -110,15 +120,15 @@ def main():
 
             metric_score = calculate_SubER(
                 hypothesis=hypothesis_segments_to_use, reference=reference_segments, metric=metric,
-                statistics_collector=statistics_collector)
+                statistics_collector=statistics_collector, language=args.language)
 
             if statistics_collector:
                 additional_outputs[full_metric_name] = statistics_collector.get_statistics()
 
         elif metric.startswith("WER"):
             metric_score = calculate_word_error_rate(
                 hypothesis=hypothesis_segments_to_use, reference=reference_segments, metric=metric,
-                score_break_at_segment_end=score_break_at_segment_end)
+                score_break_at_segment_end=score_break_at_segment_end, language=args.language)
 
         elif metric.startswith("CER"):
             metric_score = calculate_character_error_rate(
@@ -127,7 +137,7 @@ def main():
         else:
             metric_score = calculate_sacrebleu_metric(
                 hypothesis=hypothesis_segments_to_use, reference=reference_segments, metric=metric,
-                score_break_at_segment_end=score_break_at_segment_end)
+                score_break_at_segment_end=score_break_at_segment_end, language=args.language)
 
         results[full_metric_name] = metric_score
 

diff --git a/suber/constants.py b/suber/constants.py
@@ -3,3 +3,12 @@
 END_OF_BLOCK_SYMBOL = "<eob>"
 
 MASK_SYMBOL = "<mask>"
+
+# These are the languages for which we enable "asian_support" for TER computation.
+# TODO: Korean included as a precaution, does it make sense? "asian_support=True" should only have an effect in very
+# rare cases for Korean text?
+# For SubER and WER we actually use sacrebleu's TokenizerZh, TokenizerJaMecab, and TokenizerKoMecab instead of
+# TercomTokenizer with "asian_support".
+EAST_ASIAN_LANGUAGE_CODES = ["zh", "ja", "ko"]
+
+SPACE_ESCAPE = "▁"
diff --git a/suber/file_readers/srt_file_reader.py b/suber/file_readers/srt_file_reader.py
@@ -1,10 +1,9 @@
 import re
 import datetime
-import numpy
-
 
 from suber.file_readers.file_reader_base import FileReaderBase
 from suber.data_types import LineBreak, TimedWord, Subtitle
+from suber.utilities import set_approximate_word_times
 
 
 class SRTFormatError(Exception):
@@ -81,7 +80,7 @@ def _parse_lines(self, file_object):
                 if word_list:  # might be an empty subtitle
                     word_list[-1].line_break = LineBreak.END_OF_BLOCK
 
-                    self._set_approximate_word_times(word_list, start_time, end_time)
+                    set_approximate_word_times(word_list, start_time, end_time)
 
                 subtitles.append(
                     Subtitle(word_list=word_list, index=subtitle_index, start_time=start_time, end_time=end_time))
@@ -98,32 +97,13 @@ def _parse_lines(self, file_object):
             if word_list:  # might be an empty subtitle
                 word_list[-1].line_break = LineBreak.END_OF_BLOCK
 
-                self._set_approximate_word_times(word_list, start_time, end_time)
+                set_approximate_word_times(word_list, start_time, end_time)
 
             subtitles.append(
                 Subtitle(word_list=word_list, index=subtitle_index, start_time=start_time, end_time=end_time))
 
         return subtitles
 
-    @classmethod
-    def _set_approximate_word_times(cls, word_list, start_time, end_time):
-        """
-        Linearly interpolates word times from the subtitle start and end time as described in
-        https://www.isca-archive.org/interspeech_2021/cherry21_interspeech.pdf
-        """
-        # Remove small margin to guarantee the first and last word will always be counted as within the subtitle.
-        epsilon = 1e-8
-        start_time = start_time + epsilon
-        end_time = end_time - epsilon
-
-        num_words = len(word_list)
-        duration = end_time - start_time
-        assert duration >= 0
-
-        approximate_word_times = numpy.linspace(start=start_time, stop=end_time, num=num_words)
-        for word_time, word in zip(approximate_word_times, word_list):
-            word.approximate_word_time = word_time
-
     @classmethod
     def _parse_time_stamp(cls, time_stamp):
         time_stamp_tokens = time_stamp.split()

diff --git a/suber/hyp_to_ref_alignment/levenshtein_alignment.py b/suber/hyp_to_ref_alignment/levenshtein_alignment.py
@@ -1,27 +1,47 @@
 import numpy
+import regex
 import string
 from itertools import zip_longest
-from typing import List, Tuple
+from typing import List, Optional, Tuple
 
 from suber import lib_levenshtein
+from suber.constants import EAST_ASIAN_LANGUAGE_CODES, SPACE_ESCAPE
 from suber.data_types import Segment
+from suber.tokenizers import reversibly_tokenize_segments, detokenize_segments
 
 
-def levenshtein_align_hypothesis_to_reference(hypothesis: List[Segment], reference: List[Segment]) -> List[Segment]:
+def levenshtein_align_hypothesis_to_reference(
+        hypothesis: List[Segment], reference: List[Segment], language: Optional[str] = None) -> List[Segment]:
     """
     Runs the Levenshtein algorithm to get the minimal set of edit operations to convert the full list of hypothesis
     words into the full list of reference words. The edit operations implicitly define an alignment between hypothesis
     and reference words. Using this alignment, the hypotheses are re-segmented to match the reference segmentation.
     """
 
+    if language in EAST_ASIAN_LANGUAGE_CODES:
+        # Punctuation kept attached because we want to remove it below to normalize the tokens before alignment, but
+        # there we cannot change the number of tokens (and must not create empty tokens).
+        hypothesis = reversibly_tokenize_segments(hypothesis, language, keep_punctuation_attached=True)
+        reference = reversibly_tokenize_segments(reference, language, keep_punctuation_attached=True)
+
     remove_punctuation_table = str.maketrans('', '', string.punctuation)
 
     def normalize_word(word):
         """
         Lower-cases and removes punctuation as this increases the alignment accuracy.
         """
         word = word.lower()
-        word_without_punctuation = word.translate(remove_punctuation_table)
+
+        if language in EAST_ASIAN_LANGUAGE_CODES:
+            # Space escape needed for detokenization, but we don't want it to influence the alignment.
+            if word.startswith(SPACE_ESCAPE):
+                word = word[1:]
+                assert word, "Word should not be only space escape character."
+            word_without_punctuation = regex.sub(r"\p{P}", "", word)
+        else:
+            # Backwards compatibility: keep old behavior for other languages, even though removing non-ASCII punctuation
+            # would also make sense here.
+            word_without_punctuation = word.translate(remove_punctuation_table)
 
         if not word_without_punctuation:
             return word  # keep tokens that are purely punctuation
@@ -85,6 +105,9 @@ def normalize_word(word):
 
     aligned_hypothesis = [Segment(word_list=word_list) for word_list in aligned_hypothesis_word_lists]
 
+    if language in EAST_ASIAN_LANGUAGE_CODES:
+        aligned_hypothesis = detokenize_segments(aligned_hypothesis)
+
     return aligned_hypothesis
 
 

diff --git a/suber/hyp_to_ref_alignment/time_alignment.py b/suber/hyp_to_ref_alignment/time_alignment.py
@@ -1,16 +1,23 @@
 import numpy
+from typing import List, Optional
 
-from typing import List
+from suber.constants import EAST_ASIAN_LANGUAGE_CODES
 from suber.data_types import Segment, Subtitle
+from suber.tokenizers import reversibly_tokenize_segments, detokenize_segments
 
 
-def time_align_hypothesis_to_reference(hypothesis: List[Segment], reference: List[Subtitle]) -> List[Subtitle]:
+def time_align_hypothesis_to_reference(
+        hypothesis: List[Segment], reference: List[Subtitle], language: Optional[str] = None) -> List[Subtitle]:
     """
     Re-segments the hypothesis segments according to the reference subtitle timings. The output hypothesis subtitles
     will have the same time stamps as the reference, and each will contain the words whose approximate times falls into
     these intervals, i.e. reference_subtitle.start_time < word.approximate_word_time < reference_subtitle.end_time.
     Hypothesis words that do not fall into any subtitle will be dropped.
     """
+
+    if language in EAST_ASIAN_LANGUAGE_CODES:
+        hypothesis = reversibly_tokenize_segments(hypothesis, language)
+
     aligned_hypothesis_word_lists = [[] for _ in reference]
 
     reference_start_times = numpy.array([subtitle.start_time for subtitle in reference])
@@ -40,4 +47,7 @@ def time_align_hypothesis_to_reference(hypothesis: List[Segment], reference: Lis
 
         aligned_hypothesis.append(subtitle)
 
+    if language in EAST_ASIAN_LANGUAGE_CODES:
+        aligned_hypothesis = detokenize_segments(aligned_hypothesis)
+
     return aligned_hypothesis
diff --git a/suber/metrics/cer.py b/suber/metrics/cer.py
@@ -1,6 +1,7 @@
-import string
 from typing import List
 
+import regex
+
 from suber import lib_levenshtein
 from suber.data_types import Segment
 from suber.utilities import segment_to_string
@@ -14,12 +15,8 @@ def calculate_character_error_rate(hypothesis: List[Segment], reference: List[Se
     reference_strings = [segment_to_string(segment) for segment in reference]
 
     if metric != "CER-cased":
-        remove_punctuation_table = str.maketrans('', '', string.punctuation)
-
         def normalize_string(string):
-            string = string.translate(remove_punctuation_table)
-            # Ellipsis is a common character in subtitles which is not included in string.punctuation.
-            string = string.replace('…', '')
+            string = regex.sub(r"\p{P}", "", string)
             string = string.lower()
             return string
 

diff --git a/suber/metrics/jiwer_interface.py b/suber/metrics/jiwer_interface.py
@@ -2,14 +2,14 @@
 import functools
 from typing import List
 
-from sacrebleu.tokenizers.tokenizer_ter import TercomTokenizer
-
 from suber.data_types import Segment
+from suber.constants import EAST_ASIAN_LANGUAGE_CODES
+from suber.tokenizers import get_sacrebleu_tokenizer
 from suber.utilities import segment_to_string, get_segment_to_string_opts_from_metric
 
 
 def calculate_word_error_rate(hypothesis: List[Segment], reference: List[Segment], metric="WER",
-                              score_break_at_segment_end=True) -> float:
+                              score_break_at_segment_end=True, language: str = None) -> float:
 
     assert len(hypothesis) == len(reference), (
         "Number of hypothesis segments does not match reference, alignment step missing?")
@@ -18,19 +18,25 @@ def calculate_word_error_rate(hypothesis: List[Segment], reference: List[Segment
         transformations = jiwer.Compose([
             # Note: the original release used no tokenization here. We find this change to have a minor positive effect
             # on correlation with post-edit effort (-0.657 vs. -0.650 in Table 1, row 2, "Combined" in our paper.)
-            TercomTokenize(),
+            Tokenize(language),
             jiwer.ReduceToListOfListOfWords(),
         ])
         metric = "WER"
 
     else:
-        transformations = jiwer.Compose([
+        transformations = [
             jiwer.ToLowerCase(),
             jiwer.RemovePunctuation(),
             # Ellipsis is a common character in subtitles that older jiwer versions would not remove by default.
             jiwer.RemoveSpecificWords(['…']),
             jiwer.ReduceToListOfListOfWords(),
-        ])
+        ]
+        # For most languages no tokenizer needed when punctuation is removed. Not true though for languages that do not
+        # use spaces to separate words.
+        if language in EAST_ASIAN_LANGUAGE_CODES:
+            transformations.insert(3, Tokenize(language))
+
+        transformations = jiwer.Compose(transformations)
 
     include_breaks, mask_words, metric = get_segment_to_string_opts_from_metric(metric)
     assert metric == "WER"
@@ -51,9 +57,12 @@ def calculate_word_error_rate(hypothesis: List[Segment], reference: List[Segment
     return round(wer_score * 100, 3)
 
 
-class TercomTokenize(jiwer.AbstractTransform):
-    def __init__(self):
-        self.tokenizer = TercomTokenizer(normalized=True, no_punct=False, case_sensitive=True)
+class Tokenize(jiwer.AbstractTransform):
+    def __init__(self, language: str):
+        # For backwards-compatibility, TercomTokenizer is used for all languages except "ja", "ko", and "zh".
+        self.tokenizer = get_sacrebleu_tokenizer(language, default_to_tercom=True)
 
     def process_string(self, s: str):
+        # TercomTokenizer would split "<eol>" into "< eol >"
+        s = s.replace("<eol>", "eol").replace("<eob>", "eob")
         return self.tokenizer(s)