-
Notifications
You must be signed in to change notification settings - Fork 530
Open
Labels
bugFrom Hemiptera and especially its suborder HeteropteraFrom Hemiptera and especially its suborder Heteroptera
Description
In the following example
https://arxiv.org/pdf/2103.12028v1.pdf
there are cases of wrong sentence segmentations, with sentence offsets apparently shifted by a few characters, resulting in word cut. This happens whatever the selected sentence segmenter is, OpenNLP or Pragmatic Segmenter:
<s>Human annotators evaluated the quality of document alignments for six languages (de, zh, ar, ro, et, my) selected for their different scripts and amount of retrieved documents, reporting precision of over 90%. T</s>
<s>e quality of the extracted parallel sentences is evaluated in a machine translation (MT) task on six European...</s> As it happens with both segmenters, which use different offset calculation methods, it might be due to issues with character encoding.
Metadata
Metadata
Assignees
Labels
bugFrom Hemiptera and especially its suborder HeteropteraFrom Hemiptera and especially its suborder Heteroptera