Skip to content

Issue with sentence segmentation offsets #753

@kermitt2

Description

@kermitt2

In the following example

https://arxiv.org/pdf/2103.12028v1.pdf

there are cases of wrong sentence segmentations, with sentence offsets apparently shifted by a few characters, resulting in word cut. This happens whatever the selected sentence segmenter is, OpenNLP or Pragmatic Segmenter:

<s>Human annotators evaluated the quality of document alignments for six languages (de, zh, ar, ro, et, my) selected for their different scripts and amount of retrieved documents, reporting precision of over 90%. T</s>
<s>e quality of the extracted parallel sentences is evaluated in a machine translation (MT) task on six European...</s> 

As it happens with both segmenters, which use different offset calculation methods, it might be due to issues with character encoding.

Metadata

Metadata

Assignees

Labels

bugFrom Hemiptera and especially its suborder Heteroptera

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions