Skip to content

PDF extraction introducing stray double carriage returns of unknown cause #8

@azamanian

Description

@azamanian

@reynoldsm88

Any double carriage return is going to introduce a sentence break during information extraction. So any time a double carriage return in is in the middle of a sentence, that's quite destructive. Sometimes it's obvious what's causing them, but I see them in random places sometimes. For instance in the PDF of document 1f5db65f2b3b158f8b3f0ae53f7c508c

image

The converter is introducing a double carriage return between "and" and "Nutrition Teams". Other line breaks in this bullet points and other similar bullet points do not typically cause double carriage returns. Although there are other stray ones such as after "WFP staff is working alongside NDRMC staff in" in the same document.

pdf source:

https://documents.wfp.org/stellent/groups/Public/documents/ep/WFP284788.pdf?_ga=2.243684229.1030149860.1553624300-1022356052.1547047485

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions