PDF extraction erroneously combines two words based on end-of-line dashes

@reynoldsm88

In the pdf for document 25ac6fa139dc49c98194c0fd80dbe900 (and elsewhere), there's these end-of-line dashes:

![25ac6fa139dc49c98194c0fd80dbe900](https://user-images.githubusercontent.com/38891375/63975688-902b7280-ca7d-11e9-8dc4-bb4185674c98.jpg)

such as the ones between "above" and "average", and "surplus" and "producing". Normally, this indicates a split word, so the converter is reasonably combining them into single tokens "aboveaverage", "surplusproducing"

If possible, though, we could use a simple algorithm that asks if the two words combined form a real word, if they don't, as in the above cases, don't fully combine the words, just keep the dash between them. 

pdf source:

http://fews.net/sites/default/files/documents/reports/MONTHLY%20PRICE%20WATCH_AND_ANNEX_NOVEMBER2014_0_1.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF extraction erroneously combines two words based on end-of-line dashes #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PDF extraction erroneously combines two words based on end-of-line dashes #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions