-
Notifications
You must be signed in to change notification settings - Fork 4
Strip formatting #1
Copy link
Copy link
Open
Description
When I run PDF tests I get output that looks like this
Textractor returns the contents of pdf documents
Failure/Error: Textractor.text_from_path(fixture_path("document.pdf")).should == 'text'
expected: "text",
got: "text\t\r \302\240 \t\r \302\240" (using ==)
My pdftotext version must handle formatting characters differently from yours. Do you think this is something textractor should handle?
In my use case I never care about the document formatting, I only want strings separated by spaces, with a limited subset of punctuation (aka periods and commas) for use indexing documents. I don't mind handling this functionality in each application, but I'd be glad to write it into texttactor if you think there's value in that.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels