Strip formatting

When I run PDF tests I get output that looks like this

Textractor returns the contents of pdf documents
     Failure/Error: Textractor.text_from_path(fixture_path("document.pdf")).should == 'text'
     expected: "text",
          got: "text\t\r \302\240 \t\r \302\240" (using ==)

My pdftotext version must handle formatting characters differently from yours.  Do you think this is something textractor should handle?

In my use case I never care about the document formatting, I only want strings separated by spaces, with a limited subset of punctuation (aka periods and commas) for use indexing documents.  I don't mind handling this functionality in each application, but I'd be glad to write it into texttactor if you think there's value in that.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip formatting #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Strip formatting #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions