Messy transcripts

We're extracting transcripts of documents using the pdfplumber library. This is a great tool, but creates a lot of noise, some of which are because os the documents itself. A description of such issues-

1. Headers and footers
2. 'ti' being parsed as (cid:45)
3. Lots of /n and whitespaces being inserted in between words, making regex based extraction faulty.

Possible fixes:
1. Use heuristics to mitigate issues, detect where things cannot be solved by regex and use llm calls.
2. Try out different pdf parsing libraries for this purpose, compare output quality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Messy transcripts #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Messy transcripts #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions