-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
bugSomething isn't workingSomething isn't working
Description
We're extracting transcripts of documents using the pdfplumber library. This is a great tool, but creates a lot of noise, some of which are because os the documents itself. A description of such issues-
- Headers and footers
- 'ti' being parsed as (cid:45)
- Lots of /n and whitespaces being inserted in between words, making regex based extraction faulty.
Possible fixes:
- Use heuristics to mitigate issues, detect where things cannot be solved by regex and use llm calls.
- Try out different pdf parsing libraries for this purpose, compare output quality.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working