Skip to content

Messy transcripts #15

@pranshu-raj-211

Description

@pranshu-raj-211

We're extracting transcripts of documents using the pdfplumber library. This is a great tool, but creates a lot of noise, some of which are because os the documents itself. A description of such issues-

  1. Headers and footers
  2. 'ti' being parsed as (cid:45)
  3. Lots of /n and whitespaces being inserted in between words, making regex based extraction faulty.

Possible fixes:

  1. Use heuristics to mitigate issues, detect where things cannot be solved by regex and use llm calls.
  2. Try out different pdf parsing libraries for this purpose, compare output quality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions