Skip to content

fuzzy searches, get_references for messy ASR #119

@whicks1

Description

@whicks1

Machine Generated ASR programs like open-Ai's Whisper are on the rise and tend to output messy formatting of scripture, with difficulties in consistent int/ordinals/words for book/chapter/verse numbers, spans, and have varying capitalizations problems, etc.

Here are a handful of examples lines from webvtt/srt outputs from a batch I've run recently:

  • Second Timothy chapter two verses three and four says endure hardship
  • If you read Ephesians four 17 through 32 all the ammunition
  • remember that powerful message of Paul in first Corinthians nine
  • in Jesus's first sermonic presentation on planet earth in Matthew five through seven,
  • Jesus said over in Matthew chapter six, verse number 12,
  • Genesis four, 25.
  • and forth between Haggai two and Ezra three.
  • and go and report to John one-fifteen and thirty.
  • I want to focus on here is Colossians chapter three, 22 through verses through chapter four, verse one.
  • In 1 Corinthians 9.22, you see Paul saying
  • says in Mark 16 10 that the disciples were
  • through that fire, 1 Kings 18.24-38, 1 Chronicles 21.26, 2 Chronicles 7.1-3.
  • open their Bibles to first Corinthians 14, 34, 35 and say, look
  • Genesis 1, 26, 2, 7, and 21, 22.
  • look in Revelations 21, 1 through 7, you can start reading all about
  • Psalms 103.12 says
  • for one another Galatians 6 1 & 2 clearly gives us

It will take a post-processing step to clean this sort of data up for nearly anyone using these tools seriously and while feeding the inputs into an LLM or NLP tookit may make sense, it would be swell if a library like this one could do some of the heavy lifting to normalize scripture referenced in a string. Tall order/deep rabbit hole, I understand, but worth a shot.

Suggest a reformat_fuzzy_references that returns (attempts) a reformatted input_string with even a subset of the most common speech patterns into a normalized form. Bonus points if the user can have some configuration control on output styles, e.g. omit "chapter" or use "v./vv."

Assumed gotchas:

  • Strings may contain other semi-formatted numbers that a simple regex search may false flag upon:
    • I was just in class at 8.30 with my friend Wilson
    • We're going to talk at 3.30 this afternoon about the discipline of grace and there is
    • So in Acts chapter 2, 3,000 were saved.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions