Skip to content

Too many junk strings #24

@kweatherman

Description

@kweatherman

First of all this is great. A problem I've experimented with and wanted to solve for a while.
So thanks a lot for your work.

I do notice there are still a lot of junk strings that get extracted.
If I do an extra filtering pass I fix a lot of them:

  1. Python decode all strings to ASCII. Wrapped in an empty (or specific on decoding error for purists) will catch and thus ignore all chars that are >127. Filters out many bad ones. Although obviously not so good if you actually want other than English strings.
  2. Filter out all strings that are a run of the same character.
  3. Filter all that don't have at least one English vowel (AEIOU upper and lower case). But like #1, not ideal for non-English strings.

Still after this additional filtering still several bogus strings.
Q: Needs a larger corpus of bad vs good strings to train on?

I know this is a pretty loaded subject. While English is relatively easy because of only ~127 codepoints, it's another story to try to encapsulate the UTF ranges.
I might have a better solution that I've experimented with based on someone elses work that has largely gone unnoticed.
It's based on a statistical DB based on ngrams, using a huge corpus collected over several languages and character sets.
It's a mostly pre-ML solution, that a lot of commercial text extraction, browser language detection, etc., use.
Which maybe ML could be applied to as well for a similar or hybrid solution.

This would add to your tool multi-character set detection (not just UTF), better initial filtering (a lot of bogus strings will have a low statistical matching score). At the same time it could detect what languages are being used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions