-
Notifications
You must be signed in to change notification settings - Fork 9
Description
First of all this is great. A problem I've experimented with and wanted to solve for a while.
So thanks a lot for your work.
I do notice there are still a lot of junk strings that get extracted.
If I do an extra filtering pass I fix a lot of them:
- Python decode all strings to ASCII. Wrapped in an empty (or specific on decoding error for purists) will catch and thus ignore all chars that are >127. Filters out many bad ones. Although obviously not so good if you actually want other than English strings.
- Filter out all strings that are a run of the same character.
- Filter all that don't have at least one English vowel (AEIOU upper and lower case). But like
#1, not ideal for non-English strings.
Still after this additional filtering still several bogus strings.
Q: Needs a larger corpus of bad vs good strings to train on?
I know this is a pretty loaded subject. While English is relatively easy because of only ~127 codepoints, it's another story to try to encapsulate the UTF ranges.
I might have a better solution that I've experimented with based on someone elses work that has largely gone unnoticed.
It's based on a statistical DB based on ngrams, using a huge corpus collected over several languages and character sets.
It's a mostly pre-ML solution, that a lot of commercial text extraction, browser language detection, etc., use.
Which maybe ML could be applied to as well for a similar or hybrid solution.
This would add to your tool multi-character set detection (not just UTF), better initial filtering (a lot of bogus strings will have a low statistical matching score). At the same time it could detect what languages are being used.