Too many junk strings

First of all this is great. A problem I've experimented with and wanted to solve for a while.
So thanks a lot for your work.

I do notice there are still a lot of junk strings that get extracted.
If I do an extra filtering pass I fix a lot of them:
1) Python decode all strings to ASCII. Wrapped in an empty (or specific on decoding error for purists) will catch and thus ignore all chars that are >127. Filters out many bad ones. Although obviously not so good if you actually want other than English strings. 
2) Filter out all strings that are a run of the same character.
3) Filter all that don't have at least one English vowel (AEIOU upper and lower case). But like `#1`, not ideal for non-English strings.

Still after this additional filtering still several bogus strings.
Q: Needs a larger corpus of bad vs good strings to train on?

I know this is a pretty loaded subject. While English is relatively easy because of only ~127 codepoints, it's another story to try to encapsulate the UTF ranges.
I might have a better solution that I've experimented with based on someone elses work that has largely gone unnoticed.
It's based on a statistical DB based on ngrams, using a huge corpus collected over several languages and character sets.
It's a mostly pre-ML solution, that a lot of commercial text extraction, browser language detection, etc., use.
Which maybe ML could be applied to as well for a similar or hybrid solution.

This would add to your tool multi-character set detection (not just UTF), better initial filtering (a lot of bogus strings will have a low statistical matching score). At the same time it could detect what languages are being used.







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Too many junk strings #24

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Too many junk strings #24

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions