Java implementation of UAX #29 text segmentation algorithm, plus token types for URLs, emoji, emails, hashtags, cashtags, and mentions.
The tokenizer produces the following token types:
ALPHANUM-- A sequence of alphabetic and numeric characters, e.g., hello, test123NUM-- A number, e.g., 123SOUTHEAST_ASIAN-- A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and KhmerIDEOGRAPHIC-- A single CJKV ideographic characterHIRAGANA-- A single hiragana characterKATAKANA-- A sequence of katakana charactersHANGUL-- A sequence of Hangul charactersURL-- A URL, e.g., https://www.example.com/EMAIL-- An email address or mailto link, e.g., info@example.comEMOJI-- A sequence of Emoji characters, e.g., 🙂HASHTAG-- A social media hashtag, e.g., #hashtagCASHTAG-- A social media cashtag, e.g., $CASHMENTION-- A social media mention, e.g., @twitter
To process text into tokens, use code like the following:
try (UAX29URLEmailTokenizer tokenizer=new UAX29URLEmailTokenizer("example text")) {
for(Token token=tokenizer.nextToken();token!=null;token=tokenizer.nextToken(token)) {
// Process the token here
}
}