-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Unicode Technical Standard #55 specifies a concept of "identifier word boundary":
An identifier word boundary is defined by the following rules, using the notation from Section 1.1, Notation, in Unicode Standard Annex #29, Unicode Text Segmentation [UAX29].
Treat a letter followed by a sequence of nonspacing or enclosing marks as that letter.
The regular expressions for the following rules incorporate this one; only the text descriptions rely on it.
🐫 CamelBoundary. An identifier word boundary exists after a lowercase or non-Greek titlecase letter followed by an uppercase or titlecase letter:
[ \p{Ll} [\p{Lt}-\p{Grek}] ] [\p{Mn}\p{Me}]*÷[\p{Lu}\p{Lt}]🎩 HATBoundary. An identifier word boundary exists before an uppercase or titlecase letter followed by a lowercase letter, or before a non-Greek titlecase letter:
÷
[\p{Lu}\p{Lt}] [\p{Mn}\p{Me}]* \p{Ll} | [\p{Lt}-\p{Grek}]🐍 snake_boundary. An identifier word boundary exists either side either side of a Punctuation character which is not an Other_Punctuation character:
÷
[\p{P}-\p{Po}]
[\p{P}-\p{Po}]÷No other identifier word boundaries exist.
Any × Any
It would be nice if heck could follow this spec.