Skip to content

UTS 55 conformance #55

@Jules-Bertholet

Description

@Jules-Bertholet

Unicode Technical Standard #55 specifies a concept of "identifier word boundary":

An identifier word boundary is defined by the following rules, using the notation from Section 1.1, Notation, in Unicode Standard Annex #29, Unicode Text Segmentation [UAX29].

Treat a letter followed by a sequence of nonspacing or enclosing marks as that letter.

The regular expressions for the following rules incorporate this one; only the text descriptions rely on it.

🐫 CamelBoundary. An identifier word boundary exists after a lowercase or non-Greek titlecase letter followed by an uppercase or titlecase letter:

[ \p{Ll} [\p{Lt}-\p{Grek}] ] [\p{Mn}\p{Me}]* ÷ [\p{Lu}\p{Lt}]

🎩 HATBoundary. An identifier word boundary exists before an uppercase or titlecase letter followed by a lowercase letter, or before a non-Greek titlecase letter:

÷ [\p{Lu}\p{Lt}] [\p{Mn}\p{Me}]* \p{Ll} | [\p{Lt}-\p{Grek}]

🐍 snake_boundary. An identifier word boundary exists either side either side of a Punctuation character which is not an Other_Punctuation character:

÷ [\p{P}-\p{Po}]
[\p{P}-\p{Po}] ÷

No other identifier word boundaries exist.

Any × Any

It would be nice if heck could follow this spec.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions