-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
While I implement the word segmenter, I met a case 'unicode-segmenter'를 has different result in spec and Node.js
Chrome/V8 (6): ', unicode, -, segmenter, ', 를
Unicode spec-compliant implementation (4): ', unicode, -, segmenter'를
JSC and Spidermonkey follow the spec.
But Intl.Segmenter in Chrome/V8 behave differently:
can't->can'ta'b->a'ba'가->a,',가a'α->a'αa'א->a'אa'中->a,',中a'あ->a,',あa'ア->a,',アa'ก->a'กa'ア->a,',ア가'나->가,',나α'β->α'βא'ב->א'ב
That implies Chrome/V8 is applying tailoring beyond UAX #29 rather than using the ICU library as-is.
As a native Korean user, that makes perfect sense, the apostrophe joiner is a strange rule in CJK.
There may be other examples around the sentence segmentation too.
The question is, what should the unicode-segmenter follow. Should I strictly follow the spec? Or the practical one that matches the most popular environment?
Metadata
Metadata
Assignees
Labels
No labels