Skip to content

Chrome/V8 tailored rules for word/sentence #112

@cometkim

Description

@cometkim

While I implement the word segmenter, I met a case 'unicode-segmenter'를 has different result in spec and Node.js

Chrome/V8 (6): ', unicode, -, segmenter, ',
Unicode spec-compliant implementation (4): ', unicode, -, segmenter'를

JSC and Spidermonkey follow the spec.

But Intl.Segmenter in Chrome/V8 behave differently:

  • can't -> can't
  • a'b -> a'b
  • a'가 -> a, ',
  • a'α -> a'α
  • a'א -> a'א
  • a'中 -> a, ',
  • a'あ -> a, ',
  • a'ア -> a, ',
  • a'ก -> a'ก
  • a'ア -> a, ',
  • 가'나 -> , ',
  • α'β -> α'β
  • א'ב -> א'ב

That implies Chrome/V8 is applying tailoring beyond UAX #29 rather than using the ICU library as-is.

As a native Korean user, that makes perfect sense, the apostrophe joiner is a strange rule in CJK.

There may be other examples around the sentence segmentation too.

The question is, what should the unicode-segmenter follow. Should I strictly follow the spec? Or the practical one that matches the most popular environment?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions