Skip to content

[quality] Long words in zh-hans model (20198 suggested changes) #874

@peterburk

Description

@peterburk

Thank you for making budoux! I've been using it actively for Chinese (Traditional), Chinese (Simplified), Japanese, and Thai. It's very fast, and I really appreciate your work on it!

Input: UNv1.0.en-zh.zh

Process:
cat "/Users/peter/Downloads/budoux-main/UNv1.0.en-zh.zh" | python3 budoux/main.py -m 'budoux/models/zh-hans.json' > "/Users/peter/Downloads/budoux-main/UNv1.0.en-zh.zhSpaced.txt"

Expected output (sample):

基 皮亚 克 土著马 赛 群体 争取 生存 计划
释放 利比亚国民 阿卜杜勒 巴塞特
波斯尼亚 - 克罗地亚 - 塞尔维亚
( 阿波 斯托 洛斯安 德 列 亚斯 角)

Expected output (full):
UNv1.0.en-zh.zhSpacedWordsOver5CharactersSpaced.txt

Expected output is built using a development copy of https://pingtype.github.io/

Actual output (sample):

基皮亚克土著马赛群体争取生存计划
释放利比亚国民阿卜杜勒巴塞特
波斯尼亚-克罗地亚-塞尔维亚
(阿波斯托洛斯安德列亚斯角)

Actual output (full)
UNv1.0.en-zh.zhSpacedWordsOver5Characters.txt

Please message me if you have any more questions, and I'd be happy to advise. I also have more data for long words (over 5 characters) in Japanese, Chinese (Traditional), and Thai - please comment here or email me when you're working on this issue, and I can collaborate with you more :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions