Skip to content

juno-rmks/utsuho

Repository files navigation

Utsuho

CI PyPI version Python versions License

Utsuho is a Python library for deterministic normalization of Japanese text variants.

It focuses on character-level conversions such as width normalization and kana conversion, while avoiding unrelated transformations that general-purpose Unicode normalization may introduce.

  • Bidirectional conversion between half-width and full-width katakana
  • Bidirectional conversion between hiragana and katakana
  • Configurable handling of spaces, punctuation, ASCII symbols, digits, and alphabets
  • Command-line interface for interactive use and scripting

Why Utsuho?

Japanese text often mixes multiple representations of the same content, such as half-width and full-width katakana, or hiragana and katakana. Python's Unicode normalization can help in some cases, but it may also perform conversions you do not want, such as changing ASCII symbols or decomposing composite characters.

Utsuho provides explicit, deterministic character-level conversions for these Japanese text variants, making it easier to normalize Japanese text without introducing unrelated transformations.

Installation

Install Utsuho with pip:

pip install Utsuho

Quick Start

Half-width to full-width katakana

from utsuho import HalfToFullConverter

text = "キョウトシ サキョウク ギンカクジチョウ 2"
converted = HalfToFullConverter().convert(text)

print(converted)
# キョウトシ サキョウク ギンカクジチョウ 2

Full-width to half-width katakana

from utsuho import FullToHalfConverter

text = "キョウトシ サキョウク ギンカクジチョウ 2"
converted = FullToHalfConverter().convert(text)

print(converted)
# キョウトシ サキョウク ギンカクジチョウ 2

Hiragana to katakana

from utsuho import HiraganaToKatakanaConverter

text = "きょうとし さきょうく ぎんかくじちょう 2"
converted = HiraganaToKatakanaConverter().convert(text)

print(converted)
# キョウトシ サキョウク ギンカクジチョウ 2

Katakana to hiragana

from utsuho import KatakanaToHiraganaConverter

text = "キョウトシ サキョウク ギンカクジチョウ 2"
converted = KatakanaToHiraganaConverter().convert(text)

print(converted)
# きょうとし さきょうく ぎんかくじちょう 2

Configuring Width Conversion

Use WidthConverterConfig to control which non-katakana characters are normalized during half-width and full-width conversion.

from utsuho import HalfToFullConverter, WidthConverterConfig

config = WidthConverterConfig(
    ascii_symbol=False,
    ascii_digit=False,
    ascii_alphabet=False,
)

converted = HalfToFullConverter(config).convert("ギンカクジ 2F")

Available options:

Parameter Default Description
punctuation True Convert punctuation marks.
corner_brucket True Convert corner brackets.
conjunction_mark True Convert conjunction marks.
length_mark True Convert length marks.
space True Convert spaces.
ascii_symbol True Convert ASCII symbols.
ascii_digit True Convert ASCII digits.
ascii_alphabet True Convert ASCII alphabets.
wave_dash False Convert full-width wave dashes to half-width tildes in full-to-half conversion.

Note

The current public API uses the parameter name corner_brucket (due to historical reasons).

CLI

Utsuho also provides a command-line interface for interactive use and scripting.

% utsuho --help
Usage: utsuho [OPTIONS] COMMAND [ARGS]...

  Utsuho provides deterministic normalization utilities for Japanese text,
  including width normalization and hiragana/katakana conversion.

Options:
  --version  Show the version.
  --help     Show this message and exit.

Commands:
  full-to-half          Convert from full-width to half-width characters.
  half-to-full          Convert from half-width to full-width characters.
  hiragana-to-katakana  Convert from hiragana to katakana.
  katakana-to-hiragana  Convert from katakana to hiragana.

Examples:

% utsuho full-to-half "キョウトシ サキョウク ギンカクジチョウ 2"
キョウトシ サキョウク ギンカクジチョウ 2

% utsuho half-to-full "キョウトシ サキョウク ギンカクジチョウ 2"
キョウトシ サキョウク ギンカクジチョウ 2

% utsuho hiragana-to-katakana "きょうとし さきょうく ぎんかくじちょう 2"
キョウトシ サキョウク ギンカクジチョウ 2

% utsuho katakana-to-hiragana "キョウトシ サキョウク ギンカクジチョウ 2"
きょうとし さきょうく ぎんかくじちょう 2

Each command also accepts --file (or -f) to treat the argument as a UTF-8 text file path.

Documentation

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

About

Utsuho is a Python module that facilitates bidirectional conversion between half-width katakana and full-width katakana in Japanese.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages