Utsuho is a Python library for deterministic normalization of Japanese text variants.
It focuses on character-level conversions such as width normalization and kana conversion, while avoiding unrelated transformations that general-purpose Unicode normalization may introduce.
- Bidirectional conversion between half-width and full-width katakana
- Bidirectional conversion between hiragana and katakana
- Configurable handling of spaces, punctuation, ASCII symbols, digits, and alphabets
- Command-line interface for interactive use and scripting
Japanese text often mixes multiple representations of the same content, such as half-width and full-width katakana, or hiragana and katakana. Python's Unicode normalization can help in some cases, but it may also perform conversions you do not want, such as changing ASCII symbols or decomposing composite characters.
Utsuho provides explicit, deterministic character-level conversions for these Japanese text variants, making it easier to normalize Japanese text without introducing unrelated transformations.
Install Utsuho with pip:
pip install Utsuhofrom utsuho import HalfToFullConverter
text = "キョウトシ サキョウク ギンカクジチョウ 2"
converted = HalfToFullConverter().convert(text)
print(converted)
# キョウトシ サキョウク ギンカクジチョウ 2from utsuho import FullToHalfConverter
text = "キョウトシ サキョウク ギンカクジチョウ 2"
converted = FullToHalfConverter().convert(text)
print(converted)
# キョウトシ サキョウク ギンカクジチョウ 2from utsuho import HiraganaToKatakanaConverter
text = "きょうとし さきょうく ぎんかくじちょう 2"
converted = HiraganaToKatakanaConverter().convert(text)
print(converted)
# キョウトシ サキョウク ギンカクジチョウ 2from utsuho import KatakanaToHiraganaConverter
text = "キョウトシ サキョウク ギンカクジチョウ 2"
converted = KatakanaToHiraganaConverter().convert(text)
print(converted)
# きょうとし さきょうく ぎんかくじちょう 2Use WidthConverterConfig to control which non-katakana characters are normalized during half-width and full-width conversion.
from utsuho import HalfToFullConverter, WidthConverterConfig
config = WidthConverterConfig(
ascii_symbol=False,
ascii_digit=False,
ascii_alphabet=False,
)
converted = HalfToFullConverter(config).convert("ギンカクジ 2F")Available options:
| Parameter | Default | Description |
|---|---|---|
punctuation |
True |
Convert punctuation marks. |
corner_brucket |
True |
Convert corner brackets. |
conjunction_mark |
True |
Convert conjunction marks. |
length_mark |
True |
Convert length marks. |
space |
True |
Convert spaces. |
ascii_symbol |
True |
Convert ASCII symbols. |
ascii_digit |
True |
Convert ASCII digits. |
ascii_alphabet |
True |
Convert ASCII alphabets. |
wave_dash |
False |
Convert full-width wave dashes to half-width tildes in full-to-half conversion. |
Note
The current public API uses the parameter name corner_brucket (due to historical reasons).
Utsuho also provides a command-line interface for interactive use and scripting.
% utsuho --help
Usage: utsuho [OPTIONS] COMMAND [ARGS]...
Utsuho provides deterministic normalization utilities for Japanese text,
including width normalization and hiragana/katakana conversion.
Options:
--version Show the version.
--help Show this message and exit.
Commands:
full-to-half Convert from full-width to half-width characters.
half-to-full Convert from half-width to full-width characters.
hiragana-to-katakana Convert from hiragana to katakana.
katakana-to-hiragana Convert from katakana to hiragana.Examples:
% utsuho full-to-half "キョウトシ サキョウク ギンカクジチョウ 2"
キョウトシ サキョウク ギンカクジチョウ 2
% utsuho half-to-full "キョウトシ サキョウク ギンカクジチョウ 2"
キョウトシ サキョウク ギンカクジチョウ 2
% utsuho hiragana-to-katakana "きょうとし さきょうく ぎんかくじちょう 2"
キョウトシ サキョウク ギンカクジチョウ 2
% utsuho katakana-to-hiragana "キョウトシ サキョウク ギンカクジチョウ 2"
きょうとし さきょうく ぎんかくじちょう 2Each command also accepts --file (or -f) to treat the argument as a UTF-8 text file path.
- Documentation: https://utsuho.readthedocs.io/
- Source code: https://github.com/juno-rmks/utsuho/
- Issue tracker: https://github.com/juno-rmks/utsuho/issues/
This project is licensed under the Apache License 2.0. See LICENSE for details.