-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathREADME
More file actions
127 lines (98 loc) · 4.36 KB
/
README
File metadata and controls
127 lines (98 loc) · 4.36 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
Introduction
============
A Python package to determine Unicode text segmentations.
You can see the full documentation including the package reference on
http://uniseg-python.readthedocs.io.
Features
========
This package provides:
- Functions to get Unicode Character Database (UCD) properties concerned with
text segmentations.
- Functions to determine segmentation boundaries of Unicode strings.
- Classes that help implement Unicode-aware text wrapping on both console
(monospace) and graphical (monospace / proportional) font environments.
Supporting segmentations are:
*code point*
`Code point <http://www.unicode.org/glossary/#code_point>`_ is *"any value
in the Unicode codespace."* It is the basic unit for processing Unicode
strings.
*grapheme cluster*
`Grapheme cluster <http://www.unicode.org/glossary/#grapheme_cluster>`_
approximately represents *"user-perceived character."* They may be made
up of single or multiple Unicode code points. e.g. "G" + *acute-accent* is
a *user-perceived character*.
*word break*
Word boundaries are familiar segmentation in many common text operations.
e.g. Unit for text highlighting, cursor jumping etc. Note that *words* are
not determinable only by spaces or punctuations in text in some languages.
Such languages like Thai or Japanese require dictionaries to determine
appropriate word boundaries. Though the package only provides simple word
breaking implementation which is based on the scripts and doesn't use any
dictionaries, it also provides ways to customize its default behavior.
*sentence break*
Sentence breaks are also common in text processing but they are more
contextual and less formal. The sentence breaking implementation (which is
specified in UAX: Unicode Standard Annex) in the package is simple and
formal too. But it must be still useful in some usages.
*line break*
Implementing line breaking algorithm is one of the key features of this
package. The feature is important in many general text presentations in
both CLI and GUI applications.
Requirements
============
- Python 2.7 / 3.4 / 3.5 / 3.6
Download
========
Source / binary distributions (PyPI)
https://pypi.python.org/pypi/uniseg
All sources and build tools etc. (Bitbucket)
https://bitbucket.org/emptypage/uniseg-python
Install
=======
Just type::
% pip install uniseg
or download the archive and::
% python setup.py install
Changes
=======
0.7.1 (2015-05-02)
- CHANGE: wrap.Wrapper.wrap(): returns the count of lines now.
- Separate LICENSE from README.txt for the packaging-related reason in some
environments.
0.7.0 (2015-02-27)
- CHANGE: Quitted gathering all submodules's members on the top, uniseg
module.
- CHANGE: Reform ``uniseg.wrap`` module and sample scripts.
- Maintained uniseg.wrap module, and sample scripts work again.
0.6.4 (2015-02-10)
- Add ``uniseg-dbpath`` console command, which just print the path of
``ucd.sqlite3``.
- Include sample scripts under the package's subdirectory.
0.6.3 (2015-01-25)
- Python 3.4
- Support modern setuptools, pip and wheel.
0.6.2 (2013-06-09)
- Python 3.3
0.6.1 (2013-06-08)
- Unicode 6.2.0
References
==========
*UAX #14: Unicode Line Breaking Algorithm* (6.2.0)
http://www.unicode.org/reports/tr14/tr14-30.html
*UAX #29 Unicode Text Segmentation* (6.2.0)
http://www.unicode.org/reports/tr29/tr29-21.html
Related / Similar Projects
==========================
`PyICU <https://pypi.python.org/pypi/PyICU>`_ - Python extension wrapping the ICU C++ API
*PyICU* is a Python extension wrapping International Components for
Unicode library (ICU). It also provides text segmentation supports and
they just perform richer and faster than those of ours. PyICU is an
extension library so it requires ICU dynamic library (binary files) and
compiler to build the extension. Our package is written in pure Python;
it runs slower but is more portable.
`pytextseg <https://pypi.python.org/pypi/pytextseg>`_ - Python module for text segmentation
*pytextseg* package focuses very similar goal to ours; it provides
Unicode-aware text wrapping features. They designed and uses their
original string class (not built-in ``unicode`` / ``str`` classes) for the
purpose. We use strings as just ordinary built-in ``unicode`` / ``str``
objects for text processing in our modules.