Multiple spaces support

Sometimes text input includes unwated signs, such as:

non printable unicode signs (see: https://www.soscisurvey.de/tools/view-chars.php)
non standard whitespaces (see: https://en.wikipedia.org/wiki/Whitespace_character)

ex. "Here␣is␣a\u180E\u200Bproblem."

For SpaCy, only most common ascii whitespace \u0020 is translated as a whitespace spearator.

You might want to project nonstandard signs into whitespaces before processing,

"Here␣is␣a\u180E\u200Bproblem." -> "Here␣is␣a␣␣problem."

but getting rid of multiple spaces is not always possible (this would change span char ranges). Since extra spaces are grouped as one token with propery IS_SPACE: True, patterns in match_dict should have extra whitespace tokens:

ex.

"patterns": [
                {
                    "LOWER": "a"
                },
                {
                    "IS_SPACE": True, "OP": "?"
                },
                {
                    "LOWER": "problem"
                }
            ]

To keep preceded_by... and succeeded_by... match hooks working, add whitespace tokens before and after each pattern. In order to automatically add whitespace tokens to all patterns in your match_dict, use:

r_matcher = ReplaceMatcher(nlp, match_dict, allow_multiple_whitespaces=True)

By default allow_multiple_whitespaces is set to False.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple spaces support

Multiple spaces support

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally