Skip to content

Multiple spaces support

melisa-qordoba edited this page Sep 23, 2020 · 1 revision

Multiple spaces support

Sometimes text input includes unwated signs, such as:

ex. "Here␣is␣a\u180E\u200Bproblem."

For SpaCy, only most common ascii whitespace \u0020 is translated as a whitespace spearator.

You might want to project nonstandard signs into whitespaces before processing,

"Here␣is␣a\u180E\u200Bproblem." -> "Here␣is␣a␣␣problem."

but getting rid of multiple spaces is not always possible (this would change span char ranges). Since extra spaces are grouped as one token with propery IS_SPACE: True, patterns in match_dict should have extra whitespace tokens:

ex.

"patterns": [
                {
                    "LOWER": "a"
                },
                {
                    "IS_SPACE": True, "OP": "?"
                },
                {
                    "LOWER": "problem"
                }
            ]

To keep preceded_by... and succeeded_by... match hooks working, add whitespace tokens before and after each pattern. In order to automatically add whitespace tokens to all patterns in your match_dict, use:

r_matcher = ReplaceMatcher(nlp, match_dict, allow_multiple_whitespaces=True)

By default allow_multiple_whitespaces is set to False.

Clone this wiki locally