Tokenization

Tokenization and Sentence Splitting

Format conversion by bconv involves a limited amount of natural-language processing (NLP), namely sentence splitting and word tokenization.

bconv's document model requires input text always to be divided into sentences. If an input document does not encode sentence boundaries, bconv automatically applies sentence splitting to the input. By default, bconv uses NLTK's PunktSentenceTokenizer, which performs well for English standardized texts like scientific articles.

For some output formats (eg. CoNLL), word tokenization is also necessary. When needed, bconv tokenizes sentences with NLTK's WordPunctTokenizer, which identifies tokens with the regular expression \w+|[^\w\s]+.

Specifying a Custom Tokenizer

bconv exposes a hook to override the default sentence splitter and word tokenizer. By importing bconv.nlp.tokenize.TOKENIZER and assigning TOKENIZER.word_tokenizer and TOKENIZER.sentence_splitter, a custom method can be registered for use by the library:

import re
import bconv
from bconv.nlp.tokenize import TOKENIZER

class WhitespaceTokenizer:
    @staticmethod
    def span_tokenize(text):
        for match in re.finditer(r'\S+', text):
            yield match.span()

TOKENIZER.word_tokenizer = WhitespaceTokenizer()
# This will now use WhitespaceTokenizer for tokenizing the output:
bconv.dump(document, 'path/to/output.conll', fmt='conll')

For both sentence splitting and word tokenization, bconv expects an object with a span_tokenize method with the following signature:

span_tokenize(text: str) -> Iterable[Tuple[int, int]]

ie. the method must accept a single str as input and return an iterable of start–end pairs (character offsets).

Lazy Execution and Caching of Tokenization

Word tokenization is performed lazily, ie. only when needed. Tokenization is usually triggered implicitly through actions like iterating over a sentence, accessing its length (number of tokens) or even inspecting its repr form. The result of tokenization is cached, ie. updating the tokenizer has no effect for sentences that have already been tokenized (possibly implicitly). Tokenization can be performed anew by explicitly calling the tokenize() method of a Sentence object.

import bconv
from bconv.nlp.tokenize import TOKENIZER

coll = bconv.load('example.json', fmt='bioc_json')
bconv.dump(coll, 'example-1.conll', fmt='conll')  # implictly tokenizes all sentences
TOKENIZER.word_tokenizer = WhitespaceTokenizer()
bconv.dump(coll, 'example-2.conll', fmt='conll')  # identical to example-1
for sentence in coll.units('sentence'):
    sentence.tokenize()
bconv.dump(coll, 'example-3.conll', fmt='conll')  # now uses WhitespaceTokenizer

bconv Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization

Tokenization and Sentence Splitting

Specifying a Custom Tokenizer

Lazy Execution and Caching of Tokenization

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally