-
Notifications
You must be signed in to change notification settings - Fork 3
Tokenization
Format conversion by bconv involves a limited amount of natural-language processing (NLP), namely sentence splitting and word tokenization.
bconv's document model requires input text always to be divided into sentences.
If an input document does not encode sentence boundaries, bconv automatically applies sentence splitting to the input.
By default, bconv uses NLTK's PunktSentenceTokenizer, which performs well for English standardized texts like scientific articles.
For some output formats (eg. CoNLL), word tokenization is also necessary.
When needed, bconv tokenizes sentences with NLTK's WordPunctTokenizer, which identifies tokens with the regular expression \w+|[^\w\s]+.
bconv exposes a hook to override the default sentence splitter and word tokenizer.
By importing bconv.nlp.tokenize.TOKENIZER and assigning TOKENIZER.word_tokenizer and TOKENIZER.sentence_splitter, a custom method can be registered for use by the library:
import re
import bconv
from bconv.nlp.tokenize import TOKENIZER
class WhitespaceTokenizer:
@staticmethod
def span_tokenize(text):
for match in re.finditer(r'\S+', text):
yield match.span()
TOKENIZER.word_tokenizer = WhitespaceTokenizer()
# This will now use WhitespaceTokenizer for tokenizing the output:
bconv.dump(document, 'path/to/output.conll', fmt='conll')For both sentence splitting and word tokenization, bconv expects an object with a span_tokenize method with the following signature:
span_tokenize(text: str) -> Iterable[Tuple[int, int]]ie. the method must accept a single str as input and return an iterable of start–end pairs (character offsets).
Word tokenization is performed lazily, ie. only when needed.
Tokenization is usually triggered implicitly through actions like iterating over a sentence, accessing its length (number of tokens) or even inspecting its repr form.
The result of tokenization is cached, ie. updating the tokenizer has no effect for sentences that have already been tokenized (possibly implicitly).
Tokenization can be performed anew by explicitly calling the tokenize() method of a Sentence object.
import bconv
from bconv.nlp.tokenize import TOKENIZER
coll = bconv.load('example.json', fmt='bioc_json')
bconv.dump(coll, 'example-1.conll', fmt='conll') # implictly tokenizes all sentences
TOKENIZER.word_tokenizer = WhitespaceTokenizer()
bconv.dump(coll, 'example-2.conll', fmt='conll') # identical to example-1
for sentence in coll.units('sentence'):
sentence.tokenize()
bconv.dump(coll, 'example-3.conll', fmt='conll') # now uses WhitespaceTokenizer