-
Notifications
You must be signed in to change notification settings - Fork 3
Doc Model
For converting documents and annotations from one format to another, bconv stores the information in a custom interchange representation.
This representation is an attempt for a unified interface compatible with all formats.
However, every format has different limitations and requirements, which cannot always easily be translated into a general representation.
In a simple A-to-B format conversion, users can ignore bconv's document representation.
In some cases, however, accessing the Collection or Document objects may be useful, such as:
- manipulating data during conversion, eg. to fix incompatibilities or remove unwanted elements;
- using the loader only, eg. for reading annotated data into vectors for training a model;
- using the formatter only, eg. to export automatically generated annotations.
Load a collection in BioC JSON, prefix the concept identifiers with "MESH:", then store every document in Brat stand-off format.
import bconv
coll = bconv.load('path/to/collection.json', 'bioc_json')
for entity in coll.iter_entities():
# Add a prefix to all IDs.
entity.metadata['cui'] = 'MESH:{}'.format(entity.metadata['cui'])
for doc in coll:
# Store annotations and text in separate files.
bconv.dump(doc, '{}.ann'.format(doc.id), fmt='brat', cui='cui')
bconv.dump(doc, '{}.txt'.format(doc.id), fmt='txt')Load a collection of annotated documents given in BioC JSON format and create two numpy arrays, a matrix of token indices and a vector of binary sentence-level labels.
import itertools as it
from collections import defaultdict
import numpy as np
import bconv
coll = bconv.load('path/to/training_data.json', 'bioc_json')
n_sents = sum(1 for _ in coll.units('sentence'))
sentences = np.zeros((n_sents, 100), dtype=int) # matrix of token indices
labels = np.zeros(n_sents, dtype=int) # vector of sentence labels
vocabulary = defaultdict(it.count().__next__)
for i, sentence in enumerate(coll.units('sentence')):
# Note: you probably want to use the tokenizer and vocabulary associated
# with your word embeddings instead.
sentence.tokenize()
tokens = [vocabulary[tok.text] for tok in sentence]
sentences[i, :len(tokens)] = tokens
labels[i] = any(sentence.iter_entities()) # True if there is any entityNote: when vectorizing token-level annotations (eg. for training an NER system) it is probably easier to process the tabular data in CoNLL format rather than working with bconv's document objects.
Construct Document objects to be passed to bconv.dump.
import re
from pathlib import Path
import bconv
sequence = re.compile(r'\b((?P<DNA>[GACT]{5,})|(?P<RNA>[GACU]{5,}))\b')
def nucleotide_sequences(text):
"""Regex-based tagger for literal DNA/RNA sequences."""
for match in sequence.finditer(text):
yield bconv.Entity(
id=None,
text=match.group(),
spans=[match.span()], # must be a list of start/end pairs
type='DNA' if match.group('DNA') else 'RNA')
def make_document(path):
with open(path, encoding='utf8') as f:
title = f.readline() # assume title is the first line
body = f.read()
doc = bconv.Document(path.stem)
for text, type_ in ((title, 'title'), (body, 'body')):
doc.add_section(
type_, text, entities=list(nucleotide_sequences(text)))
return doc
paths = Path('path/to/examples/').glob('*.txt')
coll = bconv.Collection.from_iterable(map(make_document, paths), id='example')
bconv.dump(coll, 'examples.conll')The document representation is implemented as a hierarchy of units corresponding to Python classes as follows:
Collection
Document
Section
Sentence
Token
Collections are always organized into documents, which are recursively divided into sections, sentences, and tokens.
Entity annotations are stored in Entity objects, which are always anchored at the sentence level (though they can be iterated over from any higher level).
Relation objects can be anchored at document, section, or sentence level.
The following methods and attributes are shared by the Collection, Document, Section and Sentence units (with some exceptions as indicated).
TextUnit.__iter__() -> Iterator[SubUnit]
TextUnit.__len__() -> int
TextUnit.__getitem__(index: int) -> SubUnitImmutable-sequence methods. Every unit is a sequence of subunits from the next-lower level, ie. a
Collectionis a sequence ofDocumentobjects, everyDocumentis a sequence ofSectionobjects etc. The sequences support iteration (for doc in collection), length check (len(collection)) and access by index (collection[1]).
TextUnit.units(level: str) -> Iterator[LevelType]Iterate over units of the specified
level, a case-insensitive string naming the desired unit type, eg."sentence". Thelevelcan be the same or lower than that of this unit:collection.units("collection")yields justcollection, whereasdocument.units("sentence")iterates over all sentences of all sections indocument.
TextUnit.add_entities(entities: Iterable[Entity], offset: int = None)Add entity annotations to this unit. If
offsetisNone(the default), character offsets (spans) are recalculated relative to the beginning of the document. If the spans are already relative to the document origin, specifyoffset=0. Since entities are always anchored at the sentence level, the target sentence is identified based on the entity spans. Note: this method is not defined for theCollectionunit.
TextUnit.iter_entities(
split_discontinuous=False, avoid_gaps=None, avoid_overlaps=None
) -> Iterator[Entity]Iterate over all
Entityobjects at this unit and all its subunits. Discontinuous and overlapping annotations may be flattened on the fly (non-permanently) with theavoid_gapsandavoid_overlapsparameters (see Entity-Flattening). The legacy flagsplit_discontinuousis kept for backwards compatibility; setting it toTrueis equivalent to specifyingavoid_gaps="split". The entities are yielded in occurrence order; sorting is applied after flattening.
TextUnit.iter_relations() -> Iterator[Relation]Iterate over all
Relationobjects at this unit and all its subunits. The iteration order is deterministic, but not connected to the relation members' position in the document.
TextUnit.text: strRead-only attribute for the entire text of this unit in a single string.
TextUnit.metadata: Dict[str, str]Read/write attribute for arbitrary metadata. Note that some keys are interpreted specially in some formats, eg. collection-level "source", "date", and "key" in BioC. In general, however, metadata are largely ignored in most formats.
TextUnit.relations: Iterable[Relation]Read/write attribute for
Relationobjects anchored at this unit. Unlikeiter_relations(), this does not touch relations from subunits. Note: this attribute is not defined for theCollectionunit.
Collection(id: int|str, filename: str|Path = None, **metadata)Constructor for an empty collection that may be populated later.
Collection.from_iterable(
documents: Iterable[Document], id: int|str, filename: str|Path = None
) -> CollectionClassmethod for creating a collection from an iterable of
Documentobjects. More documents may be appended later.
Collection.add_document(document: Document) -> DocumentAppend a
Documentobject to the end of this collection.
Collection.get_document(id: int|str) -> DocumentRetrieve a document by its ID. If IDs are non-unique, the last-added document will be returned. If there is no document with the given ID, a
KeyErroris raised.
Collection.id: int|str
Collection.filename: Optional[str|Path]Read/write attributes corresponding to the constructor arguments.
Document(id: int|str, filename: str|Path = None, **metadata)Constructor for an empty document to be populated later.
Document.add_section(
type: str,
text: str|Iterable[str],
offset: int = None,
entities: Sequence[Entity] = (),
entity_offset: int = None,
**metadata) -> SectionAdd a section to this document.
The sectiontypeis something like "Title", "Abstract", "Introduction" and is stored in the section'smetadata.
Thetextcan be given in a variety of types: If it is a singlestr, sentence splitting is performed bybconv(cf. the tokenization documentation). However, if the text has already been split into sentences, an iterable ofstrmay be provided.
Thestartoffset of the new section can be set throughoffset; ifoffsetisNone, it is set bybconvbased on the length of the preceding sections.
A sequence ofEntityobjects can be provided as well, which will be added to the corresponding sentences.
Theentity_offsetargument works the same asoffsetinadd_entities(): use it if the entity spans are not relative to the beginning of the added section text.
Arbitrary key-value pairs can be passed asmetadata.
Document.id: int|str
Document.filename: Optional[str|Path]Read/write attributes corresponding to the constructor arguments.
The Section unit represents any division of a document, such as a paragraph or an article section.
Do not directly instantiate Section objects from their constructor, but use Document.add_section() instead.
Section.start: int
Section.end: intRead-only attributes for the text range (character offsets) relative to the document start.
Sentences are automatically created from sections through sentence-boundary detection (cf. the tokenization documentation).
To create Sentence units from a list of strings, pass it to Document.add_section() as the text parameter.
Do not directly instantiate Sentence objects from their constructor.
Sentence.start: int
Sentence.end: intRead-only attributes for the text range (character offsets) relative to the document start.
Tokens are the smallest textual units (think: a word) created by bconv by splitting sentences at whitespace and before/after punctuation characters (cf. the tokenization documentation).
Token units are minimal objects with a few attributes but none of the methods of the other units.
Token.text: strRead-only attribute for the value of this token as a
str.
Token.start: int
Token.end: intRead-only attributes for the text range (character offsets) relative to the document start.
Entity annotations assign metadata (eg. a concept identifier or type) to a textual expression. The textual expression may contain gaps and be arbitrarily long, but it must not cross sentence boundaries.
Entity(
id: Optional[int|str],
text: str,
spans: List[Tuple[int, int]],
meta: Optional[Dict[str, str]],
**metadata)Constructor for an entity annotation.
Theidis an identifier for each particular annotation instance within the enclosing document (i.e. it is not a concept identifier/reference to a controlled vocabulary). Its value is used by some output formats (eg. BioC), but ignored by others (eg. Brat, which requires consecutive IDs prefixed with "T"). If relations are present, the entity ID should be unique throughout the document.
Thetextvalue must exactly match the annotated span in the document (for multi-span entities, gap symbols like "..." are permitted).
Thespansmust be a sequence of start–end pairs, even for single-span entities.
Any additional information, such as concept identifier or entity type, can be passed as a mapping of key–value pairs to themetaparameter or directly as keyword arguments.bconvis agnostic wrt. the extent and spelling of metadata fields; however, many output formats may require some configuration to get the desired result (eg. themetaparameter for PubTator).
Entity.id: Optional[int|str]
Entity.text: str
Entity.spans: List[Tuple[int, int]]
Entity.metadata: Dict[str, str]Read/write attributes corresponding to the constructor arguments. Note: once added to a document, do not alter an entity's
spansvalue anymore.
Entity.start: int
Entity.end: intRead-only attributes for the outer boundaries of an entity, ie.
entity.spans[0][0]andentity.spans[-1][1].
Relations describe a connection between a number of members, which are either entities or other relations.
The number of members is not restricted by bconv; like in Bioc, relations can have more than two members or even be unary or member-less.
However, some formats are more restrictive; eg. PubAnnotation JSON only allows binary relations.
Relation(id: Optional[int|str], members: Iterable[Tuple[int|str, str]], **metadata)Constructor for a relation. As for entities, the ID is optional in general, but needs to be defined and unique if this relation is referenced in another relation. The relation members are expected as an iterable of <RefID, Role> pairs.
Relation.add_member(refid: int|str, role: str) -> RelationMemberAdd a member to this relation. The reference ID
refidmust refer to an existing entity or another relation. Theroleis a free-form string describing the function of this member within the relation (eg. "cause").
Relation.__iter__() -> Iterator[RelationMember]
Relation.__len__() -> int
Relation.__getitem__(index: int) -> RelationMemberImmutable-sequence methods: a relation is a sequence of
RelationMemberobjects.
Relation.id: Optional[int|str]Read/write attribute for the relation ID.
Relation.metadata: Dict[str, str]Read/write attribute for arbitrary key–value pairs. Like for the text units, metadata are ignored by most formats.
RelationMember objects have two attributes, refid and role.
RelationMember.refid: int|strRead-only attribute referencing an entity or another relation by ID.
RelationMember.role: strRead-only attribute describing this member's role in the relation.