-
Notifications
You must be signed in to change notification settings - Fork 3
PubAnnotation
Lenz Furrer edited this page May 24, 2021
·
9 revisions
PubAnnotation is a repository for publically hosting annotations over biomedical documents. The platform accepts submissions in a JSON format that contains both text and annotations.
bconv provides two loaders and two formatters:
pubanno_json reads/produces an individual JSON file for a single document.
pubanno_json.tgz reads/creates a compressed archive containing multiple JSON files for a document or collection.
{
"sourceid": "354896",
"sourcedb": "PubMed",
"text": "Lidocaine-induced cardiac asystole.\n",
"denotations": [
{
"id": "T2",
"span": {
"begin": 18,
"end": 34
},
"obj": "Disease"
}
]
}The PubAnnotation website has a description of the basic annotation format and the instructions for representing more complex documents.
-
Document structure: The
pubanno_jsonformat is designed primarily for abstracts. Full-text documents can be represented, but any document-internal structure is lost.bconvalso allows exporting collections withpubanno_json, but the resulting JSON (an array of document objects) is not accepted by PubAnnotation, and cannot even be loaded directly bybconv. Thepubanno_json.tgzformat, however, supports multi-document collections and also preserves section boundaries. -
Metadata: The format records the document ID with the key
sourceid. Thesourcedboption allows specifying the resource that defines the document ID (eg. PubMed). In thepubanno_json.tgzformat, sections are preserved and enumerated, but their type is not stored. -
Entity annotations: Annotations have an ID (counter), start/end offsets, and an
objattribute, which is typically the entity type. Additional attributes (eg. the concept ID) are stored in attribute annotations. - Whitespace: Whitespace is preserved.
- Offsets: Entity offsets are calculated as Unicode codepoint units.
- Discontinuous spans: Discontinuous spans are represented with PubAnnotation's bagging model.
-
Relations/events: PubAnnotation only supports binary relations.
When serialising,
bconvinterprets the first and second relation member as thesubjandobjattribute, respectively, and thetypeentry of the relation metadata as the predicate (pred). Other metadata as well as therolevalue of relation members are ignored. In PubAnnotation, complex relations with more than two members can be represented through nesting; however,bconvdoes not attempt an automatic conversion and simply raises an exception if the arity is different from 2.
| fmt | pubanno_json |
|---|---|
| native type | Document |
| lazy loading | no |
| supports text | yes |
| supports annotations | yes |
| stream type | text |
| name | type | default | purpose |
|---|---|---|---|
| obj | str | 'type' |
key in Entity.metadata for the obj field |
| fmt | pubanno_json.tgz |
|---|---|
| native type | Collection |
| lazy loading | no |
| supports text | yes |
| supports annotations | yes |
| stream type | binary |
| name | type | default | purpose |
|---|---|---|---|
| obj | str | 'type' |
key in Entity.metadata for the obj field |
| fmt | pubanno_json |
|---|---|
| supports text | yes |
| supports annotations | yes |
| stream type | text |
| name | type | default | purpose |
|---|---|---|---|
| obj | str | 'type' |
key in Entity.metadata for the obj field |
| sourcedb | str | None |
source of the article text |
| avoid_gaps | str | None |
suppress discontinuous spans |
| avoid_overlaps | str | None |
suppress annotation collisions |
| **meta | Dict[str, Any] | {} |
additional key-value pairs directly copied into the output JSON |
| fmt | pubanno_json.tgz |
|---|---|
| supports text | yes |
| supports annotations | yes |
| stream type | binary |
| name | type | default | purpose |
|---|---|---|---|
| obj | str | 'type' |
key in Entity.metadata for the obj field |
| sourcedb | str | None |
source of the article text |
| avoid_gaps | str | None |
suppress discontinuous spans |
| avoid_overlaps | str | None |
suppress annotation collisions |
| **meta | Dict[str, Any] | {} |
additional key-value pairs directly copied into the output JSON |