Skip to content

PubAnnotation

Lenz Furrer edited this page May 24, 2021 · 9 revisions

PubAnnotation

PubAnnotation is a repository for publically hosting annotations over biomedical documents. The platform accepts submissions in a JSON format that contains both text and annotations.

bconv provides two loaders and two formatters: pubanno_json reads/produces an individual JSON file for a single document. pubanno_json.tgz reads/creates a compressed archive containing multiple JSON files for a document or collection.

Example

{
  "sourceid": "354896",
  "sourcedb": "PubMed",
  "text": "Lidocaine-induced cardiac asystole.\n",
  "denotations": [
    {
      "id": "T2",
      "span": {
        "begin": 18,
        "end": 34
      },
      "obj": "Disease"
    }
  ]
}

Full example

Sources

The PubAnnotation website has a description of the basic annotation format and the instructions for representing more complex documents.

Notes

  • Document structure: The pubanno_json format is designed primarily for abstracts. Full-text documents can be represented, but any document-internal structure is lost. bconv also allows exporting collections with pubanno_json, but the resulting JSON (an array of document objects) is not accepted by PubAnnotation, and cannot even be loaded directly by bconv. The pubanno_json.tgz format, however, supports multi-document collections and also preserves section boundaries.
  • Metadata: The format records the document ID with the key sourceid. The sourcedb option allows specifying the resource that defines the document ID (eg. PubMed). In the pubanno_json.tgz format, sections are preserved and enumerated, but their type is not stored.
  • Entity annotations: Annotations have an ID (counter), start/end offsets, and an obj attribute, which is typically the entity type. Additional attributes (eg. the concept ID) are stored in attribute annotations.
  • Whitespace: Whitespace is preserved.
  • Offsets: Entity offsets are calculated as Unicode codepoint units.
  • Discontinuous spans: Discontinuous spans are represented with PubAnnotation's bagging model.
  • Relations/events: PubAnnotation only supports binary relations. When serialising, bconv interprets the first and second relation member as the subj and obj attribute, respectively, and the type entry of the relation metadata as the predicate (pred). Other metadata as well as the role value of relation members are ignored. In PubAnnotation, complex relations with more than two members can be represented through nesting; however, bconv does not attempt an automatic conversion and simply raises an exception if the arity is different from 2.

Loaders

PubAnnoJSONLoader

Properties

fmt pubanno_json
native type Document
lazy loading no
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
obj str 'type' key in Entity.metadata for the obj field

PubAnnoTGZLoader

Properties

fmt pubanno_json.tgz
native type Collection
lazy loading no
supports text yes
supports annotations yes
stream type binary
name type default purpose
obj str 'type' key in Entity.metadata for the obj field

Exporters

PubAnnoJSONFormatter

Properties

fmt pubanno_json
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
obj str 'type' key in Entity.metadata for the obj field
sourcedb str None source of the article text
avoid_gaps str None suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions
**meta Dict[str, Any] {} additional key-value pairs directly copied into the output JSON

PubAnnoTGZFormatter

Properties

fmt pubanno_json.tgz
supports text yes
supports annotations yes
stream type binary
name type default purpose
obj str 'type' key in Entity.metadata for the obj field
sourcedb str None source of the article text
avoid_gaps str None suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions
**meta Dict[str, Any] {} additional key-value pairs directly copied into the output JSON

Clone this wiki locally