Skip to content
Lenz edited this page Sep 16, 2020 · 2 revisions

PubMed

PubMed provides abstracts and full-text articles to download through its efetch API. bconv has a dedicated subclass of loaders, called fetchers, which pull documents directly from the API rather than loading them from disk. The formats pubmed and pmc accept a list of PubMed or PMC identifiers instead of a path/open file. The identifier list can be a single string with comma separators or an actual Python sequence. Note: PubMed silently skips invalid IDs. It is the responsibility of the caller to sanity-check the returned documents for completeness.

PubMed's terms of usage require the users to provide an email address when querying the API (although it is not enforced). It can be specified as the email format parameter and is passed directly to the API, just like tool (defaulting to "bconv") and api_key (available from PubMed for high-volume users).

bconv requests the documents from PubMed in XML format, sometimes referred to as pxml (for abstracts) and nxml (PMC full-text). With the pxml and nxml formats, bconv provides regular loaders to process locally available files (eg. downloaded Medline dumps). The XML formats contain a lot of information (metadata, mark-up), which is largely ignored by bconv – this is also why bconv can only parse, but not serialise pxml/nxml. PubMed also provides documents in BioC format with a similar (but not identical) reduction; a future version of bconv might feature fetchers for this API as well.

Sources

The XML schemas for PubMed abstracts and PMC full-text articles are linked to here. The efetch API, to which bconv's fetchers point, is documented here. In the test suite, examples can be found for abstracts and full-text.

Notes

  • Document structure: Both pxml and nxml may contain multiple documents, which are organised into sections of multiple levels. From the abstracts (pxml), bconv extracts every AbstractText node as a section. With the single_section option, the entire abstract is conflated into a single section and any section labels (eg. "Motivation") are embedded in the text. For PMC documents (nxml), each sec element is considered a section.
  • Metadata: All documents have an article ID, which is extracted from the XML. In addition, pxml abstracts contain a few basic metadata, such as type and date of publication. With the include_mesh option, the Medical Subject Headings (MeSH) are included in the abstracts (as part of the text, not as metadata). At the section level, the section type is stored in the section.type property for both pxml and nxml.
  • Entity annotations: With the mesh_as_entities option (which implicitly sets include_mesh), the MeSH terms are annotated with their respective identifiers.
  • Whitespace: Usually, the texts in the XML documents have no trailing whitespace at the end of a section. However, separating whitespace between sections is highly desirable when converting to other formats (eg. plain text). Therefore, bconv appends a single line break to the end of every abstract section and two line breaks (ie. a blank line) to sections of full-text articles.
  • Offsets: Unlike BioC, the PubMed XML formats do not contain explicit character offsets. Instead, bconv simply counts characters, as it does for loading plain text. The character offsets include any embedded labels and appended whitespace.

Fetchers

PXMLFetcher

Properties

fmt pubmed
native type Collection
lazy loading no
supports text yes
supports annotations MeSH only
stream type binary

Options

name type default purpose
single_section bool False conflate all sections (following the title) into one
include_mesh bool False add every MeSH entry as a separate section
mesh_as_entities bool False add entities annotating the MeSH entries (implies include_mesh)
tool str "bconv" parameter passed to efetch
email str None parameter passed to efetch
api_key str None parameter passed to efetch

PMCFetcher

Properties

fmt pmc
native type Collection
lazy loading no
supports text yes
supports annotations no
stream type binary

Options

name type default purpose
tool str "bconv" parameter passed to efetch
email str None parameter passed to efetch
api_key str None parameter passed to efetch

Loaders

PXMLLoader

Properties

fmt pxml
native type Collection
lazy loading yes
supports text yes
supports annotations MeSH only
stream type binary

Options

name type default purpose
single_section bool False conflate all sections (following the title) into one
include_mesh bool False add every MeSH entry as a separate section
mesh_as_entities bool False add entities annotating the MeSH entries (implies include_mesh)

PMCLoader

Properties

fmt nxml
native type Collection
lazy loading yes
supports text yes
supports annotations no
stream type binary

Options

None.

Clone this wiki locally