-
Notifications
You must be signed in to change notification settings - Fork 3
PubMed
PubMed provides abstracts and full-text articles to download through its efetch API.
bconv has a dedicated subclass of loaders, called fetchers, which pull documents directly from the API rather than loading them from disk.
The formats pubmed and pmc accept a list of PubMed or PMC identifiers instead of a path/open file.
The identifier list can be a single string with comma separators or an actual Python sequence.
Note: PubMed silently skips invalid IDs.
It is the responsibility of the caller to sanity-check the returned documents for completeness.
PubMed's terms of usage require the users to provide an email address when querying the API (although it is not enforced).
It can be specified as the email format parameter and is passed directly to the API, just like tool (defaulting to "bconv") and api_key (available from PubMed for high-volume users).
bconv requests the documents from PubMed in XML format, sometimes referred to as pxml (for abstracts) and nxml (PMC full-text).
With the pxml and nxml formats, bconv provides regular loaders to process locally available files (eg. downloaded Medline dumps).
The XML formats contain a lot of information (metadata, mark-up), which is largely ignored by bconv – this is also why bconv can only parse, but not serialise pxml/nxml.
PubMed also provides documents in BioC format with a similar (but not identical) reduction; a future version of bconv might feature fetchers for this API as well.
The XML schemas for PubMed abstracts and PMC full-text articles are linked to here.
The efetch API, to which bconv's fetchers point, is documented here.
In the test suite, examples can be found for abstracts and full-text.
-
Document structure: Both pxml and nxml may contain multiple documents, which are organised into sections of multiple levels.
From the abstracts (pxml),
bconvextracts everyAbstractTextnode as a section. With thesingle_sectionoption, the entire abstract is conflated into a single section and any section labels (eg. "Motivation") are embedded in the text. For PMC documents (nxml), eachsecelement is considered a section. -
Metadata: All documents have an article ID, which is extracted from the XML.
In addition, pxml abstracts contain a few basic metadata, such as type and date of publication.
With the
include_meshoption, the Medical Subject Headings (MeSH) are included in the abstracts (as part of the text, not as metadata). At the section level, the section type is stored in thesection.typeproperty for both pxml and nxml. -
Entity annotations: With the
mesh_as_entitiesoption (which implicitly setsinclude_mesh), the MeSH terms are annotated with their respective identifiers. -
Whitespace: Usually, the texts in the XML documents have no trailing whitespace at the end of a section.
However, separating whitespace between sections is highly desirable when converting to other formats (eg. plain text).
Therefore,
bconvappends a single line break to the end of every abstract section and two line breaks (ie. a blank line) to sections of full-text articles. -
Offsets: Unlike BioC, the PubMed XML formats do not contain explicit character offsets.
Instead,
bconvsimply counts characters, as it does for loading plain text. The character offsets include any embedded labels and appended whitespace.
| fmt | pubmed |
|---|---|
| native type | Collection |
| lazy loading | no |
| supports text | yes |
| supports annotations | MeSH only |
| stream type | binary |
| name | type | default | purpose |
|---|---|---|---|
| single_section | bool | False |
conflate all sections (following the title) into one |
| include_mesh | bool | False |
add every MeSH entry as a separate section |
| mesh_as_entities | bool | False |
add entities annotating the MeSH entries (implies include_mesh) |
| tool | str | "bconv" |
parameter passed to efetch |
| str | None |
parameter passed to efetch | |
| api_key | str | None |
parameter passed to efetch |
| fmt | pmc |
|---|---|
| native type | Collection |
| lazy loading | no |
| supports text | yes |
| supports annotations | no |
| stream type | binary |
| name | type | default | purpose |
|---|---|---|---|
| tool | str | "bconv" |
parameter passed to efetch |
| str | None |
parameter passed to efetch | |
| api_key | str | None |
parameter passed to efetch |
| fmt | pxml |
|---|---|
| native type | Collection |
| lazy loading | yes |
| supports text | yes |
| supports annotations | MeSH only |
| stream type | binary |
| name | type | default | purpose |
|---|---|---|---|
| single_section | bool | False |
conflate all sections (following the title) into one |
| include_mesh | bool | False |
add every MeSH entry as a separate section |
| mesh_as_entities | bool | False |
add entities annotating the MeSH entries (implies include_mesh) |
| fmt | nxml |
|---|---|
| native type | Collection |
| lazy loading | yes |
| supports text | yes |
| supports annotations | no |
| stream type | binary |
None.