PubMed

PubMed provides abstracts and full-text articles to download through its efetch API. bconv has a dedicated subclass of loaders, called fetchers, which pull documents directly from the API rather than loading them from disk. The formats pubmed and pmc accept a list of PubMed or PMC identifiers instead of a path/open file. The identifier list can be a single string with comma separators or an actual Python sequence. Note: PubMed silently skips invalid IDs. It is the responsibility of the caller to sanity-check the returned documents for completeness.

PubMed's terms of usage require the users to provide an email address when querying the API (although it is not enforced). It can be specified as the email format parameter and is passed directly to the API, just like tool (defaulting to "bconv") and api_key (available from PubMed for high-volume users).

bconv requests the documents from PubMed in XML format, sometimes referred to as pxml (for abstracts) and nxml (PMC full-text). With the pxml and nxml formats, bconv provides regular loaders to process locally available files (eg. downloaded Medline dumps). The XML formats contain a lot of information (metadata, mark-up), which is largely ignored by bconv – this is also why bconv can only parse, but not serialise pxml/nxml. PubMed also provides documents in BioC format with a similar (but not identical) reduction; a future version of bconv might feature fetchers for this API as well.

Sources

The XML schemas for PubMed abstracts and PMC full-text articles are linked to here. The efetch API, to which bconv's fetchers point, is documented here. In the test suite, examples can be found for abstracts and full-text.

Notes

Document structure: Both pxml and nxml may contain multiple documents, which are organised into sections of multiple levels. From the abstracts (pxml), bconv extracts every AbstractText node as a section. With the single_section option, the entire abstract is conflated into a single section and any section labels (eg. "Motivation") are embedded in the text. For PMC documents (nxml), each sec element is considered a section.
Metadata: All documents have an article ID, which is extracted from the XML. In addition, pxml abstracts contain a few basic metadata, such as type and date of publication. With the include_mesh option, the Medical Subject Headings (MeSH) are included in the abstracts (as part of the text, not as metadata). At the section level, the section type is stored in the section.type property for both pxml and nxml.
Entity annotations: With the mesh_as_entities option (which implicitly sets include_mesh), the MeSH terms are annotated with their respective identifiers.
Whitespace: Usually, the texts in the XML documents have no trailing whitespace at the end of a section. However, separating whitespace between sections is highly desirable when converting to other formats (eg. plain text). Therefore, bconv appends a single line break to the end of every abstract section and two line breaks (ie. a blank line) to sections of full-text articles.
Offsets: Unlike BioC, the PubMed XML formats do not contain explicit character offsets. Instead, bconv simply counts characters, as it does for loading plain text. The character offsets include any embedded labels and appended whitespace.

Fetchers

`PXMLFetcher`

Properties

fmt	`pubmed`
native type	Collection
lazy loading	no
supports text	yes
supports annotations	MeSH only
stream type	binary

Options

name	type	default	purpose
single_section	bool	`False`	conflate all sections (following the title) into one
include_mesh	bool	`False`	add every MeSH entry as a separate section
mesh_as_entities	bool	`False`	add entities annotating the MeSH entries (implies `include_mesh`)
tool	str	`"bconv"`	parameter passed to efetch
email	str	`None`	parameter passed to efetch
api_key	str	`None`	parameter passed to efetch

`PMCFetcher`

Properties

fmt	`pmc`
native type	Collection
lazy loading	no
supports text	yes
supports annotations	no
stream type	binary

Options

name	type	default	purpose
tool	str	`"bconv"`	parameter passed to efetch
email	str	`None`	parameter passed to efetch
api_key	str	`None`	parameter passed to efetch

Loaders

`PXMLLoader`

Properties

fmt	`pxml`
native type	Collection
lazy loading	yes
supports text	yes
supports annotations	MeSH only
stream type	binary

Options

name	type	default	purpose
single_section	bool	`False`	conflate all sections (following the title) into one
include_mesh	bool	`False`	add every MeSH entry as a separate section
mesh_as_entities	bool	`False`	add entities annotating the MeSH entries (implies `include_mesh`)

`PMCLoader`

Properties

fmt	`nxml`
native type	Collection
lazy loading	yes
supports text	yes
supports annotations	no
stream type	binary

Options

None.

bconv Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PubMed

PubMed

Sources

Notes

Fetchers

`PXMLFetcher`

Properties

Options

`PMCFetcher`

Properties

Options

Loaders

`PXMLLoader`

Properties

Options

`PMCLoader`

Properties

Options

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally