Skip to content

Processing all data, not just a subset, in step 2 (full-texts --> JSON data structure) #27

@npscience

Description

@npscience

I'm running through the pipeline to see if it is all possible locally (see issue #23) and I think there is a problem with step 2 (cc @npch), as follows:

When using EuPMCCodeReferences notebook to process the full-texts from getppapers text mining and extract URLs into a JSON data structure, with paper DOIs, etc., the notebook gets stuck at ln[5]:

KeyError                                  Traceback (most recent call last)
<ipython-input-21-d81bdb32fc4e> in <module>()
      1 # Process the papers and extract all the references to GitHub and Zenodo urls
----> 2 papers_info = process_eupmc.process_papers(paper_ids, data_dir)

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_papers(list_of_pmcids, data_dir)
     97
     98     for pmcid in list_of_pmcids:
---> 99         papers.append(process_paper(pmcid, data_dir))
    100
    101     return papers

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_paper(pmcid, data_dir)
     66             paper_json = json.load(f)
     67             # Get the DOI
---> 68             doi = get_doi(paper_json)
     69             pub_date = get_pub_date(paper_json)
     70     except IOError:

[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in get_doi(paper_json)
     29
     30 def get_doi(paper_json):
---> 31     paper_doi = paper_json['doi'][0]
     32     return paper_doi
     33

KeyError: 'doi'

I think this is because if there is a folder without XML or JSON, or a JSON file without a DOI, the process_eupmc.py script cannot complete. This second situation is not fully tested, but what I can glean from simple checks. The notebook works when I run it on a small subset of data and remove problematic directories.

I propose amending the process_eupmc.py script to have run-throughs for when the getpapers result does not contain the expected info. So instead of pausing at these points, the script continues and that data is ignored.

I don't yet know how to do this, I'll try to give this a crack, feel free to jump in if anyone feels up to it.

  • set a default?
  • try except?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions