-
Notifications
You must be signed in to change notification settings - Fork 6
Description
I'm running through the pipeline to see if it is all possible locally (see issue #23) and I think there is a problem with step 2 (cc @npch), as follows:
When using EuPMCCodeReferences notebook to process the full-texts from getppapers text mining and extract URLs into a JSON data structure, with paper DOIs, etc., the notebook gets stuck at ln[5]:
KeyError Traceback (most recent call last)
<ipython-input-21-d81bdb32fc4e> in <module>()
1 # Process the papers and extract all the references to GitHub and Zenodo urls
----> 2 papers_info = process_eupmc.process_papers(paper_ids, data_dir)
[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_papers(list_of_pmcids, data_dir)
97
98 for pmcid in list_of_pmcids:
---> 99 papers.append(process_paper(pmcid, data_dir))
100
101 return papers
[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in process_paper(pmcid, data_dir)
66 paper_json = json.load(f)
67 # Get the DOI
---> 68 doi = get_doi(paper_json)
69 pub_date = get_pub_date(paper_json)
70 except IOError:
[~]/Github/collabw18/code-cite/code-cite/notebooks/process_eupmc.py in get_doi(paper_json)
29
30 def get_doi(paper_json):
---> 31 paper_doi = paper_json['doi'][0]
32 return paper_doi
33
KeyError: 'doi'
I think this is because if there is a folder without XML or JSON, or a JSON file without a DOI, the process_eupmc.py script cannot complete. This second situation is not fully tested, but what I can glean from simple checks. The notebook works when I run it on a small subset of data and remove problematic directories.
I propose amending the process_eupmc.py script to have run-throughs for when the getpapers result does not contain the expected info. So instead of pausing at these points, the script continues and that data is ignored.
I don't yet know how to do this, I'll try to give this a crack, feel free to jump in if anyone feels up to it.
- set a default?
- try except?