-
Notifications
You must be signed in to change notification settings - Fork 31
parser: parse from simple-head too #358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parser: parse from simple-head too #358
Conversation
dbae157 to
24eeb7e
Compare
|
A note related to this PR. The repo is not python 3 compatible, meaning we can build it but tests are not running etc... Yet we release it with python3 -> https://github.com/inspirehep/hepcrawl/pull/358/files#diff-20e0e050358c0425896e7b9edb659fec4ed949bb350a35841c0d616e1971cc6bR22. What was the reason for that? Was it just to release it for now as we are using it only in python2(inspire-next) and fix it in the future? cc @drjova |
|
Also, usecase -> https://github.com/inspirehep/hepcrawl/pull/358/files#diff-c9b75f1849ee2f819f04cabf4463702f66d1c748fb8ef795ea2a3ea81d06309eR165. Fails on CI, succeeds locally with exactly the same setup. Issue seems to be here https://github.com/inspirehep/hepcrawl/blob/master/hepcrawl/spiders/pos_spider.py#L112-L117. For some reason it is not extracted correctly on the CI. Could be the way we are generating the |
|
Not directly related to
<ce:document-thread><ce:refers-to-document id="rd0010"><ce:pii>S0550-3213(22)00251-6</ce:pii><ce:doi>10.1016/j.nuclphysb.2022.115900</ce:doi></ce:refers-to-document></ce:document-thread>It would be good to extract this DOI too so the erratum gets matched with the original publication, otherwise it will create new records in INSPIRE which we don't want. So in this case, DOIs should look like {
"dois": [
{"material": "erratum", "value": "10.1016/j.nuclphysb.2022.115991"},
{"material": "publication", "value": "10.1016/j.nuclphysb.2022.115900"}
]
}(order doesn't matter) |
|
Thanks @michamos , will create another ticket to handle these findings. |
|
OK, but note that this should not go to prod without the related DOI extraction, as it will cause a mess. |
|
But isn’t that the case already ? Don’t we crawl elsevier papers like that one ? |
|
We crawl them, but they don't get added because they don't pass the |
24eeb7e to
573379b
Compare
573379b to
efdc136
Compare
drjova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ioannistsanaktsidis few comment for clarification and formatting
| ).extract() | ||
| if not collaborations: | ||
| collaborations = self.root.xpath( | ||
| "./*/simple-head/author-group//collaboration/text/text()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the path here is different from abstract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean from the line above ? only difference is that it tries to extract from simple-head if nothing found on head . Or I misunderstood the question ?
efdc136 to
5db5649
Compare
michamos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
simple-headtoo cern-sis/issues-inspire#429