Skip to content

Conversation

@ioannistsanaktsidis
Copy link
Contributor

@ioannistsanaktsidis ioannistsanaktsidis force-pushed the 429-parse-simple-head branch 22 times, most recently from dbae157 to 24eeb7e Compare May 16, 2025 12:57
@ioannistsanaktsidis
Copy link
Contributor Author

ioannistsanaktsidis commented May 16, 2025

A note related to this PR. The repo is not python 3 compatible, meaning we can build it but tests are not running etc... Yet we release it with python3 -> https://github.com/inspirehep/hepcrawl/pull/358/files#diff-20e0e050358c0425896e7b9edb659fec4ed949bb350a35841c0d616e1971cc6bR22. What was the reason for that? Was it just to release it for now as we are using it only in python2(inspire-next) and fix it in the future? cc @drjova

@ioannistsanaktsidis
Copy link
Contributor Author

ioannistsanaktsidis commented May 16, 2025

Also, usecase -> https://github.com/inspirehep/hepcrawl/pull/358/files#diff-c9b75f1849ee2f819f04cabf4463702f66d1c748fb8ef795ea2a3ea81d06309eR165. Fails on CI, succeeds locally with exactly the same setup. Issue seems to be here https://github.com/inspirehep/hepcrawl/blob/master/hepcrawl/spiders/pos_spider.py#L112-L117. For some reason it is not extracted correctly on the CI. Could be the way we are generating the fake_response_file -> https://github.com/inspirehep/hepcrawl/blob/master/hepcrawl/testlib/fixtures.py#L50-L56, but not sure. To be checked further as this is irrelevant to the changes of this PR

@michamos
Copy link
Contributor

Not directly related to simple-head, by I noticed two additional issues:

  • ORCIDs don't seem to be extracted
  • In the example file, the DOI of the original paper is mentioned in
<ce:document-thread><ce:refers-to-document id="rd0010"><ce:pii>S0550-3213(22)00251-6</ce:pii><ce:doi>10.1016/j.nuclphysb.2022.115900</ce:doi></ce:refers-to-document></ce:document-thread>

It would be good to extract this DOI too so the erratum gets matched with the original publication, otherwise it will create new records in INSPIRE which we don't want. So in this case, DOIs should look like

{
    "dois": [
        {"material": "erratum", "value": "10.1016/j.nuclphysb.2022.115991"},
        {"material": "publication", "value": "10.1016/j.nuclphysb.2022.115900"}
    ]
}

(order doesn't matter)

@ioannistsanaktsidis
Copy link
Contributor Author

Thanks @michamos , will create another ticket to handle these findings.

@michamos
Copy link
Contributor

OK, but note that this should not go to prod without the related DOI extraction, as it will cause a mess.

@ioannistsanaktsidis
Copy link
Contributor Author

But isn’t that the case already ? Don’t we crawl elsevier papers like that one ?

@michamos
Copy link
Contributor

We crawl them, but they don't get added because they don't pass the should_record_be_harvested() check due to the lack of title currently.

Copy link
Contributor

@drjova drjova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ioannistsanaktsidis few comment for clarification and formatting

).extract()
if not collaborations:
collaborations = self.root.xpath(
"./*/simple-head/author-group//collaboration/text/text()"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the path here is different from abstract?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean from the line above ? only difference is that it tries to extract from simple-head if nothing found on head . Or I misunderstood the question ?

Copy link
Contributor

@michamos michamos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ioannistsanaktsidis ioannistsanaktsidis merged commit 05f6f55 into inspirehep:master May 19, 2025
8 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants