parser: parse from simple-head too #358

ioannistsanaktsidis · 2025-05-15T07:12:09Z

ref Elsevier parser should extract metadata from simple-head too cern-sis/issues-inspire#429

ioannistsanaktsidis · 2025-05-16T13:08:42Z

A note related to this PR. The repo is not python 3 compatible, meaning we can build it but tests are not running etc... Yet we release it with python3 -> https://github.com/inspirehep/hepcrawl/pull/358/files#diff-20e0e050358c0425896e7b9edb659fec4ed949bb350a35841c0d616e1971cc6bR22. What was the reason for that? Was it just to release it for now as we are using it only in python2(inspire-next) and fix it in the future? cc @drjova

ioannistsanaktsidis · 2025-05-16T14:07:28Z

Also, usecase -> https://github.com/inspirehep/hepcrawl/pull/358/files#diff-c9b75f1849ee2f819f04cabf4463702f66d1c748fb8ef795ea2a3ea81d06309eR165. Fails on CI, succeeds locally with exactly the same setup. Issue seems to be here https://github.com/inspirehep/hepcrawl/blob/master/hepcrawl/spiders/pos_spider.py#L112-L117. For some reason it is not extracted correctly on the CI. Could be the way we are generating the fake_response_file -> https://github.com/inspirehep/hepcrawl/blob/master/hepcrawl/testlib/fixtures.py#L50-L56, but not sure. To be checked further as this is irrelevant to the changes of this PR

michamos · 2025-05-16T14:41:44Z

Not directly related to simple-head, by I noticed two additional issues:

ORCIDs don't seem to be extracted
In the example file, the DOI of the original paper is mentioned in

<ce:document-thread><ce:refers-to-document id="rd0010"><ce:pii>S0550-3213(22)00251-6</ce:pii><ce:doi>10.1016/j.nuclphysb.2022.115900</ce:doi></ce:refers-to-document></ce:document-thread>

It would be good to extract this DOI too so the erratum gets matched with the original publication, otherwise it will create new records in INSPIRE which we don't want. So in this case, DOIs should look like

{
    "dois": [
        {"material": "erratum", "value": "10.1016/j.nuclphysb.2022.115991"},
        {"material": "publication", "value": "10.1016/j.nuclphysb.2022.115900"}
    ]
}

(order doesn't matter)

ioannistsanaktsidis · 2025-05-16T15:32:31Z

Thanks @michamos , will create another ticket to handle these findings.

michamos · 2025-05-16T15:35:45Z

OK, but note that this should not go to prod without the related DOI extraction, as it will cause a mess.

ioannistsanaktsidis · 2025-05-16T15:38:35Z

But isn’t that the case already ? Don’t we crawl elsevier papers like that one ?

michamos · 2025-05-16T16:14:34Z

We crawl them, but they don't get added because they don't pass the should_record_be_harvested() check due to the lack of title currently.

drjova

Thanks @ioannistsanaktsidis few comment for clarification and formatting

tests/unit/test_parsers_elsevier.py

drjova · 2025-05-19T09:09:32Z

hepcrawl/parsers/elsevier.py

        ).extract()
+        if not collaborations:
+            collaborations = self.root.xpath(
+                "./*/simple-head/author-group//collaboration/text/text()"


why the path here is different from abstract?

You mean from the line above ? only difference is that it tries to extract from simple-head if nothing found on head . Or I misunderstood the question ?

* ref cern-sis/issues-inspire#429

michamos

👍

ioannistsanaktsidis force-pushed the 429-parse-simple-head branch 22 times, most recently from dbae157 to 24eeb7e Compare May 16, 2025 12:57

ioannistsanaktsidis force-pushed the 429-parse-simple-head branch from 24eeb7e to 573379b Compare May 19, 2025 08:52

ioannistsanaktsidis force-pushed the 429-parse-simple-head branch from 573379b to efdc136 Compare May 19, 2025 08:54

drjova requested changes May 19, 2025

View reviewed changes

parser: parse from simple-head too

5db5649

* ref cern-sis/issues-inspire#429

ioannistsanaktsidis force-pushed the 429-parse-simple-head branch from efdc136 to 5db5649 Compare May 19, 2025 09:13

michamos approved these changes May 19, 2025

View reviewed changes

drjova approved these changes May 19, 2025

View reviewed changes

ioannistsanaktsidis merged commit 05f6f55 into inspirehep:master May 19, 2025
8 of 16 checks passed

parser: parse from simple-head too #358

parser: parse from simple-head too #358

Uh oh!

Conversation

ioannistsanaktsidis commented May 15, 2025

Uh oh!

ioannistsanaktsidis commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ioannistsanaktsidis commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michamos commented May 16, 2025

Uh oh!

ioannistsanaktsidis commented May 16, 2025

Uh oh!

michamos commented May 16, 2025

Uh oh!

ioannistsanaktsidis commented May 16, 2025

Uh oh!

michamos commented May 16, 2025

Uh oh!

drjova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drjova May 19, 2025

Choose a reason for hiding this comment

Uh oh!

ioannistsanaktsidis May 19, 2025

Choose a reason for hiding this comment

Uh oh!

michamos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ioannistsanaktsidis commented May 16, 2025 •

edited

Loading

ioannistsanaktsidis commented May 16, 2025 •

edited

Loading