Add (example of) PubMed parser #68

J535D165 · 2025-03-10T14:46:56Z

This is a draft PR (that includes the work in PR #66) to figure out what (API) changes might be needed to make a PubMed implementation straightforward. The parser passes the tests if enforce_list_type is set to true and PUBMED_LIST_TYPE_TAGS is passed. However, the PubMed documentation seems to say that list-typed tags should not be enforced.

This PR adds PubMed support to Rispy. As PubMed RIS-like files differ in various aspects, we had to extend the internals in various places. It was also needed to expand the API a tiny bit.

Multiline RIS

Pubmed and WOS implement a hard-wrapped RIS format. Currently, rispy simply ignores the hard wrap for all cases. The results in problems for cases like the following:

TY  - JOUR
N2  - This is a rare case, but it is relevant for RIS-like formats like PubMed
N1  - first line
      second line
N1  - first line
      second line
ER  -

The current implementation of Rispy fails on this example. Do you expect notes=["first line", "second line", "first line", "second line"] or notes=["first line second line", "first line second line"]. The output of the second might be what you expect. To make this work, I introduced undo_wrapping argument.

List types in Pubmed

There is no clear overview of list tags for PubMed on their official documentation, afaik. Therefore, I exported some large files from PubMed and searched for all repetitive tags. The following tags where found:

PUBMED_LIST_TYPE_TAGS = {
    "AD",
    "AID",
    "AU",
    "AUID",
    "CIN",
    "CON",
    "CN",
    "RN",
    "EDAT",
    "EIN",
    "FAU",
    "FIR",
    "GR",
    "IR",
    "IRAD",
    "IS",
    "LA",
    "LID",
    "MHDA",
    "MH",
    "OT",
    "PHST",
    "PST",
    "PT",
    "PMC",
    "SI",
    "SO",
    "SB",
}

Benchmark

The PubMed support affects the overall performance of rispy slightly (negatively). However, RIS files with many multiline tags are clearly more efficient.

This PR

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               277.7997 (1.0)      342.6343 (1.0)      297.7747 (1.0)      25.8081 (1.0)      288.1745 (1.0)      23.2345 (1.0)           1;1  3.3582 (1.0)           5           1
test_benchmark_rispy_large_multiline     307.3277 (1.11)     389.9151 (1.14)     334.9115 (1.12)     31.9259 (1.24)     325.2276 (1.13)     27.3331 (1.18)          1;1  2.9859 (0.89)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

main branch

---------------------------------------------------------------------------------------------- benchmark: 2 tests ---------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean              StdDev              Median                 IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               276.5595 (1.0)      356.2912 (1.0)      295.8040 (1.0)       33.9246 (1.0)      281.4575 (1.0)       22.7273 (1.0)           1;1  3.3806 (1.0)           5           1
test_benchmark_rispy_large_multiline     310.7737 (1.12)     582.1600 (1.63)     400.5045 (1.35)     111.7629 (3.29)     358.1399 (1.27)     149.2522 (6.57)          1;0  2.4969 (0.74)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You can see the improvements in the multi-lined use case.

Future work

I will propose to move some code around in another PR. I think we will benefit from a restructuring.

Copilot

Pull Request Overview

This PR introduces PubMed support along with enhanced handling for multi‐line and multi‐tag RIS files. Key changes include new tests for multiline and PubMed entries, the addition of a PubMedParser subclass and related configuration in the parser and config modules, and adjustments to tag handling (including unknown tags) in both parser and writer.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/test_parser.py	Added new tests for multiline, multi‐tag, and PubMed files
tests/data/*.ris	Updated example RIS files to reflect newline handling and PubMed examples
rispy/writer.py	Adjusted unknown tag handling logic with a branch for the "unknown_tag" case
rispy/parser.py	Introduced undo_wrapping support and a new PubMedParser subclass
rispy/config.py	Added PubMed-specific configurations and removed the "UK" mapping
rispy/init.py	Exported the new PubMedParser
README.md	Updated tag mappings to remove the "unknown_tag" reference

Comments suppressed due to low confidence (1)

rispy/config.py:80

The removal of the 'UK' mapping in the configuration affects unknown tag handling. Update related documentation to explain the new behavior and ensure that downstream modules are aligned with this change.

"UK": "unknown_tag",

Copilot · 2025-05-27T09:21:55Z

rispy/writer.py

            except KeyError:
-                warnings.warn(UserWarning(f"label `{label}` not exported"), stacklevel=2)
-                continue
+                if label.lower() == "unknown_tag":


Ensure that the constant for unknown tags (self.UNKNOWN_TAG) is defined and maintained consistently across modules. Consider extracting this value into a shared configuration to avoid discrepancies between parser and writer behavior.

Suggested change

if label.lower() == "unknown_tag":

if label.lower() == self.UNKNOWN_TAG:

Copilot · 2025-05-27T09:21:55Z

rispy/parser.py

+        tuple
+            Tuple containing the tag and the content of the tag.
+        """
+


Consider adding a check for the minimum line length before slicing to prevent potential index errors when processing unexpectedly short lines.

Suggested change

if len(line) < 5:

return (None, line.strip())

J535D165 · 2025-06-12T08:57:27Z

@shapiromatron, do you have some time to look into this?

shapiromatron

LGTM - a few minor changes requested, but this looks great

shapiromatron · 2025-08-19T13:58:05Z

rispy/parser.py

-                record[name] = defaultdict(list)
-            record[name][tag].append(content)
-
+            record.setdefault("unknown_tag", defaultdict(list))[tag].append(content)


since we removed UK, make unknown_tag a variable constant UNKNOWN_TAG since right now it's a magic string used in a few places

shapiromatron · 2025-08-19T13:58:59Z

tests/data/example_endnote.ris

@@ -0,0 +1,36 @@
+TY  - JOUR
+AB  - This is a test article about the mysterious world of EndNote libraries. Legend says if you cite yourself three times in a row, EndNote will grant you a coffee break.


I need to cite myself a few more times then :D

shapiromatron · 2025-08-19T13:59:27Z

.gitignore


 # created from tests
 export.ris
+


remove from this PR?

J535D165 mentioned this pull request Mar 10, 2025

Add support for PubMed RIS files asreview/asreview#1825

Open

J535D165 force-pushed the pubmed-ris branch from 7a3746b to 1dda827 Compare May 22, 2025 19:56

Add Pubmed parser

85aa692

J535D165 force-pushed the pubmed-ris branch from 1dda827 to 85aa692 Compare May 22, 2025 20:02

J535D165 added 15 commits May 22, 2025 22:13

Add more tests and improve syntax explainability

a1a749d

Update parser for multiline support with or without wrapping

57fabc2

Update tests and coverage for multiline cases

c70b69d

Order config of list tags

3e45361

Update list-like tags for PubMed

e0c42fb

Format file

c7ec531

Fix line too long

d1c87b8

Merge remote-tracking branch 'upstream/main' into pubmed-ris

e18f4ff

Simplify unknown tag handling

ba4c010

Add endnote test with special multiline case

7ad57e5

Add advanced multiline and add tests

bf7bee7

Ignore benchmark related files and internal banchmarks

5262da2

Use set instead of list

4c8b27d

Fix writer for unknown tag

eac47af

Fix doctest

cc8ff15

J535D165 marked this pull request as ready for review May 27, 2025 09:21

J535D165 requested review from Copilot and shapiromatron May 27, 2025 09:21

Copilot AI reviewed May 27, 2025

View reviewed changes

shapiromatron self-assigned this May 28, 2025

shapiromatron requested changes Aug 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add (example of) PubMed parser #68

Add (example of) PubMed parser #68

Uh oh!

J535D165 commented Mar 10, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 27, 2025

Uh oh!

Copilot AI May 27, 2025

Uh oh!

J535D165 commented Jun 12, 2025

Uh oh!

shapiromatron left a comment

Uh oh!

shapiromatron Aug 19, 2025

Uh oh!

shapiromatron Aug 19, 2025

Uh oh!

shapiromatron Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if label.lower() == "unknown_tag":
	if label.lower() == self.UNKNOWN_TAG:

		@@ -0,0 +1,36 @@
		TY - JOUR
		AB - This is a test article about the mysterious world of EndNote libraries. Legend says if you cite yourself three times in a row, EndNote will grant you a coffee break.

Add (example of) PubMed parser #68

Are you sure you want to change the base?

Add (example of) PubMed parser #68

Uh oh!

Conversation

J535D165 commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Multiline RIS

List types in Pubmed

Benchmark

Future work

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 27, 2025

Choose a reason for hiding this comment

Uh oh!

J535D165 commented Jun 12, 2025

Uh oh!

shapiromatron left a comment

Choose a reason for hiding this comment

Uh oh!

shapiromatron Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

shapiromatron Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

shapiromatron Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

J535D165 commented Mar 10, 2025 •

edited

Loading