Skip to content

Conversation

@J535D165
Copy link
Collaborator

@J535D165 J535D165 commented Mar 10, 2025

This is a draft PR (that includes the work in PR #66) to figure out what (API) changes might be needed to make a PubMed implementation straightforward. The parser passes the tests if enforce_list_type is set to true and PUBMED_LIST_TYPE_TAGS is passed. However, the PubMed documentation seems to say that list-typed tags should not be enforced.

This PR adds PubMed support to Rispy. As PubMed RIS-like files differ in various aspects, we had to extend the internals in various places. It was also needed to expand the API a tiny bit.

Multiline RIS

Pubmed and WOS implement a hard-wrapped RIS format. Currently, rispy simply ignores the hard wrap for all cases. The results in problems for cases like the following:

TY  - JOUR
N2  - This is a rare case, but it is relevant for RIS-like formats like PubMed
N1  - first line
      second line
N1  - first line
      second line
ER  -

The current implementation of Rispy fails on this example. Do you expect notes=["first line", "second line", "first line", "second line"] or notes=["first line second line", "first line second line"]. The output of the second might be what you expect. To make this work, I introduced undo_wrapping argument.

List types in Pubmed

There is no clear overview of list tags for PubMed on their official documentation, afaik. Therefore, I exported some large files from PubMed and searched for all repetitive tags. The following tags where found:

PUBMED_LIST_TYPE_TAGS = {
    "AD",
    "AID",
    "AU",
    "AUID",
    "CIN",
    "CON",
    "CN",
    "RN",
    "EDAT",
    "EIN",
    "FAU",
    "FIR",
    "GR",
    "IR",
    "IRAD",
    "IS",
    "LA",
    "LID",
    "MHDA",
    "MH",
    "OT",
    "PHST",
    "PST",
    "PT",
    "PMC",
    "SI",
    "SO",
    "SB",
}

Benchmark

The PubMed support affects the overall performance of rispy slightly (negatively). However, RIS files with many multiline tags are clearly more efficient.

This PR

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               277.7997 (1.0)      342.6343 (1.0)      297.7747 (1.0)      25.8081 (1.0)      288.1745 (1.0)      23.2345 (1.0)           1;1  3.3582 (1.0)           5           1
test_benchmark_rispy_large_multiline     307.3277 (1.11)     389.9151 (1.14)     334.9115 (1.12)     31.9259 (1.24)     325.2276 (1.13)     27.3331 (1.18)          1;1  2.9859 (0.89)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

main branch

---------------------------------------------------------------------------------------------- benchmark: 2 tests ---------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean              StdDev              Median                 IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               276.5595 (1.0)      356.2912 (1.0)      295.8040 (1.0)       33.9246 (1.0)      281.4575 (1.0)       22.7273 (1.0)           1;1  3.3806 (1.0)           5           1
test_benchmark_rispy_large_multiline     310.7737 (1.12)     582.1600 (1.63)     400.5045 (1.35)     111.7629 (3.29)     358.1399 (1.27)     149.2522 (6.57)          1;0  2.4969 (0.74)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You can see the improvements in the multi-lined use case.

Future work

I will propose to move some code around in another PR. I think we will benefit from a restructuring.

@J535D165 J535D165 marked this pull request as ready for review May 27, 2025 09:21
@J535D165 J535D165 requested review from Copilot and shapiromatron May 27, 2025 09:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces PubMed support along with enhanced handling for multi‐line and multi‐tag RIS files. Key changes include new tests for multiline and PubMed entries, the addition of a PubMedParser subclass and related configuration in the parser and config modules, and adjustments to tag handling (including unknown tags) in both parser and writer.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_parser.py Added new tests for multiline, multi‐tag, and PubMed files
tests/data/*.ris Updated example RIS files to reflect newline handling and PubMed examples
rispy/writer.py Adjusted unknown tag handling logic with a branch for the "unknown_tag" case
rispy/parser.py Introduced undo_wrapping support and a new PubMedParser subclass
rispy/config.py Added PubMed-specific configurations and removed the "UK" mapping
rispy/init.py Exported the new PubMedParser
README.md Updated tag mappings to remove the "unknown_tag" reference
Comments suppressed due to low confidence (1)

rispy/config.py:80

  • The removal of the 'UK' mapping in the configuration affects unknown tag handling. Update related documentation to explain the new behavior and ensure that downstream modules are aligned with this change.
"UK": "unknown_tag",

except KeyError:
warnings.warn(UserWarning(f"label `{label}` not exported"), stacklevel=2)
continue
if label.lower() == "unknown_tag":
Copy link

Copilot AI May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that the constant for unknown tags (self.UNKNOWN_TAG) is defined and maintained consistently across modules. Consider extracting this value into a shared configuration to avoid discrepancies between parser and writer behavior.

Suggested change
if label.lower() == "unknown_tag":
if label.lower() == self.UNKNOWN_TAG:

Copilot uses AI. Check for mistakes.
tuple
Tuple containing the tag and the content of the tag.
"""

Copy link

Copilot AI May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a check for the minimum line length before slicing to prevent potential index errors when processing unexpectedly short lines.

Suggested change
if len(line) < 5:
return (None, line.strip())

Copilot uses AI. Check for mistakes.
@shapiromatron shapiromatron self-assigned this May 28, 2025
@J535D165
Copy link
Collaborator Author

@shapiromatron, do you have some time to look into this?

Copy link
Collaborator

@shapiromatron shapiromatron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - a few minor changes requested, but this looks great

record[name] = defaultdict(list)
record[name][tag].append(content)

record.setdefault("unknown_tag", defaultdict(list))[tag].append(content)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we removed UK, make unknown_tag a variable constant UNKNOWN_TAG since right now it's a magic string used in a few places

@@ -0,0 +1,36 @@
TY - JOUR
AB - This is a test article about the mysterious world of EndNote libraries. Legend says if you cite yourself three times in a row, EndNote will grant you a coffee break.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to cite myself a few more times then :D


# created from tests
export.ris

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove from this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants