-
Notifications
You must be signed in to change notification settings - Fork 20
Add (example of) PubMed parser #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces PubMed support along with enhanced handling for multi‐line and multi‐tag RIS files. Key changes include new tests for multiline and PubMed entries, the addition of a PubMedParser subclass and related configuration in the parser and config modules, and adjustments to tag handling (including unknown tags) in both parser and writer.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_parser.py | Added new tests for multiline, multi‐tag, and PubMed files |
| tests/data/*.ris | Updated example RIS files to reflect newline handling and PubMed examples |
| rispy/writer.py | Adjusted unknown tag handling logic with a branch for the "unknown_tag" case |
| rispy/parser.py | Introduced undo_wrapping support and a new PubMedParser subclass |
| rispy/config.py | Added PubMed-specific configurations and removed the "UK" mapping |
| rispy/init.py | Exported the new PubMedParser |
| README.md | Updated tag mappings to remove the "unknown_tag" reference |
Comments suppressed due to low confidence (1)
rispy/config.py:80
- The removal of the 'UK' mapping in the configuration affects unknown tag handling. Update related documentation to explain the new behavior and ensure that downstream modules are aligned with this change.
"UK": "unknown_tag",
| except KeyError: | ||
| warnings.warn(UserWarning(f"label `{label}` not exported"), stacklevel=2) | ||
| continue | ||
| if label.lower() == "unknown_tag": |
Copilot
AI
May 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure that the constant for unknown tags (self.UNKNOWN_TAG) is defined and maintained consistently across modules. Consider extracting this value into a shared configuration to avoid discrepancies between parser and writer behavior.
| if label.lower() == "unknown_tag": | |
| if label.lower() == self.UNKNOWN_TAG: |
| tuple | ||
| Tuple containing the tag and the content of the tag. | ||
| """ | ||
|
|
Copilot
AI
May 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a check for the minimum line length before slicing to prevent potential index errors when processing unexpectedly short lines.
| if len(line) < 5: | |
| return (None, line.strip()) |
|
@shapiromatron, do you have some time to look into this? |
shapiromatron
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - a few minor changes requested, but this looks great
| record[name] = defaultdict(list) | ||
| record[name][tag].append(content) | ||
|
|
||
| record.setdefault("unknown_tag", defaultdict(list))[tag].append(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we removed UK, make unknown_tag a variable constant UNKNOWN_TAG since right now it's a magic string used in a few places
| @@ -0,0 +1,36 @@ | |||
| TY - JOUR | |||
| AB - This is a test article about the mysterious world of EndNote libraries. Legend says if you cite yourself three times in a row, EndNote will grant you a coffee break. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to cite myself a few more times then :D
|
|
||
| # created from tests | ||
| export.ris | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove from this PR?
This is a draft PR (that includes the work in PR #66) to figure out what (API) changes might be needed to make a PubMed implementation straightforward. The parser passes the tests ifenforce_list_typeis set to true andPUBMED_LIST_TYPE_TAGSis passed. However, the PubMed documentation seems to say that list-typed tags should not be enforced.This PR adds PubMed support to Rispy. As PubMed RIS-like files differ in various aspects, we had to extend the internals in various places. It was also needed to expand the API a tiny bit.
Multiline RIS
Pubmed and WOS implement a hard-wrapped RIS format. Currently, rispy simply ignores the hard wrap for all cases. The results in problems for cases like the following:
The current implementation of Rispy fails on this example. Do you expect
notes=["first line", "second line", "first line", "second line"]ornotes=["first line second line", "first line second line"]. The output of the second might be what you expect. To make this work, I introducedundo_wrappingargument.List types in Pubmed
There is no clear overview of list tags for PubMed on their official documentation, afaik. Therefore, I exported some large files from PubMed and searched for all repetitive tags. The following tags where found:
Benchmark
The PubMed support affects the overall performance of rispy slightly (negatively). However, RIS files with many multiline tags are clearly more efficient.
This PR
main branch
You can see the improvements in the multi-lined use case.
Future work
I will propose to move some code around in another PR. I think we will benefit from a restructuring.