Skip to content

Conversation

@iacopy
Copy link
Contributor

@iacopy iacopy commented Jun 4, 2025

This commit fixes a regression introduced in v0.4.0 where the parsing of elements was inadvertently removed during the refactor to an iterator-based model for parse_medline_xml. The original functionality present in v0.3.1 for identifying deleted articles is now restored, adapted for the memory-efficient iterator design.

Changes include:

  • Added logic to process elements within the iterparse loop. For each deleted citation, a dictionary {'pmid': <pmid_value>, 'delete': True} is yielded, consistent with the updated function docstring.
  • Ensured element.clear() is called for both and elements to prevent memory leaks, upholding the memory-saving goals of the iterator pattern.

This change brings the parser's behavior in line with its documentation regarding deleted articles and rectifies the omission from the previous refactor.

Fixes #166 (addresses the core bug of not handling deleted articles)
Related to #165 (clarifies behavior for the 'delete' flag)
Addresses regression from v0.4.0 (restores DeleteCitation parsing)

Given the nature of this fix, it's recommended that this be part of version 0.5.2.

This commit fixes a regression introduced in v0.4.0 where the
parsing of <DeleteCitation> elements was inadvertently removed during
the refactor to an iterator-based model for `parse_medline_xml`.
The original functionality present in v0.3.1 for identifying deleted
articles is now restored, adapted for the memory-efficient iterator design.

Changes include:
- Added logic to process <DeleteCitation> elements within the iterparse loop.
  For each deleted citation, a dictionary `{'pmid': <pmid_value>, 'delete': True}`
  is yielded, consistent with the updated function docstring.
- Ensured `element.clear()` is called for both <DeleteCitation> and
  <PubmedArticle> elements to prevent memory leaks, upholding the
  memory-saving goals of the iterator pattern.

This change brings the parser's behavior in line with its documentation
regarding deleted articles and rectifies the omission from the previous
refactor.

Fixes titipata#166 (addresses the core bug of not handling deleted articles)
Related to titipata#165 (clarifies behavior for the 'delete' flag)
Addresses regression from v0.4.0 (restores DeleteCitation parsing)
@Michael-E-Rose
Copy link
Collaborator

Hi & thanks for this PR! Much appreciated.

Can you confirm if works in your end? The automatic tests failed but that might not be your fault.

@iacopy
Copy link
Contributor Author

iacopy commented Jun 6, 2025

@Michael-E-Rose Hi, thanks, you're welcome. Yes, all the tests pass locally, including test_pii.
I also had some integration tests on deleted articles that started failing when I upgraded from 0.3.1 to 0.5.1, and now they pass.
Note that I didn't touch the pubmed_parser version number in setup.py because I don't know the policy regarding that.

@Michael-E-Rose Michael-E-Rose merged commit 1559018 into titipata:master Jun 6, 2025
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: Inconsistency in Handling Deleted Articles in MEDLINE Parser

2 participants