Skip to content

Conversation

@hherb
Copy link
Owner

@hherb hherb commented Dec 29, 2025

Add Europe PMC PDF Bulk Downloader

Summary

  • Add EuropePMCPDFDownloader class for bulk downloading PDF files from Europe PMC Open Access FTP
  • Add europe_pmc_pdf_cli.py CLI with commands: list, download, verify, extract, status, estimate, find
  • Create user documentation at doc/users/europe_pmc_pdf_guide.md

Features

  • Resumable downloads: Progress saved to JSON state file; interrupted downloads resume automatically
  • Download verification: Tar.gz archive integrity verification after each download
  • PDF extraction: Automatic extraction to PMCID-based subdirectory structure (e.g., pdf/13000/13900-13999/PMC13901.pdf)
  • Rate limiting: Configurable delay between downloads (default 60s) to be polite to Europe PMC servers
  • PMCID filtering: Download only specific PMCID ranges with --range option
  • Jitter on retries: Random jitter added to retry delays to prevent thundering herd effects (Golden Rule Create database setup and import test script #22)
  • Input validation: PMCID validation with helpful error messages (Golden Rule Add Claude Code GitHub Workflow #1)

CLI Usage

# List available packages
uv run python europe_pmc_pdf_cli.py list

# Download with rate limiting
uv run python europe_pmc_pdf_cli.py download --output-dir ~/europepmc_pdf --delay 120

# Download specific PMCID range
uv run python europe_pmc_pdf_cli.py download --range 1-1000000

# Find a specific PDF
uv run python europe_pmc_pdf_cli.py find --pmcid PMC123456

Files Changed
src/bmlibrarian/importers/europe_pmc_pdf_downloader.py (new)
src/bmlibrarian/importers/__init__.py (updated exports)
europe_pmc_pdf_cli.py (new)
doc/users/europe_pmc_pdf_guide.md (new)
CLAUDE.md (updated with new CLI commands)
Test Plan
 Run uv run python europe_pmc_pdf_cli.py list to verify package listing
 Run uv run python europe_pmc_pdf_cli.py download --limit 1 to test single package download
 Run uv run python europe_pmc_pdf_cli.py verify to test verification
 Run uv run python europe_pmc_pdf_cli.py find --pmcid PMC123456 to test PDF lookup
 Verify state persistence by interrupting and resuming download

- Create EuropePMCPDFDownloader class for bulk PDF downloads from Europe PMC
- Add europe_pmc_pdf_cli.py CLI with list, download, verify, extract, status,
  estimate, and find commands
- Support resumable downloads with state persistence in JSON
- Extract PDFs to PMCID-based subdirectory structure
- Configure rate limiting and PMCID range filtering
- Update CLAUDE.md with new CLI commands
- Add jitter to retry delays (Rule #22) to prevent thundering herd
- Replace magic numbers with named constants
- Add input validation to get_pdf_path method (Rule #1)
- Remove unused gzip import
- Add user documentation (Rule #12)
- Improve error handling in CLI for invalid PMCID input
@claude
Copy link

claude bot commented Dec 29, 2025

Claude finished @hherb's task —— View job


Code Review: Europe PMC PDF Bulk Downloader

I'm reviewing this PR for code quality, best practices, potential issues, and security concerns. Here's my analysis:

Review Checklist

  • Analyze main downloader implementation (europe_pmc_pdf_downloader.py)
  • Review CLI implementation (europe_pmc_pdf_cli.py)
  • Check documentation quality and completeness
  • Evaluate code against BMLibrarian's golden rules
  • Assess security considerations
  • Review performance implications
  • Check error handling and edge cases
  • Evaluate test coverage needs

Detailed Code Review

🔍 Critical Security Issues

1. Path Traversal Vulnerability (europe_pmc_pdf_downloader.py:619-627)

  • Risk: HIGH - No validation of extracted file paths from tar archive
  • Malicious archives could extract files outside intended directories using ../ paths
  • Recommendation: Add path validation before extraction:
# Before extracting, validate the path
safe_path = os.path.normpath(member.name)
if safe_path.startswith('/') or '..' in safe_path:
    logger.warning(f"Skipping unsafe path: {member.name}")
    continue

2. Regex Denial of Service (ReDoS) (europe_pmc_pdf_downloader.py:286-287, 300-301)

  • Risk: MEDIUM - Complex regex patterns on untrusted server response
  • Could cause CPU exhaustion with crafted HTML responses
  • Recommendation: Add timeout limits and simpler patterns

🛡️ Security Concerns

3. Contact Information Exposure (europe_pmc_pdf_downloader.py:214)

🐛 Bugs and Logic Issues

4. Infinite Loop Risk (europe_pmc_pdf_downloader.py:344-347)

  • Range overlap logic could match all ranges if start > end in configuration
  • Fix: Validate range bounds in constructor

5. Missing Resource Cleanup (europe_pmc_pdf_downloader.py:212-215)

  • HTTP session created but never explicitly closed
  • Could lead to resource leaks in long-running processes
  • Fix: Add context manager or explicit cleanup method

6. Integer Overflow Risk (europe_pmc_pdf_downloader.py:654-657)

  • PMCID subdirectory calculation doesn't handle extremely large PMCIDs
  • Fix: Add bounds checking for PMCID values

📊 Performance Issues

7. Memory Usage (europe_pmc_pdf_downloader.py:276-277)

  • Loading entire FTP directory HTML into memory for regex parsing
  • Could fail with very large directory listings
  • Recommendation: Stream parsing or pagination

8. Progress Logging Inefficiency (europe_pmc_pdf_downloader.py:537-542)

  • Modulo operation on every chunk could be expensive for large files
  • Fix: Use byte counters instead of modulo

🎯 Code Quality Issues

9. CLI Hardcoded Paths (`europe_pmc_pdf_cli.py:474, 496, etc.)

10. Exception Handling (europe_pmc_pdf_cli.py:150-152)

  • Catching all exceptions but logging with exc_info=True could leak sensitive data
  • Fix: More specific exception handling

Positive Aspects

  • Excellent documentation - Comprehensive docstrings and user guide
  • State persistence - Robust JSON serialization with resume capability
  • Rate limiting - Respectful to servers with configurable delays
  • Error handling - Generally good retry logic with jitter
  • Input validation - Good PMCID format checking
  • Type hints - Comprehensive type annotations throughout
  • Dataclasses - Clean data modeling with serialization

📝 Minor Improvements

11. Magic Numbers: Several constants could be better organized (lines 58-65)
12. Error Messages: Could provide more specific guidance to users
13. Configuration: Some settings hardcoded that could be configurable

🧪 Test Coverage Needs

The implementation is missing unit tests for:

  • Path traversal attack scenarios
  • PMCID validation edge cases
  • Archive extraction error handling
  • Range validation logic
  • HTTP timeout/retry scenarios

📋 Summary

This is a well-architected bulk downloader with excellent documentation and thoughtful design. However, it has a critical path traversal vulnerability that must be addressed before merging. The code demonstrates good adherence to BMLibrarian's coding standards overall, but needs security hardening.

Priority Fixes:

  1. CRITICAL: Add path traversal protection
  2. HIGH: Fix contact info exposure
  3. MEDIUM: Add resource cleanup
  4. LOW: Address performance and code quality issues

Recommendation: Address security issues before merging, consider adding basic test coverage for security-critical paths.

@hherb hherb merged commit d333aaf into master Dec 29, 2025
4 checks passed
hherb pushed a commit that referenced this pull request Dec 29, 2025
…F downloader

Security fixes:
- Add path traversal protection in tar extraction (_is_safe_tar_member)
- Add ReDoS protection with bounded regex patterns and match limits
- Remove hardcoded contact email from User-Agent (now configurable)
- Add PMCID bounds validation to prevent integer overflow
- Add range validation to prevent infinite loop in overlap checking

Code quality improvements:
- Add context manager support for HTTP session cleanup
- Improve progress logging efficiency using byte thresholds
- Move magic numbers and hardcoded paths to named constants
- Add specific exception handling in CLI (network, permission, filesystem)
- Organize configuration constants with clear documentation

New features:
- Add contact_email parameter for configurable User-Agent
- Add lazy session creation for better resource management
- Add close() method for explicit resource cleanup

Tests:
- Add comprehensive unit tests for security fixes
- Cover path traversal, PMCID validation, range checking, session cleanup

Addresses issues from PR #226 security review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants