-
Notifications
You must be signed in to change notification settings - Fork 0
Build bulk PDF downloader for EuropePMC #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Create EuropePMCPDFDownloader class for bulk PDF downloads from Europe PMC - Add europe_pmc_pdf_cli.py CLI with list, download, verify, extract, status, estimate, and find commands - Support resumable downloads with state persistence in JSON - Extract PDFs to PMCID-based subdirectory structure - Configure rate limiting and PMCID range filtering - Update CLAUDE.md with new CLI commands
|
Claude finished @hherb's task —— View job Code Review: Europe PMC PDF Bulk Downloader
|
…F downloader Security fixes: - Add path traversal protection in tar extraction (_is_safe_tar_member) - Add ReDoS protection with bounded regex patterns and match limits - Remove hardcoded contact email from User-Agent (now configurable) - Add PMCID bounds validation to prevent integer overflow - Add range validation to prevent infinite loop in overlap checking Code quality improvements: - Add context manager support for HTTP session cleanup - Improve progress logging efficiency using byte thresholds - Move magic numbers and hardcoded paths to named constants - Add specific exception handling in CLI (network, permission, filesystem) - Organize configuration constants with clear documentation New features: - Add contact_email parameter for configurable User-Agent - Add lazy session creation for better resource management - Add close() method for explicit resource cleanup Tests: - Add comprehensive unit tests for security fixes - Cover path traversal, PMCID validation, range checking, session cleanup Addresses issues from PR #226 security review.

Add Europe PMC PDF Bulk Downloader
Summary
EuropePMCPDFDownloaderclass for bulk downloading PDF files from Europe PMC Open Access FTPeurope_pmc_pdf_cli.pyCLI with commands: list, download, verify, extract, status, estimate, finddoc/users/europe_pmc_pdf_guide.mdFeatures
pdf/13000/13900-13999/PMC13901.pdf)--rangeoptionCLI Usage