A command-line tool that searches PubMed, fetches article abstracts, and splits them into token-limited batch files ready for LLM ingestion pipelines.
- Features
- Requirements
- Installation
- Credentials Setup
- Usage
- Options
- Output Structure
- Examples
- Security Notes
- Troubleshooting
- Full PubMed search with date range filtering and optional review-article restriction
- Token-accurate batching (includes file overhead in count) using
cl100k_basetokenizer - Streaming pipeline — RAM usage stays proportional to fetch batch size, not total results
- Automatic retry with exponential backoff on transient NCBI network failures
- Oversized article detection — single articles exceeding the token limit are isolated rather than dropped
- Null-byte sanitisation to prevent corrupt output files
- Concurrent-write collision detection with timestamp-suffixed fallback filenames
- Structured summary and log files written alongside batch output
- Python 3.10+
- An NCBI account with an API key
biopython
tiktoken
git clone https://github.com/your-org/pubmed-fetcher.git
cd pubmed-fetcher
pip install biopython tiktokenNever hardcode credentials. The tool requires two environment vsomeables and will fail loudly at startup if either is missing.
export ENTREZ_EMAIL=you@example.com
export ENTREZ_API_KEY=your_ncbi_api_keyTo make these permanent, add the lines above to your ~/.bashrc, ~/.zshrc, or equivalent shell profile.
To get an NCBI API key:
- Sign in at https://www.ncbi.nlm.nih.gov/account/
- Go to Account Settings → API Key Management
- Click Create an API Key
Without an API key, NCBI throttles requests to 3/second. With a key the limit rises to 10/second. The tool's built-in rate delay (
0.34s) respects the unauthenticated limit; with a key you can safely lower it if needed.
python pubmed_fetcher.py \
--query '"Computational Biology/history"[MAJR]' \
--years 40 \
--max-results 2500 \
--token-limit 10000 \
--output-dir ./compbio_hist \
--review-onlyRun python pubmed_fetcher.py --help for the full option reference.
from pubmed_fetcher import PubMedFetcher
fetcher = PubMedFetcher(
query='"Computational Biology/history"[MAJR]',
years=40,
max_results=2500,
token_limit=10_000,
output_dir="./compbio_hist",
review_only=True,
)
fetcher.run()| Flag | Type | Default | Description |
|---|---|---|---|
--query |
str |
(required) | PubMed search query. Supports full MeSH syntax. |
--years |
int |
5 |
How many years back from today to search. |
--max-results |
int |
100 |
Maximum articles to retrieve. Capped at NCBI's server limit of 9,999. |
--token-limit |
int |
8000 |
Token ceiling per batch file. |
--output-dir |
str |
./pubmed_output |
Directory where all output files are written. Created if it doesn't exist. |
--review-only |
flag | False |
Restrict results to review articles only. |
output_dir/
├── pubmed_batch_1.txt # Articles 1–N (≤ token_limit tokens)
├── pubmed_batch_2.txt # Articles N+1–M
├── ...
├── summary.txt # Fetch statistics and batch distribution
└── fetch_log.txt # Full timestamped run log
Each entry within a batch file follows this structure:
PMID: 12345678
TITLE: Example Article Title
ABSTRACT:
This is the abstract text...
----------------------------------------
Includes:
- Search query and parameters used
- Total articles fetched and skipped
- Total words and tokens across all articles
- Per-batch breakdown (article count, token count)
- Full list of output files
Fetch recent RNA-seq method reviews:
python pubmed_fetcher.py \
--query '"RNA-Seq/methods"[MAJR] AND "Computational Biology/methods"[MeSH]' \
--years 3 \
--max-results 200 \
--review-only \
--output-dir ./rnaseq_reviewsFetch all genomics papers from a specific institution:
python pubmed_fetcher.py \
--query '"Genomics"[MeSH] AND "Some Research Institute"[AD]' \
--years 10 \
--max-results 500 \
--output-dir ./some_genomicsLarge historical fetch with bigger batches:
python pubmed_fetcher.py \
--query '"Human Genome Project"[MAJR]' \
--years 35 \
--max-results 9999 \
--token-limit 50000 \
--output-dir ./hgp_history- Credentials are read exclusively from environment vsomeables. The tool raises
EnvironmentErrorat startup if either is unset — there are no hardcoded fallbacks. - Do not commit
.envfiles or shell profiles containing your API key to version control. Add them to.gitignore. - If you suspect your API key has been exposed, rotate it immediately at https://www.ncbi.nlm.nih.gov/account/.
EnvironmentError: Required environment vsomeable 'ENTREZ_EMAIL' is not set
Export the vsomeable in your shell before running — see Credentials Setup.
WARNING: PubMed has X total results but only Y were requested
Increase --max-results. Note the hard ceiling is 9,999 per NCBI's API.
Batch files are smaller than expected Token counts include file overhead (PMID label, TITLE label, ABSTRACT label, separator line) in addition to the article text itself. This is intentional to ensure batches stay within the limit when loaded by downstream tools.
WARNING: PMID XXXXXXXX alone uses N tokens (limit=M); writing as its own oversized batch
A single article's abstract is longer than --token-limit. Either increase the limit or note that this batch will exceed it when consumed downstream.
Network errors / partial fetches
The tool retries each NCBI request up to 3 times with exponential backoff. If all retries fail, the affected batch of up to 50 articles is skipped and logged. Check fetch_log.txt for details.
This project is licensed under CC BY-NC 4.0 - see the LICENSE file for details.