Automated verification of APA-style references using CrossRef and PubMed
A production-grade Python tool for automatically verifying academic bibliographies. Supports journal articles, books, translations, and ancient texts; performs fuzzy title matching; retrieves DOIs; and generates detailed verification reports for peer-review workflows.
pip install python-docx pandas requests urllib3- Prepare your folder
MyManuscript/
├── bibliography.docx
└── verify_bibliography_production.py
- Configure email (line 26 in script)
EMAIL = "your.email@uw.edu"- Run verification
cd MyManuscript
python verify_bibliography_production.py-
Review outputs
verification_log.txt→ Start here (human-readable summary)verification_report.csv→ Full metadata for archivingverification_for_R.csv→ Statistical analysis in Rextraction_failures.txt→ Debug log (if needed)
-
Fix and re-run
- Edit
bibliography.docxto correct flagged references - Re-run script (outputs overwrite automatically)
- Edit
Requirements for bibliography.docx:
- APA format
- Each reference as separate paragraph
- Must be
.docxformat (not.doc)
- ✅ Verifies references against CrossRef and PubMed
- ✅ Detects reference types: journal articles, books, classics/translations, ancient texts, in-press items
- ✅ Fuzzy title matching with type-specific thresholds (0.85 for journals, 0.75 for books)
- ✅ Automated DOI lookup and metadata extraction
- ✅ Unicode normalization for accented names
- ✅ Year matching with ±2-year tolerance
- ✅ Handles classic editions with "(Original work published...)" notation
- ✅ Generates CSV, R-ready CSV, and human-readable logs
- ✅ Graceful API failure handling with exponential backoff
Accurate bibliographic references are crucial for scientific integrity and discoverability. The Bibliography Verification Tool automates reference validation against authoritative databases (CrossRef, PubMed), minimizing manual errors and streamlining the workflow for researchers, reviewers, and editors.
| File | Purpose |
|---|---|
verification_log.txt |
Human-readable summary with flagged items requiring attention |
verification_report.csv |
Complete metadata for all processed references (archival) |
verification_for_R.csv |
Boolean flags optimized for statistical analysis in R |
extraction_failures.txt |
Debug log for extraction pattern failures (if any) |
Use this text in your manuscript's methods section:
References were verified against CrossRef and PubMed using the Bibliography Verification Tool v1.0 (Balakrishnan, 2025), implementing type-specific fuzzy title matching (thresholds: 0.85 for journal articles, 0.75 for books), Unicode-normalized author matching, and automated DOI lookup. Ancient texts (pre-1800) were excluded from automated verification.
Clone the repository:
git clone https://github.com/pvsundar/bibliography-verification-tool
cd bibliography-verification-toolInstall dependencies:
pip install python-docx pandas requests urllib3Or use requirements.txt:
pip install -r requirements.txtComprehensive guides included in the repository:
PRODUCTION_SETUP_GUIDE.md→ Detailed setup and configurationQUICK_REFERENCE.md→ Fast lookup for common tasksDEPLOYMENT_CHECKLIST.md→ Pre-publication verification stepsCITATION_ATTRIBUTION_GUIDE.md→ How to cite this toolanalysis/analyze_verification_results.R→ 10+ pre-built R analysis functions
If you use this tool in your research, please cite:
APA 7th Edition:
Balakrishnan, P. V. (Sundar). (2025). Bibliography Verification Tool v1.0:
Automated reference verification against CrossRef and PubMed (Version 1.0.0)
[Software]. GitHub. https://github.com/pvsundar/bibliography-verification-tool
Zenodo. https://doi.org/10.5281/zenodo.17622390
Balakrishnan, P. V. (Sundar). (2025). Bibliography Verification Tool:
Automated reference verification against CrossRef and PubMed (Version 1.0.1)
[Computersoftware]. Zenodo. https://doi.org/10.5281/zenodo.17622390
BibTeX:
@software{Balakrishnan2025_BVT,
author = {Balakrishnan, P. V. (Sundar)},
title = {Bibliography Verification Tool},
subtitle = {Automated reference verification against CrossRef and PubMed},
year = {2025},
version = {1.0.1},
publisher = {Zenodo},
doi = {10.5281/zenodo.17622390},
url = {https://doi.org/10.5281/zenodo.17622390},
}In methods sections:
References were verified using Bibliography Verification Tool v1.0
(Balakrishnan, 2025), which implements fuzzy title matching (thresholds:
0.85 for journal articles, 0.75 for books), author verification with
Unicode normalization, and automated DOI lookup against CrossRef and
PubMed databases.
This software paper describes:
- Statement of Need: Addressed above and expanded in the JOSS paper
- Software Architecture: Four-stage pipeline (extraction, query, matching, reporting)
- State of the Field: Comparison with citation managers, API libraries, and manual methods
- Target Audience: Academic authors, journal editors, meta-researchers
- Novelty: First turnkey solution integrating extraction, validation, and reproducible reporting for reference verification
Repository completeness checklist:
- ✅ Working code with comprehensive error handling
- ✅ Installation instructions (requirements.txt)
- ✅ Usage examples with sample data
- ✅ Documentation folder with user guides
- ✅ R analysis integration
- ✅ MIT License
- ✅ CITATION.cff with ORCID
- ✅ Community standards (README, documentation)
JOSS submission information will be added upon submission
Issue: "FileNotFoundError: bibliography.docx"
- Ensure file is in the same directory as the script
- Check filename matches exactly (case-sensitive on Linux/Mac)
- Verify file format is
.docx(not.doc)
Issue: Many references flagged as "NOT_FOUND_IN_DATABASES"
- Very recent publications (CrossRef indexing lag ~2 weeks)
- Specialized/regional journals not indexed in CrossRef
- Check
extraction_failures.txtfor extraction issues
Issue: "LOW_MATCH_CONFIDENCE" on books
- Expected behavior (books often have subtitle variations)
- Verify author and year match
- Scores 50-75 are typically acceptable for books
Issue: API rate limiting (429 errors)
- Script uses exponential backoff (automatic retry)
- Wait 1-2 hours if issue persists
- For large bibliographies (100+ refs), expect 20-60 minutes runtime
Enable detailed diagnostics:
DEBUG_MODE = True # In verify_bibliography_production.pyThis outputs:
- Extraction details for each reference
- API responses
- Step-by-step match calculations
CrossRef API:
- Rate limit: Polite (1 request/second)
- Coverage: 130M+ scholarly publications
- Documentation: https://www.crossref.org/documentation/retrieve-metadata/rest-api/
PubMed API:
- Rate limit: 3 requests/second
- Coverage: 35M+ biomedical citations
- Documentation: https://www.ncbi.nlm.nih.gov/books/NBK25501/
Both APIs are free for academic use. Proper email headers are configured automatically.
| Bibliography Size | Expected Runtime | Processing Rate |
|---|---|---|
| 10-30 references | 1-5 minutes | ~0.3 ref/sec |
| 30-100 references | 5-20 minutes | ~0.3 ref/sec |
| 100-300 references | 20-90 minutes | ~0.3 ref/sec |
Rate limited to respect API guidelines. For very large bibliographies (300+ refs), consider running overnight.
Sample bibliography included:
python verify_bibliography_production.py --testThis verifies a test bibliography with known edge cases:
- Journal articles with and without DOIs
- Books with subtitle variations
- Classic editions with original publication years
- Ancient texts (pre-1800)
- In-press items
What the tool accesses:
- Local
.docxfile only - Public CrossRef and PubMed APIs
What the tool does NOT do:
- Does not upload your bibliography to external servers
- Does not store data beyond local output files
- Does not require authentication or login
Your data stays on your machine. API queries contain only titles/authors for matching, not your full manuscript.
MIT License
Copyright (c) 2025 P. V. (Sundar) Balakrishnan
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
P. V. (Sundar) Balakrishnan
Professor of Marketing Strategy & Analytics
University of Washington Bothell
📧 Email: sundar@uw.edu
🔗 ORCID: 0000-0002-2856-5543
🐙 GitHub: @pvsundar
Contributions are welcome! Here's how you can help:
- Report bugs: Open an issue with reproduction steps
- Suggest features: Open an issue describing the enhancement
- Submit fixes: Fork the repo, make changes, submit a pull request
- Improve documentation: Help make guides clearer
Please ensure:
- Code follows existing style
- Tests pass (if applicable)
- Documentation is updated
- Commit messages are descriptive
This tool builds upon:
- CrossRef REST API for scholarly metadata
- PubMed E-utilities for biomedical literature
- python-docx for Word document parsing
- pandas for data manipulation
- R tidyverse for statistical analysis
Special thanks to the open-source community for these foundational tools.
v1.0.0 (November 2025)
- Production release
- Full CrossRef and PubMed integration
- Reference-type-specific matching thresholds
- Unicode normalization
- Classic edition handling
- R analysis integration
- Comprehensive documentation
If you find this tool useful, please star the repository to help others discover it!
Repository: https://github.com/pvsundar/bibliography-verification-tool
Version: 1.0.0
Status: Production Ready ✅
Last Updated: November 2025
Maintained: Yes, actively maintained
Built with ❤️ for the academic research community