A simple, powerful Node.js tool that converts PDF documents to clean, structured markdown using Google's Gemini 2.0 Flash API. Perfect for digitizing documents, creating documentation, or preparing content for further processing.
- π Simple & Fast - Convert PDFs to markdown in seconds
- π Batch Processing - Handle single files or entire folders
- π§ AI-Powered - Uses Gemini 2.0 Flash for intelligent text extraction
- π Structure Preservation - Maintains headers, tables, and formatting
- π° Cost-Effective - File uploads are free, minimal processing costs
- π Secure - Files auto-delete after 48 hours
- π± Cross-Platform - Works on macOS, Windows, and Linux
- Node.js 18 or higher
- Google API key for Gemini
# Clone the repository
git clone https://github.com/josh-janse/pdf-to-markdown-extractor.git
cd pdf-to-markdown-extractor
# Install dependencies
npm install- Visit Google AI Studio
- Sign in with your Google account
- Click "Get API Key" in the sidebar
- Create a new API key and copy it
# Copy the template and add your API key
cp .env.template .env
# Edit .env and add your API key
echo "GOOGLE_API_KEY=your_actual_api_key_here" > .env# Run the diagnostic to verify everything works
node diagnose.jsYou should see:
β
All checks passed! The tool should work.
node extract.js document.pdfnode extract.js file1.pdf file2.pdf file3.pdf# Process all PDFs in a folder
node extract.js ./documents/
# Process all PDFs in current directory
node extract.js ./# Convert a research paper
node extract.js research-paper.pdf
# Process all invoices
node extract.js ./invoices/
# Convert presentation slides
node extract.js presentation.pdfnode extract.js document.pdfnode extract.js file1.pdf file2.pdf file3.pdf# Process all PDFs in a folder
node extract.js ./documents/
# Process all PDFs in current directory
node extract.js ./# Convert a research paper
node extract.js research-paper.pdf
# Process all invoices
node extract.js ./invoices/
# Convert presentation slides
node extract.js presentation.pdf- Markdown files are saved to
./output/directory - Each PDF generates a corresponding
.mdfile - Original document structure is preserved
- Tables are converted to markdown format
- Headers and formatting are maintained
Free Tier: Free of charge for both input and output (with rate limits)
Paid Tier (per 1M tokens):
- Input: $0.10 (text/image/video)
- Output: $0.40
Real-world costs for typical PDFs:
- Small document (5 pages): ~$0.002-0.005
- Medium document (20 pages): ~$0.01-0.02
- Large document (100 pages): ~$0.05-0.10
Important: Most users can use the free tier for testing and light usage. You only pay when you exceed the free tier rate limits.
Create a .env file in your project root:
GOOGLE_API_KEY=your_gemini_api_key_here- Maximum file size: 2GB per PDF
- Storage limit: 20GB total per project
- Files automatically delete after 48 hours
- Rate limiting: 2-second delay between files in batch mode
β
Text Content - All readable text from PDFs
β
Headers & Structure - Document hierarchy preserved
β
Tables - Converted to markdown table format
β
Lists - Bullet points and numbered lists
β
Formatting - Bold and italic text where possible
β
Multi-column Layout - Intelligent text flow handling
"GOOGLE_API_KEY not found"
- Ensure
.envfile exists in project root - Check there are no spaces around the
=sign - Verify your API key is correct
"File not found"
- Use full or relative paths to PDF files
- Ensure files have
.pdfextension - Check file permissions
"API Error"
- Verify your API key is active
- Check internet connectivity
- Ensure you haven't exceeded rate limits
"No PDF files found"
- Verify the folder contains PDF files
- Check folder path is correct
- Ensure PDF files have proper extensions
Run the diagnostic script to troubleshoot issues:
node diagnose.jsThis will check:
- Node.js version compatibility
- API key configuration
- Network connectivity
- Available API methods
Common issues and solutions:
"GOOGLE_API_KEY not found"
- Ensure
.envfile exists in project root - Check there are no spaces around the
=sign - Verify your API key is correct
"File not found"
- Use full or relative paths to PDF files
- Ensure files have
.pdfextension - Check file permissions
"API Error"
- Verify your API key is active at Google AI Studio
- Check internet connectivity
- Ensure you haven't exceeded rate limits
"No PDF files found"
- Verify the folder contains PDF files
- Check folder path is correct
- Ensure PDF files have proper extensions
pdf-to-markdown-extractor/
βββ README.md # This file
βββ extract.js # Main extraction script
βββ diagnose.js # Diagnostic utility
βββ package.json # Dependencies and scripts
βββ package-lock.json # Locked dependency versions
βββ .env.template # Environment template
βββ .env # Your API key (not in repo)
βββ .gitignore # Git ignore rules
βββ output/ # Generated markdown files
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
The codebase is intentionally simple. Some ideas for enhancements:
- Support for other document formats (DOCX, PPTX)
- Custom output formatting options
- Integration with other AI models
- Web interface
- Docker containerization
Input: quarterly-report.pdf
Output: output/quarterly-report.md
# Q3 2024 Financial Report
## Executive Summary
Our company achieved strong growth in Q3 2024...
| Metric | Q3 2024 | Q2 2024 | Change |
|--------|---------|---------|--------|
| Revenue | $2.1M | $1.8M | +16.7% |
| Profit | $450K | $380K | +18.4% |
## Key Achievements
- Launched new product line
- Expanded to 3 new markets
- Improved customer satisfaction by 25%- API keys are stored locally in
.envfiles - Files are uploaded securely to Google's servers
- All uploaded files auto-delete after 48 hours
- No data is stored permanently
- API calls are encrypted in transit
This is a simple utility script. Feel free to:
- Fork and modify for your needs
- Add new features or improvements
- Report issues or suggestions
- Share your use cases
# Fork and clone your fork
git clone https://github.com/YOUR_USERNAME/pdf-to-markdown-extractor.git
cd pdf-to-markdown-extractor
# Install dependencies
npm install
# Copy environment template
cp .env.template .env
# Add your API key to .env
# Test your setup
node diagnose.jsMIT License - feel free to use for personal or commercial projects.
Modify the script to change the output directory:
// Change this line in extract.js:
await fs.mkdir('./my-output', { recursive: true });Modify the extraction prompt for specific formatting:
// Find this line in extract.js and customize:
'Extract all text from this PDF and format it as clean markdown...'Use as a module in larger projects:
import { extractPDF } from './extract.js';
const result = await extractPDF('document.pdf');Create npm scripts in package.json:
{
"scripts": {
"extract": "node extract.js",
"diagnose": "node diagnose.js",
"extract-docs": "node extract.js ./documents/"
}
}Then use:
npm run extract document.pdf
npm run diagnose
npm run extract-docsHappy extracting! πβ‘οΈπ
β Star this repo if you find it useful!
π Report issues on the GitHub Issues page
π‘ Suggest features or contribute improvements