PDF to Markdown Extractor

A simple, powerful Node.js tool that converts PDF documents to clean, structured markdown using Google's Gemini 2.0 Flash API. Perfect for digitizing documents, creating documentation, or preparing content for further processing.

✨ Features

🚀 Simple & Fast - Convert PDFs to markdown in seconds
📁 Batch Processing - Handle single files or entire folders
🧠 AI-Powered - Uses Gemini 2.0 Flash for intelligent text extraction
📊 Structure Preservation - Maintains headers, tables, and formatting
💰 Cost-Effective - File uploads are free, minimal processing costs
🔒 Secure - Files auto-delete after 48 hours
📱 Cross-Platform - Works on macOS, Windows, and Linux

🛠️ Quick Start

Prerequisites

Node.js 18 or higher
Google API key for Gemini

1. Clone and Install

# Clone the repository
git clone https://github.com/josh-janse/pdf-to-markdown-extractor.git
cd pdf-to-markdown-extractor

# Install dependencies
npm install

2. Get Your Gemini API Key

Visit Google AI Studio
Sign in with your Google account
Click "Get API Key" in the sidebar
Create a new API key and copy it

3. Configure Environment

# Copy the template and add your API key
cp .env.template .env

# Edit .env and add your API key
echo "GOOGLE_API_KEY=your_actual_api_key_here" > .env

4. Test Installation

# Run the diagnostic to verify everything works
node diagnose.js

You should see:

✅ All checks passed! The tool should work.

🚀 Usage

Single PDF File

node extract.js document.pdf

Multiple Files

node extract.js file1.pdf file2.pdf file3.pdf

Entire Folder

# Process all PDFs in a folder
node extract.js ./documents/

# Process all PDFs in current directory
node extract.js ./

Examples

# Convert a research paper
node extract.js research-paper.pdf

# Process all invoices
node extract.js ./invoices/

# Convert presentation slides
node extract.js presentation.pdf

🚀 Usage

Single PDF File

node extract.js document.pdf

Multiple Files

node extract.js file1.pdf file2.pdf file3.pdf

Entire Folder

# Process all PDFs in a folder
node extract.js ./documents/

# Process all PDFs in current directory
node extract.js ./

Examples

# Convert a research paper
node extract.js research-paper.pdf

# Process all invoices
node extract.js ./invoices/

# Convert presentation slides
node extract.js presentation.pdf

📂 Output

Markdown files are saved to ./output/ directory
Each PDF generates a corresponding .md file
Original document structure is preserved
Tables are converted to markdown format
Headers and formatting are maintained

💰 Cost

Free Tier: Free of charge for both input and output (with rate limits)

Paid Tier (per 1M tokens):

Input: $0.10 (text/image/video)
Output: $0.40

Real-world costs for typical PDFs:

Small document (5 pages): ~$0.002-0.005
Medium document (20 pages): ~$0.01-0.02
Large document (100 pages): ~$0.05-0.10

Important: Most users can use the free tier for testing and light usage. You only pay when you exceed the free tier rate limits.

🔧 Configuration

Environment Variables

Create a .env file in your project root:

GOOGLE_API_KEY=your_gemini_api_key_here

File Limits

Maximum file size: 2GB per PDF
Storage limit: 20GB total per project
Files automatically delete after 48 hours
Rate limiting: 2-second delay between files in batch mode

📊 What It Extracts

✅ Text Content - All readable text from PDFs
✅ Headers & Structure - Document hierarchy preserved
✅ Tables - Converted to markdown table format
✅ Lists - Bullet points and numbered lists
✅ Formatting - Bold and italic text where possible
✅ Multi-column Layout - Intelligent text flow handling

🐛 Troubleshooting

Common Issues

"GOOGLE_API_KEY not found"

Ensure .env file exists in project root
Check there are no spaces around the = sign
Verify your API key is correct

"File not found"

Use full or relative paths to PDF files
Ensure files have .pdf extension
Check file permissions

"API Error"

Verify your API key is active
Check internet connectivity
Ensure you haven't exceeded rate limits

"No PDF files found"

Verify the folder contains PDF files
Check folder path is correct
Ensure PDF files have proper extensions

Getting Help

Run the diagnostic script to troubleshoot issues:

node diagnose.js

This will check:

Node.js version compatibility
API key configuration
Network connectivity
Available API methods

Common issues and solutions:

"GOOGLE_API_KEY not found"

Ensure .env file exists in project root
Check there are no spaces around the = sign
Verify your API key is correct

"File not found"

Use full or relative paths to PDF files
Ensure files have .pdf extension
Check file permissions

"API Error"

Verify your API key is active at Google AI Studio
Check internet connectivity
Ensure you haven't exceeded rate limits

"No PDF files found"

Verify the folder contains PDF files
Check folder path is correct
Ensure PDF files have proper extensions

🛠️ Development

Project Structure

pdf-to-markdown-extractor/
├── README.md          # This file
├── extract.js         # Main extraction script
├── diagnose.js        # Diagnostic utility
├── package.json       # Dependencies and scripts
├── package-lock.json  # Locked dependency versions
├── .env.template      # Environment template
├── .env              # Your API key (not in repo)
├── .gitignore        # Git ignore rules
└── output/           # Generated markdown files

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Adding Features

The codebase is intentionally simple. Some ideas for enhancements:

Support for other document formats (DOCX, PPTX)
Custom output formatting options
Integration with other AI models
Web interface
Docker containerization

📝 Example Output

Input: quarterly-report.pdf
Output: output/quarterly-report.md

# Q3 2024 Financial Report

## Executive Summary

Our company achieved strong growth in Q3 2024...

| Metric | Q3 2024 | Q2 2024 | Change |
|--------|---------|---------|--------|
| Revenue | $2.1M | $1.8M | +16.7% |
| Profit | $450K | $380K | +18.4% |

## Key Achievements

- Launched new product line
- Expanded to 3 new markets
- Improved customer satisfaction by 25%

🔒 Security & Privacy

API keys are stored locally in .env files
Files are uploaded securely to Google's servers
All uploaded files auto-delete after 48 hours
No data is stored permanently
API calls are encrypted in transit

🤝 Contributing

This is a simple utility script. Feel free to:

Fork and modify for your needs
Add new features or improvements
Report issues or suggestions
Share your use cases

Development Setup

# Fork and clone your fork
git clone https://github.com/YOUR_USERNAME/pdf-to-markdown-extractor.git
cd pdf-to-markdown-extractor

# Install dependencies
npm install

# Copy environment template
cp .env.template .env
# Add your API key to .env

# Test your setup
node diagnose.js

📄 License

MIT License - feel free to use for personal or commercial projects.

🚀 Advanced Usage

Custom Output Directory

Modify the script to change the output directory:

// Change this line in extract.js:
await fs.mkdir('./my-output', { recursive: true });

Custom Prompts

Modify the extraction prompt for specific formatting:

// Find this line in extract.js and customize:
'Extract all text from this PDF and format it as clean markdown...'

Integration

Use as a module in larger projects:

import { extractPDF } from './extract.js';
const result = await extractPDF('document.pdf');

Automation

Create npm scripts in package.json:

{
  "scripts": {
    "extract": "node extract.js",
    "diagnose": "node diagnose.js",
    "extract-docs": "node extract.js ./documents/"
  }
}

Then use:

npm run extract document.pdf
npm run diagnose
npm run extract-docs

Happy extracting! 📚➡️📝

⭐ Star this repo if you find it useful!
🐛 Report issues on the GitHub Issues page
💡 Suggest features or contribute improvements

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
diagnose.js		diagnose.js
extract.js		extract.js
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

PDF to Markdown Extractor

✨ Features

🛠️ Quick Start

Prerequisites

1. Clone and Install

2. Get Your Gemini API Key

3. Configure Environment

4. Test Installation

🚀 Usage

Single PDF File

Multiple Files

Entire Folder

Examples

🚀 Usage

Single PDF File

Multiple Files

Entire Folder

Examples

📂 Output

💰 Cost

🔧 Configuration

Environment Variables

File Limits

📊 What It Extracts

🐛 Troubleshooting

Common Issues

Getting Help

🛠️ Development

Project Structure

Contributing

Adding Features

📝 Example Output

🔒 Security & Privacy

🤝 Contributing

Development Setup

📄 License

🚀 Advanced Usage

Custom Output Directory

Custom Prompts

Integration

Automation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages