PDF Table Extractor

A desktop application to extract tabular data from PDF documents, with support for challenging PDFs including those with rotated text or complex layouts.

Features

Visual Table Selection: Draw column and row markers directly on the PDF to define table boundaries
Automatic Line Detection: Identify table structure from visual elements
Text Orientation Correction: Automatically detects and corrects issues with vertical or rotated text
Multi-page Extraction: Extract and combine tables across multiple PDF pages
Data Export: Save extracted data to CSV or Excel formats
Manual Input Mode: Manually edit table cells when automatic extraction isn't perfect
Configuration Save/Load: Save table markers and extraction settings for future use

Installation

Requirements

Python 3.7 or higher
Required packages (install using pip install -r requirements.txt):
- PyMuPDF (imported as fitz)
- Pillow
- numpy
- pandas
- openpyxl

Setup

Clone this repository:

git clone https://github.com/yourusername/pdf-table-extractor.git
cd pdf-table-extractor

Install required packages:
```
pip install -r requirements.txt
```
Run the application:
```
python main.py
```

Usage Guide

Basic Operation

Load a PDF:
- Click "Open PDF" in the File tab or use Ctrl+O (Cmd+O on Mac)
Define Table Structure:
- Go to the Edit tab
- Click "Select Columns" and click on the PDF to add vertical lines
- Click "Select Rows" and click on the PDF to add horizontal lines
Extract Table:
- Go to the Table tab
- Click "Extract Table" to process the defined area
Export Data:
- Go to the Export tab
- Choose "Save as CSV" or "Save as Excel"

Advanced Features

Automatic Table Detection

Click "Select Area" and drag to select a region containing a table
Click "Process Selection" to automatically detect table lines
Review the detected lines and click "Apply Detected Lines" to use them

Text Orientation Correction

If extracted text appears scrambled or incorrectly oriented:

Extract the table first
Click "Correct Text Orientation" to fix vertical or rotated text issues

Manual Input Mode

When automatic extraction doesn't work well:

Define table structure with column and row markers
Go to the "Manual Input" tab to enter cell values manually
Navigate between cells using arrow keys or Tab
Click "Save & Exit Manual Mode" when finished

Multi-page Extraction

To extract tables from multiple pages:

Navigate to each page containing tables
Set up markers and click "Mark Current Page" for each page
Click "Extract All Marked Pages" to process all marked pages
Choose how to merge the tables (vertically or horizontally)

Keyboard Shortcuts

Ctrl+O / Cmd+O: Open PDF
Ctrl+S / Cmd+S: Save as CSV
Ctrl+E / Cmd+E: Save as Excel
Ctrl+Z / Cmd+Z: Undo last marker
Left/Right Arrow: Previous/Next page
+/-: Zoom in/out

Troubleshooting

Text Not Extracting Correctly

Try the "Correct Text Orientation" feature
Switch to Manual Input Mode for problematic tables
Experiment with different marker placements

Application Not Starting

Ensure Python 3.7+ is installed
Verify all dependencies are installed with pip install -r requirements.txt
Check console output for specific error messages

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

PyMuPDF for PDF processing
Pillow for image processing
Created by Martha Correa

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
core		core
gui		gui
icon		icon
image		image
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Table Extractor

Features

Installation

Requirements

Setup

Usage Guide

Basic Operation

Advanced Features

Automatic Table Detection

Text Orientation Correction

Manual Input Mode

Multi-page Extraction

Keyboard Shortcuts

Troubleshooting

Text Not Extracting Correctly

Application Not Starting

Contributing

License

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

mtcorread/pdf-table-extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Table Extractor

Features

Installation

Requirements

Setup

Usage Guide

Basic Operation

Advanced Features

Automatic Table Detection

Text Orientation Correction

Manual Input Mode

Multi-page Extraction

Keyboard Shortcuts

Troubleshooting

Text Not Extracting Correctly

Application Not Starting

Contributing

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages