A desktop application to extract tabular data from PDF documents, with support for challenging PDFs including those with rotated text or complex layouts.
- Visual Table Selection: Draw column and row markers directly on the PDF to define table boundaries
- Automatic Line Detection: Identify table structure from visual elements
- Text Orientation Correction: Automatically detects and corrects issues with vertical or rotated text
- Multi-page Extraction: Extract and combine tables across multiple PDF pages
- Data Export: Save extracted data to CSV or Excel formats
- Manual Input Mode: Manually edit table cells when automatic extraction isn't perfect
- Configuration Save/Load: Save table markers and extraction settings for future use
- Python 3.7 or higher
- Required packages (install using
pip install -r requirements.txt):- PyMuPDF (imported as fitz)
- Pillow
- numpy
- pandas
- openpyxl
-
Clone this repository:
git clone https://github.com/yourusername/pdf-table-extractor.git cd pdf-table-extractor -
Install required packages:
pip install -r requirements.txt -
Run the application:
python main.py
-
Load a PDF:
- Click "Open PDF" in the File tab or use Ctrl+O (Cmd+O on Mac)
-
Define Table Structure:
- Go to the Edit tab
- Click "Select Columns" and click on the PDF to add vertical lines
- Click "Select Rows" and click on the PDF to add horizontal lines
-
Extract Table:
- Go to the Table tab
- Click "Extract Table" to process the defined area
-
Export Data:
- Go to the Export tab
- Choose "Save as CSV" or "Save as Excel"
- Click "Select Area" and drag to select a region containing a table
- Click "Process Selection" to automatically detect table lines
- Review the detected lines and click "Apply Detected Lines" to use them
If extracted text appears scrambled or incorrectly oriented:
- Extract the table first
- Click "Correct Text Orientation" to fix vertical or rotated text issues
When automatic extraction doesn't work well:
- Define table structure with column and row markers
- Go to the "Manual Input" tab to enter cell values manually
- Navigate between cells using arrow keys or Tab
- Click "Save & Exit Manual Mode" when finished
To extract tables from multiple pages:
- Navigate to each page containing tables
- Set up markers and click "Mark Current Page" for each page
- Click "Extract All Marked Pages" to process all marked pages
- Choose how to merge the tables (vertically or horizontally)
- Ctrl+O / Cmd+O: Open PDF
- Ctrl+S / Cmd+S: Save as CSV
- Ctrl+E / Cmd+E: Save as Excel
- Ctrl+Z / Cmd+Z: Undo last marker
- Left/Right Arrow: Previous/Next page
- +/-: Zoom in/out
- Try the "Correct Text Orientation" feature
- Switch to Manual Input Mode for problematic tables
- Experiment with different marker placements
- Ensure Python 3.7+ is installed
- Verify all dependencies are installed with
pip install -r requirements.txt - Check console output for specific error messages
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.