A Python tool for extracting structured shipping manifest data from PDF documents using PDFMiner, with optional machine learning capabilities via LSTM models.
Manifest_PDF_Extractor is a specialized tool designed to parse shipping manifest PDFs and extract key logistics information into a structured CSV format. It uses coordinate-based text extraction to accurately capture fields such as Bill of Lading (BOL), shipper/consignee details, port information, container specifications, and arrival dates.
The project also includes a Jupyter notebook (MyLSTM.ipynb) demonstrating how extracted manifest data can be used to train LSTM models for classification or anomaly detection tasks.
- 🔍 Coordinate-Based PDF Parsing: Uses
pdfminerto extract text from specific bounding box locations in manifest PDFs - 📦 Comprehensive Field Extraction:
- Bill of Lading (BOL) number
- Shipping line & carrier information
- Shipper & consignee details with location parsing
- Port of loading & discharge (city/country)
- Container size, TEUs, and weight metrics
- Commodity codes and descriptions
- Arrival dates and manifest metadata
- 🌍 Geolocation Support: Automatically extracts and validates city/country names using
GeoTextandgeopy - 📊 CSV Export: Outputs structured data ready for analysis or database import
- 🤖 ML Integration: Includes an LSTM model notebook for predictive analytics on manifest data
- 🔁 Multi-Page Support: Handles multi-page manifest documents with BOL continuity tracking
Manifest_PDF_Extractor/
├── MainAnalysis.py # Main PDF parsing engine and orchestration
├── Manifest.py # Data model class for manifest entries
├── utils.py # Helper functions: CSV export, location extraction
├── MyLSTM.ipynb # Jupyter notebook: LSTM model for classification
├── Tammo.pdf # Sample manifest PDF for testing
├── old/ # Legacy/backup files
└── .idea/ # IDE configuration (PyCharm)
- Python 3.7+
- pip package manager
pip install pdfminer.six geotext geopy pandas numpy matplotlib scikit-learn tensorflow keras💡 Note: The LSTM notebook requires TensorFlow/Keras. Install with:
pip install tensorflow # Includes keras
from MainAnalysis import parse_pdf
from utils import my_functions
# Parse a manifest PDF (adjust page limit as needed)
parse_pdf('path/to/your_manifest.pdf', pages=150)
# Export extracted data to CSV
# (CSV is automatically generated as 'myManifest.csv' after parsing)python MainAnalysis.pyThis will process
Tammo.pdf(included sample) and generatemyManifest.csv.
- Open
MyLSTM.ipynbin Jupyter Notebook or Google Colab - Ensure your extracted CSV data is preprocessed and loaded
- Run cells sequentially to:
- Load and preprocess data
- Train the LSTM model
- Evaluate performance metrics
- Save/load the trained model
The ItemBBOX class in utils.py defines bounding box coordinates for each field. Adjust these values if your manifest PDFs use a different layout:
class ItemBBOX:
BOL = [77, 466, 2] # [x0, y1, line_index]
Shipping_Line = [78, 530, 0]
# ... add/modify other fields as needed🔎 Tip: Use a PDF coordinate inspector tool to identify exact text positions in your documents.
- Add the field to the
Manifestclass inManifest.py - Define its coordinates in
ItemBBOXinutils.py - Add extraction logic in
get_text_from_elements()inMainAnalysis.py - Include the field in
header_listandget_list()for CSV export
The tool generates myManifest.csv with the following columns:
| Column | Description |
|---|---|
| Shipping Line | Carrier/transport company |
| Shipper | Exporter name |
| Shipper Location - City,Province | Origin city |
| Shipper Location - Country | Origin country |
| Commodity Code | HS/customs code |
| Cont. Size | Container dimensions (e.g., 20, 40) |
| TEU's | Twenty-foot Equivalent Units |
| Package Class | Packaging type |
| Weight (ORG) in Tonne | Gross weight |
| Commodity | Goods description |
| BOL # | Bill of Lading number |
| Consignee | Importer/recipient |
| Actual Port of Loading | Departure port |
| Place of delivery-City,Province | Destination city |
| Place of delivery- Country | Destination country |
| Actual Port of Discharge-City,Province | Arrival port city |
| Actual Port of Discharge- Country | Arrival port country |
| Date of Arrival | Expected arrival date |
| Month/Year | Parsed date components |
| Manifest Type | Export/Import designation |
# Run with sample file
python MainAnalysis.py
# Verify output
cat myManifest.csv- Layout-Specific: Designed for manifests with consistent formatting. Coordinate values may need adjustment for different PDF templates.
- Geocoding Rate Limits:
geopy.Nominatimhas usage policies—add delays or use alternative services for bulk processing. - PDF Compatibility: Works best with text-based PDFs. Scanned/image-based PDFs require OCR preprocessing.
- Error Handling: Includes try/except blocks, but complex/malformed PDFs may require manual review.
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Commit changes:
git commit -m 'Add some feature' - Push to branch:
git push origin feature/your-feature - Open a Pull Request
Please ensure code follows PEP 8 and includes docstrings for new functions.
This project is licensed under the MIT License—see the LICENSE file for details.
- PDFMiner for robust PDF text extraction
- GeoText & GeoPy for location parsing
- TensorFlow/Keras team for the LSTM implementation framework
💬 Have questions or suggestions? Open an issue or contact the maintainer.
Last updated: March 2026