This Python project extracts and summarizes content from PDFs, including scanned/image-based ones. It uses OCR for unreadable pages, detects the language (English or Malay), and summarizes the content using a pretrained NLP model.
- 📄 Handles both text-based and scanned PDFs
- 🧠 Summarizes using
facebook/bart-large-cnnfrom Hugging Face - 🏷️ Language detection (
langdetect) — supports English (en) and Malay (ms) - 🔍 OCR via Tesseract for image-based pages
- 🧹 Intelligent gibberish detection and filtering
- 📸 Image preprocessing with OpenCV to improve OCR accuracy
document summarization/
├── test.py # Your main script
├── README.md # This file
├── assets/
│ └── sample.pdf # Example PDF
Python 3.8 or higher recommended
Install everything in one go:
pip install pdfplumber pytesseract Pillow langdetect transformers pdf2image opencv-python torch numpyOr create a requirements.txt file and run:
pip install -r requirements.txtrequirements.txt contents:
pdfplumber==0.10.2
pytesseract==0.3.10
Pillow==10.2.0
langdetect==1.0.9
transformers==4.39.3
pdf2image==1.17.0
opencv-python==4.9.0.80
torch>=1.13.0
numpy==1.26.4
Tesseract is used for reading text from scanned images.
-
Windows:
Download and install from https://github.com/tesseract-ocr/tesseractAfter installation, set the path in your code:
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR esseract.exe"
-
macOS:
brew install tesseract
-
Linux (Ubuntu):
sudo apt install tesseract-ocr
Poppler is required by pdf2image to convert PDFs into images.
-
Windows:
Download from http://blog.alivate.com.au/poppler-windows/
Extract and add the/binfolder to your systemPATH. -
macOS:
brew install poppler
-
Linux:
sudo apt install poppler-utils
Edit the pdf_path in your script:
pdf_path = r"C:\path_to_your_file.pdf"python text.pyYou’ll see logs for:
- OCR processing time
- Language detection result
- Summary output per page
Each page will return:
Page X (Language: en/ms):
<summary>
Unreadable or gibberish pages will be flagged and skipped.
This project is open-source and uses the MIT License.
Built with ❤️ by Amier (Ole Kacak)
Feel free to contribute, fork, or suggest improvements.