This project extracts text from PDFs (both regular and scanned) and summarizes the extracted content using AI-powered text summarization.
- Extracts text from normal PDFs using
PyMuPDF(fitz). - Extracts text from scanned PDFs using
Tesseract OCRandEasyOCR. - Preprocesses images for better OCR accuracy.
- Summarizes extracted text using the
facebook/bart-large-cnnmodel fromtransformers. - Handles large text inputs by processing in chunks.
Ensure you have the following dependencies installed:
- Python 3.8+
pymupdfpytesseractpdf2imageeasyocrPillowtorchtransformers
- Tesseract OCR:
Install Tesseract OCR and ensure it is correctly set up.
Windows users may need to set the path intext_extractor.py:pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
- Poppler (For
pdf2imageto work):- Windows: Install from this link and add it to the system
PATH. - Linux: Install using:
sudo apt install poppler-utils
- macOS: Install using:
brew install poppler
- Windows: Install from this link and add it to the system
-
Clone the repository:
git clone https://github.com/Tertho1/pdf-summarizer.git cd pdf-summarizer -
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
Run the script with a PDF file as an argument:
python main.py path/to/document.pdfTo Run the gui version
python gui_main.pyOr simply run it from you IDE and choose a pdf to summarize it
Extracting text from: sample.pdf
Extracted Text (Preview - First 1000 chars):
Lorem ipsum dolor sit amet, consectetur adipiscing elit...
Summarizing...
Summary:
- The document discusses key aspects of Lorem Ipsum.
- Various elements and structure are explained concisely.
pdf-summarizer/
│── main.py # Main script to run extraction and summarization
│── text_extractor.py # Handles text extraction from PDFs (regular & scanned)
│── summarizer.py # AI-based text summarization
│── requirements.txt # Required Python packages
│── README.md # Project documentation
Any contribution is appreciated.