Reviewer is a document processing tool that extracts text from PDF and PowerPoint documents (.pdf, .pptx, .ppt) and summarizes it using LangChain and Google's Gemini AI. This project supports OCR based extraction and image based PDF's.
- Extract text from PDF and PPT/PPTX files.
- Uses OCR Tesseract to extract text from images within PDF's and PowerPoint slides.
- Summarizes into a clear overview.
- Answers user questions based on the document.
- Maintains conversation context and memory.
- Uses LangChain and Google's Gemini API.
- Supports large documents by chunking the document before processing.
Reviewer/
├── main.py # Entry point
├── requirements.txt # Dependencies
├── .env # Environment file
├── config/
│ └── settings.py # Configuration settings
├── core/
│ ├── __init__.py
│ ├── document_processor.py # Document text extraction
│ ├── text_chunker.py # Text chunking utilities
│ ├── ai_service.py # LLM model integration
│ └── cli.py # User interface functions
└── utils/
├── __init__.py
├── file_helpers.py # File validation utilities
└── converters.py # Conversion utilities
- Python 3.8 or higher
- Tesseract OCR
- Poppler
- LibreOffice (Optional)
It is recommended to install unoconv or unoserver aside from LibreOffice for better performance.
Clone the repository:
git clone https://github.com/isaiah76/Reviewer.git
cd reviewerRun the provided installation script:
chmod +x install.sh && ./install.shRun the provided batch script:
install.batEnsure the following dependencies are installed before running the program
Linux:
- Debian/Ubuntu:
sudo apt install tesseract-ocr
- Arch Linux:
sudo pacman -S tesseract
- Fedora:
sudo dnf install tesseract
macOS:
brew install tesseractWindows: Download and install from Tesseract OCR. Ensure the installation path is added to your system PATH.
Linux:
- Debian/Ubuntu:
sudo apt install poppler-utils
- Arch Linux:
sudo pacman -S poppler
- Fedora:
sudo dnf install poppler-utils
macOS:
brew install popplerWindows: Download from Poppler for Windows and add it to the system PATH.
Linux:
- Debian/Ubuntu:
sudo apt install libreoffice unoconv
- Arch Linux:
sudo pacman -S libreoffice-fresh
- Fedora:
sudo dnf install libreoffice unoconv
macOS:
brew install libreofficefor unoserver:
pip install unoserverpip install -r requirements.txtOr if the requirements.txt is missing, install the required packages manually:
pip install python-dotenv langchain langchain-google-genai google-generativeai PyPDF2 python-pptx pyfiglet pytesseract pdf2image PillowBefore running the program, create a .env file in your root project (or use the provided .env.example as a guide).
- Visit the Google AI Studio website
- Sign in with your Google account
- Navigate to "Get API key" or go to your profile settings
- Create a new API key or use an existing one
- Copy the API key into your
.envfile
Include your Gemini API key:
GEMINI_API_KEY=your_api_key_hereOptionally; set the Tesseract command if it not in your PATH:
TESSERACT_CMD=/path/to/tesseract # Linux/macOS
TESSERACT_CMD=C:\\Program Files\\Tesseract-OCR\\tesseract.exe # WindowsCurrently there are two options to choose from for usage:
- Provide the path in CLI Arugments
python3 main.py path/to/file- Enter the path when prompted
python3 main.pyand then enter the path when prompted:
Please enter the path to your file (.pdf, .pptx, or .ppt): path/to/fileCurrently only 1 file can be processed at a time.
To exit the program, press Ctrl + C. Or, if you are prompted, type exit and press Enter.
Contributions are welcome! Please submit a pull request or open an issue for discussion.