An Intelligent Docker-Based Pipeline for Document Processing with OCR and NLP for efficient data management
pdf2json is an automated, intelligent pipeline designed to streamline data management. It leverages Optical Character Recognition (OCR) and natural language processing (NLP) to extract, process, and structure information from PDF documents. This tool facilitates efficient handling of text and images, offering an organized, JSON-based output to support several kind of workflows.
-
Standard Graphical User Interface (GUI)
The GUI offers a simple yet efficient interface, allowing users to upload PDFs for processing. It also enables users to specify a destination path where the generated JSON file will be saved.
-
Automated Document Processing
This feature extracts and processes text from both PDF documents and images, performing tasks such as text cleaning, summarization, and image extraction. The images are decoded into base64 format, and all extracted data is organized and converted into a standardized JSON format. The resulting JSON file is then saved to the user-specified path.
Ensure you have the following installed on your system:
- Python 3.8 or higher (recommended: Python 3.10): Python's official website
- Run following command in your IDE:
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')" - Docker (with Docker Compose, if applicable): Docker's official website
- On Linux, you may need to install Docker Compose separately
- Git: Git's official website
- Tesseract OCR (for OCR functionality)
- On Ubuntu/Debian:
sudo apt update && sudo apt install -y tesseract-ocr libtesseract-dev - On macOS (using Homebrew):
brew install tesseract
- On Windows:
- Download installer from Tesseract at UB Mannheim or Tesseract OCR
- On Ubuntu/Debian:
Clone the project repository to your local machine:
git clone https://github.com/zohrehasadi00/pdf2json.git
cd pdf2json
If you use Windows you'll need to install X server. How to Install and Run X Server in Windows 11?
- Run the Application
docker build -t myapp . docker run -it -e DISPLAY=host.docker.internal:0.0 -p 8000:8000 myapp - Access the app at: http://localhost:8000
- stop the Container:
docker stop myapp_container