Skip to content

zohrehasadi00/pdf2json

Repository files navigation

License: LGPL-3.0

pdf2json

An Intelligent Docker-Based Pipeline for Document Processing with OCR and NLP for efficient data management

Overview

pdf2json is an automated, intelligent pipeline designed to streamline data management. It leverages Optical Character Recognition (OCR) and natural language processing (NLP) to extract, process, and structure information from PDF documents. This tool facilitates efficient handling of text and images, offering an organized, JSON-based output to support several kind of workflows.


Key Features

  • Standard Graphical User Interface (GUI)

    The GUI offers a simple yet efficient interface, allowing users to upload PDFs for processing. It also enables users to specify a destination path where the generated JSON file will be saved.

  • Automated Document Processing
    This feature extracts and processes text from both PDF documents and images, performing tasks such as text cleaning, summarization, and image extraction. The images are decoded into base64 format, and all extracted data is organized and converted into a standardized JSON format. The resulting JSON file is then saved to the user-specified path.

Setup and Installation

1. Prerequisites

Ensure you have the following installed on your system:

  • Python 3.8 or higher (recommended: Python 3.10): Python's official website
  • Run following command in your IDE:
    python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
  • Docker (with Docker Compose, if applicable): Docker's official website
    • On Linux, you may need to install Docker Compose separately
  • Git: Git's official website
  • Tesseract OCR (for OCR functionality)
    • On Ubuntu/Debian:
      sudo apt update && sudo apt install -y tesseract-ocr libtesseract-dev
    • On macOS (using Homebrew):
      brew install tesseract
    • On Windows:

2. Clone the Repository

Clone the project repository to your local machine:

    git clone https://github.com/zohrehasadi00/pdf2json.git
    cd pdf2json

3. X Server

If you use Windows you'll need to install X server. How to Install and Run X Server in Windows 11?

4. Docker

  • Run the Application
    docker build -t myapp . 
    docker run -it -e DISPLAY=host.docker.internal:0.0 -p 8000:8000 myapp
  • Access the app at: http://localhost:8000
  • stop the Container:
    docker stop myapp_container

About

An Intelligent Pipeline for Document Processing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published