pdf2json

An Intelligent Docker-Based Pipeline for Document Processing with OCR and NLP for efficient data management

Overview

pdf2json is an automated, intelligent pipeline designed to streamline data management. It leverages Optical Character Recognition (OCR) and natural language processing (NLP) to extract, process, and structure information from PDF documents. This tool facilitates efficient handling of text and images, offering an organized, JSON-based output to support several kind of workflows.

Key Features

Standard Graphical User Interface (GUI)

The GUI offers a simple yet efficient interface, allowing users to upload PDFs for processing. It also enables users to specify a destination path where the generated JSON file will be saved.
Automated Document Processing
This feature extracts and processes text from both PDF documents and images, performing tasks such as text cleaning, summarization, and image extraction. The images are decoded into base64 format, and all extracted data is organized and converted into a standardized JSON format. The resulting JSON file is then saved to the user-specified path.

Setup and Installation

1. Prerequisites

Ensure you have the following installed on your system:

Python 3.8 or higher (recommended: Python 3.10): Python's official website

Run following command in your IDE:

python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

Docker (with Docker Compose, if applicable): Docker's official website
- On Linux, you may need to install Docker Compose separately
Git: Git's official website
Tesseract OCR (for OCR functionality)
- On Ubuntu/Debian:
```
sudo apt update && sudo apt install -y tesseract-ocr libtesseract-dev
```
- On macOS (using Homebrew):
```
brew install tesseract
```
- On Windows:
  - Download installer from Tesseract at UB Mannheim or Tesseract OCR

2. Clone the Repository

Clone the project repository to your local machine:

    git clone https://github.com/zohrehasadi00/pdf2json.git
    cd pdf2json

3. X Server

If you use Windows you'll need to install X server. How to Install and Run X Server in Windows 11?

4. Docker

Run the Application

docker build -t myapp . 
docker run -it -e DISPLAY=host.docker.internal:0.0 -p 8000:8000 myapp

Access the app at: http://localhost:8000
stop the Container:
```
docker stop myapp_container
```

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.idea		.idea
__pycache__		__pycache__
api		api
backend		backend
models		models
results		results
statics		statics
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
fu-logo.png		fu-logo.png
gui.py		gui.py
main.py		main.py
requirements.txt		requirements.txt
structure.txt		structure.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf2json

Overview

Key Features

Setup and Installation

1. Prerequisites

2. Clone the Repository

3. X Server

4. Docker

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

zohrehasadi00/pdf2json

Folders and files

Latest commit

History

Repository files navigation

pdf2json

Overview

Key Features

Setup and Installation

1. Prerequisites

2. Clone the Repository

3. X Server

4. Docker

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages