Skip to content

coozyme/ocr-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README_CONTENT = '''# OCR Engine API

A powerful OCR (Optical Character Recognition) engine that supports multiple file formats including images, PDFs, and Excel files with embedded images.

Features

  • Multi-format Support: PNG, JPG, JPEG, PDF, XLSX
  • Multilingual OCR: English, Indonesian, Chinese (Simplified)
  • Smart PDF Processing: Handles both native and scanned PDFs
  • Excel Image Extraction: OCR for embedded images in Excel files
  • Spell Correction: Automatic text correction using SymSpell
  • Image Preprocessing: Automatic image enhancement for better OCR results
  • Async Processing: Background task processing with status tracking
  • RESTful API: FastAPI-based REST API

Installation

  1. Clone the repository:
git clone <repository-url>
cd ocr-engine
  1. Install dependencies:
pip install -r requirements.txt
  pip install python-docx mammoth
  1. Install system dependencies:
# For Ubuntu/Debian
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install libmagic1
sudo apt-get install poppler-utils

# For macOS
brew install tesseract
brew install libmagic
brew install poppler

Usage

Start the server

python main.py

The API will be available at http://localhost:8000

API Documentation

Visit http://localhost:8000/docs for interactive API documentation.

API Endpoints

1. Upload File for OCR Processing

curl -X POST "http://localhost:8000/ocr/upload" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@your_file.pdf"

Response:

{
  "task_id": "uuid-string",
  "message": "File uploaded successfully. Processing started.",
  "status_url": "/ocr/status/uuid-string",
  "result_url": "/ocr/result/uuid-string"
}

2. Check Processing Status

curl -X GET "http://localhost:8000/ocr/status/{task_id}"

Response:

{
  "id": "uuid-string",
  "status": "processing",
  "progress": 65,
  "message": "Processing PDF file..."
}

3. Get OCR Results

curl -X GET "http://localhost:8000/ocr/result/{task_id}"

Response:

{
  "id": "uuid-string",
  "status": "completed",
  "original_filename": "document.pdf",
  "file_type": "pdf",
  "processing_time": 12.34,
  "extracted_text": "Original extracted text...",
  "corrected_text": "Corrected text with spelling fixes...",
  "corrections_made": [
    {
      "original": "teh",
      "corrected": "the",
      "confidence": 1000
    }
  ],
  "detailed_results": {...},
  "created_at": "2024-01-01T10:00:00"
}

4. Delete Results (cleanup)

curl -X DELETE "http://localhost:8000/ocr/result/{task_id}"

5. Health Check

curl -X GET "http://localhost:8000/ocr/health"

File Format Support

Images (PNG, JPG, JPEG)

  • Direct OCR processing
  • Automatic image preprocessing
  • Coordinate extraction for text regions

PDF Files

  • Native text extraction for text-based PDFs
  • OCR processing for scanned PDFs
  • Automatic detection of PDF type

Excel Files (XLSX)

  • Text extraction from cells
  • OCR processing of embedded images
  • Sheet-by-sheet processing

Configuration

OCR Languages

Modify the language list in ocr_engine/ocr_processor.py:

self.reader = easyocr.Reader(['en', 'id', 'ch_sim'], gpu=False)

Spell Checker Dictionary

Add custom words in ocr_engine/spell_checker.py:

custom_words = ["your", "custom", "words"]
for word in custom_words:
    self.sym_spell.create_dictionary_entry(word, 1000)

Project Structure

ocr-engine/
├── main.py                 # Entry point
├── requirements.txt        # Dependencies
├── README.md              # This file
├── api/
│   ├── __init__.py
│   ├── models.py          # Pydantic models
│   └── routes.py          # FastAPI routes
├── ocr_engine/

About

OCR Engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors