README_CONTENT = '''# OCR Engine API
A powerful OCR (Optical Character Recognition) engine that supports multiple file formats including images, PDFs, and Excel files with embedded images.
- Multi-format Support: PNG, JPG, JPEG, PDF, XLSX
- Multilingual OCR: English, Indonesian, Chinese (Simplified)
- Smart PDF Processing: Handles both native and scanned PDFs
- Excel Image Extraction: OCR for embedded images in Excel files
- Spell Correction: Automatic text correction using SymSpell
- Image Preprocessing: Automatic image enhancement for better OCR results
- Async Processing: Background task processing with status tracking
- RESTful API: FastAPI-based REST API
- Clone the repository:
git clone <repository-url>
cd ocr-engine- Install dependencies:
pip install -r requirements.txt pip install python-docx mammoth- Install system dependencies:
# For Ubuntu/Debian
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install libmagic1
sudo apt-get install poppler-utils
# For macOS
brew install tesseract
brew install libmagic
brew install popplerpython main.pyThe API will be available at http://localhost:8000
Visit http://localhost:8000/docs for interactive API documentation.
curl -X POST "http://localhost:8000/ocr/upload" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@your_file.pdf"Response:
{
"task_id": "uuid-string",
"message": "File uploaded successfully. Processing started.",
"status_url": "/ocr/status/uuid-string",
"result_url": "/ocr/result/uuid-string"
}curl -X GET "http://localhost:8000/ocr/status/{task_id}"Response:
{
"id": "uuid-string",
"status": "processing",
"progress": 65,
"message": "Processing PDF file..."
}curl -X GET "http://localhost:8000/ocr/result/{task_id}"Response:
{
"id": "uuid-string",
"status": "completed",
"original_filename": "document.pdf",
"file_type": "pdf",
"processing_time": 12.34,
"extracted_text": "Original extracted text...",
"corrected_text": "Corrected text with spelling fixes...",
"corrections_made": [
{
"original": "teh",
"corrected": "the",
"confidence": 1000
}
],
"detailed_results": {...},
"created_at": "2024-01-01T10:00:00"
}curl -X DELETE "http://localhost:8000/ocr/result/{task_id}"curl -X GET "http://localhost:8000/ocr/health"- Direct OCR processing
- Automatic image preprocessing
- Coordinate extraction for text regions
- Native text extraction for text-based PDFs
- OCR processing for scanned PDFs
- Automatic detection of PDF type
- Text extraction from cells
- OCR processing of embedded images
- Sheet-by-sheet processing
Modify the language list in ocr_engine/ocr_processor.py:
self.reader = easyocr.Reader(['en', 'id', 'ch_sim'], gpu=False)Add custom words in ocr_engine/spell_checker.py:
custom_words = ["your", "custom", "words"]
for word in custom_words:
self.sym_spell.create_dictionary_entry(word, 1000)ocr-engine/
├── main.py # Entry point
├── requirements.txt # Dependencies
├── README.md # This file
├── api/
│ ├── __init__.py
│ ├── models.py # Pydantic models
│ └── routes.py # FastAPI routes
├── ocr_engine/