Gemini Pipeline | Traditional OCR Pipeline
A modern, AI-powered invoice information extraction module using Google's Gemini API for structured data extraction from Vietnamese invoices.
The Gemini Extractor is an alternative extraction pipeline that leverages large language models (LLMs) to perform intelligent, template-agnostic extraction of invoice data. Instead of relying on fixed rules or trained models for specific invoice layouts, it uses Gemini's vision and text understanding capabilities to extract structured information from diverse invoice formats.
- Template-agnostic extraction: Works across different invoice layouts without retraining
- Vision + OCR integration: Can process both image and text inputs
- Structured output: Returns clean JSON with validated fields
- Configurable prompts: Easy to customize extraction instructions via YAML configs
- Batch processing: Support for processing multiple invoices in a single run
- Vietnamese language optimized: Prompts and validation tailored for Vietnamese invoices
Invoice Image → OCR (optional) → Gemini API → Structured JSON
The extractor can work in two modes:
- Direct vision mode: Send images directly to Gemini's vision API
- OCR + text mode: Use local OCR (PaddleOCR/custom models) then send extracted text to Gemini
- Python 3.9+
- A Google Cloud account with Gemini API access
- API key for Google Generative AI
- Install required dependencies:
pip install google-generativeai pyyaml pillow- Set up your Gemini API key:
export GEMINI_API_KEY="your-api-key-here"Or add it to your config file:
# config/gemini_config.yaml
api_key: "your-api-key-here"
model: "gemini-flash-latest"The extractor uses YAML configuration files to control behavior. Example config:
# config/gemini_config.yaml
api:
key: ${GEMINI_API_KEY} # or direct string
model: "gemini-flash-latest"
temperature: 0.1
max_output_tokens: 2048
extraction:
mode: "vision" # or "text"
prompt_template: "prompts/extraction_vi.txt"
output_format: "json"
fields:
- SELLER
- ADDRESS
- TIMESTAMP
- PRODUCTS
- TOTAL_COST
products_schema:
- PRODUCT
- NUM
- VALUEapi.key: Your Gemini API key (can use environment variables)api.model: Gemini model to use (gemini-1.5-flash,gemini-1.5-pro, etc.)api.temperature: Controls randomness (0.0-1.0, lower = more deterministic)extraction.mode:visionfor image input,textfor OCR text inputextraction.prompt_template: Path to custom prompt filefields: List of top-level fields to extractproducts_schema: Schema for product line items
from src.gemini_extractor import GeminiExtractor
# Initialize extractor
extractor = GeminiExtractor(config_path="config/gemini_config.yaml")
# Extract from image
result = extractor.extract_from_image("path/to/invoice.jpg")
print(result)
# {
# "SELLER": "VinCommerce",
# "ADDRESS": "...",
# "TIMESTAMP": "...",
# "PRODUCTS": [...],
# "TOTAL_COST": "..."
# }We provide ready-to-use example scripts:
Basic extraction (single image):
python examples/basic_gemini_extraction.pyThis will:
- Load a sample invoice image
- Extract structured data using Gemini
- Save results to
output/extracted_invoice.json - Print the extracted data
Batch extraction (multiple images):
python examples/batch_gemini_extraction.pyThis will:
- Process all images in
uploads/folder - Extract data from each invoice
- Save individual JSON files to
output/batch_results/ - Generate a summary report in
output/batch_results/batch_summary.json
You can customize the extraction prompt to improve accuracy or extract additional fields:
extractor = GeminiExtractor(
config_path="config/gemini_config.yaml",
custom_prompt="Extract seller, date, and all product items with prices..."
)
result = extractor.extract_from_image("invoice.jpg")If you already have OCR results:
# Extract from OCR text instead of image
result = extractor.extract_from_text(ocr_text)from pathlib import Path
images = list(Path("uploads/").glob("*.jpg"))
results = extractor.batch_extract(images, output_dir="output/batch_results")
# Results is a list of dicts with metadata
for item in results:
print(f"{item['filename']}: {item['status']}")
if item['status'] == 'success':
print(item['data'])The quality of extraction depends heavily on the prompt. The default prompt is in prompts/extraction_vi.txt.
Example prompt structure:
Bạn là một hệ thống trích xuất thông tin hóa đơn chuyên nghiệp.
Từ hình ảnh/văn bản hóa đơn, hãy trích xuất các thông tin sau:
1. SELLER: Tên cửa hàng/công ty
2. ADDRESS: Địa chỉ
3. TIMESTAMP: Ngày giờ bán hàng
4. PRODUCTS: Danh sách sản phẩm (mỗi sản phẩm gồm tên, số lượng, giá trị)
5. TOTAL_COST: Tổng tiền
Trả về kết quả dưới dạng JSON hợp lệ theo định dạng:
{
"SELLER": "...",
"ADDRESS": "...",
...
}
Tips for better prompts:
- Be specific about output format (JSON structure)
- Provide examples of expected output
- Include edge case handling (missing fields, multiple formats)
- Use Vietnamese for Vietnamese invoices
Standard JSON output:
{
"SELLER": "VinCommerce",
"ADDRESS": "TP. Cẩm Phả, Quảng Ninh",
"TIMESTAMP": "ngày bán: 15/08/2020 09:47",
"PRODUCTS": [
{
"PRODUCT": "dưa hấu không hạt 27.500/KG x 3,396 KG",
"NUM": "1",
"VALUE": "93.390"
},
{
"PRODUCT": "cải thảo 24.900/KG x 1,704 KG",
"NUM": "1",
"VALUE": "42.430"
}
],
"TOTAL_COST": "331.142"
}Main class for invoice extraction.
Methods:
-
__init__(config_path: str, custom_prompt: str = None)- Initialize extractor with config file
-
extract_from_image(image_path: str) -> dict- Extract data from invoice image
- Returns: Dictionary with extracted fields
-
extract_from_text(text: str) -> dict- Extract data from OCR text
- Returns: Dictionary with extracted fields
-
batch_extract(image_paths: list, output_dir: str = None) -> list- Process multiple invoices
- Returns: List of results with metadata
See config/gemini_config.yaml for full schema documentation.
API Costs (as of October 2025):
- Gemini 1.5 Flash: ~$0.00035 per image (very affordable)
- Gemini 1.5 Pro: ~$0.0035 per image (higher accuracy)
Processing Time:
- ~1-3 seconds per invoice (depends on model and image size)
- Batch processing can be parallelized
Accuracy:
- Works well on most standard Vietnamese invoices
- May need prompt tuning for unusual layouts
- Handles handwritten text better than traditional OCR
1. API Key Error
Error: Invalid API key
Solution: Check that your GEMINI_API_KEY environment variable is set correctly.
2. Rate Limiting
Error: Resource exhausted (quota)
Solution: Add delays between requests or upgrade your API quota.
3. Poor Extraction Quality Solution:
- Try using
gemini-1.5-proinstead offlash - Adjust temperature (lower = more consistent)
- Refine your prompt in
prompts/extraction_vi.txt
4. Empty Results Solution:
- Check image quality (resolution, clarity)
- Verify the prompt matches your invoice format
- Try OCR + text mode instead of vision mode
Enable detailed logging:
extractor = GeminiExtractor(config_path="config/gemini_config.yaml")
extractor.set_debug(True) # Prints API requests/responses
result = extractor.extract_from_image("invoice.jpg")| Feature | Gemini Extractor | Traditional Pipeline |
|---|---|---|
| Setup complexity | Low (API key only) | High (models, weights) |
| Training required | No | Yes |
| Template flexibility | High | Low |
| Cost | API costs | GPU costs |
| Offline capable | No | Yes |
| Accuracy (varied layouts) | High | Medium |
examples/basic_gemini_extraction.py- Single image extraction demoexamples/batch_gemini_extraction.py- Batch processing democonfig/gemini_config.yaml- Full configuration exampleconfig/gemini_config_simple.yaml- Minimal config for quick start
- Gemini API Documentation
- Prompt Engineering Guide
- See
architect/README_GEMINI.mdfor design decisions and architecture details
This module is part of the OCR-invoice project. See project root LICENSE.
Questions or issues? Check the main project README or open an issue on GitHub.