Gemini Extractor

Gemini Pipeline | Traditional OCR Pipeline

Gemini Extractor

A modern, AI-powered invoice information extraction module using Google's Gemini API for structured data extraction from Vietnamese invoices.

Overview

The Gemini Extractor is an alternative extraction pipeline that leverages large language models (LLMs) to perform intelligent, template-agnostic extraction of invoice data. Instead of relying on fixed rules or trained models for specific invoice layouts, it uses Gemini's vision and text understanding capabilities to extract structured information from diverse invoice formats.

Key Features

Template-agnostic extraction: Works across different invoice layouts without retraining
Vision + OCR integration: Can process both image and text inputs
Structured output: Returns clean JSON with validated fields
Configurable prompts: Easy to customize extraction instructions via YAML configs
Batch processing: Support for processing multiple invoices in a single run
Vietnamese language optimized: Prompts and validation tailored for Vietnamese invoices

Architecture

Invoice Image → OCR (optional) → Gemini API → Structured JSON

The extractor can work in two modes:

Direct vision mode: Send images directly to Gemini's vision API
OCR + text mode: Use local OCR (PaddleOCR/custom models) then send extracted text to Gemini

Installation

Prerequisites

Python 3.9+
A Google Cloud account with Gemini API access
API key for Google Generative AI

Setup

Install required dependencies:

pip install google-generativeai pyyaml pillow

Set up your Gemini API key:

export GEMINI_API_KEY="your-api-key-here"

Or add it to your config file:

# config/gemini_config.yaml
api_key: "your-api-key-here"
model: "gemini-flash-latest"

Configuration

The extractor uses YAML configuration files to control behavior. Example config:

# config/gemini_config.yaml
api:
  key: ${GEMINI_API_KEY}  # or direct string
  model: "gemini-flash-latest"
  temperature: 0.1
  max_output_tokens: 2048

extraction:
  mode: "vision"  # or "text"
  prompt_template: "prompts/extraction_vi.txt"
  output_format: "json"
  
fields:
  - SELLER
  - ADDRESS
  - TIMESTAMP
  - PRODUCTS
  - TOTAL_COST
  
products_schema:
  - PRODUCT
  - NUM
  - VALUE

Configuration Options

api.key: Your Gemini API key (can use environment variables)
api.model: Gemini model to use (gemini-1.5-flash, gemini-1.5-pro, etc.)
api.temperature: Controls randomness (0.0-1.0, lower = more deterministic)
extraction.mode: vision for image input, text for OCR text input
extraction.prompt_template: Path to custom prompt file
fields: List of top-level fields to extract
products_schema: Schema for product line items

Usage

Basic Usage (Single Image)

from src.gemini_extractor import GeminiExtractor

# Initialize extractor
extractor = GeminiExtractor(config_path="config/gemini_config.yaml")

# Extract from image
result = extractor.extract_from_image("path/to/invoice.jpg")

print(result)
# {
#   "SELLER": "VinCommerce",
#   "ADDRESS": "...",
#   "TIMESTAMP": "...",
#   "PRODUCTS": [...],
#   "TOTAL_COST": "..."
# }

Using the Example Scripts

We provide ready-to-use example scripts:

Basic extraction (single image):

python examples/basic_gemini_extraction.py

This will:

Load a sample invoice image
Extract structured data using Gemini
Save results to output/extracted_invoice.json
Print the extracted data

Batch extraction (multiple images):

python examples/batch_gemini_extraction.py

This will:

Process all images in uploads/ folder
Extract data from each invoice
Save individual JSON files to output/batch_results/
Generate a summary report in output/batch_results/batch_summary.json

Advanced Usage (Custom Prompts)

You can customize the extraction prompt to improve accuracy or extract additional fields:

extractor = GeminiExtractor(
    config_path="config/gemini_config.yaml",
    custom_prompt="Extract seller, date, and all product items with prices..."
)

result = extractor.extract_from_image("invoice.jpg")

Using with OCR Text

If you already have OCR results:

# Extract from OCR text instead of image
result = extractor.extract_from_text(ocr_text)

Batch Processing

from pathlib import Path

images = list(Path("uploads/").glob("*.jpg"))
results = extractor.batch_extract(images, output_dir="output/batch_results")

# Results is a list of dicts with metadata
for item in results:
    print(f"{item['filename']}: {item['status']}")
    if item['status'] == 'success':
        print(item['data'])

Prompt Engineering

The quality of extraction depends heavily on the prompt. The default prompt is in prompts/extraction_vi.txt.

Example prompt structure:

Bạn là một hệ thống trích xuất thông tin hóa đơn chuyên nghiệp.

Từ hình ảnh/văn bản hóa đơn, hãy trích xuất các thông tin sau:

1. SELLER: Tên cửa hàng/công ty
2. ADDRESS: Địa chỉ
3. TIMESTAMP: Ngày giờ bán hàng
4. PRODUCTS: Danh sách sản phẩm (mỗi sản phẩm gồm tên, số lượng, giá trị)
5. TOTAL_COST: Tổng tiền

Trả về kết quả dưới dạng JSON hợp lệ theo định dạng:
{
  "SELLER": "...",
  "ADDRESS": "...",
  ...
}

Tips for better prompts:

Be specific about output format (JSON structure)
Provide examples of expected output
Include edge case handling (missing fields, multiple formats)
Use Vietnamese for Vietnamese invoices

Output Format

Standard JSON output:

{
  "SELLER": "VinCommerce",
  "ADDRESS": "TP. Cẩm Phả, Quảng Ninh",
  "TIMESTAMP": "ngày bán: 15/08/2020 09:47",
  "PRODUCTS": [
    {
      "PRODUCT": "dưa hấu không hạt 27.500/KG x 3,396 KG",
      "NUM": "1",
      "VALUE": "93.390"
    },
    {
      "PRODUCT": "cải thảo 24.900/KG x 1,704 KG",
      "NUM": "1",
      "VALUE": "42.430"
    }
  ],
  "TOTAL_COST": "331.142"
}

API Reference