Skip to content

stanford-oval/Churro

Repository files navigation

CHURRO Logo

CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

Model Dataset Paper GitHub Stars

Handwritten and printed text recognition across 22 centuries and 46 language clusters, including historical and dead languages.

Cost vs Performance comparison showing CHURRO's accuracy advantage at significantly lower cost
Cost vs. accuracy: CHURRO (3B) achieves higher accuracy than much larger commercial and open-weight VLMs while being substantially cheaper.


Table of Contents

  1. Overview
  2. Quick Start
  3. Installing the Full Package
  1. CLI Workflows
  1. Adding a New OCR System
  2. HistoricalDocument XML
  1. Citation
  2. License

Overview

CHURRO is a 3B-parameter open-weight vision-language model (VLM) for historical document transcription. It is trained on CHURRO-DS, a curated dataset of ~100K pages from 155 historical collections spanning 22 centuries and 46 language clusters.

On the CHURRO-DS test set, CHURRO delivers 15.5× lower cost than Gemini 2.5 Pro while exceeding its accuracy.

Quick Start

Want a minimal demo? The following will install transformers and torch only:

git clone https://github.com/stanford-oval/churro.git
cd churro
curl -fsSL https://pixi.sh/install.sh | bash
pixi shell -e minimal

Then run:

python churro_transformers_infer.py tests/churro_dataset_sample_1.jpeg --max-new-tokens 40

Expected output begins with:

<HistoricalDocument xmlns="http://example.com/historicaldocument">
  <Metadata>
    <Language>German</Language>
    <WritingDirection>ltr</WritingDirection>
    <PhysicalDescription

Increase --max-new-tokens to 20000 for complete pages.

This minimal path is ideal for quick CPU/GPU trials of the open-weight Churro model. It has lower throughput, and for example, does not install the CLI tools, and does not support image binarization.

Installing the Full Package

Warning

This codebase has been tested on Ubuntu 20.04+. Using other operating systems may require tinkering with and troubleshooting system dependencies.

System Packages

sudo apt-get update && sudo apt-get install -y \
	libtiff5-dev libjpeg8-dev libopenjp2-7-dev zlib1g-dev libfreetype6-dev \
	liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python3-tk libharfbuzz-dev \
	libfribidi-dev libxcb1-dev

Docker (recommended for local models)

Environment Setup

We use Pixi to manage Python environments and dependencies. If you are familiar with Conda, you can think of Pixi as a much faster alternative. The following commands set up a Pixi shell with all required packages. Make sure the environment is active before running any Python code.

git clone https://github.com/stanford-oval/churro.git
cd churro
curl -fsSL https://pixi.sh/install.sh | bash
pixi shell  # create and enter the managed environment

Sanity check the install with:

pixi run python -m churro.cli --help

Configure Providers

Copy the example environment file:

cp .example.env .env

Populate only the variables you need in .env. All environment variables live in .env and are autoloaded via python-dotenv. Use the table below as a quick reference to decide which credentials you must supply.

Workflow Required providers Key variables
Azure Document Intelligence OCR (--system azure) or if using docs-to-images command without --no-trim Azure Document Intelligence AZURE_DI_ENDPOINT, AZURE_DOC_KEY
LLM-based OCR against Vertex AI deployments Google Vertex AI VERTEX_AI_LOCATION
LLM-based OCR against Azure/OpenAI deployments Azure OpenAI or OpenAI AZURE_API_BASE, AZURE_OPENAI_API_KEY (or OPENAI_API_KEY), AZURE_API_VERSION
Mistral OCR (--system mistral_ocr) Mistral MISTRAL_API_KEY
Local vLLM models (--system finetuned or llm with engines backed by vllm/) Docker + Hugging Face LOCAL_VLLM_PORT, HF_TOKEN (only if using private models)

When a workflow does not need a provider, leave the corresponding variables blank. See .example.env for full documentation of each field. For Vertex AI usage, additionally ensure that the Google Cloud SDK is installed and authenticated: https://cloud.google.com/sdk/docs/install

Note that for all API LLM calls, the outputs are cached in .litellm_cache/, so subsequent runs with the same inputs will be much faster and free.

CLI Workflows

The unified Typer CLI lives under churro/cli. All examples below assume you are inside a pixi shell or prefix commands with pixi run.

Inference

Single image (local CHURRO model hosted via vLLM):

pixi run python -m churro.cli infer \
	--system finetuned \
	--engine churro \
	--image tests/churro_dataset_sample_1.jpeg

finetuned system returns HistoricalDocument XML by default; add --strip-xml to output plain text instead. See HistoricalDocument XML for schema details and parsing tips.

Optionally, add --binarize to pre-process each page with the bundled neural image binarizer before sending it to OCR. This can improve OCR accuracy on degraded documents. The first run downloads the stanford-oval/eynollah_binarizer_onnx model.

Batch directory with filtered suffixes and output files:

pixi run python -m churro.cli infer \
	--system finetuned \
	--engine churro \
	--image-dir path/to/images \
	--suffix png --suffix jpeg \
	--recursive \
	--output-dir workdir/texts/ \
	--skip-existing \
	--max-concurrency 8

Use pixi run python -m churro.cli infer --help to see every option, including how to use other LLMs via --system llm --engine <engine> arguments.

Preprocess PDFs and Images

If you have raw PDF scans or image directories, first use the docs-to-images command to convert them into page-aligned PNGs ready for OCR. docs-to-images normalizes PDF scans and image directories into page-aligned PNGs. The default engine gemini-2.5-pro-low calls a Vertex AI model to detect double-page spreads, then calls Azure Document Intelligence to detect page boundaries and trim margins.

Single PDF:

pixi run python -m churro.cli docs-to-images \
	--input-file path/to/file.pdf \
	--output-dir workdir/images/

Mixed directory with custom suffix filters and dry run:

pixi run python -m churro.cli docs-to-images \
	--input-dir path/to/scans \
	--suffix pdf --suffix tif --suffix png \
	--recursive \
	--output-dir workdir/images/ \
	--dry-run

Here is how this pipeline works:

  • An LLM estimates whether a rasterized page contains a two-page spread. Provide --engine <MODEL_MAP key> to swap to a different splitter if you do not have Vertex AI access.
  • Margin trimming is enabled by default via Azure Document Intelligence. Use --no-trim to disable this stage.
  • --batch-pages, --queue-maxsize, --raster-workers, --page-workers, and --llm-concurrency-limit balance CPU-bound rasterization and LLM throughput.
  • Pages are written as <source_base>_page_XXXX.png, even when spreads split into multiple images.

Benchmark on CHURRO-DS

Run end-to-end evaluation against the CHURRO dataset. The command automatically initializes any required local vLLM server before processing.

pixi run python -m churro.cli benchmark \
	--system finetuned \
	--engine churro \
	--dataset-split test \
	--input-size 0 \
	--max-concurrency 32

Important options:

  • --system {azure,mistral_ocr,llm,finetuned} determines which OCR backend to use.
  • --engine <key> is required for llm and finetuned systems; see churro/utils/llm/models.py for the full MODEL_MAP of logical keys (GPT-4/5, Claude, Gemini, Qwen 2.5, MiniCPM, CHURRO, and more).
  • --tensor-parallel-size / --data-parallel-size tune vLLM scaling for local engines.
  • --resize <pixels> optionally resizes large images before inference.

Optionally, add --binarize to pre-process each dataset page with a neural image binarizer before OCR.

Outputs land under workdir/results/<split>/<system>_<engine>/ (the engine suffix is omitted for azure and mistral_ocr).

LLM Improver

The Churro CLI supports optional post-processing with the LLMImprover, enabled via --use-improver. Pair it with --improver-engine. Improver can help fix OCR errors, and improve the formatting of complex documents' Markdown.

Backup Engines

You can supply a backup engine for LLM-based OCR systems using --backup-engine, and for LLM improvers using --improver-backup-engine. Both backup options allow the pipeline to retry with a secondary model if the first call fails. For example, when a provider's content filter incorrectly flags historical material or when a transient outage interrupts inference.

Local vLLM Container Notes

When you run infer or benchmark with an llm or finetuned system whose engine has an hf_repo entry, the CLI will:

  • Read LOCAL_VLLM_PORT and HF_TOKEN from your environment (churro/utils/docker/vllm.py).
  • Pull the corresponding Hugging Face repository on first launch. Expect multi-gigabyte downloads.
  • Start a Docker container exposing an OpenAI-compatible API at http://localhost:<LOCAL_VLLM_PORT>/v1.
  • Stop the container automatically when the command exits or crashes.

Make sure the chosen port is free and that Docker is running. GPU acceleration is optional but dramatically improves throughput.

Adding a New OCR System

Pull requests for new VLMs and OCR backends are welcome.

If adding a new LLM, simply add it to utils/llm/models.py (MODEL_MAP). Include an hf_repo for vLLM-served models.

For entirely new OCR systems, follow all steps:

  1. Register the system in churro/systems/ocr_factory.py so the CLI can instantiate it.
  2. Implement process_image and get_system_name in a subclass of BaseOCR.
  3. Use --system <system_name> with the CLI or import the factory in your own scripts.

HistoricalDocument XML

HistoricalDocument is the XML schema we use in the CHURRO dataset and model for rich transcriptions. It is specifically designed to capture complex layouts, scribal edits, and missing text, which are all common in historical documents, while preserving reading order.

Each response contains a root <HistoricalDocument> element with optional <Metadata> details (languages, scripts, writing direction, notes) followed by one or more <Page> blocks. A page combines optional <Header> and <Footer> regions with a required <Body> that nests structural tags such as <Paragraph>, <MarginalNote>, <Figure>, and <List>. Inline markup like <Addition>, <Deletion>, <Gap/>, and <InterlinearNote> captures scribal edits or missing text while preserving reading order.

<HistoricalDocument xmlns="http://example.com/historicaldocument">
	<Metadata>
		<Language>lat</Language>
		<Script>Latn</Script>
	</Metadata>
	<Page>
		<Header/>
		<Body>
			<Paragraph>
				<Line>In nomine domini amen.</Line>
				<Line><Gap reason="illegible"/> nos notarii subscripsimus.</Line>
			</Paragraph>
		</Body>
	</Page>
</HistoricalDocument>

The complete definition lives in churro/evaluation/historical_doc.xsd. The inference CLI's --strip-xml flag and the evaluation helpers call churro.evaluation.xml_utils.extract_actual_text_from_xml() to remove all XML tags and flatten the content into plain text when you do not need the markup.

Generate HistoricalDocument XML

If you are adding a new dataset to Churro, you may want to convert your transcriptions to the HistoricalDocument XML format. Convert a directory full of PNG/TXT pairs into HistoricalDocument XML using this CLI tool. This conversion uses an LLM prompt to structure the text according to the schema.

pixi run python -m churro.cli text-to-historical-doc-xml \
	path/to/pairs/dir \
	--corpus-description "Basque newspaper corpus" \
	--max-concurrency 8

Place your data in matched X.png and X.txt files within the input directory; each pair yields an X.xml. The tool:

  • Validates LLM output against historical_doc.xsd and prettifies XML prior to saving.
  • Skips files that already have XML unless you pass --overwrite.
  • Accepts any logical engine key from MODEL_MAP via --engine (defaults to gemini-2.5-pro-medium).
  • Adds optional corpus context to prompts through --corpus-description.

Citation

If you use CHURRO or CHURRO-DS, please cite:

@inproceedings{semnani2025churro,
	title        = {{CHURRO}: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition},
	author       = {Semnani, Sina J. and Zhang, Han and He, Xinyan and Tekg{"u}rler, Merve and Lam, Monica S.},
	booktitle    = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)},
	year         = {2025}
}

License

  • Model Weights: Qwen research license (see HF model card)
  • Dataset: Due to licensing restrictions on the original datasets used in Churro, use is permitted for research purposes only.
  • Code: Apache 2.0

About

CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

Resources

License

Stars

Watchers

Forks

Contributors

Languages