Skip to content

Project introduction

Colin Greenstreet edited this page Feb 9, 2026 · 6 revisions
Metadata

Author: Claude Opus 4.6 in Claude Ottoman Turkish Project, prompted and edited by Colin Greenstreet | Wiki entry created: Sunday, February 8th 2026

Version: v1.2

Version history:

  • v1.0 (8 February 2026): Initial draft.
  • v1.1 (8 February 2026): Minor edits; added collapsing metadata feature.
  • v1.2 (9 February 2026): Rewrote first two paragraphs; emphasised distinction between progress on printed documents by conventional means, but limited success till now with handwritten documents.

Validation: This wiki entry requires validation by Ottoman Turkish scholars


Project introduction

The Opening the Ottoman Archive initiative develops large language model based methods for machine transcription and analysis of Ottoman language documents. Our focus is on handwritten manuscripts from the C16th to early C20th in the Ottoman Empire and its successor states.

Our work seeks to make archival materials accessible to scholars who cannot read historical scripts and to broaden access for scholars who possess skills in some but not all major Ottoman languages — not by replacing specialist expertise, but by creating structured workflows that combine large language model capabilities with scholarly judgement.

This wiki documents the methodologies, skill files, and processing protocols developed by the initiative. It serves as both working documentation for the project's ongoing development and a reference for scholars interested in applying or adapting these methods.


The Problem

Ottoman Turkish — the administrative language of an empire spanning Southeast Europe, Western Asia, and North Africa — was written in Perso-Arabic script with vocabulary drawn heavily from Arabic and Persian. Estimates suggest up to 80% of high-register vocabulary was Arabic or Persian in origin, operating within Turkish grammatical structures. The script was abandoned in Turkey in 1928 as part of Atatürk's language reforms, creating a sharp barrier between modern Turkish readers and their archival heritage.

The result is that vast quantities of historical material remain effectively inaccessible. Millions of pages of administrative correspondence, court records, chronicles, and personal letters sit in archives across Turkey, the Balkans, and Europe, readable only by a small community of specialists (estimated at 2,000–5,000 individuals worldwide, concentrated overwhelmingly in Turkey).

Conventional OCR tools have made significant progress for printed documents. But existing initiatives — including Muteferriqa (7.5 million pages of printed material), the Digital Ottoman Corpora project (Transkribus-based, 7.20% CER on printed Nesih), and Osmanlica.com (a three-stage OCR-transliteration-translation pipeline) — focus primarily on printed Nesih type. Handwritten manuscripts, which constitute the majority of the Ottoman archival record, remain largely beyond the reach of automated tools.


The Two-Stage Pipeline

The initiative's core methodological innovation is a two-stage pipeline that separates visual capture from semantic processing. This separation addresses a fundamental problem: when a single AI model is asked both to see what is on the page and to understand what it means, the semantic processing contaminates the visual capture — the model "reads" what it expects rather than what is actually written.

The initiative uses frontier commercial large language models, specifically Gemini 3 Pro Preview (launched November 18 2025), Claude Opus 4.5(launched November 24 2025) and now Claude Opus 4.6 (launched February 6 2026).

Stage 1: Visual Capture

Google Gemini, configured at extreme visual-grounding settings (low temperature, low Top-P, minimal thinking), performs pure pattern-to-Unicode conversion. The guiding metaphor is that of a camera: the model reports what ink marks it sees without attempting to understand their meaning.

The key protocol is V3-S-Minimal, which deliberately strips out all linguistic guidance to prevent the model from activating semantic processing pathways. A critical empirical discovery — the Skill File Paradox — demonstrated that complex instructions paradoxically degrade visual capture performance. When protocols included extensive linguistic guidance, processing times increased dramatically while accuracy decreased.

Stage 2: Semantic Processing

Claude (Anthropic) receives the Gemini output and performs philological processing: diplomatic transliteration (preserving exactly what was written), literal and modernised English translation, Named Entity Recognition (NER), and contextual summaries. Here, linguistic sophistication is an asset rather than a liability.

Document-type-specific protocols — the V3-T skill file family — extend a base transliteration system with genre-appropriate conventions, NER categories, and historical context. Each skill file inherits the core V3-T methodology, then adapts for its target genre and period.

The Handoff

The contract between stages cannot be enforced at Stage 1 — any instruction to Gemini about output validation risks triggering semantic processing. Quality assurance therefore happens through Stage 3 verification, where outputs are compared back against source images, and through variance analysis, which compares multiple AI outputs for the same source image to identify systematic errors.


Stage 0: Document Reconnaissance

Before the two-stage pipeline begins, the S0-Triage protocol assesses a document through visual inspection alone, without reading its content. Triage classifies the document along three axes:

  • Document type: Physical form and production method (manuscript, printed, mixed)
  • Genre: Content type and rhetorical function (gazette, chronicle, correspondence, etc.)
  • Script type: Writing system and calligraphic style (Nesih, Divani, Rik'a, etc.)

These three classifications guide skill file selection: which V3-S protocol to use for visual capture, and which V3-T variant to use for semantic processing. The Typology section of this wiki documents all three classification systems.


Skill Files

Skill files are the initiative's core reusable artefacts — markdown documents that encode processing protocols for specific document types, genres, and periods. They function as structured instructions that can be loaded into a Claude project and applied to new documents of the same class.

The initiative maintains several skill file families:

Family Purpose Examples
S0-Triage Document reconnaissance before processing S0-Triage v1.1
V3-S Stage 1 visual capture protocols V3-S-Minimal, V3-S-Newspaper
V3-T Stage 2 semantic processing protocols V3-T (gazette), V3-T-Newspaper, V3-T-C (correspondence), V3-T-C-Personal, V3-T-C17 (classical imperial), V3-T-E-5035-Chronicle-Variant, V3-T-Quranic

Skill files are versioned, dated, and designed for iterative refinement. They are developed through empirical testing — systematic experiments with controlled parameters — rather than theoretical specification. The Demonstration skill files page provides published examples.


The Scope of this Wiki

This wiki documents the Ottoman Turkish component of the initiative, which is the most extensively developed of six language projects (see Ottoman language projects for the full portfolio). The wiki is organised along two axes:

Typology — how documents are classified:

Pipeline — how documents are processed:


Current Status

The initiative is in active development. The primary corpus is the Takvîm-i Vekâyi (1831–1839), with issues 1, 181, and 185 partially processed. Additional material processed includes Second Constitutional Period newspapers (Tanin, Peyâm), İSAM archive correspondence (administrative and personal letters), manuscript chronicles (Şanizade Tarihi, E.5035), a 17th-century imperial document (Leipzig B. or. 290.01), and cartographic materials (Ottoman Map Inventory, Balkans 1890).

The LLM-based approach — using large language models for both visual capture and semantic processing rather than conventional CNN/RNN/LSTM architectures — was pioneered by this initiative but is now being explored by other researchers, including work on multilingual TrOCR models and multi-modal LLM approaches to Ottoman manuscript recognition.

The initiative's distinctive contribution remains the separation of visual capture from semantic processing, the empirical skill file development methodology, and the focus on making Ottoman materials accessible to generalist researchers through structured, reusable workflows.


Further Reading


Last updated: 9 February 2026 · v1.2

Clone this wiki locally