Skip to content

By script type

Colin Greenstreet edited this page Feb 10, 2026 · 17 revisions
Metadata

Author: Claude Opus 4.6 in Claude Ottoman Turkish Project, prompted and edited by Colin Greenstreet | Wiki entry created: Sunday, February 8th 2026

Version: v2.2

Version history:

  • v1.0 (8 February 2026): Initial draft prepared in Claude Tech Stuff project. Literature-review structure; led with external initiatives.
  • v2.0 (8 February 2026): Major restructure: repositioned to lead with initiative's own pipeline and skill files; external work cited as context; added Sülüs as display script; distinguished between existing/draft and planned skill files; updated external initiative coverage with 2025 publications.
  • v2.1 (8 February 2026): Minor additions and corrections; added collapsing metadata feature.
  • v2.2 (10 February 2026): Added comment on planned V3-S-Siyakat skill file.

Validation: This wiki entry requires validation by Ottoman Turkish scholars


By Script Type

This page catalogues the script types addressed by the Ottoman Turkish component of the Opening the Ottoman Archive initiative. Ottoman Turkish was written in multiple scripts across different administrative, literary, and confessional contexts. Each script presents distinct challenges for handwritten text recognition (HTR) and optical character recognition (OCR), and each requires different handling within the initiative's two-stage pipeline.

Script type is distinct from document type (which addresses physical form) and genre (which addresses content). A single script may appear across many document types and genres: Nesih serves for chronicles, correspondence, and official gazettes alike. Conversely, a single document may contain multiple scripts — a manuscript with Nesih body text, Sülüs headings, and Ta'lîk colophon.

For the HTR pipeline, script type primarily affects Stage 1 (Visual Capture), where letterform recognition, baseline detection, and word segmentation all depend on the script being processed. The V3-Triage protocol identifies script type as part of its preliminary assessment, guiding the selection of appropriate processing protocols.


Overview Table

Script Direction Medium Primary Period Typical Genres Initiative Status
Nesih (Naskh) — printed RTL Letterpress / lithograph c. 1729–1928 Newspapers, books, official gazettes Active processing
Nesih (Naskh) — handwritten RTL Manuscript 14th–20th c. Correspondence, chronicles, court records Active processing
Rik'a (Rık'a) RTL Manuscript 19th–20th c. Correspondence, bureaucratic notes Active processing
Nastaliq (Ta'lîk) RTL Manuscript 15th–19th c. Persian-influenced literary and diplomatic texts Not yet tested
Divani (Dîvânî) RTL Manuscript 15th–19th c. Imperial decrees, berats Early testing (V3-T-C17)
Sülüs (Thuluth) RTL Manuscript / architectural 11th–20th c. Headers, inscriptions, Quranic chapter titles Encountered but not primary target
Siyakat RTL Manuscript 15th–19th c. Financial registers, treasury records Not yet tested
Karamanlidika LTR Print / manuscript 18th–early 20th c. Religious texts, community literature Planned
Armeno-Turkish LTR Print / manuscript 18th–early 20th c. Newspapers, community documents Planned

The Initiative's Approach to Script

The Opening the Ottoman Archive initiative uses a two-stage pipeline in which visual capture (Stage 1, using Google Gemini) and semantic processing (Stage 2, using Claude) are separated to prevent interpretive contamination. This architecture has important implications for script handling:

Stage 1 (Visual Capture) must handle the specific visual characteristics of each script — letterform identification, baseline detection, word boundary recognition, and diacritical mark capture — without engaging in semantic interpretation. The V3-S-Minimal protocol provides the core visual capture framework, with script-specific challenges documented for each type below.

Stage 2 (Semantic Processing) is largely script-agnostic once the Perso-Arabic text has been captured. The V3-T skill file family handles transliteration and translation regardless of the original script, though script-aware skill file variants (e.g., V3-T-C17 for classical documents that may include Divani) provide period- and register-specific guidance.

This contrasts with conventional HTR approaches (Transkribus, eScriptorium, custom CNN/RNN models), which require script-specific trained models with ground-truth data for each script type. The LLM-based approach has the potential to handle multiple scripts through a single pipeline, though empirical testing across all script types remains ongoing.

How External Initiatives Handle Script

The initiative operates alongside approximately ten other Ottoman HTR/OCR projects globally. Their approaches to script provide useful context:

Printed Nesih is the only script for which multiple working systems exist. Muteferriqa (7.5 million pages), Wikilala (109,000+ documents), Osmanlica.com, and the Digital Ottoman Corpora project all focus primarily or exclusively on printed Nesih. The Digital Ottoman Corpora's OttomanTurkish_Print_1 Transkribus model achieved 7.20% CER (June 2023). Osmanlica.com (Dölek & Kurt, Istanbul University-Cerrahpaşa) reports approximately 96% character accuracy on printed Nesih using a CRNN model and has published peer-reviewed results in IEEE and Wiley venues, with further work published in 2025 (Uzan & Dölek, Fırat University journal).

Handwritten scripts remain far more challenging. The QhoD project (Austrian Academy of Sciences) has tested Transkribus on 18th-century Nesih manuscript materials from the Passarowitz embassy, reporting approximately 25% CER. Their work has also attempted Nastaliq and Divani with notably poorer results — layout analysis fails entirely on Nastaliq, and Divani requires manual baseline annotation.

Multilingual and LLM approaches are emerging. Broadwell, Patel, and Tekgürler published work in March 2025 on multilingual TrOCR-based HTR models for Arabic-script languages including Ottoman Turkish. A Stanford research project has explored multi-modal large language models for Ottoman manuscript recognition. These developments suggest growing interest in the approaches this initiative has been pioneering.

No existing initiative has published working systems for Siyakat, and no initiative has reported success with Nastaliq or Divani using automated methods.


Perso-Arabic Script Family

The first seven script types below all use the Perso-Arabic writing system (right-to-left, with connected letterforms). They share a common alphabet but differ dramatically in visual execution, degree of abbreviation, and administrative or literary context. A critical point for HTR work — and for the scholarly community more broadly — is that Ottoman specialists typically develop reading fluency in only one or two of these scripts, making cross-script expertise rare.

1. Nesih (Naskh) — Printed

Physical characteristics: Regular, upright letterforms with clear word separation and consistent baseline spacing. Ottoman printing began with İbrahim Müteferrika's press (1729) and expanded dramatically from the mid-19th century. Both letterpress and lithographic technologies produced relatively uniform characters, though ligature conventions and typographic style varied across printing houses and periods.

Typical genres: Official gazettes (Takvîm-i Vekâyi, from 1831), newspapers and periodicals, books, dictionaries.

Period: c. 1729–1928 (end of Ottoman Arabic-script usage in Turkey).

Initiative status: Active processing — primary corpus. Printed Nesih is the most extensively developed area of the initiative's work. The base V3-T skill file was developed specifically for the Takvîm-i Vekâyi (1831 - 1839), with issues 1, 181 and 185 partially processed. The V3-T-Newspaper skill file (v1.1) extends this for constitutional-era journalism (Tanin, Peyâm). Visual capture uses V3-S-Minimal and V3-S-Newspaper protocols. The regularity of typeset characters makes printed Nesih the natural starting point for pipeline development.

Skill files:

  • V3-S-Minimal — visual capture (draft, in use)
  • V3-S-Newspaper — newspaper-specific visual capture (draft, in use)
  • V3-T — core transliteration/translation for gazette material (draft, in use)
  • V3-T-Newspaper v1.1 — constitutional-era journalism (draft, in use)

2. Nesih (Naskh) — Handwritten

Physical characteristics: Retains the upright, rounded letterforms of printed Nesih but with significant variability through individual scribal hands. Letters are fully connected within words. Diacritical dots (distinguishing letters like ب/ت/ث) may be displaced, merged, or omitted. Short vowels are typically unwritten. Line spacing, word spacing, and baseline regularity vary with document formality and scribal skill.

Typical genres: Chronicles, religious texts, manuscript copies of literary works, some correspondence.

Period: 14th–20th century (the standard book and correspondence hand throughout Ottoman history).

Initiative status: Active processing. The Şanizade Tarihi (Page 9) and the E.5035 manuscript have both been processed through the pipeline, with handwritten Nesih as the primary script. The E.5035 analysis led to the development of V3-T-E-5035-Chronicle-Variant (v1.1), which handles chronicle and interrogation-record genres in manuscript Nesih. The V3-T-Quranic Semantic Processing protocol also addresses handwritten Nesih in its most formal calligraphic form.

Skill files:

  • V3-T-E-5035-Chronicle-Variant v1.1 — chronicle/interrogation manuscripts (draft, in use)
  • V3-T-Quranic Semantic Processing — Quranic manuscript text (draft)

3. Rik'a (Rık'a)

Physical characteristics: A simplified cursive hand developed for speed and everyday use. Letterforms are compact and rounded with minimal ornamentation. Letters are reduced to quick strokes, and connections may be simplified or abbreviated. Rik'a was the standard hand for everyday writing in the late Ottoman period, much as cursive handwriting served in the Latin-script world.

Typical genres: Personal and bureaucratic correspondence, administrative notes and memoranda, personal notebooks.

Period: 19th–early 20th century (dominant everyday hand during the reform period).

Initiative status: Active processing. The İSAM Hüseyin Hilmi Paşa correspondence — both administrative (Letter 1904, HHR 164a-b) and personal (family letter, HHP 1716.1 Page 9) — is written in Rik'a. The administrative letter was processed through V3-T-C with Stage 3 visual verification; the personal letter through V3-T-C-Personal. These analyses demonstrate the pipeline's ability to handle Rik'a across different registers (formal administrative vs. intimate personal), though the variability of personal hands presents ongoing challenges.

Skill files:

  • V3-T-C — administrative correspondence (draft, in use)
  • V3-T-C-Personal — personal/family correspondence (draft, in use)

4. Nastaliq (Ta'lîk)

Physical characteristics: Visually distinguished by its sweeping, diagonal flow. Letters slope downward from right to left within each word, with dramatic vertical extensions. The script has a strong calligraphic tradition as the prestige hand for Persian literary culture. Letter groups cascade across the line, and word spacing is often indistinct.

Typical genres: Persian-language literary texts, bilingual Ottoman-Persian documents, poetry manuscripts, some diplomatic correspondence.

Period: 15th–19th century (declining in administrative use but persistent in literary contexts).

Initiative status: Not yet tested. Nastaliq represents a significant challenge for the pipeline. The diagonal flow and cascading letter groups defeat baseline-detection algorithms in conventional HTR tools — the QhoD project reports that Transkribus layout analysis cannot segment Nastaliq documents at all. The LLM-based visual capture approach (Gemini) may offer advantages here, since it does not rely on explicit baseline detection, but this remains untested. No skill files exist for Nastaliq-specific processing.

Skill files: None. Future development.


5. Divani (Dîvânî)

Physical characteristics: The script of imperial authority. Letterforms are rounded and compressed, with exaggerated horizontal extensions and ornamental density. The most elaborate variant, Dîvânî-i Hümâyûn (Celi Divani), fills interlinear spaces with decorative flourishes. Baselines are not horizontal — they curve and undulate following the calligrapher's compositional design. The script was deliberately difficult to read as a security measure against forgery.

Typical genres: Imperial decrees (fermâns), letters of appointment (berâts), sultanic correspondence, treaties.

Period: 15th–19th century (formalised under the Ottoman dîvân chancellery system).

Initiative status: Early testing. The V3-T-C17 skill file (Classical Ottoman Imperial Documents, c. 1500–1700) was designed for documents that may include Divani script, and the Leipzig B. or. 290.01 analysis tested 17th-century Dîvân-ı Hümâyûn material. However, the documents processed so far appear to have been in Nesih or hybrid hands rather than full Divani. The decorative density and non-linear layout of formal Divani remain untested in the pipeline and represent a major challenge.

Conventional HTR tools fail entirely on Divani. The QhoD project describes manual baseline annotation as the only viable approach — a labour-intensive process that undermines the purpose of automated recognition. Whether LLM-based visual capture can handle Divani's non-linear layout is an open question and a high-priority testing target.

Skill files:

  • V3-T-C17 — classical imperial documents, which may include Divani (draft, in use)

6. Sülüs (Thuluth)

Physical characteristics: A large, formal display script used for headings, inscriptions, and decorative panels. Sülüs is one of the six canonical proportioned scripts (aklâm-ı sitte) of Islamic calligraphy, and the Ottoman tradition produced three major "calligraphic revolutions" in this script (Şeyh Hamdullah in the 15th century, Hâfız Osman in the 17th century, Mehmed Şevkî Efendi in the 19th century). Letterforms are large and elegant, with characteristic barbed heads and flowing curves. Diacritical marks (harakât) are typically present and contribute to the script's visual impact.

Typical uses: Surah headings in Quranic manuscripts, architectural inscriptions on mosques and endowment buildings, calligraphic panels (levha), manuscript title pages and section headings. Sülüs is not a body-text script — it appears on or within documents rather than constituting the main text.

Period: Used throughout Ottoman history for display and decorative purposes.

Initiative status: Encountered but not a primary target. Sülüs appears in document headers (including some manuscript title pages that might pass through the pipeline) and in the Quranic text that the V3-T-Quranic Semantic Processing protocol addresses. However, since Sülüs does not typically constitute the body text of documents being processed, it does not require a dedicated skill file. The V3-Triage protocol identifies Sülüs in its script classification table.

Skill files: None dedicated. Handled incidentally within V3-T-Quranic Semantic Processing where relevant.


7. Siyakat

Physical characteristics: A radically abbreviated administrative script in which standard Arabic letterforms are reduced to minimal strokes. Many visually distinct letters become identical; meaning is recovered through context, position, and knowledge of bureaucratic formulae. Diacritical dots are almost entirely absent. The script's opacity was functional — it prevented tampering with fiscal records by anyone outside the trained bureaucratic cadre.

Typical genres: Treasury registers (defter), tax records, accounting ledgers, budget documents.

Period: 15th–19th century (gradually displaced by Rik'a and eventually Latin script).

Initiative status: Not yet tested. Siyakat is the most challenging Ottoman script for any form of recognition — automated or human. Even specialist readers require dedicated training in Ottoman fiscal palaeography. No existing HTR initiative has published results on Siyakat. Whether LLM-based visual capture can make any progress on Siyakat is unknown but would represent a genuinely novel contribution if achieved. This script represents the most distant frontier for HTR development.

Skill files: Draft - V3-S-Siyakat_Draft_v0_1_03022026.md; untested: testing is contingent on partnering with an Ottoman historian interested in Siyakat material.


Cross-Script Types

The following two script types are linguistically Ottoman Turkish but use non-Arabic writing systems. They arose from the confessional and communal diversity of the Ottoman Empire, where Greek Orthodox and Armenian communities sometimes wrote Turkish in their own scripts. These cross-script types require specialised processing because the language is Turkish while the script is not.

8. Karamanlidika (Turkish in Greek Script)

Physical characteristics: Uses the Greek alphabet — including polytonic diacritical marks in earlier materials — to represent Turkish-language content. Letterforms follow Greek typographic and scribal conventions (left-to-right). Printed Karamanlidika uses standard Greek typography; manuscripts follow Greek scribal hands. The principal challenge is that Greek orthographic conventions must be "decoded" to recover the underlying Turkish phonology — a mapping that was not fully standardised across communities.

Typical genres: Religious texts for Greek Orthodox Turkish-speaking communities, newspapers, community literature, translations of Ottoman official documents.

Period: 18th–early 20th century (principally associated with Karamanlı communities of central Anatolia, also produced in Constantinople).

Initiative status: Planned. The initiative's repository architecture includes a planned Greek-Karamanlı semantic processing protocol. Karamanlidika texts have been identified as a potential component of the testing corpus. The script's regularity (especially in print) is comparable to standard Greek, but the semantic processing layer must handle cross-linguistic phonological mapping — decoding Greek orthography to recover Turkish words. Printed Karamanlidika may prove relatively tractable for visual capture; the challenge lies in the transliteration stage.

Skill files: Planned V3-T-Greek-Karamanlı protocol. Not yet drafted.


9. Armeno-Turkish (Turkish in Armenian Script)

Physical characteristics: Uses the Armenian alphabet (left-to-right, with distinctive letterforms unrelated to either Latin or Arabic scripts) to write Turkish-language content. The Armenian script has 38 letters in its classical form, providing a phonetically richer character set than Greek for representing Turkish sounds. Both printed and manuscript forms exist. As with Karamanlidika, the script-to-language mapping was not fully standardised.

Typical genres: Newspapers and periodicals for Armenian Turkish-speaking communities, literary works, community and ecclesiastical documents, commercial correspondence.

Period: 18th–early 20th century (associated with Armenian communities across the Ottoman Empire, especially Constantinople).

Initiative status: Planned. The initiative's repository architecture includes a planned Armenian–Armeno-Turkish protocol. Early observations suggest that Claude has difficulties with Armenian Unicode rendering, which reinforces the need for the two-stage pipeline (Gemini for visual capture, Claude for semantic processing) rather than attempting single-stage processing. No dedicated protocols have been drafted yet. Rxperimentation has been conducted in Gemini with successful visual capture and semantic processing.

Skill files: Planned V3-T-Armenian-Armeno-Turkish protocol. Not yet drafted.


Handwritten vs. Printed: HTR Implications

Characteristic Printed Documents Handwritten Manuscripts
Letterform consistency High (typeset or lithographed) Variable (individual scribal hands)
Baseline regularity Horizontal, uniform Ranges from regular (Nesih) to curved (Divani) to diagonal (Nastaliq)
Layout predictability Columns, headers, mastheads Variable; marginal annotations common
Diacritical marks Generally present and positioned Often displaced, merged, or omitted
Period c. 1729 onward Entire Ottoman period (14th–20th c.)
Scripts encountered Nesih only (with occasional Ta'lîk headers) Nesih, Nastaliq, Divani, Siyakat, Rik'a, Sülüs (headers)
Initiative coverage Extensive (V3-T, V3-T-Newspaper) Partial (V3-T-C, V3-T-C-Personal, V3-T-Chronicle, V3-T-C17)
External tool performance 4–7% CER achievable 25%+ CER on best-tested scripts

The printed/handwritten divide is the most consequential distinction for HTR performance. All existing Ottoman HTR systems that report usable accuracy levels (below 10% CER) operate on printed Nesih only. Handwritten scripts remain a frontier where the initiative's LLM-based approach may offer advantages over conventional trained-model methods — particularly for scripts like Nastaliq and Divani where baseline detection fails entirely.


Script Identification in the Pipeline

Script identification occurs at two points in the initiative's workflow:

S0-Triage (Document Reconnaissance): The triage protocol identifies script type through visual inspection alone, without reading content. This assessment guides skill file selection. The S0-Triage script classification table includes all seven Perso-Arabic scripts listed above plus printed typeface variants.

Stage 2 (Semantic Processing): Script type informs but does not rigidly determine which V3-T variant to apply. The critical factor is period and genre rather than script alone: a 17th-century Nesih manuscript requires V3-T-C17, not the same V3-T used for 19th-century printed Nesih. The By genre page documents how genre maps to skill files.


Further Reading


Last updated: 8 February 2026 · v2.2

Clone this wiki locally