Transkribus XML to CSV Workflow

A unified pipeline for processing Transkribus PageXML outputs with integrated text normalization for Tibetan text. This tool combines paragraph extraction, text normalization, and CSV merging into a single configurable workflow. The workflow integrates transkribus_XML2CSV, developed by James Engels and Christina Sabbagh, with TibNormCSV, developed by Yuki Kyogoku and Franz Xaver Erhard, and adapted by Christina Sabbagh.

Developed for the Divergent Discourses Project
SOAS University of London & Leipzig University | Funded by AHRC (UK) & DFG (Germany)

🚀 Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Place PageXML files in data/to_process_xml/

# 3. Run the workflow
python workflow.py

# 4. Get results from data/step3_merged_csv/merged_pages.csv

📖 For detailed instructions, see QUICKSTART.md

Features

Multi-step Pipeline: Extract → Normalize → Merge
Fully Configurable: All parameters controlled via workflow_config.ini
Preserves Intermediate Results: Each step saves outputs to dedicated folders
Tibetan Text Normalization: Integrated normalization using custom replacement tables
Progress Logging: Detailed console output and optional log file
Modular Design: Each step can be enabled/disabled independently

Workflow Steps

Step 1: Paragraph Extraction

Extracts text regions (paragraphs) from Transkribus PageXML files with metadata.

Input: PageXML files (.xml)
Output: Individual CSV files with extracted paragraphs
Output Location: ./data/step1_extracted_csv/

Step 2: Text Normalization

Applies Tibetan text normalization rules to each CSV file.

Input: Extracted CSV files from Step 1
Output: Normalized CSV files with additional normalised_paragraph column
Output Location: ./data/step2_normalized_csv/

Step 3: CSV Merging

Merges all normalized CSV files into a single master CSV.

Input: Normalized CSV files from Step 2
Output: Single merged CSV file
Output Location: ./data/step3_merged_csv/merged_pages.csv

Installation

Prerequisites

Python 3.8 or higher
pip package manager

Setup

Clone or download this repository

git clone <repository-url>
cd transkribus-workflow

Create a virtual environment (recommended)

# Using conda
conda create -n transkribus-workflow python=3.12
conda activate transkribus-workflow

# Or using venv
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Directory Structure

transkribus-workflow/
├── workflow.py              # Main workflow orchestrator
├── extractor.py            # Paragraph extraction module
├── normalizer.py           # Text normalization module
├── merger.py               # CSV merge module
├── workflow_config.ini     # Configuration file
├── requirements.txt        # Python dependencies
├── README.md              # This file
│
├── data/
│   ├── to_process_xml/           # Place your PageXML files here
│   ├── step1_extracted_csv/      # Extracted paragraphs (auto-created)
│   ├── step2_normalized_csv/     # Normalized text (auto-created)
│   └── step3_merged_csv/         # Final merged CSV (auto-created)
│
├── tables/                        # Normalization tables
│   ├── abbreviations.tsv
│   ├── table1.tsv
│   ├── table2.tsv
│   └── table3.tsv
│
└── logs/                          # Log files (auto-created)
    └── workflow.log

Usage

Quick Start

Place your PageXML files in ./data/to_process_xml/
Ensure normalization tables are in ./tables/
Run the workflow:

python workflow.py

That's it! The workflow will:

Extract paragraphs from all XML files
Normalize the Tibetan text
Merge everything into a single CSV

Custom Configuration

Run with a custom config file:

python workflow.py --config my_config.ini

Running Specific Steps

Edit workflow_config.ini to enable/disable steps:

[workflow]
enable_extraction = True      # Set to False to skip
enable_normalization = True   # Set to False to skip
enable_merge = True          # Set to False to skip

Configuration File

The workflow_config.ini file controls all aspects of the workflow:

[workflow]
# Enable/disable workflow steps (executed in this order)
enable_extraction = True
enable_normalization = True
enable_merge = True

[paths]
# Input directories
xml_input_dir = ./data/to_process_xml

# Output directories for each step
extracted_csv_dir = ./data/step1_extracted_csv
normalized_csv_dir = ./data/step2_normalized_csv
merged_csv_dir = ./data/step3_merged_csv

# Normalization tables
table_path = ./tables

[extraction]
# XML namespace for PageXML parsing
namespace_uri = http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15

# Files to exclude from processing
excluded_files = mets.xml,metadata.xml

[normalization]
# Flag parameter for conditional normalization rules
flag = 1

[merge]
# Name of the merged output file
merged_filename = merged_pages.csv

[logging]
# Enable verbose logging
verbose = True
log_file = ./logs/workflow.log

Input File Requirements

PageXML Files

Files should follow the naming convention:

XXXX_NEWSPAPER_YEAR_MONTH_DAY_PAGE_[optional].xml

or (see Bug fixes applied)

NEWSPAPER_YEAR_MONTH_DAY_PAGE_[optional].xml

Example: 0001_QTN_1952_07_05_001_SB_Zsn128163MR.xml or QTN_1952_07_05_001_SB_Zsn128163MR.xml

Where:

XXXX: Transkribus-assigned ID (4 digits)
NEWSPAPER: Newspaper code (e.g., QTN)
YEAR: Publication year (e.g., 1952)
MONTH: Publication month (e.g., 07)
DAY: Publication day (e.g., 05)
PAGE: Page number (e.g., 001)
[optional]: Additional metadata (ignored)

For detailed information on the naming convention, see Erhard, Franz Xaver 2025. "The Divergent Discourses Corpus. A Digital Collection of Early Tibetan Newspapers from the 1950s and 1960s." In F.X. Erhard, R. Barnett, and N.W. Hill (eds.) “From Print to Pixels: Building Digital Tools for Modern Tibetan Textual Analysis.” Special issue, Revue d'études tibétaines (RET). (74), 77–80.

Normalization Tables

Four TSV files are required in the ./tables/ directory:

abbreviations.tsv: Abbreviation expansions (with flag column)
table1.tsv: Character replacements
table2.tsv: Regex-based replacements
table3.tsv: Context-aware replacements with exceptions (with flag column)

Output Format

CSV Columns

The final merged CSV contains:

paragraph: Original extracted text
normalised_paragraph: Normalized text (added in Step 2)
paragraph_idx: Paragraph ID (e.g., tr_1718110017)
readingorder_idx: Reading order index
region_type: Type of text region (e.g., paragraph, heading, caption)
filename: Original image filename
newspaper: Newspaper code
year: Publication year
month: Publication month
date: Publication day
page_num: Page number

Advanced Usage

Modifying the Workflow

Each module (extractor.py, normalizer.py, merger.py) can be imported and used independently:

from extractor import XMLParagraphExtractor
from normalizer import load_tables, normalize_csv_files
from merger import merge_csv_files

# Use individual components
extractor = XMLParagraphExtractor(namespace_uri, excluded_files)
csv_files = extractor.extract_all(xml_dir, output_dir)

# Load and apply normalization
tables = load_tables(table_path, flag=1)
normalized = normalize_csv_files(input_dir, output_dir, tables)

# Merge
merge_csv_files(input_dir, output_file)

Customizing Normalization

Modify the flag parameter in workflow_config.ini to use different normalization rule sets:

[normalization]
flag = 1  # Use rules where flag=1

Batch Processing

The workflow recursively searches for XML files, so you can organize your files in subdirectories:

data/to_process_xml/
├── batch1/
│   ├── file1.xml
│   └── file2.xml
└── batch2/
    ├── file3.xml
    └── file4.xml

Troubleshooting

No XML files found

Check that XML files are in ./data/to_process_xml/
Ensure files have .xml extension
Verify they're not in the excluded list

Normalization errors

Verify all required TSV files exist in ./tables/
Check TSV files are tab-separated and properly formatted
Ensure flag column exists in abbreviations.tsv and table3.tsv

Memory issues with large files

Process in smaller batches
Disable merge step and merge manually using external tools

Encoding issues

All files use UTF-8 encoding
Output CSVs use UTF-8-sig for Excel compatibility

📚 Documentation

INDEX.md - Documentation navigation guide
QUICKSTART.md - Get started in 5 minutes
QUICK_REFERENCE.md - One-page cheat sheet
PROJECT_SUMMARY.md - Complete project overview
WORKFLOW_DIAGRAM.md - Visual workflow diagrams
CONTRIBUTING.md - How to contribute

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines on:

Reporting issues
Suggesting enhancements
Contributing code
Development setup

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Please acknowledge the Divergent Discourses project in any use of these materials.

👥 Credits

Development:

Original Paragraph Extractor: James Engels, modified by Christina Sabbagh (SOAS University of London)
Text Normalizer: Developed for Tibetan text processing by Yuki Kyogoku and Franz Xaver Erhard
Workflow Integration: Unified processing pipeline by Franz Xaver Erhard

Project: Divergent Discourses
Institutions: SOAS University of London & Leipzig University
Funding: AHRC (UK) & DFG (Germany)

📧 Support

For issues or questions:

Check the documentation
Review existing issues
Create a new issue if needed
Check logs/workflow.log for detailed error messages

Star ⭐ this repository if you find it helpful!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
logs		logs
tables		tables
BUG_FIXES_APPLIED.md		BUG_FIXES_APPLIED.md
CONTRIBUTING.md		CONTRIBUTING.md
FILE_MANIFEST.md		FILE_MANIFEST.md
INDEX.md		INDEX.md
LICENSE		LICENSE
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
WORKFLOW_DIAGRAM.md		WORKFLOW_DIAGRAM.md
extractor.py		extractor.py
merger.py		merger.py
normalizer.py		normalizer.py
requirements.txt		requirements.txt
workflow.py		workflow.py
workflow_config.ini		workflow_config.ini

Folders and files

Latest commit

History

Repository files navigation

Transkribus XML to CSV Workflow

📋 Table of Contents

🚀 Quick Start

Features

Workflow Steps

Step 1: Paragraph Extraction

Step 2: Text Normalization

Step 3: CSV Merging

Installation

Prerequisites

Setup

Directory Structure

Usage

Quick Start

Custom Configuration

Running Specific Steps

Configuration File

Input File Requirements

PageXML Files

Normalization Tables

Output Format

CSV Columns

Advanced Usage

Modifying the Workflow

Customizing Normalization

Batch Processing

Troubleshooting

No XML files found

Normalization errors

Memory issues with large files

Encoding issues

📚 Documentation

🤝 Contributing

📄 License

👥 Credits

📧 Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages