Transkribus Parallel Upload Tool

Fast, parallel PageXML uploader for Transkribus with automatic token refresh and thread-safe operation. The scripts in this repo facilitate bulk PAGE XML uploads and file matching with image files on the Transkribus server. Optionally, a separate script can execute Transkribus HTR processing. The pipeline uses the Transkribus legacy REST API, which comes with limited options. ,

Features

⚡ Fast Parallel Uploads: Upload 20-30 files per second using thread pools
🔄 Automatic Token Refresh: Handles long-running uploads with automatic OAuth token renewal
🔒 Thread-Safe: Proper locking mechanisms for multi-threaded token refresh
📊 Progress Tracking: Real-time progress updates with ETA calculations
🎯 Exact Filename Matching: Matches XML files to server images by filename (without extension)
⚙️ Configurable: YAML-based configuration for easy customization
🛡️ Rate Limit Handling: Automatic retry with backoff on 429 errors
📝 Comprehensive Logging: Detailed logging to file and console

Performance

Workers	Upload Speed	Use Case
1	~3-5 files/sec	Conservative, minimal API load
5	~15-20 files/sec	✅ Recommended: Balance of speed and stability
9	~25-30 files/sec	Maximum speed (higher rate limit risk)

Example: 6400 files @ 20 files/sec = ~5.3 minutes

Quick Start

# 1. Clone the repository
git clone https://github.com/Divergent-Discourses/TranskribusAPI-XML-upload.git
cd transkribus-upload

# 2. Install dependencies
pip install -r requirements.txt

# 3. Configure your settings
cp config_upload.yaml.example config_upload.yaml
# Edit config_upload.yaml with your credentials and paths

# 4. Run the upload
python upload_pagexml_fast_threadsafe.py config_upload.yaml

Installation

See INSTALL.md for detailed installation instructions including virtual environment setup.

Configuration

Edit config_upload.yaml:

credentials:
  username: "your.email@example.com"
  password: "your-password"

collection:
  collection_id: 1234567

xml_upload:
  pagexml_directory: '/path/to/your/xml/files'
  file_pattern: "*.xml"
  dry_run: false
  max_workers: 5  # Adjust based on your needs

logging:
  level: "INFO"
  log_file: "transkribus_upload.log"

Important Settings

max_workers: Number of parallel upload threads (1-20)
- Start with 5 for reliability
- Increase to 7-9 for maximum speed
- Reduce to 1-3 if you encounter rate limiting
dry_run: Set to true to test without uploading
- Validates file matching
- Shows what would be uploaded
- Safe to test your configuration

How It Works

1. Collection Mapping

The tool builds a mapping of all images in your Transkribus collection:

Server image: page_001.jpg  →  Mapped to: page_001.xml
Server image: page_002.tif  →  Mapped to: page_002.xml

2. Filename Matching

XML files are matched to server images by base filename (without extension):

Local:  TID_1965_05_22_001.xml
Server: TID_1965_05_22_001.jpg  ✓ Match!

3. Parallel Upload

Files are uploaded in parallel using ThreadPoolExecutor:

Multiple threads upload simultaneously
Progress tracked in real-time
Automatic token refresh when OAuth token expires (~30-60 min)
Thread-safe locking prevents race conditions

Usage Examples

Basic Upload

python upload_pagexml_fast_threadsafe.py config_upload.yaml

Dry Run (Test Mode)

# Set dry_run: true in config_upload.yaml, then:
python upload_pagexml_fast_threadsafe.py config_upload.yaml

This shows what would be uploaded without actually uploading.

Custom Configuration File

python upload_pagexml_fast_threadsafe.py /path/to/custom_config.yaml

Output Example

======================================================================
PageXML Parallel Upload with Thread-Safe Token Refresh
======================================================================
Logging in to Transkribus...
✓ Login successful
✓ Access token obtained (expires in ~30-60 min, will auto-refresh)
Building collection mapping...
Found 6453 pages across 152 documents
✓ Mapping complete: 6453 pages available
Found 6400 XML files matching pattern '*.xml'
Matched 6400 files to pages
Unmatched: 0 files
Starting parallel upload with 5 workers...
Note: Token will auto-refresh if it expires during upload
======================================================================
Progress: 100/6400 (1%) | Rate: 22.3 files/sec | ETA: 4.7 min | Errors: 0
Progress: 200/6400 (3%) | Rate: 21.8 files/sec | ETA: 4.7 min | Errors: 0
...
⚠ Token expired for TID_1965_05_22_530.xml, refreshing (thread: ThreadPoolExecutor-0_1)
✓ Token refreshed successfully by thread ThreadPoolExecutor-0_1
...
======================================================================
UPLOAD SUMMARY
======================================================================
Total XML files found: 6400
Successfully uploaded: 6400
Not found in collection: 0
Errors: 0
Total time: 5.23 minutes
Average rate: 20.4 files/second
======================================================================

Troubleshooting

"No match found" Errors

Problem: XML files don't match server images.

Solution: Ensure filenames match exactly (excluding extension):

✗ Wrong: page001.xml → page_001.jpg  (different names)
✓ Right: page001.xml → page001.jpg   (same base name)

401 Unauthorized Errors

Problem: Token expired during upload.

Solution: This should be handled automatically. If it persists:

Check credentials in config
Verify internet connection
Check Transkribus service status

Rate Limiting (429 Errors)

Problem: Too many requests to API.

Solution: Reduce max_workers in config:

max_workers: 3  # Reduce from 9 to 3

Token Refresh Conflicts

Problem: Multiple threads refreshing token simultaneously.

Solution: Use upload_pagexml_fast_threadsafe.py (not older versions). This version includes proper thread-safe locking.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                  upload_pagexml_fast_threadsafe.py          │
│                                                             │
│  ┌──────────────┐        ┌──────────────┐                 │
│  │  Main Thread │        │ Thread Pool  │                 │
│  │              │────────│  (Workers)   │                 │
│  │ - Config     │        │              │                 │
│  │ - Mapping    │        │ Upload threads│                 │
│  │ - Progress   │        │  (5-9 parallel)                 │
│  └──────────────┘        └──────────────┘                 │
│         │                        │                         │
│         │                        │                         │
│         ▼                        ▼                         │
│  ┌────────────────────────────────────────┐               │
│  │      transkribus_wrapper.py            │               │
│  │  (TranskribusClient + Token Refresh)   │               │
│  └────────────────────────────────────────┘               │
│                     │                                      │
└─────────────────────┼──────────────────────────────────────┘
                      │
                      ▼
              ┌──────────────────┐
              │ Transkribus API  │
              │  (OAuth + REST)  │
              └──────────────────┘

API Reference

TranskribusClient

client = TranskribusClient()

# Authenticate
client.login(username, password)

# List documents in collection
docs = client.list_documents(collection_id)

# Get document details
doc = client.get_document(collection_id, doc_id)

# Upload PageXML
client.update_page_xml(
    collection_id=123,
    doc_id=456,
    page_nr=1,
    xml_content="<xml>..."
)

# Refresh OAuth token
client.refresh_access_token()

Security Considerations

⚠️ Never commit config_upload.yaml to version control!

The repository includes a .gitignore that excludes:

config_upload.yaml (contains credentials)
*.log (may contain sensitive data)
.env files
Python cache files

Instead, use config_upload.yaml.example as a template.

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Acknowledgments

Built for the Transkribus platform
Uses the Transkribus REST API and OAuth2 authentication
Developed for Tibetan Newspaper digitization workflows of the Divergent Discourses project (DFG/AHRC)

Support

Issues: GitHub Issues
Transkribus Help: https://help.transkribus.org

Changelog

See CHANGELOG.md for version history and updates.

Made with ❤️ for digital humanities and document digitization projects

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
GITHUB_PUBLICATION_GUIDE.md		GITHUB_PUBLICATION_GUIDE.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
PACKAGE_SUMMARY.md		PACKAGE_SUMMARY.md
README.md		README.md
config_upload.yaml		config_upload.yaml
config_upload.yaml.example		config_upload.yaml.example
requirements.txt		requirements.txt
transkribus_wrapper.py		transkribus_wrapper.py
upload_pagexml_fast_threadsafe.py		upload_pagexml_fast_threadsafe.py

Folders and files

Latest commit

History

Repository files navigation

Transkribus Parallel Upload Tool

Features

Performance

Quick Start

Installation

Configuration

Important Settings

How It Works

1. Collection Mapping

2. Filename Matching

3. Parallel Upload

Usage Examples

Basic Upload

Dry Run (Test Mode)

Custom Configuration File

Output Example

Troubleshooting

"No match found" Errors

401 Unauthorized Errors

Rate Limiting (429 Errors)

Token Refresh Conflicts

Architecture

API Reference

TranskribusClient

Security Considerations

Contributing

Acknowledgments

Support

Changelog

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages