Skip to content

Divergent-Discourses/TranskribusAPI-XML-upload

Repository files navigation

Transkribus Parallel Upload Tool

Fast, parallel PageXML uploader for Transkribus with automatic token refresh and thread-safe operation. The scripts in this repo facilitate bulk PAGE XML uploads and file matching with image files on the Transkribus server. Optionally, a separate script can execute Transkribus HTR processing. The pipeline uses the Transkribus legacy REST API, which comes with limited options. Python 3.8+, DOI

Features

  • Fast Parallel Uploads: Upload 20-30 files per second using thread pools
  • 🔄 Automatic Token Refresh: Handles long-running uploads with automatic OAuth token renewal
  • 🔒 Thread-Safe: Proper locking mechanisms for multi-threaded token refresh
  • 📊 Progress Tracking: Real-time progress updates with ETA calculations
  • 🎯 Exact Filename Matching: Matches XML files to server images by filename (without extension)
  • ⚙️ Configurable: YAML-based configuration for easy customization
  • 🛡️ Rate Limit Handling: Automatic retry with backoff on 429 errors
  • 📝 Comprehensive Logging: Detailed logging to file and console

Performance

Workers Upload Speed Use Case
1 ~3-5 files/sec Conservative, minimal API load
5 ~15-20 files/sec ✅ Recommended: Balance of speed and stability
9 ~25-30 files/sec Maximum speed (higher rate limit risk)

Example: 6400 files @ 20 files/sec = ~5.3 minutes

Quick Start

# 1. Clone the repository
git clone https://github.com/Divergent-Discourses/TranskribusAPI-XML-upload.git
cd transkribus-upload

# 2. Install dependencies
pip install -r requirements.txt

# 3. Configure your settings
cp config_upload.yaml.example config_upload.yaml
# Edit config_upload.yaml with your credentials and paths

# 4. Run the upload
python upload_pagexml_fast_threadsafe.py config_upload.yaml

Installation

See INSTALL.md for detailed installation instructions including virtual environment setup.

Configuration

Edit config_upload.yaml:

credentials:
  username: "your.email@example.com"
  password: "your-password"

collection:
  collection_id: 1234567

xml_upload:
  pagexml_directory: '/path/to/your/xml/files'
  file_pattern: "*.xml"
  dry_run: false
  max_workers: 5  # Adjust based on your needs

logging:
  level: "INFO"
  log_file: "transkribus_upload.log"

Important Settings

  • max_workers: Number of parallel upload threads (1-20)

    • Start with 5 for reliability
    • Increase to 7-9 for maximum speed
    • Reduce to 1-3 if you encounter rate limiting
  • dry_run: Set to true to test without uploading

    • Validates file matching
    • Shows what would be uploaded
    • Safe to test your configuration

How It Works

1. Collection Mapping

The tool builds a mapping of all images in your Transkribus collection:

Server image: page_001.jpg  →  Mapped to: page_001.xml
Server image: page_002.tif  →  Mapped to: page_002.xml

2. Filename Matching

XML files are matched to server images by base filename (without extension):

Local:  TID_1965_05_22_001.xml
Server: TID_1965_05_22_001.jpg  ✓ Match!

3. Parallel Upload

Files are uploaded in parallel using ThreadPoolExecutor:

  • Multiple threads upload simultaneously
  • Progress tracked in real-time
  • Automatic token refresh when OAuth token expires (~30-60 min)
  • Thread-safe locking prevents race conditions

Usage Examples

Basic Upload

python upload_pagexml_fast_threadsafe.py config_upload.yaml

Dry Run (Test Mode)

# Set dry_run: true in config_upload.yaml, then:
python upload_pagexml_fast_threadsafe.py config_upload.yaml

This shows what would be uploaded without actually uploading.

Custom Configuration File

python upload_pagexml_fast_threadsafe.py /path/to/custom_config.yaml

Output Example

======================================================================
PageXML Parallel Upload with Thread-Safe Token Refresh
======================================================================
Logging in to Transkribus...
✓ Login successful
✓ Access token obtained (expires in ~30-60 min, will auto-refresh)
Building collection mapping...
Found 6453 pages across 152 documents
✓ Mapping complete: 6453 pages available
Found 6400 XML files matching pattern '*.xml'
Matched 6400 files to pages
Unmatched: 0 files
Starting parallel upload with 5 workers...
Note: Token will auto-refresh if it expires during upload
======================================================================
Progress: 100/6400 (1%) | Rate: 22.3 files/sec | ETA: 4.7 min | Errors: 0
Progress: 200/6400 (3%) | Rate: 21.8 files/sec | ETA: 4.7 min | Errors: 0
...
⚠ Token expired for TID_1965_05_22_530.xml, refreshing (thread: ThreadPoolExecutor-0_1)
✓ Token refreshed successfully by thread ThreadPoolExecutor-0_1
...
======================================================================
UPLOAD SUMMARY
======================================================================
Total XML files found: 6400
Successfully uploaded: 6400
Not found in collection: 0
Errors: 0
Total time: 5.23 minutes
Average rate: 20.4 files/second
======================================================================

Troubleshooting

"No match found" Errors

Problem: XML files don't match server images.

Solution: Ensure filenames match exactly (excluding extension):

✗ Wrong: page001.xml → page_001.jpg  (different names)
✓ Right: page001.xml → page001.jpg   (same base name)

401 Unauthorized Errors

Problem: Token expired during upload.

Solution: This should be handled automatically. If it persists:

  1. Check credentials in config
  2. Verify internet connection
  3. Check Transkribus service status

Rate Limiting (429 Errors)

Problem: Too many requests to API.

Solution: Reduce max_workers in config:

max_workers: 3  # Reduce from 9 to 3

Token Refresh Conflicts

Problem: Multiple threads refreshing token simultaneously.

Solution: Use upload_pagexml_fast_threadsafe.py (not older versions). This version includes proper thread-safe locking.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                  upload_pagexml_fast_threadsafe.py          │
│                                                             │
│  ┌──────────────┐        ┌──────────────┐                 │
│  │  Main Thread │        │ Thread Pool  │                 │
│  │              │────────│  (Workers)   │                 │
│  │ - Config     │        │              │                 │
│  │ - Mapping    │        │ Upload threads│                 │
│  │ - Progress   │        │  (5-9 parallel)                 │
│  └──────────────┘        └──────────────┘                 │
│         │                        │                         │
│         │                        │                         │
│         ▼                        ▼                         │
│  ┌────────────────────────────────────────┐               │
│  │      transkribus_wrapper.py            │               │
│  │  (TranskribusClient + Token Refresh)   │               │
│  └────────────────────────────────────────┘               │
│                     │                                      │
└─────────────────────┼──────────────────────────────────────┘
                      │
                      ▼
              ┌──────────────────┐
              │ Transkribus API  │
              │  (OAuth + REST)  │
              └──────────────────┘

API Reference

TranskribusClient

client = TranskribusClient()

# Authenticate
client.login(username, password)

# List documents in collection
docs = client.list_documents(collection_id)

# Get document details
doc = client.get_document(collection_id, doc_id)

# Upload PageXML
client.update_page_xml(
    collection_id=123,
    doc_id=456,
    page_nr=1,
    xml_content="<xml>..."
)

# Refresh OAuth token
client.refresh_access_token()

Security Considerations

⚠️ Never commit config_upload.yaml to version control!

The repository includes a .gitignore that excludes:

  • config_upload.yaml (contains credentials)
  • *.log (may contain sensitive data)
  • .env files
  • Python cache files

Instead, use config_upload.yaml.example as a template.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Acknowledgments

  • Built for the Transkribus platform
  • Uses the Transkribus REST API and OAuth2 authentication
  • Developed for Tibetan Newspaper digitization workflows of the Divergent Discourses project (DFG/AHRC)

Support

Changelog

See CHANGELOG.md for version history and updates.


Made with ❤️ for digital humanities and document digitization projects

About

PAGE xml bulk upload into Transkribus with optional HTR processing using the legacy API

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages