Fast, parallel PageXML uploader for Transkribus with automatic token refresh and thread-safe operation. The scripts in this repo facilitate bulk PAGE XML uploads and file matching with image files on the Transkribus server. Optionally, a separate script can execute Transkribus HTR processing. The pipeline uses the Transkribus legacy REST API, which comes with limited options.
,
- ⚡ Fast Parallel Uploads: Upload 20-30 files per second using thread pools
- 🔄 Automatic Token Refresh: Handles long-running uploads with automatic OAuth token renewal
- 🔒 Thread-Safe: Proper locking mechanisms for multi-threaded token refresh
- 📊 Progress Tracking: Real-time progress updates with ETA calculations
- 🎯 Exact Filename Matching: Matches XML files to server images by filename (without extension)
- ⚙️ Configurable: YAML-based configuration for easy customization
- 🛡️ Rate Limit Handling: Automatic retry with backoff on 429 errors
- 📝 Comprehensive Logging: Detailed logging to file and console
| Workers | Upload Speed | Use Case |
|---|---|---|
| 1 | ~3-5 files/sec | Conservative, minimal API load |
| 5 | ~15-20 files/sec | ✅ Recommended: Balance of speed and stability |
| 9 | ~25-30 files/sec | Maximum speed (higher rate limit risk) |
Example: 6400 files @ 20 files/sec = ~5.3 minutes
# 1. Clone the repository
git clone https://github.com/Divergent-Discourses/TranskribusAPI-XML-upload.git
cd transkribus-upload
# 2. Install dependencies
pip install -r requirements.txt
# 3. Configure your settings
cp config_upload.yaml.example config_upload.yaml
# Edit config_upload.yaml with your credentials and paths
# 4. Run the upload
python upload_pagexml_fast_threadsafe.py config_upload.yamlSee INSTALL.md for detailed installation instructions including virtual environment setup.
Edit config_upload.yaml:
credentials:
username: "your.email@example.com"
password: "your-password"
collection:
collection_id: 1234567
xml_upload:
pagexml_directory: '/path/to/your/xml/files'
file_pattern: "*.xml"
dry_run: false
max_workers: 5 # Adjust based on your needs
logging:
level: "INFO"
log_file: "transkribus_upload.log"-
max_workers: Number of parallel upload threads (1-20)- Start with 5 for reliability
- Increase to 7-9 for maximum speed
- Reduce to 1-3 if you encounter rate limiting
-
dry_run: Set totrueto test without uploading- Validates file matching
- Shows what would be uploaded
- Safe to test your configuration
The tool builds a mapping of all images in your Transkribus collection:
Server image: page_001.jpg → Mapped to: page_001.xml
Server image: page_002.tif → Mapped to: page_002.xml
XML files are matched to server images by base filename (without extension):
Local: TID_1965_05_22_001.xml
Server: TID_1965_05_22_001.jpg ✓ Match!
Files are uploaded in parallel using ThreadPoolExecutor:
- Multiple threads upload simultaneously
- Progress tracked in real-time
- Automatic token refresh when OAuth token expires (~30-60 min)
- Thread-safe locking prevents race conditions
python upload_pagexml_fast_threadsafe.py config_upload.yaml# Set dry_run: true in config_upload.yaml, then:
python upload_pagexml_fast_threadsafe.py config_upload.yamlThis shows what would be uploaded without actually uploading.
python upload_pagexml_fast_threadsafe.py /path/to/custom_config.yaml======================================================================
PageXML Parallel Upload with Thread-Safe Token Refresh
======================================================================
Logging in to Transkribus...
✓ Login successful
✓ Access token obtained (expires in ~30-60 min, will auto-refresh)
Building collection mapping...
Found 6453 pages across 152 documents
✓ Mapping complete: 6453 pages available
Found 6400 XML files matching pattern '*.xml'
Matched 6400 files to pages
Unmatched: 0 files
Starting parallel upload with 5 workers...
Note: Token will auto-refresh if it expires during upload
======================================================================
Progress: 100/6400 (1%) | Rate: 22.3 files/sec | ETA: 4.7 min | Errors: 0
Progress: 200/6400 (3%) | Rate: 21.8 files/sec | ETA: 4.7 min | Errors: 0
...
⚠ Token expired for TID_1965_05_22_530.xml, refreshing (thread: ThreadPoolExecutor-0_1)
✓ Token refreshed successfully by thread ThreadPoolExecutor-0_1
...
======================================================================
UPLOAD SUMMARY
======================================================================
Total XML files found: 6400
Successfully uploaded: 6400
Not found in collection: 0
Errors: 0
Total time: 5.23 minutes
Average rate: 20.4 files/second
======================================================================
Problem: XML files don't match server images.
Solution: Ensure filenames match exactly (excluding extension):
✗ Wrong: page001.xml → page_001.jpg (different names)
✓ Right: page001.xml → page001.jpg (same base name)
Problem: Token expired during upload.
Solution: This should be handled automatically. If it persists:
- Check credentials in config
- Verify internet connection
- Check Transkribus service status
Problem: Too many requests to API.
Solution: Reduce max_workers in config:
max_workers: 3 # Reduce from 9 to 3Problem: Multiple threads refreshing token simultaneously.
Solution: Use upload_pagexml_fast_threadsafe.py (not older versions). This version includes proper thread-safe locking.
┌─────────────────────────────────────────────────────────────┐
│ upload_pagexml_fast_threadsafe.py │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Main Thread │ │ Thread Pool │ │
│ │ │────────│ (Workers) │ │
│ │ - Config │ │ │ │
│ │ - Mapping │ │ Upload threads│ │
│ │ - Progress │ │ (5-9 parallel) │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ transkribus_wrapper.py │ │
│ │ (TranskribusClient + Token Refresh) │ │
│ └────────────────────────────────────────┘ │
│ │ │
└─────────────────────┼──────────────────────────────────────┘
│
▼
┌──────────────────┐
│ Transkribus API │
│ (OAuth + REST) │
└──────────────────┘
client = TranskribusClient()
# Authenticate
client.login(username, password)
# List documents in collection
docs = client.list_documents(collection_id)
# Get document details
doc = client.get_document(collection_id, doc_id)
# Upload PageXML
client.update_page_xml(
collection_id=123,
doc_id=456,
page_nr=1,
xml_content="<xml>..."
)
# Refresh OAuth token
client.refresh_access_token()config_upload.yaml to version control!
The repository includes a .gitignore that excludes:
config_upload.yaml(contains credentials)*.log(may contain sensitive data).envfiles- Python cache files
Instead, use config_upload.yaml.example as a template.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Built for the Transkribus platform
- Uses the Transkribus REST API and OAuth2 authentication
- Developed for Tibetan Newspaper digitization workflows of the Divergent Discourses project (DFG/AHRC)
- Issues: GitHub Issues
- Transkribus Help: https://help.transkribus.org
See CHANGELOG.md for version history and updates.
Made with ❤️ for digital humanities and document digitization projects