Description

This repository holds utilities for parsing and extracting useful data from Transkribus PageXML outputsThis repository holds utilities for parsing and extracting useful data from Transkribus PageXML outputs. Paragraph Extractor is a utility that accepts Transkribus PageXML as input and then interprets the text regions on each page/image (such as headers, titles, blocks of text, etc.), which we term "paragraphs". It then returns the raw text of each text region (paragraph) along with its metadata and writes it to a CSV. Note that it reads PageXML, not AltoXML.

Paragraph Extractor was developed by James Engels and modified by Christina Sabbagh, both of SOAS University of London, for the Divergent Discourses project. The project is a joint study involving SOAS University of London and Leipzig University, funded by the AHRC in the UK and the DFG in Germany. Please acknowledge the project in any use of these materials.

Paragraph Extractor

paragraph_extractor.py takes a directory containing pageXML files. For each pageXML file, it generates a CSV containing the page's relevant metadata including its text regions (one per row). The method assumes the Transkribus "text region" is an acceptably accurate 1:1 proxy for a paragraph.

CSV columns include:

Paragraph (str): The text recognised by Transkribus HTR
Paragraph ID (str): E.g. tr_1718110017
Reading order ID (int): A number representing the transkribus-predicted reading order index of the text region within its page
Region type (str): E.g. caption, heading, paragraph, other
Filename (str): E.g. 0001_QTN_1959_10_03_001_SB_Zsn128163MR.jpg. The original image filename, extracted from the pageXML. Filenames must be underscore-separated with elements ordered as follows for the code to work as intended - 4-digit code assigned by Transkribus (0001), newspaper name/code (QTN), year (1959), month (10), date (03), page (001). Remaining information in filename is not extracted
Newspaper (str): E.g. QTN - parsed from the original image filename
Year (int): Parsed from the original image filename
Month (int): Parsed from the original image filename
Date (int): Parsed from the original image filename

The program recursively searches input directories for .xml files, so clean file structures are important!

Installation

Using the command line, navigate to the location in which you wish to download the code. Then, download the code.

git clone https://github.com/Divergent-Discourses/transkribus_xml2csv.git

Create a virtual environment.

conda create -n xml2csv python=3.12.2

Activate the environment.

conda activate xml2csv

Using the command line, navigate to the location of this repository. Then, install required packages.

cd transkribus_xml2csv
pip install -r requirements.txt

Using the Paragraph Extractor

Move the directories containing your pageXML files (.xml) into the ./data/to_process_xml directory.

Navigate via the command line into this transkribus_xml2csv directory.

To parse the pageXML files and output a series of .csv files, call:

python ./src/paragraph_extractor.py

.csv files will be outputted to ./data/processed_csv.

To merge .csv files into a single .csv file, call:

python ./src/merge_csv.py

The master .csv file will be outputted to ./data/merged_csv

Remember to deactivate the virtual environment once you're done.

conda deactivate xml2csv

Copyright

Paragraph Extractor was developed by James Engels and modified by Christina Sabbagh, both of SOAS University of London, for the Divergent Discourses project. The project is a joint study involving SOAS University of London and Leipzig University, funded by the AHRC in the UK and the DFG in Germany. Please acknowledge the project in any use of these materials. Copyright for the project resides with the two universities.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
src		src
.gitignore		.gitignore
BUG_REPORT.md		BUG_REPORT.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Paragraph Extractor

Installation

Using the Paragraph Extractor

Copyright

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Description

Paragraph Extractor

Installation

Using the Paragraph Extractor

Copyright

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages