XML Parser

The XML Parser is a tool for parsing XML-encoded texts to obtain the content and speaker/author of utterances within the text.

Inputs

The XML corpus can be provided either as a collection of text files or as an Excel/CSV table. The loader currently supports loading a corpus from the following file types: txt, odt, docx, csv, tsv, xlsx, ods, xml

Following Hardie (2014), utterances must be contained by 'u' tags and must include a 'who' attribute. Additionally, the 'who' attribute must be the first attribute in the 'u' tag. Only utterances that follow this format will be included. An example of valid utterances is as follows:

<u who="PETER">Hello, world.</u>
<u who="WORLD">Hello, Peter.</u>
<u who="PETER">Wow!</u>

Output

Once a corpus has been parsed, it can be exported in one of three formats: csv, xlsx, or zip. The zip format provides a zip file containing each utterance as a txt file, with a metadata.csv containing the corpus metadata. The csv and xlsx formats are structured as a table where each row represents an utterance. The table for the above input example would look as follows:

document_	speaker
Hello, world.	PETER
Hello, Peter.	WORLD
Wow!	PETER

Instructions

Upload your document files to the 'corpus_data' directory.
Run the cell below and use the Corpus Loader to build a corpus from your selected documents.
Once the corpus is built, navigate to the 'XML Parser' tab. Here, select your corpus in the dropdown and click 'Parse XML'.
When parsing is complete, navigate to the 'Corpus Overview' tab to export the parsed corpus.

See the user guide for detailed instructions and hover over the tooltips in the loader for simplified instructions on how to load and build the corpus.

Notes

The XML Parser keeps all corpus metadata but adds a metadata called 'speaker'. If there is already a metadata column called 'speaker' it will be overwritten.
When parsing utterances, the XML Parser will skip any utterance that does not have a speaker.
When parsing utterances, the XML Parser will remove any XML tags within the contents of the utterance.

Demo

Click the button below to access a demo deployed on Binderhub.

Authors

Hamish Croser - h-croser

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
corpus_data		corpus_data
xml_parser		xml_parser
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Corpus Loader User Guide.pdf		Corpus Loader User Guide.pdf
LICENSE		LICENSE
README.md		README.md
parser.ipynb		parser.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

XML Parser

Inputs

Output

Instructions

Notes

Demo

Authors

License

About

Uh oh!

Releases 2

Uh oh!

Languages

License

Australian-Text-Analytics-Platform/xml-parser

Folders and files

Latest commit

History

Repository files navigation

XML Parser

Inputs

Output

Instructions

Notes

Demo

Authors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Languages