Skip to content

Add method to extract paragraphs from XML file #4

@thahmann

Description

@thahmann

In order to add/display/edit annotations with NER labels to the XML file, we would like to split each "div" object (representing a section) into its set of paragraphs (the "p" objects it contains).
To do that, create a new method that adopts the method grobidBodyWords(filepath, output="arr") as follows:

body = findBody(filepath)
divArr = body.findall("{http://www.tei-c.org/ns/1.0}div")
retval = []

for div in divArr:
# select all paragraph elements ("p") from the current "div" elements
# each "p" element denotes a single paragraph
for p in div.findall("{http://www.tei-c.org/ns/1.0}p")
for child in p.iter():
#child.text is the text of that object, but child.tail is the text after that object.
if(child.text):
retval += manualSplit(child.text)
if(child.tail):
retval += manualSplit(child.tail)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions