-
Notifications
You must be signed in to change notification settings - Fork 0
Description
In order to add/display/edit annotations with NER labels to the XML file, we would like to split each "div" object (representing a section) into its set of paragraphs (the "p" objects it contains).
To do that, create a new method that adopts the method grobidBodyWords(filepath, output="arr") as follows:
body = findBody(filepath)
divArr = body.findall("{http://www.tei-c.org/ns/1.0}div")
retval = []
for div in divArr:
# select all paragraph elements ("p") from the current "div" elements
# each "p" element denotes a single paragraph
for p in div.findall("{http://www.tei-c.org/ns/1.0}p")
for child in p.iter():
#child.text is the text of that object, but child.tail is the text after that object.
if(child.text):
retval += manualSplit(child.text)
if(child.tail):
retval += manualSplit(child.tail)