Add method to extract paragraphs from XML file

In order to add/display/edit annotations with NER labels to the XML file, we would like to split each "div" object (representing a section) into its set of paragraphs (the "p" objects it contains). 
To do that, create a new method that adopts the method grobidBodyWords(filepath, output="arr") as follows: 

body = findBody(filepath)
divArr = body.findall("{http://www.tei-c.org/ns/1.0}div")
retval = []

for div in divArr:
	# select all paragraph elements ("p") from the current "div" elements
	# each "p" element denotes a single paragraph 
	for p in div.findall("{http://www.tei-c.org/ns/1.0}p")
        for child in p.iter():
            #child.text is the text of that object, but child.tail is the text after that object.
            if(child.text):
                retval += manualSplit(child.text)
            if(child.tail):
                retval += manualSplit(child.tail)
				


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method to extract paragraphs from XML file #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add method to extract paragraphs from XML file #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions