-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Hi,
I’m struggling to convert TEI-encoded parallel corpora with Pepper.
The most straightforward approach proposed by TEI seems to involve constructing link groups connecting the aligned linguistic units together. Such is the approach I have witnessed in the Opus-MontenegrinSubs corpus, where along with the English and Montenegrin texts themselves there is a separate file containing nothing but the alignment links:
<linkGrp xmlns="http://www.tei-c.org/ns/1.0" type="alignment"
corresp="opusmonte_en.ana.xml opusmonte_cnr.ana.xml">
<link n="0:0" target="#Damages.S1.dam0101.SL1-en #Damages.S1.dam0101.SL1-cnr"/>
...
</linkGrp>
Additionally, every aligned segment has a @corresp attribute pointing to the @xml:id of its translation equivalent, like this:
<ab n="10" xml:id="Damages.S1.dam0101.SL15-cnr"
corresp="#Damages.S1.dam0101.SL15-en">
...
</ab>
However, the TEI importer fails to process this corpus with the errors of this kind:
Cannot map 'salt:/0/OpusMonte.TEI/opusmonte_cnr.ana' with module 'TEIImporter', because of a mapping result was 'FAILED'.
Cannot map 'salt:/0/OpusMonte.TEI/opusmonte_en.ana' with module 'TEIImporter', because of a mapping result was 'FAILED'.
An exception was thrown by the mapper threads 'Thread[TEIImporter_mapper(salt:/OpusMonte.TEI/opusmonte_cnr.ana),5,TEIImporter_mapperGroup]'.
org.corpus_tools.pepper.modules.exceptions.PepperModuleXMLResourceException: Cannot read xml-file'file:/D:/Users/k.sipunin/Downloads/OpusMonte.TEI/opusmonte_cnr.ana.xml', because of a nested exception.
at org.corpus_tools.pepper.common.PepperUtil.readXMLResource(PepperUtil.java:661)
at org.corpus_tools.pepper.impl.PepperMapperImpl.readXMLResource(PepperMapperImpl.java:278)
at org.corpus_tools.peppermodules.TEIModules.TEIMapper.mapSDocument(TEIMapper.java:58)
at org.corpus_tools.pepper.impl.PepperMapperControllerImpl.map(PepperMapperControllerImpl.java:251)
at org.corpus_tools.pepper.impl.PepperMapperControllerImpl.run(PepperMapperControllerImpl.java:188)
Caused by: org.corpus_tools.salt.exceptions.SaltInsertionException: Cannot insert object 'lemma=opasni' into container 'SStructureImpl(null)[lemma=opasni], salt::unit=word], ana=mte:Agpfpny]'. Because an id already exists: lemma=opasni.
What might be the problem? And more generally, what is the proper way to encode parallel corpora importable into ANNIS (the presence of a sample here suggests that it’s doable)?