Skip to content

How to import parallel corpora? #128

@vintagentleman

Description

@vintagentleman

Hi,

I’m struggling to convert TEI-encoded parallel corpora with Pepper.

The most straightforward approach proposed by TEI seems to involve constructing link groups connecting the aligned linguistic units together. Such is the approach I have witnessed in the Opus-MontenegrinSubs corpus, where along with the English and Montenegrin texts themselves there is a separate file containing nothing but the alignment links:

<linkGrp xmlns="http://www.tei-c.org/ns/1.0" type="alignment"
    corresp="opusmonte_en.ana.xml opusmonte_cnr.ana.xml">
  <link n="0:0" target="#Damages.S1.dam0101.SL1-en #Damages.S1.dam0101.SL1-cnr"/>
  ...
</linkGrp>

Additionally, every aligned segment has a @corresp attribute pointing to the @xml:id of its translation equivalent, like this:

<ab n="10" xml:id="Damages.S1.dam0101.SL15-cnr"
    corresp="#Damages.S1.dam0101.SL15-en">
  ...
</ab>

However, the TEI importer fails to process this corpus with the errors of this kind:

Cannot map 'salt:/0/OpusMonte.TEI/opusmonte_cnr.ana' with module 'TEIImporter', because of a mapping result was 'FAILED'.
Cannot map 'salt:/0/OpusMonte.TEI/opusmonte_en.ana' with module 'TEIImporter', because of a mapping result was 'FAILED'.
An exception was thrown by the mapper threads 'Thread[TEIImporter_mapper(salt:/OpusMonte.TEI/opusmonte_cnr.ana),5,TEIImporter_mapperGroup]'.
org.corpus_tools.pepper.modules.exceptions.PepperModuleXMLResourceException: Cannot read xml-file'file:/D:/Users/k.sipunin/Downloads/OpusMonte.TEI/opusmonte_cnr.ana.xml', because of a nested exception.
        at org.corpus_tools.pepper.common.PepperUtil.readXMLResource(PepperUtil.java:661)
        at org.corpus_tools.pepper.impl.PepperMapperImpl.readXMLResource(PepperMapperImpl.java:278)
        at org.corpus_tools.peppermodules.TEIModules.TEIMapper.mapSDocument(TEIMapper.java:58)
        at org.corpus_tools.pepper.impl.PepperMapperControllerImpl.map(PepperMapperControllerImpl.java:251)
        at org.corpus_tools.pepper.impl.PepperMapperControllerImpl.run(PepperMapperControllerImpl.java:188)
Caused by: org.corpus_tools.salt.exceptions.SaltInsertionException: Cannot insert object 'lemma=opasni' into container 'SStructureImpl(null)[lemma=opasni], salt::unit=word], ana=mte:Agpfpny]'.  Because an id already exists: lemma=opasni.

What might be the problem? And more generally, what is the proper way to encode parallel corpora importable into ANNIS (the presence of a sample here suggests that it’s doable)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions