Skip to content

TANDO is a corpus for training and evaluation of document-level machine translation models in Basque-Spanish.

Notifications You must be signed in to change notification settings

Vicomtech/tando

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

TANDO: A Corpus for Document-level Machine Translation

This repository contains the TANDO corpus for Document-level Machine Translation in Basque-Spanish.

Table of Contents

  1. Description
  2. Citation
  3. License
  4. Contact

Description

TANDO is a corpus for training and evaluation of document-level machine translation models in Basque-Spanish and Basque-French. The corpus was prepared within the ELKARTEK project TANDO (2020-2021: www.tando.eus) by members of the project consortium:

The TANDO corpus includes both parallel and contrastive datasets, in text format, and covers different domains (literature, news, subtitles, talks, politics). There currently two versions:

We recommend using v2.0 for any future work.

Citation

If you use any part of the corpus in your own work, please cite the following papers:

@inproceedings{gete-et-al2022tando-corpus,
  title={TANDO: A Corpus for Document-level Machine Translation},
  author={Gete, Harritxu and Etchegoyhen, Thierry and Ponce, David and Labaka, Gorka and
     Aranberri, Nora and Corral, Ander and Saralegi, Xabier
     and Ellakuria Santos, Igor and Martin, Maite}
  booktitle={Proceedings of the 13th Edition of the Language Resources and Evaluation Conference  (LREC 2022)},
  location = {Marseille, France}
  year={2022},
  pages = {TBD}
}

@article{gete2025tando+,
  title={TANDO+: Corpus and Baselines for Document-level Machine Translation in Basque--Spanish and Basque--French},
  author={Gete, Harritxu and Etchegoyhen, Thierry and Labaka, Gorka and Corral, Ander and Saralegi, Xabier and Aranberri, Nora and Ponce, David and Santos, Igor Ellakuria and Martin, Maite},
  journal={Language Resources and Evaluation},
  pages={1--41},
  year={2025},
  publisher={Springer}
}

License

The TANDO corpus is distributed under the Creative Commons BY-NC-SA 4.0 license.
To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Contact

If you have any question or suggestion, do not hesitate to contact us at the following addresses:

  • Thierry Etchegoyhen: tetchegoyhen [AT] vicomtech [DOT] org
  • Harritxu Gete: hgete [AT] vicomtech [DOT] org

About

TANDO is a corpus for training and evaluation of document-level machine translation models in Basque-Spanish.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published