MITRA-parallel is a large-scale, sentence-aligned parallel corpus for Sanskrit, Buddhist Chinese, and Tibetan. The dataset is designed to support research in machine translation, semantic retrieval, and philological studies of Buddhist and classical Asian literature. It is introduced in the paper:
MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan
Sebastian Nehrdich, Kurt Keutzer
arXiv preprint
This repository includes:
- A corpus of 1.74 million parallel sentence pairs between Sanskrit, Chinese, and Tibetan
- Links to the model weights of domain-specific pretrained language models (Gemma 2 MITRA) for translation and semantic retrieval
The parallel data is provided in the mitra-parallel/tsv/ directory as .tsv files. Each file contains sentence-level alignments between two ancient Buddhist languages. The typical columns are:
src_segmentnr: Source segment identifiersrc_original: Source sentence (e.g., Sanskrit, Chinese, or Tibetan)tgt_segmentnr: Target segment identifiertgt_original: Target sentence (aligned translation)
Example (tab-separated):
src_segmentnr src_original tgt_segmentnr tgt_original
XXn693u_007:1 namo buddhāyaḥ | T02D2243:122b-4 སངས་རྒྱས་ལ་ཕྱག་འཚལ་ལོ་༎
The MITRA project also provides three models, one base model and two finetuned models for semantic retrieval and translation:
- Gemma 2 MITRA (base model)
- Gemma 2 MITRA-MT (machine translation)
- Gemma 2 MITRA-E (semantic embedding)
Model links:
If you use this dataset or models, please cite:
@article{nehrdich2026mitra,
title={MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan},
author={Sebastian Nehrdich and Kurt Keutzer},
journal={arXiv preprint},
year={2026},
url={https://github.com/dharmamitra/mitra-semantic-similarity}
}
- The MITRA project is hosted at the Center for Integrated Japanese Studies, Tohoku University, supported by the Tsadra Foundation, and collaborative partners including monlam.ai and the Kumarajiva project.
- For more information and updates, see the project website
The MITRA-parallel dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). This means you are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, even commercially
Under the following terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made
- ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original
For more details, see the full LICENSE file.
