MITRA-parallel: A Large-Scale Parallel Corpus for Sanskrit, Buddhist Chinese, and Tibetan

Overview

MITRA-parallel is a large-scale, sentence-aligned parallel corpus for Sanskrit, Buddhist Chinese, and Tibetan. The dataset is designed to support research in machine translation, semantic retrieval, and philological studies of Buddhist and classical Asian literature. It is introduced in the paper:

MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan
Sebastian Nehrdich, Kurt Keutzer
arXiv preprint

This repository includes:

A corpus of 1.74 million parallel sentence pairs between Sanskrit, Chinese, and Tibetan
Links to the model weights of domain-specific pretrained language models (Gemma 2 MITRA) for translation and semantic retrieval

Data Structure

The parallel data is provided in the mitra-parallel/tsv/ directory as .tsv files. Each file contains sentence-level alignments between two ancient Buddhist languages. The typical columns are:

src_segmentnr: Source segment identifier
src_original: Source sentence (e.g., Sanskrit, Chinese, or Tibetan)
tgt_segmentnr: Target segment identifier
tgt_original: Target sentence (aligned translation)

Example (tab-separated):

src_segmentnr	src_original	tgt_segmentnr	tgt_original
XXn693u_007:1	 namo buddhāyaḥ |	T02D2243:122b-4	སངས་རྒྱས་ལ་ཕྱག་འཚལ་ལོ་༎

Pretrained Models

The MITRA project also provides three models, one base model and two finetuned models for semantic retrieval and translation:

Gemma 2 MITRA (base model)
Gemma 2 MITRA-MT (machine translation)
Gemma 2 MITRA-E (semantic embedding)

Model links:

Citation

If you use this dataset or models, please cite:

@article{nehrdich2026mitra,
  title={MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan},
  author={Sebastian Nehrdich and Kurt Keutzer},
  journal={arXiv preprint},
  year={2026},
  url={https://github.com/dharmamitra/mitra-semantic-similarity}
}

Acknowledgments

The MITRA project is hosted at the Center for Integrated Japanese Studies, Tohoku University, supported by the Tsadra Foundation, and collaborative partners including monlam.ai and the Kumarajiva project.
For more information and updates, see the project website

License

The MITRA-parallel dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). This means you are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original

For more details, see the full LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
eval		eval
mitra-parallel/tsv		mitra-parallel/tsv
README.md		README.md
dm-logo-full.avif		dm-logo-full.avif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MITRA-parallel: A Large-Scale Parallel Corpus for Sanskrit, Buddhist Chinese, and Tibetan

Overview

Data Structure

Pretrained Models

Citation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MITRA-parallel: A Large-Scale Parallel Corpus for Sanskrit, Buddhist Chinese, and Tibetan

Overview

Data Structure

Pretrained Models

Citation

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages