Skip to content

dharmamitra/mitra-parallel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MITRA Logo

MITRA-parallel: A Large-Scale Parallel Corpus for Sanskrit, Buddhist Chinese, and Tibetan

arXiv

Overview

MITRA-parallel is a large-scale, sentence-aligned parallel corpus for Sanskrit, Buddhist Chinese, and Tibetan. The dataset is designed to support research in machine translation, semantic retrieval, and philological studies of Buddhist and classical Asian literature. It is introduced in the paper:

MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan
Sebastian Nehrdich, Kurt Keutzer
arXiv preprint

This repository includes:

  • A corpus of 1.74 million parallel sentence pairs between Sanskrit, Chinese, and Tibetan
  • Links to the model weights of domain-specific pretrained language models (Gemma 2 MITRA) for translation and semantic retrieval

Data Structure

The parallel data is provided in the mitra-parallel/tsv/ directory as .tsv files. Each file contains sentence-level alignments between two ancient Buddhist languages. The typical columns are:

  • src_segmentnr: Source segment identifier
  • src_original: Source sentence (e.g., Sanskrit, Chinese, or Tibetan)
  • tgt_segmentnr: Target segment identifier
  • tgt_original: Target sentence (aligned translation)

Example (tab-separated):

src_segmentnr	src_original	tgt_segmentnr	tgt_original
XXn693u_007:1	 namo buddhāyaḥ |	T02D2243:122b-4	སངས་རྒྱས་ལ་ཕྱག་འཚལ་ལོ་༎

Pretrained Models

The MITRA project also provides three models, one base model and two finetuned models for semantic retrieval and translation:

  • Gemma 2 MITRA (base model)
  • Gemma 2 MITRA-MT (machine translation)
  • Gemma 2 MITRA-E (semantic embedding)

Model links:

Citation

If you use this dataset or models, please cite:

@article{nehrdich2026mitra,
  title={MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan},
  author={Sebastian Nehrdich and Kurt Keutzer},
  journal={arXiv preprint},
  year={2026},
  url={https://github.com/dharmamitra/mitra-semantic-similarity}
}

Acknowledgments

License

The MITRA-parallel dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). This means you are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made
  • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original

For more details, see the full LICENSE file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages