TransCorpus

TransCorpus is a scalable, production-ready API and CLI toolkit for large-scale parallel translation, preprocessing, and corpus management. It supports multi-GPU translation, robust checkpointing, and safe concurrent downloads, making it ideal for research and industry-scale machine translation workflows.

Features

🚀 Multi-GPU and multi-process translation
📦 Corpus downloading and preprocessing
🔒 Safe, resumable, and concurrent file downloads
🧩 Split and checkpoint management for large corpora
🛠️ Easy deployment and extensibility
🖥️ Cross-platform: Linux, macOS, Windows

Quick Start

Clone and Install

git clone https://github.com/jknafou/TransCorpus.git
cd TransCorpus
UV_INDEX_STRATEGY=unsafe-best-match rye sync
source .venv/bin/activate

Download a Corpus

transcorpus download-corpus [corpus_name]

Preprocess the corpus by splits

transcorpus preprocess [corpus_name] [language] --num-split 100

Translate (and preprocess if not done) the corpus by split

transcorpus translate [corpus_name] [language] --num-split 100

Preview a corpus with two languages next to each other:

transcorpus preview [corpus_name] [language1] Opt[language2]

Example of two languages next to each other

A demo mode can be tested using the -d flag for each command.

Preprocess and Translate (Multi-GPU Example)

The following example translates the bio corpus (PubMed) of about 30GB, preprocessing it with 4 parallel workers, while translating each available split with two GPUs of different sizes. It can easily be modified to one needs. When deployed on an HPC cluster, for example with SLURM, it will automatically resume from where it left off in the previous run. With shared memory, multiple GPUs from different nodes can work simultaneously.

# Preprocess with 4 workers iteratively, split into 20 parts (here in demo mode)
./example/multi_GPU.sh bio de 4 20

Research-Proven Performance

Paper published at EMNLP2025

TransCorpus enables the training of state-of-the-art language models through synthetic translation. For example, TransBERT achieved superior performance by leveraging corpus translation with this toolkit. The paper detailing these results has been published at EMNLP2025. 📝 Download it here. If you use this toolkit, please cite:

  @inproceedings{knafou-etal-2025-transbert,
      title = "{T}rans{BERT}: A Framework for Synthetic Translation in Domain-Specific Language Modeling",
      author = {Knafou, Julien  and
        Mottin, Luc  and
        Mottaz, Ana{\"i}s  and
        Flament, Alexandre  and
        Ruch, Patrick},
      editor = "Christodoulopoulos, Christos  and
        Chakraborty, Tanmoy  and
        Rose, Carolyn  and
        Peng, Violet",
      booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
      month = nov,
      year = "2025",
      address = "Suzhou, China",
      publisher = "Association for Computational Linguistics",
      url = "https://aclanthology.org/2025.findings-emnlp.1053/",
      doi = "10.18653/v1/2025.findings-emnlp.1053",
      pages = "19338--19354",
      ISBN = "979-8-89176-335-7",
      abstract = "The scarcity of non-English language data in specialized domains significantly limits the development of effective Natural Language Processing (NLP) tools. We present TransBERT, a novel framework for pre-training language models using exclusively synthetically translated text, and introduce TransCorpus, a scalable translation toolkit. Focusing on the life sciences domain in French, our approach demonstrates that state-of-the-art performance on various downstream tasks can be achieved solely by leveraging synthetically translated data. We release the TransCorpus toolkit, the TransCorpus-bio-fr corpus (36.4GB of French life sciences text), TransBERT-bio-fr, its associated pre-trained language model and reproducible code for both pre-training and fine-tuning. Our results highlight the viability of synthetic translation in a high-resource translation direction for building high-quality NLP resources in low-resource language/domain pairs."
  }

🧬 Pretrained Models

Looking for pretrained models built with TransCorpus? Check out TransBERT-bio-fr on Hugging Face 🤗, a French biomedical language model trained entirely on synthetic translations generated by this toolkit. Also available, TransCorpus-bio-fr on Hugging Face 🤗

New corpus upload

One can easily add its own corpus (along with a demo) to the repo following the same schema of domains.json:

    "bio": {
        "database": {
            "file": "https://transcorpus.s3.text-analytics.ch/bibmed.tar.gz"
        },
        "corpus": {
            "file":
                "https://transcorpus.s3.text-analytics.ch/title_abstract_en.txt"
            ,
            "demo":
                "https://transcorpus.s3.text-analytics.ch/1k_sample.txt"
        },
        "id": {
            "file":
                "https://transcorpus.s3.text-analytics.ch/PMID.txt"
            ,
            "demo":
                "https://transcorpus.s3.text-analytics.ch/PMID_1k_sample.txt"
        },
        "language": "en"
    }

Where each line of the corpus is a different document. For the moment, a life-science corpus is available comprising about 28GB of raw text, 22M of abstracts from PubMed. The database it is made of can also be downloaded using transcorpus download-database bio.

Deployment

Requirements:

Python 3.10+
rye (for dependency management)
CUDA-enabled GPUs (for multi-GPU translation)

Contributing

Pull requests and issues are welcome!

License

MIT License

Acknowledgements

Swiss AI Center
fairseq
PyTorch
rye

TransCorpus makes large-scale, robust translation easy and reproducible.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
example		example
src/transcorpus		src/transcorpus
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TransCorpus

Features

Quick Start

Preprocess and Translate (Multi-GPU Example)

Research-Proven Performance

Paper published at EMNLP2025

🧬 Pretrained Models

New corpus upload

Deployment

Requirements:

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

jknafou/TransCorpus

Folders and files

Latest commit

History

Repository files navigation

TransCorpus

Features

Quick Start

Preprocess and Translate (Multi-GPU Example)

Research-Proven Performance

Paper published at EMNLP2025

🧬 Pretrained Models

New corpus upload

Deployment

Requirements:

Contributing

License

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages