Skip to content

TransCorpus is a scalable toolkit for large-scale, parallel translation and preprocessing of text corpora, built for language model pretraining and research.

Notifications You must be signed in to change notification settings

jknafou/TransCorpus

Repository files navigation

TransCorpus

TransCorpus is a scalable, production-ready API and CLI toolkit for large-scale parallel translation, preprocessing, and corpus management. It supports multi-GPU translation, robust checkpointing, and safe concurrent downloads, making it ideal for research and industry-scale machine translation workflows.

Features

  • 🚀 Multi-GPU and multi-process translation
  • 📦 Corpus downloading and preprocessing
  • 🔒 Safe, resumable, and concurrent file downloads
  • 🧩 Split and checkpoint management for large corpora
  • 🛠️ Easy deployment and extensibility
  • 🖥️ Cross-platform: Linux, macOS, Windows

Quick Start

  1. Clone and Install
git clone https://github.com/jknafou/TransCorpus.git
cd TransCorpus
UV_INDEX_STRATEGY=unsafe-best-match rye sync
source .venv/bin/activate
  1. Download a Corpus
transcorpus download-corpus [corpus_name]
  1. Preprocess the corpus by splits
transcorpus preprocess [corpus_name] [language] --num-split 100
  1. Translate (and preprocess if not done) the corpus by split
transcorpus translate [corpus_name] [language] --num-split 100
  1. Preview a corpus with two languages next to each other:
transcorpus preview [corpus_name] [language1] Opt[language2]

Example of two languages next to each other

A demo mode can be tested using the -d flag for each command.

Preprocess and Translate (Multi-GPU Example)

The following example translates the bio corpus (PubMed) of about 30GB, preprocessing it with 4 parallel workers, while translating each available split with two GPUs of different sizes. It can easily be modified to one needs. When deployed on an HPC cluster, for example with SLURM, it will automatically resume from where it left off in the previous run. With shared memory, multiple GPUs from different nodes can work simultaneously.

# Preprocess with 4 workers iteratively, split into 20 parts (here in demo mode)
./example/multi_GPU.sh bio de 4 20

Research-Proven Performance

Paper published at EMNLP2025

TransCorpus enables the training of state-of-the-art language models through synthetic translation. For example, TransBERT achieved superior performance by leveraging corpus translation with this toolkit. The paper detailing these results has been published at EMNLP2025. 📝 Download it here. If you use this toolkit, please cite:

  @inproceedings{knafou-etal-2025-transbert,
      title = "{T}rans{BERT}: A Framework for Synthetic Translation in Domain-Specific Language Modeling",
      author = {Knafou, Julien  and
        Mottin, Luc  and
        Mottaz, Ana{\"i}s  and
        Flament, Alexandre  and
        Ruch, Patrick},
      editor = "Christodoulopoulos, Christos  and
        Chakraborty, Tanmoy  and
        Rose, Carolyn  and
        Peng, Violet",
      booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
      month = nov,
      year = "2025",
      address = "Suzhou, China",
      publisher = "Association for Computational Linguistics",
      url = "https://aclanthology.org/2025.findings-emnlp.1053/",
      doi = "10.18653/v1/2025.findings-emnlp.1053",
      pages = "19338--19354",
      ISBN = "979-8-89176-335-7",
      abstract = "The scarcity of non-English language data in specialized domains significantly limits the development of effective Natural Language Processing (NLP) tools. We present TransBERT, a novel framework for pre-training language models using exclusively synthetically translated text, and introduce TransCorpus, a scalable translation toolkit. Focusing on the life sciences domain in French, our approach demonstrates that state-of-the-art performance on various downstream tasks can be achieved solely by leveraging synthetically translated data. We release the TransCorpus toolkit, the TransCorpus-bio-fr corpus (36.4GB of French life sciences text), TransBERT-bio-fr, its associated pre-trained language model and reproducible code for both pre-training and fine-tuning. Our results highlight the viability of synthetic translation in a high-resource translation direction for building high-quality NLP resources in low-resource language/domain pairs."
  }

🧬 Pretrained Models

Looking for pretrained models built with TransCorpus? Check out TransBERT-bio-fr on Hugging Face 🤗, a French biomedical language model trained entirely on synthetic translations generated by this toolkit. Also available, TransCorpus-bio-fr on Hugging Face 🤗

New corpus upload

One can easily add its own corpus (along with a demo) to the repo following the same schema of domains.json:

    "bio": {
        "database": {
            "file": "https://transcorpus.s3.text-analytics.ch/bibmed.tar.gz"
        },
        "corpus": {
            "file":
                "https://transcorpus.s3.text-analytics.ch/title_abstract_en.txt"
            ,
            "demo":
                "https://transcorpus.s3.text-analytics.ch/1k_sample.txt"
        },
        "id": {
            "file":
                "https://transcorpus.s3.text-analytics.ch/PMID.txt"
            ,
            "demo":
                "https://transcorpus.s3.text-analytics.ch/PMID_1k_sample.txt"
        },
        "language": "en"
    }

Where each line of the corpus is a different document. For the moment, a life-science corpus is available comprising about 28GB of raw text, 22M of abstracts from PubMed. The database it is made of can also be downloaded using transcorpus download-database bio.

Deployment

Requirements:

  • Python 3.10+
  • rye (for dependency management)
  • CUDA-enabled GPUs (for multi-GPU translation)

Contributing

Pull requests and issues are welcome!

License

MIT License

Acknowledgements

  • Swiss AI Center
  • fairseq
  • PyTorch
  • rye

TransCorpus makes large-scale, robust translation easy and reproducible.