EuroBERT: Scaling Multilingual Encoders for European Languages - Optimus Training Library

Introduction

EuroBERT is a multilingual encoder model designed for European languages, trained using the Optimus training library. Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, including CPU, AMD, and NVIDIA GPUs.

Key Features

Hardware Agnostic: Seamlessly train on CPU, AMD, or NVIDIA hardware.
Resumable Training: Continue training regardless of hardware or environment changes.
Scalable Distributed Training: Supports Fully Sharded Data Parallel (FSDP), Distributed Data Parallel (DDP), and other parallelism strategies.
Comprehensive Data Processing: Includes utilities for tokenization, packing, subsampling, and dataset inspection.
Highly Customizable: Fine-tune model architecture, training, and data processing with extensive configuration options.
Performance Optimizations: Implements advanced techniques like mixed precision training, fused operations, and optimizations such as Liger Kernel and Flash Attention.

Installation

You can install EuroBERT using one of the following methods:

Run the following command to install the package directly:

pip install git+https://github.com/Nicolas-BZRD/EuroBERT.git

or, for development purposes, clone the repository and install it in editable mode:

git clone https://github.com/Nicolas-BZRD/EuroBERT.git
cd EuroBERT
pip install -e .

Tutorial

Before diving further into the Optimus Library, we encourage you to follow the notebook ‘Continuous Pre-Training of EuroBERT-210M with the Optimus Library’ (compatible with Google Scholar), which covers data processing and training setup.

Go to Tutorial

Data Processing

Optimus provides an efficient pre-processing pipeline with tokenization, packing, subsampling, and inspection utilities. Full details are available in the data processing documentation. The tokenize_dataset.py script tokenizes a dataset and saves it in an optimized format. The tokenized data can be sharded and processed in parallel using multiple workers.

Usage

python -m optimus.dataprocess.tokenize_dataset --input_dir <path> --tokenizer <path_or_name> --dataset <name> [--output_dir <path>] [--num_workers <num>]

Parameters

input_dir (str): Path to the input dataset directory.
tokenizer (str): Path or name of the tokenizer.
dataset (str): Name of the dataset to process.
output_dir (str, optional): Directory to save the tokenized dataset.
num_workers (int or max): Number of worker processes (use max for all available CPUs).

Training

Optimus supports a wide range of configurations for different training scenarios. Detailed configuration options are documented in the training guide.

python -m optimus.train --huggingface_id EuroBERT/EuroBERT-210m --data_mix_path <path> --batch_size <int> --mlm_probability <float> --mask_probability <float>

Parameters

model_name (str): Model type (bert, eurobert, biqwen, or bigemma). Available models are listed here.
model_size (str): Model size (e.g., 210m, 310m, 2b).
data_mix_path (str): Path to the data mix folder containing train.json and optionally eval.json. These JSON files define parameter configurations for dataset creation, offering similar configuration options to MosaicML.
batch_size (int): Number of samples per batch.
mlm_probability (float): Probability of applying masked language modeling.
mask_probability (float): Probability of replacing a masked token.

Evaluation

If you're interested in evaluating pre-trained encoder models, we recommend using the EncodEval library. Developed alongside the Optimus library, it provides a fair and consistent framework for evaluating and comparing encoder models.

Citation

If you use EuroBERT in your research, please cite our paper:

@misc{boizard2025eurobertscalingmultilingualencoders,
      title={EuroBERT: Scaling Multilingual Encoders for European Languages}, 
      author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and André Martins and Ayoub Hammal and Caio Corro and Céline Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and João Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
      year={2025},
      eprint={2503.05500},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.05500}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
docs		docs
examples		examples
optimus		optimus
.gitignore		.gitignore
LICENSE.TXT		LICENSE.TXT
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EuroBERT: Scaling Multilingual Encoders for European Languages - Optimus Training Library

Introduction

Key Features

Installation

Tutorial

Data Processing

Usage

Parameters

Training

Parameters

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

License

Diabolocom-Research/Decoder2Encoder

Folders and files

Latest commit

History

Repository files navigation

EuroBERT: Scaling Multilingual Encoders for European Languages - Optimus Training Library

Introduction

Key Features

Installation

Tutorial

Data Processing

Usage

Parameters

Training

Parameters

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages