GitHub - sarapapi/hearing2translate: A unified evaluation suite for speech-to-text translation, covering SpeechLLMs, SFMs, and cascaded systems across diverse real-world speech phenomena.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

The Hearing-to-Translate test suite provides a unified evaluation framework for assessing how effectively SpeechLLMs, Speech Foundation Models (SFMs), and cascaded ASR→LLM pipelines handle speech-to-text translation across diverse real-world conditions. Covering 21 systems, 13 language pairs, 9 speech phenomena, and 16 benchmarks, the suite measures performance on clean speech as well as challenging scenarios involving gender bias, accents, code-switching, disfluencies, noise, named entities, emotion, and long-form content.

📰 News

Dec. 28, 2025: Human Evaluation data released on 🤗HuggingFace
Dec. 19, 2025: Preprint released on arXiv

Repository Structure

.
├── analysis/ # Scripts and files used for the analysis and aggregation of metrics
├── evaluation/ # Evaluation scripts (XCOMET, MetricX, LID)
├── evaluation_human/ # Code and data of human evaluation
├── inference/ # Generation scripts for each model 
├── manifests/ # Code for using and replicating the manifests
├── outputs/ # Outputs produced by the models
├── infer.py # Inference script
├── run_text_models.sh # Example script for running text-based models
├── infer-loop.sh # Script for repeated / batch inference
├── requirements.txt # Python dependencies
└── README.md

Installation

Clone the repository and install dependencies:

git clone https://github.com/sarapapi/hearing2translate.git
cd hearing2translate
pip install -r requirements.txt

Note: The required transformers version depends on the specific model being used. Please ensure that you install the version compatible with the model you intend to run, as reported in Table 6 of the paper.

Usage

1. Download benchmarks and set data directory

Download the desired benchmarks by following the instructions provided in each benchmark-specific README, and set ${H2T_DATADIR} to the directory containing the corresponding audio files.

Supported benchmarks by category:

Generic: fleurs, covost2, europarl_st, wmt
Gender Bias: winoST
Accents: commonAccent, mandi
Code Switching: cs-dialogue, cs_fleurs
Disfluencies: libristutter
Noise: noisy_fleurs_ambient, noisy_fleurs_babble
Emotion: emotiontalk, mexpresso
Long-Form: acl6060-long, acl6060-short, mcif-long, mcif-short

2. Run inference

Run inference with the following command:

python infer.py \
  --model ${MODEL_NAME} \
  --in-modality {speech/text} \
  --in-file ./manifests/${BENCHMARK_NAME}/${SRC_LANG}-${TGT-LANG}.jsonl \
  --out-file ${OUTPUT_PATH}

The full list of supported models can be obtained with python infer.py -h. Supported benchmarks are listed above, while benchmark-specific language coverage is documented in the corresponding READMEs.

3. Run evaluation

After generating model outputs, run the evaluation suite using the scripts in the evaluation/ directory. For environment setup, model downloads, and benchmark-specific evaluation commands, refer to the dedicated Evaluation README.

Contributing

If you want to add a model to the repository, please create a PR with:

the inference code in inference/{llm/sfm/speechllm}
the outputs on all the applicable benchmarks of the test suite in outputs/${MODEL_NAME}

Please refer to PR template for more information.

License

The code contained in this repository is released under Apache 2.0 License.

Benchmarks are released under their own licenses. See the specific READMEs in /manifests for more information.

Human evaluation data in evaluation_human/hearing2translate-v1 is released under CC-BY 4.0 License.

Citation

@misc{papi2025hearingtranslateeffectivenessspeech,
      title={Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs}, 
      author={Sara Papi and Javier Garcia Gilabert and Zachary Hopton and Vilém Zouhar and Carlos Escolano and Gerard I. Gállego and Jorge Iranzo-Sánchez and Ahrii Kim and Dominik Macháček and Patricia Schmidtova and Maike Züfle},
      year={2025},
      eprint={2512.16378},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.16378}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

📰 News

Repository Structure

Installation

Usage

1. Download benchmarks and set data directory

2. Run inference

3. Run evaluation

Contributing

License

Citation

About

Uh oh!

Releases 6

Packages

Contributors 13

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 634 Commits
analysis		analysis
assets		assets
evaluation		evaluation
evaluation_human		evaluation_human
inference		inference
manifests		manifests
outputs		outputs
tests		tests
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer-loop.sh		infer-loop.sh
infer.py		infer.py
pull_request_template.md		pull_request_template.md
requirements.txt		requirements.txt
run_text_models.sh		run_text_models.sh

License

sarapapi/hearing2translate

Folders and files

Latest commit

History

Repository files navigation

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

📰 News

Repository Structure

Installation

Usage

1. Download benchmarks and set data directory

2. Run inference

3. Run evaluation

Contributing

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 13

Uh oh!

Languages

Packages