The Hearing-to-Translate test suite provides a unified evaluation framework for assessing how effectively SpeechLLMs, Speech Foundation Models (SFMs), and cascaded ASR→LLM pipelines handle speech-to-text translation across diverse real-world conditions. Covering 21 systems, 13 language pairs, 9 speech phenomena, and 16 benchmarks, the suite measures performance on clean speech as well as challenging scenarios involving gender bias, accents, code-switching, disfluencies, noise, named entities, emotion, and long-form content.
- Dec. 28, 2025: Human Evaluation data released on 🤗HuggingFace
- Dec. 19, 2025: Preprint released on arXiv
.
├── analysis/ # Scripts and files used for the analysis and aggregation of metrics
├── evaluation/ # Evaluation scripts (XCOMET, MetricX, LID)
├── evaluation_human/ # Code and data of human evaluation
├── inference/ # Generation scripts for each model
├── manifests/ # Code for using and replicating the manifests
├── outputs/ # Outputs produced by the models
├── infer.py # Inference script
├── run_text_models.sh # Example script for running text-based models
├── infer-loop.sh # Script for repeated / batch inference
├── requirements.txt # Python dependencies
└── README.md
Clone the repository and install dependencies:
git clone https://github.com/sarapapi/hearing2translate.git
cd hearing2translate
pip install -r requirements.txtNote: The required transformers version depends on the specific model being used.
Please ensure that you install the version compatible with the model you intend to run, as reported in Table 6 of the paper.
Download the desired benchmarks by following the instructions provided in each benchmark-specific README,
and set ${H2T_DATADIR} to the directory containing the corresponding audio files.
Supported benchmarks by category:
- Generic:
fleurs,covost2,europarl_st,wmt - Gender Bias:
winoST - Accents:
commonAccent,mandi - Code Switching:
cs-dialogue,cs_fleurs - Disfluencies:
libristutter - Noise:
noisy_fleurs_ambient,noisy_fleurs_babble - Emotion:
emotiontalk,mexpresso - Long-Form:
acl6060-long,acl6060-short,mcif-long,mcif-short
Run inference with the following command:
python infer.py \
--model ${MODEL_NAME} \
--in-modality {speech/text} \
--in-file ./manifests/${BENCHMARK_NAME}/${SRC_LANG}-${TGT-LANG}.jsonl \
--out-file ${OUTPUT_PATH}The full list of supported models can be obtained with python infer.py -h.
Supported benchmarks are listed above, while benchmark-specific language coverage is documented in the corresponding READMEs.
After generating model outputs, run the evaluation suite using the scripts in the evaluation/ directory.
For environment setup, model downloads, and benchmark-specific evaluation commands, refer to the dedicated Evaluation README.
If you want to add a model to the repository, please create a PR with:
- the inference code in
inference/{llm/sfm/speechllm} - the outputs on all the applicable benchmarks of the test suite in
outputs/${MODEL_NAME}
Please refer to PR template for more information.
The code contained in this repository is released under Apache 2.0 License.
Benchmarks are released under their own licenses. See the specific READMEs in /manifests for more information.
Human evaluation data in evaluation_human/hearing2translate-v1 is released under CC-BY 4.0 License.
@misc{papi2025hearingtranslateeffectivenessspeech,
title={Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs},
author={Sara Papi and Javier Garcia Gilabert and Zachary Hopton and Vilém Zouhar and Carlos Escolano and Gerard I. Gállego and Jorge Iranzo-Sánchez and Ahrii Kim and Dominik Macháček and Patricia Schmidtova and Maike Züfle},
year={2025},
eprint={2512.16378},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.16378},
}