C3T (Cross-modal Capabilities Conservation Test) is a benchmark for assessing the performance of speech-aware language models. The benchmark utilizes textual tasks synthesized with a voice cloning text-to-speech model to verify if language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.
C3T was designed to be composed of tasks that have a single, ground truth correct answer that can be determined by string comparison. The answer generated by the model is considered to be correct if it includes the target answer and doesn't include other options (if applicable).
C3T includes 11 tasks from Big-Bench dataset: Causal Judgement, Disambiguation QA, Formal Fallacies, Hyperbaton, Movie Recommendation, Navigate, Object Counting, Reasoning About Colored Objects, Snarks, Sports Understanding, and Web of Lies.
We chose tasks from Open LLM Leaderboard v2 as a foundation for our benchmark. Several tasks were excluded after evaluating their suitability for being read aloud.
First, clone this repository to your local machine and install the required packages:
git clone https://github.com/SamsungLabs/C3T.git
cd C3T/
uv syncOur baseline model is implemented as a simple pipeline consisting of an ASR model (whisper-large)
followed by a textual LLM (Llama-3.1-70B-Instruct).
To generate outputs using the baseline model with default settings, simply execute:
uv run run_baseline.pyThis will run the pipeline using the default models and paths defined in the script.
You can override any of the default parameters directly from the command line:
--asr_name: Name or path of the ASR model. Default:openai/whisper-large.--llm_name: Name or path of the LLM. Default:meta-llama/Llama-3.1-70B-Instruct.--output_dir: Directory where the output files will be saved. Default:./.--cache_dir: Optional directory for caching model files. Default:None.--overwrite: Overwrite existing JSON outputs if they already exist. Default:False.
Example:
uv run run_baseline.py --asr_name openai/whisper-tiny \
--llm_name meta-llama/Llama-3.2-1B \
--output_dir results/ \
--cache_dir ~/.cache/models \
--overwriteThe script produces two JSON files in the specified output directory - audio-outputs.json and
text-outputs.json.
Each file is a dictionary where:
- key ->
sample_id - value -> LLM output for that sample
Example structure:
{
"cj_0.g_0001-13fe": "It seems like you're referencing the famous phrase \"What is in a name?\" from the classic novel \"Alice's Adventures in Wonderland\" by Lewis Carroll."
}To evaluate the generated outputs, run:
uv run run_eval.py --a path/to/audio-outputs.json --t path/to/text-outputs.jsonThe script compares the results for audio (audio-outputs.json) and text (text-outputs.json)
modality with ground truth and calculates the relevant metrics. The evaluation results are printed
to the console.
-a,--audio_outputs(required): Path to the JSON file containing outputs for the audio modality (e.g.,audio-outputs.json).-t,--text_modality(optional): Path to the JSON file containing outputs for the text modality (e.g.,text-outputs.json). Default:None.
If only the audio modality is available, you can omit --text_outputs.
spoken_ema(float): Exact match accuracy for audio modality.textual_ema(float, optional): Exact match accuracy for text modality. Calculated only if outputs for text modality are available.spoken_ema/c(float): Exact match acuracy for audio modality for tasks for which there exists at least one sample that yielded the correct answer.textual_ema/c(float): Exact match acuracy for text modality for tasks for which there exists at least one sample that yielded the correct answer. Calculated only if outputs for text modality are available.fair(float): Overall fairness.fair/c(float): Overall fairness for tasks for which there exists at least one sample that yielded the correct answer.robust(float, optional): Overall robustness. Calculated only if outputs for text modality are available.robust/c(float, optional): Overall robustness for tasks for which there exists at least one sample that yielded the correct answer. Calculated only if outputs for text modality are available.fair^accent(float): Conditional fairness for accent.fair^accent/c(float): Conditional fairness for accent for tasks for which there exists at least one sample that yielded the correct answer.fair^age(float): Conditional fairness for age.fair^age/c(float): Conditional fairness for age for tasks for which there exists at least one sample that yielded the correct answer.fair^gender(float): Conditional fairness for gender.fair^gender/c(float): Conditional fairness for gender for tasks for which there exists at least one sample that yielded the correct answer.
@misc{kubis2025preservationlanguageunderstandingcapabilities,
title={Preservation of Language Understanding Capabilities in Speech-aware Large Language Models},
author={Marek Kubis and Paweł Skórzewski and Iwona Christop and Mateusz Czyżnikiewicz and Jakub Kubiak and Łukasz Bondaruk and Marcin Lewandowski},
year={2025},
eprint={2509.12171},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.12171},
}