Skip to content

SamsungLabs/C3T

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

C3T

C3T (Cross-modal Capabilities Conservation Test) is a benchmark for assessing the performance of speech-aware language models. The benchmark utilizes textual tasks synthesized with a voice cloning text-to-speech model to verify if language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.

C3T was designed to be composed of tasks that have a single, ground truth correct answer that can be determined by string comparison. The answer generated by the model is considered to be correct if it includes the target answer and doesn't include other options (if applicable).

Tasks

C3T includes 11 tasks from Big-Bench dataset: Causal Judgement, Disambiguation QA, Formal Fallacies, Hyperbaton, Movie Recommendation, Navigate, Object Counting, Reasoning About Colored Objects, Snarks, Sports Understanding, and Web of Lies.

We chose tasks from Open LLM Leaderboard v2 as a foundation for our benchmark. Several tasks were excluded after evaluating their suitability for being read aloud.

Usage

First, clone this repository to your local machine and install the required packages:

git clone https://github.com/SamsungLabs/C3T.git
cd C3T/
uv sync

Baseline

Our baseline model is implemented as a simple pipeline consisting of an ASR model (whisper-large) followed by a textual LLM (Llama-3.1-70B-Instruct).

To generate outputs using the baseline model with default settings, simply execute:

uv run run_baseline.py

This will run the pipeline using the default models and paths defined in the script.

Optional Arguments

You can override any of the default parameters directly from the command line:

  • --asr_name: Name or path of the ASR model. Default: openai/whisper-large.
  • --llm_name: Name or path of the LLM. Default: meta-llama/Llama-3.1-70B-Instruct.
  • --output_dir: Directory where the output files will be saved. Default: ./.
  • --cache_dir: Optional directory for caching model files. Default: None.
  • --overwrite: Overwrite existing JSON outputs if they already exist. Default: False.

Example:

uv run run_baseline.py --asr_name openai/whisper-tiny \
  --llm_name meta-llama/Llama-3.2-1B \
  --output_dir results/ \
  --cache_dir ~/.cache/models \
  --overwrite

Outputs

The script produces two JSON files in the specified output directory - audio-outputs.json and text-outputs.json.

Each file is a dictionary where:

  • key -> sample_id
  • value -> LLM output for that sample

Example structure:

{
  "cj_0.g_0001-13fe": "It seems like you're referencing the famous phrase \"What is in a name?\" from the classic novel \"Alice's Adventures in Wonderland\" by Lewis Carroll."
}

Evaluation

To evaluate the generated outputs, run:

uv run run_eval.py --a path/to/audio-outputs.json --t path/to/text-outputs.json

The script compares the results for audio (audio-outputs.json) and text (text-outputs.json) modality with ground truth and calculates the relevant metrics. The evaluation results are printed to the console.

Arguments

  • -a, --audio_outputs (required): Path to the JSON file containing outputs for the audio modality (e.g., audio-outputs.json).
  • -t, --text_modality (optional): Path to the JSON file containing outputs for the text modality (e.g., text-outputs.json). Default: None.

If only the audio modality is available, you can omit --text_outputs.

Evaluation Metrics

  • spoken_ema (float): Exact match accuracy for audio modality.
  • textual_ema (float, optional): Exact match accuracy for text modality. Calculated only if outputs for text modality are available.
  • spoken_ema/c (float): Exact match acuracy for audio modality for tasks for which there exists at least one sample that yielded the correct answer.
  • textual_ema/c (float): Exact match acuracy for text modality for tasks for which there exists at least one sample that yielded the correct answer. Calculated only if outputs for text modality are available.
  • fair (float): Overall fairness.
  • fair/c (float): Overall fairness for tasks for which there exists at least one sample that yielded the correct answer.
  • robust (float, optional): Overall robustness. Calculated only if outputs for text modality are available.
  • robust/c (float, optional): Overall robustness for tasks for which there exists at least one sample that yielded the correct answer. Calculated only if outputs for text modality are available.
  • fair^accent (float): Conditional fairness for accent.
  • fair^accent/c (float): Conditional fairness for accent for tasks for which there exists at least one sample that yielded the correct answer.
  • fair^age (float): Conditional fairness for age.
  • fair^age/c (float): Conditional fairness for age for tasks for which there exists at least one sample that yielded the correct answer.
  • fair^gender (float): Conditional fairness for gender.
  • fair^gender/c (float): Conditional fairness for gender for tasks for which there exists at least one sample that yielded the correct answer.

Citation

@misc{kubis2025preservationlanguageunderstandingcapabilities,
  title={Preservation of Language Understanding Capabilities in Speech-aware Large Language Models}, 
  author={Marek Kubis and Paweł Skórzewski and Iwona Christop and Mateusz Czyżnikiewicz and Jakub Kubiak and Łukasz Bondaruk and Marcin Lewandowski},
  year={2025},
  eprint={2509.12171},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2509.12171}, 
}

About

C3T: Cross-modal Capabilities Conservation Test

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages