GitHub - xiaomi-research/mecat

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

📖 arXiv | 🛠️ GitHub Code | 🔊 MECAT-Caption Dataset (HuggingFace) 🔊 MECAT-QA Dataset (HuggingFace)

1. Introduction
2. Features
3. Data Distribution
4. Tasks
- 4.1 Audio-Captioning
- 4.2 Audio-Question-Answering
5. Example Data
- 5.1 Audio Captioning Example (SMA - Speech, Music and General Sound)
- 5.2 Audio Question Answering Example (SMA - Speech, Music and General Sound)
6. Evaluation Metrics
7. Usage
8. Results
9. Acknowledgement
10. Contributing
11. Citation
12. License

1. Introduction

MECAT is a comprehensive benchmark constructed on large-scale data to evaluate machine understanding of audio content through two core tasks:

Audio Captioning: Generating textual descriptions for given audio
Audio Question Answering: Answering questions about given audio

2. Features

Data Source：Diverse-scenario coverage via the part of ACAV100M dataset
Processing Pipeline:
- MetaInfo: Source video metadata extraction (titles/descriptions)
- Content-Specific: Content-specific feature extraction using 10-20 dedicated models (speech/music/general audio)
- Content-Unrelated: Non-content audio analysis: quality metrics, loudness measurements, reverberation assessment
Understanding & Genration: LLM-powered comprehension & generation with Chain-of-Thought
Quality Control： Multi-stage verification framework
Evluation System: Multi-perspective assessment with progressive difficulty levels

3. Data Distribution

Data Code	Description	Audio Caption		Audio Question Answering
Data Code	Description	# Pairs (Train)	# Pairs (Test)	# Pairs (Train)	# Pairs (Test)
000	silence	173	179	865	895
00A	general sound excluding speech and music	837	848	4185	4240
0M0	music	2593	2593	12965	12965
0MA	music and general sound	206	199	1030	995
S00	speech	7839	7839	39195	39195
S0A	speech and general sound	2424	2439	12120	12195
SM0	speech and music	5312	5312	26560	26560
SMA	speech, music and general sound	668	643	3340	3215

4. Tasks

4.1 Audio-Captioning

Type	Subtask	Category	Level	Descrption	Evaluated Data Abbreviation	Sample
Systemtic	Short		🔵 Specialized	Simplified caption over the whole audio within 15 words	000, 00A, 0M0, 0MA S00, S0A, SM0, SMA	20052
Systemtic	Long		🔵 Specialized	Caption over the whole audio using 1-2 sentences	000, 00A, 0M0, 0MA S00, S0A, SM0, SMA	20052
Content-Specific	Speech	Clean	🟢 Basic	Caption over clean speech	S00	7839
	Speech	Mixed	🔴 Complex	Caption over speech with music/sound interference	0MA, S0A, SM0, SMA	8593
	Music	Clean	🟢 Basic	Caption over clean Music	0M0	2593
	Music	Mixed	🔴 Complex	Caption over music with speech/sound interference	0MA, S0A, SM0, SMA	8593
	Sound	Clear	🟢 Basic	Caption over general sound excluding speech and music	00A	848
	Sound	Mixed	🔴 Complex	Caption over sound with speech/music interference	0MA, S0A, SM0, SMA	8593
Content-Unrelated	Environment		🔵 Specialized	Caption over acoustic characteristic and environment	000, 00A, 0M0, 0MA S00, S0A, SM0, SMA	20052

4.2 Audio-Question-Answering

Description

Type	Subtask	Level	Description	Data Abbreviation	Sample
Perception	Direct_Perception	🟢🟡	Perceive sound types	000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA	20624
Analysis	Sound_Characteristics	🟢🟡🟠🔴	Analyze sound characteristics	000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA	19767
Analysis	Quality_Assessment	🟢🟡🟠🔴	Analyze sound quality	000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA	18942
Reasoning	Environment_Reasoning	🟢🟡🟠🔴	Reasoning acoustic environment	000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA	18300
	Inference_Judgment	🟢🟡🟠🔴	Cross-modal reasoning	000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA	19756
	Application_Context	🟢🟡🟠🔴	Semantic understanding	000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA	2871

Difficulty Distribution

Difficulty	Symbol	Ratio (%)	Description
Basic	🟢	25	Direct descriptive questions
Intermediate	🟡	35	Analytical questions
Advanced	🟠	25	Inferential questions
Complex	🔴	15	Comprehensive judgment questions

5. Example Data

5.1 Audio Captioning Example (SMA - Speech, Music and General Sound)

The following example shows the comprehensive caption annotations for a single audio sample from the SMA domain. This is the first data sample from the HuggingFace dataset:

Data Source: MECAT-Caption/SMA/Test/test_0000-0000000.tar.gz

{
  "RjRMEFDocEY_78_681_88_681": {
    "short": [
      "Energetic electronic music accompanies animated speech with intermittent dog barks and background interference.",
      "Upbeat instrumental track plays under expressive dialogue and occasional canine vocalizations amid noise.",
      "Dynamic speech with emotional shifts over electronic music featuring sporadic barking and audio artifacts."
    ],
    "long": [
      "A female voice delivers emotionally varied speech ranging from laughter to frustration, accompanied by rhythmic electronic instrumentation with guitar elements. Occasional dog barks emerge through persistent background static and audio distortion.",
      "Expressive vocal performance transitions between cheerfulness and intensity, layered over a driving electronic beat with occasional animal sounds and recording imperfections.",
      "Vivid speech with fluctuating emotional tones interacts with synth-driven musical backing, punctuated by canine noises and low-fidelity artifacts."
    ],
    "speech": [
      "Animated female speech displaying rapid emotional shifts from laughter to frustration.",
      "Expressive vocal delivery alternating between cheerful and agitated tones.",
      "Dynamic spoken performance transitioning between amusement and intensity."
    ],
    "music": [
      "Moderate-tempo electronic composition featuring prominent guitar and rhythmic percussion elements.",
      "Driving synth-based arrangement with guitar accents and steady beat.",
      "Energetic instrumental track combining electronic textures with rhythmic guitar work."
    ],
    "sound": [
      "Intermittent dog vocalizations amidst persistent electrical interference.",
      "Occasional canine barks layered over background static.",
      "Sporadic animal noises punctuating continuous audio distortion."
    ],
    "environment": [
      "Low-quality recording with noticeable background interference and distortion.",
      "Audio artifacts and electrical noise throughout the recording.",
      "Persistent static and signal degradation affecting audio clarity."
    ],
    "domain": "SMA"
  }
}

5.2 Audio Question Answering Example (SMA - Speech, Music and General Sound)

The following example shows a QA pair from the SMA domain. This is the first data sample from the HuggingFace dataset:

Data Source: MECAT-QA/SMA/Test/test_0000-0000000.tar.gz

{
  "RjRMEFDocEY_78_681_88_681_ffd8b511": {
    "category": "direct_perception",
    "difficulty": "basic",
    "question": "What type of vocal sounds are present?",
    "answer": "A woman speaking expressively and dog barks.",
    "domain": "SMA"
  }
}

6. Evaluation Metrics

MECAT supports multiple evaluation metrics for comprehensive assessment:

Traditional Metrics: BLEU
FENSE: Fluency Error-based Sentence-bert Evaluation for audio captioning
DATE: Discriminability based Audio Task Evaluation - DATE is particularly effective for audio captioning and question-answering tasks as it considers both the quality of generated text and the model's discriminative capabilities.

7. Usage

7.1 Installation

python3 -m pip install mecat
# Or the development 
# pip install git+https://github.com/xiaomi-research/mecat.git

7.2 Quick Start with Qwen2-Audio Example

This section provides a complete walkthrough of evaluating audio models using MECAT, using Qwen2-Audio as a practical example. The same approach can be adapted for other audio understanding models.

7.2.1 Preliminary Steps: Environment Setup and Model Loading

import torch
from tqdm import tqdm
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load Qwen2-Audio model and processor
model = Qwen2AudioForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-Audio-7B", 
    trust_remote_code=True,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-Audio-7B", 
    trust_remote_code=True
)

7.2.2 Audio Caption Evaluation

Step 1: Load MECAT-Caption Dataset

from datasets import load_dataset
data = load_dataset(
    'mispeech/MECAT-Caption', 
    split='test', 
)
print(f"Loaded {len(data)} samples from datasets")

Step 2: Generate and Evaluate Captions

Method 1: Single Dictionary Approach (for non-instruction-following models)

Generation:

from mecat import evaluate
# Generate general predictions using a single prompt
predictions = {}

for item in tqdm(data, desc="Generating general captions"):
    key = item['__key__']
    audio = item['flac']['array']
    sampling_rate = item['flac']['sampling_rate']
    # Note: the sampling rate of audio provided by MECAT is 16kHz 
    
    # Create general prompt for caption generation
    prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"

    # Process inputs
    inputs = processor(
        text=prompt, 
        audio=audio, 
        sampling_rate=sampling_rate,
        return_tensors="pt"
    ).to(device)
    
    # Generate response
    with torch.no_grad():
        generated_ids = model.generate(
            **inputs, 
            max_length=512,
            do_sample=False,
            temperature=0.1
        )
    
    # Decode response
    generated_ids = generated_ids[:, inputs.input_ids.size(1):]
    response = processor.batch_decode(
        generated_ids, 
        skip_special_tokens=True, 
        clean_up_tokenization_spaces=False
    )[0]
    predictions[key] = response.strip()

print(f"Generated {len(predictions)} general captions")

# Save single prediction file
import csv
with open('caption_predictions.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f, quoting=csv.QUOTE_ALL)
    for key, value in predictions.items():
        writer.writerow([key, value])

Evaluation:

# Evaluate general predictions across all subtasks
results = evaluate(
    predicted_data=predictions,
    task='caption', 
    metrics=['fense', 'date']
)

print("\nSingle Dictionary Evaluation Results:")
print(results)

Method 2: Multi-Dictionary Approach (recommended for instruction-following models)

Generation:

# Generate task-specific predictions using different prompts
task_prompts = {
    'long': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to this audio and describe it in 1-2 sentences:",
    'short': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to the audio and provide a caption for this audio within 15 words:",
    'speech': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to the audio and provide a caption describing the speech content in this audio:",
    'music': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to the audio and provide a caption for the music content in this audio:",
    'sound': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to the audio and provide a general sound excluding speech and music:",
    'environment': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to the audio and provide a caption for quality or acoustic environment for this audio:"
}

# Generate predictions for each subtask
subtask_predictions = {}

for subtask, prompt_template in task_prompts.items():
    print(f"\nGenerating {subtask} captions...")
    subtask_predictions[subtask] = {}
    
    for item in tqdm(data, desc=f"Generating {subtask} captions"):
        key = item['__key__']
        audio = item['flac']['array']
        sampling_rate = item['flac']['sampling_rate']
        
        # Process inputs with task-specific prompt
        inputs = processor(
            text=prompt_template, 
            audio=audio, 
            sampling_rate=sampling_rate,
            return_tensors="pt"
        ).to(device)
        
        # Generate response
        with torch.no_grad():
            generated_ids = model.generate(
                **inputs, 
                max_length=512,
                do_sample=False,
                temperature=0.1
            )
        
        # Decode response
        generated_ids = generated_ids[:, inputs.input_ids.size(1):]
        response = processor.batch_decode(
            generated_ids, 
            skip_special_tokens=True, 
            clean_up_tokenization_spaces=False
        )[0]
        subtask_predictions[subtask][key] = response.strip()

# Save separate prediction files for each subtask
for subtask, preds in subtask_predictions.items():
    filename = f'{subtask}_caption.csv'
    with open(filename, 'w', encoding='utf-8') as f:
        writer = csv.writer(f, quoting=csv.QUOTE_ALL)
        for key, value in preds.items():
            writer.writerow([key, value])
    print(f"Saved {len(preds)} {subtask} predictions to {filename}")

Evaluation:

# Evaluate task-specific predictions for optimal performance
results_multisubtask = evaluate(
    predicted_data=subtask_predictions,
    task='caption', 
    metrics=['fense', 'date']
)

print("\nMulti-Dictionary Evaluation Results:")
print(results_multisubtask)

Step 3: Expected Results

Expected Caption Evaluation Output: This result does not represent the actual performance of Qwen2-Audio-7B

   subtask     num_samples  fense  date
 content_long        20052   47.3 40.5
content_short        20052   45.8 41.0 
  pure_speech         7839   30.9 28.5 
 mixed_speech         8593   31.7 27.1 
   pure_music         2593   42.1 50.7
  mixed_music         8593   28.3 33.1
   pure_sound          848   41.2 46.6 
  mixed_sound         8593   16.2 34.1 
  environment        20052   45.4 47.8
score_caption         <NA>   35.2 39.3

Note: The formulae of score_caption:

$S_{\rm caption} = 0.4\times({0.8S_{\rm long} + 0.2S_{\rm short}}) + 0.4\times(0.6S_{\rm speech} + 0.3S_{\rm music} + 0.1S_{\rm sound}) + 0.2\times S_{\rm environment}$

where $S_{\rm speech}, S_{\rm music}$ and $S_{\rm sound}$ were the average score of pure data and mixed data, e.g., $S_{\rm speech} = \frac{S_{\rm speech,pure}+S_{\rm speech,mixed}}{2}$

7.2.3 Audio Question Answering Evaluation

Step 1: Load MECAT-QA Dataset

# Load MECAT-QA test data 
qa_data = load_dataset(
    'mispeech/MECAT-QA', 
    split='test', 
)
print(f"Loaded {len(qa_data)} QA samples from datasets")

Step 2: Generate and Evaluate Answers

Generation:

# Generate predictions for each question-audio pair
qa_predictions = {}

for item in tqdm(qa_data, desc="Generating answers"):
    key = item['__key__']
    audio = item['flac']['array']
    sampling_rate = item['flac']['sampling_rate']
    question = item['json']['question']
    
    
    # Create prompt for QA
    prompt = f"<|audio_bos|><|AUDIO|><|audio_eos|>{question}"
    
    # Process inputs
    inputs = processor(
        text=prompt, 
        audio=audio, 
        sampling_rate=sampling_rate,
        return_tensors="pt"
    ).to(device)
    
    # Generate response
    with torch.no_grad():
        generated_ids = model.generate(
            **inputs, 
            max_length=512,
            do_sample=False,
            temperature=0.1
        )
    
    # Decode response
    generated_ids = generated_ids[:, inputs.input_ids.size(1):]
    response = processor.batch_decode(
        generated_ids, 
        skip_special_tokens=True, 
        clean_up_tokenization_spaces=False
    )[0]
    
    qa_predictions[key] = response.strip()

print(f"Generated {len(qa_predictions)} answers")

# Output the results to csv files
import csv
with open('qa_predictions.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f, quoting=csv.QUOTE_ALL)
    for key, value in qa_predictions.items():
        writer.writerow([key, value])

Evaluation:

# Evaluate using MECAT metrics
qa_results = evaluate(
    predicted_data=qa_predictions, 
    task='qa', 
    metrics=['fense', 'date']
)

print("\nQA Evaluation Results:")
print(qa_results)

Step 3: Expected Results

Expected QA Evaluation Output: This result does not represent the actual performance of Qwen2-Audio-7B

           subtask     num_samples fense date
    direct_perception        20624 44.0 54.0
sound_characteristics        19767 39.0 53.1 
   quality_assessment        18942 18.0 17.8
environment_reasoning        18300 42.0 35.5 
  inference_judgement        19756 51.0 42.0
  application_context         2871 40.0 49.9
             score_qa        <NA>  39.0 42.1

Note: the final score is the average scores of all six subtasks

7.3 Command Line Evaluation

You can also use the command line interface for evaluation:

7.3.1 Single File Evaluation

# Caption evaluation for different audio types (using single dictionary predictions)
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask long   --metrics fense date
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask short  --metrics fense date
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask music  --metrics fense date
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask speech --metrics fense date
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask sound  --metrics fense date
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask environment --metrics fense date

# Batch evaluation across all subsets (using single dictionary predictions)
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --metrics fense date

7.3.2 Multi-File Evaluation (Recommended for Caption Task)

For instruction-following models that can generate task-specific captions, you can provide multiple prediction files at once to get comprehensive evaluation results across all caption subtasks:

# Evaluate multiple caption prediction files in order: long, short, speech, music, sound, environment
python -m mecat.evaluate --prediction \
    long_caption.csv \
    short_caption.csv \
    speech_caption.csv \
    music_caption.csv \
    sound_caption.csv \
    environment_caption.csv \
    --task caption --metrics fense date

# Evaluate with fewer files (will evaluate only available subtasks with warning)
python -m mecat.evaluate --prediction \
    long_caption.csv \
    short_caption.csv \
    --task caption --metrics fense date

Benefits of Multi-File Evaluation:

✅ Complete Coverage: Evaluates all caption subtasks with task-specific predictions
✅ Better Performance: Each prediction file contains responses optimized for specific caption types
✅ Comprehensive Results: Provides the full evaluation matrix including overall scores
⚠️ File Order Matters: Files are mapped to subtasks in order: long → short → speech → music → sound → environment

7.3.3 QA Task Evaluation

# QA evaluation for different question types
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask direct_perception     --metrics fense date
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask sound_characteristics --metrics fense date
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask quality_assessment    --metrics fense date
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask environment_reasoning --metrics fense date
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask inference_judgement   --metrics fense date
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask application_context   --metrics fense date

# Batch evaluation across all subsets (recommended)
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --metrics fense date

Prediction File Format:

# csv File
"audio_key_1", "Generated caption or answer text"
"audio_key_2", "Another generated response"
"audio_key_3", "More predictions..."

Important Notes:

Audio Captioning Task
- For instruction-following models (Recommended):
  - Generate 6 different prediction files using task-specific prompts (one per sub-task). Requires 6 inference passes.
  - Prompts example:
    - long: "Listen to this audio and describe it in 1-2 sentences"
    - short: "Listen to the audio and provide a caption for this audio within 15 words"
    - speech: "Listen to the audio and provide a caption describing the speech content in this audio"
    - music: "Listen to the audio and provide a caption for the music content in this audio"
    - sound: "Listen to the audio and provide a general sound excluding speech and music"
    - environment: "Listen to the audio and provide a caption for quality or acoustic environment for this audio"
- For non-instruction-following models:
  - Evaluate using a single prediction file (single inference pass).
  - The same predictions will be evaluated across all subtasks.
Audio Question Answering Task:
- Evaluate all sub-tasks in a single inference pass using the standard method.
- Single prediction file is sufficient as questions are task-specific.

7.4 Direct Data Evaluation

If you have complete predicted_data and reference_data, you can directly use them for evaluation without loading from files or datasets:

Audio Captioning Task Example:

from mecat import evaluate

predicted_data = [['silence']]

reference_data = [['Extended silence with severe audio distortion and background noise.', 'Persistent quiet period containing heavy signal interference.', 'Continuous silence disrupted by pronounced technical artifacts.']]

results = evaluate(
    predicted_data=predicted_data, 
    reference_data=reference_data, 
    task='caption',
    metrics='bleu',
)

print(results)

Audio Question Answering Task Example:

Similarly, for QA tasks, you can also provide predicted_data and reference_data directly:

from mecat import evaluate

predicted_data = [['A woman speaking expressively and dog barks.']]

reference_data = [['A woman speaking expressively and dog barks.']]

results = evaluate(
    predicted_data=predicted_data, 
    reference_data=reference_data, 
    task='qa',
    metrics='bleu',
    subtask='direct_perception',
)

print(results)

This approach is useful when:

You already have the predictions and references in memory
You want to evaluate custom data that is not part of the MECAT dataset
You need to quickly test evaluation metrics on specific examples

Note: Both predicted_data and reference_data should be lists of lists, where each inner list contains the predictions or references for a single sample. For reference data, multiple references per sample are supported (as shown in the caption example above).

Important: If you want to obtain results for different tasks (e.g., evaluating multiple caption subtasks or QA subtasks), it is recommended to use the methods described in sections 7.1-7.3, as they automatically select the appropriate keys for different tasks and provide comprehensive evaluation results across all subtasks.

8. Results

8.1 Audio-Captioning Task

8.1.1 DATE

Model Type	Model Name	Systemtic		Content-Specific						Content-Unrelated	Overall
		Systemtic		Speech-Focused		Music-Focused		Sound-Focused		Content-Unrelated
		long	short	pure	mixed	pure	mixed	pure	mixed	environment
Caption-Only	enclap	48.6	53.1	30.2	31.8	17.9	15.9	48.8	15.2	6.8	33.3
Caption-Only	pengi	43.5	46.8	27.2	29.5	29.3	13.1	42.8	14.6	7.1	30.6
LALM	audio-flamingo	48.6	49.7	30.5	34.3	28.8	25.6	41.2	18.5	17.5	35.6
	kimi-audio	49.5	54.2	30.0	31.3	27.7	16.9	43.1	16.2	7.0	34.3
	omni3b	56.4	55.2	42.5	41.3	46.6	29.7	52.9	23.9	19.4	42.6
	omni7b	61.1	56.5	39.9	40.9	32.1	30.9	50.7	23.8	17.9	43.0

8.1.2 FENSE

Model Type	Model Name	Systemtic		Content-Specific						Content-Unrelated	Overall
		Systemtic		Speech-Focused		Music-Focused		Sound-Focused		Content-Unrelated
		long	short	pure	mixed	pure	mixed	pure	mixed	environment
Caption-Only	enclap-both	40.5	45.0	28.7	29.5	39.3	15.0	41.2	17.3	17.9	31.6
Caption-Only	pengi	37.5	41.0	26.6	29.2	39.6	11.8	35.4	16.2	17.8	29.5
LALM	audio-flamingo2	43.8	43.3	28.5	33.7	43.1	30.3	41.0	24.7	45.4	39.4
	kimi-audio	40.8	45.7	25.6	27.1	39.5	16.2	35.8	19.4	16.7	30.8
	qwen2.5-omni3b	48.3	45.3	37.3	37.5	50.7	34.7	46.6	34.1	47.8	44.1
	qwen2.5-omni7b	52.7	46.2	35.3	37.5	39.2	33.1	45.2	32.1	41.0	43.4

8.2 Audio-Question-Answering

8.2.1 DATE

Model Type	Model Name	Perception	Analsysis		Reasoning			Overall
Model Type	Model Name	direct perception	sound characteristics	quality assessment	environment reasoning	inference judgement	application context	Overall
LALM	audio-flamingo2	45.1	46.3	34.9	37.5	44.0	42.4	41.7
	kimi-audio	45.6	39.2	18.7	34.6	48.9	41.2	38.0
	qwen2.5-omni3b	55.7	53.2	38.6	41.1	51.8	50.8	48.5
	qwen2.5-omni7b	57.8	52.9	39.1	44.0	53.2	50.8	49.6

8.2.2 FENSE

Model-Type	Model-Name	Perception	Analsysis		Reasoning			Overall
Model-Type	Model-Name	direct perception	sound characteristics	quality assessment	environment reasoning	inference judgement	application context	Overall
LALM	audio-flamingo2	39.1	39.0	37.4	41.3	35.5	35.8	38.0
	kimi-audio	37.5	32.5	19.2	37.5	38.8	33.8	33.2
	qwen2.5-omni3b	47.2	43.8	39.7	43.2	41.0	41.9	42.8
	qwen2.5-omni7b	49.7	43.8	40.5	44.1	42.5	41.9	43.7

9. Acknowledgement

We have referred to the implementation of FENSE for the evaluation

10. Contributing

Yadong Niu* · Tianzi Wang* · Heinrich Dinkel · Xingwei Sun · Jiahao Zhou · Gang Li · Jizhong Liu · Xunying Liu · Junbo Zhang · Jian Luan

*: Equal Contribution

11. Citation

@article{mecat2025,
  title={MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks},
  author={Niu, Yadong and Wang, Tianzi and Dinkel, Heinrich and Sun, Xingwei and Zhou, Jiahao and Li, Gang and Liu, Jizhong and Liu, Xunying and Zhang, Junbo and Luan, Jian},
  journal={arXiv preprint arXiv:2507.23511},
  year={2025}
}

12. License

The dataset of the project is from the part of ACAV100M undert the Creative Commons Attribution License 3.0 (CC BY-3.0) license.

The code of the project is under Apache License 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

xiaomi-research/mecat

Folders and files

Latest commit

History

Repository files navigation

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Table of Contents

1. Introduction

2. Features

3. Data Distribution

4. Tasks

4.1 Audio-Captioning

4.2 Audio-Question-Answering

Description

Difficulty Distribution

5. Example Data

5.1 Audio Captioning Example (SMA - Speech, Music and General Sound)

5.2 Audio Question Answering Example (SMA - Speech, Music and General Sound)

6. Evaluation Metrics

7. Usage

7.1 Installation

7.2 Quick Start with Qwen2-Audio Example

7.2.1 Preliminary Steps: Environment Setup and Model Loading

7.2.2 Audio Caption Evaluation

Step 1: Load MECAT-Caption Dataset

Step 2: Generate and Evaluate Captions

Step 3: Expected Results

7.2.3 Audio Question Answering Evaluation

Step 1: Load MECAT-QA Dataset

Step 2: Generate and Evaluate Answers

Step 3: Expected Results

7.3 Command Line Evaluation

7.3.1 Single File Evaluation

7.3.2 Multi-File Evaluation (Recommended for Caption Task)

7.3.3 QA Task Evaluation

7.4 Direct Data Evaluation

8. Results

8.1 Audio-Captioning Task

8.1.1 DATE

8.1.2 FENSE

8.2 Audio-Question-Answering

8.2.1 DATE

8.2.2 FENSE

9. Acknowledgement

10. Contributing

11. Citation

12. License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages