MMAU-mini and AIR-Bench reproduction issue

Hello,

Thank you for sharing your impressive work and for making the checkpoints and evaluation code publicly available. I’ve been running some evaluations using the official checkpoints for MMAU-mini and AIR-Bench, but I’ve noticed that the results I obtained were notably lower than those reported in your paper — around 55.8 on average for MMAU-mini and 5.667 for AIR-Bench.

Interestingly, when I evaluated Qwen2-Audio-7B-Instruct using the same codebase, the scores closely matched the official benchmarks, suggesting that the evaluation setup might be working as expected.

To help me better understand and reproduce your results more accurately, I wanted to ask if there are any additional inference settings, preprocessing steps, or evaluation considerations that might differ from what’s documented. For inference, I used the ms-swift-based code you provided (tf=4.48.0, ms-swift=3.0.0), and for evaluation, I extracted only the content between `<RESPONSE>` and `</RESPONSE>`.

Any guidance you could offer would be greatly appreciated. Thanks again for your inspiring work, and I wish you continued success in your research.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMAU-mini and AIR-Bench reproduction issue #14

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MMAU-mini and AIR-Bench reproduction issue #14

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions