Skip to content

MMAU-mini and AIR-Bench reproduction issue #14

@gmltmd789

Description

@gmltmd789

Hello,

Thank you for sharing your impressive work and for making the checkpoints and evaluation code publicly available. I’ve been running some evaluations using the official checkpoints for MMAU-mini and AIR-Bench, but I’ve noticed that the results I obtained were notably lower than those reported in your paper — around 55.8 on average for MMAU-mini and 5.667 for AIR-Bench.

Interestingly, when I evaluated Qwen2-Audio-7B-Instruct using the same codebase, the scores closely matched the official benchmarks, suggesting that the evaluation setup might be working as expected.

To help me better understand and reproduce your results more accurately, I wanted to ask if there are any additional inference settings, preprocessing steps, or evaluation considerations that might differ from what’s documented. For inference, I used the ms-swift-based code you provided (tf=4.48.0, ms-swift=3.0.0), and for evaluation, I extracted only the content between <RESPONSE> and </RESPONSE>.

Any guidance you could offer would be greatly appreciated. Thanks again for your inspiring work, and I wish you continued success in your research.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions