We follow InternVL2 to evaluate the performance on MME, MMBench, MMMU, MMVet, MathVista and MMVP.
Please follow the InternVL2 to prepare the corresponding data. And the link the data under vlm.
The final directory structure is:
data
├── MathVista
├── mmbench
├── mme
├── MMMU
├── mm-vet
└── MMVPDirectly run scripts/eval/run_eval_vlm.sh to evaluate different benchmarks. The output will be saved in $output_path.
- Set
$model_pathand$output_pathfor the path for checkpoint and log. - Increase
GPUSif you want to run faster. - For MMBench, please use the official evaluation server.
- For MMVet, please use the official evaluation server.
- For MathVista, please set
$openai_api_keyinscripts/eval/run_eval_vlm.shandyour_api_urlineval/vlm/eval/mathvista/utilities.py. The default GPT version isgpt-4o-2024-11-20. - For MMMU, we use CoT in the report, which improve the accuracy by about 2%. For evaluation of the oprn-ended answer, we use GPT-4o for judgement.
We modify the code in GenEval for faster evaluation.
Install the following dependencies:
pip install open-clip-torch
pip install clip-benchmark
pip install --upgrade setuptools
sudo pip install -U openmim
sudo mim install mmengine mmcv-full==1.7.2
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection; git checkout 2.x
pip install -v -e .Download Detector:
cd ./eval/gen/geneval
mkdir model
bash ./evaluation/download_models.sh ./modelDirectly run scripts/eval/run_geneval.sh to evaluate GenEVAL. The output will be saved in $output_path.
- Set
$model_pathand$output_pathfor the path for checkpoint and log. - Set
metadata_fileto./eval/gen/geneval/prompts/evaluation_metadata.jsonlfor original GenEval prompts.
We modify the code in WISE for faster evaluation.
Directly run scripts/eval/run_wise.sh to evaluate WISE. The output will be saved in $output_path.
- Set
$model_pathand$output_pathfor the path for checkpoint and log. - Set
$openai_api_keyinscripts/eval/run_wise.shandyour_api_urlineval/gen/wise/gpt_eval_mp.py. The default GPT version isgpt-4o-2024-11-20. - Use
thinkfor thinking mode.
Please follow GEdit-Bench for evaluation.
TBD