Skip to content

[Bug] TextVQA,输出predcition正确,但是result精度为0 #187

@MansonX

Description

@MansonX

操作系统及版本

ubuntu 22.04

安装工具的python环境

在anaconda/miniconda创建的python虚拟环境

python版本

3.10

AISBench工具版本

3.1.20260306

AISBench执行命令

ais_bench --models vllm_api_general_chat --datasets textvqa_gen_base64 --mode all --debug

模型配置文件或自定义配置文件内容

cli_args=dict(
cfg_time_str='20260312_122302',
config=None,
config_dir='configs',
custom_dataset_data_type=None,
custom_dataset_infer_method=None,
custom_dataset_meta_path=None,
custom_dataset_path=None,
datasets=[
'textvqa_gen_base64',
],
debug=True,
dir_time_str='20260312_122302',
dry_run=False,
dump_eval_details=False,
dump_extract_rate=False,
max_num_workers=1,
max_workers_per_gpu=1,
merge_ds=False,
mode='all',
models=[
'vllm_api_general_chat',
],
num_prompts=None,
num_warmups=1,
pressure=False,
pressure_time=15,
reuse=None,
search=False,
summarizer=None,
work_dir=None)
datasets=[
dict(abbr='textvqa',
eval_cfg=dict(
evaluator=dict(
type='ais_bench.benchmark.datasets.TEXTEvaluator')),
image_type='image_base64',
infer_cfg=dict(
inferencer=dict(
type='ais_bench.benchmark.openicl.icl_inferencer.GenInferencer'),
prompt_template=dict(
template=dict(
round=[
dict(prompt_mm=dict(
image=dict(
image_url=dict(
url='data:image/jpeg;base64,{image}'),
type='image_url'),
text=dict(
text='{question} Answer the question using a single word or phrase.',
type='text')),
role='HUMAN'),
]),
type='ais_bench.benchmark.openicl.icl_prompt_template.icl_prompt_template_mm.MMPromptTemplate'),
retriever=dict(
prompt_template=dict(
template=dict(
round=[
dict(prompt_mm=dict(
image=dict(
image_url=dict(
url='data:image/jpeg;base64,{image}'),
type='image_url'),
text=dict(
text='{question} Answer the question using a single word or phrase.',
type='text')),
role='HUMAN'),
]),
type='ais_bench.benchmark.openicl.icl_prompt_template.icl_prompt_template_mm.MMPromptTemplate'),
type='ais_bench.benchmark.openicl.icl_retriever.ZeroRetriever')),
path='ais_bench/datasets/textvqa/textvqa_val.jsonl',
reader_cfg=dict(
input_columns=[
'question',
'image',
],
output_column='answer'),
type='ais_bench.benchmark.datasets.TEXTVQADataset'),
]
eval=dict(
partitioner=dict(
out_dir='outputs/default/20260312_122302/results/',
type='ais_bench.benchmark.partitioners.naive.NaivePartitioner'),
runner=dict(
debug=True,
max_num_workers=1,
max_workers_per_gpu=1,
task=dict(
type='ais_bench.benchmark.tasks.openicl_eval.OpenICLEvalTask'),
type='ais_bench.benchmark.runners.local.LocalRunner'))
infer=dict(
partitioner=dict(
out_dir='outputs/default/20260312_122302/predictions/',
type='ais_bench.benchmark.partitioners.naive.NaivePartitioner'),
runner=dict(
debug=True,
max_num_workers=1,
max_workers_per_gpu=1,
task=dict(
type='ais_bench.benchmark.tasks.openicl_api_infer.OpenICLApiInferTask'),
type='ais_bench.benchmark.runners.local.LocalRunner'))
models=[
dict(abbr='vllm-api-general-chat',
api_key='',
attr='service',
batch_size=64,
generation_kwargs=dict(
chat_template_kwargs=dict(
enable_thinking=False),
ignore_eos=False,
temperature=0.0),
host_ip='141.61.81.13',
host_port=7070,
max_out_len=1024,
model='qwen35',
path='/mnt/share/w00957216/qwen35',
pred_postprocessor=dict(
type='ais_bench.benchmark.utils.postprocess.model_postprocessors.extract_non_reasoning_content'),
request_rate=0,
retry=2,
stream=False,
trust_remote_code=False,
type='ais_bench.benchmark.models.VLLMCustomAPIChat',
url='',
use_timestamp=False),
]

预期行为

prediction结果符合golden的情况下summary精度大于0

实际行为

部分prediction:
{"data_abbr": "textvqa", "id": 4949, "success": true, "uuid": "d3cf79ab769e4cc1a59733edcb6ef585", "origin_prompt": [{"role": "HUMAN", "prompt": [{"image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEASABIAAD/4gxYSUNDX1BST0ZJTEUAAQEAAAxITGlubwIQAABtbnRyUkdCIFhZWiAHzgACAAkABgAxAABhY3NwTVNGVAAAAABJRUMgc1JHQgAAAAAAAAAAAAAAAAAA9tYAAQAAAADTLUhQICAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABFjcHJ0A ..."}, "type": "image_url"}, {"text": "which year did this happen? Answer the question using a single word or phrase.", "type": "text"}]}], "prediction": "2011<|im_end|>\n<|endoftext|>", "gold": [{"answer": "2011", "answer_confidence": "yes", "answer_id": 0}, {"answer": "2011", "answer_confidence": "yes", "answer_id": 1}, {"answer": "2011", "answer_confidence": "yes", "answer_id": 2}, {"answer": "2011", "answer_confidence": "yes", "answer_id": 3}, {"answer": "2011", "answer_confidence": "yes", "answer_id": 4}, {"answer": "2011", "answer_confidence": "yes", "answer_id": 5}, {"answer": "2011", "answer_confidence": "yes", "answer_id": 6}, {"answer": "2011", "answer_confidence": "yes", "answer_id": 7}, {"answer": "2011", "answer_confidence": "yes", "answer_id": 8}, {"answer": "2011", "answer_confidence": "yes", "answer_id": 9}]}
{"data_abbr": "textvqa", "id": 4945, "success": true, "uuid": "68fffed926db4ae0a2d08626eda51fff", "origin_prompt": [{"role": "HUMAN", "prompt": [{"image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEASABIAAD/4gxYSUNDX1BST0ZJTEUAAQEAAAxITGlubwIQAABtbnRyUkdCIFhZWiAHzgACAAkABgAxAABhY3NwTVNGVAAAAABJRUMgc1JHQgAAAAAAAAAAAAAAAAAA9tYAAQAAAADTLUhQICAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABFjcHJ0A ..."}, "type": "image_url"}, {"text": "what is the title of the thin red book? Answer the question using a single word or phrase.", "type": "text"}]}], "prediction": " el arte de cocinar<|im_end|>\n<|endoftext|>", "gold": [{"answer": "la sombra del aguilla", "answer_confidence": "yes", "answer_id": 0}, {"answer": "la sombra del aguila", "answer_confidence": "yes", "answer_id": 1}, {"answer": "la sombra del aquila", "answer_confidence": "yes", "answer_id": 2}, {"answer": "la sombra del aguila", "answer_confidence": "yes", "answer_id": 3}, {"answer": "la sombra de agulla", "answer_confidence": "yes", "answer_id": 4}, {"answer": "la sombre del aguilla", "answer_confidence": "yes", "answer_id": 5}, {"answer": "la sombra del aguilla", "answer_confidence": "yes", "answer_id": 6}, {"answer": "la sombra del aguila", "answer_confidence": "yes", "answer_id": 7}, {"answer": "la sombra del aquila", "answer_confidence": "yes", "answer_id": 8}, {"answer": "la sombra del águila", "answer_confidence": "yes", "answer_id": 9}]}
{"data_abbr": "textvqa", "id": 4975, "success": true, "uuid": "eaf538a406d34cec8320f7e51df28a34", "origin_prompt": [{"role": "HUMAN", "prompt": [{"image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEA8ADwAAD/4gxYSUNDX1BST0ZJTEUAAQEAAAxITGlubwIQAABtbnRyUkdCIFhZWiAHzgACAAkABgAxAABhY3NwTVNGVAAAAABJRUMgc1JHQgAAAAAAAAAAAAAAAAAA9tYAAQAAAADTLUhQICAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABFjcHJ0A ..."}, "type": "image_url"}, {"text": "what is the name of this wine? Answer the question using a single word or phrase.", "type": "text"}]}], "prediction": " rubino del casale<|im_end|>\n<|endoftext|>", "gold": [{"answer": "rubino del casale", "answer_confidence": "yes", "answer_id": 0}, {"answer": "rubino del casale ", "answer_confidence": "yes", "answer_id": 1}, {"answer": "rubino del casale ", "answer_confidence": "yes", "answer_id": 2}, {"answer": "rubino", "answer_confidence": "yes", "answer_id": 3}, {"answer": "rulino del casale", "answer_confidence": "yes", "answer_id": 4}, {"answer": "vino da tavola rosso", "answer_confidence": "yes", "answer_id": 5}, {"answer": "rubino del casale", "answer_confidence": "yes", "answer_id": 6}, {"answer": "rubinode casale", "answer_confidence": "yes", "answer_id": 7}, {"answer": "rubino del casale", "answer_confidence": "yes", "answer_id": 8}, {"answer": "rubino del casale", "answer_confidence": "yes", "answer_id": 9}]}
{"data_abbr": "textvqa", "id": 4890, "success": true, "uuid": "d2713693c1fa40cdb87198565851762a", "origin_prompt": [{"role": "HUMAN", "prompt": [{"image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEA8ADwAAD/4gJASUNDX1BST0ZJTEUAAQEAAAIwQURCRQIQAABtbnRyUkdCIFhZWiAHzwAGAAMAAAAAAABhY3NwQVBQTAAAAABub25lAAAAAAAAAAAAAAAAAAAAAAAA9tYAAQAAAADTLUFEQkUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAApjcHJ0A ..."}, "type": "image_url"}, {"text": "what is the address for the business on the ad? Answer the question using a single word or phrase.", "type": "text"}]}], "prediction": "4001-4053 ravenswood ave.<|im_end|>\n<|endoftext|>", "gold": [{"answer": "manz", "answer_confidence": "yes", "answer_id": 0}, {"answer": "4001-4053 ravenswood ave. chicago, ill", "answer_confidence": "yes", "answer_id": 1}, {"answer": "4001-4053 ravenswood ave. chicago, ill.", "answer_confidence": "yes", "answer_id": 2}, {"answer": "4001-4053 ravenswood ave. chicago, ill.", "answer_confidence": "yes", "answer_id": 3}, {"answer": "4001-4053 ravenswood ave", "answer_confidence": "yes", "answer_id": 4}, {"answer": "4001-4053 ravenswood ave chicago ill", "answer_confidence": "yes", "answer_id": 5}, {"answer": "4001-4053 ravenswood ave. chicago, ill.", "answer_confidence": "yes", "answer_id": 6}, {"answer": "4001-4053 ravenswood ave chicaago, ill", "answer_confidence": "yes", "answer_id": 7}, {"answer": "4001-4053 ravenswood ave. chicago, ill", "answer_confidence": "yes", "answer_id": 8}, {"answer": "4001-4053 ravenswood ave. chicago,ill.", "answer_confidence": "yes", "answer_id": 9}]}

summary:
tabulate format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataset version metric mode vllm-api-general-chat


textvqa e9d882 accuracy gen 0.00
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

-------------------------------------------------------------------------------------------------------------------------------- THIS IS A DIVIDER --------------------------------------------------------------------------------------------------------------------------------

csv format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataset,version,metric,mode,vllm-api-general-chat
textvqa,e9d882,accuracy,gen,0.00
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

markdown format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

dataset version metric mode vllm-api-general-chat
textvqa e9d882 accuracy gen 0.00

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
-------------------------------------------------------------------------------------------------------------------------------- THIS IS A DIVIDER --------------------------------------------------------------------------------------------------------------------------------

raw format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Model: vllm-api-general-chat
textvqa: {'accuracy': 0.0}
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

前置检查

  • 我已读懂主页文档的快速入门,无法解决问题
  • 我已检索过FAQ,无重复问题
  • 我已搜索过现有Issue,无重复问题
  • 我已更新到最新版本,问题仍存在

Metadata

Metadata

Labels

bugSomething isn't workingcontent_check_passedissue content check passed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions