-
Notifications
You must be signed in to change notification settings - Fork 41
Open
Description
I used the provided aria_ui_vllm.py and aria_ui_hf.py for inference separately and found that there are inconsistencies in the results.
- running
aria_ui_vllm.py
llm = LLM(
model=model_path,
tokenizer_mode="slow",
dtype="bfloat16",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
model_path, trust_remote_code=True, use_fast=False
)
instruction = "Try Aria."
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{
"type": "text",
"text": "Given a GUI image, what are the relative (0-1000) pixel point coordinates for the element corresponding to the following instruction or description: " + instruction,
}
],
}
]outputs:
```(684, 786)```<|im_end|>
After running draw_coord, it can be seen that the correct coordinates for 'Try Aria' were not found.

- running
aria_ui_hf.py
instruction = "Try Aria."
image = Image.open(image_file).convert("RGB")
# NOTE: using huggingface on a single 80GB GPU, we resize the image to 1920px on the long side to prevent OOM. this is unnecessary with vllm.
image = resize_image(image, long_size=1920)
messages = [
{
"role": "user",
"content": [
{"text": None, "type": "image"},
{"text": instruction, "type": "text"},
],
}
]outputs:
```(767, 782)```<|im_end|>
- When I changed the prompt in vllm. I removed the sentence 'Given a GUI image, what are the relative (0-1000) pixel point coordinates for the element corresponding to the following instruction or description:'.
instruction = "Try Aria."
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{
"type": "text",
"text": instruction,
}
],
}
]outputs:
```(760, 786)```<|im_end|>
This way, the observed results are correct
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels

