Skip to content

There are problems with using inference with vllm and using inference with Transformrers #9

@cug-auto-zp

Description

@cug-auto-zp

I used the provided aria_ui_vllm.py and aria_ui_hf.py for inference separately and found that there are inconsistencies in the results.

  1. running aria_ui_vllm.py
    llm = LLM(
        model=model_path,
        tokenizer_mode="slow",
        dtype="bfloat16",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_path, trust_remote_code=True, use_fast=False
    )
    instruction = "Try Aria."
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {
                    "type": "text",
                    "text": "Given a GUI image, what are the relative (0-1000) pixel point coordinates for the element corresponding to the following instruction or description: " + instruction,
                }
            ],
        }
    ]

outputs:

```(684, 786)```<|im_end|>

After running draw_coord, it can be seen that the correct coordinates for 'Try Aria' were not found.
Image

  1. running aria_ui_hf.py
instruction = "Try Aria."

image = Image.open(image_file).convert("RGB")

# NOTE: using huggingface on a single 80GB GPU, we resize the image to 1920px on the long side to prevent OOM. this is unnecessary with vllm.
image = resize_image(image, long_size=1920)

messages = [
    {
        "role": "user",
        "content": [
            {"text": None, "type": "image"},
            {"text": instruction, "type": "text"},
        ],
    }
]

outputs:

```(767, 782)```<|im_end|>

After draw_coord
Image

  1. When I changed the prompt in vllm. I removed the sentence 'Given a GUI image, what are the relative (0-1000) pixel point coordinates for the element corresponding to the following instruction or description:'.
instruction = "Try Aria."
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {
                    "type": "text",
                    "text":  instruction,
                }
            ],
        }
    ]

outputs:

```(760, 786)```<|im_end|>

Image

This way, the observed results are correct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions