There are problems with using inference with vllm and using inference with Transformrers

I used the provided aria_ui_vllm.py and aria_ui_hf.py for inference separately and found that there are inconsistencies in the results.
1. running `aria_ui_vllm.py`

```python
    llm = LLM(
        model=model_path,
        tokenizer_mode="slow",
        dtype="bfloat16",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_path, trust_remote_code=True, use_fast=False
    )
    instruction = "Try Aria."
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {
                    "type": "text",
                    "text": "Given a GUI image, what are the relative (0-1000) pixel point coordinates for the element corresponding to the following instruction or description: " + instruction,
                }
            ],
        }
    ]
```

outputs：
```
```(684, 786)```<|im_end|>
```
After running draw_coord, it can be seen that the correct coordinates for 'Try Aria' were not found.
![Image](https://github.com/user-attachments/assets/68fa802f-6397-4339-9f41-89da770fd006)



2. running `aria_ui_hf.py`
```python
instruction = "Try Aria."

image = Image.open(image_file).convert("RGB")

# NOTE: using huggingface on a single 80GB GPU, we resize the image to 1920px on the long side to prevent OOM. this is unnecessary with vllm.
image = resize_image(image, long_size=1920)

messages = [
    {
        "role": "user",
        "content": [
            {"text": None, "type": "image"},
            {"text": instruction, "type": "text"},
        ],
    }
]
```
outputs：
```
```(767, 782)```<|im_end|>
```
After draw_coord 
![Image](https://github.com/user-attachments/assets/5fdeca8c-599a-44f8-b832-9292ce5d64ee)

3. When I changed the prompt in vllm. I removed the sentence 'Given a GUI image, what are the relative (0-1000) pixel point coordinates for the element corresponding to the following instruction or description:'.
```python
instruction = "Try Aria."
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {
                    "type": "text",
                    "text":  instruction,
                }
            ],
        }
    ]
```

outputs：
```
```(760, 786)```<|im_end|>
```

![Image](https://github.com/user-attachments/assets/8dbe2fba-0681-4936-91dc-69748aa9dcb8)

This way, the observed results are correct




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There are problems with using inference with vllm and using inference with Transformrers #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

There are problems with using inference with vllm and using inference with Transformrers #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions