[COLM2025] VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

Abstract

Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agents— mapping textual plans to GUI elements— can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent’ s behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.

Install Requirements

pip install -r requirements.txt

Data Preparation

Download the required data

We use the train data from the SeeClick and test data from SeeClick and OminiAct in Uground Settings

Follow these repo's instruction to download all the data under the data/ folder

Preprocess the data

After download the data, you should use the scripts in this repo to process the data.

Step 1: get the clean train data
```
bash process.sh
```

Step 2: get the poison pretrain data

## get the poison pretrain train data
bash poison_utils/generate_poison_data.sh

Step 3: get test poison input data(both pretrain and downstream)

## get the poison pretrain screenspot poison input test data
bash poison_utils/generate_poison_test.sh

## get the poison downstream agent tasks poison input test data
bash poison_utils/generate_poison_aitw_test.sh
bash poison_utils/generate_poison_mind2web_test.sh
bash poison_utils/generate_poison_omni_test.sh

Caution: There are different grounding format for different model(e.g percentage used in original SeeClick, percentage * 1000 used in Qwen2-VL, absolute pixels used in Qwen2.5-VL ,etc.), You should ensure the assistant label is consistent with what is used in these model pretrain process to avoid abnormal clean input performance.

Train the model

pip install -r requirements.txt

Data Preparation

Download the required data

We use the train data from the SeeClick and test data from SeeClick and OminiAct in Uground Settings

Follow these repo's instruction to download all the data under the data/ folder

Preprocess the data

After download the data, you

Step 1: get the clean train data
```
bash process.sh
```

Step 2: get the poison pretrain data

## get the poison pretrain train data
bash poison_utils/generate_poison_data.sh

Step 3: get test poison input data(both pretrain and downstream)

## get the poison pretrain screenspot poison input test data
bash poison_utils/generate_poison_test.sh

## get the poison downstream agent tasks poison input test data
bash poison_utils/generate_poison_aitw_test.sh
bash poison_utils/generate_poison_mind2web_test.sh
bash poison_utils/generate_poison_omni_test.sh

Caution: There are different grounding format for different model(e.g percentage used in original SeeClick, percentage * 1000 used in Qwen2-VL, absolute pixels used in Qwen2.5-VL ,etc.), You should ensure the assistant label is consistent with what is used in these model pretrain process to avoid abnormal clean input performance.

Train the model

We use the this 2U1/Qwen2-VL-Finetune and LLaMA-Factory to train the model.

There may be bbox offset problem when train Qwen2-VL-7B and Qwen2.5-VL series as shown in issue, you should use pay special attention to your transformer version.

Evaluation

use the scripts under the scripts/ to evaluate. Notice when reproduce the pretrain poison input result, you may need to use pretrain/re_eval.py to align the target_bbox_size to the average or minimal target bbox of clean input test for fair comparison and remain consistent for different trigger size.

Citation

Please cite our paper if you find the repo helpful in your work:

@inproceedings{yeVisualTrapStealthyBackdoor2025,
  title = {{{VisualTrap}}: A Stealthy Backdoor Attack on {{GUI}} Agents via Visual Grounding Manipulation},
  shorttitle = {{{VisualTrap}}},
  booktitle = {Second {{Conference}} on {{Language Modeling}}},
  author = {Ye, Ziang and Zhang, Yang and Shi, Wentao and You, Xiaoyu and Feng, Fuli and Chua, Tat-Seng},
  year = {2025},
  month = aug,
  langid = {english},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
accelerate_config		accelerate_config
agent_tasks		agent_tasks
data_lookup		data_lookup
draft_train_script		draft_train_script
figs		figs
finetune_src		finetune_src
inference_utils		inference_utils
poison_utils		poison_utils
pretrain		pretrain
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
omniparser_example.py		omniparser_example.py
process.sh		process.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[COLM2025] VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

Abstract

Install Requirements

Data Preparation

Download the required data

Preprocess the data

Train the model

Data Preparation

Download the required data

Preprocess the data

Train the model

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

License

whi497/VisualTrap

Folders and files

Latest commit

History

Repository files navigation

[COLM2025] VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

Abstract

Install Requirements

Data Preparation

Download the required data

Preprocess the data

Train the model

Data Preparation

Download the required data

Preprocess the data

Train the model

Evaluation

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages