Improve model card: Add license, pipeline tag, update links, and add sample usage
Browse filesThis PR significantly enhances the model card by:
* Adding the `license` to `apache-2.0`.
* Setting the `pipeline_tag` to `image-text-to-text`, ensuring better discoverability at https://huggingface.co/models?pipeline_tag=image-text-to-text.
* Updating the model card's title to "V-Droid: Advancing Mobile GUI Agents".
* Linking the paper title in the introduction to the Hugging Face paper page: https://huggingface.co/papers/2503.15937.
* Updating the code repository link to `https://github.com/V-Droid-Agent/V-Droid`.
* Renaming "Demo" to "Project Page" under "Model Sources" for clearer terminology.
* Including a comprehensive `Sample Usage` section with a Python code snippet directly from the GitHub README to help users quickly get started with the model.
* Improving the formatting of the BibTeX citation.
* Removing placeholder `[optional]` tags from headers.
Please review these improvements.
|
@@ -1,11 +1,13 @@
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
tags: []
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
| 7 |
|
| 8 |
-
This repository contains the model for the paper "Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment" (arXiv:2503.15937).
|
| 9 |
|
| 10 |
V-Droid is a novel mobile GUI task automation agent that leverages Large Language Models (LLMs) as verifiers rather than generators. This verifier-driven paradigm, combined with a discretized action space and a prefilling-only workflow, allows V-Droid to achieve state-of-the-art performance on public benchmarks while maintaining near-real-time decision-making capabilities.
|
| 11 |
|
|
@@ -26,15 +28,15 @@ This model card corresponds to the verifier component of the V-Droid agent.
|
|
| 26 |
- **Developed by:** Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu
|
| 27 |
- **Model type:** Verifier for Mobile GUI Agent
|
| 28 |
- **Language(s) (NLP):** English
|
| 29 |
-
- **Finetuned from model
|
| 30 |
|
| 31 |
-
### Model Sources
|
| 32 |
|
| 33 |
<!-- Provide the basic links for the model. -->
|
| 34 |
|
| 35 |
-
- **Repository:** https://github.com/V-Droid-Agent/V-Droid
|
| 36 |
- **Paper:** https://doi.org/10.48550/arXiv.2503.15937
|
| 37 |
-
- **
|
| 38 |
|
| 39 |
## Uses
|
| 40 |
|
|
@@ -42,7 +44,7 @@ This model card corresponds to the verifier component of the V-Droid agent.
|
|
| 42 |
|
| 43 |
The V-Droid verifier model is intended to be used as a core component within the V-Droid agent framework. It takes a HTML description of a mobile device, a task description, and a candidate action as input, and outputs a score that the action will contribute to completing the task.
|
| 44 |
|
| 45 |
-
### Downstream Use
|
| 46 |
|
| 47 |
The principles and architecture of V-Droid can be adapted for other GUI automation tasks beyond mobile environments, such as web or desktop applications. The verifier-driven approach could also be explored for other types of autonomous agents where action validation is critical.
|
| 48 |
|
|
@@ -50,6 +52,77 @@ The principles and architecture of V-Droid can be adapted for other GUI automati
|
|
| 50 |
|
| 51 |
This model is not designed for generating free-form text or engaging in conversational chat. It is specifically tailored for the task of action verification in a mobile GUI context. Using it for purposes outside of this scope is likely to yield poor results. The model should not be used for any applications that could cause harm, violate privacy, or conduct malicious activities.
|
| 52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
## Bias, Risks, and Limitations
|
| 54 |
|
| 55 |
The performance of V-Droid is dependent on the quality and diversity of the training data. As with any model trained on specific data distributions, it may exhibit biases present in the training set. The model's ability to generalize to unseen applications or radically different UI layouts may be limited.
|
|
@@ -70,9 +143,11 @@ The training data consists of state-action pairs, where the state includes the s
|
|
| 70 |
|
| 71 |
## Citation
|
| 72 |
**BibTeX:**
|
|
|
|
| 73 |
@article{dai2025advancing,
|
| 74 |
title={Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment},
|
| 75 |
author={Dai, Gaole and Jiang, Shiqi and Cao, Ting and Li, Yuanchun and Yang, Yuqing and Tan, Rui and Li, Mo and Qiu, Lili},
|
| 76 |
journal={arXiv preprint arXiv:2503.15937},
|
| 77 |
year={2025}
|
| 78 |
}
|
|
|
|
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
tags: []
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
pipeline_tag: image-text-to-text
|
| 6 |
---
|
| 7 |
|
| 8 |
+
# V-Droid: Advancing Mobile GUI Agents
|
| 9 |
|
| 10 |
+
This repository contains the model for the paper "[Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment](https://huggingface.co/papers/2503.15937)" (arXiv:2503.15937).
|
| 11 |
|
| 12 |
V-Droid is a novel mobile GUI task automation agent that leverages Large Language Models (LLMs) as verifiers rather than generators. This verifier-driven paradigm, combined with a discretized action space and a prefilling-only workflow, allows V-Droid to achieve state-of-the-art performance on public benchmarks while maintaining near-real-time decision-making capabilities.
|
| 13 |
|
|
|
|
| 28 |
- **Developed by:** Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu
|
| 29 |
- **Model type:** Verifier for Mobile GUI Agent
|
| 30 |
- **Language(s) (NLP):** English
|
| 31 |
+
- **Finetuned from model:** Llama-3.1-8B-4bit
|
| 32 |
|
| 33 |
+
### Model Sources
|
| 34 |
|
| 35 |
<!-- Provide the basic links for the model. -->
|
| 36 |
|
| 37 |
+
- **Repository:** https://github.com/V-Droid-Agent/V-Droid
|
| 38 |
- **Paper:** https://doi.org/10.48550/arXiv.2503.15937
|
| 39 |
+
- **Project Page:** https://v-droid-agent.github.io/
|
| 40 |
|
| 41 |
## Uses
|
| 42 |
|
|
|
|
| 44 |
|
| 45 |
The V-Droid verifier model is intended to be used as a core component within the V-Droid agent framework. It takes a HTML description of a mobile device, a task description, and a candidate action as input, and outputs a score that the action will contribute to completing the task.
|
| 46 |
|
| 47 |
+
### Downstream Use
|
| 48 |
|
| 49 |
The principles and architecture of V-Droid can be adapted for other GUI automation tasks beyond mobile environments, such as web or desktop applications. The verifier-driven approach could also be explored for other types of autonomous agents where action validation is critical.
|
| 50 |
|
|
|
|
| 52 |
|
| 53 |
This model is not designed for generating free-form text or engaging in conversational chat. It is specifically tailored for the task of action verification in a mobile GUI context. Using it for purposes outside of this scope is likely to yield poor results. The model should not be used for any applications that could cause harm, violate privacy, or conduct malicious activities.
|
| 54 |
|
| 55 |
+
## Sample Usage
|
| 56 |
+
|
| 57 |
+
To quickly get started with the V-Droid model, here is a sample code snippet from the GitHub repository:
|
| 58 |
+
|
| 59 |
+
```python
|
| 60 |
+
import pae
|
| 61 |
+
from pae.models import LlavaAgent, ClaudeAgent
|
| 62 |
+
from accelerate import Accelerator
|
| 63 |
+
import torch
|
| 64 |
+
from tqdm import tqdm
|
| 65 |
+
from types import SimpleNamespace
|
| 66 |
+
from pae.environment.webgym import BatchedWebEnv
|
| 67 |
+
import os
|
| 68 |
+
from llava.model.language_model.llava_mistral import LlavaMistralForCausalLM
|
| 69 |
+
|
| 70 |
+
# ============= Instanstiate the agent =============
|
| 71 |
+
config_dict = {"use_lora": False,
|
| 72 |
+
"use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
|
| 73 |
+
"use_anyres": False,
|
| 74 |
+
"temperature": 1.0,
|
| 75 |
+
"max_new_tokens": 512,
|
| 76 |
+
"train_vision": False,
|
| 77 |
+
"num_beams": 1,}
|
| 78 |
+
config = SimpleNamespace(**config_dict)
|
| 79 |
+
|
| 80 |
+
accelerator = Accelerator()
|
| 81 |
+
agent = LlavaAgent(policy_lm = "yifeizhou/pae-llava-7b", # alternate models "yifeizhou/pae-llava-7b-webarena", "yifeizhou/pae-llava-34b"
|
| 82 |
+
device = accelerator.device,
|
| 83 |
+
accelerator = accelerator,
|
| 84 |
+
config = config)
|
| 85 |
+
|
| 86 |
+
# ============= Instanstiate the environment =============
|
| 87 |
+
test_tasks = [{"web_name": "Google Map",
|
| 88 |
+
"id": "0",
|
| 89 |
+
"ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
|
| 90 |
+
"web": "https://www.google.com/maps/"}]
|
| 91 |
+
save_path = "xxx"
|
| 92 |
+
|
| 93 |
+
test_env = BatchedWebEnv(tasks = test_tasks,
|
| 94 |
+
do_eval = False,
|
| 95 |
+
download_dir=os.path.join(save_path, 'test_driver', 'download'),
|
| 96 |
+
output_dir=os.path.join(save_path, 'test_driver', 'output'),
|
| 97 |
+
batch_size=1,
|
| 98 |
+
max_iter=10,)
|
| 99 |
+
# for you to check the images and actions
|
| 100 |
+
image_histories = [] # stores the history of the paths of images
|
| 101 |
+
action_histories = [] # stores the history of actions
|
| 102 |
+
|
| 103 |
+
results = test_env.reset()
|
| 104 |
+
image_histories.append(results[0][0]["image"])
|
| 105 |
+
|
| 106 |
+
observations = [r[0] for r in results]
|
| 107 |
+
actions = agent.get_action(observations)
|
| 108 |
+
action_histories.append(actions[0])
|
| 109 |
+
dones = None
|
| 110 |
+
|
| 111 |
+
for _ in tqdm(range(3)):
|
| 112 |
+
if dones is not None and all(dones):
|
| 113 |
+
break
|
| 114 |
+
results = test_env.step(actions)
|
| 115 |
+
image_histories.append(results[0][0]["image"])
|
| 116 |
+
observations = [r[0] for r in results]
|
| 117 |
+
actions = agent.get_action(observations)
|
| 118 |
+
action_histories.append(actions[0])
|
| 119 |
+
dones = [r[2] for r in results]
|
| 120 |
+
|
| 121 |
+
print("Done!")
|
| 122 |
+
print("image_histories: ", image_histories)
|
| 123 |
+
print("action_histories: ", action_histories)
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
## Bias, Risks, and Limitations
|
| 127 |
|
| 128 |
The performance of V-Droid is dependent on the quality and diversity of the training data. As with any model trained on specific data distributions, it may exhibit biases present in the training set. The model's ability to generalize to unseen applications or radically different UI layouts may be limited.
|
|
|
|
| 143 |
|
| 144 |
## Citation
|
| 145 |
**BibTeX:**
|
| 146 |
+
```bibtex
|
| 147 |
@article{dai2025advancing,
|
| 148 |
title={Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment},
|
| 149 |
author={Dai, Gaole and Jiang, Shiqi and Cao, Ting and Li, Yuanchun and Yang, Yuqing and Tan, Rui and Li, Mo and Qiu, Lili},
|
| 150 |
journal={arXiv preprint arXiv:2503.15937},
|
| 151 |
year={2025}
|
| 152 |
}
|
| 153 |
+
```
|