Transformers
Safetensors
agent
GUI
nielsr HF Staff commited on
Commit
edea893
·
verified ·
1 Parent(s): 87471ac

Improve model card: Add license, pipeline tag, update links, and add sample usage

Browse files

This PR significantly enhances the model card by:

* Adding the `license` to `apache-2.0`.
* Setting the `pipeline_tag` to `image-text-to-text`, ensuring better discoverability at https://huggingface.co/models?pipeline_tag=image-text-to-text.
* Updating the model card's title to "V-Droid: Advancing Mobile GUI Agents".
* Linking the paper title in the introduction to the Hugging Face paper page: https://huggingface.co/papers/2503.15937.
* Updating the code repository link to `https://github.com/V-Droid-Agent/V-Droid`.
* Renaming "Demo" to "Project Page" under "Model Sources" for clearer terminology.
* Including a comprehensive `Sample Usage` section with a Python code snippet directly from the GitHub README to help users quickly get started with the model.
* Improving the formatting of the BibTeX citation.
* Removing placeholder `[optional]` tags from headers.

Please review these improvements.

Files changed (1) hide show
  1. README.md +82 -7
README.md CHANGED
@@ -1,11 +1,13 @@
1
  ---
2
  library_name: transformers
3
  tags: []
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- This repository contains the model for the paper "Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment" (arXiv:2503.15937).
9
 
10
  V-Droid is a novel mobile GUI task automation agent that leverages Large Language Models (LLMs) as verifiers rather than generators. This verifier-driven paradigm, combined with a discretized action space and a prefilling-only workflow, allows V-Droid to achieve state-of-the-art performance on public benchmarks while maintaining near-real-time decision-making capabilities.
11
 
@@ -26,15 +28,15 @@ This model card corresponds to the verifier component of the V-Droid agent.
26
  - **Developed by:** Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu
27
  - **Model type:** Verifier for Mobile GUI Agent
28
  - **Language(s) (NLP):** English
29
- - **Finetuned from model [optional]:** Llama-3.1-8B-4bit
30
 
31
- ### Model Sources [optional]
32
 
33
  <!-- Provide the basic links for the model. -->
34
 
35
- - **Repository:** https://github.com/V-Droid-Agent/V-Droid-Public
36
  - **Paper:** https://doi.org/10.48550/arXiv.2503.15937
37
- - **Demo:** https://v-droid-agent.github.io/
38
 
39
  ## Uses
40
 
@@ -42,7 +44,7 @@ This model card corresponds to the verifier component of the V-Droid agent.
42
 
43
  The V-Droid verifier model is intended to be used as a core component within the V-Droid agent framework. It takes a HTML description of a mobile device, a task description, and a candidate action as input, and outputs a score that the action will contribute to completing the task.
44
 
45
- ### Downstream Use [optional]
46
 
47
  The principles and architecture of V-Droid can be adapted for other GUI automation tasks beyond mobile environments, such as web or desktop applications. The verifier-driven approach could also be explored for other types of autonomous agents where action validation is critical.
48
 
@@ -50,6 +52,77 @@ The principles and architecture of V-Droid can be adapted for other GUI automati
50
 
51
  This model is not designed for generating free-form text or engaging in conversational chat. It is specifically tailored for the task of action verification in a mobile GUI context. Using it for purposes outside of this scope is likely to yield poor results. The model should not be used for any applications that could cause harm, violate privacy, or conduct malicious activities.
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ## Bias, Risks, and Limitations
54
 
55
  The performance of V-Droid is dependent on the quality and diversity of the training data. As with any model trained on specific data distributions, it may exhibit biases present in the training set. The model's ability to generalize to unseen applications or radically different UI layouts may be limited.
@@ -70,9 +143,11 @@ The training data consists of state-action pairs, where the state includes the s
70
 
71
  ## Citation
72
  **BibTeX:**
 
73
  @article{dai2025advancing,
74
  title={Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment},
75
  author={Dai, Gaole and Jiang, Shiqi and Cao, Ting and Li, Yuanchun and Yang, Yuqing and Tan, Rui and Li, Mo and Qiu, Lili},
76
  journal={arXiv preprint arXiv:2503.15937},
77
  year={2025}
78
  }
 
 
1
  ---
2
  library_name: transformers
3
  tags: []
4
+ license: apache-2.0
5
+ pipeline_tag: image-text-to-text
6
  ---
7
 
8
+ # V-Droid: Advancing Mobile GUI Agents
9
 
10
+ This repository contains the model for the paper "[Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment](https://huggingface.co/papers/2503.15937)" (arXiv:2503.15937).
11
 
12
  V-Droid is a novel mobile GUI task automation agent that leverages Large Language Models (LLMs) as verifiers rather than generators. This verifier-driven paradigm, combined with a discretized action space and a prefilling-only workflow, allows V-Droid to achieve state-of-the-art performance on public benchmarks while maintaining near-real-time decision-making capabilities.
13
 
 
28
  - **Developed by:** Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu
29
  - **Model type:** Verifier for Mobile GUI Agent
30
  - **Language(s) (NLP):** English
31
+ - **Finetuned from model:** Llama-3.1-8B-4bit
32
 
33
+ ### Model Sources
34
 
35
  <!-- Provide the basic links for the model. -->
36
 
37
+ - **Repository:** https://github.com/V-Droid-Agent/V-Droid
38
  - **Paper:** https://doi.org/10.48550/arXiv.2503.15937
39
+ - **Project Page:** https://v-droid-agent.github.io/
40
 
41
  ## Uses
42
 
 
44
 
45
  The V-Droid verifier model is intended to be used as a core component within the V-Droid agent framework. It takes a HTML description of a mobile device, a task description, and a candidate action as input, and outputs a score that the action will contribute to completing the task.
46
 
47
+ ### Downstream Use
48
 
49
  The principles and architecture of V-Droid can be adapted for other GUI automation tasks beyond mobile environments, such as web or desktop applications. The verifier-driven approach could also be explored for other types of autonomous agents where action validation is critical.
50
 
 
52
 
53
  This model is not designed for generating free-form text or engaging in conversational chat. It is specifically tailored for the task of action verification in a mobile GUI context. Using it for purposes outside of this scope is likely to yield poor results. The model should not be used for any applications that could cause harm, violate privacy, or conduct malicious activities.
54
 
55
+ ## Sample Usage
56
+
57
+ To quickly get started with the V-Droid model, here is a sample code snippet from the GitHub repository:
58
+
59
+ ```python
60
+ import pae
61
+ from pae.models import LlavaAgent, ClaudeAgent
62
+ from accelerate import Accelerator
63
+ import torch
64
+ from tqdm import tqdm
65
+ from types import SimpleNamespace
66
+ from pae.environment.webgym import BatchedWebEnv
67
+ import os
68
+ from llava.model.language_model.llava_mistral import LlavaMistralForCausalLM
69
+
70
+ # ============= Instanstiate the agent =============
71
+ config_dict = {"use_lora": False,
72
+ "use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
73
+ "use_anyres": False,
74
+ "temperature": 1.0,
75
+ "max_new_tokens": 512,
76
+ "train_vision": False,
77
+ "num_beams": 1,}
78
+ config = SimpleNamespace(**config_dict)
79
+
80
+ accelerator = Accelerator()
81
+ agent = LlavaAgent(policy_lm = "yifeizhou/pae-llava-7b", # alternate models "yifeizhou/pae-llava-7b-webarena", "yifeizhou/pae-llava-34b"
82
+ device = accelerator.device,
83
+ accelerator = accelerator,
84
+ config = config)
85
+
86
+ # ============= Instanstiate the environment =============
87
+ test_tasks = [{"web_name": "Google Map",
88
+ "id": "0",
89
+ "ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
90
+ "web": "https://www.google.com/maps/"}]
91
+ save_path = "xxx"
92
+
93
+ test_env = BatchedWebEnv(tasks = test_tasks,
94
+ do_eval = False,
95
+ download_dir=os.path.join(save_path, 'test_driver', 'download'),
96
+ output_dir=os.path.join(save_path, 'test_driver', 'output'),
97
+ batch_size=1,
98
+ max_iter=10,)
99
+ # for you to check the images and actions
100
+ image_histories = [] # stores the history of the paths of images
101
+ action_histories = [] # stores the history of actions
102
+
103
+ results = test_env.reset()
104
+ image_histories.append(results[0][0]["image"])
105
+
106
+ observations = [r[0] for r in results]
107
+ actions = agent.get_action(observations)
108
+ action_histories.append(actions[0])
109
+ dones = None
110
+
111
+ for _ in tqdm(range(3)):
112
+ if dones is not None and all(dones):
113
+ break
114
+ results = test_env.step(actions)
115
+ image_histories.append(results[0][0]["image"])
116
+ observations = [r[0] for r in results]
117
+ actions = agent.get_action(observations)
118
+ action_histories.append(actions[0])
119
+ dones = [r[2] for r in results]
120
+
121
+ print("Done!")
122
+ print("image_histories: ", image_histories)
123
+ print("action_histories: ", action_histories)
124
+ ```
125
+
126
  ## Bias, Risks, and Limitations
127
 
128
  The performance of V-Droid is dependent on the quality and diversity of the training data. As with any model trained on specific data distributions, it may exhibit biases present in the training set. The model's ability to generalize to unseen applications or radically different UI layouts may be limited.
 
143
 
144
  ## Citation
145
  **BibTeX:**
146
+ ```bibtex
147
  @article{dai2025advancing,
148
  title={Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment},
149
  author={Dai, Gaole and Jiang, Shiqi and Cao, Ting and Li, Yuanchun and Yang, Yuqing and Tan, Rui and Li, Mo and Qiu, Lili},
150
  journal={arXiv preprint arXiv:2503.15937},
151
  year={2025}
152
  }
153
+ ```