File size: 9,639 Bytes
738dfa1
 
 
c49e246
 
738dfa1
4aef4d7
 
c49e246
 
 
 
 
738dfa1
 
c49e246
738dfa1
c49e246
 
 
738dfa1
c49e246
738dfa1
c49e246
738dfa1
3142866
 
c49e246
 
 
738dfa1
c49e246
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
296a4e2
 
 
 
 
3142866
296a4e2
 
 
3142866
296a4e2
 
 
3142866
296a4e2
c49e246
 
 
 
738dfa1
c49e246
 
 
 
 
 
 
 
3d94ccf
c49e246
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3142866
3d94ccf
3142866
1c8864c
 
738dfa1
3d94ccf
eff2dad
c49e246
3142866
c49e246
 
 
 
 
 
eff2dad
3d94ccf
eff2dad
3d94ccf
c49e246
3142866
c49e246
3d94ccf
738dfa1
c49e246
1c8864c
3142866
 
 
 
1c8864c
3142866
 
 
 
 
 
 
738dfa1
3142866
738dfa1
c49e246
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dcc653f
c49e246
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dcc653f
 
c49e246
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d331fe
c49e246
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- agent
- computer-use
- gui-grounding
- vision-language
metrics:
- accuracy
---

# GroundNext-7B-V0

<p align="center">
&nbsp&nbsp🌐 <a href="https://groundcua.github.io">Website</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2511.07332">Paper</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">Dataset</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">Model</a>&nbsp&nbsp
</p>

## Highlights

**GroundNext-7B-V0** is a state-of-the-art vision-language model for GUI element grounding, developed as part of the **GroundCUA** project. This model features:

- **Superior grounding accuracy** achieving 52.9% on ScreenSpot-Pro, 67.7% on OSWorld-G, and 60.3% on UI-Vision benchmarks
- **Exceptional cross-platform generalization** with 81.1% accuracy on MMBench-GUI and 90.4% on ScreenSpot-v2 despite desktop-only training
- **Data-efficient training** achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work
- **Strong agentic capabilities** reaching 50.6% overall success rate on OSWorld when paired with reasoning models
- **Native tool-calling support** with built-in computer use action space for mouse, keyboard, and screen interactions

## Model Overview

**GroundNext-7B-V0** has the following characteristics:
- **Type**: Vision-Language Model for GUI Grounding
- **Base Model**: Qwen2.5-VL-7B-Instruct
- **Training Approach**: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO)
- **Number of Parameters**: 7.0B
- **Training Data**: 700K human-annotated desktop demonstrations from GroundCUA dataset
- **Context Length**: 262,144 tokens (inherited from base model)
- **Specialization**: Desktop GUI element grounding with cross-platform generalization

For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our [paper](https://arxiv.org/abs/2511.07332), [GitHub repository](https://github.com/ServiceNow/GroundCUA), and [project website](https://groundcua.github.io).

## Performance

### Desktop Grounding Benchmarks

|                    | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** |
| ------------------ | ------------- | ----------- | ----------------- |
| **ScreenSpot-Pro** | 29.7          | 38.1        | **52.9**          |
| **OSWorld-G**      | 42.7          | 57.1        | **67.7**          |
| **UI-Vision**      | 16.5          | 25.5        | **60.3**          |
| **Avg (Desktop)**  | 29.6          | 40.2        | **60.3**          |

### Cross-Platform Generalization (Desktop, Mobile & Web)

|                      | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** |
| -------------------- | ------------- | ----------- | ----------------- |
| **MMBench-GUI**      | 33.9          | 74.3        | **81.1**          |
| **ScreenSpot-v2**    | 88.8          | 90.3        | **90.4**          |
| **Avg (Mobile/Web)** | 61.4          | 82.3        | **85.8**          |


### Agentic Performance on OSWorld

When combined with OpenAI o3 for reasoning, **GroundNext-7B-V0** demonstrates strong end-to-end computer use capabilities:

| Model | OS | Office | Daily | Pro | Workflow | Overall |
|--- | --- | --- | --- | --- | --- | --- |
| OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 |
| CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 |
| OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 |
| UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 |
| JEDI-7B w/ o3 | 50.0 | 46.1 | **61.9** | **75.5** | 35.3 | **51.0** |
| **GroundNext-3B w/ o3** | **62.5** | **47.0** | 55.0 | 73.5 | **36.5** | 50.6 |

*Note: GroundNext-7B-V0 results with o3 integration forthcoming.*

## Quickstart

The code of GroundNext-7B-V0 is compatible with the latest Hugging Face `transformers` library and follows the Qwen2.5-VL implementation.

With `transformers<4.37.0`, you may encounter compatibility issues. We recommend using `transformers>=4.37.0`.

### Installation

```bash
pip install transformers>=4.37.0 torch torchvision accelerate
pip install qwen-vl-utils  # For image processing utilities
```

### Basic Inference

The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding:

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen

model_name = "ServiceNow/GroundNext-7B-V0"

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True

# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)

# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>
```

### Deployment with vLLM

For production deployment, you can use vLLM to create OpenAI-compatible API endpoints:

**vLLM**:
```bash
vllm serve ServiceNow/GroundNext-7B-V0 --max-model-len 8192
```

**Note**: Adjust `max-model-len` or `context-length` based on your hardware capabilities. For typical GUI grounding tasks, 8192 tokens is sufficient.

## Best Practices

To achieve optimal grounding performance, we recommend:

1. **Image Preprocessing**:
   - Use high-resolution screenshots (minimum 800x600)
   - Ensure UI elements are clearly visible
   - Maintain original aspect ratios when resizing

2. **Prompt Engineering**:
   - Be specific about the target element (e.g., "Click on the blue 'Submit' button in the top-right corner" or "Click on the following element: Save")
   - Include element attributes when available (color, position, text)

3. **Generation Parameters**:
   - Use `temperature=0.0` for deterministic grounding
   - Set `max_new_tokens=128` (sufficient for tool calls)
   - Enable `use_cache=True` for faster inference

4. **System Prompt**:
   - Always include the system prompt with actual screen dimensions
   - Replace `{width}` and `{height}` with true screenshot dimensions
   - Maintain the tool signature format for proper JSON parsing

5. **Post-processing**:
   - Parse `<tool_call>` tags to extract JSON
   - Validate coordinates are within screen bounds

## Training

GroundNext-7B-V0 was trained using a two-stage approach:

1. **Supervised Fine-tuning (SFT)**: Trained on 700K human-annotated desktop demonstrations from the GroundCUA dataset
2. **Reinforcement Learning (RLOO)**: Further optimized using reward-based learning with custom GUI grounding rewards

For detailed training instructions, dataset preparation, and reproduction steps, please visit our [GitHub repository](https://github.com/ServiceNow/GroundCUA).

## Limitations and Future Work

- **Desktop-focused**: Primarily trained on desktop environments (though shows strong cross-platform generalization)
- **Action space**: Currently supports mouse click action only
- **Languages**: Optimized for English UI elements
- **Resolution**: Performance may vary with extremely high or low resolution images

## Citation

If you use GroundNext-7B-V0 in your research, please cite:

```bibtex
@misc{feizi2025groundingcomputeruseagents,
      title={Grounding Computer Use Agents on Human Demonstrations}, 
      author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
      year={2025},
      eprint={2511.07332},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.07332}, 
}
```

## License

This model is released under the Apache 2.0 License, following the base Qwen2.5-VL-7B-Instruct model. See the [LICENSE](https://choosealicense.com/licenses/apache-2.0/) for details.

## Acknowledgements

We thank:
- The Qwen team for the excellent Qwen2.5-VL foundation models
- The open-source community for tools and frameworks that made this work possible
- Human annotators who contributed to the GroundCUA dataset