File size: 9,639 Bytes
738dfa1 c49e246 738dfa1 4aef4d7 c49e246 738dfa1 c49e246 738dfa1 c49e246 738dfa1 c49e246 738dfa1 c49e246 738dfa1 3142866 c49e246 738dfa1 c49e246 296a4e2 3142866 296a4e2 3142866 296a4e2 3142866 296a4e2 c49e246 738dfa1 c49e246 3d94ccf c49e246 3142866 3d94ccf 3142866 1c8864c 738dfa1 3d94ccf eff2dad c49e246 3142866 c49e246 eff2dad 3d94ccf eff2dad 3d94ccf c49e246 3142866 c49e246 3d94ccf 738dfa1 c49e246 1c8864c 3142866 1c8864c 3142866 738dfa1 3142866 738dfa1 c49e246 dcc653f c49e246 dcc653f c49e246 1d331fe c49e246 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
---
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- agent
- computer-use
- gui-grounding
- vision-language
metrics:
- accuracy
---
# GroundNext-7B-V0
<p align="center">
  🌐 <a href="https://groundcua.github.io">Website</a>   |   📑 <a href="https://arxiv.org/abs/2511.07332">Paper</a>   |   🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">Dataset</a>   |   🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">Model</a>  
</p>
## Highlights
**GroundNext-7B-V0** is a state-of-the-art vision-language model for GUI element grounding, developed as part of the **GroundCUA** project. This model features:
- **Superior grounding accuracy** achieving 52.9% on ScreenSpot-Pro, 67.7% on OSWorld-G, and 60.3% on UI-Vision benchmarks
- **Exceptional cross-platform generalization** with 81.1% accuracy on MMBench-GUI and 90.4% on ScreenSpot-v2 despite desktop-only training
- **Data-efficient training** achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work
- **Strong agentic capabilities** reaching 50.6% overall success rate on OSWorld when paired with reasoning models
- **Native tool-calling support** with built-in computer use action space for mouse, keyboard, and screen interactions
## Model Overview
**GroundNext-7B-V0** has the following characteristics:
- **Type**: Vision-Language Model for GUI Grounding
- **Base Model**: Qwen2.5-VL-7B-Instruct
- **Training Approach**: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO)
- **Number of Parameters**: 7.0B
- **Training Data**: 700K human-annotated desktop demonstrations from GroundCUA dataset
- **Context Length**: 262,144 tokens (inherited from base model)
- **Specialization**: Desktop GUI element grounding with cross-platform generalization
For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our [paper](https://arxiv.org/abs/2511.07332), [GitHub repository](https://github.com/ServiceNow/GroundCUA), and [project website](https://groundcua.github.io).
## Performance
### Desktop Grounding Benchmarks
| | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** |
| ------------------ | ------------- | ----------- | ----------------- |
| **ScreenSpot-Pro** | 29.7 | 38.1 | **52.9** |
| **OSWorld-G** | 42.7 | 57.1 | **67.7** |
| **UI-Vision** | 16.5 | 25.5 | **60.3** |
| **Avg (Desktop)** | 29.6 | 40.2 | **60.3** |
### Cross-Platform Generalization (Desktop, Mobile & Web)
| | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** |
| -------------------- | ------------- | ----------- | ----------------- |
| **MMBench-GUI** | 33.9 | 74.3 | **81.1** |
| **ScreenSpot-v2** | 88.8 | 90.3 | **90.4** |
| **Avg (Mobile/Web)** | 61.4 | 82.3 | **85.8** |
### Agentic Performance on OSWorld
When combined with OpenAI o3 for reasoning, **GroundNext-7B-V0** demonstrates strong end-to-end computer use capabilities:
| Model | OS | Office | Daily | Pro | Workflow | Overall |
|--- | --- | --- | --- | --- | --- | --- |
| OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 |
| CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 |
| OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 |
| UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 |
| JEDI-7B w/ o3 | 50.0 | 46.1 | **61.9** | **75.5** | 35.3 | **51.0** |
| **GroundNext-3B w/ o3** | **62.5** | **47.0** | 55.0 | 73.5 | **36.5** | 50.6 |
*Note: GroundNext-7B-V0 results with o3 integration forthcoming.*
## Quickstart
The code of GroundNext-7B-V0 is compatible with the latest Hugging Face `transformers` library and follows the Qwen2.5-VL implementation.
With `transformers<4.37.0`, you may encounter compatibility issues. We recommend using `transformers>=4.37.0`.
### Installation
```bash
pip install transformers>=4.37.0 torch torchvision accelerate
pip install qwen-vl-utils # For image processing utilities
```
### Basic Inference
The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding:
```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen
model_name = "ServiceNow/GroundNext-7B-V0"
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True
# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)
# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>
```
### Deployment with vLLM
For production deployment, you can use vLLM to create OpenAI-compatible API endpoints:
**vLLM**:
```bash
vllm serve ServiceNow/GroundNext-7B-V0 --max-model-len 8192
```
**Note**: Adjust `max-model-len` or `context-length` based on your hardware capabilities. For typical GUI grounding tasks, 8192 tokens is sufficient.
## Best Practices
To achieve optimal grounding performance, we recommend:
1. **Image Preprocessing**:
- Use high-resolution screenshots (minimum 800x600)
- Ensure UI elements are clearly visible
- Maintain original aspect ratios when resizing
2. **Prompt Engineering**:
- Be specific about the target element (e.g., "Click on the blue 'Submit' button in the top-right corner" or "Click on the following element: Save")
- Include element attributes when available (color, position, text)
3. **Generation Parameters**:
- Use `temperature=0.0` for deterministic grounding
- Set `max_new_tokens=128` (sufficient for tool calls)
- Enable `use_cache=True` for faster inference
4. **System Prompt**:
- Always include the system prompt with actual screen dimensions
- Replace `{width}` and `{height}` with true screenshot dimensions
- Maintain the tool signature format for proper JSON parsing
5. **Post-processing**:
- Parse `<tool_call>` tags to extract JSON
- Validate coordinates are within screen bounds
## Training
GroundNext-7B-V0 was trained using a two-stage approach:
1. **Supervised Fine-tuning (SFT)**: Trained on 700K human-annotated desktop demonstrations from the GroundCUA dataset
2. **Reinforcement Learning (RLOO)**: Further optimized using reward-based learning with custom GUI grounding rewards
For detailed training instructions, dataset preparation, and reproduction steps, please visit our [GitHub repository](https://github.com/ServiceNow/GroundCUA).
## Limitations and Future Work
- **Desktop-focused**: Primarily trained on desktop environments (though shows strong cross-platform generalization)
- **Action space**: Currently supports mouse click action only
- **Languages**: Optimized for English UI elements
- **Resolution**: Performance may vary with extremely high or low resolution images
## Citation
If you use GroundNext-7B-V0 in your research, please cite:
```bibtex
@misc{feizi2025groundingcomputeruseagents,
title={Grounding Computer Use Agents on Human Demonstrations},
author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2511.07332},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.07332},
}
```
## License
This model is released under the Apache 2.0 License, following the base Qwen2.5-VL-7B-Instruct model. See the [LICENSE](https://choosealicense.com/licenses/apache-2.0/) for details.
## Acknowledgements
We thank:
- The Qwen team for the excellent Qwen2.5-VL foundation models
- The open-source community for tools and frameworks that made this work possible
- Human annotators who contributed to the GroundCUA dataset |