File size: 9,639 Bytes

---
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- agent
- computer-use
- gui-grounding
- vision-language
metrics:
- accuracy
---

# GroundNext-7B-V0

<p align="center">
&nbsp&nbsp🌐 <a href="https://groundcua.github.io">Website</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2511.07332">Paper</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">Dataset</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">Model</a>&nbsp&nbsp
</p>

## Highlights

**GroundNext-7B-V0** is a state-of-the-art vision-language model for GUI element grounding, developed as part of the **GroundCUA** project. This model features:

- **Superior grounding accuracy** achieving 52.9% on ScreenSpot-Pro, 67.7% on OSWorld-G, and 60.3% on UI-Vision benchmarks
- **Exceptional cross-platform generalization** with 81.1% accuracy on MMBench-GUI and 90.4% on ScreenSpot-v2 despite desktop-only training
- **Data-efficient training** achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work
- **Strong agentic capabilities** reaching 50.6% overall success rate on OSWorld when paired with reasoning models
- **Native tool-calling support** with built-in computer use action space for mouse, keyboard, and screen interactions

## Model Overview

**GroundNext-7B-V0** has the following characteristics:
- **Type**: Vision-Language Model for GUI Grounding
- **Base Model**: Qwen2.5-VL-7B-Instruct
- **Training Approach**: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO)
- **Number of Parameters**: 7.0B
- **Training Data**: 700K human-annotated desktop demonstrations from GroundCUA dataset
- **Context Length**: 262,144 tokens (inherited from base model)
- **Specialization**: Desktop GUI element grounding with cross-platform generalization

For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our [paper](https://arxiv.org/abs/2511.07332), [GitHub repository](https://github.com/ServiceNow/GroundCUA), and [project website](https://groundcua.github.io).

## Performance

### Desktop Grounding Benchmarks

|                    | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** |
| ------------------ | ------------- | ----------- | ----------------- |
| **ScreenSpot-Pro** | 29.7          | 38.1        | **52.9**          |
| **OSWorld-G**      | 42.7          | 57.1        | **67.7**          |
| **UI-Vision**      | 16.5          | 25.5        | **60.3**          |
| **Avg (Desktop)**  | 29.6          | 40.2        | **60.3**          |

### Cross-Platform Generalization (Desktop, Mobile & Web)

|                      | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** |
| -------------------- | ------------- | ----------- | ----------------- |
| **MMBench-GUI**      | 33.9          | 74.3        | **81.1**          |
| **ScreenSpot-v2**    | 88.8          | 90.3        | **90.4**          |
| **Avg (Mobile/Web)** | 61.4          | 82.3        | **85.8**          |


### Agentic Performance on OSWorld

When combined with OpenAI o3 for reasoning, **GroundNext-7B-V0** demonstrates strong end-to-end computer use capabilities:

| Model | OS | Office | Daily | Pro | Workflow | Overall |
|--- | --- | --- | --- | --- | --- | --- |
| OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 |
| CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 |
| OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 |
| UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 |
| JEDI-7B w/ o3 | 50.0 | 46.1 | **61.9** | **75.5** | 35.3 | **51.0** |
| **GroundNext-3B w/ o3** | **62.5** | **47.0** | 55.0 | 73.5 | **36.5** | 50.6 |

*Note: GroundNext-7B-V0 results with o3 integration forthcoming.*

## Quickstart

The code of GroundNext-7B-V0 is compatible with the latest Hugging Face `transformers` library and follows the Qwen2.5-VL implementation.

With `transformers<4.37.0`, you may encounter compatibility issues. We recommend using `transformers>=4.37.0`.

### Installation

```bash
pip install transformers>=4.37.0 torch torchvision accelerate
pip install qwen-vl-utils  # For image processing utilities
```

### Basic Inference

The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding:

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen

model_name = "ServiceNow/GroundNext-7B-V0"

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True

# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)

# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>
```

### Deployment with vLLM

For production deployment, you can use vLLM to create OpenAI-compatible API endpoints:

**vLLM**:
```bash
vllm serve ServiceNow/GroundNext-7B-V0 --max-model-len 8192
```

**Note**: Adjust `max-model-len` or `context-length` based on your hardware capabilities. For typical GUI grounding tasks, 8192 tokens is sufficient.

## Best Practices

To achieve optimal grounding performance, we recommend:

1. **Image Preprocessing**:
   - Use high-resolution screenshots (minimum 800x600)
   - Ensure UI elements are clearly visible
   - Maintain original aspect ratios when resizing

2. **Prompt Engineering**:
   - Be specific about the target element (e.g., "Click on the blue 'Submit' button in the top-right corner" or "Click on the following element: Save")
   - Include element attributes when available (color, position, text)

3. **Generation Parameters**:
   - Use `temperature=0.0` for deterministic grounding
   - Set `max_new_tokens=128` (sufficient for tool calls)
   - Enable `use_cache=True` for faster inference

4. **System Prompt**:
   - Always include the system prompt with actual screen dimensions
   - Replace `{width}` and `{height}` with true screenshot dimensions
   - Maintain the tool signature format for proper JSON parsing

5. **Post-processing**:
   - Parse `<tool_call>` tags to extract JSON
   - Validate coordinates are within screen bounds

## Training

GroundNext-7B-V0 was trained using a two-stage approach:

1. **Supervised Fine-tuning (SFT)**: Trained on 700K human-annotated desktop demonstrations from the GroundCUA dataset
2. **Reinforcement Learning (RLOO)**: Further optimized using reward-based learning with custom GUI grounding rewards

For detailed training instructions, dataset preparation, and reproduction steps, please visit our [GitHub repository](https://github.com/ServiceNow/GroundCUA).

## Limitations and Future Work

- **Desktop-focused**: Primarily trained on desktop environments (though shows strong cross-platform generalization)
- **Action space**: Currently supports mouse click action only
- **Languages**: Optimized for English UI elements
- **Resolution**: Performance may vary with extremely high or low resolution images

## Citation

If you use GroundNext-7B-V0 in your research, please cite:

```bibtex
@misc{feizi2025groundingcomputeruseagents,
      title={Grounding Computer Use Agents on Human Demonstrations}, 
      author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
      year={2025},
      eprint={2511.07332},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.07332}, 
}
```

## License

This model is released under the Apache 2.0 License, following the base Qwen2.5-VL-7B-Instruct model. See the [LICENSE](https://choosealicense.com/licenses/apache-2.0/) for details.

## Acknowledgements

We thank:
- The Qwen team for the excellent Qwen2.5-VL foundation models
- The open-source community for tools and frameworks that made this work possible
- Human annotators who contributed to the GroundCUA dataset