GroundNext-7B-V0 / README.md

Update inference code

1c8864c verified about 1 month ago

9.64 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-text-to-text
	tags:
	- agent
	- computer-use
	- gui-grounding
	- vision-language
	metrics:
	- accuracy
	---

	# GroundNext-7B-V0

	<p align="center">
	&nbsp&nbsp🌐 <a href="https://groundcua.github.io">Website</a>&nbsp&nbsp \| &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2511.07332">Paper</a>&nbsp&nbsp \| &nbsp&nbsp🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">Dataset</a>&nbsp&nbsp \| &nbsp&nbsp🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">Model</a>&nbsp&nbsp
	</p>

	## Highlights

	GroundNext-7B-V0 is a state-of-the-art vision-language model for GUI element grounding, developed as part of the GroundCUA project. This model features:

	- Superior grounding accuracy achieving 52.9% on ScreenSpot-Pro, 67.7% on OSWorld-G, and 60.3% on UI-Vision benchmarks
	- Exceptional cross-platform generalization with 81.1% accuracy on MMBench-GUI and 90.4% on ScreenSpot-v2 despite desktop-only training
	- Data-efficient training achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work
	- Strong agentic capabilities reaching 50.6% overall success rate on OSWorld when paired with reasoning models
	- Native tool-calling support with built-in computer use action space for mouse, keyboard, and screen interactions

	## Model Overview

	GroundNext-7B-V0 has the following characteristics:
	- Type: Vision-Language Model for GUI Grounding
	- Base Model: Qwen2.5-VL-7B-Instruct
	- Training Approach: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO)
	- Number of Parameters: 7.0B
	- Training Data: 700K human-annotated desktop demonstrations from GroundCUA dataset
	- Context Length: 262,144 tokens (inherited from base model)
	- Specialization: Desktop GUI element grounding with cross-platform generalization

	For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our [paper](https://arxiv.org/abs/2511.07332), [GitHub repository](https://github.com/ServiceNow/GroundCUA), and [project website](https://groundcua.github.io).

	## Performance

	### Desktop Grounding Benchmarks

	\| \| Qwen2.5-VL-7B \| UI-TARS-72B \| GroundNext-7B-V0 \|
	\| ------------------ \| ------------- \| ----------- \| ----------------- \|
	\| ScreenSpot-Pro \| 29.7 \| 38.1 \| 52.9 \|
	\| OSWorld-G \| 42.7 \| 57.1 \| 67.7 \|
	\| UI-Vision \| 16.5 \| 25.5 \| 60.3 \|
	\| Avg (Desktop) \| 29.6 \| 40.2 \| 60.3 \|

	### Cross-Platform Generalization (Desktop, Mobile & Web)

	\| \| Qwen2.5-VL-7B \| UI-TARS-72B \| GroundNext-7B-V0 \|
	\| -------------------- \| ------------- \| ----------- \| ----------------- \|
	\| MMBench-GUI \| 33.9 \| 74.3 \| 81.1 \|
	\| ScreenSpot-v2 \| 88.8 \| 90.3 \| 90.4 \|
	\| Avg (Mobile/Web) \| 61.4 \| 82.3 \| 85.8 \|


	### Agentic Performance on OSWorld

	When combined with OpenAI o3 for reasoning, GroundNext-7B-V0 demonstrates strong end-to-end computer use capabilities:

	\| Model \| OS \| Office \| Daily \| Pro \| Workflow \| Overall \|
	\|--- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| OpenAI o3 \| 62.5 \| 14.5 \| 21.4 \| 38.8 \| 16.5 \| 23.0 \|
	\| CUA \| 23.9 \| 34.6 \| 55.1 \| 18.3 \| 18.3 \| 31.4 \|
	\| OpenCUA-72B \| 58.3 \| 47.0 \| 53.8 \| 73.5 \| 20.4 \| 46.1 \|
	\| UI-TARS-1.5-7B \| 33.3 \| 29.9 \| 37.9 \| 53.1 \| 9.1 \| 29.6 \|
	\| JEDI-7B w/ o3 \| 50.0 \| 46.1 \| 61.9 \| 75.5 \| 35.3 \| 51.0 \|
	\| GroundNext-3B w/ o3 \| 62.5 \| 47.0 \| 55.0 \| 73.5 \| 36.5 \| 50.6 \|

	Note: GroundNext-7B-V0 results with o3 integration forthcoming.

	## Quickstart

	The code of GroundNext-7B-V0 is compatible with the latest Hugging Face `transformers` library and follows the Qwen2.5-VL implementation.

	With `transformers<4.37.0`, you may encounter compatibility issues. We recommend using `transformers>=4.37.0`.

	### Installation

	```bash
	pip install transformers>=4.37.0 torch torchvision accelerate
	pip install qwen-vl-utils # For image processing utilities
	```

	### Basic Inference

	The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding:

	```python
	import torch
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from PIL import Image
	import groundcua
	import io
	from urllib.request import urlopen

	model_name = "ServiceNow/GroundNext-7B-V0"

	# Load model and processor
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="auto",
	trust_remote_code=True
	).eval()

	processor = AutoProcessor.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

	# Configure generation
	model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
	model.generation_config.do_sample = False
	model.generation_config.use_cache = True

	# Load and prepare image
	url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
	image = Image.open(io.BytesIO(urlopen(url).read()))
	image, (width, height) = groundcua.prepare_image(image)

	# Create messages and generate
	instruction = "Click on the 'File' button"
	messages = groundcua.create_messages(instruction, image, width, height)

	input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
	inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)

	generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
	generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

	response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
	print(response)
	# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>
	```

	### Deployment with vLLM

	For production deployment, you can use vLLM to create OpenAI-compatible API endpoints:

	vLLM:
	```bash
	vllm serve ServiceNow/GroundNext-7B-V0 --max-model-len 8192
	```

	Note: Adjust `max-model-len` or `context-length` based on your hardware capabilities. For typical GUI grounding tasks, 8192 tokens is sufficient.

	## Best Practices

	To achieve optimal grounding performance, we recommend:

	1. Image Preprocessing:
	- Use high-resolution screenshots (minimum 800x600)
	- Ensure UI elements are clearly visible
	- Maintain original aspect ratios when resizing

	2. Prompt Engineering:
	- Be specific about the target element (e.g., "Click on the blue 'Submit' button in the top-right corner" or "Click on the following element: Save")
	- Include element attributes when available (color, position, text)

	3. Generation Parameters:
	- Use `temperature=0.0` for deterministic grounding
	- Set `max_new_tokens=128` (sufficient for tool calls)
	- Enable `use_cache=True` for faster inference

	4. System Prompt:
	- Always include the system prompt with actual screen dimensions
	- Replace `{width}` and `{height}` with true screenshot dimensions
	- Maintain the tool signature format for proper JSON parsing

	5. Post-processing:
	- Parse `<tool_call>` tags to extract JSON
	- Validate coordinates are within screen bounds

	## Training

	GroundNext-7B-V0 was trained using a two-stage approach:

	1. Supervised Fine-tuning (SFT): Trained on 700K human-annotated desktop demonstrations from the GroundCUA dataset
	2. Reinforcement Learning (RLOO): Further optimized using reward-based learning with custom GUI grounding rewards

	For detailed training instructions, dataset preparation, and reproduction steps, please visit our [GitHub repository](https://github.com/ServiceNow/GroundCUA).

	## Limitations and Future Work

	- Desktop-focused: Primarily trained on desktop environments (though shows strong cross-platform generalization)
	- Action space: Currently supports mouse click action only
	- Languages: Optimized for English UI elements
	- Resolution: Performance may vary with extremely high or low resolution images

	## Citation

	If you use GroundNext-7B-V0 in your research, please cite:

	```bibtex
	@misc{feizi2025groundingcomputeruseagents,
	title={Grounding Computer Use Agents on Human Demonstrations},
	author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
	year={2025},
	eprint={2511.07332},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2511.07332},
	}
	```

	## License

	This model is released under the Apache 2.0 License, following the base Qwen2.5-VL-7B-Instruct model. See the [LICENSE](https://choosealicense.com/licenses/apache-2.0/) for details.

	## Acknowledgements

	We thank:
	- The Qwen team for the excellent Qwen2.5-VL foundation models
	- The open-source community for tools and frameworks that made this work possible
	- Human annotators who contributed to the GroundCUA dataset