Paper Detail

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu

Browse

Workflow Queues

arxiv Score 13.3

Published 2026-04-14 · First seen 2026-04-15

Research Track B · General AI

Open paper source

Abstract

Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{mittal2026see,
  title = {See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback},
  author = {Himangi Mittal and Gaurav Mittal and Nelson Daniel Troncoso and Yu Hu},
  year = {2026},
  abstract = {Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we},
  url = {https://arxiv.org/abs/2604.13019},
  keywords = {cs.CV},
  eprint = {2604.13019},
  archiveprefix = {arXiv},
}

Metadata

{}