
Google DeepMind has made public a new AI model, Gemini 2.5 Computer Use, capable of interacting with web and mobile user interfaces by mimicking human actions such as clicking, typing and scrolling. The model is now available in preview through the Gemini API, via Google AI Studio and Vertex AI, allowing developers to build agents that can directly operate interfaces otherwise inaccessible via backend APIs.
The model rests on the visual reasoning and comprehension capacity built into Gemini 2.5 Pro, extended with a specialized “computer_use” tool. Developers feed it a prompt, a screenshot of the interface state, and the history of previous actions; the model returns discrete UI actions, which are executed by client-side code, triggering a new visual state loop. The process continues until the task is completed, an error arises or a safety halts execution. Google says this architecture delivers lower latency in web and mobile benchmarks compared with alternatives.
Gemini 2.5 Computer Use supports actions such as navigation to URLs, domain-level clicking, drag-and-drop, dropdown manipulation, scrolls and text entry. It can also conditionally request user confirmation for risky actions such as purchases or system changes. Google emphasises multi-layered safety guardrails: a per-step safety service checks each action before execution, and developers can impose system instructions limiting or disabling certain actions.
Internal Google teams have already deployed this model in tools like Project Mariner, the Firebase Testing Agent, and AI Mode within Search. Early testers from outside the company report gains in speed and reliability. One automation platform noted performance improvements up to 18 percent in complex tasks; another described it as often operating 50 percent faster than competing systems in interface interactions.
The competitive pressure on agentic AI architectures is significant. Anthropic introduced a Computer Use capability for its Claude model last year, and OpenAI’s ChatGPT Agent now runs with virtual computer-level control including code execution. Google’s approach limits control to the browser/mobile layer rather than full OS access, narrowing attack surface but potentially restricting versatility.
Benchmark results, partly self-reported and validated via the Browserbase evaluation suite, show Gemini 2.5 Computer Use outpaces rivals on metrics such as Online-Mind2Web and WebVoyager. In tests, it achieved notably higher success rates than Claude and OpenAI agents at equivalent latency. The model does not yet support direct file system operations or desktop OS control.
Complementary advancements in open research also signal rising competition. A new technical report describes UI-Venus, an open-source UI agent developed using reinforcement tuning on a multimodal model backbone; it achieves state-of-the-art grounding and navigation success without requiring massive training datasets, underscoring that UI agent research is accelerating.
Yet challenges remain. Real-world digital environments can present dynamic layouts, CAPTCHAs, session timeouts and unpredictable UX changes, which can break agent loops. An evaluation from Carnegie Mellon earlier this year found that even top-tier AI agents struggle with robust business automation tasks in messy real-world settings. Some industry observers caution that deployment viability in complex workflows still faces hurdles in error tracking, fallback logic and interpretability.
Follow Arabian Post
Select Arabian Post as your preferred source on Google and MSN News for trusted business news and Arab politics and updates.