arxiv
Score 28.5
2026-04-09 · Xing Han Lù, Siva Reddy
Research Track B · General AI
Frontier LLMs can navigate complex websites, but their cost and reliance on third-party APIs make local deployment impractical. We introduce Agent-as-Annotators, a framework that structures synthetic trajectory generation for web agents by analogy to human annotation roles, replacing the Task Designer, Annotator, and S…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 27.5
2026-03-20 · Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Chen Dai, Lianyong Qi, Shi Jin
Research Track B · General AI
Despite rapid progress in multimodal GUI agents, reusable skill acquisition remains difficult because on-demand generated skills often leave action semantics, state assumptions, and success criteria implicit. This makes them brittle to execution errors, hard to verify, and difficult to repair. We present ContractSkill,…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 27.0
2026-05-19 · Fatemeh Pesaran zadeh, Seyeon Choi, Xing Han Lù, Siva Reddy, Gunhee Kim, Fatemeh Pesaran Zadeh
Research Track B · General AI
Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 26.0
2026-06-09 · Jayoo Hwang, Xiaowen Zhang, Vedant Padwal
Research Track B · General AI
Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent archi…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 26.0
2026-06-12 · Sina Hajimiri, Masih Aminbeidokhti, Jose Dolz, Ismail Ben Ayed, Issam H. Laradji, Spandana Gella, Nicolas Gontier
Research Track B · General AI
Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its b…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 23.3
2026-03-07 · Yunteng Tan, Zhi Gao, Xinxiao Wu
Research Track B · General AI
Large language model-based web agents have shown strong potential in automating web interactions through advanced reasoning and instruction following. While retrieval-based memory derived from historical trajectories enables these agents to handle complex, long-horizon tasks, current methods struggle to generalize acro…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 23.3
2026-06-09 · Jaewoo Lee, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Supriyo Chakraborty, Kartik Balasubramaniam, Sambit Sahu, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal
Research Track A · Research Track B · General AI
Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 22.5
2026-06-16 · Shiqi He, Yue Cui, Feijie Wu, Xinyu Ma, Jiaheng Lu, Yaliang Li, Bolin Ding, Mosharaf Chowdhury
Research Track B · General AI
Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 22.4
2026-06-25 · Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Research Track B · General AI
Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and privacy preserving compared with commercial large models, they suffer from weak planning and l…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 22.3
2026-03-20 · Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette
Research Track B · General AI
Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing L…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 22.3
2026-06-01 · Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao
Research Track B · General AI
Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated w…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 21.5
2026-04-07 · Guruprasad Viswanathan Ramesh, Asmit Nayak, Basieem Siddique, Kassem Fawaz
Research Track B · General AI
Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance~(e.g., WebArena) or safety against malicious actions~(e.g., SafeArena), no existing framework assesses an agent's ability to successfully exe…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 21.3
2026-03-23 · Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara, Maria Wang, Jindong Chen, Mohit Bansal, Boqing Gong
Research Track B · General AI
Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This li…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 21.3
2026-06-09 · Yv Zhang, Hao Sun, Hao Fang, Kuofeng Gao, Fan Mo, Bin Chen, Shu-Tao Xia, Yaowei Wang
Research Track B · General AI
External memory has become a core component of modern web agents, enabling long-horizon reasoning through the retrieval of past experiences. However, this paradigm introduces a critical vulnerability: malicious content injected into memory can be persistently recalled and repeatedly influence agent behavior. In this wo…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 20.5
2026-05-08 · Donguk Kwon, Dongha Lee
Research Track B · General AI
Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice. Existing work treats observation at the same element-level granularity as the action space, leaving the page's functional organization implicit and forcing the agent to infer it from element-leve…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 20.5
2026-05-14 · Julien Piet, Annabella Chow, Yiwei Hou, Muxi Lyu, Sylvie Venuto, Jinhao Zhu, Raluca Ada Popa, David Wagner
Research Track B · General AI
ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then execute it. The reas…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 20.5
2026-06-10 · Longkun Hao, Hongyu Lin, Hao Li, Zhichao Yang, Haojie Hao, Dongshuo Huang, Haitao Yang, Hongyu Ge, Ming jie Xie, Yanjun Wu, Zi Hao Yin, Yan Bai, Yihang Lou
Research Track B · General AI
Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushi…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 20.0
2026-03-13 · Orit Shahnovsky, Rotem Dror
Research Track B · General AI
Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequ…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 20.0
2026-04-14 · Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, Huaxiu Yao
Research Track B · General AI
Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long-horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directl…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 20.0
2026-04-20 · Lingfeng Zhang, yongan sun, Jinpeng Hu, Hui Ma, yang ying, Kuien Liu, Zenglin Shi, Meng Wang, Yongan Sun, Yang Ying
Research Track B · General AI
Recent advancements in large language models (LLMs) have empowered autonomous web agents to execute natural language instructions directly on real-world webpages. However, existing agents often struggle with complex tasks involving dynamic interactions and long-horizon execution due to rigid planning strategies and hal…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 20.0
2026-04-29 · Fazle Elahi Faisal, Qianhui Wu, Baolin Peng, Jianfeng Gao
Research Track B · General AI
Recent advances in multimodal large language models (LLMs) have revolutionized web agents that can automate complex tasks on websites. However, their accuracy remains limited by the scarcity of high-quality web trajectory training data. Existing automatic trajectory generation methods suffer from incomplete website cov…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 20.0
2026-05-25 · Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, Dayiheng Liu, Que Shen, Junyang Lin, Tao Yu
Research Track B · General AI
Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 20.0
2026-05-28 · Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim
Research Track B · General AI
Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored.…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 19.8
2026-05-20 · Wujiang Xu, Yu Wang, Kai Mei, Kaiqu Liang, Zhenting Wang, Mingyu Jin, Han Zhang, Shi-Xiong Zhang, Wenyue Hua, Sambit Sahu, Dimitris N. Metaxas
Research Track A · Research Track B · General AI
Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 19.8
2026-06-16 · Xuelong Dai, Jianyu Ma, Boyang Ma, Biwei Yan, Yijun Yang, Yue Zhang
Research Track B · General AI
Multimodal Large Language Model (MLLM)-based web agents provide practical, high-precision solutions for visual browser automation; however, they inherently expand the attack surface, introducing novel vision-based vulnerabilities. Existing adversarial evaluations targeting these agents frequently rely on permissive thr…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 19.3
2026-04-09 · Boer Zhang, Mingyan Wu, Dongzhuoran Zhou, Yuqicheng Zhu, Wendong Fan, Puzhen Zhang, Zifeng Ding, Guohao Li, Yuan He
Research Track B · General AI
Deep research requires reasoning over web evidence to answer open-ended questions, and it is a core capability for AI agents. Yet many deep research agents still rely on implicit, unstructured search behavior that causes redundant exploration and brittle evidence aggregation. Motivated by Anthropic's "think" tool parad…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 19.0
2026-04-06 · Yuwen Zhai, Runze Li, Liang Wang, Nian Shi, Liwu Xu, Wei Zhang, Ran Lin, Bo Xu, Benlei Cui
Research Track B · General AI
Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open-ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action-observation sequence-a strategy that proves unreliable on long-hori…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 19.0
2026-05-28 · Masafumi Enomoto, Ryoma Obara, Haochen Zhang, Masafumi Oyamada
Research Track B · General AI
HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance. The main obstacle is the high cost of end-to-end evaluation: in our experiments, evaluating 11 methods across 32…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 19.0
2026-06-03 · Jiaxi Li, Ke Deng, Yun Wang, Jingyuan Huang, Yucheng Shi, Qiaoyu Tan, Jin Lu, Ninghao Liu
Research Track B · General AI
Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse s…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.5
2026-05-04 · Masafumi Oyamada, Kunihiro Takeoka, Kosuke Akimoto, Ryoma Obara, Masafumi Enomoto, Haochen Zhang, Daichi Haraguchi, Takuya Tamura
Research Track B · General AI
What if a browser agent could learn your work simply by watching you do it? We present cotomi Act, a browser-based computer-using agent that combines reliable multi-step task execution with persistent organizational knowledge learned from user behavior. For execution, an agent scaffold with adaptive lazy observation, v…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.5
2026-05-12 · Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song
Research Track B · General AI
Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be se…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.5
2026-05-24 · Yubo Li, Yidi Miao, Yuntian Shen, Yuxin Liu
Research Track B · General AI
Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.3
2026-03-15 · Mohamed Aghzal, Gregory J. Stein, Ziyu Yao
Research Track B · General AI
Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze w…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.3
2026-04-13 · Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Research Track B · General AI
GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.3
2026-04-13 · Xiaozhe Li, Tianyi Lyu, Yizhao Yang, Liang Shan, Siyi Yang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu, Yang Li
Research Track B · General AI
Large Language Models (LLMs) struggle with long-horizon tasks due to the "context bottleneck" and the "lost-in-the-middle" phenomenon, where accumulated noise from verbose environments degrades reasoning over multi-turn interactions. To address this issue, we introduce a symbiotic framework that decouples context manag…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.3
2026-04-16 · Yan Li, Zezi Zeng, Yifan Yang, Yuqing Yang, Ning Liao, Weiwei Guo, Lili Qiu, Mingxi Cheng, Qi Dai, Zhendong Wang, Zhengyuan Yang, Xue Yang, Ji Li, Lijuan Wang, Chong Luo
Research Track B · General AI
The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often lea…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.3
2026-06-05 · Zequn Xie, Junjie Wang, Dan Yang, Jie Feng, Yue Shen, Jian Wang, Jinjie Gu
Research Track B · General AI
Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.0
2026-01-12 · Jihong Wang, Jiamu Zhou, Weiming Zhang, Weiwen Liu, Zhuosheng Zhang, Xingyu Lou, Weinan Zhang, Huarong Deng, Jun Wang
Research Track B · General AI
With the advancement of vision-language models, web automation has made significant progress. However, deploying autonomous agents in real-world settings remains challenging, primarily due to site heterogeneity, where generalist models lack domain-specific priors for diverse interfaces, and long-horizon instability, ch…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.0
2026-03-09 · Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang
Research Track B · General AI
Modern agents powered by thinking LLMs achieve high accuracy through long chain-of-thought reasoning but incur substantial inference costs. While many LLMs now support configurable reasoning levels (e.g., high/medium/low), static strategies are often ineffective: using low-effort modes at every step leads to significan…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.0
2026-04-03 · Wei Zou, Mingwen Dong, Miguel Romero Calvo, Shuaichen Chang, Jiang Guo, Dongkyu Lee, Xing Niu, Xiaofei Ma, Yanjun Qi, Jiarong Jiang
Research Track B · General AI
Memory makes LLM-based web agents personalized, powerful, yet exploitable. By storing past interactions to personalize future tasks, agents inadvertently create a persistent attack surface that spans websites and sessions. While existing security research on memory assumes attackers can directly inject into memory stor…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 18.0
2026-06-15 · Anqi Zou, Han Deng, Chengyu Zhang, Junquan Hu, Yu Wang, Yuxiang Xing, Aokai Zhang, Hanling Zhang, Zhaoyang Liu, Ben Fei, Zhihui Wang, Wanli Ouyang
Research Track B · General AI
Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is im…
- Review
- pending
- Role
- unreviewed
- Read
- now
huggingface
Score 17.7
2026-04-23 · Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou, Nancy Lau, Alvaro A. Cardenas, Yuhui Xu, Ran Xu, Caiming Xiong, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie
Research Track B · General AI
Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated comp…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 17.3
2026-04-09 · Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, Ranjay Krishna
Research Track B · General AI
Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, r…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 17.3
2026-04-14 · Yulin Chen, Tri Cao, Haoran Li, Yue Liu, Yibo Li, Yufei He, Le Minh Khoi, Yangqiu Song, Shuicheng Yan, Bryan Hooi
Research Track B · General AI
Web agents powered by vision-language models (VLMs) enable autonomous interaction with web environments by perceiving and acting on both visual and textual webpage content to accomplish user-specified tasks. However, they are highly vulnerable to prompt injection attacks, where adversarial instructions embedded in HTML…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 17.3
2026-04-27 · Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov
Research Track B · General AI
Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multipl…
- Review
- pending
- Role
- unreviewed
- Read
- now
huggingface
Score 17.2
2026-06-28 · Mengqi Yuan, Zilong Zhou, Xinzhuang Xiong, Weiming Wu, Jiayang Sun, Jiamin Song, Kaiqian Cui, Bowen Wang, Haoyuan Wu, Yitong Li, Dunjie Lu, Haikong Lu, Qi Zhen, Xinyuan Wang, Jiaqi Deng, Yuhao Yang, Cheng Chen, Boyuan Zheng, Alex Su, Xiao Yu, Hao Zou, Saaket Agashe, Xing Han Lu, Manpreet Kaur, Zhengyang Qi, Vincent Sunn Chen, Frederic Sala, Dayiheng Liu, Junyang Lin, Zhou Yu, Yu Su, Siva Reddy, Xin Eric Wang, Peng Qi, Tianbao Xie, Tao Yu
Research Track B · General AI
Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, des…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 17.0
2026-04-24 · Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia
Research Track B · General AI
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 16.8
2026-05-20 · Yibo Li, Jiashuo Yang, Zhi Zheng, Zhiyuan Hu, Yuan Sui, Shizun Wang, Yufei He, Bryan Hooi
Research Track B · General AI
LLM agents have shown strong performance across a wide range of complex tasks, including interactive environments that require long-horizon decision making. But these agents cannot learn on the fly at test time. Self-evolving agents address this by accumulating memory and reflection across episodes rather than requirin…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 16.3
2026-03-24 · Yenchia Feng, Chirag Sharma, Karime Maamari
Research Track B · General AI
Although large language models (LLMs) have advanced rapidly, robust automation of complex software workflows remains an open problem. In long-horizon settings, agents frequently suffer from cascading errors and environmental stochasticity; a single misstep in a dynamic interface can lead to task failure, resulting in h…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 16.3
2026-04-24 · Jichao Wang, Liuyang Bian, Yufeng Zhou, Han Xiao, Yue Pan, Guozhi Wang, Hao Wang, Zhaoxiong Wang, Yafei Wen, Xiaoxin Chen, Shuai Ren, Lingfang Zeng
Research Track B · General AI
As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on …
- Review
- pending
- Role
- unreviewed
- Read
- now
huggingface
Score 16.2
2026-05-28 · Tianpeng Bu, Xin Liu, Qihua Chen, Hao Jiang, Shurui Li, Hongtao Duan, Lu Jiang, Lulu Hu, Bin Yang, Minying Zhang
Research Track B · General AI
While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains 1,216 executable te…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 15.8
2026-05-29 · Weile Chen, Bingchen Miao, Qifan Yu, Wendong Bu, Guoming Wang, Wenqiao Zhang, Shengyu Zhang, Juncheng Li, Siliang Tang
Research Track B · General AI
Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their adaptability to complex, dynamic environments. To address these challenges, we propose SCA…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 15.5
2026-05-07 · Oğuzhan Fatih Kar, Roman Bachmann, Yuanzheng Gong, Anders Boesen Lindbo Larsen, Afshin Dehghan
Research Track B · General AI
The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 15.5
2026-06-29 · Rahul Khedar, Mayank Malhotra, Avinash Karn, Mouli V, Prakhar Mehrotra
Research Track B · General AI
Live product demonstrations are a recurring, high-cost activity in software organizations: a human presenter must select features, dispatch the corresponding interactions on a running application, narrate them coherently, and answer questions in real time. Existing automation addresses only fragments -- generalist brow…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 14.9
2026-06-20 · Rishi Srivastava
Research Track B · General AI
We introduce CFAgentBench, a reproducible, self-hostable environment and benchmark for autonomous construction-finance agents: a CFO/controller-class agent operating across the real software stack a US construction finance team runs - ERP, project management, email, documents, pay applications, payroll, certified payro…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 14.8
2026-06-18 · To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz
Research Track B · General AI
The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifac…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 14.5
2026-04-13 · Peng Yuan, Yuyang Yin, Yuxuan Cai, Zheng Wei
Research Track B · General AI
Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framewor…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 14.5
2026-05-12 · Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao, Jingyi Yang, Xuanjing Huang, Jing Shao, Ming Yan, Jieping Ye
Research Track B · General AI
Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This diffi…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 14.5
2026-06-16 · Aagam Sogani, Botao Rui, Swetha Vaidyanathan, Rishi Agarwal, Minghao Yan, Shivaram Venkataraman
Research Track B · General AI
Long-horizon web agents often fail in ways hidden by final-answer evaluation: they may visit useful pages, produce a well-formed answer, and terminate confidently while still missing fields, over-including unsupported items, or relying on stale evidence. We study these failures with Parallel WebBench, a parallel web-ex…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 14.3
2026-06-08 · Radeen Mostafa, Sawradip Saha
Research Track B · General AI
We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way a person browses. A human reading a page does not retain every pixel they have seen; they look at a few candidate targets, decide on one, and remember only what is needed to keep t…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 14.3
2026-06-08 · Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov
Research Track B · General AI
A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first int…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 14.3
2026-06-11 · Zihao Wang, Yiming Li, Yutong Wu, Zheyu Liu, Kangjie Chen, Fok Kar Wai, Pin-Yu Chen, Vrizlynn L. L. Thing, Bo Li, Dacheng Tao, Tianwei Zhang
Research Track B · General AI
Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions th…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 14.3
2026-06-15 · Y. H. Zhou, Z. M. Ma, Y. J. Zhou, Y. T. Li, H. X. Xiang, Y. M. Cheng, T. L. Chen, K. J. Zhang, Z. H. Nan, J. H. Ni, Z. Wu, Q. Y. Pan, S. Zhang, S. Cheng, M. Y. Luo
Research Track B · General AI
SMS fraud is increasingly cross-channel: a message directs the user to a webpage, and the final risk depends on how the SMS claim aligns with the page content and requested user action. However, existing evaluations either focus on message-only smishing classification or expose URL and domain cues that allow models to …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.8
2026-05-19 · Han Li, Vibhor Malik, Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ailin Fan, Keat Yang Koay, Yuanzheng Zhu, Meysam Feghhi, Ronie Uliana, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Zhong Wu, Lingyun Wang
Research Track B · General AI
A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM)…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.8
2026-05-28 · Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu
Research Track B · General AI
Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without inter…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.4
2026-06-25 · Zhongxin Guo, Danrui Qi, Hanwen Gu, Peng Cheng, Yongqiang Xiong
Research Track B · General AI
Agents often repeatedly solve similar task instances from scratch, leading to unnecessary reasoning cost and long execution traces. Prior work has explored workflow reuse and executable skill induction, but it remains unclear which task scenarios admit procedural skills and how the shared procedural structure should be…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.3
2026-03-20 · Xuanwang Zhang, Yuteng Han, Jinnan Qi, Mulong Xie, Zhen Wu, Xinyu Dai
Research Track B · General AI
Despite significant advances in autonomous web navigation, current methods remain far from human-level performance in complex web environments. We argue that this limitation stems from Topological Blindness, where agents are forced to explore via trial-and-error without access to the global topological structure of the…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.3
2026-04-14 · Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu
Research Track B · General AI
Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing a…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.3
2026-04-19 · Mohit Dubey
Research Track B · General AI
Multi-agent systems (MAS) powered by large language models suffer from severe token inefficiency arising from two compounding sources: (i) unstructured parallel execution, where all agents activate simultaneously irrespective of input readiness; and (ii) unrestricted context sharing, where every agent receives the full…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.3
2026-04-28 · Mengyao Du, Han Fang, Haokai Ma, Jiahao Chen, Kai Xu, Quanjun Yin, Ee-Chien Chang
Research Track B · General AI
Web agents have emerged as an effective paradigm for automating interactions with complex web environments, yet remain vulnerable to prompt injection attacks that embed malicious instructions into webpage content to induce unintended actions. This threat is further amplified for screenshot-based web agents, which opera…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.3
2026-06-08 · Bojie Rong, Zheyu Shen, Qiaoping Wang, Pengfei Kang, Yang Xu, Yawen Wei, Hanyu Wu, Zhi Zhao, Leihao Pei, Linquan Jiang
Research Track B · General AI
We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented proce…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.0
2026-04-01 · Zhanzhi Lou, Hui Chen, Yibo Li, Qian Wang, Bryan Hooi
Research Track B · General AI
Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.0
2026-04-22 · Sachin Kumar
Research Track B · General AI
Can small language models achieve strong tool-use performance without complex adaptation mechanisms? This paper investigates this question through Meta-Tool, a controlled empirical study comparing hypernetwork-based LoRA adaptation against carefully designed few-shot prompting. Using a Llama-3.2-3B-Instruct backbone, w…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 13.0
2026-05-14 · William Lugoloobi, Samuelle Marro, Jabez Magomere, Joss Wright, Chris Russell
Research Track B · General AI
As LLM-based agents increasingly browse the web on users' behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doing so would represent a significant security risk, enabling targeted attacks tailored to known model vulnerabilities. Across 14 frontier LLMs and four w…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.8
2026-05-12 · Di Wu, Zixiang Ji, Asmi Kawatkar, Bryan Kwan, Jia-Chen Gu, Nanyun Peng, Kai-Wei Chang
Research Track B · General AI
Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.8
2026-05-20 · Chongrui Ye, Yuxiang Liu, Yu Wang, Haofei Yu, Yining Zhao, Ge Liu, Julian McAuley, Jiaxuan You
Research Track A · Research Track B · General AI
Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acquisition and consolidation into a single …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.5
2026-05-29 · Haoxiang Zhang, Qixin Xu, Zhuofeng Li, Lei Zhang, Pengcheng Jiang, Yu Zhang, Julian McAuley
Research Track B · General AI
Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of context management helps …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.5
2026-06-09 · Can Lin, Tao Feng, Hangjie Yuan, Dan Zhang, Yifan Zhu, Zhonghong Ou
Research Track A · Research Track B · General AI
Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robustness that humans naturally exhibit, remains unsolved. Notably, GUI data are inherently…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.3
2026-03-29 · Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang, Runnan Fang, Qi Zhang, Baixuan Li, Shihao Cai, Rui Ye, Hui Chen, Jiang Yong, Joey Tianyi Zhou, Chenxiong Qian, Pengjun Xie, Bryan Hooi, Zuozhu Liu, Jingren Zhou
Research Track B · General AI
As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck. Existing context management methods typically commit to a single fixed strategy throughout the entire trajectory. Such static designs may work well in so…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.3
2026-04-03 · Renze Lou, Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Suman Nath, Wenpeng Yin, Jianfeng Gao
Research Track B · General AI
As web agents rapidly evolve, an increasing body of work has moved beyond conventional atomic browser interactions and explored tool use as a higher-level action paradigm. Although prior studies have shown the promise of tools, their conclusions are often drawn from limited experimental scales and sometimes non-compara…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.3
2026-05-30 · Soham Roy, Sarthakbrata Halder, Arya Bharaty, Vaibhav Bhaskar, Yash Sinha, Dhruv Kumar, Srikant Panda, Murari Mandal
Research Track B · General AI
Deceptive web content, widely instantiated across the internet and commonly known as \textit{social-engineering attacks}, manipulates autonomous web agents into submitting users' personally identifiable information (PII) to attacker-controlled endpoints. In this paper, we show that social-engineering attacks are highly…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.0
2026-04-01 · Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang, Yaozu Wu, Liancheng Fang, Zhengyao Gu, Zhen Zhang, Kening Zheng, Fangxin Wang, Yi Nian, Shanghao Li, Wenzhe Fan, Langzhou He, Weizhi Zhang, Xue Liu, Philip S. Yu
Research Track B · General AI
As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing bench…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.0
2026-05-10 · Zhiqing Zhong, Zhijing Ye, Jiamin Wang, Xiaodong Yu
Research Track B · General AI
Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit und…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.0
2026-05-10 · Yilin Zhang, Yingkai Hua, Chunyu Wei, Xin Wang, Yueguo Chen
Research Track B · General AI
Vision-language model (VLM) based web agents demonstrate impressive autonomous GUI interaction but remain vulnerable to deceptive interface elements. Existing approaches either detect deception without task integration or document attacks without proposing defenses. We formalize deception-aware web agent defense and pr…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.0
2026-05-14 · Tri Cao, Yulin Chen, Hieu Cao, Yibo Li, Khoi Le, Thong Nguyen, Yuexin Li, Yufei He, Yue Liu, Shuicheng Yan, Bryan Hooi
Research Track B · General AI
Web agents can autonomously complete online tasks by interacting with websites, but their exposure to open web environments makes them vulnerable to prompt injection attacks embedded in HTML content or visual interfaces. Existing guard models still suffer from limited generalization to unseen domains and attack pattern…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.0
2026-05-15 · Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie, Alberto Castelo, Tianfu Wu, Lingyun Wang
Research Track B · General AI
Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irrepro…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.0
2026-05-15 · Mike Wong, Kevin Hsieh, Suman Nath, Ravi Netravali
Research Track B · General AI
Skim is a speculative execution framework for web agents that exploits the predictable structure of purpose-built websites. Today's web-agent expense is not intrinsic to the tasks but a property of how agents are composed: frontier-model inference, browser rendering, and ReAct-style planning are applied to every step o…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 12.0
2026-05-29 · Dongxin Guo, Jikun Wu, Siu Ming Yiu
Research Track B · General AI
Extended chain-of-thought reasoning can degrade performance on deterministic state-tracking tasks, not due to preference biases, but limits rooted in the information-theoretic capacity of decoder-only attention. We establish: (1) an Attention Bottleneck Theorem with a complementary achievability construction, bounding …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 11.9
2026-06-25 · Minbyul Jeong
Research Track B · General AI
Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated, especially outside English. Breadth is also hard to build: certifying that a gold set is complete…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 11.8
2026-05-20 · Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini, Christos Kozyrakis
Research Track B · General AI
Computer-use agents (CUA) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 11.8
2026-06-29 · Iliana Fayolle, Sihem Bouhenniche, Samuel Pélissier, Pierre Laperdrix, Clémentine Maurice, Walter Rudametkin
Research Track B · General AI
Since 2023, a new class of bots has emerged: Web Agents. They can automate complex tasks on the Web, going beyond traditional browser automation tools such as Selenium, Puppeteer, or Playwright. Leveraging large language models (LLMs), these agents are capable of solving anti-bot mechanisms, mimicking human behavior, a…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 11.5
2026-03-04 · Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu
Research Track B · General AI
Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 hel…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 11.5
2026-03-19 · Haochen Zhao, Shaoyang Cui
Research Track B · General AI
Autonomous web agents such as \textbf{OpenClaw} are rapidly moving into high-impact real-world workflows, but their security robustness under live network threats remains insufficiently evaluated. Existing benchmarks mainly focus on static sandbox settings and content-level prompt attacks, which leaves a practical gap …
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 11.5
2026-03-22 · Liang Ding
Research Track B · General AI
LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on th…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 11.5
2026-03-30 · Deepak Akkil, Mowafak Allaham, Amal Raj, Tamer Abuelsaad, Ravi Kokku
Research Track B · General AI
Reliable evaluation of AI agents operating in complex, real-world environments requires methodologies that are robust, transparent, and contextually aligned with the tasks agents are intended to perform. This study identifies persistent shortcomings in existing AI agent evaluation practices that are particularly acute …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 11.3
2026-04-02 · Masafumi Enomoto, Ryoma Obara, Haochen Zhang, Masafumi Oyamada
Research Track B · General AI
Web agents based on large language models (LLMs) rely on observations of web pages -- commonly represented as HTML -- as the basis for identifying available actions and planning subsequent steps. Prior work has treated the verbosity of HTML as an obstacle to performance and adopted observation reduction as a standard p…
- Review
- pending
- Role
- unreviewed
- Read
- now
huggingface
Score 11.2
2026-04-08 · Jiwan Chung, JiHyuk Byun, Vibhav Vineet, Seon Joo Kim
Research Track B · General AI
Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with contr…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 11.0
2026-04-14 · Chuang Peng, Wei Zhang, Renshuai Tao, Xinhao Zhang, Jian Yang
Research Track B · General AI
Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to re…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 11.0
2026-05-01 · Dongxin Guo, Jikun Wu, Siu Ming Yiu
Research Track B · General AI
AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and p…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 11.0
2026-05-07 · Borui Zhang, Bo Zhang, Bo Wang, Wenzhao Zheng, Yuhao Cheng, Liang Tang, Yiqiang Yan, Jie Zhou, Jiwen Lu
Research Track B · General AI
GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution metho…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 11.0
2026-06-18 · Dayeon Kang, Hyejun Jeong, Jade Sheffey, Pubali Datta, Amir Houmansadr
Research Track B · General AI
As AI web agents proliferate, combining large language models with autonomous, browser-level control, indiscriminate content scraping by web agents has emerged as a privacy and security challenge. Existing defenses, such as robots.txt and active bot-blocking, are insufficient, as they are widely violated and easily cir…
- Review
- pending
- Role
- unreviewed
- Read
- now
huggingface
Score 10.7
2026-06-18 · Guangyi Liu, Gao Wu, Congxiao Liu, Pengxiang Zhao, Liang Liu, Mading Li, Qi Zhang, Mengyan Wang, Liang Guo, Yong Liu
Research Track B · General AI
MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to…
- Review
- pending
- Role
- unreviewed
- Read
- soon
huggingface
Score 10.5
2026-04-27 · Hongxin Li, Yuntao Chen, Zhaoxiang Zhang
Research Track B · General AI
Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring lo…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 10.5
2026-06-04 · Hao Bai, Rui Yang, Chenlu Ye, Spencer Whitehead, Aviral Kumar, Tong Zhang
Research Track B · General AI
Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gra…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 10.5
2026-06-06 · Amine El Hattami, Nicolas Chapados, Christopher Pal
Research Track B · General AI
AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, esp…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 10.5
2026-06-11 · Hexuan Yu, Chaoyu Zhang, Heng Jin, Shanghao Shi, Ning Zhang, Y. Thomas Hou, Wenjing Lou
Research Track B · General AI
Modern LLM-powered autonomous agents increasingly rely on rich user interface (UI) state observations to achieve reliable action grounding in complex digital environments. However, many deployments transmit the full UI state to remote inference servers even when most elements are irrelevant to the current task, which c…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 10.5
2026-06-17 · Yujin Zhang, Daye Nam
Research Track B · General AI
AI web agents can perform complex, multi-step tasks such as searching for products, comparing options, and making purchases on behalf of users. However, verifying the correctness of an agent's output remains difficult. Existing transparency mechanisms, including full trajectory logs, source links, screenshots, and LLM-…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 10.5
2026-06-30 · Kaisen Yang, Zheng Jiang, Yuzhao Peng, Houde Qian, Boshi Zhang, Youjie Zheng, Shijin Hong, Qingle Liu, Ruoyu Han, Bohan Lyu, Bingxiang He, Eren Cai, Calvin Xiao, Qinhuai Na
Research Track A · Research Track B · General AI
Internet users collectively perform an enormous range of skilled work through web browsers, from software development and document editing to search, forms, and enterprise workflows, making human browsing a highly scalable but under-exploited source of reusable browser skills. We argue that the bottleneck for browser a…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 10.3
2026-02-01 · Alberto Castelo, Zahra Zanjani Foumani, Ailin Fan, Keat Yang Koay, Vibhor Malik, Yuanzheng Zhu, Han Li, Meysam Feghhi, Ronie Uliana, Shuang Xie, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Lingyun Wang, Zhong Wu
Research Track B · General AI
A/B testing remains the gold standard for evaluating e-commerce UI changes, yet it diverts traffic, takes weeks to reach significance, and risks harming user experience. We introduce SimGym, a scalable system for rapid offline A/B testing using traffic-grounded synthetic buyers powered by Large Language Model agents op…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 10.3
2026-03-24 · Qianlong Lan, Anuj Kaul
Research Track B · General AI
Deploying large language models (LLMs) as autonomous browser agents exposes a significant attack surface in the form of Indirect Prompt Injection (IPI). Cloud-based defenses can provide strong semantic analysis, but they introduce latency and raise privacy concerns. We present the Cognitive Firewall, a three-stage spli…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 10.0
2026-03-11 · Hyungjoo Chae, Jungsoo Park, Alan Ritter
Research Track B · General AI
Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites in…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 10.0
2026-04-30 · Haofei Yu, Yining Zhao, Lenore Blum, Manuel Blum, Paul Pu Liang
Research Track B · General AI
Despite remarkable advances, today's AI systems remain narrow in scope, falling short of the flexible, adaptive, and multisensory intelligence that characterizes human capabilities. This gap has fueled longstanding debates about whether AI might one day achieve human-like generality or even consciousness, and whether t…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 10.0
2026-05-04 · Zhisheng Tang, Mayank Kejriwal
Research Track B · General AI
Research funding discovery remains fundamentally fragmented: researchers navigate disparate agency portals (e.g., in the United States, NSF, NIH, DARPA, Grants.gov, and many others) with heterogeneous interfaces, search capabilities, and data schemas. We present a compound AI system that unifies this landscape through …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 10.0
2026-05-07 · Xinmiao Huang, Jinwei Hu, Rajarshi Roy, Changshun Wu, Yi Dong, Xiaowei Huang
Research Track B · General AI
Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixG…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 10.0
2026-05-08 · Zhichao Liu, Wenbo Pan, Haining Yu, Ge Gao, Tianqing Zhu, Xiaohua Jia
Research Track B · General AI
Browser agents are increasingly deployed in long-horizon tasks, which require executing extended action chains to accomplish user goals. However, this prolonged execution process provides attackers with more opportunities to inject malicious instructions. Existing prompt injection attacks against browser agents expose …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 10.0
2026-05-14 · Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ted Chaiwachirasak, Han Li, Lingyun Wang
Research Track B · General AI
LLM-based web agents can navigate live storefronts, yet they often collapse to a single "average buyer" policy, failing to capture the heterogeneous and distributional nature of real buyer populations. Existing personalization methods rely on hand-crafted prompt-based personas that are brittle, difficult to scale, cont…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 10.0
2026-05-27 · Kenny Daniel
Research Track B · General AI
The fastest-growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to analyze it, and the questions worth asking ("show me where the agent got confused") cannot be answered by SQL alone, since text is not queryable without a model in the query path. …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 9.8
2026-05-12 · Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo
Research Track B · General AI
Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction …
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 9.5
2026-01-14 · Saber Zerhoudi, Michael Granitzer
Research Track B · General AI
A fundamental tension exists between the demand for sophisticated AI assistance in web search and the need for user data privacy. Current centralized models require users to transmit sensitive browsing data to external services, which limits user control. In this paper, we present a browser extension that provides a vi…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 9.5
2026-03-22 · Liang Ding
Research Track B · General AI
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER,…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 9.5
2026-03-24 · Wanying Mo, Jijia Lai, Xiaoming Wang
Research Track B · General AI
Browser agents built on LLMs can act in web interfaces, yet most remain confined to a single chat surface (e.g., a sidebar). This mismatch with real browsing can increase context-switching and reduce user control. We introduce \textbf{IntentWeave}, a design space of ten spatial paradigms for embedding agentic assistanc…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 9.5
2026-03-31 · Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton, Sagar Davasam, Srinivas Sunkara, Vikas Yadav, Sai Rajeswar
Research Track B · General AI
There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Ye…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 9.5
2026-04-20 · Weixi Tong, Yifeng Di, Tianyi Zhang
Research Track B · General AI
Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website's structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a lim…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 9.5
2026-04-26 · Tin Nguyen, Thang T. Truong, Runtao Zhou, Trung Bui, Chirag Agarwal, Anh Totti Nguyen
Research Track B · General AI
Users browsing the web daily struggle to quickly locate relevant information in cluttered pages, complete unfamiliar multi-step tasks, and stay focused amid distracting content. State-of-the-art AI assistants (e.g., ChatGPT, Gemini, Claude) and browser agents (e.g., OpenAI Operator, Browser Use) can answer questions an…
- Review
- pending
- Role
- unreviewed
- Read
- now
huggingface
Score 9.5
2026-04-27 · Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang
Research Track B · General AI
Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital wo…
- Review
- pending
- Role
- unreviewed
- Read
- now
arxiv
Score 9.3
2025-12-08 · Alisha Ukani, Hamed Haddadi, Ali Shahin Shamsabadi, Peter Snyder
Research Track B · General AI
This paper presents a systematic evaluation of the privacy behaviors and attributes of eight recent, popular browser agents. Browser agents are software that automate Web browsing using large language models and ancillary tooling. However, the automated capabilities that make browser agents powerful also make them high…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 9.3
2026-04-08 · Jagadeesh Chundru
Research Track B · General AI
LLM-driven web agents operating through continuous inference loops -- repeatedly querying a model to evaluate browser state and select actions -- exhibit a fundamental scalability constraint for repetitive tasks. We characterize this as the Rerun Crisis: the linear growth of token expenditure and API latency relative t…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 9.3
2026-06-04 · Shubham Gaur, Ian Lane
Research Track B · General AI
Web agents operating over long horizons ingest raw DOM and accessibility trees -- routinely tens of thousands of tokens -- at every action step, causing progressive context degradation that erodes reasoning well before tasks complete. We argue that this coupling of observation frequency to action frequency is an archit…
- Review
- pending
- Role
- unreviewed
- Read
- soon
huggingface
Score 9.0
2026-05-24 · Yubo Li, Yidi Miao
Research Track B · General AI
Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertaint…
- Review
- pending
- Role
- unreviewed
- Read
- soon
huggingface
Score 8.5
2026-04-27 · NVIDIA, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, Mike Ranzinger, Greg Heinrich, Guo Chen, Lukas Voegtle, Philipp Fischer, Timo Roman, Karan Sapra, Collin McCarthy, Shaokun Zhang, Fuxiao Liu, Hanrong Ye, Yi Dong, Mingjie Liu, Yifan Peng, Piotr Zelasko, Zhehuai Chen, Nithin Rao Koluguri, Nune Tadevosyan, Lilit Grigoryan, Ehsan Hosseini Asl, Pritam Biswas, Leili Tavabi, Yuanhang Su, Zhiding Yu, Peter Jin, Alexandre Milesi, Netanel Haber, Yao Xu, Sarah Amiraslani, Nabin Mulepati, Eric Tramel, Jaehun Jung, Ximing Lu, Brandon Cui, Jin Xu, Zhiqi Li, Shihao Wang, Yuanguo Kuang, Huck Yang, Boyi Li, Hongxu Yin, Song Han, Pavlo Molchanov, Adi Renduchintala, Charles Wang, David Mosallanezhad, Soumye Singhal, Luis Vega, Katherine Cheung, Sreyan Ghosh, Yian Zhang, Alexander Bukharin, Venkat Srinivasan, Johnny Greco, Andre Manoel, Maarten Van Segbroeck, Suseella Panguliri, Rohit Watve, Divyanshu Kakwani, Shubham Pachori, Jeffrey Glick, Radha Sri-Tharan, Aileen Zaman, Khanh Nguyen, Shi Chen, Jiaheng Fang, Qing Miao, Wenfei Zhou, Yu Wang, Zaid Pervaiz Bhat, Varun Praveen, Arihant Jain, Ramanathan Arunachalam, Tomasz Kornuta, Ashton Sharabiani, Amy Shen, Wei Huang, Yi-Fu Wu, Ali Roshan Ghias, Huiying Li, Brian Yu, Nima Tajbakhsh, Chen Cui, Wenwen Gao, Li Ding, Terry Kong, Manoj Kilaru, Anahita Bhiwandiwalla, Marek Wawrzos, Daniel Korzekwa, Pablo Ribalta, Grzegorz Chlebus, Besmira Nushi, Ewa Dobrowolska, Maciej Jakub Mikulski, Kunal Dhawan, Steve Huang, Jagadeesh Balam, Yongqiang Wang, Nikolay Karpov, Valentin Mendelev, George Zelenfroynd, Meline Mkrtchyan, Omri Almog, Bhavesh Pawar, Rameshwar Shivbhakta, Sudeep Sabnis, Ashrton Sharabiani, Negar Habibi, Geethapriya Venkataramani, Pamela Peng, Prerit Rodney, Serge Panev, Richard Mazzarese, Nicky Liu, Michael Fukuyama, Andrii Skliar, Roger Waleffe, Duncan Riach, Yunheng Zou, Jian Hu, Hao Zhang, Binfeng Xu, Yuhao Yang, Zuhair Ahmed, Carlo del Mundo, Chad Voegele, Zhiyu Cheng, Nave Assaf, Daniel Afrimi, Natan Bagrov, Ran Zilberstein, Ofri Masad, Eugene Khvedchenia, Borys Tymchenko, Tomer Asida, Parth Mannan, Victor Cui, Michael Evans, Katherine Luna, Jie Lou, Pinky Xu, Guyue Huang, Michael Boone, Pradeep Thalasta, Adeola Adesoba, Dina Yared, Christopher Parisien, Leon Derczynski, Shaona Ghosh, Wes Feely, Micah Schaffer, Barnaby Simkin, Tomasz Grzegorzek, Rishabh Garg, Aastha Jhunjhunwala, Sergei Kolchenko, Farzan Memarian, Haran Kumar, Shiv Kumar, Isabel Hulseman, Anjali Shah, Kari Briski, Padmavathy Subramanian, Joey Conway, Udi Karpas, Jane Polak Scowcroft, Annie Surla, Shilpa Ammireddy, Ellie Evans, Jesse Oliver, Tom Balough, Chia-Chih Chen, Sandip Bhaskar, Alejandra Rico, Bardiya Sadeghi, Seph Mard, Meredith Price, Laya Sleiman, Saori Kaji, Wesley Helmholz, Wendy Quan, Michael Lightstone, Jonathan Cohen, Jian Zhang, Oleksii Kuchaiev, Boris Ginsburg, Jan Kautz, Eileen Long, Mohammad Shoeybi, Mostofa Patwary, Oluwatobi Olabiyi, Andrew Tao, Bryan Catanzaro
Research Track B · General AI
We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in …
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 8.5
2026-05-06 · Sohom Datta, Alex Nahapetyan, William Enck, Alexandros Kapravelos
Research Track B · General AI
Large language models (LLMs) are increasingly being integrated into web browsers to create agentic browsing systems that execute actions on behalf of the user. Prior work considering the security of agentic browsers focuses exclusively on indirect prompt-injection attacks. However, by failing to consider traditional we…
- Review
- pending
- Role
- unreviewed
- Read
- soon
arxiv
Score 8.0
2026-02-27 · Abisheka Pitumpe, Amir Rahmati
Research Track B · General AI
Job-based smishing scams, where victims are recruited under the guise of remote job opportunities, represent a rapidly growing and understudied threat within the broader landscape of online fraud. In this paper, we present Anansi, the first scalable, end-to-end measurement pipeline designed to systematically engage wit…
- Review
- pending
- Role
- unreviewed
- Read
- soon
huggingface
Score 7.2
2026-06-15 · Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong
Research Track B · General AI
Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when us…
- Review
- pending
- Role
- unreviewed
- Read
- soon