Research Paper Cockpit

Research Track B

Papers on interactive systems and agent architectures.

Daily Archives

Quick jump into generated daily digests.

Research Workflow

Latest digest: 2026-03-28.

Papers

21 visible entries

arxiv Score 29.4

ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents

2026-03-20 · Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Chen Dai

Research Track B · General AI

Despite rapid progress in multimodal GUI agents, reusable skill acquisition remains difficult because on-demand generated skills often leave action semantics, state assumptions, and success criteria implicit. This makes them brittle to execution errors, hard to verify, and difficult to repair. We present ContractSkill,…

Review
pending
Role
unreviewed
Read
now
arxiv Score 24.8

Enhancing Web Agents with a Hierarchical Memory Tree

2026-03-07 · Yunteng Tan, Zhi Gao, Xinxiao Wu

Research Track B · General AI

Large language model-based web agents have shown strong potential in automating web interactions through advanced reasoning and instruction following. While retrieval-based memory derived from historical trajectories enables these agents to handle complex, long-horizon tasks, current methods struggle to generalize acro…

Review
pending
Role
unreviewed
Read
now
arxiv Score 24.2

A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

2026-03-20 · Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette

Research Track B · General AI

Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing L…

Review
pending
Role
unreviewed
Read
now
arxiv Score 23.8

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

2026-03-23 · Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara, Maria Wang, Jindong Chen, Mohit Bansal, Boqing Gong

Research Track B · General AI

Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This li…

Review
pending
Role
unreviewed
Read
now
arxiv Score 21.5

AI Planning Framework for LLM-Based Web Agents

2026-03-13 · Orit Shahnovsky, Rotem Dror

Research Track B · General AI

Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequ…

Review
pending
Role
unreviewed
Read
now
arxiv Score 20.2

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

2026-03-15 · Mohamed Aghzal, Gregory J. Stein, Ziyu Yao

Research Track B · General AI

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze w…

Review
pending
Role
unreviewed
Read
now
arxiv Score 19.5

ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution

2026-01-12 · Jihong Wang, Jiamu Zhou, Weiming Zhang, Weiwen Liu, Zhuosheng Zhang, Xingyu Lou, Weinan Zhang, Huarong Deng, Jun Wang

Research Track B · General AI

With the advancement of vision-language models, web automation has made significant progress. However, deploying autonomous agents in real-world settings remains challenging, primarily due to site heterogeneity, where generalist models lack domain-specific priors for diverse interfaces, and long-horizon instability, ch…

Review
pending
Role
unreviewed
Read
now
arxiv Score 19.5

Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents

2026-03-09 · Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang

Research Track B · General AI

Modern agents powered by thinking LLMs achieve high accuracy through long chain-of-thought reasoning but incur substantial inference costs. While many LLMs now support configurable reasoning levels (e.g., high/medium/low), static strategies are often ineffective: using low-effort modes at every step leads to significan…

Review
pending
Role
unreviewed
Read
now
arxiv Score 18.8

Environment Maps: Structured Environmental Representations for Long-Horizon Agents

2026-03-24 · Yenchia Feng, Chirag Sharma, Karime Maamari

Research Track B · General AI

Although large language models (LLMs) have advanced rapidly, robust automation of complex software workflows remains an open problem. In long-horizon settings, agents frequently suffer from cascading errors and environmental stochasticity; a single misstep in a dynamic interface can lead to task failure, resulting in h…

Review
pending
Role
unreviewed
Read
now
arxiv Score 15.2

WebNavigator: Global Web Navigation via Interaction Graph Retrieval

2026-03-20 · Xuanwang Zhang, Yuteng Han, Jinnan Qi, Mulong Xie, Zhen Wu, Xinyu Dai

Research Track B · General AI

Despite significant advances in autonomous web navigation, current methods remain far from human-level performance in complex web environments. We argue that this limitation stems from Topological Blindness, where agents are forced to explore via trial-and-error without access to the global topological structure of the…

Review
pending
Role
unreviewed
Read
now
arxiv Score 14.0

AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

2026-03-22 · Liang Ding

Research Track B · General AI

LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on th…

Review
pending
Role
unreviewed
Read
now
arxiv Score 13.4

ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation

2026-03-19 · Haochen Zhao, Shaoyang Cui

Research Track B · General AI

Autonomous web agents such as \textbf{OpenClaw} are rapidly moving into high-impact real-world workflows, but their security robustness under live network threats remains insufficiently evaluated. Existing benchmarks mainly focus on static sandbox settings and content-level prompt attacks, which leaves a practical gap …

Review
pending
Role
unreviewed
Read
soon
arxiv Score 13.0

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

2026-03-04 · Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu

Research Track B · General AI

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 hel…

Review
pending
Role
unreviewed
Read
soon
arxiv Score 12.8

The Cognitive Firewall:Securing Browser Based AI Agents Against Indirect Prompt Injection Via Hybrid Edge Cloud Defense

2026-03-24 · Qianlong Lan, Anuj Kaul

Research Track B · General AI

Deploying large language models (LLMs) as autonomous browser agents exposes a significant attack surface in the form of Indirect Prompt Injection (IPI). Cloud-based defenses can provide strong semantic analysis, but they introduce latency and raise privacy concerns. We present the Cognitive Firewall, a three-stage spli…

Review
pending
Role
unreviewed
Read
now
arxiv Score 12.0

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

2026-03-22 · Liang Ding

Research Track B · General AI

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER,…

Review
pending
Role
unreviewed
Read
soon
arxiv Score 12.0

IntentWeave: A Progressive Entry Ladder for Multi-Surface Browser Agents in Cloud Portals

2026-03-24 · Wanying Mo, Jijia Lai, Xiaoming Wang

Research Track B · General AI

Browser agents built on LLMs can act in web interfaces, yet most remain confined to a single chat surface (e.g., a sidebar). This mismatch with real browsing can increase context-switching and reduce user control. We introduce \textbf{IntentWeave}, a design space of ten spatial paradigms for embedding agentic assistanc…

Review
pending
Role
unreviewed
Read
soon
arxiv Score 11.8

SimGym: Traffic-Grounded Browser Agents for Offline A/B Testing in E-Commerce

2026-02-01 · Alberto Castelo, Zahra Zanjani Foumani, Ailin Fan, Keat Yang Koay, Vibhor Malik, Yuanzheng Zhu, Han Li, Meysam Feghhi, Ronie Uliana, Shuang Xie, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Lingyun Wang, Zhong Wu

Research Track B · General AI

A/B testing remains the gold standard for evaluating e-commerce UI changes, yet it diverts traffic, takes weeks to reach significance, and risks harming user experience. We introduce SimGym, a scalable system for rapid offline A/B testing using traffic-grounded synthetic buyers powered by Large Language Model agents op…

Review
pending
Role
unreviewed
Read
soon
arxiv Score 11.5

Safe and Scalable Web Agent Learning via Recreated Websites

2026-03-11 · Hyungjoo Chae, Jungsoo Park, Alan Ritter

Research Track B · General AI

Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites in…

Review
pending
Role
unreviewed
Read
soon
arxiv Score 11.0

In-Browser Agents for Search Assistance

2026-01-14 · Saber Zerhoudi, Michael Granitzer

Research Track B · General AI

A fundamental tension exists between the demand for sophisticated AI assistance in web search and the need for user data privacy. Current centralized models require users to transmit sensitive browsing data to external services, which limits user control. In this paper, we present a browser extension that provides a vi…

Review
pending
Role
unreviewed
Read
soon
arxiv Score 10.8

Privacy Practices of Browser Agents

2025-12-08 · Alisha Ukani, Hamed Haddadi, Ali Shahin Shamsabadi, Peter Snyder

Research Track B · General AI

This paper presents a systematic evaluation of the privacy behaviors and attributes of eight recent, popular browser agents. Browser agents are software that automate Web browsing using large language models and ancillary tooling. However, the automated capabilities that make browser agents powerful also make them high…

Review
pending
Role
unreviewed
Read
soon
arxiv Score 9.5

Anansi: Scalable Characterization of Message-Based Job Scams

2026-02-27 · Abisheka Pitumpe, Amir Rahmati

Research Track B · General AI

Job-based smishing scams, where victims are recruited under the guise of remote job opportunities, represent a rapidly growing and understudied threat within the broader landscape of online fraud. In this paper, we present Anansi, the first scalable, end-to-end measurement pipeline designed to systematically engage wit…

Review
pending
Role
unreviewed
Read
soon