Paper Detail
CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian McAuley, Eric P. Xing, Zhengzhong Liu, Rupesh Kumar Srivastava, Zhiting Hu
LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@misc{team2026cocoabench,
title = {CocoaBench: Evaluating Unified Digital Agents in the Wild},
author = {CocoaBench Team and Shibo Hao and Zhining Zhang and Zhiqi Liang and Tianyang Liu and Yuheng Zha and Qiyue Gao and Jixuan Chen and Zilong Wang and Zhoujun Cheng and Haoxiang Zhang and Junli Wang and Hexi Jin and Boyuan Zheng and Kun Zhou and Yu Wang and Feng Yao and Licheng Liu and Yijiang Li and Zhifei Li and Zhengtao Han and Pracha Promthaw and Tommaso Cerruti and Xiaohan Fu and Ziqiao Ma and Jingbo Shang and Lianhui Qin and Julian McAuley and Eric P. Xing and Zhengzhong Liu and Rupesh Kumar Srivastava and Zhiting Hu},
year = {2026},
abstract = {LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon},
url = {https://huggingface.co/papers/2604.11201},
keywords = {LLM agents, software engineering, deep research, GUI automation, agent scaffolds, unified systems, digital agents, long-horizon tasks, vision, search, coding, automatic evaluation, controlled comparison, model backbones, reasoning, planning, tool use, execution, visual grounding, code available, huggingface daily},
eprint = {2604.11201},
archiveprefix = {arXiv},
}
{}