Paper Detail

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Shoufa Chen, Luyuan Wang, Xuan Yang, Zhiheng Liu, Yuren Cong, Yuanfeng Ji, Feiyan Zhou, Xiaohui Zhang, Fanny Yang, Belinda Zeng

Browse

Workflow Queues

huggingface Score 19.4

Published 2026-06-26 · First seen 2026-06-30

General AI

Open paper source

Abstract

As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{chen2026tua,
  title = {TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents},
  author = {Shoufa Chen and Luyuan Wang and Xuan Yang and Zhiheng Liu and Yuren Cong and Yuanfeng Ji and Feiyan Zhou and Xiaohui Zhang and Fanny Yang and Belinda Zeng},
  year = {2026},
  abstract = {As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically },
  url = {https://huggingface.co/papers/2606.28480},
  keywords = {terminal-use agents, general-purpose agents, computer-use tasks, graphical user interfaces, shell-based workflows, execution-based scoring protocol, digital activities, specialized software, benchmark evaluation, code available, huggingface daily},
  eprint = {2606.28480},
  archiveprefix = {arXiv},
}

Metadata

{}