Paper Detail

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Peng Yuan, Yuyang Yin, Yuxuan Cai, Zheng Wei

Browse

Workflow Queues

arxiv Score 14.5

Published 2026-04-13 · First seen 2026-04-14

Research Track B · General AI

Open paper source

Abstract

Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline -- Plan, Generate, Refine, and Validate -- that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{yuan2026webforge,
  title = {WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark},
  author = {Peng Yuan and Yuyang Yin and Yuxuan Cai and Zheng Wei},
  year = {2026},
  abstract = {Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline -- Plan, Generate, Refine, and Validate -- that produces interactive, self-contained web environments end-to-end without},
  url = {https://arxiv.org/abs/2604.10988},
  keywords = {cs.AI, cs.CV},
  eprint = {2604.10988},
  archiveprefix = {arXiv},
}

Metadata

{}