Paper Detail

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Peng Yuan, Yuyang Yin, Yuxuan Cai, Zheng Wei

arxiv Score 15.5

Published 2026-04-13 · First seen 2026-04-14

Research Track B · General AI

Abstract

Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline -- Plan, Generate, Refine, and Validate -- that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{yuan2026webforge,
  title = {WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark},
  author = {Peng Yuan and Yuyang Yin and Yuxuan Cai and Zheng Wei},
  year = {2026},
  abstract = {Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline -- Plan, Generate, Refine, and Validate -- that produces interactive, self-contained web environments end-to-end without},
  url = {https://arxiv.org/abs/2604.10988},
  keywords = {cs.AI, cs.CV},
  eprint = {2604.10988},
  archiveprefix = {arXiv},
}

Metadata

{}