Paper Detail

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu

arxiv Score 13.0

Published 2026-03-04 · First seen 2026-03-27

Research Track B · General AI

Abstract

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves only 58.0% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{tran2026vibe,
  title = {Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development},
  author = {Hung Tran and Langston Nashold and Rayan Krishnan and Antoine Bigeard and Alex Gu},
  year = {2026},
  abstract = {Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best},
  url = {https://arxiv.org/abs/2603.04601},
  keywords = {cs.SE, cs.AI, cs.CL},
  eprint = {2603.04601},
  archiveprefix = {arXiv},
}

Metadata

{}