Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Hung Tran (Vals AI), Langston Nashold (Vals AI), Rayan Krishnan (Vals AI), Antoine Bigeard (Vals AI), Alex Gu (MIT)
Evaluation & Benchmarking
Vibe Code Bench is a benchmark of 100 web application specifications with over 10,000 browser-evaluated substeps showing that even the best frontier models complete only 58% of realistic end-to-end application development tasks. The benchmark uses an autonomous browser agent to verify deployed applications against behavioral specifications, measuring what users actually care about rather than code syntax.
Presentation
Talk
Paper Session 2: Agent Evaluation
Wednesday, May 27 · 2:10 PM – 2:20 PM
Bayshore Ballroom
Poster
Wednesday, May 27 · 5:15 PM – 6:45 PM
Carmel / Monterey
Abstract
Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves 61.8% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8–93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results. We provide reproducibility artifacts, detailed in Appendix A.