Registration has reached capacity. Join the waitlist

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran (Vals AI), Langston Nashold (Vals AI), Rayan Krishnan (Vals AI), Antoine Bigeard (Vals AI), Alex Gu (MIT)

Evaluation & Benchmarking

Vibe Code Bench is a benchmark of 100 web application specifications with over 10,000 browser-evaluated substeps showing that even the best frontier models complete only 58% of realistic end-to-end application development tasks. The benchmark uses an autonomous browser agent to verify deployed applications against behavioral specifications, measuring what users actually care about rather than code syntax.

Presentation

Talk

Paper Session 2: Agent Evaluation

Wednesday, May 27 · 2:10 PM – 2:20 PM

Bayshore Ballroom

Poster

Wednesday, May 27 · 5:15 PM – 6:45 PM

Carmel / Monterey

View day schedule

Abstract

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves 61.8% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8–93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results. We provide reproducibility artifacts, detailed in Appendix A.

Artifacts & Links

                        Authors
                        Hung Tran
Vals AI
Langston Nashold
Vals AI
Rayan Krishnan
Vals AI
Antoine Bigeard
Vals AI
Alex Gu
MIT