Paper Detail

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang

Browse

Workflow Queues

huggingface Score 9.5

Published 2026-04-25 · First seen 2026-04-28

General AI

Open paper source

Abstract

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{huang2026proeval,
  title = {ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation},
  author = {Yizheng Huang and Wenjun Zeng and Aditi Kumaresan and Zi Wang},
  year = {2026},
  abstract = {Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violatio},
  url = {https://huggingface.co/papers/2604.23099},
  keywords = {Gaussian Processes, Bayesian quadrature, superlevel set sampling, uncertainty-aware decision strategies, transfer learning, performance estimation, failure discovery, code available, huggingface daily},
  eprint = {2604.23099},
  archiveprefix = {arXiv},
}

Metadata

{}