Paper Detail

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, Furong Huang

arxiv Score 13.8

Published 2026-05-28 · First seen 2026-05-31

General AI

Abstract

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{ho2026soundnessbench,
  title = {SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?},
  author = {Sy-Tuyen Ho and Minghui Liu and Huy Nghiem and Furong Huang},
  year = {2026},
  abstract = {Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with r},
  url = {https://arxiv.org/abs/2605.30329},
  keywords = {cs.LG, Large Language Models, machine-learning research proposals, reviewer soundness, ICLR submissions, autonomous AI research agents, hypothesis generation, peer review, benchmark evaluation, code available, huggingface daily},
  eprint = {2605.30329},
  archiveprefix = {arXiv},
}

Metadata

{}