Paper Detail

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Udari Madhushani Sehwag, Elaine Lau, Haniyeh Ehsani Oskouie, Shayan Shabihi, Erich Liang, Andrea Toledo, Guillermo Mangialardi, Sergio Fonrouge, Ed-Yeremai Hernandez Cardona, Paula Vergara, Utkarsh Tyagi, Chen Bo Calvin Zhang, Pavi Bhatter, Nicholas Johnson, Furong Huang, Ernesto Gabriel Hernandez Montoya, Bing Liu

huggingface Score 10.5

Published 2026-04-12 · First seen 2026-04-14

General AI

Abstract

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is approx20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only approx20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from approx5% to approx80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{sehwag2026scipredict,
  title = {SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?},
  author = {Udari Madhushani Sehwag and Elaine Lau and Haniyeh Ehsani Oskouie and Shayan Shabihi and Erich Liang and Andrea Toledo and Guillermo Mangialardi and Sergio Fonrouge and Ed-Yeremai Hernandez Cardona and Paula Vergara and Utkarsh Tyagi and Chen Bo Calvin Zhang and Pavi Bhatter and Nicholas Johnson and Furong Huang and Ernesto Gabriel Hernandez Montoya and Bing Liu},
  year = {2026},
  abstract = {Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 spe},
  url = {https://huggingface.co/papers/2604.10718},
  keywords = {LLMs, experimental outcomes, scientific knowledge, reasoning, benchmark, prediction accuracy, calibration, reproducibility, huggingface daily},
  eprint = {2604.10718},
  archiveprefix = {arXiv},
}

Metadata

{}