Paper Detail

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long, Jiahui Cai, Yan Luo, Mengyu Wang

huggingface Score 14.5

Published 2026-04-07 · First seen 2026-04-21

General AI

Abstract

Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce MedConclusion, a large-scale dataset of 5.7M PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{li2026medconclusion,
  title = {MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts},
  author = {Weiyue Li and Ruizhi Qian and Yi Li and Yongce Li and Yunfan Long and Jiahui Cai and Yan Luo and Mengyu Wang},
  year = {2026},
  abstract = {Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce MedConclusion, a large-scale dataset of 5.7M PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-},
  url = {https://huggingface.co/papers/2604.06505},
  keywords = {large language models, biomedical conclusion generation, structured abstracts, evidence-to-conclusion reasoning, reference-based metrics, LLM-as-a-judge, code available, huggingface daily},
  eprint = {2604.06505},
  archiveprefix = {arXiv},
}

Metadata

{}