Paper Detail

BenchX: Benchmarking AI Models for Cancer Detection and Localization with Demographic and Protocol Biases

Qi Chen, Wenxuan Li, Pedro R. A. S. Bassi, Xinze Zhou, Jakob Wasserthal, Ibrahim Ethem Hamamci, Sezgin Er, Ashwin Kumar, Yiwen Ye, Yuhan Wang, Yuyin Zhou, Akshay S. Chaudhari, Curtis Langlotz, Kang Wang, Yang Yang, Alan L. Yuille, Zongwei Zhou

arxiv Score 11.2

Published 2026-06-23 · First seen 2026-06-24

General AI

Abstract

Artificial intelligence (AI) has achieved remarkable success in medical imaging, but it is widely recognized that these models often perform inconsistently across real-world clinical settings. Such inconsistencies occur when patient demographics and imaging protocols vary, for example, in detecting small tumors, analyzing scans from different contrast phases, or evaluating patients of different ages or sexes. To quantify these inconsistencies, we develop a large-scale, open benchmark of 85,355 CT scans that systematically evaluates 12 tumor-detection AI models across tumor size, location, patient subgroup, and imaging protocol. We leverage large language models (LLMs) to extract and organize subgroup information from clinical data, which makes the analysis both scalable and reproducible. Our benchmark reveals that current state-of-the-art AI models, optimized for average accuracy, perform poorly in rare or underrepresented subgroups, such as young, female African Americans. However, collecting sufficient annotated data for these rare cases is often impractical. The benchmark provides a foundation for building more reliable and robust AI models for tumor detection and highlighting the need for rigorous, subgroup-level evaluation in medical imaging and computer vision. Datasets, code

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{chen2026benchx,
  title = {BenchX: Benchmarking AI Models for Cancer Detection and Localization with Demographic and Protocol Biases},
  author = {Qi Chen and Wenxuan Li and Pedro R. A. S. Bassi and Xinze Zhou and Jakob Wasserthal and Ibrahim Ethem Hamamci and Sezgin Er and Ashwin Kumar and Yiwen Ye and Yuhan Wang and Yuyin Zhou and Akshay S. Chaudhari and Curtis Langlotz and Kang Wang and Yang Yang and Alan L. Yuille and Zongwei Zhou},
  year = {2026},
  abstract = {Artificial intelligence (AI) has achieved remarkable success in medical imaging, but it is widely recognized that these models often perform inconsistently across real-world clinical settings. Such inconsistencies occur when patient demographics and imaging protocols vary, for example, in detecting small tumors, analyzing scans from different contrast phases, or evaluating patients of different ages or sexes. To quantify these inconsistencies, we develop a large-scale, open benchmark of 85,355 C},
  url = {https://arxiv.org/abs/2606.24883},
  keywords = {cs.CV},
  eprint = {2606.24883},
  archiveprefix = {arXiv},
}

Metadata

{}