Paper Detail

TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Tristan Kirscher, Alexandra Ertl, Klaus Maier-Hein, Xavier Coubez, Philippe Meyer, Sylvain Faisan

huggingface Score 6.5

Published 2026-04-17 · First seen 2026-04-20

General AI

Abstract

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{kirscher2026twintrack,
  title = {TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation},
  author = {Tristan Kirscher and Alexandra Ertl and Klaus Maier-Hein and Xavier Coubez and Philippe Meyer and Sylvain Faisan},
  year = {2026},
  abstract = {Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities},
  url = {https://huggingface.co/papers/2604.15950},
  keywords = {ensemble segmentation, post-hoc calibration, empirical mean human response, inter-rater disagreement, probabilistic outputs, calibration metrics, huggingface daily},
  eprint = {2604.15950},
  archiveprefix = {arXiv},
}

Metadata

{}