Paper Detail
Manan Gupta, Dhruv Kumar
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@article{gupta2026diagnosing,
title = {Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations},
author = {Manan Gupta and Dhruv Kumar},
year = {2026},
abstract = {LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: \$\textbackslash{}textbf\{(1)\}\$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates (\$\textbackslash{}barρ = 0.8\$-\$4.1\textbackslash{}\%\$), with \$33\$-\$67\textbackslash{}\%\$ of documents exhibiting at least one directed 3-cycle; and \$\textbackslash{}textbf\{(2)\}\$ split conformal prediction sets over 1-5 Likert scores},
url = {https://arxiv.org/abs/2604.15302},
keywords = {cs.AI, cs.CL, cs.LG, Transitive relation, Set (abstract data type), Reliability (semiconductor), Consistency (knowledge bases), Fluency},
eprint = {2604.15302},
archiveprefix = {arXiv},
}
{}