Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Abstract

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{gupta2026diagnosing,
  title = {Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations},
  author = {Manan Gupta and Dhruv Kumar},
  year = {2026},
  abstract = {LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: \$\textbackslash{}textbf\{(1)\}\$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates (\$\textbackslash{}barρ = 0.8\$-\$4.1\textbackslash{}\%\$), with \$33\$-\$67\textbackslash{}\%\$ of documents exhibiting at least one directed 3-cycle; and \$\textbackslash{}textbf\{(2)\}\$ split conformal prediction sets over 1-5 Likert scores},
  url = {https://arxiv.org/abs/2604.15302},
  keywords = {cs.AI, cs.CL, cs.LG, Transitive relation, Set (abstract data type), Reliability (semiconductor), Consistency (knowledge bases), Fluency},
  eprint = {2604.15302},
  archiveprefix = {arXiv},
}

Metadata

{}