Paper Detail

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen, Sunniva Maria Stordal Bjørklund, Leon Moonen, Klas Pettersen, Michael A. Riegler

arxiv Score 7.8

Published 2026-05-07 · First seen 2026-05-09

General AI

Abstract

Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values between 0.89 and 1.00, target identity is the dominant variance component ($η^2 \approx 0.52$), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{gautam2026when,
  title = {When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels},
  author = {Sushant Gautam and Finn Schwall and Annika Willoch Olstad and Fernando Vallecillos Ruiz and Birk Torpmann-Hagen and Sunniva Maria Stordal Bjørklund and Leon Moonen and Klas Pettersen and Michael A. Riegler},
  year = {2026},
  abstract = {Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-t},
  url = {https://arxiv.org/abs/2605.06652},
  keywords = {cs.LG, cs.AI, cs.CL, benchmarkless comparative safety scoring, scenario-based audit, instrumental-validity chain, AUROC, variance decomposition, rerun budget, local-first scoring instrument, safety pack, target-driven variance, auditor artifacts, judge artifacts, code available, huggingface daily, Computer science, Benchmark (surveying), Audit, Variance (accounting), Enforcement},
  eprint = {2605.06652},
  archiveprefix = {arXiv},
}

Metadata

{}