Paper Detail

Online Safety Monitoring for LLMs

Mona Schirmer, Metod Jazbec, Alexander Timans, Christian Naesseth, Maja Waldron, Eric Nalisnick

arxiv Score 9.6

Published 2026-07-02 · First seen 2026-07-03

General AI

Abstract

Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{schirmer2026online,
  title = {Online Safety Monitoring for LLMs},
  author = {Mona Schirmer and Metod Jazbec and Alexander Timans and Christian Naesseth and Maja Waldron and Eric Nalisnick},
  year = {2026},
  abstract = {Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with mor},
  url = {https://arxiv.org/abs/2607.02510},
  keywords = {cs.AI, cs.CL, cs.LG, stat.AP, stat.ML},
  eprint = {2607.02510},
  archiveprefix = {arXiv},
}

Metadata

{}