Paper Detail

On the Proper Treatment of Units in Surprisal Theory

Samuel Kiegeland, Vésteinn Snæbjarnarson, Tim Vieira, Ryan Cotterell

Browse

Workflow Queues

arxiv Score 6.3

Published 2026-04-30 · First seen 2026-05-01

General AI

Open paper source

Abstract

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{kiegeland2026proper,
  title = {On the Proper Treatment of Units in Surprisal Theory},
  author = {Samuel Kiegeland and Vésteinn Snæbjarnarson and Tim Vieira and Ryan Cotterell},
  year = {2026},
  abstract = {Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two dis},
  url = {https://arxiv.org/abs/2604.28147},
  keywords = {cs.CL},
  eprint = {2604.28147},
  archiveprefix = {arXiv},
}

Metadata

{}