Paper Detail

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

Andrei Liviu Nicolicioiu, Mohammad Pezeshki, Aaron Courville

Browse

Workflow Queues

arxiv Score 9.2

Published 2026-06-24 · First seen 2026-06-25

General AI

Open paper source

Abstract

On-policy self-distillation achieves strong pass@1 accuracy by using a single model as both teacher and student, with the teacher conditioned on a correct demonstration to provide dense token-level feedback. We show that this could come at a hidden cost: rollout diversity decreases and pass@k curves flatten (i.e., generating more rollouts fails to improve accuracy). We trace this to compounding biases in the design of self-distillation with sampled demonstrations. The teacher scores each student rollout while conditioned on a sampled correct rollout, channeling its feedback through the model's own biases. We theoretically analyze the optimal self-distillation policy and show that it tilts the base distribution by a pointwise conditional mutual information score between the student's rollout and the correct rollout used as context. Unlike the ideal optimal on-policy reinforcement learning (RL), which preserves probability ratios among equally correct rollouts, self-distillation can amplify existing probability gaps, concentrating mass on already-dominant modes. On a controlled graph path-finding task and science question-answering benchmarks, self-distilled models match or exceed RL on average performance but exhibit substantially lower functional and semantic diversity, failing on out-of-distribution settings that require diverse strategies.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{nicolicioiu2026policy,
  title = {On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity},
  author = {Andrei Liviu Nicolicioiu and Mohammad Pezeshki and Aaron Courville},
  year = {2026},
  abstract = {On-policy self-distillation achieves strong pass@1 accuracy by using a single model as both teacher and student, with the teacher conditioned on a correct demonstration to provide dense token-level feedback. We show that this could come at a hidden cost: rollout diversity decreases and pass@k curves flatten (i.e., generating more rollouts fails to improve accuracy). We trace this to compounding biases in the design of self-distillation with sampled demonstrations. The teacher scores each student},
  url = {https://arxiv.org/abs/2606.26091},
  keywords = {cs.LG, cs.AI},
  eprint = {2606.26091},
  archiveprefix = {arXiv},
}

Metadata

{}