Paper Detail

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, Dongbin Zhao

Browse

Workflow Queues

huggingface Score 14.5

Published 2026-03-26 · First seen 2026-03-27

General AI

Open paper source

Abstract

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{fu2026revisiting,
  title = {Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes},
  author = {Yuqian Fu and Haohuan Huang and Kaiwen Jiang and Yuanheng Zhu and Dongbin Zhao},
  year = {2026},
  abstract = {On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically,},
  url = {https://huggingface.co/papers/2603.25562},
  keywords = {on-policy distillation, large language models, teacher feedback, student-generated rollouts, sampled-token variant, distribution matching, reverse-KL, gradient variance, token-level estimation, sequence-level estimation, top-K local support matching, top-p rollout sampling, special-token masking, code available, huggingface daily},
  eprint = {2603.25562},
  archiveprefix = {arXiv},
}

Metadata

{}