Paper Detail

Denser $\neq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training

Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang, Geng Liu, Haiyang Guo, Guo-Sen Xie, Gaofeng Meng, Hongbin Liu, Fei Zhu

Browse

Workflow Queues

arxiv Score 18.8

Published 2026-07-02 · First seen 2026-07-03

Research Track A · General AI

Open paper source

Abstract

Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a particularly attractive approach. In this work, we revisit this optimistic view through self-distillation policy optimization (SDPO). Our experiments show that SDPO can accelerate in-domain specialization when teacher signals are stable and well aligned, but it struggles to generalize to out-of-distribution scenarios. In continual post-training, SDPO exhibits stronger forgetting and can even collapse, whereas on-policy reinforcement learning methods such as GRPO adapt more conservatively and better preserve prior capabilities. Further analyses reveal that denser self-distillation induces larger drift in both parameter space and response space, and can amplify high-frequency formatting artifacts through a self-reinforcing teacher--student loop. These findings suggest that on-policy data alone is insufficient for continual learning. Dense self-distillation can accelerate specialization when teacher targets are stable and token-level supervision is reliable, but it should not be treated as a default stabilizer for continual post-training. Our code is available at https://github.com/Moenupa/SDPO-CL.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{wang2026denser,
  title = {Denser \$\textbackslash{}neq\$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training},
  author = {Meng Wang and Haohan Zhao and Wenzhuo Liu and Lu Yang and Geng Liu and Haiyang Guo and Guo-Sen Xie and Gaofeng Meng and Hongbin Liu and Fei Zhu},
  year = {2026},
  abstract = {Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a particularly attractive approach. In this work, we revisit this optimistic view through self-distillation policy optimization (SDPO). Our experiments show that SDPO can accelerate in-domain specialization when teacher signals are stable and well aligned, but it strugg},
  url = {https://arxiv.org/abs/2607.01763},
  keywords = {cs.LG, cs.CL, continual post-training, on-policy learning, self-distillation, on-policy self-distillation, policy optimization, continual learning, model specialization, model forgetting, reinforcement learning, GRPO, parameter space, response space, self-reinforcing loop, code available, huggingface daily},
  eprint = {2607.01763},
  archiveprefix = {arXiv},
}

Metadata

{}