Paper Detail

Learning from the Self-future: On-policy Self-distillation for dLLMs

Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu

arxiv Score 15.3

Published 2026-06-16 · First seen 2026-06-17

General AI

Abstract

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{luo2026learning,
  title = {Learning from the Self-future: On-policy Self-distillation for dLLMs},
  author = {Yifu Luo and Zeyu Chen and Haoyu Wang and Xinhao Hu and Yuxuan Zhang and Zhizhou Sha and Shiwei Liu},
  year = {2026},
  abstract = {On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our ap},
  url = {https://arxiv.org/abs/2606.18195},
  keywords = {cs.CL, on-policy self-distillation, diffusion LLMs, self-teacher construction, suffix conditioning, step-level supervision, iterative denoising process, reasoning benchmarks, sample efficiency, RLVR, SFT, code available, huggingface daily},
  eprint = {2606.18195},
  archiveprefix = {arXiv},
}

Metadata

{}