Paper Detail

Learning from the Self-future: On-policy Self-distillation for dLLMs

Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha, Shiwei Liu

Browse

Workflow Queues

arxiv Score 15.3

Published 2026-06-16 · First seen 2026-06-17

General AI

Open paper source

Abstract

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{luo2026learning,
  title = {Learning from the Self-future: On-policy Self-distillation for dLLMs},
  author = {Yifu Luo and Zeyu Chen and Haoyu Wang and Xinhao Hu and Yuxuan Zhang and Zhizhou Sha and Shiwei Liu},
  year = {2026},
  abstract = {On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our ap},
  url = {https://arxiv.org/abs/2606.18195},
  keywords = {cs.CL, on-policy self-distillation, diffusion LLMs, self-teacher construction, suffix conditioning, step-level supervision, iterative denoising process, reasoning benchmarks, sample efficiency, RLVR, SFT, code available, huggingface daily},
  eprint = {2606.18195},
  archiveprefix = {arXiv},
}

Metadata

{}