Paper Detail

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma

Browse

Workflow Queues

huggingface Score 9.5

Published 2026-06-16 · First seen 2026-06-17

General AI

Open paper source

Abstract

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{lee2026zone,
  title = {Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients},
  author = {Byung-Kwan Lee and Ximing Lu and Shizhe Diao and Minki Kang and Saurav Muralidharan and Karan Sapra and Andrew Tao and Pavlo Molchanov and Yejin Choi and Yu-Chiang Frank Wang and Ryo Hachiuma},
  year = {2026},
  abstract = {Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded},
  url = {https://huggingface.co/papers/2606.18216},
  keywords = {knowledge distillation, student model, teacher model, reinforcement learning, policy gradient, on-policy assumption, prompt replay buffer, Binary Candidate-included Question, Negative Candidate-included Question, zone of proximal development, vision-language models, benchmark suite, huggingface daily},
  eprint = {2606.18216},
  archiveprefix = {arXiv},
}

Metadata

{}