Paper Detail

Trust-Region Behavior Blending for On-Policy Distillation

Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky, Boris Shaposhnikov, Daria Korotyshova, Daniil Gavrilov

huggingface Score 5.0

Published 2026-05-29 · First seen 2026-06-01

General AI

Abstract

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{plyusov2026trust,
  title = {Trust-Region Behavior Blending for On-Policy Distillation},
  author = {Daniil Plyusov and Alexey Gorbatovski and Alexey Malakhov and Nikita Balagansky and Boris Shaposhnikov and Daria Korotyshova and Daniil Gavrilov},
  year = {2026},
  abstract = {On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per},
  url = {https://huggingface.co/papers/2605.31159},
  keywords = {on-policy distillation, student policy, teacher policy, prefix mismatch, offline distillation, behavior blending, KL trust region, reverse-KL loss, annealing, huggingface daily},
  eprint = {2605.31159},
  archiveprefix = {arXiv},
}

Metadata

{}