Paper Detail

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Woongyeng Yeo, Yumin Choi, Taekyung Ki, Sung Ju Hwang

huggingface Score 10.0

Published 2026-05-18 · First seen 2026-05-25

General AI

Abstract

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{yeo2026hint,
  title = {HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents},
  author = {Woongyeng Yeo and Yumin Choi and Taekyung Ki and Sung Ju Hwang},
  year = {2026},
  abstract = {Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successfu},
  url = {https://huggingface.co/papers/2605.17873},
  keywords = {reinforcement learning, self-distillation, hindsight, targeted distillation, long-horizon agents, action selection, feedback-conditioned distillation, trajectory analysis, huggingface daily},
  eprint = {2605.17873},
  archiveprefix = {arXiv},
}

Metadata

{}