Paper Detail

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Chu-Cheng Lin, Eugene Ie

arxiv Score 9.8

Published 2026-04-28 · First seen 2026-04-29

General AI

Abstract

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_{θ^{-q}}$ that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires $Ω(\frac{1}{p_0})$ time to escape cold start, while the density-estimation pole escapes in $Θ\big(\log(\frac{1}{p_0})\big)$; intermediate $q$ trades escape speed against noise memorization. Because $P_θ$ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias $O\big(\frac{q}{M P_θ^{q+1}}\big)$; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at $q{=}0.75$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at $q{=}0.75$ provides stable gradients (best overall on HotPotQA at 47.9 maj@16, $+14.4$ over GRPO).

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{lin2026how,
  title = {How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum},
  author = {Chu-Cheng Lin and Eugene Ie},
  year = {2026},
  abstract = {Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability \$p\_0\$ is small. Using the Tsallis \$q\$-logarithm, we define a loss family \$J\_Q\$ that interpolates between RLVR (at \$q\{=\}0\$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at \$q\{=\}1\$, the density-estimation pole). All members share the same per-example gradient directi},
  url = {https://arxiv.org/abs/2604.25907},
  keywords = {cs.LG, cs.AI},
  eprint = {2604.25907},
  archiveprefix = {arXiv},
}

Metadata

{}