SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

Abstract

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length~$γ$, which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed~$γ$ (typically~4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied to the target model. In this paper, we present \textbf{SpecKV}, a lightweight adaptive controller that selects~$γ$ per speculation step using signals extracted from the draft model itself. We profile speculative decoding across 4~task categories, 4~speculation lengths, and 3~compression levels (FP16, INT8, NF4), collecting 5,112 step-level records with per-step acceptance rates, draft entropy, and draft confidence. We demonstrate that the optimal~$γ$ shifts across compression regimes and that draft model confidence and entropy are strong predictors of acceptance rate (correlation~$\approx 0.56$). SpecKV uses a small MLP trained on these signals to maximize expected tokens per speculation step, achieving a 56.0\% improvement over the fixed-$γ$=4 baseline with only 0.34\,ms overhead per decision ($<$0.5\% of step time). The improvement is statistically significant ($p < 0.001$, paired bootstrap test). We release all profiling data, trained models, and notebooks as open-source artifacts.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{shukla2026speckv,
  title = {SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection},
  author = {Shikhar Shukla},
  year = {2026},
  abstract = {Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length\textasciitilde{}\$γ\$, which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed\textasciitilde{}\$γ\$ (typically\textasciitilde{}4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied},
  url = {https://arxiv.org/abs/2605.02888},
  keywords = {cs.LG, cs.AI, cs.CL, cs.DC, eess.SY},
  eprint = {2605.02888},
  archiveprefix = {arXiv},
}

Metadata

{}