Paper Detail

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Talor Abramovich, Maor Ashkenazi, Carl, Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, Yonatan Geifman

Browse

Workflow Queues

huggingface Score 9.5

Published 2026-02-10 · First seen 2026-04-14

General AI

Open paper source

Abstract

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{abramovich2026speed,
  title = {SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding},
  author = {Talor Abramovich and Maor Ashkenazi and Carl and Putterman and Benjamin Chislett and Tiyasa Mitra and Bita Darvish Rouhani and Ran Zilberstein and Yonatan Geifman},
  year = {2026},
  abstract = {Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production},
  url = {https://huggingface.co/papers/2604.09557},
  keywords = {Speculative Decoding, Large Language Model, inference, benchmarking, throughput evaluation, semantic diversity, production engines, vLLM, TensorRT-LLM, draft length, vocabulary pruning, code available, huggingface daily},
  eprint = {2604.09557},
  archiveprefix = {arXiv},
}

Metadata

{}