2023 AcceleratingLlmInferencewithSta
- (Spector & Ré, 2023) ⇒ Benjamin Spector, and Christopher Ré. (2023). “Accelerating LLM Inference with Staged Speculative Decoding.” In: arXiv preprint arXiv:2308.04623. doi:10.48550/arXiv.2308.04623
Subject Headings: LLM Inference, Staged Speculative Decoding, Small-Batch LLM Inference
Notes
- It introduces staged speculative decoding to accelerate LLM inference, especially in low-batch, on-device scenarios.
- It builds upon speculative decoding by reorganizing batches into a tree structure and adding a second speculative stage.
- It achieves a 3.16x reduction in single-batch decoding latency using a GPT-2-L model without compromising output quality.
- It addresses the challenge of low arithmetic intensity in small-batch LLM inference, improving latency, personalization, and privacy.
- It utilizes a tree-structured speculative batch and a two-stage decoding process for efficiency gains.
- It evaluates performance using a GPT-2-L oracle model, a smaller GPT-2 draft model, and a Katz backoff trigram model.
- It identifies future directions, including faster speculative sampling, running larger models on-device, and improving lower-level draft models.
Cited By
Quotes
Abstract
Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2023 AcceleratingLlmInferencewithSta | Christopher Ré Benjamin Spector | Accelerating Llm Inference with Staged Speculative Decoding | 10.48550/arXiv.2308.04623 | 2023 |