Speculative Decoding Technique

Context:
- It can (typically) generate multiple tokens in parallel, which are then verified by the larger target model.
- It can (often) involve the use of speculative sampling, a method that ensures the output distribution matches that of the target model alone.
- ...
Example(s):
- a Google's speculative decoding implementation that leverages a small draft model to generate candidate tokens, which are then checked by a larger LLM.
- a DeepMind's SpS (Speculative Sampling) method that achieves 2-2.5x decoding speedups in large language models while maintaining output fidelity.
- a HuggingFace's assisted generation (AsG) approach that assists in speeding up generation by using a draft model for initial token prediction.
- ...
Counter-Example(s):
- Standard Autoregressive Decoding, which decodes tokens sequentially without the use of a draft model, leading to slower inference times.
- Non-speculative Beam Search, a decoding method that explores multiple sequences of tokens but does not involve parallel draft models for acceleration.
See: Inference Acceleration, Autoregressive Decoding, Speculative Sampling, Large Language Models

== References

(Leviathan et al., 2023) ⇒ Yaniv Leviathan, Matan Kalman, and Yossi Matias. (2023). "Fast Inference from Transformers via Speculative Decoding." In: Proceedings of the 40th International Conference on Machine Learning.
- NOTE: It introduces speculative decoding as a method to speed up the inference of transformer-based models by using a draft model to generate tokens that are then verified by the main model.
- NOTE: It demonstrates that this technique can achieve up to 3x speedups in inference while maintaining the quality of the outputs, making it suitable for real-time applications.

(Kim et al., 2023) ⇒ Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, and Kurt Keutzer. (2023). "Speculative Decoding with Big Little Decoder." In: Advances in Neural Information Processing Systems 36 (NeurIPS 2023).
- NOTE: It proposes the Big Little Decoder (BiLD) framework, which combines a small draft model and a larger target model to reduce inference latency in text generation tasks.
- NOTE: It achieves a 2.12x speedup in various NLP tasks with minimal degradation in output quality, demonstrating its potential for efficient deployment in real-time systems.

(Chen et al., 2023) ⇒ Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." Published by Hugging Face.
- NOTE: It presents speculative sampling as a technique for accelerating the decoding process in large language models by generating multiple tokens simultaneously and using a rejection sampling scheme to ensure output accuracy.
- NOTE: It achieves a 2-2.5x speedup in decoding for large models like Chinchilla, without requiring changes to the model itself, making it a practical method for faster inference.

(Anonymous, 2024) ⇒ Anonymous. (2024). "Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding." In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024).
- NOTE: It introduces self-speculative decoding, where a draft model generates tokens that are subsequently verified by the main model, achieving speedup without loss of output quality.
- NOTE: It demonstrates that this technique is effective across various tasks, providing a reliable method for accelerating LLM inference in practical applications.

Navigation menu