Speculative Decoding Technique

From GM-RKB
Jump to navigation Jump to search

Speculative Decoding Technique is an AI model inference acceleration technique designed to speed up LLM output generation by using a small draft LLM without altering the final outputs.



== References

2024

  • (Leviathan et al., 2023) ⇒ Yaniv Leviathan, Matan Kalman, and Yossi Matias. (2023). "Fast Inference from Transformers via Speculative Decoding." In: Proceedings of the 40th International Conference on Machine Learning.
    • NOTE: It introduces speculative decoding as a method to speed up the inference of transformer-based models by using a draft model to generate tokens that are then verified by the main model.
    • NOTE: It demonstrates that this technique can achieve up to 3x speedups in inference while maintaining the quality of the outputs, making it suitable for real-time applications.

2024

  • (Kim et al., 2023) ⇒ Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, and Kurt Keutzer. (2023). "Speculative Decoding with Big Little Decoder." In: Advances in Neural Information Processing Systems 36 (NeurIPS 2023).
    • NOTE: It proposes the Big Little Decoder (BiLD) framework, which combines a small draft model and a larger target model to reduce inference latency in text generation tasks.
    • NOTE: It achieves a 2.12x speedup in various NLP tasks with minimal degradation in output quality, demonstrating its potential for efficient deployment in real-time systems.

2024

  • (Chen et al., 2023) ⇒ Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." Published by Hugging Face.
    • NOTE: It presents speculative sampling as a technique for accelerating the decoding process in large language models by generating multiple tokens simultaneously and using a rejection sampling scheme to ensure output accuracy.
    • NOTE: It achieves a 2-2.5x speedup in decoding for large models like Chinchilla, without requiring changes to the model itself, making it a practical method for faster inference.

2024

  • (Anonymous, 2024) ⇒ Anonymous. (2024). "Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding." In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024).
    • NOTE: It introduces self-speculative decoding, where a draft model generates tokens that are subsequently verified by the main model, achieving speedup without loss of output quality.
    • NOTE: It demonstrates that this technique is effective across various tasks, providing a reliable method for accelerating LLM inference in practical applications.