2024 BetterFasterLargeLanguageModels

From GM-RKB
Jump to navigation Jump to search

Subject Headings: LLM Training.

Notes

  • The paper introduces a novel multi-token prediction method for training large language models (LLMs) aimed at enhancing sample efficiency and robustness.
  • The paper proposes training large language models to predict multiple future tokens at once using independent output heads on top of a shared model trunk, aiming to improve sample efficiency and performance.
  • The paper employs a shared Transformer model trunk with independent output heads to predict multiple tokens, which improves performance on various benchmarks.
  • The paper demonstrates that multi-token prediction leads to significant improvements on coding benchmarks like HumanEval and MBPP, with the gains increasing for larger model sizes up to 13B parameters.
  • The paper shows that models trained with 4-token prediction can achieve up to 3x faster inference speeds compared to standard autoregressive decoding by leveraging the additional prediction heads.
  • The paper presents a memory-efficient implementation for multi-token prediction that reduces peak GPU memory usage without runtime overhead by sequentially computing the forward and backward passes of each output head.
  • The paper shows that the optimal number of predicted tokens depends on the task and data distribution, with n=4 working well for a 32k token vocabulary on code and n=8 being best for byte-level modeling.
  • The paper finds that multi-token prediction is especially beneficial for learning global patterns and longer-term dependencies, as demonstrated by strong performance improvements on byte-level language modeling.
  • The paper finds that the benefits of multi-token prediction persist when training for multiple epochs, and that models pretrained this way maintain an edge when finetuned on downstream tasks like CodeContests.
  • The paper observes mixed results for multi-token prediction on natural language tasks, with improvements on abstractive summarization benchmarks but no significant gains on standard multiple-choice question answering datasets.
  • The paper provides intuitions for why multi-token prediction works, suggesting it mitigates train-test distribution mismatch, assigns higher implicit loss weights to consequential token decisions, and promotes useful information-sharing between positions.

Cited By

Quotes

Abstract

Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2024 BetterFasterLargeLanguageModelsGabriel Synnaeve
Fabian Gloeckle
Badr Youbi Idrissi
David Lopez-Paz
Baptiste Rozière
Better & Faster Large Language Models via Multi-token Prediction10.48550/arXiv.2404.197372024