Language Model Scaling Law

Jump to navigation Jump to search

A Language Model Scaling Law is an deep learning scaling law that describes how language model performance relates to language model size, language model training data, or language model computational resources.





  • (Kaplan et al., 2020) ⇒ Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei. (2020). "Scaling Laws for Neural Language Models." In: arXiv preprint arXiv:2001.08361. arXiv:2001.08361
    • NOTES:
      • The paper identifies power-law relationships between model size, dataset size, and compute, showing that performance improves predictably with scale across these factors.
      • The paper demonstrates that model architecture, such as depth or width, has a minimal effect on performance compared to scaling parameters like model size, data, and compute.
      • The paper reveals that larger models are more sample-efficient, achieving similar or better performance with fewer data and training steps than smaller models.
      • The paper suggests that optimal training efficiency is achieved by training large models on modest datasets and halting training well before full convergence.
      • The paper finds that overfitting can be predicted and mitigated by maintaining a balance between model size and dataset size, using a simple ratio to avoid diminishing returns.
      • The paper emphasizes that training curves follow predictable patterns, enabling researchers to forecast final performance based on early training data.
      • The paper highlights that larger models, when trained with the appropriate compute budget, can be significantly more compute-efficient than smaller models trained to convergence.
      • The paper shows that performance scales smoothly across multiple orders of magnitude, with no significant deviation in trends, even as model size increases dramatically.
      • The paper advocates for the use of larger batch sizes for training large models, with the ideal batch size scaling with the gradient noise of the model.
      • The paper concludes that scaling model size is more impactful than increasing data size or training time, recommending a focus on larger models for future improvements.