2024 TheAdEMAMixOptimizerBetterFaste
- (Pagliardini et al., 2024) ⇒ Matteo Pagliardini, Pierre Ablin, and David Grangier. (2024). “The AdEMAMix Optimizer: Better, Faster, Older.” In: arXiv preprint arXiv:2409.03137.
Subject Headings: Exponential Moving Average.
Notes
Here’s the wikified version of your provided text:
- The paper introduces AdEMAMix, a novel optimization method that combines two Exponential Moving Averages (EMAs) to better leverage both recent and older gradient information.
- The paper highlights a limitation in traditional Adam and AdamW optimizers, which rely on a single EMA, arguing that they fail to balance recent and older gradients effectively.
- The paper demonstrates that incorporating both fast and slow-changing EMAs improves model performance by enabling better navigation of complex loss landscapes.
- The paper shows that AdEMAMix helps large models like transformers achieve comparable or better results while using significantly fewer tokens in training (e.g., a 1.3B parameter model trained on 101B tokens performed comparably to an AdamW model trained on 197B tokens).
- The paper empirically demonstrates that older gradients remain relevant over tens of thousands of steps, challenging the common assumption that only recent gradients matter.
- The paper addresses the problem of model forgetting, showing that AdEMAMix slows down the rate at which models forget training data, leading to more stable learning.
- The paper compares AdEMAMix to existing optimizers, such as Adam, Adafactor, and Lion, highlighting how its mixture of EMAs enables faster convergence and better generalization across various model sizes and tasks.
- The paper incorporates results from language modeling and vision tasks, demonstrating the optimizer’s versatility and consistent superiority over AdamW in different domains.
- The paper outlines how the additional memory and computational overhead introduced by AdEMAMix is negligible in large-scale distributed setups, making it practical for real-world applications.
- The paper suggests that AdEMAMix performs best in scenarios with a high capacity/data ratio, excelling when large amounts of data are available relative to model size.
- The paper reviews related work on momentum methods, discussing how traditional momentum-based optimizers like SGD+M and Adam have been widely adopted but often miss opportunities to effectively utilize older gradient information.
- The paper sets the stage for future research into optimizers that combine different types of gradient information beyond EMAs, opening up possibilities for more sophisticated methods of handling historical data in optimization.
Cited By
Quotes
Abstract
Momentum-based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show—quite surprisingly—that gradients can stay relevant for tens of thousands of steps. They help to converge faster, and often to lower minima: e.g., a 1.3B parameter AdEMAMix LLM trained on 101B tokens performs comparably to an AdamW model trained on 197B tokens (+95%). Moreover, our method significantly slows down model forgetting during training. Our work motivates further exploration of different types of functions to leverage past gradients, beyond EMAs.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 TheAdEMAMixOptimizerBetterFaste | David Grangier Matteo Pagliardini Pierre Ablin | The AdEMAMix Optimizer: Better, Faster, Older | 2024 |