Direct Preference Optimization (DPO)
(Redirected from Direct Preference Optimization)
Jump to navigation
Jump to search
A Direct Preference Optimization (DPO) is an LLM instruction fine-tuning algorithm that aligns language models with human preferences by formulating and maximizing a reward function derived directly from these preferences.
- Context:
- It can (typically) involve constructing a reward function directly based on human preference data.
- ...
- Example(s):
- as proposed in (Rafailov et al., 2023).
- ...
- Counter-Example(s):
- See: Model Optimization, Language Model Alignment, Human-Centered AI.
References
2024
- (Wikipedia, 2024) ⇒ https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback#Alternatives Retrieved:2024-1-11.
- An alternative to RLHF called Direct Preference Optimization (DPO) was described in 2023. Like RLHF, it is used to improve pre-trained large language models using human-generated preference data. Unlike RLHF, it does not train an intermediate reward model and does not use reinforcement learning; instead, it formulates a reward function based on the human preferences and directly trains the large language model to maximize this reward.
2023
- (Rafailov et al., 2023) ⇒ Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. (2023). “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” doi:10.48550/arXiv.2305.18290
- NOTE:
- It introduces Direct Preference Optimization (DPO), a novel approach for aligning language models with human preferences, as a simpler alternative to the complex and often unstable reinforcement learning from human feedback (RLHF).
- It demonstrates a unique mathematical insight, showing that for any given language model, there exists a specific reward function for which the model is optimal, eliminating the need for a separately represented reward function.
- It simplifies aligning instruction-tuned LLM models to human preferences (by requiring only the language model transformer for training, since the reward function is implicitly defined).
- It can be computationally lighter and easier to implement than RLHF (which involves training two transformer networks and is sensitive to hyperparameter choices).
- NOTE: