Direct Preference Optimization (DPO)
Jump to navigation
Jump to search
A Direct Preference Optimization (DPO) is an LLM instruction fine-tuning algorithm that aligns language models with human preferences by formulating and maximizing a reward function derived directly from these preferences.
- Context:
- It can (typically) involve constructing a reward function directly based on human preference data.
- ...
- Example(s):
- as proposed in (Rafailov et al., 2023).
- ...
- Counter-Example(s):
- See: Model Optimization, Language Model Alignment, Human-Centered AI.
References
2024
- (Hui, Yang et al., 2024) ⇒ Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin, et al. (2024). “Qwen2.5 Coder Technical Report.” doi:10.48550/arXiv.2409.12186
- NOTE: Code Instruction Tuning Pipeline: Multi-stage process for converting base code LLMs into instruction-following assistants. Includes synthetic data generation, checklist-based evaluation, and DPO alignment. Connected concepts: LLM Instruction Tuning, Direct Preference Optimization, Code Quality Assessment.
2024
- (Wikipedia, 2024) ⇒ https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback#Alternatives Retrieved:2024-1-11.
- An alternative to RLHF called Direct Preference Optimization (DPO) was described in 2023. Like RLHF, it is used to improve pre-trained large language models using human-generated preference data. Unlike RLHF, it does not train an intermediate reward model and does not use reinforcement learning; instead, it formulates a reward function based on the human preferences and directly trains the large language model to maximize this reward.
2023
- (Rafailov et al., 2023) ⇒ Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. (2023). “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” doi:10.48550/arXiv.2305.18290
- NOTE:
- It introduces Direct Preference Optimization (DPO), a novel approach for aligning language models with human preferences, as a simpler alternative to the complex and often unstable reinforcement learning from human feedback (RLHF).
- It demonstrates a unique mathematical insight, showing that for any given language model, there exists a specific reward function for which the model is optimal, eliminating the need for a separately represented reward function.
- It simplifies aligning instruction-tuned LLM models to human preferences (by requiring only the language model transformer for training, since the reward function is implicitly defined).
- It can be computationally lighter and easier to implement than RLHF (which involves training two transformer networks and is sensitive to hyperparameter choices).
- NOTE: