Odds Ratio Preference Optimization (ORPO)
Jump to navigation
Jump to search
A Odds Ratio Preference Optimization (ORPO) is a preference alignment algorithm, that optimizes Large Language Models, by integrating Supervised Fine-Tuning (SFT) and preference alignment into a single training step, using an odds ratio-based method.
- Context:
- It can simplifying the training process.
- It can apply a penalty to the log odds of the disfavored responses, enhancing the model's ability to generate preferred outputs.
- It can operate without a reference model, contrasting favored and disfavored styles through the modification of the traditional negative log-likelihood loss.
- It can reduce the training time and computational resources needed.
- ...
- Example(s):
- ...
- Counter-Example(s):
- a traditional two-step fine-tuning Preference Alignment Algorithm that require separate stages for supervised fine-tuning and preference alignment.
- ...
- See: Preference Alignment, Supervised Fine-Tuning, Negative Log-Likelihood, Large Language Model.
References
2024
- (Hong et al., 2024) ⇒ Jiwoo Hong, Noah Lee, and James Thorne. (2024). “Reference-free Monolithic Preference Optimization with Odds Ratio.” arXiv preprint arXiv:2403.07691
- ABSTRACT: While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval2.0 (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B).