Direct Preference Optimization (DPO)

From GM-RKB
Jump to navigation Jump to search

A Direct Preference Optimization (DPO) is an LLM instruction fine-tuning algorithm that aligns language models with human preferences by formulating and maximizing a reward function derived directly from these preferences.



References

2024

2023