2024 SelfRewardingLanguageModels

(Yuan et al., 2024) ⇒ Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. (2024). “Self-Rewarding Language Models.” doi:10.48550/arXiv.2401.10020

Subject Headings: Self-Rewarding Language Models (SR-LMs), AlpacaEval 2.0 Leaderboard.

Notes

It introduces Self-Rewarding Language Models (SR-LMs), where language models use their own outputs for training and self-improvement.
The approach enables continual updates and advancements beyond initial training data limits, thus addressing the limitations of static training datasets.
It demonstrates that SR-LMs significantly enhance instruction-following abilities by iteratively generating and evaluating their own instruction-following examples, showcasing a novel approach to language model training.
Its training methodology combines Instruction Fine-Tuning (IFT) and Evaluation Fine-Tuning (EFT) with iterative training.
It outperforms the AlpacaEval 2.0 leaderboard.
It acknowledges the limitations of SR-LM, particularly in understanding the scaling laws of this effect and conducting comprehensive safety evaluations.

Cited By

http://scholar.google.com/scholar?q=%222024%22+Self-Rewarding+Language+Models

Quotes

Abstract

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.

References

(Achiam et al., 2023) ⇒ Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. (2023). “GPT-4 Technical Report.” [arXiv:2303.08774](https://arxiv.org/abs/2303.08774)
(Adolphs et al., 2023) ⇒ Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. (2023). “The CRINGE Loss: Learning What Language Not to Model.” In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 8854–8874, [doi:10.18653/v1/2023.acl-long.493](https://doi.org/10.18653/v1/2023.acl-long.493)
(Anthropic, 2023) ⇒ Anthropic. (2023). “Claude 2.” [URL](https://www.anthropic.com/index/claude-2)
(Bai et al., 2022a) ⇒ Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. (2022). “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” [arXiv:2204.05862](https://arxiv.org/abs/2204.05862)
(Bai et al., 2022b) ⇒ Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. (2022). “Constitutional AI: Harmlessness from AI Feedback.” [arXiv:2212.08073](https://arxiv.org/abs/2212.08073)
(Bai et al., 2023) ⇒ Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. (2023). “Benchmarking Foundation Models with Language-Model-as-an-Examiner.” In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. [URL](https://openreview.net/forum?id=IiRHQ7gvnq)
(Chen et al., 2023) ⇒ Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. (2023). “AlpaGasus: Training a Better Alpaca with Fewer Data.” [arXiv:2307.08701](https://arxiv.org/abs/2307.08701)
(Chen et al., 2024) ⇒ Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. (2024). “Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models.” [arXiv:2401.01335](https://arxiv.org/abs/2401.01335)
(Collobert & Weston, 2008) ⇒ Ronan Collobert and Jason Weston. (2008). “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning.” In: Proceedings of the 25th International Conference on Machine Learning, pages 160–167.
(Dubois et al., 2023) ⇒ Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. (2023). “Alpacafarm: A Simulation Framework for Methods that Learn from Human Feedback.” [arXiv:2305.14387](https://arxiv.org/abs/2305.14387)
(Fernandes et al., 2023) ⇒ Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André FT Martins, Graham Neubig, Ankush Garg, Jonathan H Clark, Markus Freitag, and Orhan Firat. (2023). “The Devil is in the Errors: Leveraging Large Language Models for Fine-Grained Machine Translation Evaluation.” [arXiv:2308.07286](https://arxiv.org/abs/2308.07286)
(Gulcehre et al., 2023) ⇒ Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. (2023). “Reinforced Self-Training (ReST) for Language Modeling.” [arXiv:2308.08998](https://arxiv.org/abs/2308.08998)
(Honovich et al., 2023) ⇒ Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. (2023). “Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor.” In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 14409–14428, [doi:10.18653/v1/2023.acl-long.806](https://doi.org/10.18653/v1/2023.acl-long.806)
(Kim et al., 2023) ⇒ Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. (2023). “Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models.” [arXiv:2310.08491](https://arxiv.org/abs/2310.08491)
(Köpf et al., 2023) ⇒ Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, et al. (2023). “OpenAssistant Conversations–Democratizing Large Language Model Alignment.” [arXiv:2304.07327](https://arxiv.org/abs/2304.07327)
(Lee et al., 2023) ⇒ Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. (2023). “RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.” [arXiv:2309.00267](https://arxiv.org/abs/2309.00267)
(Li et al., 2023a) ⇒ Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, and Mike Lewis. (2023). “Self-Alignment with Instruction Backtranslation.” [arXiv:2308.06259](https://arxiv.org/abs/2308.06259)
(Li et al., 2023b) ⇒ Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. (2023). “Alpacaeval: An Automatic Evaluator of Instruction-Following Models.” [URL](https://github.com/tatsu-lab/alpaca_eval)
(Ouyang et al., 2022) ⇒ Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. (2022). “Training Language Models to Follow Instructions with Human Feedback.” In: Advances in Neural Information Processing Systems, 35:27730–27744.
(Pan, Luo et al., 2023) ⇒ Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. (2023). “Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Self-Correction Strategies.” [arXiv:2308.03188](https://arxiv.org/abs/2308.03188)
(Radford et al., 2019) ⇒ Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. (2019). “Language Models are Unsupervised Multitask Learners.” OpenAI blog, 1(8):9.
(Rafailov et al., 2023) ⇒ Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. (2023). “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” In: Thirty-seventh Conference on Neural Information Processing Systems. [URL](https://openreview.net/forum?id=HPuSIXJaa9)
(Saha et al., 2023) ⇒ Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, and Xian Li. (2023). “Branch-Solve-Merge Improves Large Language Model Evaluation and Generation.” [arXiv:2310.15123](https://arxiv.org/abs/2310.15123)
(Schulman et al., 2017) ⇒ John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. (2017). “Proximal Policy Optimization Algorithms.” [arXiv:1707.06347](https://arxiv.org/abs/1707.06347)
(Stiennon et al., 2020) ⇒ Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. (2020). “Learning to Summarize with Human Feedback.” In: Advances in Neural Information Processing Systems, 33:3008–3021.
(Taori et al., 2023) ⇒ Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. (2023). “Stanford Alpaca: An Instruction-Following Llama Model.” [URL](https://github.com/tatsu-lab/stanford_alpaca)
(Touvron et al., 2023) ⇒ Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. (2023). “Llama 2: Open Foundation and Fine-Tuned Chat Models.” [arXiv:2307.09288](https://arxiv.org/abs/2307.09288)
(Van der Maaten & Hinton, 2008) ⇒ Laurens Van der Maaten and Geoffrey Hinton. (2008). “Visualizing Data Using t-SNE.” Journal of Machine Learning Research, 9(11).
(Wang et al., 2022) ⇒ Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. (2022). “Self-Instruct: Aligning Language Model with Self Generated Instructions.” In: arXiv preprint arXiv:2212.10560.
(Xu et al., 2023) ⇒ Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. (2023). “Some Things Are More Cringe Than Others: Preference Optimization with the Pairwise Cringe Loss.” In: arXiv preprint arXiv:2312.16682.
(Yuan et al., 2023) ⇒ Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. (2023). “RRHF: Rank Responses to Align Language Models with Human Feedback.” In: Thirty-seventh Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id=EdIGMCHk4l.
(Zhao et al., 2023) ⇒ Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. (2023). “SLiC-HF: Sequence Likelihood Calibration with Human Feedback.” In: arXiv preprint arXiv:2305.10425.
(Zheng et al., 2023a) ⇒ Chujie Zheng, Pei Ke, Zheng Zhang, and Minlie Huang. (2023). “Click: Controllable Text Generation with Sequence Likelihood Contrastive Learning.” In: Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1022–1040, Toronto, Canada. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.65. URL: https://aclanthology.org/2023.findings-acl.65.
(Zheng et al., 2023b) ⇒ Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. (2023). “Judging LLM-as-a-Judge with MT-bench and Chatbot Arena.” In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. URL: https://openreview.net/forum?id=uccHPGDlao.
(Ziegler et al., 2019) ⇒ Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. (2019). “Fine-Tuning Language Models from Human Preferences.” In: arXiv preprint arXiv:1909.08593.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2024 SelfRewardingLanguageModels	Jason Weston Kyunghyun Cho Sainbayar Sukhbaatar Weizhe Yuan Richard Yuanzhe Pang Jing Xu			Self-Rewarding Language Models				10.48550/arXiv.2401.10020		2024