2023 PanGuCoder2BoostingLargeLanguag
- (Shen et al., 2023) ⇒ Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, Yuenan Guo, and Qianxiang Wang. (2023). “PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback.” doi:10.48550/arXiv.2307.14936
Subject Headings: Code LLM, PanGu-Coder2, CoderEval Benchmark, LeetCode Benchmark
Notes
Cited By
Quotes
Abstract
Large Language Models for Code (Code LLM) are flourishing. New and powerful models are released on a weekly basis, demonstrating remarkable performance on the code generation task. Various approaches have been proposed to boost the code generation performance of pre-trained Code LLMs, such as supervised fine-tuning, instruction tuning, reinforcement learning, etc. In this paper, we propose a novel RRTF (Rank Responses to align Test&Teacher Feedback) framework, which can effectively and efficiently boost pre-trained large language models for code generation. Under this framework, we present PanGu-Coder2, which achieves 62.20% pass@1 on the OpenAI HumanEval benchmark. Furthermore, through an extensive evaluation on CoderEval and LeetCode benchmarks, we show that PanGu-Coder2 consistently outperforms all previous Code LLMs.
1 Introduction
As one of the most promising applications of large language model (LLM), code large language models have captivated considerable attention across academia and industry due to their remarkable capability in code-related tasks Zan et al. 2023.
Since OpenAI released Codex Chen et al. 2021, AlphaCode Li et al. 2022, PaLM-Coder Chowdhery et al. 2022, and PanGu-Coder Christopoulou et al. 2022 are subsequently published but in a closed-source way. Researchers open-source CodeParrot Huggingface 2021, PolyCoder Xu et al. 2022, PyCodeGPT Zan et al. 2022a, and SantaCoder Allal et al. 2023, but they fall far behind commercial models in terms of model size, capability, and performance.
It can be noted that the situation is changed by Hugging Face, as the BigCode community releases StarCoder Li et al. 2023: a 15B parameter model with 8K window size and FIM (Fill In the Middle, or infilling) capability. StarCoder outperforms many previous open-source large language models that support generating code from natural language descriptions, and even matches the OpenAI code-cushman-001 model on the HumanEval Chen et al. 2021 and MBPP benchmarks Austin et al. 2021.
However, most large language models for code still fall behind the latest commercial models like GPT-3.5 and GPT-4 from OpenAI OpenAI 2023, Bubeck et al. 2023. We use Code LLM to denote the large language model majorly pre-trained on code corpus, like PanGu-Coder Christopoulou et al. 2022, Replit 2, and StarCoder Li et al. 2023. Compared with open-source Code LLMs, the OpenAI GPT-family models are usually bigger in size and majorly pre-train on natural language corpus (with a small proposition of code-related data), which can contribute to their superior natural language comprehension and instruction following capabilities. Some efforts have been made to boost Code LLMs, like data engineering (phi-1 Gunasekar et al. 2023), instruction tuning (WizardCoder Luo et al. 2023), retrieval-augmented generation (ReAcc Lu et al. 2022, RepoCoder Zhang et al. 2023, etc.), and reinforcement learning (RLTF Liu et al. 2023, CodeRL Le et al. 2022, PPOCoder Shojaee et al. 2023, etc.).
Although reinforcement learning (RL) seems to be a promising direction since programming is essentially a trial- and-error procedure, existing RL-based approaches face several major limitations. The motivation is intuitive and straightforward: as we expect the model to generate code according to human intent and requirements, reinforcement learning on Code LLMs can help the model enhance the ability to interpret and respond to code generation instructions, thus increasing the likelihood of generating a code to successfully solve a given problem. Typically, existing RL-based approaches design value/reward functions according to feedback signals from code processors, like compilers, debuggers, executors, and test cases. However, this leads to three limitations: First, regarding the test results as a reward directly provides limited improvements to the base model. Second, the adopted RL algorithm (like PPO) is complicated to implement and hard to train on large language models Liu et al. 2023. Besides, running tests while training the model is time-consuming. As a result, previous works Le et al. 2022, Liu et al. 2023 only experiment on modestly-sized models, and the improvement is rather limited.
To address the problem of existing RL-based approaches and further exploit the potential of Code LLM, we propose the RRTF (Rank Responses to align Test&Teacher Feedback) framework, which is a novel work to successfully apply natural language LLM alignment techniques on Code LLMs. Different from previous works like CodeRL Le et al. 2022 and RLTF Liu et al. 2023, we follow the idea of RLHF (Reinforcement Learning from Human Feedback) that empowers InstructGPT/ChatGPT Ouyang et al. 2022a, but implement a much simpler and efficient training approach using ranking responses as feedback instead of the absolute value of a reward model.
As a proof of concept, we apply RRTF on StarCoder 15B, and present a model that achieves the best performance among all published Code LLMs, namely the PanGu-Coder2. Through extensive evaluation on three benchmarks, including HumanEval, CoderEval, and LeetCode, we conjecture that Code LLMs do have the potential to surpass natural language models of the same or larger sizes on the code generation task. Furthermore, by analyzing the training process and manually inspecting the generation code samples, we highlight the importance of high-quality data in improving the models’ instruction following and code writing capabilities.
In a nutshell, we make the following contributions:
- We introduce a new optimization paradigm named RRTF, which is a data-efficient, easy-to-implement, and model-agnostic framework to effectively boost the code generation performance of pre-trained Code LLMs.
- We present PanGu-Coder2, a model that improves nearly 30% over its base model and achieves new state-of-the-art performance on the HumanEval, CoderEval, and LeetCode benchmarks, surpassing all previously published Code LLMs.
- We share our experience and findings in constructing effective training data, training the model with RRTF, and optimizing such a model for fast inference.
2 Related Work
2.1 Large Language Model for Code (Code LLMs)
As a momentous milestone, Codex Chen et al. (2021) boasting a 12-billion-parameters model demonstrates the extraordinary capability to tackle up to 72% of Python programming problems. Subsequently, a new wave of code generation models, such as AlphaCode Li et al. (2022), PaLM-Coder Chowdhery et al. (2022), and PanGu-Coder Christopoulou et al. (2022), also were proposed. Despite the remarkable prowess exhibited by the aforementioned models, it is disheartening to note their unavailability as open-source projects. Therefore, several open-source code generation models, including CodeParrot Huggingface [2021], PolyCoder Xu et al. [2022], PyCodeGPT Zan et al. [2022a], SantaCoder Allal et al. [2023], and StarCoder Li et al. [2023], were released, injecting fresh vigor into the realm of code generation Chen et al. [2022]. Meanwhile, code generation models have also been applied to a broader range of practical coding scenarios. For example, CodeGeeX Zheng et al. [2023], BLOOM Scao et al. [2022] and ERNIE-Code Chai et al. [2022] have been proposed to facilitate multilingual modeling; JuPyT5 Chandel et al. [2022] is trained on a large corpus of Jupyter notebooks, aiming to elevate the experience of interactive programming; DocCoder Zhou et al. [2023a] and APICoder Zan et al. [2022b] have been proposed to empower language models with the ability to invoke APIs; Some models such as InCoder Fried et al. [2023], FIM Bavarian et al. [2022], MIM Nguyen et al. [2023], SantaCoder Allal et al. [2023], and StarCoder Li et al. [2023] support the code generation at arbitrary positions.
Of late, some efforts Zhou et al. [2023b], Peng et al. [2023] using the instruction tuning technique unlock the potential valuable knowledge stored within large language models, by fine-tuning on meticulously curated high-quality instruction datasets. In the field of code generation, WizardCoder 15B Luo et al. [2023] and phi-1 1.3B Gunasekar et al. [2023] achieve exceptional code generation performance by fine-tuning on the data generated by OpenAI’s GPT-3.5 or GPT-4.
2.2 Reinforcement Learning on LLM
Reinforcement Learning from Human Feedback Large language models can generate untruthful, unexpected, and unhelpful outputs, which are not aligned with the intention of the end users. To align the behavior of large language models with human intentions, Ouyang et al. [2022b] proposed Reinforcement Learning from Human Feedback(RLHF) recently. The underlying idea is to leverage human preferences on given tasks to improve the behavior of a language model. A typical RLHF procedure consists of three steps, including supervised fine-tuning (SFT) which collects human demonstrations of desired model behavior and fine-tunes a language model, reward model (RM) training which employs humans to label the preferred output among various model outputs and trains a reward model based on the labeled data, and reinforcement learning via proximal policy optimization (PPO) which optimizes the language model against the reward model. OpenAI’s GPT-3.5 and GPT-4 are trained with RLHF and their success demonstrates the effectiveness of RLHF to align the behavior of language models with human preferences. However, implementing RLHF requires heavy training resources and complex parameter tuning, which alleviates the technique from being easily applied in practice. In addition, the inefficiency and instability of RL algorithms can pose challenges to the alignment of language models. Given the limitations of heavy training resources and complex parameter tuning, Yuan et al. [2023] proposed the RRHF paradigm which leverages outputs with human preferences collected from various resources to train a model that aligns with human preferences. Its principle to align the model behavior to humans is to train a model to learn the outputs with better rewards based on human preferences among a set of outputs. Compared with RLHF, RRHF can be easily scaled to LLMs with larger sizes under a resource-constrained scenario. In view of the inefficiency and instability problem, Dong et al. [2023] proposed the reward-ranked fine-tuning (RAFT) technique for language models. Their underlying idea is to first select high-quality outputs of the model based on the output ranking estimated by a reward model and then leverage the selected outputs to train a model that aligns with human preferences. Compared with RLHF, the SFT-style RAFT typically converges faster than the PPO used in RLHF, while utilizing simpler parameter configuration and fewer computational resources.
Reinforcement Learning on Code The successful practice of RLHF has inspired researchers to improve the capability of Code LLMs with reinforcement learning. For example, CodeRL Le et al. [2022] integrates actor-critic RL framework with unit test signals to fine-tune models. Following CodeRL, PPOCoder Shojaee et al. [2023] uses the Proximal Policy Optimization (PPO) algorithm, but results in little improvements on the MBPP benchmark. Very recently, RLTF Liu et al. [2023] moves a step forward by adopting an online RL framework with multi-granularity unit test feedback, to overcome the limitation of offline RL adopted by CodeRL and PPOCoder.
2.3 Fine-tuning Code LLM
Fine-tuning on pre-trained language models is a mainstream modeling paradigm that maximizes the performance at downstream tasks. In the field of code, several works also adopt the paradigm to address code-related scenarios. For instance, CodeGen Nijkamp et al. [2022] and StarCoder Li et al. [2023] start by pre-training on a multilingual code corpus, followed by fine-tuning on monolingual data, thereby achieving superior performance on monolingual tasks. Codex-S Chen et al. [2021] and PanGu-Coder-FT Christopoulou et al. [2022] elevate their code generation capabilities by fine-tuning on competitive programming problems. Recently, instruction tuning Ouyang et al. [2022a], OpenAI [2023], as a form of supervised fine-tuning (SFT), is proposed to align the model with human behavior by learning abundant high-quality instruction corpus. In this regard, WizardCoder Luo et al. [2023] was fine-tuned on a series of instruction corpora derived from a teacher model, effectively maximizing its code knowledge with relatively limited parameters. In this technical report, PanGu-Coder2 employs ranking feedback strategy Yuan et al. [2023] during the fine-tuning process, and achieves surprising code generation performance.
3 Approach
3.1 Overview
...
5 Conclusion
In this paper, we introduce a novel framework, namely RRTF, and present a new Code LLM, namely PanGu-Coder2. Firstly, we adopt the Evol-Instruct technique to obtain a substantial amount of high-quality natural language instruction and code solution data pairs. Then, we train the base model by ranking candidate code solutions using feedback from test cases and heurstic preferences. Through comprehensive evaluations on HumanEval, CodeEval, and LeetCode benchmarks, PanGu-Coder2 achieves new state-of-the-art performance among billion-parameter-level Code LLMs, surpassing all of the existing ones by a large margin. In our future work, we will delve into the combination of RRTF and instruct tuning to boost the performance of Code LLMs.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2023 PanGuCoder2BoostingLargeLanguag | Bo Shen Jiaxin Zhang Taihong Chen Daoguang Zan Bing Geng An Fu Muhan Zeng Ailun Yu Jichuan Ji Jingyang Zhao Yuenan Guo Qianxiang Wang | PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback | 10.48550/arXiv.2307.14936 | 2023 |