Advances in large language models (LLMs) have largely driven the field of code generation. In previous research, reinforcement learning (RL) and compiler feedback signals were combined to explore the output space of LLMs to optimize the quality of code generation.
But there are still two problems:
#1. Reinforcement learning exploration is difficult to directly adapt to "complex human needs". That is, LLMs are required to generate "long sequence code";
2. Since unit tests may not cover complex code, it is ineffective to use unexecuted code snippets to optimize LLMs.
To address these challenges, the researchers proposed a new reinforcement learning framework called StepCoder, which was jointly developed by experts from Fudan University, Huazhong University of Science and Technology, and Royal Institute of Technology. StepCoder contains two key components designed to improve the efficiency and quality of code generation.
1. CCCSAddresses exploration challenges by breaking long sequence code generation tasks into code completion sub-task courses;
2. FGOOptimizes the model by masking unexecuted code segments to provide fine-grained optimization.
Paper link: https://arxiv.org/pdf/2402.01391.pdf
Project Link: https://github.com/Ablustrund/APPS_Plus
The researchers also built the APPS data set for reinforcement learning training and manually verified it to ensure the correctness of the unit tests.
Experimental results show that the method improves the ability to explore the output space and outperforms state-of-the-art methods on corresponding benchmarks.
In the code generation process, ordinary reinforcement learning exploration (exploration) is difficult to handle "environments with sparse and delayed rewards" and "long sequences". Complex needs".
In the CCCS (Curriculum of Code Completion Subtasks) stage, researchers decompose complex exploration problems into a series of subtasks. Using a portion of the canonical solution as a prompt, LLM can start exploring from simple sequences.
The calculation of rewards is only related to executable code fragments, so it is inaccurate to use the entire code (red part in the figure) to optimize LLM (gray part in the figure).
In the FGO (Fine-Grained Optimization) stage, researchers mask the unexecuted tokens (red part) in the unit test and only use the executed tokens (green part) Computes a loss function, which can provide fine-grained optimization.
Preliminary knowledge
Assume that is a training data set used for code generation, where x, y, and u represent human needs (i.e., task description), standard solutions, and unit test samples respectively.
is a list of conditional statements obtained by automatically analyzing the abstract syntax tree of the standard solution yi, where st and en represent the starting position and ending position of the statement respectively.
For human needs x, its standard solution y can be expressed as ; in the code generation stage, given human needs x, the final state is through the unit A collection of codes that test u.
Method details
StepCoder integrates two key components: CCCS and FGO, where the purpose of CCCS is to Courses in which code generation tasks are decomposed into code completion subtasks can alleviate exploration challenges in RL; FGO is designed specifically for code generation tasks and provides fine-grained optimization by only calculating the loss of executed code fragments.
CCCS
In the code generation process, to solve complex human needs, policy models usually need to adopt relatively complex Long action sequences. At the same time, the compiler's feedback is delayed and sparse, that is, the policy model only receives rewards after the entire code has been generated. In this case, exploration is very difficult.
The core of the method is to decompose such a long list of exploration problems into a series of short, easy-to-explore subtasks. The researchers reduced code generation into code completion subtasks, where Subtasks are automatically constructed from typical solutions in the training dataset.
For human needs x, in the early training stage of CCCS, the starting point s* for exploration is a state near the final state.
Specifically, the researchers provide the human need x and the first half of the standard solution , and train a policy model to predict xp) complete the code.
Assuming that y^ is the combined sequence of xp and the output trajectory τ, that is, yˆ=(xp,τ), the reward model is provided based on the correctness of the code fragment τ with y^ as input Reward r.
The researchers used the proximal policy optimization (PPO) algorithm to optimize the policy model πθ by utilizing the reward r and trajectory τ.
During the optimization phase, the canonical solution code segment xp used to provide hints will be masked so that it will not have an impact on the gradient of the policy model πθ update.
CCCS optimizes the policy model πθ by maximizing the opposition function, where π^ref is the reference model in PPO and is initialized by the SFT model.
As training progresses, the starting point s* of exploration will gradually move toward the starting point of the standard solution. Specifically, a threshold ρ is set for each training sample. When the cumulative accuracy rate of the code segments generated by πθ is greater than ρ, the starting point is moved to beginning.
In the later stages of training, the exploration process of this method is equivalent to that of original reinforcement learning, that is, s*=0, and the policy model only generates code with human needs as input.
Sample the initial recognition point s* at the starting position of the conditional statement to complete the remaining unwritten code segment.
Specifically, the more conditional statements, the more independent paths the program has, and the higher the logic complexity. The complexity requires more frequent sampling to improve the training quality, and Programs with fewer conditional statements do not need to sample as frequently.
This sampling method can evenly extract representative code structures while taking into account both complex and simple semantic structures in the training data set.
To speed up the training phase, the researchers set the number of courses for the i-th sample to , where Ei is the number of its conditional statements. The training course span of the i-th sample is , not 1.
The main points of CCCS can be summarized as follows:
1. It is easy to start exploring from a state close to the goal (i.e. the final state);
2. Exploration from a state further away from the goal is challenging, but exploration becomes easier if you can take advantage of states that have already learned how to reach the goal.
FGO
The relationship between rewards and actions in code generation is different from other reinforcement learning tasks (such as Atari ), in code generation it is possible to exclude a set of actions that are not relevant for calculating the reward in the generated code.
Specifically, for unit testing, the compiler’s feedback is only related to the executed code fragment. However, in the ordinary RL optimization goal, all actions on the trajectory will participate in the gradient calculation. , and the gradient calculation is imprecise.
In order to improve the optimization accuracy, the researchers shielded the unexecuted actions (i.e. tokens) in the unit test and the loss of the strategy model.
APPS data set
Reinforcement learning requires a large amount of high-quality training data. During the investigation, the researchers found that among the currently available open source data sets, only APPS meets this requirement.
But there are some incorrect instances in APPS, such as missing input, output or standard solution, where the standard solution may not compile or execute, or there may be differences in execution output.
To complete the APPS dataset, the researchers filtered out instances with missing inputs, outputs, or standard solutions, and then standardized the formats of inputs and outputs to facilitate the execution of unit tests. and comparison; each instance was then unit tested and manually analyzed, eliminating instances with incomplete or irrelevant code, syntax errors, API misuse, or missing library dependencies.
For differences in output, researchers manually review the problem description, correct the expected output, or eliminate the instance.
Finally, the APPS data set was constructed, containing 7456 instances. Each instance includes programming problem description, standard solution, function name, unit test (i.e. input and output) and Startup code (i.e. the beginning of a standard solution).
To evaluate the performance of other LLMs and StepCoder in code generation, researchers Experiments were conducted on the APPS dataset.
The results show that the RL-based model outperforms other language models, including the base model and the SFT model.
#The researchers have reason to infer that reinforcement learning can further improve performance by more efficiently browsing the model's output space, guided by compiler feedback. The quality of code generation.
Furthermore, StepCoder surpassed all baseline models, including other RL-based methods, and achieved the highest score.
Specifically, this method achieved 59.7%, High scores of 23.5% and 8.6%.
Compared with other reinforcement learning-based methods, this method performs well in exploring the output space by simplifying complex code generation tasks into code completion subtasks, and the FGO process performs well in Played a key role in accurately optimizing the strategy model.
It can also be found that on the APPS data set based on the same architecture network, the performance of StepCoder is better than the supervised LLM for fine-tuning; compared with the backbone network, the latter has almost no Improve the pass rate of generated code, which also directly shows that using compiler feedback to optimize the model can improve the quality of generated code more than the next token prediction in code generation.
The above is the detailed content of Complete the "Code Generation" task! Fudan et al. release StepCoder framework: Reinforcement learning from compiler feedback signals. For more information, please follow other related articles on the PHP Chinese website!