What are the origins and applications of RLHF technology in language models?-AI-php.cn

What are the origins and applications of RLHF technology in language models?

WBOY

Release： 2024-01-24 10:45:14

forward

1392 people have browsed it

What are the origins and applications of RLHF technology in language models?

RLHF stands for reinforcement learning from human feedback. This article will introduce how the large language model (LLM) is combined with RLHF.

Mechanism of RLHF

Reinforcement learning is a branch of machine learning that learns optimal strategies through agent interaction with the environment. Agents choose actions that affect transitions in the state of the environment and are rewarded accordingly. Rewards are feedback signals for the reinforcement learning agent to adjust its strategy. During the training phase, the agent adjusts its strategy based on rewards to maximize long-term returns.

Therefore, it is crucial to design an appropriate reward system, which is the key to reinforcement learning. RLHF, on the other hand, integrates human feedback and incorporates humans into the training process to enhance the training effect of reinforcement learning agents.

RLHF General Framework

The reinforcement learning fine-tuning process of a large language model (LLM) usually consists of three stages. First, we start with a pretrained language model. Since LLM requires a large amount of training data, it is impractical to train it from scratch with manual feedback. Therefore, we can pre-train through unsupervised learning and use existing language models for output generation. After the pre-training is completed, the next step is the fine-tuning phase. At this stage, we will use reinforcement learning algorithms to optimize the LLM. By interacting with the environment, LLM can obtain feedback from the environment and optimize its output by adjusting the parameters of the model. The final stage is subsequent fine-tuning. In this phase, the LLM will interact with the specific task and pass

Next, entering the second phase, we need to create a reward model for the RL system. At this stage, we train another machine learning model that takes the text generated by the main model and generates a quality score for it. Typically, we will use another LLM model and modify it so that it outputs a scalar value instead of a sequence of text tokens. This quality score will be used as a reward signal to guide the main model to generate higher quality text.

In order to train the reward model, we need to build a quality assessment dataset containing LLM-generated text. Each training example consists of a hint and multiple outputs generated by the LLM. Next, we asked humans to evaluate the quality of these generated texts. We then use these evaluation results to train a reward model to predict the score of LLM-generated text. By training between the output of the LLM and the ratings, the reward model is able to build a mathematical representation of human preferences.

In the final stage, we fine-tuned and created a reinforcement learning loop. A replica of the master LLM is used as the RL agent. On each training set, LLM takes multiple cues from the dataset and generates text. The text is then passed to a reward model, which assigns a score that evaluates its consistency with human preferences. We then update the LLM to generate outputs that score higher on the reward model.

Although this is a general RLHF framework for language models, different implementation goals require corresponding modifications.

Another consideration for language models in RLHF is maintaining a balance between reward optimization and language consistency. Although reward models are only imperfect approximations of human preferences, agent LLMs may maximize rewards by violating syntactic or logical consistency, similar to most RL systems. To prevent this from happening, the ML team keeps a copy of the original LLM and uses it in the RL loop. They integrated the difference between the output of the original LLM and the output of the RL-trained LLM (KL divergence) as a negative value into the reward signal to prevent the deviation between the model and the original output from being too large. This strategy aims to balance reward optimization with language consistency.

The above is the detailed content of What are the origins and applications of RLHF technology in language models?. For more information, please follow other related articles on the PHP Chinese website!