'Flattery' is common in RLHF models, and no one is immune from Claude to GPT-4-AI-php.cn

'Flattery' is common in RLHF models, and no one is immune from Claude to GPT-4

王林

Release： 2023-10-24 20:53:06

forward

1212 people have browsed it

Whether you are in the AI circle or other fields, you have more or less used large language models (LLM). When everyone is praising the various changes brought about by LLM, some of the large models Shortcomings are gradually exposed.

For example, some time ago, Google DeepMind discovered that LLM generally exhibits "sycophantic" human behavior, that is, sometimes the views of human users are objectively incorrect, and the model will also adjust its own Respond to follow the user's point of view. As shown in the figure below, the user tells the model 1 1=956446, and the model follows the human instructions and believes that this answer is correct.

Flattery is common in RLHF models, and no one is immune from Claude to GPT-4 ##Image source https://arxiv.org/abs/2308.03958

In fact, this phenomenon is common in many In the AI model, what is the reason? Researchers from the AI startup Anthropic analyzed this phenomenon. They believe that "flattery" is a common behavior of RLHF models, partly due to human preference for "flattery" responses.

Flattery is common in RLHF models, and no one is immune from Claude to GPT-4

Paper address: https://arxiv.org/pdf/2310.13548.pdf

Connect Let’s take a look at the specific research process.

AI assistants such as GPT-4 are trained to produce more accurate answers, and the vast majority of them use RLHF. Fine-tuning a language model using RLHF improves the quality of the model's output, which is evaluated by humans. However, some studies believe that training methods based on human preference judgments are undesirable. Although the model can produce output that appeals to human evaluators, it is actually flawed or incorrect. At the same time, recent work has also shown that models trained on RLHF tend to provide answers that are consistent with users.

In order to better understand this phenomenon, the study first explored whether AI assistants with SOTA performance would provide "flattery" model responses in various real-world environments. The results found that The five RLHF-trained SOTA AI assistants showed a consistent "flattery" pattern in the free-form text generation task. Since flattery appears to be a common behavior for RLHF-trained models, this article also explores the role of human preferences in this type of behavior.

This article also explores whether "flattery" present in preference data will lead to "flattery" in the RLHF model, and finds that more optimization will increase some forms of "flattery." ”, but will reduce other forms of “flattery.”

The degree and impact of "flattery" of large models

In order to evaluate the degree of "flattery" of large models and analyze the impact on reality generation Impact, the study benchmarked the degree of “flattery” of large models released by Anthropic, OpenAI, and Meta.

Specifically, the study proposes the SycophancyEval evaluation benchmark. SycophancyEval extends the existing large model "flattery" evaluation benchmark. In terms of models, this study specifically tested 5 models, including: claude-1.3 (Anthropic, 2023), claude-2.0 (Anthropic, 2023), GPT-3.5-turbo (OpenAI, 2022), GPT-4 (OpenAI, 2023 ), llama-2-70b-chat (Touvron et al., 2023).

Flattering User Preferences

Theory when a user asks a large model to provide free-form feedback on a piece of debate text Technically, the quality of an argument only depends on the content of the argument, but the study found that large models gave more positive feedback to arguments that the user liked and more negative feedback to arguments that the user disliked.

As shown in Figure 1 below, the large model’s feedback on text paragraphs not only depends on the text content, but is also affected by user preferences.

Flattery is common in RLHF models, and no one is immune from Claude to GPT-4

It’s easy to be swayed

The study found that even large models offer Although they provide accurate answers and express confidence in those answers, they often change their answers and provide incorrect information when users question them. Therefore, "flattery" can damage the credibility and reliability of large model responses.

Flattery is common in RLHF models, and no one is immune from Claude to GPT-4

Provide answers consistent with user beliefs

The study found that for For open-ended question and answer tasks, large models will tend to provide answers that are consistent with user beliefs. For example, in Figure 3 below, this “flattery” behavior reduced LLaMA 2 accuracy by as much as 27%.

Flattery is common in RLHF models, and no one is immune from Claude to GPT-4

Imitate user errors

For To test whether large models repeat user errors, this study explores whether large models incorrectly give the author of a poem. As shown in Figure 4 below, even if the large model can answer the correct author of the poem, it will answer incorrectly because the user gives wrong information.

Flattery is common in RLHF models, and no one is immune from Claude to GPT-4

Understanding Flattery in Language Models

The study found that in different real-world settings more All large models show consistent "flattery" behavior, so it is speculated that this may be caused by RLHF fine-tuning. Therefore, this study analyzes human preference data used to train a preference model (PM).

As shown in Figure 5 below, this study analyzed human preference data and explored which features can predict user preferences.

Flattery is common in RLHF models, and no one is immune from Claude to GPT-4

The experimental results show that, other conditions being equal, the "flattery" behavior in the model response will increase the likelihood that humans will prefer the response. sex. The preference model (PM) used to train the large model has a complex impact on the "flattery" behavior of the large model, as shown in Figure 6 below.

Flattery is common in RLHF models, and no one is immune from Claude to GPT-4

Finally, the researchers explored how often humans and PM (PREFERENCE MODELS) models tended to answer truthfully how many? It was found that humans and PM models favored flattering responses over correct responses.

PM Results: In 95% of cases, flattering responses were preferred to true responses (Figure 7a). The study also found that PMs preferred flattering responses almost half of the time (45%).

Human feedback results: Although humans tend to respond more honestly than flattery, as the difficulty (misconception) increases, their probability of choosing a reliable answer decreases ( Figure 7b). Although aggregating the preferences of multiple people can improve the quality of feedback, these results suggest that completely eliminating flattery simply by using non-expert human feedback may be challenging.

Figure 7c shows that although optimization for Claude 2 PM reduces flattery, the effect is not significant.

Flattery is common in RLHF models, and no one is immune from Claude to GPT-4

For more information, please view the original paper.

The above is the detailed content of 'Flattery' is common in RLHF models, and no one is immune from Claude to GPT-4. For more information, please follow other related articles on the PHP Chinese website!