A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students.-AI-php.cn

Which company is the best in the large model rankings? Also watch LLM Arena~

As of now, a total of 90 LLMs have joined the battle, and the total number of user votes has exceeded 770,000.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

However, while netizens are making fun of new models rushing to the top and old models losing their dignity,

LMSYS, the organization behind Renjia Arena, has quietly completed the transformation of results: the most convincing benchmark test born from actual combat-Arena-Hard.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

The four advantages demonstrated by Arena-Hard are exactly what the current LLM benchmark test needs most. of:

- separability (87.4%) is significantly better than MT-bench (22.6%);

- with Chatbot Arena The closest ranking at 89.1%;

- fast and cheap ($25)

- frequently updated with real-time data

The Chinese translation is, first of all, the examination of this large-scale model must be differentiated, and not even poor students can get 90 points;

Secondly , the exam questions should be more realistic, and the scoring should be strictly aligned with human preferences;

In the end, the questions must not be leaked, so the test data must be updated frequently to ensure the fairness of the exam;

——The last two requirements are tailor-made for LLM Arena.

Let’s take a look at the effect of the new benchmark:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

The above figure compares Arena Hard v0.1 with the previous SOTA benchmark MT Bench.

We can find that compared with MT Bench, Arena Hard v0.1 has stronger separability (surging from 22.6% to 87.4%), and the confidence interval is also narrower .

In addition, take a look at this ranking. It is basically consistent with the latest LLM arena ranking below:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

This shows that Arena Hard’s evaluation is very close to human preference (89.1%).

——Arena Hard can be regarded as opening up a new method of crowdsourcing:

Netizens got a free experience, and the official platform got the most Impactful leaderboards, and fresh, high-quality data – a world where no one gets hurt is complete.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students.

Asking questions for large models

Let’s take a look at how to build this benchmark test.

To put it simply, it is how to select some better ones from the 200,000 user prompts (questions) in the arena.

This "good" is reflected in two aspects: diversity and complexity. The following figure shows Arena-Hard’s workflow:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

To summarize: first classify all prompts ( There are more than 4,000 topics divided here), and then some artificial standards are set to score each prompt, and the average score is calculated for prompts in the same category.

Categories with high scores can be considered to be of high complexity (or quality) - which is the meaning of "Hard" in Arena-Hard.

Select the top 250 highest-scoring categories (250 ensures diversity), and randomly select 2 lucky prompts from each category to form the final benchmark test set (500 prompts).

Expand in detail below:

diversity

The researchers first transformed each tip using OpenAI’s text-embedding-3-small, reduced the dimensionality using UMAP, and A hierarchical-based clustering algorithm (HDBSCAN) was used to identify clusters, followed by aggregation using GPT-4-turbo.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students.

Complexity

through the seven key criteria in the table below Select high-quality user queries:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. image

1. Does the prompt ask for specific output?

2. Does it cover one or more specific areas?

3. Are there multiple levels of reasoning, components, or variables?

4. Should AI directly demonstrate its ability to solve problems?

5. Is there a level of creativity involved?

6. Is technical accuracy of the response required?

7. Is it relevant to practical applications?

For each tip, use LLM (GPT-3.5-Turbo, GPT-4-Turbo) to mark how many criteria it meets (score 0 to 7), and then calculate each Average score of group cues (clusters).

The following figure shows the average score ranking of some clusters:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

We can observe that clusters with higher scores are usually more challenging topics (such as game development, mathematical proofs), while clusters with lower scores belong to trivial or ambiguous problems.

With this complexity, the gap between top students and poor students can be widened. Let’s look at the following experimental results:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

In the above three comparisons, assume that GPT-4 is stronger than Llama2-70b, Claude's large cup is stronger than medium cup, and Mistral-Large is stronger than Mixtral ,

We can see that as the (complexity) score increases, the winning rate of stronger models also increases - the top students get distinguished, and the bad students get filtered.

Because the higher the score (the more complex the problem), the better the discrimination, so 250 high-quality classifications with an average score >= 6 points (out of 7 points) were finally selected .

Then, 2 tips from each category were randomly selected to form this version of the benchmark - Arena-Hard-v0.1.

Is the teacher who judges the test papers reliable?

After the test papers are out, who will judge them is a question.

Manual work is of course the most accurate, and because this is the "Hard mode", many issues involving domain knowledge still require experts to evaluate - this is obviously not possible.

The next best thing is to choose GPT-4, the smartest model currently recognized, as the test teacher.

For example, in the charts above, all aspects of scoring are handled by GPT-4. Additionally, the researchers used CoT to prompt LLM to generate answers before making a verdict.

GPT-4 judgment results

The following uses gpt-4-1106-preview as the judgment model, and the baseline for comparison is used gpt-4-0314.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

The Bradley-Terry coefficients for each model are compared and calculated in the table above and converted relative to the baseline The winning percentage is used as the final score. The 95% confidence intervals were calculated through 100 rounds of bootstrapping.

Claude expressed dissatisfaction

——I, Claude-3 Opus, am also tied for first in the rankings, why should I let GPT be the judge? Teacher Juan?

So, the researchers compared the performance of GPT-4-1106-Preview and Claude-3 Opus as marking teachers.

Summary in one sentence: GPT-4 is a strict father, Claude-3 is a loving mother.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

Separability across models is higher when scored using GPT-4 (ranging from 23.0 to 78.0 ).

When using Claude-3, the scores of most models have improved a lot: I must take care of my own models, and I also like open source models (Mixtral, Yi, Starling), gpt-4-0125-preview is indeed better than mine.

Claude-3 even loves gpt-3.5-0613 more than gpt-4-0613.

The following table further compares GPT-4 and Claude-3 using separability and consistency metrics:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

From the results data, GPT-4 is significantly better in all indicators.

By manually comparing the different judgment examples between GPT-4 and Claude-3, we can find that when the two LLMs disagree, they can usually be divided into two major categories:

Conservative scoring, and a different take on user tips.

Claude-3-Opus is more lenient in its scoring and is much less likely to give harsh scores - it is particularly hesitant to claim one answer over another. Much better."

In contrast, GPT-4-Turbo identifies errors in model responses and penalizes the model with significantly lower scores.

On the other hand, Claude-3-Opus sometimes ignores smaller errors. Even when Claude-3-Opus does find these errors, it tends to treat them as minor issues and is very lenient during the scoring process.

Even in coding and math problems where small mistakes can actually completely ruin the final answer, Claude-3-Opus still gives leniency to these mistakes, GPT-4-Turbo Not so.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

For another small set of tips, Claude-3-Opus and GPT-4-Turbo use fundamentally different Judgment based on angle.

For example, given a coding problem, Claude-3-Opus favors a simple structure that does not rely on external libraries, which can provide the user with a response of maximum educational value.

And GPT-4-Turbo may prioritize responses that provide the most practical answers, regardless of its educational value to the user.

While both explanations are valid criteria for judging, GPT-4-Turbo's view may be closer to that of ordinary users.

See the image below for specific examples of different judgments, many of which exhibit this phenomenon.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

Limitations Test

LLM Would you like a longer answer? ?

The average token length and score of each model on MT-Bench and Arena-Hard-v0.1 are plotted below. Visually, there is not a strong correlation between fraction and length.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

To further examine potential verbosity bias, the researchers used GPT-3.5-Turbo on three different systems Prompts (original, talkative, detailed) were ablated.

The results show that the judgments of both GPT-4-Turbo and Claude-3-Opus may be affected by longer output, while Claude is more affected (because GPT-3.5- Turbo's winning rate against GPT-4-0314 is over 40%).

Interestingly, "talkative" had little impact on the winning rates of the two judges, indicating that output length is not the only factor, and that more detailed answers may also be favored by the LLM judges.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

Tips for experimentation:

##detailed: You are a helpful assistant who thoroughly explains things with as much detail as possible.

chatty: You are a helpful assistant who is chatty.

GPT-4 VARIANCE OF JUDGMENT

The researchers found that even if temperature = 0, GPT-4-Turbo may still produce slightly different judgments.

The following judgment is repeated three times for gpt-3.5-turbo-0125 and the variance is calculated.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

Due to limited budget, only one evaluation of all models is performed here. However, the authors recommend using confidence intervals to determine model separation.

Reference://m.sbmmt.com/link/6e361e90ca5f9bee5b36f3d413c51842

The above is the detailed content of A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students.. For more information, please follow other related articles on the PHP Chinese website!