贾佳亚团队联手剑桥清华等共推评测新范式一秒侦破大模型“高分低能”

Subverting past large model evaluation standards, the latest, most comprehensive and authoritative evaluation data set MR-Ben is here! This is a highly representative work proposed again by the Hong Kong Chinese Jiajiaya team after the release of Mini-Gemini, a powerful visual language model called GPT-4 + DALL-E-3, in April this year. Under the "supervision" of MR-Ben, the large model must not only be able to answer questions like a student, but also be able to mark papers like a teacher, leaving no trace of its true reasoning ability.
MR-Ben has carefully evaluated many domestic and foreign first-line open source and closed source models, such as GPT4-Turbo, Cluade3.5-Sonnet, Mistral-Large, Zhipu-GLM4, Moonshot-v1, Yi-Large, Qwen2-70B , Deepseek-V2, etc., and conducted detailed analysis.
Which seemingly beautiful large models will be "removed", and which model has the strongest surface? At present, all the code and data of this work are open source, let’s take a look!
Project Page: https://randolph-zeng.github.io/Mr-Ben.github.io/
Arxiv Page: https://arxiv.org/abs/2406.13975
Github Repo: https://github.com /dvlab-research/Mr-Ben
MR-Ben breaks the "high score and low energy" of large models in seconds
After the field of artificial intelligence entered the GPT moment, academia and industry worked together, and new models were released every month or even every week .
Large models emerge in endlessly. What standards are used to measure the specific capabilities of large models? The current mainstream direction is to use human standardized tests - multiple choice questions and fill-in-the-blank questions to conduct large model evaluations. There are many benefits of using this test method, which can be divided into the following points in simple terms:
• Standardized tests are easy to quantify and evaluate. The standards are clear, and what is right is right and what is wrong is wrong.
• The indicators are intuitive, and it is easy to compare and understand the scores obtained in the domestic college entrance examination or the American college entrance examination SAT.
• Quantitative results are naturally topical (for example, GPT4’s ability to easily pass the U.S. Bar Certification Examination is extremely eye-catching).
But if you delve into the training method of large models, you will find that this step-by-step thinking chain method to generate the final answer is not "reliable".
The question appears precisely in the step-by-step answer process!
The pre-training model has already seen trillions of word elements during pre-training. It is difficult to say whether the model being evaluated has already seen the corresponding data and can answer the questions correctly by "memorizing the questions". In the step-by-step answer, we don’t know whether the model selects the correct option based on correct understanding and reasoning, because the evaluation method mainly relies on checking the final answer.
Although the academic community continues to upgrade and transform data sets such as GSM8K and MMLU, such as introducing a multi-language version of the MGSM data set on GSM8K and introducing more difficult questions based on MMLU, there is still no way to get rid of the problem of selecting or filling in the blanks. Unruly.
Moreover, these data sets have all faced serious saturation problems. The values of large language models on these indicators have peaked and they have gradually lost their distinction.
To this end, the Jiajiaya team teamed up with many well-known universities such as MIT, Tsinghua, and Cambridge, and cooperated with domestic head annotation companies to annotate a evaluation data set MR-Ben for the reasoning process of complex problems.
MR-Ben has carried out a "grading" paradigm transformation based on the questions from the pre-training and test data sets of large models such as GSM8K, MMLU, LogiQA, MHPP and other large models. The generated new data sets are more difficult, more differentiated and more realistic. It reflects the model’s reasoning ability!
The Jiajiaya team’s work this time also made targeted improvements to address existing evaluation pain points:
Aren’t you afraid that data leakage will lead to large-scale model memorization of questions, resulting in inflated scores? There is no need to re-find the questions or deform the questions to test the robustness of the model. MR-Ben directly changes the model from the student identity of the answerer to the "grading" mode of the answer process, allowing the big model to be the teacher to test How well does it master the knowledge points!
Aren’t you worried that the model has no awareness of the problem-solving process, and may have “illusions” or misunderstandings, and get the answer wrong? MR-Ben directly recruits a group of high-level master's and doctoral annotators to carefully annotate the problem-solving process of a large number of questions. Point out in detail whether the problem-solving process is correct, the location of the error, and the reason for the error, and compare the grading results of the large model and the grading results of human experts to test the model's knowledge of the knowledge points.
Specifically, the Jiajiaya team organized the mainstream evaluation data sets on the market such as GSM8K, MMLU, LogiQA, MHPP and other data sets, and divided them into multiple categories such as mathematics, physics, chemistry, biology, code, logic, medicine, etc., and also distinguished Different difficulty levels. For each category and each question collected, the team carefully collected the corresponding step-by-step problem-solving process, and was trained and annotated by professional master's and doctoral annotators.

贾佳亚团队联手剑桥清华等共推评测新范式一秒侦破大模型“高分低能”

The construction process of MR-Ben datasetWhat is the best performing large model?

接下來，讓我們結合MR-Ben工作中給出的具體例子，看看為什麼說逐步作答產生最終答案的方式和相關評測不靠譜。

作為被評測的「閱卷」模型，GPT4-Turbo並未發現第二步所隱藏的計算錯誤：

E.1 數學

科目：高中數學

問題：哪個最小的整數被4除餘1，被3除餘2，被5除餘2？

選項：[A] 67 [B] 43 [C] 57 [D] 37

答案：D

取樣模型：Claude2

模型解步驟：

。 67除以4，餘數是3，不是1。所以67不是正確答案。
步驟2：檢查選項B（43）當43除以4，餘數是1。當43除以3，餘數是3，不是2。所以43不是正確答案。
步驟3：檢查選項C（57）當57除以4，餘數是1。當57除以3，餘數是0，不是2。所以57不是正確答案。
步驟4：檢查選項D（37）當37除以4，餘數是1。當37除以3，餘數是1。當37除以5，餘數是2。

步驟5：選項D（37）滿足所有條件，是正確答案。

解答：選項D

人工註解

模型解答正確性：不正確
首次錯誤步驟：步驟2
模型解答錯誤原因：
在步驟2中，模型錯誤計算了43除以4的餘數為1而不是3。模型假設當43除以4時餘數是1，但餘數應該是3。這裡有一個計算錯誤。

修正後的首個正確步驟：步驟2

當43除以4，餘數是1。當43除以3，餘數是1。當43除以5，餘數是3，不是2。因此，43不是正確答案。

評測模型：gpt-4-turbo

模型回覆：
-步驟1：步驟檢查67除以4和3的餘數，發現它符合被4除餘3的條件，這是一個正確的驗證。
-步驟2：檢查43除以4、3、5的餘數，發現1、3、5的餘數正確符合預期，除了5的餘數不符合條件。
-步驟3：步驟計算57除以4和3的餘數，餘數分別是1和0。當57除以3時餘數不符合條件。
-步驟4：步驟檢查並驗證37除以4、3、5的餘數，發現所有預期的餘數都準確，得出37滿足題目中的所有條件。

解答過程是否正確：正確

第一個出錯的步驟：不適用

可以看到，當給定問題、確定的解答過程，評測語言模型的方式就轉換成了讓模型對解答過程進行「閱卷」並評判正誤、指出錯誤位置及原因。其中解答過程的正誤，潛在的出錯位置可以透過與標註結果進行比對來計算。而模型出錯步驟和原因的評測，則可以交由GPT4，透過比較標註人員給出的錯誤原因解釋和模型的錯誤原因解釋，來判斷模型是否正確。

其次，MR-Ben透過使用了人力精細的標註流程控制，取得了大量的高品質標註，而巧妙的流程設計又使得評測方式能夠直觀地量化。
賈佳亞團隊也針對性測試了時下最具代表性的十大大語言模型和不同版本。可以看到，閉源大語言模型裡，GPT4-Turbo的表現最佳（雖然在「閱卷」時未能發現計算錯誤），在絕大部分的科目裡，有demo（k=1）和無demo （k=0）的設定下都領先其他模型。
🎜🎜**部分開源大語言模型在MR-Ben資料集上的評估結果
可以看到，最強的部分開源大語言模型效果已經趕上了部分商用模型，並且即使最強的閉源模型在MR- Ben資料集上表現也仍未飽和，不同模型間的區分度較大。
除此之外，MR-Ben的原論文裡還有更多有趣的解析和發現，例如：

Qwen和Deepseek發布的開源模型即使在全球梯隊裡，PK閉源模型效果也不遜色。
不同的閉源模式定價策略與實際表現耐人尋味。在使用場景裡關注推理能力的小夥伴，可以對照價格和能力找到自己心儀的模型去使用。
低資源場景下，小模型也有不少亮點，MR-Ben評測中Phi-3-mini在一眾小模型裡脫穎而出，甚至高於或持平幾百億參數的大模型，展現出了微調數據的重要性。
MR-Ben場景包含複雜的邏輯解析和逐步推斷，Few-shot模式下過長的上下文反而會使得模型困惑，造成水平下降的後果。
MR-Ben評估了不少產生-反思-重生成的消融實驗，查看不同提示策略的差異，發現對低水平的模型沒有效果，對高水平的模型如GPT4-Turbo效果也不明顯。反而對中間水平的模型因為總把錯的改對，對的改錯，效果反而略有提升。
將MR-Ben評測的科目大致分割成知識型、邏輯型、計算型、演算法型後，不同的模式在不同的推理類型上各有優劣。

賈佳亞團隊已在github上傳一鍵評測的方式，歡迎所有關注複雜推理的小伙伴在自家的模型上評測並提交，團隊會及時更新相應的leaderboard。
對了，使用官方的腳本一鍵評測，只需花費12M tokens左右，過程非常絲滑，不妨一試！
參考

Training Verifiers to Solve Math Word Problems（https://arxiv.org/abs/2110.14168）
Measuring Massive Multitask Language Understanding（https://arxiv.org/abs/QA Dataset for Machine Reading Comprehension with Logical Reasoning（https://arxiv.org/abs/2007.08124）
MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation（https://arxiv.org/abs/3uage Models Beyond Basic Code Generation（https://arxiv.org/abs/12430） Sparks of Artificial General Intelligence: Early experiments with GPT-4（https://arxiv.org/abs/2303.12712）
Qwen Technical Report（https://arxiv.org/abs/2309.16609）
DeepSeek-V2:16609）
Deep Strong,V2: Economical, and Efficient Mixture-of-Experts Language Model（https://arxiv.org/abs/2405.04434）
Textbooks Are All You Need（https://arxiv.org/abs/2306.11644）
Large Self- Models Correct Reasoning Yet（https://arxiv.org/abs/2310.01798）

The above is the detailed content of Jia Jiaya team teamed up with Cambridge Tsinghua University and others to promote a new evaluation paradigm to detect 'high scores and low energy' in large models in one second. For more information, please follow other related articles on the PHP Chinese website!