ACL 2024 | In the mathematical evaluation of 25 open and closed source models, GPT-3.5-Turbo barely passed-AI-php.cn

ACL 2024 | 对25个开闭源模型数学评测，GPT-3.5-Turbo才勉强及格

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The author of this article is from the University of Hong Kong and Tencent. Author list: Li Qintong, Leyang Cui, Zhao Xueliang, Kong Lingpeng, Wei Bi. Among them, the first author Li Qintong is a doctoral student in the Natural Language Processing Laboratory of the University of Hong Kong. His research interests involve natural language generation and text reasoning. He and doctoral student Zhao Xueliang are under the tutelage of Professor Kong Lingpeng. Leyang Cui and Wei Bi are senior researchers at Tencent.

Foreword

The extraordinary ability of large language models (LLMs) in solving problems is increasingly apparent. Recently, a phenomenon worthy of attention is that these models have achieved amazing results in multiple benchmark tests of mathematical reasoning. Taking GPT-4 as an example, it performed well in the difficult primary school application question test set GSM8K [1], with an accuracy rate of over 90%. At the same time, many open source models have also shown impressive performance, with accuracy rates exceeding 80%.

However, in use, we often find that when the mathematical problems are slightly changed, LLMs may have some low-level errors, as shown in the following figure:

ACL 2024 | 对25个开闭源模型数学评测，GPT-3.5-Turbo才勉强及格

^{Figure 1: GPT-3.5-Turbo A math problem was solved correctly (left), but when a constraint was added to the original problem (right), Turbo misused an operator and made an error because it did not correctly differentiate between "leave" and "return" directions.}

We can’t help but ask: Do large language models really grasp the essence of mathematical knowledge? How do they score so high on these tests? Is it simply a matter of imitating superficial reasoning patterns in large amounts of training data? Whether LLMs truly understand mathematical concepts is still a question worth exploring.

To explore this issue, the author of this article designed an evaluation benchmark GSM-Plus. This test is designed to perform 8 different fine-grained mathematical transformations on a problem to systematically assess the ability of current LLMs in dealing with basic mathematical word problems. In this new benchmark, the paper rigorously evaluates 25 different LLMs, including open source and closed source models in the industry.

Experimental results show that GSM-Plus is a challenging benchmark for most LLMs. Even on GSM8K, GPT-3.5-Turbo has been able to achieve an accuracy of 73.62%, but it can only achieve an accuracy of 61.19% on GSM-Plus. This work has been accepted by ACL2024 with scores of 4, 4, and 4.5.

ACL 2024 | 对25个开闭源模型数学评测，GPT-3.5-Turbo才勉强及格

Paper title: GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers
Paper address: https://arxiv.org/pdf/2402.19255
Paper homepage: https: //qtli.github.io/GSM-Plus/

Background

Mathematical reasoning is an important proof of the development of artificial intelligence. It requires rigorous problem understanding, strategy development, and computational execution skills. Over the past few years, numerous publicly available datasets have been used to evaluate the mathematical reasoning capabilities of artificial intelligence systems. Early math datasets focused on equation-based math problems. Subsequently, more difficult data sets were introduced covering elementary, high school, and college level mathematics problems.

As the difficulty of evaluation data continues to increase, the development of LLMs has also become very rapid. In order to improve the performance of LLMs in the field of mathematics, supervised fine-tuning (SFT) can be used to quickly help LLMs adapt to the field of mathematics by training on diverse task data. In the reasoning stage, the mathematical abilities of LLMs can also be effectively stimulated through cleverly designed input prompts (e.g., Chain-of-Thought and Program-of-Thought).

For most LLMs, there is still a lot of room for improvement when it comes to math problems in high school and above. However, in the area of primary school mathematics, LLMs have shown great potential. This makes us wonder, Can LLMs still maintain high performance in real environments?

Adversarial Evaluation Dataset GSM-Plus

This study aims to launch a comprehensive benchmark GSM-Plus to systematically examine the robustness of LLMs in solving basic mathematical problems . Inspired by the taxonomy of ability to solve mathematical problems in Polya principles [2], this article identifies five guiding principles for constructing the GSM-Plus data set:

For ease of understanding, here we use "Janet's The duck lays 16 eggs every day. She eats three eggs every morning and bakes muffins with four eggs for her friends. She sells the remaining eggs at the farmer's market every day for $2 each. How many dollars do you make at the farmer’s market?” question for example.

(1) Numerical change : refers to changing numerical data or its type. This article defines three subcategories:

Numeric substitution: replacing numeric values with the same digits and types, such as Replace "16" with "20" in the question.
Digit expansion: Increase the number of digits in a value, for example, replace "16" with "1600".
Integer - Decimal - Fraction conversion: Replace integers with decimals or fractions, for example convert "2" to "2.5".

(2) Arithmetic changes : refers to the introduction of additional operations or inversions to mathematical problems, but only limited to addition, subtraction, multiplication, and division operations:

Operation expansion: Add restrictions to the original problem. For example, add a new condition "She also uses two eggs to make homemade hair masks every day."
Operation reversal: Convert a known condition of the original problem into the variables to be solved for the GSM-Plus variant problem. For example, the statement of the original question in Figure 2 "2 US dollars per duck egg" is converted into the interrogative sentence of the new question "What is the price of each duck egg?", while the interrogative sentence of the original question "How many dollars do you earn at the farmer's market every day?" is converted into Known conditions for the new problem "She earns $18 a day at the farmer's market"

(3) Problem understanding: Refers to restating the mathematical problem with different words and sentences without changing the meaning, such as " Janet raises a flock of ducks that lay 16 duck eggs every day. She consumes three duck eggs for breakfast and then consumes four duck eggs to bake muffins for her friends. Janet sells fresh duck eggs at the farmer's market for $2 each. Sell all the remaining duck eggs at the price. How much money does she make every day by selling duck eggs at the farmer's market? "

(4) Interference insertion : refers to inserting sentences that are related to the topic and contain numerical values but are useless for solving Go to the original question, such as "Janet also wants to use two duck eggs to feed her pet parrot. Fortunately, her neighbor gives her two duck eggs every day to feed the parrot."

(5) Critical Thinking: Focuses on whether LLMs have the ability to ask questions or doubt when mathematical problems lack necessary conditions, such as “Janet’s ducks lay eggs every day. She eats three eggs every morning as She makes four eggs for breakfast and bakes waffles for her friends every day. She sells the remaining eggs at the farmer's market every day for $2 each.How many dollars does she make each day at the farmers market? ”.

Based on 1,319 test questions of GSM8K, this paper creates eight variants for each question, resulting in a GSM-Plus dataset containing 10,552 question variants (this paper also provides a GSM-Plus dataset containing 2,400 A test subset of problem variants for quick evaluation). By testing LLMs using each problem and its eight variants, GSM-Plus can help researchers comprehensively evaluate the robustness of LLMs in solving mathematical problems.个 Figure 2: Based on a seed mathematical problem, use 8 angles of 8 disturbances to generate problem variants. LLMs of different scales, different pre-training methods, different task fine-tuning, and a combination of 4 commonly used prompting technologies. This paper finds that LLMs can accurately solve the GSM8K problem as a whole, but will encounter obvious problems when answering variant questions in GSM-Plus. Difficulty. The main findings are as follows:

ACL 2024 | 对25个开闭源模型数学评测，GPT-3.5-Turbo才勉强及格

^{Task-specific optimization, that is, fine-tuning on mathematically relevant datasets, can often improve downstream task accuracy; while the level of robustness depends more on the underlying model and Fine-tune the selection of data sets.}

The performance of LLMs degrades rapidly when “critical thinking” is required, “arithmetic changes” and “distractor insertion” are involved; but for “numerical changes” and “problem understanding” The performance of perturbation, LLMs is relatively stable.

Previous prompting techniques (e.g., CoT, PoT, LtM and Complexity-based CoT) have not significant effect on robustness enhancement, especially for "arithmetic changes" and "Critical Thinking". Based on previous work, this paper further explores a combined prompt method that can simultaneously improve the performance of LLMs on GSM8K and GSM-Plus by iteratively generating and verifying each reasoning thought.

Quality Assurance

: Use two stages to generate GSM-Plus evaluation questions. First, use GPT-4’s question rewriting capabilities to generate question variants, and then generate questions for these variants. Generate candidate answers; to ensure data quality, all question variations and answers generated by GPT-4 are rigorously checked by the manual annotation team. The manual annotation team corrected 18.85% of the GPT-4 rewritten problems.
Fine-grained evaluation

: For each test question in the mainstream evaluation data set GSM8K, GSM-Plus provides 8 variant questions in perturbation directions, fully testing the large model's ability to flexibly solve mathematical application problems in different contexts.

Challenging
: Compared with GSM8K, the problem variant of GSM-Plus is more challenging, and the performance of all LLMs participating in the evaluation drops significantly. In the following analysis, this article will specifically analyze the problem-solving robustness of LLMs under different types of perturbations.

^{Table 1: Different colors represent different perturbation types:}^{numeric substitution, digit expansion, integer-decimal-fraction conversion, operation expansion, operation reversal, Problem understanding, Distractor insertion,}^{Critical thinking.}

As can be seen from the above table, previous studies used different perturbations to test the robustness of mathematical reasoning, but the evaluation settings only cover some perturbation types, and most of them introduce perturbations through automatic method construction, quality Hard to guarantee. In contrast, GSM-Plus uses eight different mathematical reasoning skills to perturb a single problem, with more comprehensive coverage and strict quality control.

Experimental Analysis

Evaluation Metrics
- Performance Reduction Rate (PDR): Performance of LLMs on the perturbed problem compared to the original problem fall degree.
- Percentage of simultaneously solved problem pairs (ASP): The proportion of the original problem and its corresponding problem variant that are both answered correctly by LLMs.
Overall Performance

As shown in the table below, the performance of most LLMs on GSM-Plus drops significantly compared to GSM8K.

GPT-4 shows the highest robustness, with the smallest PDR of only 8.23%. CodeLlama has the largest PDR, among which the 7B, 13B and 34B models are 40.56%, 39.71% and 34.27% respectively, exceeding its base model LLaMA-2-7B (39.49%), as well as the mathematical SFT model fine-tuned on it. , such as SEGO-7B (34.91%). This shows that reasoning using only procedural languages is vulnerable to perturbations.

In the face of mathematical perturbations, the larger the model size, the more stable the performance. Although supervised fine-tuning can improve accuracy on downstream tasks, it does not significantly enhance the model's robustness to perturbations (i.e., lower PDR). Data that supervises fine-tuning is important for robustness. It is also fine-tuned based on LLaMA-2 and uses different data, which will lead to large differences in the accuracy and robustness of the model. Table 2: Overall performance Performance of LLMs under disturbance

This paper further evaluates LLMs in 8 types of Performance stability under problem variants. Compared to human baseline for Critical Thinking (purple), Operation Expansion and Operation Reversal (blue), Distractor Insertion (pink), and Integer-Decimal-Fraction Conversion (orange) perturbation, the performance of LLMs decreases significantly. For "numeric replacement" and "problem understanding", the performance of LLMs is stable or even slightly improved.
The previous analysis is mainly based on the entire data set. Next, this article splits the two data sets according to whether the math questions are answered correctly, and analyzes whether when LLMs successfully solve the GSM8K problem, it means that the probability of correctly answering the GSM-Plus variant question becomes higher (i.e., a high ASP value). vice versa. If this assertion holds true, LLMs can be considered to perform stably on this specific subset of mathematical problems, even if this is not the case on the entire data set. In the experimental setup, each GSM8K problem and its variants in GSM-Plus are transformed into 8 problem pairs, and the results are shown in Figure 4 .

Figure 4: Inference transferability of LLMs between GSM8K and GSM-Plus problem pairs. Purple (both correct) and blue (both incorrect) bars indicate consistent model behavior, while red (GSM8K correct & GSM-Plus incorrect) and yellow (GSM8K incorrect & GSM-Plus correct) bars indicate Inconsistent model behavior. The sum of the heights of the purple and red bars represents the number of LLMs that correctly solved the GSM8K problem.
The presence of red bars (LLMs that correctly answer the original question, but do not address the variant), indicates that most models have limited performance transferability. Although the performance of LLMs differs on the GSM8K problem (height of purple and red bars), performance transferability is similar (height of red bars). This means that existing benchmarks cannot accurately assess a model's true capabilities in mathematical reasoning. High accuracy does not equate to strong inference robustness.
Hints help in performance robustness of LLMs

Previous work has shown that good hint instructions are important to stimulate the mathematical ability of language models. This article selects 4 representative models and tests their performance in solving problems under different prompt instructions. As shown in the figure below, when faced with interference, LLMs perform most stably when using complex examples as contextual demonstrations (Complexity-based CoT); in contrast, only using program language to represent intermediate reasoning (Program-of-Thought) , LLMs are more susceptible to interference. Overall, these tips and tricks are not enough for LLMs to maintain the same performance as GSM8K on GSM-Plus. L Figure 5: The impact of prompt on LLMS performance robustness

Is the combination prompt valid?
How to enhance the robustness of LLMs based on existing hint methods?
This article found that LLMs often ignore important conditions or make calculation errors during the problem solving process. To this end, this paper explores Comp, a combined prompting method. The method first prompts LLMs to extract numerically relevant necessary conditions in the problem (Prompt1). Next, based on the problem and critical conditions, LLMs are instructed to iteratively generate inference goals (Prompt2) and calculation goals (Prompt3), and let them provide feedback on the generated historical problem-solving steps to determine whether the final answer is obtained (Prompt4). The specific implementation is shown in Figure 6.
It can be seen that Comp can improve the performance of LLMs under various problem change types through iterative generation and self-verification, but it still The performance gap between LLMs on standard test sets and adversarial test sets cannot be bridged. This research looks forward to more methods in the future to further improve the robustness of the model and promote the further development of LLMs in the field of mathematical reasoning.
Jadual 3: Prestasi lelaran Comp menggesa Plus untuk menulis semula soalan, di bawah teknik gesaan yang berbeza Prestasi GPT-3.5-Turbo. Walaupun semua gesaan mendorong Turbo untuk menjawab soalan GSM8K dengan tepat, hanya Comp yang dapat membantu Turbo menjana jawapan yang betul pada soalan varian GSM-Plus.

Artikel ini memperkenalkan set penilaian soalan aplikasi matematik sekolah rendah yang bermusuhan GSM -Plus, yang direka bentuk untuk menyelesaikan masalah matematik secara sistematik bagi LLM. Analisis percubaan mendapati bahawa prestasi kebanyakan LLM menurun dengan ketara berbanding prestasi mereka pada penanda aras standard apabila berhadapan dengan gangguan, jauh lebih rendah daripada tahap prestasi manusia. Para penyelidik berharap bahawa kerja artikel ini dapat mempromosikan lebih banyak penyelidikan masa depan, termasuk tetapi tidak terhad kepada: (1) penilaian sistematik kemahiran matematik LLMs; (2) membina model yang boleh melakukan penaakulan matematik secara fleksibel.

Pautan rujukan

^{[1] Cobbe, Karl, et al. com/sota/aritmetik-penaakulan-pada-gsm8k}

[2] George Polya 2004. Cara menyelesaikannya: Aspek baharu kaedah matematik, jilid 85. Princeton university press.
.

The above is the detailed content of ACL 2024 | In the mathematical evaluation of 25 open and closed source models, GPT-3.5-Turbo barely passed. For more information, please follow other related articles on the PHP Chinese website!