With less than 10 lines of code, the mathematical capabilities of large models (GSM8k) can be improved by 20%!
Several independent scholars have proposed improvements to large model sampling, which has attracted the attention of the open source community.
Currently, this method has achieved results on Mistral-7B, and testing on Llama3-70B is also ongoing.
This method is called min-p sampling, which aims to balance the coherence and diversity of the generated text.
Simply put, it allows the model to exert different characteristics in different situations, such as maintaining stable performance on factual issues and being creative in scenarios such as writing.
Currently, this method has achieved results on Mistral-7B, and testing on Llama-70B is about to begin.
In the paper, the author mentioned that this method has been widely used by the open source community.
At the same time, the author also revealed that closed source model manufacturers such as Anthropic and Google have also tested or are testing min-p.
The news has also been confirmed by Google. Logan Kilpatrick, the developer community leader who switched from OpenAI to Google, has replied "On it".
Abram Jackson, a researcher at Microsoft Copilot, said after reading it that this is the first improvement he has seen regarding token sampling in the inference process, and there is still a lot of room for improvement in the future.
It is worth mentioning that the main author of this widely watched study, Minh Nhat Nguyen, has never systematically learned CS at all, but is self-taught.
With the help of an AI security research organization called Apart Research, Minh and other members of the team completed the project.
Dynamic adjustment of the sampling threshold
min-p is a dynamic truncation sampling method, the core of which is to scale the minimum probability threshold according to the maximum probability of the token distribution at each step.
The purpose of this is mainly to balance the coherence and diversity of the generated text, especially under higher temperature conditions.
Specifically, min-p introduces a basic probability threshold p_base, which represents the minimum probability requirement for entering the sampling pool.
When generating tokens at each step, min-p will multiply p_base with the largest token probability p_max in the current probability distribution to obtain a scaled absolute threshold p_scaled.
Only tokens with probability greater than or equal to p_scaled can enter the sampling pool.
When the model's prediction probability for a certain token is very high (that is, p_max is very large), the value of p_scaled will also be very high, causing the sampling pool to be greatly reduced, and the vast majority of low-probability tokens are filtered, leaving only a few with high confidence. The choice of ensures the consistency of the output;
When the model’s prediction probabilities for all tokens are relatively close (p_max is lower), the value of p_scaled will also become lower accordingly, relaxing the requirements for the sampling pool , incorporating more medium-probability tokens gives the model more space to generate more diverse content.
After determining the sampling pool, min-p will scale the token probability distribution according to temperature.
It divides the logarithmic probability of token by a temperature parameter τ, and after normalization, the scaled probability distribution of temperature is obtained.
A τ value greater than 1 will make the probability distribution flatter, increasing the chance of low-probability tokens being selected; when
τ is less than 1, it will make the distribution sharper, strengthening the advantages of high-probability tokens.
Finally, min-p randomly selects the next token from the scaled sampling pool according to the adjusted probability distribution.
Stability and creativity, "I want it all"
What is the effect of the min-p method? The author used Mistral-7B as the basic model for testing. Let's look at the results by scenario.
In the inference task, the author uses the GPQA dataset. When temperature is 1, you can see that min-p has a slight advantage over the past top-p.
As temperature increases, the GPQA score shows a downward trend overall, but it can be observed that min-p decreases significantly slower than top-p.
The downward trend of min-p does not become obvious until temperature reaches 3, when the score of top-p is close to 0.
In other words, compared to top-p, min-p better maintains the required stability in inference tasks.
Mathematical tasks also need to maintain stable performance. Here the author used the GSM8K data set for testing.
The result is that the score corresponding to min-p decreases with temperature faster than in GPQA, but still slower than the top-p method.
The third type of task is creative writing. At this time, the requirements for stability are not so high, but the model needs to be more creative.
This test was done using the AlpacaEval dataset, and the experimental data was obtained from an independent evaluator in the open source community.
Experimental results show that under the settings of temperature=1.5 and min-p=0.1, the performance of min-p is particularly outstanding and can generate creative writing content that is difficult to generate with the top-p method.
Under this parameter, the text obtained by the min-p method achieved a human judgment preference rate of 58.12%, which is much higher than the performance of other methods under similar settings.
Paper address:
https://arxiv.org/abs/2407.01082
GitHub:
https://github.com/menhguin/minp_paper/
Reference link:
https:// x.com/menhguin/status/1826132708508213629
The above is the detailed content of 10 lines of code improved the mathematics of large models by 20%. The research on 'Yeluzi' was also tested by Google. The main author is all self-taught.. For more information, please follow other related articles on the PHP Chinese website!