The evaluation results of the bean bag large model are revealed, which is 19% higher than the previous generation 'Skylark'-AI-php.cn

The evaluation results of the bean bag large model are revealed, which is 19% higher than the previous generation 'Skylark'

王林

Release： 2024-06-06 13:45:41

Original

389 people have browsed it

Recently, the large bean bag model was officially released at the Volcano Engine Power Conference. While the price reduction trend of large models is promoted at ultra-low prices, Doubao’s model capabilities have also attracted industry attention.

In a product information of Volcano Engine, the Doubao Model team released some of the internal test results of the first phase: On the public evaluation set of 11 mainstream industries such as MMLU, BBH, GSM8K, and HumanEval, Doubao-pro The total score of -4k is 76.8 points, which is 19% higher than the 64.5 points of the previous generation model Skylark2. This is also better than other domestic models tested during the same period.

This evaluation was completed in May this year, and mainly included Dou, including Universal Model Pro, Skylark2 and nine domestic large language models. Except for Skylark2, other models are the latest advanced versions released by various manufacturers and are tested through API calls.

The evaluation results of the bean bag large model are revealed, which is 19% higher than the previous generation Skylark

Picture: Doubao model team internal test results

According to the evaluation results, in the two evaluation sets to evaluate code capabilities" "HumanEval" and "MBPP", Doubao has improved by about 50% compared to the previous generation model; in the evaluation set of professional knowledge and instruction following, Doubao has achieved a performance improvement of 33% and 24% respectively, and is also the domestic model with the highest score. .

In addition to mathematical ability, language understanding ability, and comprehensive evaluation set BCMMLU and CEval evaluation, it also has good performance. Ranked in the top three in scoring. Based on the test results on 11 public evaluation sets, the Doubao Universal Model-pro has a total score of 76.8 points. According to the test results released by OpenAI, GPT-4 has a total score of 80.1 points on these evaluation sets, which still has a certain lead over domestic models.

It is reported that the bean bag model was just launched on May 15 and has not yet been included in the testing of third-party institutions. It is expected that in the next one to two months, many third-party evaluation agencies will gradually disclose the evaluation results of this model. The AI assistant "Doubao" with the same name as the model has officially announced that the number of monthly active users has reached 26 million, and users can freely experience and test it.

Previously, Zhiyuan Research Institute published evaluation reports covering 91 language models around the world. In the subjective evaluation that focuses on Chinese ability, Skylark2 ranks first, and its Chinese ability exceeds GPT-4.

The evaluation results of the bean bag large model are revealed, which is 19% higher than the previous generation Skylark

Picture: Zhiyuan Research Institute language model evaluation results (model is 4月## Version before #20)

The above is the detailed content of The evaluation results of the bean bag large model are revealed, which is 19% higher than the previous generation 'Skylark'. For more information, please follow other related articles on the PHP Chinese website!