Explore large model technology in the post-GPT 3.0 era and move towards realizing the future of AGI-AI-php.cn

ChatGPT surprised or awakened many people after its appearance. The surprise was because I didn’t expect that the Large Language Model (LLM) could be as effective as this; the awakening was the sudden realization that our understanding of LLM and its development philosophy are far from the world’s most advanced ideas. I belong to the group that was both surprised and awakened, and I am also a typical Chinese. Chinese people are good at self-reflection, so they began to reflect, and this article is the result of reflection.

To be honest, in terms of domestic LLM model-related technologies, at this moment, the gap from the most advanced technology has further widened. I think the issue of technological leadership or technological gap should be viewed dynamically from a development perspective. In the one to two years after the emergence of Bert, in fact, domestic technology catching up in this area was still very fast, and some good improvement models were also proposed. The watershed for widening the gap should be after the release of GPT 3.0, that is, in 2020 Around the middle of the year. At that time, only a few people were aware that GPT 3.0 was not just a specific technology, but actually embodied a development concept of where LLM should go. Since then, the gap has widened further and further, and ChatGPT is just a natural consequence of this difference in development philosophies. Therefore, I personally believe that put aside the factor of whether you have the financial resources to build a very large LLM. From a technical perspective alone, the gap mainly comes from the understanding of LLM and the development philosophy of where to go in the future. s difference.

China is falling further and further behind with foreign technology. This is a fact, and it is impossible not to admit it. A while ago, many people on the Internet were worried that domestic AI is now in a "critical stage of survival." I don't think it is that serious. Don’t you see, is OpenAI the only company in the world with such a forward-thinking vision? Including Google, in fact, their understanding of LLM development concepts is obviously behind OpenAI. The reality is that OpenAI has performed too well and left everyone behind, not just domestically.

I think that OpenAI is ahead of Google and DeepMind abroad for about half a year to a year in terms of concepts and related technologies for LLM, and it is about two years ahead of China. When it comes to LLM, I feel that the echelon is very clear. Google should be in second place. The ones that best reflect Google's technical vision are PaLM and Pathways. They were launched between February and April 2022. During the same period, OpenAI was launched. It is InstructGPT. From here you can see the gap between Google and OpenAI. As for why I say this, you can probably understand it after reading the text behind me. DeepMind’s previous focus has been on strengthening learning to conquer games and AI for science. It actually entered LLM very late. It should have only started to pay attention to this direction in 21 years, and it is currently in a state of catching up. Not to mention Meta, the focus has not been on LLM, and now it feels like it is trying to catch up. This is still a group of institutions that are doing the best at present. If this is the case, let alone domestic ones? I feel excusable. As for OpenAI’s philosophy on LLM, I will talk about my understanding in the last part of this article.

This article summarizes the mainstream LLM technology since the emergence of GPT 3.0. You can refer to the mainstream technology before this. "PTM riding the wind and waves, in-depth interpretation of the progress of pre-training models."

I believe that after reading these two articles, you can understand the technical context of the LLM field, the different development concepts that have appeared in the development process of LLM technology, and even the possible future development trends. Have a clearer understanding. Of course, the content mentioned in many places is my personal opinion and is highly subjective. Errors and omissions are inevitable, so please refer to it with caution.

This article attempts to answer some of the following questions: Has ChatGPT brought about a research paradigm shift in the field of NLP and even AI? If so, what impact will that have? What does LLM learn from massive amounts of data? How does LLM access this knowledge? As the scale of LLM gradually increases, what will be the impact? What is In Context Learning? Why is it a mysterious technology? What is its relationship with Instruct? Does LLM have reasoning capabilities? How does the Thought Chain CoT work? Wait, I believe that after reading it, you will have an answer to these questions.

First of all, before talking about the current status of LLM technology, let me talk about the research paradigm shift in my mind at a macro level. In this way, we can "see the forest before the trees" and have a clearer understanding of why specific technologies have changed in such a way.

Top of the Trend: Transformation of NLP Research Paradigm

If we extend the timeline further, return to the era of deep learning in the field of NLP, and observe within a longer time window Technological changes and their impacts may make it easier to see some of the key nodes. I personally believe that during the technological development in the field of NLP in the past 10 years, there may have been two major research paradigm shifts.

Paradigm Shift 1.0: From Deep Learning to Two-stage Pre-trained Model

This The time range covered by the paradigm shift is roughly from the introduction of deep learning to the NLP field (around 2013) to before the emergence of GPT 3.0 (around May 2020) .

Before the emergence of Bert and GPT models, the popular technology in the NLP field was the deep learning model, and deep learning in the NLP field mainly relied on the following key technologies: with a large number of improvements The LSTM model and a small number of improved CNN models are used as typical feature extractors; Sequence to Sequence (or encoder-decoder) Attention is used as a typical overall technical framework for various specific tasks.

With the support of these core technologies, the main research goal of deep learning in the NLP field, if summarized, is how to effectively increase the model layer depth or model parameter capacity. That is to say, how can we continuously add deeper LSTM or CNN layers to the encoder and decoder to achieve the goal of increasing layer depth and model capacity. Although this kind of effort has indeed continuously increased the depth of the model, overall it is not very successful from the perspective of the effect of solving specific tasks. In other words, compared with non-deep learning methods, the advantages it brings are not great.

The reason why deep learning is not successful enough, I think, is mainly due to two aspects: on the one hand, the total amount of training data for a specific task is limited. As the capacity of the model increases, it needs to be supported by a larger amount of training data. Otherwise, even if you can increase the depth, the task effect will not be achieved. Before the emergence of pre-training models, it was obvious that this was a serious problem in the field of NLP research; another aspect was that the LSTM/CNN feature extractor did not have strong expressive capabilities. This means that no matter how much data you are given, it is useless because you cannot effectively absorb the knowledge contained in the data. It is mainly these two reasons that hinder the successful breakthrough of deep learning in the field of NLP.

The emergence of these two pre-training models, Bert/GPT, represents a technological leap in the field of NLP, both from the perspective of academic research and industrial application, and has brought Here comes a paradigm shift in the entire field of research. The impact of this paradigm shift is reflected in two aspects: first, the decline and even gradual demise of some NLP research subfields; secondly, the technical methods and technical frameworks of different subfields of NLP are becoming increasingly unified. One year after the emergence of Bert Around this time, the technology stack has basically converged into two technology models. Let’s talk about these two points separately.

Impact 1: The demise of intermediate tasks

NLP is a general term for a macro research field, with various specific If analyzed carefully, these tasks can be divided into two major categories from the perspective of the nature of the tasks: one type can be called "intermediate tasks" and the other type can be called "final tasks".

Typical intermediate tasks include: Chinese word segmentation, part-of-speech tagging, NER, syntactic analysis, reference resolution, semantic Parser, etc. Such tasks generally do not solve the actual needs of applications. Most of them exist as an intermediate or auxiliary stage for tasks that solve actual needs. For example, there is almost no need to say, I want a syntax Parser to show the user the syntactic analysis tree of this sentence. The user does not need to see these NLP In the intermediate stage, the results are processed. He only cares about whether you did a good job on a specific task. "Final tasks" include text classification, text similarity calculation, machine translation, text summarization, etc., there are many. The characteristic of this type of task is that each sub-field solves a certain actual need, and the task results can basically be presented directly to the user. For example, the user really needs to give you a sentence in English and tell him what Chinese is.

Logically speaking, "intermediate tasks" should not appear, and the reason why they exist is a reflection of the low level of NLP technology development. In the early stages of technological development, because the technology at that time was relatively backward, it was difficult to complete difficult final tasks in one step. Take machine translation for example. In the early days of technology, it was very difficult to do a good job in machine translation. Therefore, researchers divided and conquered the difficult problems and decomposed them into various intermediate stages such as word segmentation, part-of-speech tagging, and syntactic analysis. They completed each intermediate stage first, and then There is nothing we can do about working together to complete the final mission.

But since the emergence of Bert/GPT, there is actually no need to do these intermediate tasks, because through pre-training with a large amount of data, Bert/GPT has used these intermediate tasks as linguistic features , absorbed into the parameters of the Transformer. At this time, we can directly solve the final tasks end-to-end without the need to specifically model this intermediate process. Perhaps the most controversial thing here is Chinese word segmentation. In fact, the principle is the same. You don’t need to worry about which words should form a word. Just let LLM learn it as a feature. As long as it is helpful to solve the task, it will naturally learn it. The reasonable word segmentation method of this study may not necessarily be the same as the word segmentation rules that we humans understand.

Based on the above understanding, in fact, as soon as Bert/GPT appeared, you should conclude that this type of NLP intermediate stage tasks will gradually withdraw from the stage of history.

Impact 2: Unification of technical routes in different research directions

Before explaining the specific impact, let’s discuss it first Another way to divide NLP tasks, which is helpful for understanding the following content. If the "final task" is further classified, it can be roughly divided into two different types of tasks: natural language understanding tasks and natural language generation tasks. If the "intermediate tasks" are excluded, typical natural language understanding tasks include text classification, sentence relationship judgment, emotional tendency judgment, etc. These tasks are essentially classification tasks, that is, input a sentence (article), or two A sentence, the model refers to all the input content, and finally gives a judgment of which category it belongs to. Natural language generation also includes many NLP research sub-directions, such as chat robots, machine translation, text summarization, question and answer systems, etc. The characteristic of the generation task is that given input text, the model must generate a string of output text accordingly. The difference between the two is mainly reflected in the form of input and output

Since the birth of the Bert/GPT model, there has been an obvious trend of technical unification. First of all, the feature extractors of different subfields in NLP are gradually unified from LSTM/CNN to Transformer. In fact, soon after Bert was made public, we should have realized that this would inevitably become a technology trend. As for the reason, it was explained and analyzed in this article I wrote a few years ago "Zhang Junlin: Give up illusions and fully embrace Transformer: Comparison of three major feature extractors (CNN/RNN/TF) for natural language processing". Those who are interested Students can refer to it.

Article link: https://zhuanlan.zhihu.com/p/54743941

Moreover, currently Transformer It not only unifies many fields of NLP, but is also in the process of gradually replacing other models such as CNN that are widely used in various image processing tasks. Similarly, multi-modal models currently basically use the Transformer model. This kind of Transformer starts from NLP and gradually unifies the trend of more and more fields of AI. It started with the Vision Transformer (ViT) that appeared at the end of 2020. It has flourished since then and has been a great success so far, and it continues to expand into more fields. The momentum of expansion will become increasingly rapid.

Secondly, the research and development model in most NLP subfields has switched to a two-stage model: applying fine-tuning (Fine-tuning) or applying Zero/Few Shot Prompt mode in the model pre-training stage. To be more precise, various NLP tasks have actually converged into two different pre-training model frameworks: For natural language understanding tasks, the technical system has been unified into the "bidirectional language model pre-training application Fine-tuning" represented by Bert. mode; for natural language generation tasks, the technical system is unified to the "autoregressive language model (i.e., one-way language model from left to right) Zero/Few Shot Prompt" mode represented by GPT 2.0. As for why it is divided into two technical routes, it is inevitable. We will explain this later.

These two models may seem similar, but they contain very different development ideas and will lead to different future development directions. Unfortunately, most of us underestimated the potential of GPT as a development route at that time, and focused our vision on models like Bert.

Paradigm Shift 2.0: From pre-trained models to general artificial intelligence (AGI, Artificial General Intelligence)

The time range covered by this paradigm shift is roughly after the emergence of GPT3.0 (around June 2020). Until now, we should be in the process of this paradigm shift.

ChatGPT is the key node that triggers this paradigm shift, but before the emergence of InstructGPT, LLM was actually in a transition period before this paradigm shift.

Transition period: The "autoregressive language model Prompting" model represented by GPT 3.0 occupies the dominant position

As mentioned before, in the pre- In the early days of the development of training models, the technical framework converged into two different technical paradigms, Bert mode and GPT mode, and people were generally more optimistic about Bert mode. Quite a few subsequent technical improvements followed the path of Bert. . However, as technology continues to develop, you will find that the currently largest LLM models are almost all based on the "autoregressive language model prompting" model similar to GPT 3.0, such as GPT 3, PaLM, GLaM, Gopher, Chinchilla , MT-NLG, LaMDA, etc., without exception. Why is this so? There must be some inevitability behind it, and I think it may be mainly due to two reasons.

后GPT 3.0时代，主流大模型技术精要详解，走向AGI之路的大门已开

##First of all, Google’s T5 model formally unifies natural language understanding and natural The external manifestation of the language generation task. As shown in the figure above, what is marked in red is a text classification problem, and what is marked in yellow is a regression or classification problem of judging sentence similarity. These are both typical natural language understanding problems. In the T5 model, these natural language understanding problems are consistent with the generation problems in the form of input and output. In other words, the classification problem can be converted into the LLM model to generate strings of corresponding categories, so that the understanding and generation tasks are expressed in the form Complete unity is achieved.

This shows that the natural language generation task can be compatible with the natural language understanding task in terms of expression. If it is the other way around, it will be difficult to achieve this. The advantage of this is that the same LLM generation model can solve almost all NLP problems. If the Bert mode is still adopted, this LLM model cannot handle the generation task well. That being the case, there's one reason why we certainly tend to use generative models.

The second reason, If you want to do it well with zero shot prompting or few shot prompting Task, you must adopt GPT mode. There have been studies (Reference: On the Role of Bidirectionality in Language Model Pre-Training) that have proven that if downstream tasks are solved in fine-tuning, Bert mode is better than GPT mode; if zero shot/few shot prompting is used, this If this mode solves downstream tasks, the effect of GPT mode is better than Bert mode. This shows that it is easier for the generated model to do tasks in the zero shot/few shot prompting mode, and Bert mode has natural disadvantages in doing tasks in this way. This is the second reason.

But here comes the question: Why do we pursue zero shot/few shot prompting to do tasks? To explain this problem clearly, we first need to clarify another question: What kind of LLM model is the most ideal for us?

后GPT 3.0时代，主流大模型技术精要详解，走向AGI之路的大门已开

The picture above shows what an ideal LLM should look like. First, LLM should have strong autonomous learning capabilities. Suppose we feed it all the different types of data such as text or pictures available in the world, it should be able to automatically learn all the knowledge points contained in it. The learning process does not require human intervention, and it should be able to flexibly apply the knowledge it has learned. , to solve practical problems. Because the data is massive, to absorb all the knowledge, a lot of model parameters are needed to store the knowledge, so this model will inevitably be a giant model.

Secondly, LLM should be able to solve problems in any subfield of NLP, not just support limited fields, it should even respond to problems in other fields outside NLP, It's best if questions in any field can be answered well.

Furthermore, when we use LLM to solve problems in a specific field, we should use the expressions we are accustomed to as humans, that is to say, LLM should understand human commands. This reflects letting LLM adapt to people, rather than the other way around, letting people adapt to the LLM model. Typical examples of people adapting to LLM are racking their brains to try various prompts in an attempt to find good prompts that can best solve the problem at hand. Regarding this point, the above figure gives a few examples at the interface layer where humans interact with LLM to illustrate what is a good interface form for people to use the LLM model.

After reading this ideal LLM, let’s go back and explain the remaining questions above: Why should we pursue zero shot/few shot prompting to complete tasks? There are two reasons.

#First, the scale of this LLM model must be very huge, and there must be very few institutions capable of making this model or changing the parameters of this model. The task demanders are thousands of small and medium-sized organizations or even individuals. Even if you open source the model, they will not be able to deploy the model, let alone use the Fine-tuning mode to modify the model parameters. Therefore, we should pursue a way to allow the task demander to complete the task without modifying the model parameters, that is, the prompt mode should be used to complete the task instead of the Fine-tuning mode (it can be seen from this that the technical direction of soft prompting goes against this development trend). The model maker turns LLM into a public service and runs it in LLM as Service mode. As a service supporter, taking into account the ever-changing user needs, LLM model producers must pursue the goal of enabling LLM to complete as many types of tasks as possible. This is a side effect, and it is also a realistic factor why super large models will definitely pursue AGI. .

Second, zero shot prompting, few shot prompting, or even chain of thought (CoT, Chain of Thought) prompting that promotes LLM reasoning ability Well, it’s the existing technology in the interface layer in the picture above. Specifically, the original intention of zero shot prompting is actually the ideal interface between humans and LLM. It directly uses the task expression method that humans are accustomed to to let LLM do things. However, it was found that LLM could not understand it well and the effect was not good. After continued research, we discovered that for a certain task, if we give LLM a few examples and use these examples to represent the task description, the effect will be better than zero shot prompting, so everyone is studying better few shot prompting technology. It can be understood that we originally hoped that LLM could perform a certain task using commands commonly used by humans, but the current technology is not able to do so, so we settled for the next best thing and used these alternative technologies to express human task requirements.

If you understand the above logic, it is easy to draw the following conclusion: few shot prompting (also known as In Context Learning) is just a transitional technology. If we can describe a task more naturally and LLM can understand it, then we will definitely abandon these transitional technologies without hesitation. The reason is obvious that using these methods to describe task requirements is not in line with human usage. Habit.

This is also the reason why I listed GPT 3.0 Prompting as a transitional technology. The emergence of ChatGPT has changed this status quo, replacing Prompting with Instruct, thus bringing new technologies A paradigm shift, with several subsequent effects.

Impact 1: Adapt LLM to a new interactive interface for people

In the context of ideal LLM, let’s look at ChatGPT to better understand its technical contribution. ChatGPT should be the technical method closest to the ideal LLM among all existing technologies. If I summarize the most prominent features of ChatGPT, I would use the following eight words: "Powerful, considerate".

"Powerful capabilities" I believe this should be mainly attributed to the foundation LLM GPT3.5 on which ChatGPT is based. Although ChatGPT has added manually annotated data, it is only in the tens of thousands. Compared with the hundreds of billions of token-level data used to train the GPT 3.5 model, this amount of data contains less world knowledge (facts contained in the data). and common sense) can be described as a drop in the ocean, almost negligible, and will basically not play any role in enhancing the basic capabilities of GPT 3.5. Therefore, its powerful functions should mainly come from GPT 3.5 hidden behind it. GPT 3.5 benchmarks the giant model among the ideal LLM models.

So, does ChatGPT inject new knowledge into the GPT 3.5 model? It should be injected. This knowledge is contained in tens of thousands of manually labeled data. However, what is injected is not world knowledge, but human preference knowledge. The so-called "human preference" has several meanings: First, it is a customary way for humans to express a task. For example, people are used to saying: "Translate the following sentence from Chinese to English" to express a need for "machine translation". However, LLM is not a human, so how can it understand what this sentence means? You have to find a way to make LLM understand the meaning of this command and execute it correctly. Therefore, ChatGPT injects this kind of knowledge into GPT 3.5 through manual annotation of data, making it easier for LLM to understand human commands. This is the key to its "empathy". Secondly, humans have their own standards for what is a good answer and what is a bad answer. For example, a more detailed answer is good, an answer with discriminatory content is bad, and so on. This is the human preference for the quality of answers. The data that people feed back to LLM through the Reward Model contains this kind of information. Overall, ChatGPT injects human preference knowledge into GPT 3.5 to obtain an LLM that understands human speech and is more polite.

It can be seen that the biggest contribution of ChatGPT is that it basically realizes the interface layer of the ideal LLM, allowing LLM to adapt to people's habitual command expressions, instead of conversely allowing people to adapt to it. With LLM, you rack your brains to come up with a command that can work (this is what prompt technology was doing before the instruct technology came out), and this increases the ease of use and user experience of LLM. It was InstructGPT/ChatGPT that first realized this problem and provided a good solution, which is also its greatest technical contribution. Compared with the previous few shot prompting, it is a human-computer interface technology that is more in line with human expression habits for people to interact with LLM.

This will surely inspire subsequent LLM models and continue to do further work on easy-to-use human-machine interfaces to make LLM more obedient.

Impact 2: Many NLP subfields no longer have independent research value

Just In the field of NLP, this paradigm shift means that many NLP research fields that currently exist independently will be included in the LLM technology system, and will no longer exist independently and gradually disappear. After the first paradigm shift, although many "intermediate tasks" in NLP are no longer necessary to continue to exist as independent research fields, most of the "final tasks" still exist as independent research fields, but are switched to "pre-training". Under the framework of "fine-tuning", new improvement plans have been proposed one after another in the face of unique problems in the field.

Current research shows that for many NLP tasks, as the size of the LLM model increases, the performance will be greatly improved. Based on this, I think the following inference can be drawn: Most of the so-called "unique" problems in a certain field are most likely just an external appearance caused by a lack of domain knowledge. As long as there is enough domain knowledge, this so-called problem unique to the field will be solved. It can be solved very well. In fact, there is no need to focus on a specific field problem and work hard to come up with a special solution. Perhaps the truth about AGI is surprisingly simple: you just give LLM more data in the field and let it learn more on its own.

In this context, at the same time, ChatGPT proves that we can now directly pursue the ideal LLM model. Then, the future technology development trend should be: pursuing increasingly large-scale LLM models, by increasing The diversity of pre-training data covers more and more fields. LLM independently learns domain knowledge from domain data through the pre-training process. As the scale of the model continues to increase, many problems are solved. The research focus will be on how to build this ideal LLM model, rather than solving specific problems in a certain field. In this way, more and more subfields of NLP will be included in the LLM technical system and gradually disappear.

In my opinion, to judge whether independent research in a specific field should be stopped immediately, the judgment criteria can be one of the following two methods: First, judge a certain task , whether the research effect of LLM exceeds human performance, there is no need for independent research in those research fields where the effect of LLM exceeds human performance. For example, for many tasks in the GLUE and SuperGLUE test sets, the LLM effect currently exceeds human performance. In fact, there is no need for research fields closely related to this data set to continue to exist independently. Second, compare the task effects of two modes. The first mode is Fine-tuning with larger domain-specific data, and the second mode is few-shot prompting or instruct-based methods. If the effect of the second method reaches or exceeds that of the first method, it means that there is no need for this field to continue to exist independently. If we use this standard, in fact, in many research fields, the effect of fine-tuning is still dominant (because of the large amount of training data in this mode field), and it seems that it can exist independently. However, considering that for many tasks as the model size increases, the effect of few shot prompting continues to grow. With the emergence of larger models, this inflection point is likely to be reached in the short term.

If the above speculation is true, it will mean the following cruel fact: For many researchers in the NLP field, they will face the choice of where to go. Should they continue to work on problems unique to the field? Or should we abandon this seemingly unpromising approach and instead build a better LLM? If we choose to turn to building LLM, which institutions have the ability and conditions to do this? What would be your answer to this question?

Impact Three: More research fields other than NLP will be included in the LLM technology system

If you stand From the perspective of AGI, referring to the ideal LLM model described previously, the tasks it can complete should not be limited to the NLP field, or one or two subject areas. The ideal LLM should be a domain-independent general artificial intelligence model. It is now Being good at one or two areas does not mean you can only do these tasks. The emergence of ChatGPT proves that it is feasible for us to pursue AGI in this period, and now is the time to put aside the shackles of "field discipline" thinking.

In addition to demonstrating the ability to solve various NLP tasks in a fluent conversational format, ChatGPT also has strong coding capabilities. It is natural that more and more other research fields will be gradually included in the LLM system and become part of general artificial intelligence.

后GPT 3.0时代，主流大模型技术精要详解，走向AGI之路的大门已开

When LLM expands its field from NLP, a natural choice is image processing and multi-modal related tasks. There are already some efforts to integrate multimodality and make LLM a universal human-machine interface that supports multimodal input and output. Typical examples include DeepMind's Flamingo and Microsoft's "Language Models are General-Purpose Interfaces", as shown above. The conceptual structure of this approach is demonstrated.

My judgment is that whether it is images or multi-modality, the future integration into LLM to become useful functions may be slower than we think. The main reason is that although the image field has been imitating Bert's pre-training approach in the past two years, it is trying to introduce self-supervised learning to release the model's ability to independently learn knowledge from image data. Typical technologies are "contrastive learning" and MAE. This is Two different technical routes. However, judging from the current results, despite great technological progress, it seems that this road has not yet been completed. This is reflected in the application of pre-trained models in the image field to downstream tasks, which brings far less benefits than Bert or GPT. It is significantly applied to NLP downstream tasks. Therefore, image preprocessing models still need to be deeply explored to unlock the potential of image data, which will delay their unification into LLM large models. Of course, if this road is opened one day, there is a high probability that the current situation in the field of NLP will be repeated, that is, various research subfields of image processing may gradually disappear and be integrated into large-scale LLM to directly complete terminal tasks.

In addition to images and multi-modality, it is obvious that other fields will gradually be included in the ideal LLM. This direction is in the ascendant and is a high-value research topic.

The above are my personal thoughts on paradigm shift. Next, let’s sort out the mainstream technological progress of the LLM model after GPT 3.0 . As shown in the ideal LLM model, related technologies can actually be divided into two major categories; one is about how the LLM model absorbs knowledge from the data, and also includes the impact of the growth of the model size on the LLM's ability to absorb knowledge; the second category is Human-computer interface on how people use LLM's inherent capabilities to solve tasks, including In Context Learning and Instruct modes. Chain of Thought (CoT) prompting, an LLM reasoning technology, essentially belongs to In Context Learning. Because they are more important, I will talk about them separately. Learners: From Endless Data to Massive Knowledge

Judging from the current research results, Transformer is a powerful enough feature extractor and does not require special improvements. So what did Transformer learn through the pre-training process? How is knowledge accessed? How do we correct incorrect knowledge? This section describes the research progress in this area.

The road to knowledge: What knowledge has LLM learned

LLM has learned a lot of knowledge from massive free texts. If we roughly classify this knowledge, it can be divided into two categories: language knowledge and world knowledge.

Language knowledge refers to lexical, part-of-speech, syntax, semantics and other knowledge that helps humans or machines understand natural language

. There is a long history of research on whether LLM can capture linguistic knowledge. Since the emergence of Bert, relevant research has continued, and conclusions have been drawn very early. Various experiments have fully proved that LLM can learn various levels of linguistic knowledge. This is why it is used After pre-training the model, one of the most important reasons is that various language understanding natural language tasks have achieved significant performance improvements. In addition, various studies have also proven that shallow language knowledge such as morphology, parts of speech, syntax and other knowledge are stored in the low-level and middle-level structures of Transformer, while abstract language knowledge such as semantic knowledge is widely distributed in the middle-level and high-level structures of Transformer. World knowledge refers to some real events that occur in this world (factual knowledge, Factual Knowledge), and some common sense knowledge (Common Sense Knowledge) )

. For example, "Biden is the current President of the United States", "Biden is an American", "Ukrainian President Zelensky met with U.S. President Biden", these are factual knowledge related to Biden; and "People have Two eyes" and "The sun rises in the east" are common sense knowledge. There are many studies on whether the LLM model can learn world knowledge, and the conclusions are relatively consistent: LLM does absorb a large amount of world knowledge from the training data, and this kind of knowledge is mainly distributed in the middle and high layers of Transformer, especially concentrated in the middle layer. Moreover, as the depth of the Transformer model increases, the amount of knowledge that can be learned gradually increases exponentially (refer to: BERTnesia: Investigating the capture and forgetting of knowledge in BERT). In fact, you regard LLM as an implicit knowledge graph reflected in model parameters. If you understand it this way, I think there is no problem at all.

"When Do You Need Billions of Words of Pre-training Data?" This article studies the relationship between the amount of knowledge learned by the pre-training model and the amount of training data. Its conclusion is: for Bert type For language models, you can learn linguistic knowledge such as syntax and semantics with only 10 million to 100 million words of corpus, but to learn factual knowledge, you need more training data. This conclusion is actually expected. After all, linguistic knowledge is relatively limited and static, while factual knowledge is huge and in a constant process of change. Current research has proven that as the amount of training data increases, the pre-trained model performs better in various downstream tasks, which shows that what is learned from the incremental training data is mainly world knowledge.

Memory place: How LLM accesses knowledge

It can be seen from the above that LLM has indeed learned a lot from data Language and world knowledge. So, for a specific piece of knowledge, where does LLM store it? How is it extracted? This is also an interesting question.

Obviously, the knowledge must be stored in the model parameters of Transformer. Judging from the structure of Transformer, the model parameters are composed of two parts: the multi-head attention (MHA) part accounts for about one-third of the total parameters, and two-thirds of the parameters are concentrated in the FFN structure. MHA is mainly used to calculate the correlation strength between words or knowledge and integrate global information. It is more likely to establish the connection between knowledge. There is a high probability that specific knowledge points will not be stored, so it is easy to deduce the knowledge body of the LLM model. It is stored in the FFN structure of Transformer.

However, the granularity of this positioning is still too coarse, and it is impossible to answer well how a specific piece of knowledge is stored and retrieved, such as "China's The capital is Beijing" This piece of knowledge is expressed as a triplet as , where "is-capital-of" represents the relationship between entities. Where is this knowledge stored in LLM?

"Transformer Feed-Forward Layers Are Key-Value Memories" gives a relatively novel perspective, which regards the Transformer's FFN as a Key-Value memory that stores a large amount of specific knowledge. . As shown in the figure above (the left side of the figure is the original paper figure, which is actually not easy to understand, you can look at the annotated right figure for a better understanding), the first layer of FFN is an MLP wide hidden layer, which is the Key layer; The second layer is the narrow hidden layer of MLP and is the Value layer. The input layer of FFN is actually the output Embedding of the MHA corresponding to a certain word, which is the Embedding that integrates the input context related to the entire sentence through Self Attention, which represents the overall information of the entire input sentence.

Each neuron node in the Key layer records a pair of information. For example, for the th node in the first hidden layer of FFN in the figure above, maybe it records the piece of knowledge. The key vector corresponding to the node actually refers to the node and the weight vector of each node of the input layer; and the corresponding Value vector refers to the node and FFN second Value of the layer Each node in the layer forms a weight vector of connections. The Key vector of each neuron is used to identify a certain language or knowledge pattern in the input. It is a pattern detector. If the input contains a certain pattern that it wants to detect, then the input vector and the key weight of the node perform a vector inner product calculation, plus Relu, to form a large numerical response of , which means that has detected this mode, then propagate this response value to the second layer of FFN through the Value weight vector of the node. This is equivalent to weighting the value of the Value vector with the response value, and then passing and reflecting it to the output of each node of the second Value layer. In this way, the forward propagation calculation process of FFN looks like detecting a certain knowledge pattern through Key, then taking out the corresponding Value, and reflecting the Value on the second layer output of FFN. Of course, each node in the second layer of FFN will collect the information of all nodes in the Key layer of FFN, so it is a mixed response, and the mixed response of all nodes in the Value layer can be interpreted as probability distribution information representing the output word.

It may still sound complicated, so let’s use an extreme example to illustrate. We assume that node in the above figure is the Key-Value memory that records this piece of knowledge. Its Key vector is used to detect the knowledge pattern "The capital of China is..." and its Value vector basically stores A vector that is close to the Embedding of the word "Beijing". When the input of Transformer is "The capital of China is [Mask]", the node detects this knowledge pattern from the input layer, so it produces a larger response output. We assume that other neurons in the Key layer have no response to this input, then the corresponding node in the Value layer will actually only receive the word embedding corresponding to the Value "Beijing", and perform the embedding through the large response value of Further numerical amplification. Therefore, the output corresponding to the Mask position will naturally output the word "Beijing". It's basically this process. It looks complicated, but it's actually very simple.

And this article also points out that the low-level Transformer responds to the surface pattern of the sentence, and the high-level responds to the semantic pattern. That is to say, the low-level FFN stores surface knowledge such as lexicon and syntax, and the middle layer and The high level stores semantic and factual concept knowledge, which is consistent with other research conclusions.

I would guess that the idea of treating FFN as a Key-Value memory is probably not the final correct answer, but it is probably not too far from the final correct answer.

Knowledge Correction Fluid: How to correct the knowledge stored in LLM

Since we know a specific world Knowledge is stored in the parameters of one or more FFN nodes, which naturally raises another question: Can we correct errors or outdated knowledge stored in the LLM model? For example, regarding the question: "Who is the current Prime Minister of the United Kingdom?" Given the frequent changes in British Prime Ministers in recent years, do you think LLM is more inclined to export "Boris" or "Sunak"? Obviously there will be more data containing "Boris" in the training data. In this case, it is very likely that LLM will give a wrong answer, so we have the need to correct the outdated knowledge stored in LLM.

If we summarize, there are currently three different methods to modify the knowledge contained in LLM:

##The first category The method corrects the knowledge from the source of the training data. "Towards Tracing Factual Knowledge in Language Models Back to the Training Data" The research goal of this article is: for a specified piece of knowledge, can we locate which training data caused LLM to learn this piece of knowledge? The answer is yes, which means that we can reversely trace the source of the training data corresponding to a certain piece of knowledge. If we use this technology, assuming we want to delete a certain piece of knowledge, we can first locate its corresponding data source, delete the data source, and then re-pretrain the entire LLM model. In this way, we can achieve the purpose of deleting the relevant knowledge in the LLM. But there is a problem here. If we correct a small part of the knowledge, we need to re-train the model, which is obviously too costly. Therefore, this method does not have much development prospects. It may be more suitable for one-time large-scale deletion of a specific category of data. It is not suitable for a small number of regular knowledge correction scenarios. For example, it may be more suitable for removing bias. Wait for the toxic content to be removed.

The second type of method is to do a fine-tuning on the LLM model to correct the knowledge. An intuitive method that can be thought of is: we can construct training data based on the new knowledge to be modified, and then let the LLM model do fine-tuning on this training data, thus guiding the LLM to remember the new knowledge and forget the old knowledge. This method is simple and intuitive, but it also has some problems. First of all, it will bring about the problem of disaster forgetting, which means that in addition to forgetting the knowledge that should be forgotten, it also forgets the knowledge that should not be forgotten, resulting in the decline in the effectiveness of some downstream tasks after doing so. In addition, because the current LLM model is very large, even if fine-tuning is performed frequently, the cost is actually quite high. Those interested in this method can refer to "Modifying Memories in Transformer Models".

Another type of method directly modifies the model parameters corresponding to certain knowledge in LLM to correct the knowledge. Suppose we want to revise the old knowledge to . First, we find a way to locate the FFN node that stores old knowledge in the LLM model parameters, and then we can forcibly adjust and change the corresponding model parameters in FFN to replace the old knowledge with new knowledge. It can be seen that this method involves two key technologies: first, how to locate the specific storage location of a certain piece of knowledge in the LLM parameter space; second, how to correct the model parameters to achieve the correction of old knowledge to new knowledge. For details on this type of technology, see "Locating and Editing Factual Associations in GPT" and "Mass-Editing Memory in a Transformer". Understanding this process of revising LLM knowledge is actually very helpful for a deeper understanding of the internal working mechanism of LLM.

Scale effect: What happens when LLM gets bigger and bigger

We know that in recent years, the scale of LLM models has been growing rapidly. Currently, the parameter scale of the most effective LLM model is mostly The parameter scale exceeds 100 billion (100B). For example, the size of OpenAI's GPT 3 is 175B, the size of Google's LaMDA is 137B, the size of PaLM is 540B, the size of DeepMind's Gogher is 280B, and so on. There are also Chinese giant models in China, such as Zhiyuan GLM with a scale of 130B, Huawei “Pangu” with a scale of 200B, Baidu “Wenxin” with a scale of 260B, and Inspur “Yuan 1.0” with a scale of 245B. So, a natural question is: what happens as the size of LLM models continues to grow?

The application of pre-training models is often divided into two stages: the pre-training stage and the specific scenario application stage. In the pre-training stage, the optimization goal is cross entropy. For autoregressive language models such as GPT, it depends on whether the LLM correctly predicts the next word; while in the scenario application stage, it generally depends on the evaluation indicators of specific scenarios. Our general intuition is that if the performance of the LLM model in the pre-training phase is better, its ability to solve downstream tasks will naturally be stronger. However, this is not entirely true. Existing research has proven that the optimization index in the pre-training stage does show a positive correlation with downstream tasks, but it is not completely positive. In other words, it is not enough to only look at the indicators in the pre-training stage to judge whether an LLM model is good enough. Based on this, we will look separately at these two different stages to see what the impact will be as the LLM model increases.

First, let’s look at what happens as the model size gradually increases during the pre-training stage. OpenAI specifically studied this issue in "Scaling Laws for Neural Language Models" and proposed the "scaling law" followed by the LLM model. As shown in the figure above, this study proves: When we independently increase the amount of training data, model parameter size, or extend the model training time (such as from 1 Epoch to 2 Epoch), the pre-trained model performs better on the test set The Loss will decrease monotonically, which means that the model effect is getting better and better.

Since all three factors are important, when we actually do pre-training, we have a decision-making problem on how to allocate computing power: Assume the total budget of computing power used to train LLM (For example, how many GPU hours or GPU days) Given, then should we increase the amount of data and reduce the model parameters? Or should the amount of data and model size increase at the same time, reducing the number of training steps? As the scale of one factor increases, the scale of other factors must be reduced to keep the total computing power unchanged, so there are various possible computing power allocation plans. In the end, OpenAI chose to increase the amount of training data and model parameters at the same time, but used an early stopping strategy to reduce the number of training steps. Because it proves that: for the two elements of training data volume and model parameters, if you only increase one of them separately, this is not the best choice. It is better to increase both at the same time according to a certain proportion. Its conclusion is to give priority to increasing the model. parameters, and then the amount of training data. Assuming that the total computing power budget for training LLM is increased by 10 times, the amount of model parameters should be increased by 5.5 times and the amount of training data should be increased by 1.8 times. At this time, the model effect is best.

A study by DeepMind (Reference: Training Compute-Optimal Large Language Models) explored this issue more deeply. Its basic conclusions are similar to those of OpenAI. For example, it is indeed necessary to increase The amount of training data and model parameters will improve the model effect. Many large models do not consider this when doing pre-training. Many large LLM models just monotonically increase the model parameters while fixing the amount of training data. This approach is actually wrong and limits the potential of the LLM model. However, it corrects the proportional relationship between the two and believes that the amount of training data and model parameters are equally important. In other words, assuming that the total computing power budget used to train LLM increases by 10 times, the amount of model parameters should be increased by 3.3 times. , 3.3 times the amount of training data, so that the model has the best effect.

This means: The importance of increasing the amount of training data is even more important than we previously thought. Based on this understanding, when DeepMind designed the Chinchilla model, it chose another configuration in terms of computing power allocation: compared with the Gopher model with a data volume of 300B and a model parameter volume of 280B, Chinchilla chose to increase the training data by 4 times, but reduced the model The parameters are reduced to one-fourth that of Gopher, which is about 70B. However, regardless of pre-training indicators or many downstream task indicators, Chinchilla is better than the larger Gopher.

This brings us the following enlightenment: We can choose to enlarge the training data and reduce the LLM model parameters in the same proportion, so as to achieve the premise of not reducing the model effect. In order to greatly reduce the size of the model. Reducing the size of the model has many benefits, such as the inference speed will be much faster during application. There is no doubt that this is a promising LLM development route.

The above is the impact of model scale from the pre-training stage. From the perspective of the effect of LLM on solving specific downstream tasks, as the model scale increases, different types of tasks have Different performances, specifically, there are the following three types of situations.

The first type of tasks perfectly reflects the scaling law of the LLM model, which means that as the scale of the model gradually increases, the tasks The performance is getting better and better, as shown in (a) in the figure above. Such tasks usually meet the following common characteristics: they are often knowledge-intensive tasks, which means that the more knowledge the LLM model contains, the better the performance of such tasks. Many studies have proven that the larger the LLM model, the higher the learning efficiency. That is to say, for the same amount of training data, the larger the model, the better the task effect. This shows that even when faced with the same batch of training data, a larger LLM model is relatively more efficient. A smaller model from which more knowledge is learned. What's more, under normal circumstances, when increasing the LLM model parameters, the amount of training data will often increase simultaneously, which means that large models can learn more knowledge points from more data. These studies can well explain the above figure, why as the model size increases, the results of these knowledge-intensive tasks become better and better. Most traditional natural language understanding tasks are actually such knowledge-intensive tasks, and many tasks have achieved great improvement in the past two years, even surpassing human performance. Obviously, this is most likely caused by the increase in the scale of the LLM model, rather than due to a specific technical improvement.

The second type of task shows that LLM has some kind of "Emergent Ability", as shown in (b) above. The so-called "emergent ability" means that when the model parameter scale fails to reach a certain threshold, the model basically does not have any ability to solve such tasks, which reflects that its performance is equivalent to randomly selecting answers. However, when the model scale spans Once the threshold is exceeded, the LLM model's effect on such tasks will experience a sudden performance increase. In other words, model size is the key to unlocking (unlocking) new capabilities of LLM. As the model size becomes larger and larger, more and more new capabilities of LLM will be gradually unlocked. This is a very magical phenomenon, because it means the following possibilities that make people optimistic about the future: Perhaps many tasks cannot be solved well by LLM at present. Even from our perspective at this moment, LLM has no solution at all. It has the ability to solve such tasks, but because LLM has "emergent ability", if we continue to push large models, this ability may be suddenly unlocked one day. The growth of the LLM model will bring us unexpected and wonderful gifts.

"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models" This article points out that this type of tasks that reflect "emergent capabilities" also have some commonalities: these tasks are generally performed by It is composed of multiple steps. To solve these tasks, it is often necessary to solve multiple intermediate steps first, and logical reasoning ability plays an important role in the final solution of such tasks. Chain of Thought Prompting is a typical technology that enhances LLM reasoning capabilities and can greatly improve the performance of such tasks. The CoT technology will be explained in the following sections and will not be discussed here.

The question is, why does LLM have this "emergent capability" phenomenon? The above article and "Emergent Abilities of Large Language Models" give several possible explanations:

One possible explanation is that the evaluation indicators of some tasks are not smooth enough. For example, some judgment standards for generation tasks require that the string output by the model must completely match the standard answer to be considered correct, otherwise it will be scored 0 points. Therefore, even as the model increases, its effect gradually becomes better, which is reflected in the output of more correct character fragments. However, because it is not completely correct, 0 points will be given for any small errors. Only when the model is large enough, the output Scores are scored if all the segments are correct. In other words, because the indicator is not smooth enough, it cannot reflect the reality that LLM is actually gradually improving the task performance. It seems to be the external manifestation of "emergent ability".

Another possible explanation is that some tasks are composed of several intermediate steps. As the size of the model increases, the ability to solve each step gradually increases, but as long as there is one If the intermediate steps are wrong, the final answer will be wrong, which will also lead to this superficial "emergent ability" phenomenon.

Of course, the above explanations are still conjectures. As for why LLM has this phenomenon, further and in-depth research is needed.

There are still a few tasks. As the model scale increases, the effect curve of the task will be displayed. U-shaped characteristics: As the model scale gradually increases, the task effect gradually becomes worse, but when the model scale further increases, the effect starts to get better and better, showing a U-shaped growth trend, as shown in the figure above. The indicator trend of the pink PaLM model on the two tasks is shown. Why do these tasks appear so special? "Inverse scaling can become U-shaped" This article gives an explanation: These tasks actually contain two different types of subtasks, one is the real task, and the other is the "interference task ( distractor task)". When the model size is small, it cannot identify any kind of sub-task, so the performance of the model is similar to randomly selecting answers. When the model grows to a medium size, it mainly performs interference tasks, so it has a negative impact on the real task effect. This is reflected in the decline of the real task effect. When the model size is further increased, LLM can ignore interfering tasks and perform the real task, which is reflected in the effect starting to grow.

For those tasks whose performance has been declining as the model size increases, if Chain of Thought (CoT) Prompting is used, the performance of some tasks will be converted to follow Scaling law, that is, the model size The bigger the effect, the better, while other tasks convert to a U-shaped growth curve. This actually shows that this type of task should be a reasoning type task, so the task performance will change qualitatively after adding CoT.

Human-Computer Interface: From In Context Learning to Instruct Understanding

Generally, the interface technologies we often mention between people and LLM include: zero shot prompting, few shot prompting , In Context Learning, and Instruct. These are actually ways of describing a specific task. But if you look at the literature, you will find that the names are quite confusing.

Among them, Instruct is the interface method of ChatGPT, which means that people give a description of the task in natural language, such as "Translate this sentence from Chinese to English", something like this. I understand that zero shot prompting is actually the early name of the current Instruct. In the past, people used to call it zero shot, but now many people call it Instruct. Although it has the same connotation, there are two specific methods. In the early days, people did zero shot prompting. In fact, they didn’t know how to express a task, so they changed different words or sentences and repeatedly tried to express the task well. This approach has been proven to fit the training data. The distribution of is actually meaningless. The current approach of Instruct is to give a command statement and try to make LLM understand it. So although they are all expressions of tasks on the surface, the ideas are different.

In Context Learning has a similar meaning to few shot prompting, which is to give LLM a few examples as a template, and then let LLM solve new problems. I personally think that In Context Learning can also be understood as a description of a certain task, but Instruct is an abstract description method, and In Context Learning is an example-based explanation method . Of course, given that these terms are currently used a bit confusingly, the above understanding only represents my personal opinion.

So we only introduce In Context Learning and Instruct here, and no longer mention zero shot and few shot.

Mysterious In Context Learning

#If you think about it carefully, you will find that In Context Learning is a very magical technology . What's so magical about it? The magic is that when you provide LLM with several sample examples , and then give it , LLM can successfully predict the corresponding . When you hear this, you might ask: What’s so magical about this? Isn’t that how fine-tuning works? If you ask this, it means you haven't thought deeply enough about this issue.

Fine-tuning and In Context Learning both seem to provide some examples for LLM, but there are qualitative differences between them. The difference (refer to the diagram above): Fine-tuning uses these examples as training data and uses backpropagation to modify the model parameters of LLM. The action of modifying the model parameters indeed reflects the process of LLM learning from these examples. However, In Context Learning only took out examples for LLM to take a look at, and did not use backpropagation to modify the parameters of the LLM model based on the examples, and asked it to predict new examples. Since the model parameters are not modified, this means that it seems that LLM has not gone through a learning process. If it has not gone through a learning process, then why can it predict new examples just by looking at it? This is the magic of In Context Learning. Does this remind you of a lyric: "Just because I glanced at you one more time in the crowd, I can never forget your face again." The song is called "Legend". Are you saying it is legendary or not?

It seems that In Context Learning does not learn knowledge from examples. In fact, does LLM learn in a strange way? Or is it true that it didn’t learn anything? The answer to this question is still an unsolved mystery. Some existing studies have different versions, and it is difficult to judge which one tells the truth. Some research conclusions are even contradictory. Here are some current opinions. As for who is right and who is wrong, you can only decide for yourself. Of course, I think pursuing the truth behind this magical phenomenon is a good research topic.

An attempt to prove that In-Context Learning does not learn from examples is "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?". It discovered: In the sample example provided to LLM, it does not matter whether is the correct answer to . , if we replace the correct answer with another random answer , this does not affect the effect of In Context Learning. This at least illustrates one thing: In Context Learning does not provide LLM with the mapping function information from to : , otherwise, if you change the correct label randomly, it will definitely disrupt this mapping function. In other words, In Context Learning does not learn the mapping process from input space to output space.

What really has a greater impact on In Context Learning is: the distribution of and , that is, the distribution of the input text and what the candidate answers are. , if you change these two distributions, such as replacing with content other than candidate answers, the In Context Learning effect will drop sharply.

In short, this work proves that In Context Learning does not learn the mapping function, but the distribution of input and output is very important, and these two cannot be changed randomly.

Some work believes that LLM still learns this mapping function from the examples given, but it is an implicit learning. For example, "What learning algorithm is in-context learning? Investigations with linear models" believes that Transformer can implicitly learn the mapping process from to from examples. Its activation function contains some simple mapping functions, and LLM can inspire the corresponding one by example. The article "Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers" treats ICL as an implicit Fine-tuning.

All in all, this is still an unsolved mystery.

Magical Instruct Understanding

We can regard Instruct as a task statement that is convenient for humans to understand. Under this premise, the current research on Instruct can be divided into two types: Instruct that is more academic research, and Instruct that describes human real needs.

Let’s look at the first type first: Academic research Instruct. Its core research topic is the generalization ability of the LLM model to understand Instruct in multi-task scenarios. As shown in the FLAN model in the figure above, that is to say, there are many NLP tasks. For each task, the researchers construct one or more Prompt templates as the Instruct of the task, and then use training examples to fine-tune the LLM model so that LLM can learn multiple tasks at the same time. task. After training the model, give the LLM model an Instruct of a new task that it has never seen before, and then let LLM solve the zero shot task. Whether the task is solved well enough can be used to judge whether the LLM model has the generalization ability to understand the Instruct.

If you summarize the current research conclusions (please refer to "Scaling Instruction-Fine-tuned Language Models"/"Super-NaturalInstructions: Generalization via Declarative Instructions on 1600 NLP Tasks"), you can Factors that effectively increase the generalization ability of the LLM model Instruct include: increasing the number of multi-task tasks, increasing the size of the LLM model, providing CoT Prompting, and increasing the diversity of tasks. If any of these measures are taken, the Instruct comprehension of the LLM model can be increased.

The second type is Instruct under the real needs of human beings. This type of research is represented by InstructGPT and ChatGPT. This type of work is also based on multi-tasking, but the biggest difference from academic research-oriented work is that it is oriented to the real needs of human users. Why do you say that? Because the task description prompts they use for LLM multi-task training are sampled from real requests submitted by a large number of users, instead of fixing the scope of the research task and then letting researchers write the task description prompts. The so-called "real needs" here are reflected in two aspects: first, because they are randomly selected from the task descriptions submitted by users, the types of tasks covered are more diverse and more in line with the real needs of users; second, a certain The prompt description of a task is submitted by the user and reflects what the average user would say when expressing task requirements, not what you think the user would say. Obviously, the user experience of the LLM model improved by this kind of work will be better.
In the InstructGPT paper, this method was also compared with the Instruct based method of FLAN. First, use the tasks, data and Prompt template mentioned in FLAN to fine-tune on GPT3 to reproduce the FLAN method on GPT 3, and then compare it with InstructGPT. Because the basic model of InstructGPT is also GPT3, there are only differences in data and methods. The two are comparable, and it is found that the effect of the FLAN method is far behind InstructGPT. So what's the reason behind it? After analyzing the data, the paper believes that the FLAN method involves relatively few task fields and is a subset of the fields involved in InstructGPT, so the effect is not good. In other words, the tasks involved in the FLAN paper are inconsistent with the actual needs of users, which results in insufficient results in real scenarios. What this means to us is that it is important to collect real needs from user data.

The connection between In Context Learning and Instruct

If we assume that In Context Learning uses some examples to concretely To express task commands, Instruct is an abstract task description that is more in line with human habits. So, a natural question is: is there any connection between them? For example, can we provide LLM with several specific examples of completing a certain task and let LLM find the corresponding Instruct command described in natural language?

There is currently sporadic work exploring this issue. I think this direction is of great research value. Let’s talk about the answer first. The answer is: Yes, LLM Can. "Large Language Models Are Human-Level Prompt Engineers" is a very interesting work in this direction. As shown in the figure above, for a certain task, give LLM some examples, let LLM automatically generate natural language commands that can describe the task, and then It then uses the task description generated by LLM to test the task effect. The basic models it uses are GPT 3 and InstructGPT. After the blessing of this technology, the effect of Instruct generated by LLM is greatly improved compared to GPT 3 and InstructGPT that do not use this technology, and in some tasks Superhuman performance.

This shows that there is a mysterious internal connection between concrete task examples and natural language descriptions of tasks. As for what exactly this connection is? We don't know anything about this yet.

Light of Wisdom: How to Enhance LLM’s Reasoning Ability

Many studies have proven that LLM has a strong memory ability for knowledge. However, generally we will not because of a A person is said to be very smart if he has a strong memory. Whether he has strong reasoning ability is often an important criterion for us to judge whether a person is smart. Similarly, if the effect of LLM is to be amazing, strong reasoning ability is necessary. In essence, reasoning ability is the comprehensive use of many relevant knowledge points to derive new knowledge or new conclusions. The reasoning ability of LLM has been one of the most important and popular research areas in LLM in the past year. Therefore, the question we are concerned about is: Does LLM have reasoning capabilities? If so, is its reasoning ability strong enough?
The current answers to these two questions seem to be:
When the model scale is large enough, LLM itself has reasoning capabilities. On simple reasoning problems , LLM has achieved very good capabilities, but more in-depth research is needed on complex reasoning problems.
If I sort out the existing LLM reasoning-related work, I classify them into two major categories, reflecting different technical ideas for mining or promoting LLM reasoning capabilities: there are more studies in the first category , can be collectively referred to as the prompt-based method. The core idea is to better stimulate the reasoning ability of LLM itself through appropriate prompts or prompt samples. Google has done a lot of very effective work in this direction. The second type of approach is to introduce program code during the pre-training process and participate in pre-training together with the text to further enhance the reasoning ability of LLM. This should be the idea implemented by OpenAI. For example, ChatGPT definitely has strong reasoning capabilities, but it does not require users to provide some reasoning examples. Therefore, ChatGPT's powerful reasoning capabilities are most likely derived from using code to participate in the pre-training of GPT 3.5.
The two ideas are actually very different in general direction: using code to enhance LLM reasoning capabilities, which reflects an idea of directly enhancing LLM reasoning capabilities by increasing diversity of training data; and based on Prompt Method, it does not promote the reasoning ability of LLM itself, but is a technical method that allows LLM to better demonstrate this ability in the process of solving problems. It can be seen that the former (code method) treats the root cause, while the latter treats the symptoms. Of course, the two are actually complementary, but in the long run, the root cause is more important.

Prompt-based method

There is a lot of work in this area. If we summarize it, it can be roughly divided into three Technical route.

##The first way of thinking is to directly add auxiliary reasoning Prompt# to the question. ##. This method is simple and straightforward, but effective in many areas. This approach was proposed by "Large language models are zero-shot reasoners" and is also known as zero-shot CoT. Specifically, it is divided into two stages (as shown in the figure above). In the first stage, the prompt "Let's think step by step" is added to the question, and LLM will output the specific reasoning process; in the second stage, After the questions in the first stage, splice the specific reasoning process output by LLM, and add Prompt="Therefore, the answer (arabic numerals) is", at which time LLM will give the answer. Such a simple operation can greatly increase the effectiveness of LLM in various reasoning tasks. For example, on the mathematical reasoning test set GSM8K, after adding prompts, the reasoning accuracy increased directly from the original 10.4% to 40.4%, which is miraculous. .
Why does LLM have the ability to list detailed reasoning steps and calculate the answer by giving a prompt of “Let’s think step by step”? The reason is not yet conclusive. My guess is: it is probably because there is a large amount of this kind of data in the pre-training data, which starts with "Let's think step by step", followed by detailed reasoning steps, and finally gives the answer. LLM remembers these patterns during pre-training. When we input this prompt, LLM is stimulated to vaguely "recall" the derivation steps of certain examples, so that we can imitate these examples to perform step reasoning and give answers. Of course, this is just my unfounded inference. If this is really the case, if you read the standard CoT practice introduced later, you will find that Zero-shot CoT is probably no different from standard CoT in essence, except that standard CoT is written by humans. Examples of steps, and Zero-shot CoT most likely activates some examples in memory that contain reasoning steps through prompts, which is likely to be so different. It is completely understandable that the standard CoT effect is better than the Zero-Shot CoT effect, because after all, relying on LLM to recall examples, the accuracy is not estimated to be too high, and the accuracy of artificially given examples is guaranteed, so the natural standard CoT effect will be better.
This illustrates a truth, that is, LLM itself has the ability to reason, but we have no way to stimulate its ability. Through appropriate prompts, two Step prompts can release its potential to a certain extent. In addition, for Chinese, there is likely to be another golden reminder, such as "The detailed problem-solving ideas are as follows", similar to this, because when Chinese corpus explains the reasoning steps, introductory sentences and "Let us think step by step" are often used "It should be different. This is an obvious Western saying, and it is actually necessary to explore this golden reminder in Chinese.

The second idea is generally called the example-based chain of thought (few-shot CoT, Chain of Thought) Prompting. This direction is currently the main direction of LLM inference research, and a lot of work is done on this idea. We briefly introduce a few representative works with significant results, which can basically represent the technical development direction of CoT.

The main idea of CoT is actually very straightforward; in order to teach the LLM model to learn reasoning, some manually written reasoning examples are given. In the examples, the steps are detailed step by step before getting the final answer. The reasoning steps are clearly explained, and these manually written detailed reasoning processes are the thinking chain Prompting. For specific examples, please refer to the blue text in the picture above. CoT means to let the LLM model understand one truth; that is, in the reasoning process, don't take too big steps, otherwise it will be easy to make mistakes. Change your thinking mode, turn big problems into small ones, step by step, accumulate small wins into big wins. The earliest article that clearly proposed the concept of CoT is "Chain of thought prompting elicits reasoning in large language models". The paper was published in January 2022. Although the method is very simple, the reasoning ability of the LLM model has been greatly improved after applying CoT. GSM8K The accuracy of the mathematical reasoning test set has increased to about 60.1%. Of course, this idea of giving detailed reasoning steps and intermediate processes was not first proposed by CoT. The earlier "scratchpad" technology (see: Show Your Work: Scratchpads for Intermediate Computation with Language Models) first adopted a similar ideas.

Not long after the CoT proposed it, in March 22, a project called “Self- Consistency" improvement technology has increased the GSM8K test set accuracy to 74.4%. The paper proposing this improvement is "Self-Consistency Improves Chain of Thought Reasoning in Language Models". The idea of "Self-Consistency" is also very intuitive (refer to the picture above): first, you can use CoT to give several examples of written reasoning processes, and then ask LLM to reason about the given problem. If it is CoT, directly output an inference Process and answers, the whole process is over. "Self-Consistency" is not the case. It requires LLM to output multiple different reasoning processes and answers, and then uses voting to select the best answer. The idea is very simple and direct, but the effect is really good. "Self-Consistency" actually teaches LLM to learn this truth: Kong Yiji once said that there are four ways to write the word "fennel" for fennel beans. Similarly, there can be many correct solutions to a mathematical problem, each with a different derivation. The process all leads to the final answer. All roads lead to Rome. Although there are some people who get lost and reach Beijing, those who get lost are only a few. Look at where most people go, and that's where the correct answer is. Simple methods often contain profound philosophical meanings, isn’t that true?

Going forward, the work "On the Advance of Making Language Models Better Reasoners" further integrates "expanding from one Prompt question to multiple Prompt questions, checking the correctness of intermediate steps of reasoning, and weighted voting on answers to multiple outputs." These three improvements have increased the accuracy of the GSM8K test set to about 83%.

##The third idea embodies the idea of a divide-and-conquer algorithm. Of course, this so-called "divide and conquer" is my generalization, others have not said so. The core idea of this idea is: for a complex reasoning problem, we decompose it into a number of sub-problems that are easy to solve. After solving the sub-problems one by one, we then deduce the answer to the complex problem from the answers to the sub-problems. You see, this is indeed similar to the idea of the divide-and-conquer algorithm. I personally feel that this kind of thinking may be the authentic way to reveal the essence of the problem and ultimately solve the complex reasoning problem of LLM. We take the "Least-to-most prompting" technology as an example to illustrate a specific implementation of this idea, as shown in the figure above: it is divided into two stages. In the first stage, from the original problem we can know the final What is the question to be asked? Let us assume that the final problem is Final Q, and then fill in the Prompt template from the original problem: "If I want to solve the Final Q problem, then I need to solve it first", and then hand the original problem and this Prompt to LLM, let The LLM model gives the answer, which is equivalent to letting the LLM give the prefix sub-question Sub Q of the final question; then we enter the second stage, letting the LLM answer the sub-question Sub Q just obtained and get the corresponding answer. Then the original question is spliced into the sub-question Sub Q and the corresponding answer, and then the LLM is asked the final question Final Q. At this time, the LLM will give the final answer. In this way, it reflects the idea of dismantling sub-questions and gradually finding the final answer from the answers to the sub-questions.

Code pre-training enhances LLM reasoning capabilities

The above are the three mainstream methods of using prompts to stimulate the reasoning capabilities of LLM models, and about LLM An interesting and puzzling phenomenon has been observed: In addition to text, if program code can be added to participate in model pre-training, the reasoning ability of the LLM model can be greatly improved. This conclusion can be drawn from the experimental sections of many papers (please refer to: AUTOMATIC CHAIN OF THOUGHT PROMPTING IN LARGE LANGUAGE MODELS/Challenging BIG-Bench tasks and whether chain-of-thought can solve them and other experimental sections of papers).

The above figure shows an experimental data from the paper "On the Advance of Making Language Models Better Reasoners" ”, among which GPT3 davinci is the standard GPT 3 model, based on plain text training; code-davinci-002 (referred to as Codex internally by OpenAI) is a model trained on both Code and NLP data. If you compare the effects of the two, it can be seen that no matter which inference method is used, just switching from a pure text pre-training model to a text and Code mixed pre-training model, the model inference ability has been improved on almost all test data sets. Huge effect improvement. For example, we take the "Self Consistency" method as an example. The performance improvement on most data sets directly exceeds 20 to 50 percentage points. This is a terrible performance improvement. In fact, at the level of specific inference models , we did nothing but added additional program code in addition to text during pre-training.

In addition to this phenomenon, from the data in the above figure, we can also draw some other conclusions. For example, the pure text pre-training model of GPT 3 actually has a considerable degree of reasoning ability. , except for the relatively poor performance in mathematical reasoning such as GSM8K, the performance of other reasoning data data sets is also good, provided that you need to use appropriate methods to stimulate the ability it has; for another example, text- davinci-002, which is a model that adds instruct fine-tuning on the basis of code-davinci-002 (the first step in adding InstructGPT or ChatGPT models), has weaker reasoning capabilities than Codex, but other studies have shown that it has Natural language processing tasks are stronger than Codex. This seems to indicate that adding instruct fine-tuning will damage the reasoning ability of the LLM model, but it will improve the natural language understanding ability to a certain extent. These conclusions are actually very interesting and can inspire further thinking and exploration.

So, a natural question is: Why can pre-trained models obtain additional reasoning capabilities from pre-training of code? The exact cause is currently unknown and deserves further exploration. I guess it may be because the code training of the original version of Codex (which only uses code training, please refer to: Evaluating Large Language Models Trained on Code) generates code from text, and the code often contains a lot of text comments, which is essentially similar to The pre-trained model does multi-modal alignment of the two types of data. The data must contain a considerable proportion of codes, descriptions and annotations of mathematical or logical problems. It is obvious that these mathematical or logical reasoning data are helpful in solving downstream mathematical reasoning problems. I guess the reason is probably here.

Thoughts on LLM reasoning capabilities

The above introduces the mainstream technical ideas of LLM reasoning and some existing ones. In conclusion, let me talk about my thoughts on LLM model reasoning technology. The following content is purely personal inference without much evidence, so please refer to it with caution. My judgment is: Although in the past year, technology has made rapid progress in stimulating the reasoning ability of LLM, and great technical progress has been made, the overall feeling is that we may be on the right track. direction, but there is still a long way to go before we can get to the real nature of the problem, and we need to have more in-depth thinking and exploration about this.

First of all, I agree with the main idea of the above-mentioned divide and conquer algorithm. For complex reasoning problems, we should break it down into a number of simple sub-problems, because for LLM, the probability of answering the sub-problems correctly is Much larger, let LLM answer the sub-questions one by one, and then gradually derive the final answer. Inspired by the "Least-to-most prompting" technology, if I think further, I think LLM reasoning is likely to be one of the following two possibilities: a graph reasoning problem that continuously interacts with LLM, or a graph reasoning problem that continuously interacts with LLM Program flow chart execution issues for interacting with LLM.

Let’s talk about the graph reasoning problem first. As shown in the figure above, suppose we have a way to break down the complex problem into sub-problems or sub-steps. The graph structure of A loop structure is to repeat certain sub-steps. Assuming that we can obtain the above-mentioned sub-problem disassembly diagram, we can guide LLM step by step according to the graph structure according to the dependency relationship, and answer the sub-questions that must be answered first until the final answer is derived.

Let’s talk about the program flow chart problem. Refer to the picture above. Suppose we have a way to break down the complex problem into sub-problems or sub-steps and generate a sub-problem. A structure composed of steps similar to a program flow chart. In this structure, some steps will be executed repeatedly multiple times (loop structure), and the execution of some steps requires conditional judgment (conditional branch). In short, interact with LLM when executing each sub-step, get the answer to the sub-step, and then continue to execute according to the process until the final answer is output. Similar to this model. Assuming that this idea is roughly correct, it may be possible to explain from this perspective why adding code will enhance the reasoning ability of the pre-trained model: There is a high probability that the multi-modal pre-trained model uses implicit methods like this inside the model. The program flow chart serves as a bridge between the two modalities, connecting the two, that is, from the text description to the implicit flow chart, and then mapping to the specific code generated by the flow chart. In other words, this kind of multi-modal pre-training can enhance the LLM model's ability to construct an implicit flow chart from text and execute it according to the flow chart, that is, it strengthens its reasoning ability.

Of course, the biggest problem with the above ideas is how can we rely on the LLM model or other models to obtain the graph structure or flow chart structure based on the problem described in the text? This may be the difficulty. One possible idea is to continue to enhance text and higher-quality code pre-training, and adopt the method of implicitly learning the internal implicit structure. If you think about the current CoT technology based on the above ideas, you can understand it this way: standard CoT actually relies on natural language text to describe the graph structure or program flow chart; while the "Least-to-most prompting" technology It is trying to deduce the graph structure based on the last graph node and relying on backward inference. However, it is obvious that the current method limits the depth of its backward inference, which means that it can only deduce a very simple graph structure. This That's what limits its capabilities.

The road to the future: LLM research trends and key directions worthy of research

Here are some LLM research areas that I personally think are important, or that are worthy of in-depth exploration research direction.

Exploring the scale ceiling of the LLM model

Although we continue to push the scale of the LLM model, this does not seem to be the case Technical content, but in fact this matter is extremely important. In my personal judgment, since the emergence of Bert, to GPT 3, and then to ChatGPT, there is a high probability that the core contributions of these impressive key technological breakthroughs come from the growth of the LLM model size, rather than a specific technology. Perhaps, the real key to unlocking AGI is: ultra-large-scale and sufficiently diverse data, ultra-large-scale models, and sufficient training processes. Furthermore, making very large-scale LLM models requires very high engineering implementation capabilities of the technical team, and it cannot be considered that this matter lacks technical content.

So what is the research significance of continuing to increase the scale of the LLM model? I think there are two aspects of value. First of all, as mentioned above, we know that for knowledge-intensive tasks, as the model size becomes larger, the performance of various tasks will become better and better; and for many types of reasoning and difficult tasks, with the addition of CoT Prompting Finally, its effect also shows a tendency to follow Scaling law. So, a natural question is: For these tasks, to what extent can the scale effect of LLM solve these tasks? This is a question that concerns many people, including me. Secondly, considering the magical “emergent ability” of LLM, if we continue to increase the model size, what new capabilities will it unlock that we did not expect? This is also a very interesting question. Considering the above two points, we still need to continue to increase the model size to see where the ceiling of model size is for solving various tasks.

Of course, this kind of thing can only be talked about. For 99.99% of practitioners, there is no opportunity or ability to do this. To do this, there are extremely high requirements on the financial resources and investment willingness, engineering capabilities, and technical enthusiasm of research institutions, and all are indispensable. A rough estimate of the number of institutions that can do this is no more than 5 abroad and no more than 3 domestically. Of course, considering the cost issue, there may be a "joint-stock large model" in the future, which is a phenomenon where several capable institutions cooperate and work together to build a super large model.

Enhance LLM’s complex reasoning ability

As described before about LLM’s reasoning ability, although LLM has improved in recent times The reasoning ability has been greatly improved in recent years, but many studies (reference: Limitations of Language Models in Arithmetic and Symbolic Induction/Large Language Models Still Can't Plan) show that currently LLM can solve reasoning problems better, often Relatively simple, LLM's complex reasoning ability is still weak. For example, even for simple character copy reasoning or addition, subtraction, multiplication and division operations, when the string or number is very long, LLM's reasoning ability will drop rapidly. For example, complex reasoning such as behavior planning ability Ability is very weak. All in all, strengthening the complex reasoning capabilities of LLM should be one of the most important aspects of future research on LLM.

As mentioned above, adding code and pre-training is a direction that directly enhances LLM reasoning capabilities. There is currently insufficient research in this direction. It is more like a summary of practical experience, exploring the principles behind it, and then introducing more types of new data other than code to enhance the reasoning ability of LLM. This may be a direction to more essentially improve the reasoning ability. .

LLM incorporates more other research fields besides NLP

The current ChatGPT is good at NLP and Code tasks, as An important seed player leading to AGI is to integrate images, videos, audio and other images and multi-modality into LLM, and even AI for Science, robot control and other fields with more obvious differences are gradually included in LLM, which is the gateway to LLM. The only way to go with AGI. This direction has just begun, so it has high research value.

Easier-to-use interactive interface for people and LLM

As mentioned earlier, the biggest technical contribution of ChatGPT is here. But it is obvious that the current technology is not perfect, and there must be many commands that LLM cannot understand. Therefore, along this direction, we are looking for better technology to allow humans to use their own accustomed command expressions, and LLM can understand them. This is a new and very promising technical direction.

Constructing a high-difficulty comprehensive task evaluation data set

A good evaluation data set is to guide the continuous progress of technology Cornerstone. As the LLM model gradually increases, task performance improves rapidly, causing many standard test sets to quickly become outdated. In other words, these data sets are too easy compared to existing technologies. Under a test set without difficulty, we do not know where the flaws and blind spots of the current technology are. Therefore, building a difficult test set is the key to promoting the progress of LLM technology.

At present, some new test sets should appear in the industry, representative ones include BIGBench, OPT-IML, etc. These test sets reflect some characteristics, such as being more difficult than existing LLM technologies and integrating various types of tasks.

Inspired by ChatGPT, I think another consideration should be included in addition: reflecting real user needs. That is to say, the expression of these tasks is truly initiated by users. Only the LLM model constructed in this way can solve the actual needs of users.

In addition, I believe that LLM will quickly overflow its capabilities into fields other than NLP, and how to incorporate more evaluation data from other fields also needs to be considered in advance.

High Quality Data Engineering

For the pre-training model, data is its foundation, and the pre-training process can Understanding is the process of drawing knowledge from data. Therefore, we need to further strengthen the mining, collection and cleaning of high-quality data.

Regarding data, there are two aspects to consider: the quality and quantity of data. Based on the comparative experiments of T5, we can conclude that among the two factors of quantity and quality, quality takes priority, and the right path should be to increase the data size while ensuring data quality.

Data quality includes multiple measures such as the information content of the data and the diversity of the data. For example, Wiki is obviously high-quality data with extremely high knowledge density in the world. This is based on the information In terms of content; increasing the diversity of data types is undoubtedly the basis for stimulating various new capabilities of LLM. For example, adding data from Q&A websites will directly help improve LLM's QA capabilities. Diverse data gives LLM the ability to better solve more different types of tasks, so this may be the most critical criterion in data quality.

Regarding the amount of data, in principle, all data publicly released on the Internet can be included in the pre-training process of the LLM model. So, where are its limits? "Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning" estimated this and concluded that by around 2026, high-quality NLP data will be used up, and low-quality NLP data will be exhausted by 2030. It will be used up by 2050, while low-quality image data will be used up between 2030 and 2060. And this means: either we have new types of data sources by then, or we have to increase the efficiency of the LLM model in utilizing the data. Otherwise, the current data-driven approach to model optimization will stop making progress, or the benefits will decrease.

Sparsification of very large LLM model Transformer

Among the largest LLMs currently, a considerable proportion of models adopt With the sparse structure, such as GPT 3, PaLM, GLaM, etc., GPT 4 will most likely follow the sparse model route. The main benefit of using a Sparse-based model is that it can greatly reduce the training time and online inference time of LLM. The Switch Transformer paper points out that under the premise of the same computing power budget, using the sparse Transformer, the training speed of the LLM model can be increased by 4 to 7 times compared to the Dense Transformer. Why do Sparse models speed up training and inference times? This is because although the model parameters are huge, for a certain training instance, the Sparse model only uses a small part of the entire parameters through the routing mechanism. The number of active parameters involved in training and inference is relatively small, so it is fast.

I think that in the future, very large LLM models will most likely converge to sparse models. There are two main reasons: on the one hand, existing research shows (Reference: Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers) that the standard Dense Transformer itself is also sparsely activated during training and inference, that is, only part of Parameters will be activated, and most parameters do not participate in the training and inference process. In this case, we might as well migrate directly to the sparse model; in addition, there is no doubt that the scale of the LLM model will continue to increase, and the high training cost is an important obstacle to further expanding the model. Using sparse models can greatly reduce the cost of very large models. Training cost, so as the size of the model becomes larger, the benefits brought by the sparse model become more obvious. Considering these two aspects, there is a high probability that larger LLM models in the future will adopt a sparse model solution.

So why don’t other large-scale models currently take the route of sparse models? Because the Sparse model has problems such as unstable training and easy overfitting, it is not easy to train well. Therefore, how to correct the problems faced by sparse models and design sparse models that are easier to train is an important future research direction.

The road to learning: What should you pay attention to when replicating ChatGPT

If you want to replica an LLM model with amazing effects like ChatGPT, based on various current research conclusions, in When making technology selection, you need to focus on weighing the following issues:

First of all, In the pre-training mode, we have three options: GPT, an autoregressive language model, Bert's bidirectional language model, and T5's hybrid model (Encoder-Decoder architecture, the Encoder adopts a bidirectional language model, and the Decoder adopts an autoregressive language model, so it is a hybrid structure, but its essence is still Belongs to Bert mode). We should choose an autoregressive language model like GPT. The reasons are analyzed in the paradigm shift section of this article. At present, when domestic LLMs are making technology selections in this area, it seems that many of them have taken the technical route of Bert bidirectional language model or T5 hybrid language model. It is very likely that the direction has gone astray.

Second, Powerful reasoning ability is an important psychological foundation for users to recognize LLM , and if you want LLM to have strong reasoning capabilities, according to current experience, it is best to introduce a large amount of code and text to train LLM together during pre-training. As for the rationale, there is a corresponding analysis in the relevant parts of this article.

Third, If you want the model parameter size not to be so huge, but still want the effect to be good enough, there are two technical options to configure: or Enhance the work of high-quality data collection, mining, cleaning, etc., which means that my model parameters can be half of ChatGPT/GPT 4, but to achieve similar effects, the amount of high-quality training data is It needs to be twice the size of the ChatGPT/GPT 4 model (Chinchilla’s approach);Another route that can effectively reduce the size of the model is to adopt the text retrieval (Retrieval based) model LLM route , which can also greatly reduce the parameter scale of the LLM model while maintaining equivalent effects. These two technology selections are not mutually exclusive, but are complementary. In other words, these two technologies can be used at the same time to achieve similar effects to super large models on the premise that the model size is relatively small.

Fourth, the training cost of super large models is too high because of the large scale of the model. As a result, few institutions have the ability to do this. And from the above analysis, it can be seen that continuing to expand the scale of the LLM model is something that will definitely happen and should be done. Therefore, How to reduce the training cost of LLM through technical means is very important. Sparseization of LLM's feature extractor is a technical choice that can effectively reduce model training and inference costs. It follows that as models get larger, sparse-ization of LLM models is an option that should be considered.

Fifth, ChatGPT is currently the technical solution closest to the ideal LLM, and the ideal LLM should be based on an almost omnipotent basic universal large model to support various Various upper-level task types. At present, supporting more and more task types is mainly achieved by increasing the diversity of LLM pre-training data. The better the data diversity, the richer the task types that LLM can support. Therefore, should pay attention to the idea of adding new LLM capabilities by increasing data diversity.

Sixth, easy-to-use human-machine operation interface. Humans use their own customary expressions to describe tasks, and LLM needs to be able to understand the true meaning of these Instructs. In addition, it should also be noted that these Instructs are in line with real human needs, that is, task descriptions must be collected from end users, rather than relying on the imagination or guessing of the developers themselves. The biggest inspiration that ChatGPT gives me is actually this. As for whether to use reinforcement learning, I don’t think it matters. Other alternative technologies should be able to do similar things.

ChatGPT: Why OpenAI

Why did OpenAI make ChatGPT and not other organizations? We can do a simple analysis here.

At the beginning of this article, we mentioned OpenAI’s philosophy on LLM. What does OpenAI think of LLM? Looking back at the technologies it has continuously introduced, we can see that, starting from GPT 1.0, it has basically firmly regarded LLM as the only way to AGI. Specifically, in the eyes of OpenAI, the future AGI should look like this: there is a task-independent super-large LLM used to learn various knowledge from massive data. This LLM generates everything to solve various problems. practical problem, and it should be able to understand human commands so that it can be used by humans. In fact, the understanding of the LLM development concept in the first half is to "build a task-independent, very large LLM and let it learn various knowledge from massive data." This is almost everyone's consensus and can reflect the actual vision of OpenAI. It's the second half.
OpenAI's concept is relatively advanced, and it has set its self-positioning relatively high from the beginning, and has always unswervingly explored whether the above methods can achieve AGI
. The reason why OpenAI can make ChatGPT is that it has a relatively high positioning and is free from external interference and has an unswerving attitude. We can review some of the key steps it has taken: GPT 1.0 takes the autoregressive language model route of generating patterns, which was released earlier than Bert. Bert proved that bidirectional language models perform better than autoregressive one-way language models for many NLP understanding tasks. Despite this, GPT 2.0 did not switch to the path of bidirectional language models. It still followed the path of text generation, and began to try zero-shot prompts and few-shot prompts. In fact, at this time, the AGI in OpenAI's mind has begun to surface and gradually shows its outline. Just because the effect of zero shot/few shot is far worse than Bert fine-tuning, everyone doesn't take it too seriously, and they don't even understand why it always insists on the one-way language model route. At this time, I estimate that even OpenAI itself may not be able to ensure that this road will definitely work.

However, this does not prevent it from continuing on this road. GPT 3.0 has demonstrated relatively powerful zero shot/few shot prompt capabilities. At this time, the AGI in OpenAI's mind has completely leaked out of the water, with a clear outline, and its effect also proves that this path is more likely to be followed. Passed. GPT 3.0 is a crossroads and watershed that determines the development direction of LLM. Another path corresponding to it is the "Bert fine-tuning" model. At this fork in the road, different practitioners chose to take different paths, and the technical gap began to widen from here. Unfortunately, many domestic practitioners choose to continue on the road of "Bert fine-tuning", which is also a key time point that caused today's backward situation. Going forward, there are InstructGPT and ChatGPT.
OpenAI proved something through ChatGPT; although we may still have a long way to go from real AGI, the road to AGI through super-large LLM, It seems feasible at the moment
.

The above is the detailed content of Explore large model technology in the post-GPT 3.0 era and move towards realizing the future of AGI. For more information, please follow other related articles on the PHP Chinese website!