In the field of large models, Transformer, which has always been firmly in the C position, seems to have a tendency to be surpassed recently.
This challenger is a study called "Mamba", which has achieved SOTA performance in multiple modalities such as language, audio, and genomics. In terms of language modeling, the Mamba-3B model outperforms Transformer models of the same size and is comparable to Transformer models twice its size, both in pre-training and downstream evaluation.
Once the paper was published, it caused quite a stir. After being amazed, everyone discovered that there are only two authors of the paper. One is Albert Gu, assistant professor of the Department of Machine Learning at Carnegie Mellon University, and the other is Tri, chief scientist of Together.AI and assistant professor of computer science at Princeton University (incoming post). Dao.
An important innovation of this research is the introduction of an architecture called "Selective SSM (selective state space model)", which is compared to the self-attention mechanism in Transformer. The amount of calculation will increase squarely as the context length increases. For example, when the context increases by 32 times, the amount of calculation may increase by 1000 times. Mamba can achieve linear expansion as the context length increases, and its performance can be improved to hundreds of times in actual data. 10,000 token length sequences and achieve 5 times improvement in inference throughput. And these are inseparable from selective SSM.
After seeing Mamba’s excellent performance, many researchers became curious about SSM (state space model) related research.
In a recent interview, Nathan Lambert, a machine learning researcher from the Allen Institute for Artificial Intelligence (AI2), spoke with Tri Dao, one of the authors of the Mamba paper, and also from Together.AI Scientist Michael Poli had an in-depth exchange.
They mainly discussed the future of LLM architecture. In addition, these three researchers also discussed the application prospects of state space model (SSM) in the emerging LLM market. The knowledge points involved in the conversation are also relatively intensive, such as why the attention mechanism in Transformer is effective, what are its expansion limits, introduction to Mamba and its hardware optimization, and discussions on future architecture predictions.
The following is the content of the conversation.
Nathan Lambert: Let’s first discuss why the attention mechanism is effective. And what are the limitations of the attention mechanism. How much of Transformer is built based on the attention mechanism, are there other mechanisms at work, and what challenges may be encountered in this regard?
Tri Dao: Yes, the so-called Transformer is the architecture that currently drives most of the exciting applications we see. As you said, the attention mechanism is the core layer. In fact, the attention mechanism has attracted attention as early as 2014 to 2015, and then the concept of Transformer appeared, integrating the attention mechanism and focusing on the intertwined use of multi-layer perceptron (MLP) and attention mechanism.
I think a lot of its success is that these models seem to scale well, and you can make the model larger by adding more parameters and data. This is the secret of success. While it seems obvious now, I don't think this was a clear concept five years ago.
A few reasons why Transformer is successful include: First, it is general enough to be able to learn a lot from large amounts of data. Secondly, it is very hardware friendly. Unlike previous recurrent neural networks (RNN), it has no order dependence.
So it runs very well on GPUs, TPUs, scales, and utilizes the hardware very efficiently. I'm also personally working on making it more efficient at using hardware. So, that's the secret to success - make an architecture that is both versatile and scales well. If you are into NLP, maybe you would consider adding some inductive bias to enhance the model. Personally, I think Transformer is a very general architecture, very scalable, and very hardware friendly.
Nathan Lambert: Yes, yes. In retrospect, it all seems obvious. Now, when looking into its alternatives, an interesting dimension is context length. Michael, what do you think?
Michael Poli: Yeah, I have a few things to say. First of all, there are still many excellent studies trying to explain Transformer from first principles. Why can it learn these interesting circuits? People will break down the calculation process, such as head combinations in different transformers, etc.
There is some work on understanding Transformer as a coded programming language. But I think, as Trey mentioned, there are some really interesting design choices in Transformer. The intertwined use of attention and MLP is quite important. Moreover, Transformer was successful at the beginning because it adopted some techniques that had been developed for RNN and other traditional NLP models, such as using gating mechanisms to regulate what information the model absorbs, and deciding whether certain content should be used in this parallel form. the speed at which it is forgotten. It's like there are some gems that can be optimized on the GPU, it's not easy, but it can be optimized.
Nathan Lambert: Yeah, these are great. The more specific point I want to make is that the attention mechanism ultimately exhibits a computational cost that increases quadratically with the length of the input sequence. Suppose you have an input sequence of length L, and you want to output a sequence also of length L. If you dig into the mathematical details and look at what happens when most libraries do inference, you'll find that you have this upper triangular attention matrix, where you can only consider past parts of the text. As the processing proceeds, you'll find that it forms an L-squared relationship, where the first token only takes into account one element, and then each subsequent token takes into account progressively more past tokens. We've just discussed RNNs and how some non-attentional methods can do this without looking at all the textual history in the sequence. When you write a long prompt to your chatbot GPT, do you really want all that information encoded in it? Besides this dense attention matrix, what other options do we have?
Tri Dao: Recurrent neural networks date back to the 1980s, perhaps some of the more famous are long short-term memory networks (LSTM), gated recurrent units ( GRU). They were very popular for translation, speech recognition, etc. around 2012 to 2016, when they were the SOTA technology in NLP.
They process text in a sequential manner: observing tokens one by one, then changing the hidden state, updating the hidden state each time a new token is seen. I think in a sense this mimics the way the human brain processes information, like you read a sentence or a paragraph, like you store some information in your brain. When you finish reading a document, you may be able to answer questions about that document without referring to the document again. So, this is how RNN works. They process text and then change the hidden state, which is a representation that can be used to generate new tokens or classify documents.
These methods used to be very popular, around 2016. However, as the experimental results emerged, we gradually discovered that their performance was not as good as Transformer. As you mentioned, the Transformer has a quadratic expansion property such that each token is compared to all previous tokens, which provides a very simple way for information to spread. I believe this is one of the reasons why Transformers and attention mechanisms work so well.
Recently it has been discovered that some new RNN architectures perform well, among which RWKV is one of the earlier ones. I admire this project very much, which was developed by researcher Bo Peng. It seems to compete with Transformer in a unique way, demonstrating the powerful potential of RNN.
Nathan Lambert: Yes. I've read this paper before too. On a technical level, they tried to replicate something similar to query key-value lookup in the attention mechanism through two linear RNNs, essentially to eliminate potential issues like specific attention expansion. These two RNNs have better long-context behavior and different implementation rules. They also trained models with up to 14 billion parameters. This also leads me to some questions I want to ask next, including Mamba and Striped Hyena. We can talk one by one.
Nathan Lambert: I went into the Together API and did a comparison test between Mistral and Striped Hyena. The results show that Striped Hyena is a good language model. It answers most questions with no obvious failure mode. Michael, what do you think of this model?
Michael Poli: First I would like to say that there is an interesting connection between these new methods. There is a convex set, which has a center point, and the correlation between linear attention (i.e. attention without softmax), linear RNN and state-based model (SSM) are all in this convex set. To a certain extent, the mathematical formulation of this underlying model is the same, and I don't mean the infrastructure here, but the underlying model.
Then you can develop in different directions, each direction has its own trade-offs, such as feature mapping direction and kernel direction. So when you break up or remove the softmax, you can take a different approach when dealing with queries and keys. These queries and keys are the basic entities that make up your attention matrix. After removing the softmax, you can build other kernel-like functions, or other functions that you hope can approximate the function of the attention mechanism.
You can do something like Taylor approximation or Taylor expansion. You get a slightly different perspective, but you get something very similar. You can turn to Time variance. This means that you modify the RNN so that its calculations depend more on the input sequence. That is, the computation in a linear RNN is determined by the input sequence. You can use things like gates, and we've seen a lot of work on, for example, updating the internal tension with additional gates to allow you to better utilize your fixed state dimensions. The third direction - at least in my opinion - is to use convolutional forms and more use of other types of linear operators that are still combineable and still allow you to train in parallel.
So the content here includes time-invariant systems. I could explain these points in detail, but there are models that can switch between convolution and loop, which are also equipped with additional gating mechanisms. A project I participated in was born out of the third type of architecture I just mentioned. What we're really trying to do is create an architecture with the best performance per floating point operation. One principle that we have repeatedly verified is that it seems that combining different layers, different categories of modules, and even full attention layers, you get something better than the individual components.
So we are trying to understand more deeply the combinatorial aspects of these models. This understanding helps us create pre-trained models with better performance per floating point operation. Using this model, we ran a set of scaling rules. Hybridization also gave us some advantages because we wanted something that could be used out of the box and it made the process much simpler.
When fine-tuning for longer contexts, we can adopt some of the techniques developed for Transformers. Surprisingly, these techniques work equally well with hybrids. For example, linear scaling is used for rotational embeddings and so on. If you're interested in the details, you can learn more. Therefore, this project is mainly an experimental attempt to figure out how far we can go in the current environment.
##Nathan Lambert: Striped Hyena uses a new set of model grafting techniques ) was optimized to allow us to change the model architecture during training, and to me it feels like there's a lot going on, like something that you probably can't talk too much about like the data.
Regarding data interpretation, I think there are still some things that are not well explained, especially some of the longer contextual data. I wonder if you can explain to us what these data mean from a model perspective? Even just a quick summary would have been a great experience for us. There's a lot of cool work in this field, so there are a lot of new projects going on in the AI field, for example, some people are trying to take the Lama model apart and continue to train it. In fact, it's a bit wild, where people are trying to take powerful models and try to make them smaller while still getting the same performance benefits as the larger models. Although this is kind of off topic, but what I didn't expect is that when you follow social media, you will see people say, oh, it is still the state non-attention model in the end. Won. In my opinion, this statement obscures a lot of interesting details. Okay, let’s get back to Mamba. If I remember correctly, I think the largest model in the Mamba suite is 280 million parameters, and the benchmark scores given by the NLP benchmarks, including GPT J and the Pythia model suite, are very strong.Tri Dao: Mamba was a collaboration between me and Albert Gu, who was a doctoral student at Stanford University at the time, which is where we met, and who is now Assistant Professor at CMU. So it was a wonderful collaboration and I owe Mamba's success to him. Albert has been committed to the research of state space models. In a sense, as mentioned earlier, he has been involved in linear tensors, linear RNN, convolution, neural network and other fields.
In several projects I have participated in in the past, I have also devoted myself to the research of space and state space. My research perspective is how to make the state space more hardware efficient and improve its performance. . So it was great to work with Albert Gu. I think the research process involved with Mamba was more of a proof of concept that state spaces could actually be as good as transformes in the NLP world? Hence Mamba, the research that suggests state space might be better for audio. However, for state space models, language has always been the most difficult to obtain and do well.Moreover, language is also the thing that people care about most now, so what I do is more of a proof of concept, that is, we want to show that the state space model can also be competitive, and can even compete with Transformer . The number of tokens verified in our experiments ranges from 3B to 300B.
So in an absolute sense, these are not very powerful models, these are not the models that we really want. I think what we're doing is more of an academic comparison. For example, when training the same number of tokens, the state space model may be slightly better than the transformer.
This is something that's particularly exciting for us and I think Albert has been pushing for this for a while.
The result is that our research may be faster at inference, and perhaps we will have a different way of understanding how contextual learning occurs. I'm looking forward to my future work.
Nathan Lambert: Can you talk a little bit about what it actually takes to implement these new CUDA kernels do what?
Tri Dao: Regarding the study of state space, it is a recurrent neural network in a sense. The state size is the buffer you use to store information while traversing or processing a sequence.
In a sense, Transformer can also be understood in this way. The entire history it saves is often called the KV cache. Transformer retains the history and continuously references it. For RNNs, they have a fixed size state; for transformers, you can think of the state size as increasing. Moreover, our intuition is that the larger the state size, the better the model performs.
So, in order to store the information you need to remember, you need more space. Previous models (like S4 etc.) had rather large hidden state sizes and they used convolutional views to avoid reifying the state.
We would like to add more input dependencies into the loop, however, doing so prevents us from using convolutional views which can improve efficiency.
So we had to find a different way to improve efficiency, so we focused on improving efficiency on the GPU. The idea is that we want to have a large state size, but we don't need to use actual GPU memory, such as HBM, we can save the large state in a faster memory, called SRAM, you can Think of it like a cache. If you're more familiar with CPUs, this is usually a cache and RAM.
So, if you have a larger state, you can save it in cache so you don't suffer too much.
Nathan Lambert: My most powerful insight into GPU and TPU right now is that MoE It doesn't work well in TPU because you have to add some MoE on the base layer.
In distributed training, the feedforward layer may end up distributed on different TPU nodes, and TPUs communicate through neighboring nodes. Therefore, TPU will be affected more in this regard compared to GPU. What will happen in this space in 2024?
Tri Dao: I think Transform is still a very powerful architecture that can now be extended to one trillion levels of parameters, and people often want the best performance Models that run most efficiently on hardware and have the most support in software.
I have some new ideas recently, such as state space. We've seen, as Michael mentioned, that mixing these components seems to improve performance, I think that's been demonstrated on the 7B size model, and maybe the state space model can work on larger scale models.
Most people are currently paying attention to the data and infrastructure construction based on the Lime architecture. Although the existing Transformer architecture is still very powerful and widely supported in the production environment, there are also Some fringe areas, such as long context, audio, genomics, etc., will be very interesting to study alternative architectures in these areas. These areas raise meaningful scientific questions, such as whether models understand instructions and intuition like humans do, and whether they can work with quantitative methods.
In addition, even if people are still using the Transformer architecture now, more new ideas and components may be incorporated in the future, such as adding more layers and attention mechanisms, etc., although they may Still called Transformer.
In short, although the current field of artificial intelligence tends to be conservative and focus on modern architecture, new architectures and ideas are gradually emerging, and these novel perspectives and methods may Bring new impetus and direction to the development of artificial intelligence.
Michael Poli: Yes, I 100% agree with Tri Dao that the attention mechanism is still important as a computing primitive. As an efficient and convenient way, the attention mechanism can effectively increase the state capacity of the sequence processor.
There is a trade-off between state dimensions and sequence length. When the model size becomes larger, that is, when the model becomes wider, more states and sequence lengths will be effectively introduced. As a result, some marginal effects may disappear and some trade-offs will change, especially for those very large models, such as 14B, 30B and so on.
In the future, architectural design will become more interesting and complex, and more innovations will occur. Whether it's hybrid models or the introduction of new modules, we'll see more exciting innovations.
Mixture of Experts (MoE) and State Space Models have recently emerged as a popular trend, according to Nathan Lambert.
However, in open source and academia, no one has really tried to make early attempts and improvements on the hybrid expert model. Model Grafting is now becoming more practical.
It is very interesting to follow these developments and hopefully these developments will provide academics and scientists with more ways to influence the industry conversation, especially at a time when industry is more focused on scaling up models in the case of. I suggest that open source companies should make specific improvements in their language models to gain commercial advantage.
What else are you focusing on in machine learning? It's not necessarily about the state space model. What are you most excited about next year?
Tri Dao: I personally think data is still the most important factor. We are taking a deeper look at how data affects model performance, for example through some synthetic tasks that are highly correlated with model performance. This approach has been the main motivation and example in our papers and research work. We will focus on data in the coming period.
While all the architecture work is fun and making it run efficiently on the hardware is fun, in the end it's still about the data. If you understand scaling law, you know that different model architectures will often have the same slope, just different offsets. The only thing that seems to change the slope is the quality of the data.
Michael Poli: Yes, we added the data. The data is really interesting, like miniaturizing the architectural design, figuring out and breaking down the various aspects involved in tasks like language modeling, and we're trying to package them into something that can be used to iterate on, which is very exciting.
I personally am very excited about new applications, especially genomics work, but more from an engineering perspective, we are seeing a shift. Currently, languages are still the area that gets the most clicks and the most interest, but I think that will change over time.
Nathan Lambert: Yes, everyone is talking about language, but I think images, videos are going to be something that will generate huge value. I don't know where the upper limit of language is. I'm excited, I've started trying this out, like I'll take text from a blog and have the model convert it into an image and then into a video with audio, all done with a Python script, it's really easy Do it. So I agree with you that things beyond language are interesting.
Tri Dao: In your experience, when you piece all these things together, do they actually work reasonably well?
Nathan Lambert: It’s not that perfect yet. The pictures generated by DALL・E are relatively similar, but my approach is very simple, just take the text directly, Then using a system prompt to let the model generate a variety of images, I think I can do better. As far as I know, in probably a year, there will be a text-to-video API, and then I will switch to the API and it will be a great experience.
Tri Dao: Yes, I think these advances do generate a lot of economic value, and we've seen that. Many companies are now turning to these technologies. I think it's going to change the way we work and, as you mentioned, the way we work and the way we play. So it's a very exciting future.
Original link: https://www.interconnects.ai/p/interviewing-tri-dao-and-michael?cnotallow=5d10d34c97637bebcfeba6470c0f0d9b
The above is the detailed content of LLM future architecture: Who is likely to shake the dominance of Transformer?. For more information, please follow other related articles on the PHP Chinese website!