One article to understand Mamba, the strongest competitor of Transformer-AI-php.cn

Mamba is good, but its development is still early.

There are many deep learning architectures, but the most successful one in recent years is Transformer, which has established its dominant position in multiple application fields.

A key driver of such success is the attention mechanism, which allows Transformer-based models to focus on parts relevant to the input sequence to achieve better context understanding. However, the disadvantage of the attention mechanism is that the computational overhead is high, which increases quadratically with the input size, making it difficult to process very long texts.

Fortunately, a new architecture with great potential was born some time ago: the structured state space sequence model (SSM). This architecture can efficiently capture complex dependencies in sequence data, making it a powerful opponent of Transformer.

The design of this type of model is inspired by the classic state space model - we can think of it as a fusion model of recurrent neural networks and convolutional neural networks. They can be efficiently computed using loop or convolution operations, allowing the computational overhead to scale linearly or nearly linearly with sequence length, thereby significantly reducing computational costs.

More specifically, the modeling capabilities of Mamba, one of the most successful variants of SSM, are already comparable to Transformer, while maintaining linear scalability with sequence length.

Mamba first introduces a simple yet effective selection mechanism that reparameterizes SSM based on inputs, allowing the model to retain necessary and relevant data indefinitely while filtering out irrelevant information. Mamba then includes a hardware-aware algorithm that iteratively computes the model using scans instead of convolutions, which results in a 3x speedup on the A100 GPU.

As shown in Figure 1, with its powerful ability to model complex long sequence data and near-linear scalability, Mamba has emerged as a basic model and is expected to revolutionize computer vision, natural language processing and medical care and many other research and application fields.

One article to understand Mamba, the strongest competitor of Transformer

Therefore, the literature on the research and application of Mamba is growing rapidly and is dizzying. A comprehensive review report will be of great benefit. Recently, a research team from the Hong Kong Polytechnic University published their contribution on arXiv.

One article to understand Mamba, the strongest competitor of Transformer

Paper title: A Survey of Mamba
Paper address: https://arxiv.org/pdf/2408.01129

This review report examines Mamba from multiple angles This summary can not only help beginners learn the basic working mechanism of Mamba, but also help experienced practitioners understand the latest progress.

Mamba is a popular research direction, and therefore many teams are trying to write review reports. In addition to the one introduced in this article, there are other reviews focusing on state space models or visual Mamba. For details, please refer to Corresponding paper:

Mamba-360: Survey of state space models as transformer alternative for long sequence modeling: Methods, applications, and challenges. arXiv:2404.16112
State space model for new-generation network alternative to transformers: A survey. arXiv:2404.09516
Vision Mamba: A Comprehensive Survey and Taxonomy. arXiv:2405.04404
A survey on visual mamba .arXiv:2404.15956

Preliminary knowledge

Mamba centralized recurrent neural network (RNN ), the parallel computing and attention mechanism of Transformer, and the linear characteristics of State Space Model (SSM). Therefore, in order to fully understand Mamba, it is necessary to first understand these three architectures.

Recurrent Neural Network

# #Recurrent Neural Networks (RNN) have the ability to retain internal memory, so they are very good at processing sequence data.

Specifically, at each discrete time step k, the standard RNN processes a vector together with the hidden state of the previous time step , then output another vector and update the hidden state. This hidden state can be used as the memory of the RNN, which can retain information about inputs that have been seen in the past. This dynamic memory allows RNNs to handle sequences of varying lengths.

That is to say, RNN is a non-linear recurrent model that can effectively capture temporal patterns by using historical knowledge stored in the hidden state .

Transformer

Transformer The self-attention mechanism helps capture global dependencies among inputs. This is done by assigning weights to each position based on their importance relative to other positions. More specifically, a linear transformation is first performed on the original input to convert the sequence x of input vectors into three types of vectors: query Q, key K, and value V.

Then calculate the normalized attention score S and calculate the attention weight.

In addition to executing a single attention function, we can also execute multi-head attention. This allows the model to capture different types of relationships and understand input sequences from multiple perspectives. Multi-head attention uses multiple sets of self-attention modules to process input sequences in parallel. Each of these heads operates independently and performs the same computations as standard self-attention mechanisms.

After that, the attention weights of each head are aggregated and combined to obtain the weighted sum of the value vectors. This aggregation step allows the model to use information from multiple heads and capture many different patterns and relationships in the input sequence.

State Space

The state space model (SSM) is a traditional mathematical framework that can be used to describe the dynamic behavior of a system over time. In recent years, SSM has been widely used in fields as diverse as cybernetics, robotics, and economics.

At its core, SSM reflects the behavior of the system through a set of hidden variables called "state", allowing it to effectively capture time data dependencies. Unlike RNN, SSM is a linear model with associative properties. Specifically, the classic state space model constructs two key equations (state equation and observation equation) to model the relationship between input x and output y at the current time t through an N-dimensional hidden state h (t) .

discretization

In order to meet the needs of machine learning, SSM must undergo a discretization process - converting continuous parameters into discrete parameters. In general, the goal of discretization methods is to divide continuous time into K discrete intervals with as equal an integral area as possible. To achieve this goal, one of the most representative solutions adopted by SSM is Zero-Order Hold (ZOH), which assumes that the function value on the interval Δ = [?_{?−1}, ?_? ] holds constant. Discrete SSM has a similar structure to a recurrent neural network, so it can perform the inference process more efficiently than Transformer-based models.

Convolution calculation

# #

Discrete SSM is a linear system with associative properties, so it can be seamlessly integrated with convolutional calculations.

The relationship between RNN, Transformer and SSM

Picture 2 shows the calculation algorithms of RNN, Transformer and SSM.

On the one hand, conventional RNN operates based on a non-linear recurrent framework, where each calculation depends only on the previous hidden state and the current input.

Although this form allows RNN to quickly generate output during autoregressive inference, it also makes it difficult for RNN to fully utilize the parallel computing power of the GPU, resulting in slower model training.

On the other hand, the Transformer architecture performs matrix multiplication in parallel on multiple "query-key" pairs, and matrix multiplication can be efficiently allocated to hardware resources, allowing faster training of attention-based models. However, if you want a Transformer-based model to generate responses or predictions, the inference process can be very time-consuming.

Unlike RNN and Transformer, which only support one type of calculation, discrete SSM is very flexible; thanks to its linear nature, it can support both loop calculation and convolution calculation. This feature enables SSM not only to achieve efficient inference but also to achieve parallel training. However, it should be noted that the most conventional SSM is time-invariant, that is, its A, B, C, and Δ are independent of the model input x. This will limit its context-aware modeling capabilities, causing SSM to perform poorly on some specific tasks such as selective copying.

Mamba

In order to solve the above shortcomings of traditional SSM and achieve context-aware modeling, Albert Gu and Tri Dao proposed Mamba which can be used as the backbone network of a general sequence base model. Please refer to this site Report " Five times throughput, comprehensive performance surrounds Transformer: New architecture Mamba detonates AI circle".

After that, the two of them further proposed Mamba-2, in which the Structured Space-State Duality (SSD/Structured Space-State Duality) built a structure that connects structured SSM with multiple forms of attention. The robust theoretical framework allows us to transfer the algorithms and system optimization technologies originally developed for Transformer to SSM. You can also refer to the report on this site " Fighting Transformer Again!" Mamba 2, led by the original author, is here, and the training efficiency of the new architecture has been greatly improved》.

Mamba-1: Selective state space model using hardware-aware algorithms

Mamba-1 introduces three innovative technologies based on the structured state space model, namely based on high-order polynomial projection calculation Memory initialization, selection mechanism and hardware-aware computation of HiPPO. As shown in Figure 3. The goal of these techniques is to improve the long-range linear time series modeling capabilities of SSM.

One article to understand Mamba, the strongest competitor of Transformer

Specifically, the initialization strategy can construct a coherent hidden state matrix to effectively promote long-range memory.

Then, the selection mechanism gives SSM the ability to obtain representations of perceptible content.

Finally, in order to improve training efficiency, Mamba also includes two hardware-aware computing algorithms: Parallel Associative Scan and Memory Recomputation.

Mamba-2: State Space Dual

Transformer has inspired the development of many different technologies, such as parameter-efficient fine-tuning, catastrophic forgetting mitigation, and model quantization. In order for state space models to also benefit from these techniques originally developed for Transformer, Mamba-2 introduces a new framework: Structured State Space Duality (SSD). This framework theoretically connects SSM and different forms of attention.

Essentially, SSD shows that both the attention mechanism used by Transformer and the linear time-invariant system used in SSM can be viewed as semi-separable matrix transformations.

In addition, Albert Gu and Tri Dao also proved that selective SSM is equivalent to a structured linear attention mechanism implemented using a semi-separable mask matrix.

Mamba-2 designs a computing method based on SSD that can use hardware more efficiently, which uses a block decomposition matrix multiplication algorithm.

Specifically, by treating the state space model as a semi-separable matrix through this matrix transformation, Mamba-2 is able to decompose this calculation into matrix blocks, where the diagonal blocks represent the intra-block calculations. While off-diagonal blocks represent inter-block computation via hidden state decomposition of SSM. This method allows Mamba-2 to train 2-8 times faster than Mamba-1's parallel correlation scan, while still achieving performance comparable to Transformer.

Mamba Blocks

Let’s take a look at the block designs of Mamba-1 and Mamba-2. Figure 4 compares the two architectures.

One article to understand Mamba, the strongest competitor of Transformer

Mamba-1 is designed around SSM, where the selective SSM layer is tasked with performing the mapping from the input sequence X to Y. In this design, after initially creating a linear projection of X, a linear projection of (A, B, C) is used. Then, the input token and state matrix are scanned through the selective SSM unit using parallel correlation to obtain the output Y. Afterwards, Mamba-1 adopts a skip connection to encourage feature reuse and mitigate performance degradation that often occurs during model training. Finally, the Mamba model is constructed by stacking this module in an alternating manner with standard normalization and residual connections.

As for Mamba-2, the SSD layer is introduced to create a mapping from [X, A, B, C] to Y. This is achieved by using a single projection at the start of the block to process [X, A, B, C] simultaneously, similar to how standard attention architectures generate Q, K, V projections in parallel.

That is to say, the Mamba-2 block is simplified based on the Mamba-1 block by removing the sequence linear projection. This allows the SSD fabric to be computed faster than Mamba-1's parallel selective scan. In addition, in order to improve training stability, Mamba-2 also adds a normalization layer after the skip connection.

Mamba model is developing and progressing

State space model and Mamba have developed rapidly recently and have become a basic model backbone network choice with great potential. Although Mamba performs well on natural language processing tasks, it still has some problems, such as memory loss, difficulty in generalizing to different tasks, and its performance in complex patterns is not as good as Transformer-based language models. In order to solve these problems, the research community has proposed many improvements to the Mamba architecture. Existing research mainly focuses on modification block design, scan patterns, and memory management. Table 1 summarizes relevant studies by category.

One article to understand Mamba, the strongest competitor of Transformer

Block Design

The design and structure of the Mamba block have a great impact on the overall performance of the Mamba model, and therefore this has become a major research hotspot.

One article to understand Mamba, the strongest competitor of Transformer

As shown in Figure 5, existing research can be divided into three categories based on different methods of building new Mamba modules:

Integration method: Integrate Mamba blocks with other models to achieve a balance between effect and efficiency;
Replacement method: Use Mamba blocks to replace other The main layer in the model framework;
Modification method: Modify the components within the classic Mamba block.

Scan mode

Parallel correlation scanning is a key component within the Mamba model. Its goal is to solve the computational problems caused by the selection mechanism, improve the speed of the training process, and reduce memory requirements. This is achieved by exploiting the linear nature of time-varying SSMs to design core fusion and recomputation at the hardware level. However, Mamba's one-way sequence modeling paradigm is not conducive to comprehensive learning of diverse data, such as images and videos.

One article to understand Mamba, the strongest competitor of Transformer

To alleviate this problem, some researchers have explored new efficient scanning methods to improve the performance of the Mamba model and facilitate its training process. As shown in Figure 6, in terms of developing scanning patterns, the existing research results can be divided into two categories: #Flat scanning method: look at the token sequence from a flattened perspective and process the model input based on this;

Stereoscopic scanning method: across dimensions, channels or scales Scanning model input, which can be further divided into three categories: hierarchical scanning, spatiotemporal scanning, and hybrid scanning.

Memory Management

Similar to RNN, within the state space model, the memory of hidden states effectively stores the information of previous steps, and therefore has a crucial impact on the overall performance of SSM. Although Mamba introduces HiPPO-based methods for memory initialization, managing memory in SSM units is still difficult, including transferring hidden information before layers and achieving lossless memory compression.

To this end, some pioneering research has proposed a number of different solutions, including memory initialization, compression, and concatenation.

Let Mamba adapt to diverse data

# #
The Mamba architecture is an extension of the selective state space model. It has the basic characteristics of the cyclic model and is therefore very suitable as a general basic model for processing text, time series, speech and other sequence data.

Not only that, some recent pioneering research has expanded the application scenarios of the Mamba architecture, making it not only capable of processing sequence data, but also In areas such as images and maps, as shown in Figure 7.

The goal of these studies is to take full advantage of Mamba's excellent ability to obtain long-range dependencies, and also allow it to take advantage of its efficiency in the learning and reasoning process. . Table 2 briefly summarizes these findings.

Sequence data

Sequence data refers to data collected and organized in a specific order, in which the data points The order is of great significance. This review report comprehensively summarizes the application of Mamba on a variety of sequence data, including natural language, video, time series, speech and human motion data. See the original paper for details.

non-sequential data

# #Unlike sequence data, non-sequence data does not follow a specific order. Its data points can be organized in any order without significantly affecting the meaning of the data. This lack of inherent order can be difficult for recurrent models (RNN, SSM, etc.) that are specifically designed to capture temporal dependencies in data.

Surprisingly, some recent research has successfully enabled Mamba (a representative SSM) to achieve efficient processing of non-sequential data, including Images, atlases, and point cloud data.

Multimodal data

# #In order to improve AI’s perception and scene understanding capabilities, multiple modal data can be integrated, such as language (sequential data) and images (non-sequential data). Such integration can provide very valuable and complementary information.

In recent times, multimodal large language models (MLLM) have been the most popular research hotspot; this type of model inherits the large language model (LLM) powerful abilities, including strong language expression and logical reasoning abilities. Although Transformer has become the dominant method in the field, Mamba is also emerging as a strong contender. Its performance in aligning mixed source data and achieving linear complexity scaling with sequence length makes Mamba promising in multi-modal learning. Aspect replaces Transformer.

APP

below Introducing some noteworthy applications of Mamba-based models. The team divided these applications into the following categories: natural language processing, computer vision, speech analysis, drug discovery, recommendation systems, and robotics and autonomous systems.

We won’t introduce it too much here, see the original paper for details.

Challenges and Opportunities

# #Mamba Although excellent performance has been achieved in some areas, overall, Mamba research is still in its infancy, and there are still some challenges to be overcome ahead. Of course, these challenges are also opportunities.

How to develop and improve the basic model based on Mamba;

How to fully implement hardware-aware computing to make full use of hardware such as GPU and TPU to improve model efficiency;
How to improve the credibility of the Mamba model , which requires further research on security and robustness, fairness, explainability, and privacy;
How to use new technologies in the Transformer field for Mamba, such as parameters Efficient fine-tuning, catastrophic forgetting mitigation, and retrieval-augmented generation (RAG).

The above is the detailed content of One article to understand Mamba, the strongest competitor of Transformer. For more information, please follow other related articles on the PHP Chinese website!