Vision Transformer (VIT) is a Transformer-based image classification model proposed by Google. Different from traditional CNN models, VIT represents images as sequences and learns the image structure by predicting the class label of the image. To achieve this, VIT divides the input image into multiple patches and concatenates the pixels in each patch through channels and then performs linear projection to achieve the desired input dimensions. Finally, each patch is flattened into a single vector, forming the input sequence. Through Transformer's self-attention mechanism, VIT is able to capture the relationship between different patches and perform effective feature extraction and classification prediction. This serialized image representation method brings new ideas and effects to computer vision tasks.
Vision Transformer models are widely used in image recognition tasks such as object detection, image segmentation, image classification and action recognition. In addition, it is suitable for generative modeling and multi-model tasks, including visual foundation, visual question answering and visual reasoning.
Before we delve into how Vision Transformers work, we must understand the basics of attention and multi-head attention in the original Transformer.
Transformer is a model that uses a mechanism called self-attention, which is neither CNN nor LSTM. It builds a Transformer model and significantly outperforms these methods.
The attention mechanism of the Transformer model uses three variables: Q (Query), K (Key) and V (Value). Simply put, it calculates the attention weight of a Query token and a Key token, and multiplies it by the Value associated with each Key. That is, the Transformer model calculates the association (attention weight) between Query token and Key token, and multiplies the Value associated with each Key.
Define Q, K, V to be calculated as a single head. In the multi-head attention mechanism, each head has its own projection matrix W_i^Q, W_i^K, W_i^V, They respectively compute attention weights using the feature values projected by these matrices.
The multi-head attention mechanism allows focusing on different parts of the sequence in a different way each time. This means:
The model can better capture positional information because each head will focus on a different part of the input. Their combination will provide a more powerful representation.
Each header will also capture different contextual information through uniquely associated words.
Now that we know the working mechanism of the Transformer model, let’s look back at the Vision Transformer model.
Vision Transformer is a model that applies Transformer to image classification tasks. It was proposed in October 2020. The model architecture is almost identical to the original Transformer, which allows images to be treated as input, just like natural language processing.
Vision Transformer model uses Transformer Encoder as the base model to extract features from images, and passes these processed features to the multi-layer perceptron (MLP) head model for classification. Since the calculation load of the basic model Transformer is already very large, the Vision Transformer decomposes the image into square blocks as a lightweight "windowing" attention mechanism to solve such problems.
The image is then converted into square patches, which are flattened and sent through a single feedforward layer to obtain a linear patch projection. To help classify bits, by concatenating learnable class embeddings with other patch projections.
In summary, these patch projections and position embeddings form a larger matrix that is quickly passed through the Transformer encoder. The output of the Transformer encoder is then sent to the multi-layer perceptron for image classification. The input features capture the essence of the image very well, making the classification task of the MLP head much simpler.
While ViT shows excellent potential in learning high-quality image features, it Worse in terms of performance and accuracy gains. The small improvement in accuracy does not justify ViT's inferior runtime.
The above is the detailed content of In-depth analysis of the working principles and characteristics of the Vision Transformer (VIT) model. For more information, please follow other related articles on the PHP Chinese website!