Deep learning image segmentation: an overview of network structure design-AI-php.cn

This article summarizes the innovations in network structure when using CNNs for image semantic segmentation. These innovations mainly include the design of new neural architectures (different depths, widths, connections and topologies) and the design of new components or layers. The former uses existing components to assemble complex large-scale networks, while the latter prefers to design underlying components. First, we introduce some classic semantic segmentation networks and their innovations, and then introduce some applications of network structure design in the field of medical image segmentation.

1. Image semantic segmentation network structure innovation

1.1 FCN network

FCN overall architecture

Simplified diagramThe FCN network is listed separately because the FCN network is the first network to solve the problem of semantic segmentation from a new perspective. Previous image semantic segmentation networks based on neural networks used image blocks centered on the pixel to be classified to predict the label of the central pixel. The network was generally constructed using the CNN FC strategy. Obviously, this method cannot utilize the global context information of the image. Moreover, the pixel-by-pixel reasoning speed is very low; while the FCN network abandons the fully connected layer FC and uses convolutional layers to build the network. Through the strategy of transposed convolution and different layer feature fusion, the network output is directly the prediction mask of the input image, which is efficient. and accuracy are greatly improved.

Deep learning image segmentation: an overview of network structure design

Schematic diagram of feature fusion of different layers of FCN

Innovation point: Full volume Product network (excluding fc layer); transposed convolution deconv (deconvolution); different layer feature map skip connection (addition)

1.2 Encoding structure (Enconder-decoder)

SegNetThe ideas of the FCN network are basically the same. The encoder part uses the first 13 layers of convolution of VGG16. The difference lies in the Upsampling method of the Decoder part. FCN obtains the upsampling result by adding the result obtained by deconv the feature map to the feature map of the corresponding size of the encoder; while SegNet uses the index of the maxpool of the Encoder part to upsample the Decoder part (original description: the decoder upsamples the lower resolution feature input maps. Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling.).

Innovation point: Encoder-Decoder structure; Pooling indices.

Deep learning image segmentation: an overview of network structure design

SegNet Network

Deep learning image segmentation: an overview of network structure design

## Comparison of the Upsample method between SegNet and FCN

Innovation points: U-shaped structure; short-circuit channel (skip-connection)

Deep learning image segmentation: an overview of network structure design

U-NetNetwork

The V-Net network structure is similar to U-Net, except that the architecture adds skip connections and replaces 2D operations with 3D operations to process 3D images (volumetric images). And optimized for widely used segmentation metrics like Dice.

Deep learning image segmentation: an overview of network structure design

V-Net Network

Innovation point: Quite The 3D version of the U-Net network

FC-DenseNet (One Hundred Layers Tiramisu Network)（paper title: The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation) The network structure is composed of Dense Block and UNet architecture. The simplest version of this network is composed of two downsampling paths transitioning downward and two upsampling paths transitioning upward. It also contains two horizontal skip connections to splice the feature map from the downsampling path with the corresponding feature map in the upsampling path. The connection patterns in the upsampling path and the downsampling path are not exactly the same: in the downsampling path, there is a skip splicing path outside each dense block, resulting in a linear increase in the number of feature maps, while in the upsampling path there is no such operation. (One more thing, the abbreviation of this network can be Dense Unet, but there is a paper called Fully Dense UNet for 2D Sparse Photoacoustic Tomography Artifact Removal, which is a paper on photoacoustic imaging artifact removal. I have seen many blogs citing this article. The illustrations in the paper talk about semantic segmentation, which is not the same thing at all =_=||, just be able to distinguish it yourself.)

Deep learning image segmentation: an overview of network structure design

##FC-DenseNet (Hundred-Layer Tiramisu Network)

Innovation point: Integration of DenseNet and U-Net networks (from the perspective of information exchange Look, dense connections are indeed more powerful than residual structures)

1) DeepLabV1: Fusion of convolutional neural network and probability graph model: CNN CRF, which improves segmentation and positioning accuracy;

Deep learning image segmentation: an overview of network structure design

2) DeepLabV2: ASPP (expanded spatial pyramid pooling); CNN CRF

Deep learning image segmentation: an overview of network structure design

3 ) DeepLabV3: Improved ASPP, adding 1*1 convolution and global avg pool; compared the effects of cascaded and parallel atrous convolutions.

Deep learning image segmentation: an overview of network structure design

Cascade Atrous Convolution

Deep learning image segmentation: an overview of network structure design

Parallel Atrous Convolution (ASPP)

4) DeepLabV3: Add the idea of encoding and decoding architecture, add a decoder module to extend DeepLabv3; apply depth separable convolution to ASPP and decoder module; improved Xception as Backbone.

Deep learning image segmentation: an overview of network structure design

DeepLabV3

In general, the core contributions of DeepLab series: dilated convolution; ASPP; CNN CRF (V1 only Using CRF with V2, it should be that V3 and V3 solve the problem of blurred segmentation boundaries through deep networks, and the effect is better than adding CRF)

PSPNet（pyramid scene parsing network) improves the network's ability to utilize global context information by aggregating context information from different areas. In SPPNet, the feature maps of different levels generated by pyramid pooling are finally flattened and concatenated, and then sent to the fully connected layer for classification, eliminating the limitation of CNN requiring a fixed input size for image classification. In PSPNet, the strategy used is: pooling-conv-upsample, and then spliced to obtain the feature map, and then perform label prediction.

Deep learning image segmentation: an overview of network structure design

##PSPNet network

Innovation point: Multi-scale pooling ization to better leverage global image-level prior knowledge to understand complex scenes

RefineNet by refining intermediate activation maps and hierarchically connecting them to combine multiple scales Activate while preventing sharpness loss. The network consists of independent Refine modules, each Refine module consists of three main modules, namely: Residual Convolutional Unit (RCU), Multi-Resolution Fusion (MRF) and Chain Residual Pooling (CRP). The overall structure is somewhat similar to U-Net, but a new combination method is designed at the jump connection (not simple concat). Personally, I think that this structure is actually very suitable as an idea for your own network design. You can add many CNN modules used in other CV problems, and using U-Net as the overall framework, the effect will not be too bad.

Deep learning image segmentation: an overview of network structure design

##RefineNet Network

Innovation point: Refine module

1.3 Reduce the computational complexity of the network structure

There is also a lot of work dedicated to reducing the computational complexity of the semantic segmentation network. Some methods to simplify the structure of deep networks: tensor decomposition; channel/network pruning; sparse connections. There are also some that use NAS (Neural Architecture Search) to replace manual design to search the structure of modules or the entire network. Of course, the GPU resources required by AutoDL will dissuade a large number of people. Therefore, some people use random search to search for much smaller ASPP modules, and then build the entire network model based on the small modules.

Lightweight network design is the consensus in the industry. For mobile deployment, it is impossible to equip each machine with a 2080ti. In addition, power consumption, storage and other issues will also limit the promotion and application of the model. However, if 5G becomes popular, all data can be processed in the cloud, which will be very interesting. Of course, in the short term (ten years), we don’t know whether full-scale deployment of 5G is feasible.

1.4 Network structure based on attention mechanism

The attention mechanism can be defined as: using subsequent layer/feature map information to select and locate the most judgmental (or salient) in the input feature map )part. It can simply be thought of as a way of weighting feature maps (the weights are calculated through the network). According to the different functions of the weights, it can be divided into channel attention mechanism (CA) and spatial attention mechanism (PA). The FPA (Feature Pyramid Attention) network is a semantic segmentation network based on the attention mechanism, which combines the attention mechanism and the spatial pyramid to extract precise features for pixel-level labeling without using dilation. Convolutional and human-designed decoder networks.

1.5 Network structure based on adversarial learning

Goodfellow et al. proposed an adversarial method to learn deep generative models in 2014. Generative adversarial networks (GANs) need to train two at the same time. Models: a generative model G that captures the distribution of the data, and a discriminative model D that estimates the probability that a sample came from the training data.

● G is a generative network, which receives a random noise z (random number), and generates an image through this noise

● D is a discriminative network, which determines whether an image is Not "real". Its input parameter is x (a picture), and the output D(x) represents the probability that x is a real picture. If it is 1, it means 100% is a real picture, and the output is 0, which means it cannot be real. picture.

G’s training procedure is to maximize the probability of D error. It can be proved that in the space of any functions G and D, there is a unique solution such that G reproduces the training data distribution, and D=0.5. During the training process, the goal of the generation network G is to try to generate real pictures to deceive the discriminant network D. The goal of D is to try to distinguish the fake images generated by G from the real images. In this way, G and D constitute a dynamic "game process", and the final equilibrium point is the Nash equilibrium point. In the case where G and D are defined by a neural network, the entire system can be trained with backpropagation.

Deep learning image segmentation: an overview of network structure design

GANs network structure diagramInspired by GANs, Luc et al. trained a semantic segmentation network (G) and a confrontation Network (D), the adversarial network distinguishes segmentation maps from ground truth or semantic segmentation networks (G). G and D continue to play games and learn, and their loss functions are defined as:

Deep learning image segmentation: an overview of network structure design

GANs loss function

Deep learning image segmentation: an overview of network structure design

Review the original GAN loss function: The loss function of GANs embodies the idea of a zero-sum game. The loss function of the original GANs is as follows:

Deep learning image segmentation: an overview of network structure design

The calculation position of the loss is at the output of D (discriminator), and the output of D is generally a fake/true judgment, so the overall situation can be considered to be a binary cross-entropy function. It can be seen from the form of the loss function of GANs that training is divided into two parts:

The first is the maxD part, because training generally first trains D while keeping G (generator) unchanged. The training goal of D is to correctly distinguish fake/true. If we use 1/0 to represent true/fake, then for the first item E, because the input is sampled from real data, we expect D(x) to approach 1, which is the first Items are larger. In the same way, the second item E input samples data generated from G, so we expect D(G(z)) to approach 0 better, which means that the second item is larger again. So this part is the expectation that training will make the whole bigger, which is the meaning of maxD. This part only updates the parameters of D.

The second part keeps D unchanged (no parameter update) and trains G. At this time, only the second item E is useful. The key is here, because we want to confuse D, so at this time the label is set to 1 (we know it is fake, so it is called confusion). We hope that the output of D(G(z)) is close to 1, that is, the smaller this term is, the better. This is minG. Of course, the discriminator is not so easy to fool, so at this time the discriminator will produce a relatively large error. The error will update G, and then G will become better. I didn’t fool you this time, so I can only work harder next time. (Quoted from https://www.cnblogs.com/walter-xh/p/10051634.html). At this time, only the parameters of G are updated.

Looking at GANs from another perspective, the discriminator (D) is equivalent to a special loss function (composed of a neural network, different from traditional L1, L2, cross-entropy and other loss functions).

In addition, GANs have a special training method and have problems such as gradient disappearance and mode collapse (there seems to be a way to solve it at present), but its design idea is indeed a great invention in the era of deep learning.

1.6 Summary

Most of the image semantic segmentation models based on deep learning follow the encoder-decoder architecture, such as U-Net. Research results in recent years have shown that dilated convolution and feature pyramid pooling can improve U-Net style network performance. In Section 2, we summarize how these methods and their variants can be applied to medical image segmentation.

2. Application of network structure innovation in medical image segmentation

This section introduces some research results on the application of network structure innovation in 2D/3D medical image segmentation.

2.1 Segmentation method based on model compression

In order to achieve real-time processing of high-resolution 2D/3D medical images (such as CT, MRI and histopathology images, etc.), researchers have proposed a variety of compression models Methods. Weng et al. used NAS technology to apply to the U-Net network and obtained a small network with better organ/tumor segmentation performance on CT, MRI and ultrasound images. Brugger redesigned the U-Net architecture by utilizing group normalization and Leaky-ReLU (leaky ReLU function) to make the network's storage efficiency for 3D medical image segmentation more efficient. Some people have also designed dilated convolution modules with fewer parameters. Some other model compression methods include weight quantization (sixteen-bit, eight-bit, binary quantization), distillation, pruning, etc.

2.2 Segmentation method of encoding-decoding structure

Drozdal proposed a method that applies a simple CNN to normalize the original input image before feeding the image into the segmentation network, improving Improved the segmentation accuracy of singleton microscope image segmentation, liver CT, and prostate MRI. Gu proposed a method of using dilated convolution in the backbone network to retain contextual information. Vorontsov proposed a graph-to-graph network framework that converts images with ROI to images without ROI (for example, images with tumors are converted to healthy images without tumors), and then the tumors removed by the model are added to the new healthy images. , to obtain the detailed structure of the object. Zhou et al. proposed a method for skip connection rewiring of the U-Net network and performed it on nodule segmentation in chest low-dose CT scans, nuclear segmentation in microscopy images, liver segmentation in abdominal CT scans, and colonoscopy. Performance was tested on a polyp segmentation task in the examination video. Goyal applied DeepLabV3 to dermoscopic color image segmentation to extract skin lesion areas.

2.3 Segmentation method based on attention mechanism

Nie proposed an attention model, which can segment the prostate more accurately than the baseline model (V-Net and FCN). SinHa proposed a network based on a multi-layer attention mechanism for abdominal organ segmentation in MRI images. Qin et al. proposed a dilated convolution module to preserve more details of 3D medical images. There are many other papers on blood image segmentation based on attention mechanisms.

2.4 Segmentation network based on adversarial learning

Khosravan proposed an adversarial training network for pancreatic segmentation from CT scans. Son uses generative adversarial networks for retinal image segmentation. Xue uses a fully convolutional network as a segmentation network in a generative adversarial framework to segment brain tumors from MRI images. There are other papers that successfully apply GANs to medical image segmentation problems, so I won’t list them one by one.

2.5 RNN-based segmentation model

Recurrent neural network (RNN) is mainly used to process sequence data. Long short-term memory network (LSTM) is an improved version of RNN. LSTM introduces self-loop (self-loops) enable the gradient flow to be maintained for a long time. In the field of medical image analysis, RNNs are used to model temporal dependencies in image sequences. Bin et al. proposed an image sequence segmentation algorithm that integrates a fully convolutional neural network and RNN, and incorporates information in the time dimension into the segmentation task. Gao et al. used CNN and LSTM to model temporal relationships in brain MRI slice sequences to improve segmentation performance in 4D images. Li et al. first used U-Net to obtain the initial segmentation probability map, and then used LSTM to segment the pancreas from 3D CT images, which improved the segmentation performance. There are many other papers that use RNN for medical image segmentation, so I will not introduce them one by one.

2.6 Summary

This part of the content is mainly about the application of segmentation algorithms in medical image segmentation, so there are not many innovation points. It is mainly about the application of different formats (CT or RGB, pixel range, image resolution, etc.) and the characteristics of data in different parts (noise, object shape, etc.), the classic network needs to be improved for different data to adapt to the input data format and characteristics, so that it can better complete the segmentation task. Although deep learning is a black box, the design of the overall model still has rules to follow. What strategies solve what problems and what problems they cause can be chosen based on the specific segmentation problem to achieve optimal segmentation performance.

Some references:

1.Deep Semantic Segmentation of Natural and Medical Images: A Review

2.NAS-Unet: Neural architecture search for medical image segmentation. IEEE Access, 7:44247–44257, 2019.

3.Boosting segmentation with weak supervision from image-to-image translation. arXiv preprint arXiv: 1904.01636, 2019

4.Multi-scale guided attention for medical image segmentation. arXiv preprint arXiv:1906.02849,2019.

5.SegAN : Adversarial network with multi-scale L1 loss for medical image segmentation.

6.Fully convolutional structured LSTM networks for joint 4D medical image segmentation. In 2018 IEEE7 https://www.cnblogs .com/walter-xh/p/10051634.html

The above is the detailed content of Deep learning image segmentation: an overview of network structure design. For more information, please follow other related articles on the PHP Chinese website!