You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!-AI-php.cn

相信和作者一样爱技术对AI兴趣浓厚的小伙伴们，一定对卷积神经网络并不陌生，也一定曾经对如此“高级”的名字困惑良久。作者今天将从零开始走进卷积神经网络的世界~与大家分享！

在深入了解卷积神经网络之前，我们先看看图像的原理。

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

图像原理

图像在计算机中是通过数字(0-255)来表示的，每个数字代表图像中一个像素的亮度或颜色信息。其中：

黑白图像：每个像素只有一个值，这个值在0（黑色）到255（白色）之间变化。

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

彩色图像：每个像素包含三个值，最常见的是RGB（Red-Green-Blue）模型，即红色、绿色和蓝色光以不同强度组合起来产生各种颜色。每个颜色通道都有256级亮度，从0~255，因此每种颜色可以用一个8位的二进制数来表述，例如(255,0,0)表示红色，(0,255,0)表示绿色，(0,0,255)表示蓝色，其他组合则对应各种颜色。计算机中，彩色图像的数据结构通常是一个三维数组或张量，形状为(宽度,高度,深度)，其中深度就是通道的数量，对于RGB图像来说，深度是3。这意味着，对于每个像素位置，有三个数值分别代表红绿蓝三个通道的亮度。例如，一个100*100像素的RGB图像将占用100x100x3个字节的内存。

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

「这里“8位的二进制数”怎么理解呢？」

在RGB颜色模型中，每个颜色通道（红、绿、蓝）可以有256个不同的亮度级别，每个通道表示8位二进制表示，8位二进制数的最大值是11111111，转化成十进制就是255；最小值是00000000，转化成十进制就是0。

何为卷积神经网络CNN？

CNN报道了一种在CV中家喻户晓的一种应用场景。以原始图片尺寸为10x10为例，如下图所示，其左半部分是像素值较大，是明亮区域；右半部分是像素值较小，为深度区域。中间的分界线即是要检测的边缘。

「那么怎么检测边缘呢？」此时滤波器filter（也叫kernel）出场了，如下图所示，kernel尺寸为3x3。

滤波器filter滑过输入图片，在每个区域处稍做停留，对应元素相乘再相加计算，之后再向其它区域滑动继续计算，直到滑动至原图片的最后一个区域为止。这个过程即为「卷积。」

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

由上图可以看出，输出结果的中间颜色浅，两边颜色深，说明原图的边界已反应出来。「因此可以总结出，边缘检测就是通过输入图片与相应滤波器进行卷积运算得以识别。」

另外，这里的滑动还涉及到一个基本概念，「步长stride」，上述示例中，是以stride为1说明，每次滑动一格，共停留了8x8个区域，所以最终输出结果是8x8矩阵。

「那么，究竟什么是卷积神经网络呢？」

经过上面边缘检测这一具体的目标检测场景的分析，我们也就不难理解，CNN(Convolutional neural network)就是通过各种各样的滤波器filter不断提取图片特征，从局部到整体，进而识别目标。

而在神经网络中，这些filter中的每个数字，就是参数，可通过大量数据训练得到(即深度学习的过程)。

CNN中的基本概念

1.卷积（Convolution）

(1) 卷积计算

卷积是数学分析中的一种积分变换的方法，而在图像处理中则采用的是卷积的离散形式。在卷积神经网络CNN中，卷积层的实现方式本质即为数学中定义的互相关计算(cross-correlation)。具体计算过程如下图所示。

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

其中：

图(a)：左边的图大小是3×3，表示输入数据是一个维度为3×3的二维数组；中间的图大小是2×2，表示一个维度为 2×2的二维数组，也即为「卷积核」。卷积核的左上角与输入数据的左上角(0,0)对齐，并依次将二者对应位置数据相乘，再相加，即可获得卷积输出的第一个结果25。

依次类推，图(b)、(c)、(d)分别为卷积输出的第二、三、四个输出结果。

(2) 图片卷积运算

那么图片卷积运算，具体是怎么回事呢？如下图所示即为彩色图像卷积过程。

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

对于彩色图像的通道1（Red）、通道2（Green）、通道3（Blue），分别使用Kernel1、Kernel2、Kernel3。每个卷积核在对应的单色图像上滑动，对每个位置上的小块区域（Kernel大小）内的像素值与卷积核的相应元素进行逐点乘法运算，然后将这些乘积相加得到一个值。再将每个通道得到的数值相加，并加上总体的偏置Bias，即可得到对应特征图（feature map）中的一个值。

立体效果如下图所示：

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

2.Padding

如上所述边缘检测的例子中，可以看到，原图片尺寸是10x10，经过filter之后是8x8。如果再做一次卷积运算就是6x6...这样的话会有两个缺点：

每次做卷积操作，输出图片尺寸缩小
角落或边缘区域的像素点在输出中采用较少，因此容易丢掉图像边缘位置的许多信息。

如下图中左上角红色阴影只被一个输出触碰到，而中间的像素点(紫色方框标记)会有许多3x3的区域与之重叠。所以，角落或边缘区域的像素点在输出中采用较少，容易丢掉图像边缘位置的许多信息。

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

为了解决这一问题，我们通常采用Padding的方法，在卷积操作之前，先给原图片边缘填充一层像素，例如，将10x10的图像即可填充为12x12的大小，卷积之后的图片尺寸为8x8，和原始图片一样大，这样便使得原图的边缘区域像素点也可以多次被采用。

选择填充多少像素，通常有两种选择：

Same卷积：即如上所述，填充再卷积之后的图片大小与原图片一致。
Valid卷积：不进行填充操作，直接卷积。

3.stride

stride的概念在引言中有提到过，表示过滤器filter在原图片中水平方向和竖直方向每次滑动的长度，也叫步进长度。

假设s表示stride长度，p表示padding长度，原图片尺寸是nxn，过滤器filter尺寸是fxf，则卷积后的图片尺寸为：

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

4.池化

池化的本质就是降维。

卷积网络中池化层的作用：降低特征图的参数量、提升计算速度、增加感受野，也即为一种降采样操作。

物体检测中常用的池化：最大值池化（Max Pooling）与平均值池化（Average Pooling）。

(1) Max pooling

即在滤波器filter滑动区域内取最大值，而无需卷积运算。数字大意味着可能探测到了某些特定的特征，忽略了其它值，降低了噪声影响，提高了模型健壮性。「并且，Max pooling需要的超参数仅为滤波器尺寸f和stride长度s，无需要训练其它参数，计算量较小。」

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

(2) Average pooling

即在滤波器filter滑动区域内求平均值。

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

5.Shape

在处理多维数据，尤其是图像数据时，Tensorflow和Pytorch数据Shape有所区分。

TensorFlow:(batch_size, height, width, in_channels)
Pytorch:(batch_size, in_channels, height, width)

其中：

batch_size: The number of samples for batch processing.
in_channels: The number of channels of the input image, usually 3 (red, green, blue) for color images.
height and width are the height and width of the image respectively.

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

As shown in the picture above:

Input image Shape: [height, width, channels], that is, [8,8, 3], represents an 8x8 image with 3 channels (R, G, B).
Convolution kernel Shape: [kernel_height, kernel_width, in_channels, out_channels], that is, [3,3,3,5], indicating a 3x3 convolution kernel with 3 channels (R, G, B ), the number of output channels is 5.
Output image Shape: [height, width, out_channels], that is, [6,6,5], indicating a 6x6 output image with 5 channels (R, G, B).

out_height = (height - kernel_height + 1) / strideout_width = (width - kernel_width + 1) / stride

Copy after login

对于8x8的图像和3x3的卷积核，输出尺寸将是 (8 - 3 + 1) / 1 = 6，因此输出形状是 [6, 6, 5]，表示一个6x6的特征图，有5个输出通道。

卷积核的输入通道数（in_channels）由输入图像的通道数决定，比如：一个RGB格式的图片，其输入通道数为3。

而输出矩阵的通道数（out_channels）是由卷积核的输出通道数所决定，即卷积核有多少个不同的滤波器（filter）。在这个例子中，卷积核有5个滤波器，所以输出有5个通道。

6.激活函数

并不是所有的映射关系都可以用线性关系准确表达。因此需要激活函数表示非线性映射。

激活函数也就是非线性映射。神经网络如果仅仅是由线性运算堆叠，是无法形成复杂的表达空间的，也就很难提取高语义信息，因此需要加入非线性映射关系。

(1) Sigmoid函数

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

Sigmoid函数将特征压缩到了（0,1）区间，0端是抑制状态，1端是激活状态，中间部分梯度最大。

(2) Relu函数

修正线性单元（Rectified Linear Unit, ReLU）。通常用于缓解梯度消失现象。

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

在小于0的部分，值与梯度为0，大于0导数为1，避免了Sigmoid函数中梯度接近于0导致的梯度消失问题。

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

(3) Softmax函数

多物体类别较为常用的分类器是Softmax函数。

在具体的分类任务中，Softmax函数的输入往往是多个类别的得分，输出则是每一个类别对应的概率，所有类别的概率取值都在0~1之间，且和为1。

Softmax函数公式如下：

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

其中，Vi表示第i个类别的得分，C代表分类的类别总数，输出Si为第i个类别的概率。

CNN整体结构

卷积神经网络CNN由输入层、卷积层、Relu、池化层和全连接层、输出层组成。

如下图所示是一个卷积网络示例，卷积层是卷积网络的第一层，其后跟着其它卷积层或池化层，最后一层是全连接层。越往后的层识别图像越大的部分，较早的层通常专注于简单的特征(例如颜色和边缘等)。随着图像数据在CNN中各层中前进，它开始识别物体的较大元素或形状，直到最终识别出预期的物体。

You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!

其中：

输入层：接收原始图像数据，作为网络的输入。
卷积层：由滤波器filters和激活函数构成，属于CNN的核心层，主要作用是提取样本特征。它由输入数据、filter(或卷积核)和特征图组成。若输入数据是RGB图像，则意味着输入将具有三个维度——高度、宽度和深度。filter的本质是一个二维权重矩阵，它将在图像的感受野中移动，检查特征是否存在。卷积的运算过程如上所述。卷积层一般要设置的超参数包括过滤器filters的数量、步长stride以及Padding的方式(valid or same)以及激活函数等。
池化层：本质即就是下采样(Downsampling)，利用图像局部相关性原理，对图像进行子抽样，在保留有用信息的前提下减小数据处理量，具有一定的防止模型过拟合作用。
全连接层：该层的每一个结点与上一层的所有结点相连，用于将前边提取到的特征综合在一起。通常，全连接层的参数是最多的。
输出层：根据全连接层的信息得到概率最大的结果。

CNN的优势

与传统神经网络相比CNN具有局部连接、权值共享等优点，使其学习的参数量大幅降低，且网络的收敛速度也更快。

Local connection: Each output value of the feature map does not need to be connected to every pixel value in the input image, but only needs to be connected to the receptive field of the applied filter, so the convolutional layer is often called "Partial connection layer", this feature is also called local connection.
Weight sharing: When the convolution kernel moves on the image, its weight remains unchanged. That is weight sharing.

The above is the detailed content of You can understand the principles of convolutional neural networks even with zero foundation! Super detailed!. For more information, please follow other related articles on the PHP Chinese website!