An article briefly discusses the generalization ability of deep learning-AI-php.cn

An article briefly discusses the generalization ability of deep learning

1. The issue of DNN generalization ability

The paper mainly discusses why the over-parameterized neural network model can have good generalization performance? That is, it does not simply memorize the training set, but summarizes a general rule from the training set, so that it can be adapted to the test set (generalization ability).

An article briefly discusses the generalization ability of deep learning

Take the classic decision tree model as an example. When the tree model learns the general rules of the data set: a good situation, if the tree first splits the node, It can just about distinguish samples with different labels, the depth is very small, and the number of samples on each corresponding leaf is enough (that is, the amount of data based on statistical rules is also relatively large), then the rules that will be obtained will be more May generalize to other data. (ie: good fit and generalization ability).

An article briefly discusses the generalization ability of deep learning

Another worse situation is that if the tree cannot learn some general rules, in order to learn this data set, the tree will become deeper and deeper, maybe every time Each leaf node corresponds to a small number of samples (the statistical information brought by a small amount of data may be just noise). Finally, all the data is memorized by rote (ie: overfitting and no generalization ability). We can see that tree models that are too deep can easily overfit.

So how can an over-parameterized neural network achieve good generalization?

2. Reasons for the generalization ability of DNN

This article explains from a simple and general perspective-exploring the reasons for the generalization ability in the gradient descent optimization process of neural networks:

We summarized the gradient coherence theory: the coherence of gradients from different samples is the reason why neural networks can have good generalization capabilities. When the gradients of different samples are well aligned during training, that is, when they are coherent, gradient descent is stable, can converge quickly, and the resulting model can generalize well. Otherwise, if the samples are too few or the training time is too long, it may not generalize.

An article briefly discusses the generalization ability of deep learning

Based on this theory, we can make the following explanation.

2.1 Generalization of Width Neural Network

Wider neural network models have good generalization capabilities. This is because wider networks have more sub-networks and are more likely to produce gradient coherence than smaller networks, resulting in better generalization. In other words, gradient descent is a feature selector that prioritizes generalization (coherence) gradients, and wider networks may have better features simply because they have more features.

Original paper: Generalization and width. Neyshabur et al. [2018b] found that wider networks generalize better. Can we now explain this? Intuitively, wider networks have more sub-networks at any given level, and so the sub-network with maximum coherence in a wider network may be more coherent than its counterpart in a thinner network, and hence generalize better. In other words, since—as discussed in Section 10—gradient descent is a feature selector that prioritizes well-generalizing (coherent) features, wider networks are likely to have better features simply because they have more features. In this connection, see also the Lottery Ticket Hypothesis [Frankle and Carbin, 2018]
Paper link: https ://github.com/aialgorithm/Blog

But personally, I think it still needs to distinguish the width of the network input layer/hidden layer. Especially for the input layer of data mining tasks, since the input features are usually manually designed, it is necessary to consider feature selection (ie, reduce the width of the input layer). Otherwise, directly inputting feature noise will interfere with the gradient coherence. .

2.2 Generalization of Deep Neural Networks

The deeper the network, the gradient coherence phenomenon is amplified and has better generalization ability.

An article briefly discusses the generalization ability of deep learning

In the deep model, since the feedback between layers strengthens the coherent gradient, there are characteristics of coherent gradients (W6) and characteristics of incoherent gradients ( The relative difference between W1) is exponentially amplified during training. This makes deeper networks prefer coherent gradients, resulting in better generalization capabilities.

2.3 Early-stopping

Through early stopping we can reduce the excessive influence of non-coherent gradients and improve generalization.

During training, some easy samples are fitted earlier than other samples (hard samples). In the early stage of training, the correlation gradient of these easy samples dominates and is easy to fit. In the later stage of training, the incoherent gradient of difficult samples dominates the average gradient g(wt), resulting in poor generalization ability. At this time, it is necessary to stop early.

An article briefly discusses the generalization ability of deep learning

(Note: Simple samples are those that have many gradients in common in the data set. For this reason, most gradients are beneficial to it. , the convergence is also faster.)

2.4 Full gradient descent VS learning rate

We found that full gradient descent can also have good generalization ability. Furthermore, careful experiments show that stochastic gradient descent does not necessarily lead to better generalization, but this does not rule out the possibility that stochastic gradients are more likely to jump out of local minima, play a role in regularization, etc.

Based on our theory, finite learning rate, and mini-batch stochasticity are not necessary for generalization

We believe that a lower learning rate may not reduce the generalization error , because lower learning rate means more iterations (opposite of early stopping).

Assuming a small enough learning rate, as training progresses, the generalization gap cannot decrease. This follows from the iterative stability analysis of training: with 40 more steps, stability can only degrade. If this is violated in a practical setting, it would point to an interesting limitation of the theory

2.5 L2 and L1 regularization

Add L2 and L1 regularization to the objective function, and the corresponding gradient calculation, The gradient that needs to be added to the L1 regular term is sign(w), and the L2 gradient is w. Taking L2 regularization as an example, the corresponding gradient W(i 1) update formula is: Picture

An article briefly discusses the generalization ability of deep learning

We can think of "L2 regularization (weight attenuation)" as a This kind of "background force" can push each parameter close to a data-independent zero value (L1 can easily obtain a sparse solution, and L2 can easily obtain a smooth solution approaching 0) to eliminate the influence in the weak gradient direction. Only in the case of coherent gradient directions can the parameters be relatively separated from the "background force" and the gradient update can be completed based on the data.

An article briefly discusses the generalization ability of deep learning

2.6 Advanced gradient descent algorithm

Momentum, Adam and other gradient descent algorithms

Momentum, Adam For the equal gradient descent algorithm, the parameter W update direction is not only determined by the current gradient, but also related to the previously accumulated gradient direction (that is, the effect of the accumulated coherent gradient is retained). This allows the parameters to be updated faster in dimensions where the gradient direction changes slightly, and reduces the update amplitude in dimensions where the gradient direction changes significantly, thus resulting in the effect of accelerating convergence and reducing oscillation.

Suppress gradient descent in weak gradient directions

We can suppress gradient updates in weak gradient directions by optimizing the batch gradient descent algorithm, further improving generalization capabilities. . For example, we can use winsorized gradient descent to exclude gradient outliers and then take the average. Or take the median of the gradient instead of the mean to reduce the impact of gradient outliers.

Summary

A few words at the end of the article. If you are interested in the theory of deep learning, you can read the related research mentioned in the paper.

The above is the detailed content of An article briefly discusses the generalization ability of deep learning. For more information, please follow other related articles on the PHP Chinese website!