ニューラルネットワークの最適化-Python チュートリアル-php.cn

ニューラルネットワークの最適化

DDD

リリース： 2024-10-13 06:15:02

オリジナル

810 人が閲覧しました

Last week I posted an article about how to build simple neural networks, specifically multi-layer perceptrons. This article will dive deeper into the specifics of neural networks to discuss how we can maximize the performance of a neural network by tweaking its configurations.

How Long to Train Your Model For

When training a model, you might think that if you train your model enough, the model will become flawless. This may be true, but that only holds for the dataset it was trained on. In fact, if you give it another set of data where the values are different, the model could output completely incorrect predictions. To understand this further, let's say you were practicing every single day for your driver's exam by driving in a straight line without moving the wheel. (Please don't do this.) While you would probably perform very well on the drag strip, if you were told to make a left turn on the actual exam, you might end up turning into a STOP sign instead. This phenomenon is called overfitting. Your model can learn all the aspects and patterns of the data it's trained on but if it learns a pattern that adheres to the training dataset too closely, then when given a new dataset, your model will perform poorly. At the same time, if you don't train your model enough, then your model won't be able to recognize patterns in other datasets properly. In this case, you would be underfitting. Optimizing Your Neural Networks

An example of overfitting. The validation loss, represented by the orange line is gradually increasing while the training loss, represented by the blue line is decreasing.

In the example above, a great position to stop training your model would be right when the validation loss reaches its minimum. It's possible to do this with early stopping, which stops training once there is no improvement in validation loss after an arbitrary number of training cycles (epochs). Training your model is all about finding a balance between overfitting and underfitting while utilizing early stopping if necessary. That's why your training dataset should be as representative as possible of your overall population so that your model can more accurately make predictions on data it hasn't seen.

Loss Functions

Perhaps one of the most important training configurations that can be tweaked is the loss function, which is the "inaccuracy" between your model's predictions and their actual values. The "inaccuracy" can be represented mathematically in many different ways, one of the most common being mean squared error (MSE):

MSE = \frac{\sum_{i = 1}^{n} (\overset{ˉ}{y_{i}} - y_{i})^{2}}{n} text{MSE} = frac{sum_{i=1}^n (バー{y_i} - y_i)^2}{n}

di mana $\overset{ˉ}{y_{i}} bar{y_i}$ ialah ramalan model dan $y_{i} y_i$ adalah nilai sebenar. Terdapat varian serupa yang dipanggil min ralat mutlak (MAE)

MAE = \frac{\sum_{i = 1}^{n} ∣ \overset{ˉ}{y_{i}} - y_{i} ∣}{n} text{MAE} = frac{sum_{i=1}^n |bar{y_i} - y_i|}{n}

이 둘의 차이점은 무엇이며 어느 것이 더 좋나요? 실제 대답은 다양한 요인에 따라 달라집니다. 간단한 2차원 선형 회귀 예제를 살펴보겠습니다.

많은 경우 이상치 역할을 하는 데이터 포인트, 즉 다른 데이터 포인트와 멀리 떨어져 있는 포인트가 있을 수 있습니다. 선형 회귀 측면에서 이는 다음 항목에 몇 가지 점이 있음을 의미합니다. $x y xy$ -나머지 비행기와 멀리 떨어져 있는 비행기. 통계 수업을 기억하신다면 계산되는 선형 회귀선에 큰 영향을 미칠 수 있는 점은 바로 이러한 점입니다.

네 점을 모두 지나는 선을 생각하고 싶다면 $y = x y = x$ 이 선은 모든 점을 통과하므로 좋은 선택이 될 것입니다.

하지만 에 또 다른 포인트를 추가하기로 결정했다고 가정해 보겠습니다. $(5, 1) (5, 1)$ . 이제 회귀선은 무엇이 되어야 할까요? 글쎄, 그것은 완전히 다른 것으로 밝혀졌습니다. $y = 0.2 x 1.6 y = 0.2x 1.6$

Optimizing Your Neural Networks

(1, 1), (2, 2), (3, 3), (4, 4) 및 (5,1에 점이 있는 간단한 그래프 ) 선형 회귀선이 통과합니다.

이전 데이터 포인트를 고려할 때 선은 다음 값을 기대합니다. $y y$ 언제 $x = 5 x = 5$ 5이지만 이상값과 MSE로 인해 회귀선이 크게 "아래로 당겨집니다".

이것은 단순한 예일 뿐이지만 머신러닝 개발자로서 잠시 멈춰서 생각해 보아야 할 질문을 제기합니다. 내 모델이 이상값에 얼마나 민감해야 할까요? 모델이 이상값에 더 민감하도록 하려면 MSE와 같은 측정항목을 선택합니다. 왜냐하면 이 경우 이상값과 관련된 오류는 제곱으로 인해 더 두드러지고 모델은 이를 최소화하기 위해 자체 조정되기 때문입니다. 그렇지 않으면 이상값에 크게 신경 쓰지 않는 MAE와 같은 측정항목을 선택하게 됩니다.

オプティマイザー

前回の投稿では、バックプロパゲーション、勾配降下法の概念、およびモデルの損失を最小限に抑えるためにそれらがどのように機能するかについても説明しました。勾配は、最も大きな変化の方向を指すベクトルです。勾配降下法アルゴリズムはこのベクトルを計算し、最終的に最小値に達するようにまったく逆の方向に移動します。

ほとんどのオプティマイザーには特定の学習率があり、一般に次のように表されます。 $α アルファ$ 彼らはそれを遵守します。基本的に、これはアルゴリズムが勾配を計算するたびに最小値に向かってどの程度移動するかを表します。学習率の設定が大きすぎることに注意してください。アルゴリズムではステップが大きく、最小値を繰り返しスキップする可能性があるため、最小値に到達しない可能性があります。

勾配降下法に戻ると、損失を最小限に抑えるのには効果的ですが、データセット全体に対して損失関数が計算されるため、トレーニングプロセスが大幅に遅くなる可能性があります。勾配降下法に代わる、より効率的な方法がいくつかありますが、それぞれに欠点があります。

確率的勾配降下法

標準勾配降下法の最も一般的な代替手段の 1 つは、確率的勾配降下法 (SGD) と呼ばれるバリアントです。勾配降下法と同様、SGD の学習率は固定です。ただし、SGD は勾配降下法のようにデータセット全体を実行するのではなく、ランダムに選択された小さなサンプルを取得し、代わりにサンプルに基づいてニューラルネットワークの重みを更新します。最終的に、パラメーター値は損失関数をほぼ (正確ではないが) 最小化する点に収束します。常に正確な最小値に達するとは限らないため、これは SGD の欠点の 1 つです。さらに、勾配降下法と同様に、設定した学習率の影響を受け続けます。

アダムオプティマイザー

Adam という名前は、適応モーメント推定に由来しています。基本的に、SGD の 2 つのバリアントを組み合わせて、各トレーニング反復中に更新される頻度に基づいて各入力パラメーターの学習率を調整します (適応学習率)。同時に、更新を平滑化するために、過去の勾配計算を移動平均として追跡します (勢い)。ただし、その運動量の特性により、他のアルゴリズムよりも収束に時間がかかることがあります。