Vanishing Gradient Problem

From GM-RKB
(Redirected from vanishing gradient)
Jump to navigation Jump to search

A Vanishing Gradient Problem is a Machine Learning Training Algorithm problem that arises when using gradient descent and backpropagation.



References

2021

2020

2018

  • (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/vanishing_gradient_problem Retrieved:2018-3-27.
    • In machine learning, the vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the gradient of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (0, 1), and backpropagation computes gradients by the chain rule. This has the effect of multiplying of these small numbers to compute gradients of the "front" layers in an -layer network, meaning that the gradient (error signal) decreases exponentially with while the front layers train very slowly.

      Back-propagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diploma thesis of 1991 [1] [2] formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network.

      When activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem.

  1. S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f. Informatik, Technische Univ. Munich, 1991. Advisor: J. Schmidhuber.
  2. S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.

2017

Grosse2017 RNN.png
Figure 3 shows the function computed at each time step, as well as the function computed by the network as a whole. From this figure, you can see which regions have exploding or vanishing gradients.

Grosse2017 FIG3.png
Figure 3: (left) The function computed by the RNN at each time step, (right) the function computed by the network.

2015a

  • (DL, 2015) ⇒ http://neuralnetworksanddeeplearning.com/chap5.html#discussion_why Why the vanishing gradient problem occurs
    • QUOTE: To understand why the vanishing gradient problem occurs, let's explicitly write out the entire expression for the gradient: [math]\displaystyle{ ∂C∂b1=σ′(z1)w2σ′(z2)w3σ′(z3)w4σ′(z4)∂C∂a4.(122) }[/math] Excepting the very last term, this expression is a product of terms of the form wjσ′(zj). To understand how each of those terms behave, let's look at a plot of the function σ′ : -4-3-2-1012340.000.050.100.150.200.25zDerivative of sigmoid function

      The derivative reaches a maximum at σ′(0)=1/4. Now, if we use our standard approach to initializing the weights in the network, then we'll choose the weights using a Gaussian with mean 0 and standard deviation 1. So the weights will usually satisfy |wj|<1. Putting these observations together, we see that the terms wjσ′(zj) will usually satisfy |wjσ′(zj)|<1/4. And when we take a product of many such terms, the product will tend to exponentially decrease: the more terms, the smaller the product will be. This is starting to smell like a possible explanation for the vanishing gradient problem. ...

      ... Indeed, if the terms get large enough - greater than 1 - then we will no longer have a vanishing gradient problem. Instead, the gradient will actually grow exponentially as we move backward through the layers. Instead of a vanishing gradient problem, we'll have an exploding gradient problem.

2015b

  • (Quora, 2015) ⇒ http://www.quora.com/What-is-the-difference-between-deep-learning-and-usual-machine-learning/answer/Sebastian-Raschka-1?srid=uuoZN
    • QUOTE: Now, this is where "deep learning" comes into play. Roughly speaking, you can think of deep learning as "clever" tricks or algorithms that can help you with training such "deep" neural network structures. There are many, many different neural network architectures, but to continue with the example of the MLP, let me introduce the idea of convolutional neural networks (ConvNets). You can think of those as an "addon" to your MLP that helps you to detect features as "good" inputs for your MLP.