The Vanishing and Exploding Gradient Problems
While increasing depth often reduces the number of parameters of the network, it leads to different types of practical issues.
Propagating backwards using the chain rule has its draw-backs in networks with a large number of layers in terms of the stability of the updates.
In particular, the updates in earlier layers can either be negligibly small (vanishing gradient) or they can be increasingly large (exploding gradient)
in certain types of neural network architectures.
This is primarily caused by the chain-like product computation in
Equation 1.23 , which can either exponentially increase or decay over the length of the path.
In order to understand this point, consider a situation in which we have a multi-layer network with one neuron in each layer. Each local derivative along
a path can be shown to be the product of the weight and the derivative of the activation function.
The overall back-propagated derivative is the product of these values. If each such value is randomly distributed, and has an expected value less than 1,
the product of these derivatives in Equation 1.23.
will drop off exponentially fast with path length. If the individual values on the path have expected
values greater than 1, it will typically cause the gradient to explode. Even if the local derivatives are randomly distributed with an expected value of
exactly 1, the overall derivative will typically show instability depending on how the values are actually distributed. In other words, the vanishing
and exploding gradient problems are rather natural to deep networks,which makes their training process unstable.
Many solutions have been proposed to address this issue. For example, a sigmoid activation often encourages the vanishing gradient problem,
because its derivative is less than 0.25 at all values of its argument , and is extremely small at saturation.
A ReLU activation unit is known to be less likely to create a vanishing gradient problem because its derivative is always 1 for positive values of the argument.
More discussions on this issue will be provided in the coming posts. Aside from the use of the ReLU, a whole host of gradient-descent tricks are used to
improve the convergence behavior of the problem. In particular, the use of adaptive learning rates and conjugate gradient methods can help in many cases.
Furthermore, a recent technique called batch normalization is helpful in addressing some of these issues. These techniques will be discussed in the coming posts.