Neural Network - Methods To Deal With The Issue Of OverFitting
Neural Architecture and Parameter Sharing
The most effective way of building a neural network is by constructing the architecture of the neural network after giving some
thought to the underlying data domain. For example, the successive words in a sentence are often related to one another,
where as the nearby pixels in an image are typically related. These types of insights are used to create specialized architectures
for text and image data with fewer parameters.
Furthermore, many of the parameters might be shared. For example, a convolutional neural network uses the same set of parameters to
learn the characteristics of a local block of the image. The recent advancements in the use of neural networks like
recurrent neural
networks and
convolutional neural networks are examples of this phenomena.
Early Stopping
Another common form of regularization is
early stopping, in which the gradient descent isended after only a few iterations.
One way to decide the stopping point is by holding out apart of the training data, and then testing the error of the model
on the held-out set. The gradient-descent approach is terminated when the error on the held-out set begins to rise.
Early stopping essentially reduces the size of the parameter space to a smaller neighborhood within the initial values of the parameters.
From this point of view, early stopping acts asa regularizer because it effectively restricts the parameter space.
Trading Off Breadth for Depth
As discussed in the earlier posts, a two-layer neural network can be used as a universal function approximator ,
if a large number of hidden units are used within the hidden layer. It turns out that networks with more layers (i.e., greater
depth)
tend to require far fewer units per layer because the composition functions created by successive layers make the neural network
more powerful. Increased depth is a form of regularization, as the features in later layers are forced to obey a particular type of
structure imposed by the earlier layers. Increased constraints reduce the capacity of the network, which is helpful when
there are limitations on the amount of available data. A brief explanation of this type of behavior will be given in the coming posts.
The number of units in each layer can typically be reduced to such an extent that a deep network often has far fewer parameters even
when added up over the greater number of layers. This observation has led to an explosion in research on the topic of
deep learning.
Even though deep networks have fewer problems with respect to overfitting, they come with a different family of problems associated
with ease of training. In particular, the loss derivatives with respect to the weights in different layers of the network tend to
have vastly different magnitudes, which causes challenges in properly choosing step sizes. Different manifestations of this undesirable
behavior are referred to as
the vanishing and exploding gradient problems. Furthermore, deep networks often take unreasonably long to converge.
These issues and design choices will be discussed later in the coming posts throughout this website.
Ensemble Methods
A variety of ensemble methods like
bagging are used in order to increase the generalization power of the model.
These methods are applicable not just to neural networks but to any type of machine learning algorithm.
However, in recent years, a number of ensemble methods that are specifically focused on neural networks have also been proposed.
Two such methods include
Dropout and DropConnect. These methods can be combined with many neural network architectures to obtain
an additional accuracy improvement of about 2% in many real settings. However, the precise improvement depends to the type of data
and the nature of the underlying training. For example, normalizing the activations in hidden layers can reduce the effectiveness
of
Dropout methods, although one can gain from the normalization itself. Ensemble methods will be discussed in the coming posts.