Different Regularization Techniques in Deep Learning: L2 & L1 regularization, DropOout, Trading Off Breadth For Depth, Early Stopping, Parameter Sharing, Ensemble Methods.

Neural Network - Methods To Deal With The Issue Of OverFitting

Neural Architecture and Parameter Sharing

The most effective way of building a neural network is by constructing the architecture of the neural network after giving some thought to the underlying data domain. For example, the successive words in a sentence are often related to one another, where as the nearby pixels in an image are typically related. These types of insights are used to create specialized architectures for text and image data with fewer parameters. Furthermore, many of the parameters might be shared. For example, a convolutional neural network uses the same set of parameters to learn the characteristics of a local block of the image. The recent advancements in the use of neural networks like recurrent neural networks and convolutional neural networks are examples of this phenomena.

Early Stopping

Another common form of regularization is early stopping, in which the gradient descent isended after only a few iterations. One way to decide the stopping point is by holding out apart of the training data, and then testing the error of the model on the held-out set. The gradient-descent approach is terminated when the error on the held-out set begins to rise. Early stopping essentially reduces the size of the parameter space to a smaller neighborhood within the initial values of the parameters. From this point of view, early stopping acts asa regularizer because it effectively restricts the parameter space.

Trading Off Breadth for Depth

As discussed in the earlier posts, a two-layer neural network can be used as a universal function approximator , if a large number of hidden units are used within the hidden layer. It turns out that networks with more layers (i.e., greater depth) tend to require far fewer units per layer because the composition functions created by successive layers make the neural network more powerful. Increased depth is a form of regularization, as the features in later layers are forced to obey a particular type of structure imposed by the earlier layers. Increased constraints reduce the capacity of the network, which is helpful when there are limitations on the amount of available data. A brief explanation of this type of behavior will be given in the coming posts. The number of units in each layer can typically be reduced to such an extent that a deep network often has far fewer parameters even when added up over the greater number of layers. This observation has led to an explosion in research on the topic of deep learning.

Even though deep networks have fewer problems with respect to overfitting, they come with a different family of problems associated with ease of training. In particular, the loss derivatives with respect to the weights in different layers of the network tend to have vastly different magnitudes, which causes challenges in properly choosing step sizes. Different manifestations of this undesirable behavior are referred to as the vanishing and exploding gradient problems. Furthermore, deep networks often take unreasonably long to converge. These issues and design choices will be discussed later in the coming posts throughout this website.

Ensemble Methods

A variety of ensemble methods like bagging are used in order to increase the generalization power of the model. These methods are applicable not just to neural networks but to any type of machine learning algorithm. However, in recent years, a number of ensemble methods that are specifically focused on neural networks have also been proposed. Two such methods include Dropout and DropConnect. These methods can be combined with many neural network architectures to obtain an additional accuracy improvement of about 2% in many real settings. However, the precise improvement depends to the type of data and the nature of the underlying training. For example, normalizing the activations in hidden layers can reduce the effectiveness of Dropout methods, although one can gain from the normalization itself. Ensemble methods will be discussed in the coming posts.