The choice of the loss function is critical in defining the outputs in a way that is sensitiveto the application at hand.
For example, least-squares regression with numeric outputs requires a simple squared loss of the form (y-ŷ)
2 for
a single training instance with target y and prediction ŷ.
One can also use other types of loss like hinge loss for y∈{-1,+1} and real-valued prediction ŷ (with identity activation)
The hinge loss can be used to implement a learning method, which is referred to as a support vector machine.
For multiway predictions (like predicting word identifiers or one of multiple classes),the softmax output is particularly useful. However, a softmax output is probabilistic, andtherefore it requires a different type of loss function. In fact, for probabilistic predictions,two different types of loss functions are used, depending on whether the prediction is binaryor whether it is multiway:
-
Binary targets (logistic regression):In this case, it is assumed that the observedvalueyis drawn from{-1,+1}, and
the prediction ŷ is a an arbitrary numerical valueon using the identity activation function.
In such a case, the loss function for a singleinstance with observed valueyand real-valued prediction ŷ (with identity activation)is defined as follows
This type of loss function implements a fundamental machine learning method, re-ferred to aslogistic regression.
Alternatively, one can use a sigmoid activation function to output ŷ∈(0,1), which indicates the probability that the observed valueyis 1.
Then, the negative logarithm of |y/2-0.5+ŷ| provides the loss, assuming that y is coded from {-1,1}. This is because |y/2-0.5+ŷ| indicates
the probability that the prediction is correct. This observation illustrates that one can use various combinations of activation and loss functions
to achieve the same result.
-
Categorical targets:
In this case, if ŷ1...ŷk are the probabilities of the k classes (using the softmax activation of
Equation 1.9 ),
and the rth class is the ground-truth class, then the loss function for a single instance is defined as follows:
This type of loss function implements multi nomial logistic regression, and it is referred to as the cross-entropy loss.
Note that binary logistic regression is identical to multinomial logistic regression, when the value of k is set to 2 in the latter.
The key point to remember is that the nature of the output nodes,
the activation function,and the loss function depend on the application at hand.
Furthermore, these choices also depend on one another. Even though the perceptron is often presented as
the quint essential representative of single-layer networks, it is only a single representative out of a very
large universe of possibilities. In practice, one rarely uses the perceptron criterion as the loss function.
For discrete-valued outputs, it is common to use softmax activation with cross-entropy loss. For real-valued outputs,
it is common to use linear activation with squared loss. Generally, cross-entropy loss is easier to optimize than squared loss.