Logistic (Log) Loss Function

A Logistic (Log) Loss Function is a convex loss function that is defined as the negative log-likelihood of a logistic model.

AKA: Binary Cross-Entropy.
Context:
- input: [math]\displaystyle{ y_{train}$, [[training data]]. ** [[Function Output|output]]: \lt math\gt y_{pred}$, a [[Log Loss Value]] (predicted [[probabiliti]]es for each [[training data]] value, it ranges from 0 to 1). ** It can [[measure]] the [[performance]] of a [[classification model]] whose [[output]] is a [[probability]] value between 0 and 1. ** It can be used in [[binary classification]] tasks with [[predicted probabilities]]. ** It can be calculated using the formula: \lt math\gt L = -[y \log(p) + (1 - y) \log(1 - p)] }[/math], where [math]\displaystyle{ y }[/math] is the true label and [math]\displaystyle{ p }[/math] is the predicted probability.
- It can be minimized during Logistic Regression Training and other probabilistic classification models.
- It can produce a Convex Optimization Problem, facilitating efficient optimization algorithms.
- It can be generalized to multiclass classification using the cross-entropy loss function.
- It can be interpreted as the uncertainty between the true distribution and the predicted distribution.
- It can handle imbalanced datasets better than some other loss functions due to its probabilistic nature.
- ...
Example(s):
- An implementation in Theano: theano.tensor.nnet.nnet.sigmoid_binary_crossentropy(), which computes the logistic loss in a numerically stable way.
- An implementation in Scikit-Learn: sklearn.metrics.log_loss(), which provides a function to compute the log loss given true labels and predicted probabilities.
- A Python-based Logistic (Log) Loss Function:
  import numpy as np
  def log_loss(y_true, y_pred, eps=1e-15): y_pred = np.clip(y_pred, eps, 1 - eps) return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
- ...
Counter-Example(s):
- An Exponential Loss Function, used in AdaBoost algorithms for boosting weak learners.
- A Hinge Loss Function, as used by Support Vector Machines for maximizing the margin between classes.
- A Mean Squared Error loss function, commonly used in regression tasks rather than classification.
- A Kullback-Leibler Divergence Loss Function, which measures how one probability distribution diverges from a second, expected probability distribution.
- a Huber Loss Function,
- a Savage Loss Function,
- a Square Loss Function,
- a Tangent Loss Function.
See: Squared Error Function, Cross-Entropy Measure, Mean Absolute Error, Mean Squared Error.

References

2021a

(Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Loss_functions_for_classification#Logistic_loss Retrieved:2021-3-7.
- The logistic loss function can be generated using (2) and Table-I as follows : \begin{align} \phi(v) &= C[f^{-1}(v)]+\left(1-f^{-1}(v)\right)\, C'\left[f^{-1}(v)\right] \\ &= \frac{1}{\log(2)}\left [\frac{-e^v}{1+e^v}\log\frac{e^v}{1+e^v}-\left(1-\frac{e^v}{1+e^v}\right)\log\left(1-\frac{e^v}{1+e^v}\right)\right ]+\left(1-\frac{e^v}{1+e^v}\right) \left [\frac{-1}{\log(2)}\log\left(\frac{\frac{e^v}{1+e^v}}{1-\frac{e^v}{1+e^v}}\right)\right] \\ &=\frac{1}{\log(2)}\log(1+e^{-v}). \end{align} The logistic loss is convex and grows linearly for negative values which make it less sensitive to outliers. The logistic loss is used in the LogitBoost algorithm.
  The minimizer of I[f] for the logistic loss function can be directly found from equation (1) as : f^*_\text{Logistic}= \log\left(\frac{\eta}{1-\eta}\right)=\log\left(\frac{p(1\mid x)}{1-p(1\mid x)}\right). This function is undefined when p(1\mid x)=1 or p(1\mid x)=0 (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows when p(1\mid x) increases and equals 0 when p(1\mid x)= 0.5 .
  It's easy to check that the logistic loss and binary cross entropy loss (Log loss) are in fact the same (up to a multiplicative constant \frac{1}{\log(2)} ). The cross entropy loss is closely related to the Kullback–Leibler divergence between the empirical distribution and the predicted distribution. The cross entropy loss is ubiquitous in modern deep neural networks.

2021b

(Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Cross_entropy Retrieved:2021-3-7.
- In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution p .

2021c

(ML Glossary, 2021) ⇒ https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html Retrieved:2021-03-06.
- QUOTE: Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. (...).
  Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.
  Code
  
  def CrossEntropy(yHat, y):
     if y == 1:
       return -log(yHat)
     else:
       return -log(1 - yHat)
  In binary classification, where the number of classes $M$ equals 2, cross-entropy can be calculated as:
  $−\left(y\log\left(p\right)+\left(1−y\right)\log\left(1−p\right)\right)$
  If $M>2$ (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.
  $−\displaystyle \sum_{c=1}^My_{o,c}\log\left(p_{o,c}\right)$

2018a

(Fast AI, 2018a) ⇒ http://wiki.fast.ai/index.php/Log_Loss
- QUOTE: Logarithmic loss (related to cross-entropy) measures the performance of a classification model where the prediction input is a probability value between 0 and 1. The goal of our machine learning models is to minimize this value. A perfect model would have a log loss of 0. Log loss increases as the predicted probability diverges from the actual label. So predicting a probability of 0.012 when the actual observation label is 1 would be bad and result in a high log loss. There is a more detailed explanation of the justifications and math behind log loss here. …
  … To calculate log loss from scratch, we need to include the MinMax function (see below). Numpy implements this for us with np.clip()
  def logloss(true_label, predicted, eps=1e-15):
    p = np.clip(predicted, eps, 1 - eps)
     if true_label == 1:
     return -log(p)
     else:
      return -log(1 - p)

2018b

(DeepLearning,2018) ⇒ http://deeplearning.net/software/theano/library/tensor/nnet/nnet.html#theano.tensor.nnet.nnet.sigmoid_binary_crossentropy
- QUOTE: It is equivalent to binary_crossentropy(sigmoid(output), target), but with more efficient and numerically stable computation, especially when taking gradients.

2017a

(WikiFastAI) ⇒ http://wiki.fast.ai/index.php/Log_Loss#Log_Loss_vs_Cross-Entropy
- QUOTE: Log loss and cross-entropy are slightly different depending on the context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing. As a demonstration, where p and q are the sets p∈{y, 1−y} and q∈{ŷ, 1−ŷ} we can rewrite cross-entropy as:
  - p = set of true labels
  - q = set of prediction
  - y = true label
  - ŷ = predicted prob
- Which is exactly the same as log loss!

2017b

(Kaggle, 2017) ⇒ https://www.kaggle.com/c/bioresponse/discussion/1831
- QUOTE:

from math import log 

def log_loss(predicted, target):
    if len(predicted) != len(target):
        print 'lengths not equal!'
        return
    target = [float(x) for x in target]   # make sure all float values
    predicted = [min([max([x,1e-15]),1-1e-15]) for x in predicted]  # within (0,1) interval
    return -(1.0/len(target))*sum([target[i]*log(predicted[i]) + \ 
                                              (1.0-target[i])*log(1.0-predicted[i]) \
                                              for i in xrange(len(target))]) 

if __name__=='__main__':  # if you run at the command line as 'python utils.py'
    actual = [0, 1, 1, 1, 1, 0, 0, 1, 0, 1]
    pred = [0.24160452, 0.41107934, 0.37063768, 0.48732519, 0.88929869,
        0.60626423, 0.09678324, 0.38135864, 0.20463064, 0.21945892]
    print log_loss(pred,actual)

2016

(Program Creek, 2016) ⇒ https://www.programcreek.com/python/example/86075/sklearn.metrics.log_loss
- QUOTE

def log_loss(solution, prediction, task = 'binary.classification'):
    Log loss for binary and multiclass. 
   [sample_num, label_num] = solution.shape
   eps = 1e-15

   pred = np.copy(prediction) # beware: changes in prediction occur through this
   sol = np.copy(solution)
   if (task == 'multiclass.classification') and (label_num>1):
       # Make sure the lines add up to one for multi-class classification
       norma = np.sum(prediction, axis=1)
       for k in range(sample_num):
           pred[k,:] /= sp.maximum (norma[k], eps) 
       # Make sure there is a single label active per line for multi-class classification
       sol = binarize_predictions(solution, task='multiclass.classification')
       # For the base prediction, this solution is ridiculous in the multi-label case

   # Bounding of predictions to avoid log(0),1/0,...
   pred = sp.minimum (1-eps, sp.maximum (eps, pred))
   # Compute the log loss
   pos_class_log_loss = - mvmean(sol*np.log(pred), axis=0)
   if (task != 'multiclass.classification') or (label_num==1):
       # The multi-label case is a bunch of binary problems.
       # The second class is the negative class for each column.
       neg_class_log_loss = - mvmean((1-sol)*np.log(1-pred), axis=0)
       log_loss = pos_class_log_loss + neg_class_log_loss
       # Each column is an independent problem, so we average.
       # The probabilities in one line do not add up to one.
       # log_loss = mvmean(log_loss) 
       # print('binary {}'.format(log_loss))
       # In the multilabel case, the right thing i to AVERAGE not sum
       # We return all the scores so we can normalize correctly later on
   else:
       # For the multiclass case the probabilities in one line add up one.
       log_loss = pos_class_log_loss
       # We sum the contributions of the columns.
       log_loss = np.sum(log_loss) 
       #print('multiclass {}'.format(log_loss))
   return log_loss

2015

(SciKit-Learn, 2015) ⇒ http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
- Log loss, aka logistic loss or cross-entropy loss.
- QUOTE: This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is : [math]\displaystyle{ -\log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp)) }[/math]

2014

(Kaggle, 2014) ⇒ https://www.kaggle.com/wiki/LogarithmicLoss
- QUOTE: [math]\displaystyle{ \operatorname{log loss} = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{ij}\log(p_{ij}) }[/math]