Multiclass Cross-Entropy Measure
A Multiclass Cross-Entropy Measure is a dispersion measure that quantifies the average number of bits needed to identify an event from a set of possibilities.
- AKA: Relative Entropy, [math]\displaystyle{ H(P,Q) }[/math].
- Context:
- It can range from being a Normalized Cross-Entropy to being an Unnormalized Cross-Entropy.
- ...
- It can generalize the Log-Loss Function for Multi-Class Classification tasks.
- It can measure the performance of a Classification Model with multiple classes.
- It can evaluate the discrepancy between the True Probability Distribution and the Predicted Probability Distribution over multiple classes.
- It can be defined mathematically as [math]\displaystyle{ H(P, Q) = -\sum_{i} P(i) \log Q(i) }[/math], where [math]\displaystyle{ P }[/math] is the True Distribution and [math]\displaystyle{ Q }[/math] is the Predicted Distribution.
- It can relate to the concept of Information Entropy and extend the Binary Cross-Entropy to multi-class problems.
- It can be used in conjunction with the Softmax Activation Function in the output layer of a Neural Network.
- ...
- Example(s):
- Theano's implementation: theano.tensor.nnet.nnet.categorical_crossentropy(), which computes the multiclass cross-entropy loss between predicted and true distributions.
- An implementation in PyTorch using
torch.nn.CrossEntropyLoss()
for multi-class classification problems. - An implementation in TensorFlow using
tf.keras.losses.CategoricalCrossentropy()
. - ...
- Counter-Example(s):
- An Accuracy Measure, which only accounts for the number of correct predictions and ignores probability distributions.
- A Binary Cross-Entropy, used for binary classification tasks.
- A Mean Squared Error, commonly used in regression tasks rather than classification.
- See: Cross-Entropy Loss Function, Information Entropy, Probability Distribution, Bit, Kullback–Leibler Divergence, Discrete Random Variable, Continuous Random Variable, Joint Entropy, Perplexity Measure, Squared Error
References
2017
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/cross_entropy Retrieved:2017-6-7.
- In information theory, the cross entropy between two probability distributions [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution [math]\displaystyle{ q }[/math] , rather than the "true" distribution [math]\displaystyle{ p }[/math] .
The cross entropy for the distributions [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] over a given set is defined as follows: : [math]\displaystyle{ H(p, q) = \operatorname{E}_p[-\log q] = H(p) + D_{\mathrm{KL}}(p \| q),\! }[/math] where [math]\displaystyle{ H(p) }[/math] is the entropy of [math]\displaystyle{ p }[/math] , and [math]\displaystyle{ D_{\mathrm{KL}}(p \| q) }[/math] is the Kullback–Leibler divergence of [math]\displaystyle{ q }[/math] from [math]\displaystyle{ p }[/math] (also known as the relative entropy of p with respect to q — note the reversal of emphasis).
For discrete [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] this means : [math]\displaystyle{ H(p, q) = -\sum_x p(x)\, \log q(x). \! }[/math] The situation for continuous distributions is analogous. We have to assume that [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] are absolutely continuous with respect to some reference measure [math]\displaystyle{ r }[/math] (usually [math]\displaystyle{ r }[/math] is a Lebesgue measure on a Borel σ-algebra). Let [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] be probability density functions of [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] with respect to [math]\displaystyle{ r }[/math] . Then : [math]\displaystyle{ -\int_X P(x)\, \log Q(x)\, dr(x) = \operatorname{E}_p[-\log Q]. \! }[/math] NB: The notation [math]\displaystyle{ H(p,q) }[/math] is also used for a different concept, the joint entropy of [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math]
- In information theory, the cross entropy between two probability distributions [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution [math]\displaystyle{ q }[/math] , rather than the "true" distribution [math]\displaystyle{ p }[/math] .
2017
- http://deeplearning.net/software/theano/library/tensor/nnet/nnet.html#theano.tensor.nnet.nnet.categorical_crossentropy
- QUOT:E: Return the cross-entropy between an approximating distribution and a true distribution. The cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the “true” distribution p. Mathematically, this function computes H(p,q) = - \sum_x p(x) \log(q(x)), where p=true_dist and q=coding_dist.
2011a
- (Mikolov et al., 2011) ⇒ Tomáš Mikolov, Anoop Deoras, Stefan Kombrink, Lukas Burget, and Jan Černocký. (2011). “Empirical Evaluation and Combination of Advanced Language Modeling Techniques..” In: Proceedings of INTERSPEECH 2011.
- QUOTE: … Thus, the measure that we will aim to minimize is the cross entropy of the test data given the language model. The cross entropy is equal to [math]\displaystyle{ \log_2 }[/math] perplexity (PPL) ...
2011b
- (Yu et al., 2011) ⇒ Dong Yu, Jinyu Li, and Li Deng. (2011). “Calibration of Confidence Measures in Speech Recognition.” In: IEEE Transactions on Audio, Speech, and Language Processing, 19(8). doi:10.1109/TASL.2011.2141988
2004
- (Caruana & Niculescu-Mizil, 2004) ⇒ Rich Caruana, and Alexandru Niculescu-Mizil. (2004). “Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria.” In: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ISBN:1-58113-888-1 doi:10.1145/1014052.1014063
- QUOTE: … compare nine boolean classification performance metrics: Accuracy, Lift, F-Score, Area under the ROC Curve, Average Precision, Precision/Recall Break-Even Point, Squared Error, Cross Entropy, and Probability Calibration. Multidimensional scaling (MDS) shows that these metrics span a low dimensional manifold.