Jensen-Shannon Divergence (JSD) Metric
A Jensen-Shannon Divergence (JSD) Metric is a symmetric probability distribution similarity measure.
- AKA: Information Radius, IRad, Total Divergence to the Average.
- Context:
- It has range of 0 (for identical distributions) to [math]\displaystyle{ 2\log 2 }[/math] (for maximally different distributions), assuming that [math]\displaystyle{ 0\log 0=0 }[/math].
- It can answer the question: "How much information is lost if we describe the two words (or random variables in the general case) that correspond to [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] with this average distribution?"
- It can be a metric if its squareroot is taken.
- …
- Counter-Example(s):
- See: Kullback-Leibler Divergence, Jensen-Shannon Distance, Deviance Information Criterion, Statistical Distance, Probability Theory.
References
2023
- chat
- Q: What are the similarities and differences between Kullback-Leibler (KL) Divergence and Jensen-Shannon Divergence?
- A: Kullback-Leibler (KL) Divergence and Jensen-Shannon Divergence are both measures of the difference between two probability distributions.
The similarity between them is that they both compare two probability distributions and they both are symmetric.
The main difference between them is that KL divergence is not a true metric, meaning it doesn't obey the triangle inequality, whereas the Jensen-Shannon divergence is a true metric. Additionally, the KL divergence is defined only when the distributions have non-zero probabilities at the same points, whereas the Jensen-Shannon divergence is defined for any two probability distributions. The KL divergence is also non-negative, meaning it always returns a positive or zero value, whereas the Jensen-Shannon divergence can return negative values.
2011
- http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
- In probability theory and statistics, the Jensen–Shannon divergence is a popular method of measuring the similarity between two probability distributions. It is also known as information radius (IRad)[1] or total divergence to the average.[2] It is based on the Kullback–Leibler divergence, with the notable (and useful) difference that it is always a finite value. The square root of the Jensen–Shannon divergence is a metric.[3][4]
Consider the set [math]\displaystyle{ M_+^1(A) }[/math] of probability distributions where A is a set provided with some σ-algebra of measurable subsets. In particular we can take A to be a finite or countable set with all subsets being measurable. The Jensen–Shannon divergence (JSD) [math]\displaystyle{ M_+^1(A) \times M_+^1(A) \rightarrow [0,\infty{}) }[/math] is a symmetrized and smoothed version of the Kullback–Leibler divergence [math]\displaystyle{ D(P \parallel Q) }[/math]. It is defined by :[math]\displaystyle{ JSD(P \parallel Q)= \frac{1}{2}D(P \parallel M)+\frac{1}{2}D(Q \parallel M) }[/math] where [math]\displaystyle{ M=\frac{1}{2}(P+Q) }[/math] If A is countable, a more general definition, allowing for the comparison of more than two distributions, is: :[math]\displaystyle{ JSD(P_1, P_2, \ldots, P_n) = H\left(\sum_{i=1}^n \pi_i P_i\right) - \sum_{i=1}^n \pi_i H(P_i) }[/math] where [math]\displaystyle{ \pi_1, \pi_2, \ldots, \pi_n }[/math] are the weights for the probability distributions [math]\displaystyle{ P_1, P_2, \ldots, P_n }[/math] and [math]\displaystyle{ H(P) }[/math] is the Shannon entropy for distribution [math]\displaystyle{ P }[/math]. For the two-distribution case described above, :[math]\displaystyle{ P_1=P, P_2=Q, \pi_1 = \pi_2 = \frac{1}{2}.\ }[/math]
- In probability theory and statistics, the Jensen–Shannon divergence is a popular method of measuring the similarity between two probability distributions. It is also known as information radius (IRad)[1] or total divergence to the average.[2] It is based on the Kullback–Leibler divergence, with the notable (and useful) difference that it is always a finite value. The square root of the Jensen–Shannon divergence is a metric.[3][4]
- ↑ Hinrich Schütze; Christopher D. Manning (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press. p. 304. ISBN 0-262-13360-1. http://nlp.stanford.edu/fsnlp/.
- ↑ Dagan, Ido; Lillian Lee, Fernando Pereira (1997). "Similarity-Based Methods For Word Sense Disambiguation". Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics: pp. 56–63. http://citeseer.ist.psu.edu/dagan97similaritybased.html. Retrieved 2008-03-09.
- ↑ Endres, D. M.; J. E. Schindelin (2003). "A new metric for probability distributions". IEEE Trans. Inf. Theory 49 (7): pp. 1858–1860. doi:10.1109/TIT.2003.813506.
- ↑ Ôsterreicher, F.; I. Vajda (2003). "A new class of metric divergences on probability spaces and its statistical applications". Ann. Inst. Statist. Math. 55 (3): pp. 639–653. doi:10.1007/BF02517812.
1999
- (Manning & Schütze, 1999) ⇒ Christopher D. Manning, and Hinrich Schütze. (1999). “Foundations of Statistical Natural Language Processing." The MIT Press. ISBN:0262133601
- QUOTE: … (Dis-)similarity measure … information radius (IRad) … [math]\displaystyle{ D(p \vert\vert \frac{p+q}{2}) + D(q \vert\vert \frac{p+q}{2}) }[/math] … The second measure in table 8.9, information radius (or total divergence to the average as Dagan et al. (1997b) call it), overcomes both these problems. It is symmetric ([math]\displaystyle{ \operatorname{IRad}(p,q) = \operatorname{IRad}(q,p) }[/math]) and there is no problem with infinite values since [math]\displaystyle{ \frac{p_i+q_i}{2} \ne 0 }[/math] if either [math]\displaystyle{ p_i \ne 0 }[/math] or [math]\displaystyle{ q_i \ne 0 }[/math]. The intuitive interpretation of IRad is that it answer the question: How much information is lost if we describe the two words (or random variables in the general case) that correspond to [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] with this average distribution? IRad ranges from 0 for identical distributions to [math]\displaystyle{ 2\log 2 }[/math] for maximally different distributions (see exercise 8.25). As usual we assume [math]\displaystyle{ 0\log 0 = 0 }[/math].