Mutual Information Metric

AKA: I.
Context:
- Metric Range: [0,1].
- It can be expressed as [math]\displaystyle{ I(X;Y) = H(X,Y) - H(X|Y) - H(Y|X) }[/math], where [math]\displaystyle{ H(Z) }[/math] is a marginal information entropy, [math]\displaystyle{ H(X|Y) }[/math] is a conditional entropy, [math]\displaystyle{ H(X, Y) }[/math] is a joint entropy, and [math]\displaystyle{ H(X) \ge H(X|Y) }[/math].
- It can range from being a Continuous-Variable Mutual Information Metric to being a Discrete-Variable Mutual Information Metric, calculated as [math]\displaystyle{ \sum_{y \in Y} \sum_{x \in X} p(x,y) \log{ \left(\frac{p(x,y)}{p(x)\,p(y)} \right) }, \,\! }[/math].
- It can be expressed with a Kullback–Leibler Divergence Measure.
Example(s):
Counter-Example(s):
- an Information Entropy Metric.
- Chi-Squared Statistic.
See: Information Gain Metric, Pointwise Mutual Information, Information Theory, Statistic Function.

References

(Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Mutual_information
- In probability theory and information theory, the mutual information (sometimes known by the archaic term transinformation) of two random variables is a quantity that measures the mutual dependence of the two variables. The most common unit of measurement of mutual information is the bit, when logarithms to the base 2 are used.

http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities
- Mutual information can also be expressed as a Kullback-Leibler divergence, of the product p(x) × p(y) of the marginal distributions of the two random variables X and Y, from p(x,y) the random variables' joint distribution: :[math]\displaystyle{ I(X;Y) = D_{\mathrm{KL}}(p(x,y)\|p(x)p(y)). }[/math]
  Furthermore, let p(x|y) = p(x, y) / p(y). Then :[math]\displaystyle{ \begin{align} I(X;Y) & {} = \sum_y p(y) \sum_x p(x|y) \log_2 \frac{p(x|y)}{p(x)} \\ & {} = \sum_y p(y) \; D_{\mathrm{KL}}(p(x|y)\|p(x)) \\ & {} = \mathbb{E}_Y\{D_{\mathrm{KL}}(p(x|y)\|p(x))\}. \end{align} }[/math] Thus mutual information can also be understood as the expectation of the Kullback-Leibler divergence of the univariate distribution p(x) of X from the conditional distribution p(x|y) of X given Y: the more different the distributions p(x|y) and p(x), the greater the information gain.

(Strehl & Ghosh, 2002b) ⇒ Alexander Strehl, and Joydeep Ghosh. (2002). “Cluster Ensembles: A knowledge reuse framework for combining partitions.” In: Journal of Machine Learning Research, 3.
- QUOTE: Mutual information, which is a symmetric measure to quantify the statistical information shared between two distributions (Cover and Thomas, 1991), provides a sound indication of the shared information between a pair of clusterings. Let X and Y be the random variables described by the cluster labeling (a) and (b), with k(a) and k(b) groups respectively. Let I(X; Y ) denote the mutual information between X and Y, and H(X) denote the entropy of X. One can show that I(X; Y ) is a metric. There is no upper bound for I(X; Y ),