Impurity function

An Impurity function is a multiset distance function that estimates the difference/similarity of two (or more?) multisets.

References

(Raileanu & Stoffel, 2004) ⇒ Laura Elena Raileanu, and Kilian Stoffel. (2004). “Theoretical Comparison between the Gini Index and Information Gain Criteria.” In: Annals of Mathematics and Artificial Intelligence, 41(1). doi:10.1023/B:AMAI.0000018580.96245.c6
- QUOTE: … Thus was introduced the “goodness of split” criterion, which is derived from the notion of an impurity function. … An impurity function is a function [math]\displaystyle{ \theta }[/math] defined on the set of all k-tuples of numbers [math]\displaystyle{ (p(c_1), p(c_2),.., p(c_k)) }[/math] satisfying [math]\displaystyle{ p(c_i) \ge 0 \forall i \in {1,...,k} }[/math] and [math]\displaystyle{ \sum^{k}_{i=1} p(c_i) = 1 }[/math] with the following properties: (a) [math]\displaystyle{ \theta }[/math] achieves its maximum at the point (1/k, 1/k,...,1/k); (b) [math]\displaystyle{ \theta }[/math] achieves its minimum at the points (1, 0,..., 0), (0, 1,..., 0), ..., (0, 0,..., 1); (c) [math]\displaystyle{ \theta }[/math] is a symmetric function of [math]\displaystyle{ (p(c_1), p(c_2),.., p(c_k)) }[/math] … Given an impurity function [math]\displaystyle{ \theta }[/math], the impurity measure of any node [math]\displaystyle{ t }[/math] is defined by [math]\displaystyle{ i(t) = \theta(p(c_1\vert t),p(c_2\vert t), . . ., p(c_k\vert t) ) }[/math].

(Smyth & Goodman, 1992) ⇒ Padhraic Smyth, and Rodney M. Goodman. (1992). “An Information Theoretic Approach to Rule Induction from Databases.” In: Transactions on Knowledge and Data Engineering, 4(4). doi:10.1109/69.149926

(Mántaras, 1991) ⇒ R. López De Mántaras. (1991). “[A Distance-Based Attribute Selection Measure for Decision Tree Induction].” In: Machine Learning, 6(1). doi:10.1023/A:1022694001379