Gini Diversity Index

AKA: Gini Impurity, Gini Separation.
Context
- It can be used by a CART algorithm.
- …
Counter-Example(s):
- a Gini Economic Inequality Index.
- an Information Gain(used by ID3).
- a Chi-Square (used by CHAID)
- an AUC Metric.
See: Classifier Performance Metric.

References

http://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity
- Used by the CART algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it were randomly labelled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability of each item being chosen times the probability of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category. o compute Gini impurity for a set of items, suppose y takes on values in {1, 2, ..., m}, and let f_i = the fraction of items labelled with value i in the set. [math]\displaystyle{ I_{G}(f) = \sum_{i=1}^{m} f_i (1-f_i) = \sum_{i=1}^{m} (f_i - {f_i}^2) = \sum_{i=1}^m f_i - \sum_{i=1}^{m} {f_i}^2 = 1 - \sum^{m}_{i=1} {f_i}^{2} }[/math]

http://en.wikipedia.org/wiki/Gini_coefficient#Calculation
- The Gini index is defined as a ratio of the areas on the Lorenz curve diagram … For a discrete probability function [math]\displaystyle{ f }[/math](y), where y_i, i = 1 to n, are the points with nonzero probabilities and which are indexed in increasing order (y_i < y_i+1): [math]\displaystyle{ G = 1 - \frac{\Sigma_{i=1}^n \; f(y_i)(S_{i-1}+S_i)}{S_n} }[/math] where [math]\displaystyle{ S_i = \Sigma_{j=1}^i \; f(y_j)\,y_j\, }[/math] and [math]\displaystyle{ S_0 = 0\, }[/math]

http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_tree-cart_split-criteria_categorical_gini.htm
- Gini Criterion (CART algorithms) The Gini impurity measure at a node t is defined as :[math]\displaystyle{ i(t)=Σi,jC(i|j)p(i|t)p(j|t) }[/math] The Gini splitting criterion is the decrease of impurity defined as :[math]\displaystyle{ Δi(s,t)=i(t)−pLi(tL)−pRi(tR) }[/math] where pL and pR are probabilities of sending a case to the left child node tL and to the right child node tR respectively. They are estimated as pL=p(tL)/p(t) and pR=p(tR)/p(t).
  Note: When user-specified costs are involved, the altered priors can optionally be used to replace the priors. When altered priors are used, the problem is considered as if no costs are involved. The altered prior is defined as

(Sammut & Webb, 2011) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2011). “Gini Coefficient.” In: (Sammut & Webb, 2011) p.457

(Raileanu & Stoffel, 2004) ⇒ Laura Elena Raileanu, and Kilian Stoffel. (2004). “Theoretical Comparison between the Gini Index and Information Gain Criteria.” In: Annals of Mathematics and Artificial Intelligence, 41(1). doi:10.1023/B:AMAI.0000018580.96245.c6
- QUOTE: Breiman adopts in his work the Gini diversity index which has the following form: [math]\displaystyle{ \theta(p(c_1\vert t),p(c_2\vert t), . . ., p(c_k\vert t) ) = \sum^{k}_{i=1} \sum^{k}_{j=1,j\ne i}(p(c_i\vert t) p(c_j\vert t) = 1 - \sum^{k}_{i=1}(p(c_i\vert t))^2. (1) }[/math]