kappa Measure of Agreement Statistic

AKA: κ.
Context:
- Statistic Output: a Kappa Coefficient.
Example(s):
- Cohen's kappa.
- Fleiss' kappa.
- [math]\displaystyle{ Pr(a)=(5+2)/10=0.70 }[/math]
  [math]\displaystyle{ Pr(e)=(.7 \times .6)+(.3 \times .4)=0.54 }[/math]
  [math]\displaystyle{ κ = \frac{Pr(a)−Pr(e)}{1−Pr(e)} }[/math]
  [math]\displaystyle{ κ=0.70−0.54; κ=1−0.54;κ=0.348 }[/math]
Counter-Example(s):
- F-Measure.
See: Classifier Performance Measure, Inter-Rater Agreement.

References

(Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Cohen's_kappa Retrieved:2015-4-24.
- Cohen's kappa coefficient is a statistic which measures inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, since κ takes into account the agreement occurring by chance.
  ^[1] Some researchers have expressed concern over κ's tendency to take the observed categories' frequencies as givens, which can have the effect of underestimating agreement for a category that is also commonly used; for this reason, κ is considered an overly conservative measure of agreement. Others contest the assertion that kappa "takes into account" chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The so-called chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess — a very unrealistic scenario.

↑ Carletta, Jean. (1996) Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), pp. 249–254.

http://en.wikipedia.org/wiki/Cohen%27s_kappa
- Cohen's kappa measures the agreement between two raters who each classify [math]\displaystyle{ N }[/math] items into [math]\displaystyle{ C }[/math] mutually exclusive categories. The first mention of a kappa-like statistic is attributed to Galton (1892). The equation for κ is: [math]\displaystyle{ \kappa = \frac{\Pr(a) - \Pr(e)}{1 - \Pr(e)}, \! }[/math] where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters other than what would be expected by chance (as defined by Pr(e)), κ = 0.

(Upton & Cook, 2008) ⇒ Graham Upton, and Ian Cook. (2008). “A Dictionary of Statistics, 2nd edition revised." Oxford University Press. ISBN:0199541450
- QUOTE: Cohen's Kappa [math]\displaystyle{ (\kappa) }[/math]: A measure of agreement between two observers, suggested by Cohen in 1960. Suppose that the observers are required, independently, to assign items to one of [math]\displaystyle{ m }[/math] classes. Let [math]\displaystyle{ f_{jk} }[/math] be the number of individuals assigned to class [math]\displaystyle{ j }[/math] by the first observer and to class [math]\displaystyle{ k }[/math] by the second observer. Let [math]\displaystyle{ f_{j0} = \sum_{k=1}^{m} f_{jk}, f_{0k} = \sum_{j=1}^{m}f_{jk} and f_{00}\sum_{j=1}^{m}\sum_{k=1}^{m}f_{jk} }[/math]. Define the quantities [math]\displaystyle{ O }[/math] and [math]\displaystyle{ E }[/math] by :[math]\displaystyle{ O = \sum_{j=1}{m}f_{jj} E = \sum{j=1}{m} \frac{f_{j0}f_{0j} }{f_{00} } }[/math] so that [math]\displaystyle{ O }[/math] is the total nmnber of individuals on which the observers are in complete agreement, and E is the expected total number of agreements that would have occurred if the observers had been statistically independent. The formula for Cohen’s kappa is :[math]\displaystyle{ \kappa {O - E \over f_{00} - E} }[/math] A value of 0 indicates statistical independence, and a value of 1 indicates perfect agreement.

(Carletta, 1996) ⇒ Jean Carletta. (1996). “Assessing Agreement on Classification Tasks: The kappa Statistic.” In: Computational Linguistics, 22(2).
- ABSTRACT: Currently, computational linguists and cognitive scientists working in the area of discourse and dialogue argue that their subjective judgments are reliable using several different statistics, none of which are easily interpretable or comparable to each other. Meanwhile, researchers in content analysis have already experienced the same difficulties and come up with a solution in the kappa statistic. We discuss what is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and argue that we would be better off as a field adopting techniques from content analysis.

(Cohen, 1960) ⇒ J. Cohen. (1960). “A Coefficient of Agreement for Nominal Scales.” In: Educational and Psychological Measurement, 20.