Measure of Agreement

AKA: Inter-Rater Reliability, Inter-Rater Agreement, Inter-Rater Concordance, Inter-Observer Reliability, Inter-Coder Reliability.
Context:
- It can range from being a Measure of Classification Agreement to being a Measure of Ranking Agreement to being a Measure of Estimation Agreement.
Example(s):
- an Inter-Classifier Agreement Measure, such as a kappa measure.
- Coefficient of Concordance.
- Rank Correlation Coefficient.
- …
Counter-Example(s):
- Measure of Spread,
- Tau-Equivalent Reliability.
See: Manual Annotation Task, Intra-Class Correlation, Consensus, Concordance Correlation Coefficient.

References

(Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Inter-rater_reliability Retrieved:2022-3-20.
- In statistics, inter-rater reliability (also called by various similar names, such as inter-rater agreement, inter-rater concordance, inter-observer reliability, inter-coder reliability, and so on) is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.
  Assessment tools that rely on ratings must exhibit good inter-rater reliability, otherwise they are not valid tests.
  There are a number of statistics that can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are joint-probability of agreement, such as Cohen's kappa, Scott's pi and Fleiss' kappa; or inter-rater correlation, concordance correlation coefficient, intra-class correlation, and Krippendorff's alpha.

(Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Glossary_of_clinical_research#I Retrieved:2022-3-20.
- Inter-rater reliability
  - The property of yielding equivalent results when used by different raters on different occasions. (ICH E9)

(James et al., 1993) ⇒ Lawrence R. James, Robert G. Demaree, and Gerrit Wolf. (1993). “r_wg: An assessment of within-group interrater agreement.” In: Journal of Applied Psychology, 78(2).
- QUOTE: F. L. Schmidt and J. E. Hunter (1989) critiqued the within-group interrater reliability statistic (rwg) described by L. R. James et al (1984). S. W. Kozlowski and K. Hattrup (1992) responded to the Schmidt and Hunter critique and argued that rwg is a suitable index of interrater agreement. This article focuses on the interpretation of rwg as a measure of agreement among judges' ratings of a single target. A new derivation of rwg is given that underscores this interpretation.