Inter-Rater Reliability (IRR) Score
An Inter-Rater Reliability (IRR) Score is a Measure of Agreement that is Rating Consistency Score given by the same person across multiple instances.
- AKA: Inter-Rater Agreement Score, Inter-Rater Concordance Score, Inter-Observer Reliability Score.
- Example(s):
- Counter-Example(s):
- See: Krippendorff's Alpha, Test Validity, Scott's pi, Concordance Correlation Coefficient, Intra-Class Correlation.
References
2021
- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Inter-rater_reliability Retrieved:2021-8-1.
- In statistics, inter-rater reliability (also called by various similar names, such as inter-rater agreement, inter-rater concordance, inter-observer reliability, and so on) is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.
In contrast, intra-rater reliability is a score of the consistency in ratings given by the same person across multiple instances. For example, the grader should not let elements like fatigue influence their grading towards the end, or let a good paper influence the grading of the next paper. The grader should not compare papers together, but they should grade each paper based on the standard.
Inter-rater and intra-rater reliability are aspects of test validity. Assessments of them are useful in refining the tools given to human judges, for example, by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.
There are a number of statistics that can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are joint-probability of agreement, Cohen's kappa, Scott's pi and the related Fleiss' kappa, inter-rater correlation, concordance correlation coefficient, intra-class correlation, and Krippendorff's alpha.
- In statistics, inter-rater reliability (also called by various similar names, such as inter-rater agreement, inter-rater concordance, inter-observer reliability, and so on) is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.
2020
- (Castilho, 2020) ⇒ Sheila Castilho (2020). "On the Same Page? Comparing Inter-Annotator Agreement in Sentence and Document Level Human Machine Translation Evaluation". In: Proceedings of the Fifth Conference on Machine Translation (WMT@EMNLP 2020) Online.
- QUOTE: In addition to that, we also compute Inter-rater reliability (IRR) as the level of agreement between raters (percentage of matches), and Pearson correlation (r) between T1&T2 and T3&T4 (see Table 3). The comparison of the scenarios (sentence vs document) is calculated between the Test Sets (Test Set 1 & Test Set 2). We calculate IAA for all the tasks, namely adequacy, fluency, error and ranking. It is important to note that Fleiss Kappa is computed when analysing T1&T2&T5.
Translators | T1,T5 | T2 | T3 | T4 |
---|---|---|---|---|
Test Set 1 (1-500 sent.) | S1 | S2 | D1 | D2 |
Test Set 2 (501-1000 sent.) | D2 | D1 | S2 | S1 |