Inter-rater reliability

Inter-rater reliability or Inter-rater agreement is the measurement of agreement between raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the metrics given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable.

There are a number of statistics which can be used in order to determine the inter-rater reliability. Different statistics are appropriate for different types of measurement. Some of the various statistics are; joint-probability of agreement, Cohen's kappa and the related Fleiss' kappa, inter-rater correlation, and intra-class correlation.

The joint-probability of agreement is probably the most simple and least robust measure. This simply takes the number of times each rating (e.g. 1, 2, ... 5) is assigned by each rater and then divides this number by the total number of ratings. This however assumes that the data is entirely nominal. Another problem with this statistic is that it does not take into account that agreement may happen solely based on chance.

Kappa statistics

 * Main articles: Cohen's kappa, Fleiss' kappa

Cohen's kappa, which works on two raters, and Fleiss' kappa, an adaptation that works for any fixed number of raters, are statistics which also take into account the amount of agreement that could be expected to occur through chance. They suffer from the same problems as the joint-probability in that they treat the data as nominal and assume no underlying connection between the scores.

Correlation coefficients

 * ''Main articles: Pearson product-moment correlation coefficient, Spearman's rank correlation coefficient

In respect of inter-rater correlation, either Pearson's $$r$$ correlation coefficient, or Spearman's $$\rho$$ correlation coefficient can be used to measure pairwise correlation between raters and then the mean can be taken to give an average level of agreement for the group. The mean of Spearman's $$\rho$$ has been used to measure inter-judge correlation. However neither Spearman's or Pearson's take into account the magnitude of the differences between scores. For example, in rating on a scale of $$1 ... 5$$, Judge A might assign the following scores to four segments; $$1, 2, 1, 3$$ and Judge B might assign; $$2, 3, 2, 4$$. The correlation coefficient would be 1, indicating perfect correlation, however the judges do not completely agree.

Intra-class correlation coefficient
Another way of performing reliability testing is to use the intra-class correlation coefficient (ICC). This is defined as, "the proportion of variance of an observation due to between-subject variability in the true scores". The range of the ICC is, as with the other correlation coefficients, between 1.0 and -1.0. The ICC will be high when there is little variation between the scores given each to a segment by raters, e.g. if all raters give the same, or similar scores to each of the segments. The ICC is an improvement over Pearson's $$r$$ and Spearman's $$\rho$$, as it takes into account the difference, or variance in between ratings for individual segments, along with the correlation between raters.