STATISTICS with R

Kappa and Weighted Kappa in R

Kappa and weighted Kappa coefficients are widely used measures of inter-rater reliability, assessing the level of agreement between two raters or measurement occasions. Cohen’s unweighted and weighted Kappa coefficients are applied when ratings are nominal or ordered categorical, respectively. The unweighted Kappa is suitable when differences between categories are equal, while the weighted Kappa is used when category distances vary, assigning weights to reflect these differences based on expert judgment or supporting evidence.

Introduction to Cohen’s Kappa

Inter‑rater reliability is a fundamental aspect of empirical research and applied evaluation, ensuring that independent raters consistently classify or categorize data in the same way. The Kappa coefficient was created to measure this agreement beyond what might occur by chance, providing a more robust metric than simple percentage agreement. For instance, if two physicians independently diagnose patients as having “mild,” “moderate,” or “severe” symptoms, Kappa offers a statistical index of the closeness of their ratings while accounting for random agreement. This makes it especially valuable in disciplines such as psychology, medicine, and quality control, where categorical judgments are common and reliability is crucial. Measures of association like Pearson or Spearman correlations are unsuitable for assessing agreement between raters.

Unweighted Kappa, also known as Cohen’s Kappa, is best suited for situations where categories are nominal and lack inherent order. For example, when two reviewers classify articles as “relevant” or “irrelevant,” each disagreement is given equal weight, regardless of the specific categories involved. This method is simple and effective for purely qualitative distinctions, but it does not consider the degree or proximity of disagreements.

Weighted Kappa builds on this idea for ordered categories by giving partial credit for less serious disagreements, allowing for a partial match between raters. For instance, when rating product quality as “Like new,” “Good,” or “Acceptable,” a mismatch between “Like new” and “Good” is less significant than one between “Like new” and “Acceptable.” This approach makes Weighted Kappa the go‑to choice for ordinal data, as it accounts for the size of the disagreement, offering a more nuanced and accurate picture of inter‑rater reliability when subtle differences in judgment are important.

Kappa values range from –1 to +1, with negative values signifying systematic disagreement, 0 indicating agreement at the level of chance, and higher values representing stronger agreement. Table 1 presents common guidelines for interpreting Kappa coefficients.

Table 1: Guidelines for interpreting Kappa agreement coefficients.
Kappa Coefficient	Strength of Agreement (Landis & Koch, 1977)
< 0.00	Poor
0.01 – 0.20	Slight
0.21 – 0.40	Fair
0.41 – 0.60	Moderate
0.61 – 0.80	Substantial
0.81 – 1.00	Almost perfect

Kappa and Weighted Kappa Example

How reliable is the assessment of the new rater in evaluating the quality of used merchandise?

Kappa and Weighted Kappa in R — Figure 0: Candidates’ ratings of used items are checked against the Used Items Inspections Supervisor. AI-generated Image

A warehouse specializing in selling used items is seeking a Used Item Inspector to evaluate the quality of items before they are listed for sale. Inspectors must assign the used merchandise quality grades of “Like new (1)”, “Good (2)”, or “Acceptable (3)”. The company has shortlisted two applicants, Candidate 1 and Candidate 2, and will hire a candidate if their ratings closely agree with those of an experienced supervisor. Each candidate is given 25 used household items to rate using the three-grade scale or corresponding labels. Agreement with the supervisor’s ratings will be measured using the weighted kappa inter-rater reliability method. Table 2 presents five sample ratings from the Supervisor, Candidate 1, and Candidate 2.

Table 2: Ratings of used items by the Supervisor, and the two candidates.
Supervisor	Candidate 1	Candidate 2
Like new	Like new	Like new
Like new	Like new	Acceptable
Good	Like new	Like new
Like new	Like new	Like new
Acceptable	Good	Acceptable
…	…	…

The employer will conduct a Kappa measure of agreement between each of the candidates and the experienced supervisor. The data for this example can be downloaded in the in CSV format.

Analysis: Kappa and Weighted Kappa in R

In this section, we learn how to compute unweighted and weighted kappa inter-rater agreement analysis on the example data. In our example, the employer intends to know which of the two candidates’ ratings of the used items is more in agreement with those of the experienced supervisor. In this example, because the rating values are ordered, the weighted kappa must be used. However, for pedagogical purposes, we demonstrate both unweighted and weighted Kappa coefficients in R. We use the R package “irr” to compute both the unweighted and weighted kappa coefficients. Please refer to RStudio Environment page to learn how to install an R package.

A: Cohen’s Unweighted Kappa in R

We use the function kappa2() from the “irr” package to compute unweighted kappa coefficients to measure inter-rater agreement between the candidates and the supervisor. The following code in Listing 1 shows the R code for unweighted kappa computation.

Listing 1: R code to run Cohen’s unweighted kappa.

library(irr)

# Read in ratings data
dfRatings <- read.csv("dfUsedItemsRating.csv")

# Unweighted kappa agreement between Candidate 1 and Supervisor
raters21UW <- kappa2(dfRatings[,c(2,1)], weight = "unweighted")

# Unweighted kappa agreement between Candidate 2 and Supervisor 
raters31UW <- kappa2(dfRatings[,c(3,1)], weight = "unweighted")

print(raters21UW)
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 25 
   Raters = 2 
    Kappa = 0.609 

        z = 4.24 
  p-value = 2.21e-05 


print(raters31UW)
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 25 
   Raters = 2 
    Kappa = 0.414 

        z = 2.86 
  p-value = 0.00423

library(irr)

# Read in ratings data
dfRatings <- read.csv("dfUsedItemsRating.csv")

# Unweighted kappa agreement between Candidate 1 and Supervisor
raters21UW <- kappa2(dfRatings[,c(2,1)], weight = "unweighted")

# Unweighted kappa agreement between Candidate 2 and Supervisor 
raters31UW <- kappa2(dfRatings[,c(3,1)], weight = "unweighted")

print(raters21UW)
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 25 
   Raters = 2 
    Kappa = 0.609 

        z = 4.24 
  p-value = 2.21e-05 


print(raters31UW)
 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 25 
   Raters = 2 
    Kappa = 0.414 

        z = 2.86 
  p-value = 0.00423

The unweighted Kappa measure of agreement between the Supervisor and Candidate 1 is 0.609, which is statistically significant (p < 0.05). In contrast, the agreement between the Supervisor and Candidate 2 is lower, at 0.414, also statistically significant (p < 0.05). This suggests that Candidate 1 is more reliable than Candidate 2 in evaluating the quality of used items, based on their alignment with the experienced supervisor’s assessments. However, since the ratings in this example are ordered (1, 2, 3 or Like new, Good, Acceptable), the weighted Kappa should be used to calculate the agreement between the candidates and the supervisor. The next section will explain how to compute the weighted Kappa inter-rater agreement coefficient in R.

B: Cohen’s Weighted Kappa in R

Weighted Kappa is used when two raters rate ordered categorical variables consisting of three or more categories. The weights in weighted Kappa agreement are usually either linear (equal weights between the categories), quadratic (squared weights), or custom derived by subject matter experts. The following code in Listing 2 shows the R code for Cohen’s weighted kappa computation using quadratic (squared) weights / distances.

Listing 2: R code to run Cohen’s weighted kappa.

# Weighted Kappa agreement coefficient
library(irr)

# Read in ratings data
dfRatings <- read.csv("dfUsedItemsRating.csv")

# Quadratic weighted kappa agreement between Supervisor and Candidate 1
raters21W <- kappa2(dfRatings[,c(2,1)], weight = "squared")

# Quadratic weighted kappa agreement between Supervisor and Candidate 2
raters31W <- kappa2(dfRatings[,c(3,1)], weight = "squared")

print(raters21W)
 Cohen's Kappa for 2 Raters (Weights: squared)

 Subjects = 25 
   Raters = 2 
    Kappa = 0.806 

        z = 4.08 
  p-value = 4.6e-05 
  
print(raters31W)
 Cohen's Kappa for 2 Raters (Weights: squared)

 Subjects = 25 
   Raters = 2 
    Kappa = 0.472 

        z = 2.36 
  p-value = 0.0183

# Weighted Kappa agreement coefficient
library(irr)

# Read in ratings data
dfRatings <- read.csv("dfUsedItemsRating.csv")

# Quadratic weighted kappa agreement between Supervisor and Candidate 1
raters21W <- kappa2(dfRatings[,c(2,1)], weight = "squared")

# Quadratic weighted kappa agreement between Supervisor and Candidate 2
raters31W <- kappa2(dfRatings[,c(3,1)], weight = "squared")

print(raters21W)
 Cohen's Kappa for 2 Raters (Weights: squared)

 Subjects = 25 
   Raters = 2 
    Kappa = 0.806 

        z = 4.08 
  p-value = 4.6e-05 
  
print(raters31W)
 Cohen's Kappa for 2 Raters (Weights: squared)

 Subjects = 25 
   Raters = 2 
    Kappa = 0.472 

        z = 2.36 
  p-value = 0.0183

In Listing 2, the weighted kappa coefficient between the Supervisor and Candidate 1 is 0.806 while that between the Supervisor and Candidate 2 it is 0.472. These results indicate that Candidate 1’s ratings align more closely with those of the experienced supervisor, suggesting that Candidate 1 is better qualified for the Used Item Inspector position.

Cohen’s unweighted and weighted kappa coefficient can also be computed using the “psych” package, as shown in Listing 3.

Listing 3: Cohen’s kappa using psych package.

# Weighted Kappa agreement coefficient using psych package
library(psych)

# Read in ratings data
dfRatings <- read.csv("dfUsedItemsRating.csv")

ratersKappaAll <- cohen.kappa(dfRatings)

# Results
print(ratersKappaAll, all = FALSE)

Cohen Kappa (below the diagonal) and Weighted Kappa (above the diagonal) 
For confidence intervals and detail print with all=TRUE
            Supervisor Candidate01 Candidate02
Supervisor        1.00        0.81        0.47
Candidate01       0.61        1.00        0.35
Candidate02       0.41        0.28        1.00

Average Cohen kappa for all raters  0.44
Average weighted kappa for all raters  0.54

# Weighted Kappa agreement coefficient using psych package
library(psych)

# Read in ratings data
dfRatings <- read.csv("dfUsedItemsRating.csv")

ratersKappaAll <- cohen.kappa(dfRatings)

# Results
print(ratersKappaAll, all = FALSE)

Cohen Kappa (below the diagonal) and Weighted Kappa (above the diagonal) 
For confidence intervals and detail print with all=TRUE
            Supervisor Candidate01 Candidate02
Supervisor        1.00        0.81        0.47
Candidate01       0.61        1.00        0.35
Candidate02       0.41        0.28        1.00

Average Cohen kappa for all raters  0.44
Average weighted kappa for all raters  0.54

Reporting Cohen’s Kappa Interrater Agreement Results

The study assessed the consistency of two candidates in rating the quality of used merchandise compared to the evaluations of an experienced supervisor. Each candidate reviewed 25 items using an ordinal scale of “Like new,” “Good,” and “Acceptable,” with the supervisor’s ratings serving as the reference standard. Candidate 1’s evaluations closely matched those of the supervisor, yielding a weighted kappa coefficient of 0.806, indicating strong agreement. In contrast, Candidate 2’s ratings produced a weighted kappa of 0.472, reflecting moderate agreement. These results suggest that Candidate 1 demonstrated greater reliability in categorizing item quality, while Candidate 2’s assessments were less consistent with expert judgment. Overall, the findings support Candidate 1’s suitability for the inspector role due to stronger alignment with professional standards.