Intraclass Correlation Coefficient a.k.a. repeatability
3.8 Explanation
3.8.1 The basics
Repeatability, as measured by the intraclass correlation coefficient (ICC), is a unitless way to see if measurements are consistent. If you measure the same thing more than once, are you going to get a similar answer?
This is usually calculated by comparing (in some way, methods vary) how much variation there is within a given object/individual measured versus the variation over all measurements of all individuals.
Here’s an example using SAS sample data (“PROC NESTED: Variability of Calcium Concentration in Turnip Greens :: SAS/STAT(R) 9.2 User’s Guide, Second Edition” n.d.).
# title 'Calcium Concentration in Turnip Leaves'
# '--Nested Random Model';
# title2 'Snedecor and Cochran, ''Statistical Methods'''
# ', 1976, p. 286';
# data Turnip;
# do Plant=1 to 4;
# do Leaf=1 to 3;
# do Sample=1 to 2;
# input Calcium @@;
# output;
# end;
# end;
# end;
# datalines;
# 3.28 3.09 3.52 3.48 2.88 2.80 2.46 2.44
# 1.87 1.92 2.19 2.19 2.77 2.66 3.74 3.44
# 2.55 2.55 3.78 3.87 4.07 4.12 3.31 3.31
# ;
#
# proc nested data=Turnip;
# classes plant leaf;
# var calcium;
# run;
Plant <- 1:4
Leaf <- 1:3
Sample = 1:2
turnip <- expand.grid(Sample=Sample, Leaf=Leaf, Plant=Plant)
turnip$calcium <- c(3.28, 3.09, 3.52, 3.48, 2.88, 2.80, 2.46, 2.44, 1.87, 1.92, 2.19, 2.19, 2.77, 2.66, 3.74, 3.44, 2.55, 2.55, 3.78, 3.87, 4.07, 4.12, 3.31, 3.31)
library(ggplot2)
ggplot(data = turnip,
mapping = aes(x = Plant,
y = calcium,
color = as.factor(Leaf),
shape = as.factor(Sample))) +
geom_point()
3.8.2 More technical
3.8.2.2 Key assumptions and limitations
3.8.2.2.1 Assumptions
This is how to know if you can use the method.
- Most assume Gaussian/normal data, but Nakagawa and Schielzeth (2010) extend to non-normal data.
3.8.2.2.2 Limitations
It’s important to pick the right ICC calculation for your study design, though at least one paper says you can do all three (Liljequist, Elfving, and Skavberg Roaldsen 2019) and compare.
Bailey and Byrnes (1990) suggest how to calculate a suitable sample size for studies based on the measurement error found.
Two thresholds are listed on the wikipedia page (“Intraclass Correlation” 2024), but neither paper provides justifications for their thresholds. (One study (Koo and Li 2016) has an erratum due to a key equation being incorrect in the feature comparison table (“Erratum to ‘A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research’ [J Chiropr Med 2016;15(2):155-163]” 2017).)
More realistically, you need to know what it actually means for your data and concepts being test (Wilson 2018). There’s nothing magic about any particular threshold.
3.8.2.4 Implementation and controversies
3.8.2.4.1 Choosing your ICC calculation
(Liljequist, Elfving, and Skavberg Roaldsen 2019) claim it doesn’t matter which ICC method you use, and that if you use all three main methods you can use any differences to suggest what type of bias you may have in measurements.
Lots of papers summarize how to use it (in Zotero library, working to figure out which ones best.) Most recent one (Ten Hove, Jorgensen, and Van Der Ark 2024) isn’t in ResearchRabbit.ai to easy visualize which papers cite the same papers. I have submitted a help request to them and a record change to OA for OpenAlex.org (which showed it as closed).
(Curry 2016) describe how to implement in R and SAS.
3.9 Examples “in the wild”
Citations and what is useful in the paper.
The most updated guideline is (Ten Hove, Jorgensen, and Van Der Ark 2024)