Intraclass Correlation Coefficient a.k.a. repeatability

3.8 Explanation

3.8.1 The basics

Repeatability, as measured by the intraclass correlation coefficient (ICC), is a unitless way to see if measurements are consistent. If you measure the same thing more than once, are you going to get a similar answer?

This is usually calculated by comparing (in some way, methods vary) how much variation there is within a given object/individual measured versus the variation over all measurements of all individuals.

Here’s an example using SAS sample data (“PROC NESTED: Variability of Calcium Concentration in Turnip Greens :: SAS/STAT(R) 9.2 User’s Guide, Second Edition” n.d.).

   # title 'Calcium Concentration in Turnip Leaves'
   #       '--Nested Random Model';
   # title2 'Snedecor and Cochran, ''Statistical Methods'''
   #        ', 1976, p. 286';
   # data Turnip;
   #    do Plant=1 to 4;
   #       do Leaf=1 to 3;
   #          do Sample=1 to 2;
   #             input Calcium @@;
   #             output;
   #          end;
   #       end;
   #    end;
   #    datalines;
   # 3.28 3.09 3.52 3.48 2.88 2.80 2.46 2.44
   # 1.87 1.92 2.19 2.19 2.77 2.66 3.74 3.44
   # 2.55 2.55 3.78 3.87 4.07 4.12 3.31 3.31
   # ;
   # 
   # proc nested data=Turnip;
   #    classes plant leaf;
   #    var calcium;
   # run;

Plant <- 1:4
Leaf <- 1:3
Sample = 1:2

turnip <- expand.grid(Sample=Sample, Leaf=Leaf, Plant=Plant)

turnip$calcium <- c(3.28, 3.09, 3.52, 3.48, 2.88, 2.80, 2.46, 2.44, 1.87, 1.92, 2.19, 2.19, 2.77, 2.66, 3.74, 3.44, 2.55, 2.55, 3.78, 3.87, 4.07, 4.12, 3.31, 3.31)


library(ggplot2)

ggplot(data = turnip,
       mapping = aes(x = Plant,
                     y = calcium,
                     color = as.factor(Leaf),
                     shape = as.factor(Sample))) +
  geom_point()

3.8.2 More technical

3.8.2.1 Questions and data types

Example problem structures and types of data you need.

3.8.2.2 Key assumptions and limitations

3.8.2.2.1 Assumptions

This is how to know if you can use the method.

Most assume Gaussian/normal data, but Nakagawa and Schielzeth (2010) extend to non-normal data.

3.8.2.2.2 Limitations

It’s important to pick the right ICC calculation for your study design, though at least one paper says you can do all three (Liljequist, Elfving, and Skavberg Roaldsen 2019) and compare.

Bailey and Byrnes (1990) suggest how to calculate a suitable sample size for studies based on the measurement error found.

Two thresholds are listed on the wikipedia page (“Intraclass Correlation” 2024), but neither paper provides justifications for their thresholds. (One study (Koo and Li 2016) has an erratum due to a key equation being incorrect in the feature comparison table (“Erratum to ‘A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research’ [J Chiropr Med 2016;15(2):155-163]” 2017).)

More realistically, you need to know what it actually means for your data and concepts being test (Wilson 2018). There’s nothing magic about any particular threshold.

3.8.2.4 Implementation and controversies

3.8.2.4.1 Choosing your ICC calculation

(Liljequist, Elfving, and Skavberg Roaldsen 2019) claim it doesn’t matter which ICC method you use, and that if you use all three main methods you can use any differences to suggest what type of bias you may have in measurements.

Lots of papers summarize how to use it (in Zotero library, working to figure out which ones best.) Most recent one (Ten Hove, Jorgensen, and Van Der Ark 2024) isn’t in ResearchRabbit.ai to easy visualize which papers cite the same papers. I have submitted a help request to them and a record change to OA for OpenAlex.org (which showed it as closed).

(Curry 2016) describe how to implement in R and SAS.

3.8.3 Most technical

The key citations.

3.9 Examples “in the wild”

Citations and what is useful in the paper.

The most updated guideline is (Ten Hove, Jorgensen, and Van Der Ark 2024)

References

Bailey, Robert C., and Janice Byrnes. 1990. “A New, Old Method for Assessing Measurement Error in Both Univariate and Multivariate Morphometric Studies.” Systematic Biology 39 (2): 124–30. https://doi.org/10.2307/2992450.

Curry, Claire M. 2016. “Repeatability, Intraclass Correlation Coefficient, and Measurement Error in R and SAS.” Computing Bird. http://www.cmcurry.com/2016/01/repeatability-intraclass-correlation.html.

“Erratum to ‘A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research’ [J Chiropr Med 2016;15(2):155-163].” 2017. Journal of Chiropractic Medicine 16 (4): 346. https://doi.org/10.1016/j.jcm.2017.10.001.

“Intraclass Correlation.” 2024. Wikipedia, October.

Koo, Terry K., and Mae Y. Li. 2016. “A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research.” Journal of Chiropractic Medicine 15 (2): 155–63. https://doi.org/10.1016/j.jcm.2016.02.012.

Liljequist, David, Britt Elfving, and Kirsti Skavberg Roaldsen. 2019. “Intraclass Correlation – A Discussion and Demonstration of Basic Features.” PLoS ONE 14 (7): e0219854. https://doi.org/10.1371/journal.pone.0219854.

Nakagawa, Shinichi, and Holger Schielzeth. 2010. “Repeatability for Gaussian and Non-Gaussian Data: A Practical Guide for Biologists.” Biological Reviews of the Cambridge Philosophical Society 85 (4): 935–56. https://doi.org/10.1111/j.1469-185X.2010.00141.x.

“PROC NESTED: Variability of Calcium Concentration in Turnip Greens :: SAS/STAT(R) 9.2 User’s Guide, Second Edition.” n.d. https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_nested_sect020.htm. Accessed February 3, 2025.

Ten Hove, Debby, Terrence D. Jorgensen, and L. Andries Van Der Ark. 2024. “Updated Guidelines on Selecting an Intraclass Correlation Coefficient for Interrater Reliability, with Applications to Incomplete Observational Designs.” Psychological Methods 29 (5): 967–79. https://doi.org/10.1037/met0000516.

Vispoel, Walter P., Carrie A. Morris, and Murat Kilinc. 2018. “Applications of Generalizability Theory and Their Relations to Classical Test Theory and Structural Equation Modeling.” Psychological Methods 23 (1): 1–26. https://doi.org/10.1037/met0000107.

Wilson, Alastair J. 2018. “How Should We Interpret Estimates of Individual Repeatability?” Evolution Letters 2 (1): 4–8. https://doi.org/10.1002/evl3.40.