Open Conference Systems, 50th Scientific meeting of the Italian Statistical Society

Font Size: 
A comparative study of benchmarking procedures for interrater and intrarater agreement studies
Amalia Vanacore, Maria Sole Pellegrino

Last modified: 2018-06-21


Decision making processes typically rely on subjective evaluations providedby human raters. In the absence of a gold standard against which check evaluationtrueness, the magnitude of inter/intra-rater agreement coefficients is commonlyinterpreted as a measure of the rater’s evaluative performance. In this study somebenchmarking procedures for characterizing the extent of agreement are discussedand compared via a Monte Carlo simulation.


1. Altman, D.G.: Practical statistics for medical research. CRC press (1990)

2. Carpenter, J., Bithell, J.: Bootstrap confidence intervals: when, which, what? a practical guidefor medical statisticians. Statistics in medicine 19(9), 1141–1164 (2000)

3. Cicchetti, D.V., Allison, T.: A new procedure for assessing reliability of scoring eeg sleeprecordings. American Journal of EEG Technology 11(3), 101–110 (1971)

4. Cicchetti, D.V., Feinstein, A.R.: High agreement but low kappa: Ii. resolving the paradoxes.Journal of clinical epidemiology 43(6), 551–558 (1990)

5. Cohen, J.: A coefficient of agreement for nominal scales. Educational and psychologicalmeasurement 20(1), 37–46 (1960)

6. De Mast, J., Van Wieringen, W.N.: Measurement system analysis for categorical measurements:agreement and kappa-type indices. Journal of Quality Technology 39(3), 191–202(2007)

7. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological bulletin76(5), 378–382 (1971)

8. Gwet, K.L.: Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology 61(1), 29–48 (2008)

9. Gwet, K.L.: Handbook of inter-rater reliability: The definitive guide to measuring the extentof agreement among raters. Advanced Analytics, LLC (2014)

10. Hartmann, D.P.: Considerations in the choice of interobserver reliability estimates. Journal ofapplied behavior analysis 10(1), 103–116 (1977)

11. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometricspp. 159–174 (1977)

12. Shrout, P.E.: Measurement reliability and agreement in psychiatry. Statistical methods inmedical research 7(3), 301–317 (1998)

13. Thompson, W.D., Walter, S.D.: A reappraisal of the kappa coefficient. Journal of clinicalepidemiology 41(10), 949–958 (1988)

Full Text: PDF