ORIGINAL RESEARCH article

How to assess and compare inter-rater reliability, agreement and correlation of ratings: an exemplary analysis of mother-father and parent-teacher expressive vocabulary rating pairs.

\r\nMargarita Stolarova,,*

  • 1 Department of Psychology, University of Konstanz, Konstanz, Germany
  • 2 Zukunftskolleg, University of Konstanz, Konstanz, Germany
  • 3 Department of Society and Economics, Rhine-Waal University of Applied Sciences, Kleve, Germany
  • 4 Department of Linguistics, University of Konstanz, Konstanz, Germany

This report has two main purposes. First, we combine well-known analytical approaches to conduct a comprehensive assessment of agreement and correlation of rating-pairs and to dis-entangle these often confused concepts, providing a best-practice example on concrete data and a tutorial for future reference. Second, we explore whether a screening questionnaire developed for use with parents can be reliably employed with daycare teachers when assessing early expressive vocabulary. A total of 53 vocabulary rating pairs (34 parent–teacher and 19 mother–father pairs) collected for two-year-old children (12 bilingual) are evaluated. First, inter-rater reliability both within and across subgroups is assessed using the intra-class correlation coefficient (ICC). Next, based on this analysis of reliability and on the test-retest reliability of the employed tool, inter-rater agreement is analyzed, magnitude and direction of rating differences are considered. Finally, Pearson correlation coefficients of standardized vocabulary scores are calculated and compared across subgroups. The results underline the necessity to distinguish between reliability measures, agreement and correlation. They also demonstrate the impact of the employed reliability on agreement evaluations. This study provides evidence that parent–teacher ratings of children's early vocabulary can achieve agreement and correlation comparable to those of mother–father ratings on the assessed vocabulary scale. Bilingualism of the evaluated child decreased the likelihood of raters' agreement. We conclude that future reports of agreement, correlation and reliability of ratings will benefit from better definition of terms and stricter methodological approaches. The methodological tutorial provided here holds the potential to increase comparability across empirical reports and can help improve research practices and knowledge transfer to educational and therapeutic settings.

1. Introduction

When it comes to the usability of screening tools both validity and reliability of an instrument are important quality indicators. They are needed to estimate the usefulness of assessments in therapeutic, educational and research contexts and are therefore highly relevant in a variety of scientific disciplines, such as psychology, education, medicine, linguistics and others that often rely on ratings to evaluate behaviors, symptoms or abilities. Validity is defined as—the degree to which evidence and theory support the interpretations of scores entailed by proposed uses of tests—( American Educational Research Association et al., 1999 ). In a way, validity of an assessment instrument mirrors its ability to capture, what it intends to measure. Reliability estimates describe the precision of an instrument, they refer to its capacity to produce constant, similar results. There are different possibilities to measure reliability, e.g., across raters that evaluate the same participant (inter-rater reliability) or across different points in time (test-retest reliability, for a comprehensive discussion on validity and reliability see for example, Borsboom et al., 2004 ). Reliability estimates for example of children's language capacities are often restricted to linear correlations and lack precise understanding of methodological approaches, which can lead to significant limitations regarding the interpretability and comparability of the reported results. This article therefore aims to provide a methodological tutorial for assessing inter-rater reliability, agreement and correlation of expressive vocabulary ratings. By applying the proposed strategy to a concrete research question, i.e., whether a screening questionnaire developed for use with parents can be employed also with daycare teachers, we are able to show the potential impact of using different measures of reliability, agreement and correlation on the interpretation of concrete empirical results. The proposed approach can potentially benefit the analyses of ratings regarding a variety of abilities and behaviors across different disciplines.

Extensive research has provided evidence for the validity of language screening tools such as the German vocabulary questionnaire ELAN (Eltern Antworten, Bockmann and Kiese-Himmel, 2006 ) used in this study and similar instruments (e.g., the MacArthur-Bates CDI scales, Fenson, 1993 , 2007 ) not only with regard to parental, but also to teacher evaluations ( Marchman and Martinez-Sussmann, 2002 ; Norbury et al., 2004 ; Bockmann, 2008 ; Vagh et al., 2009 ). Most of the validity studies correlate vocabulary ratings with objective lexical measures, such as for example the Peabody Picture Vocabulary Test ( Dunn and Dunn, 2007 ) and find strong associations between the scores children achieve in an objective test situation and the vocabulary ratings provided by different caregivers, e.g., mothers, fathers, or teachers ( Janus, 2001 ; Norbury et al., 2004 ; Bishop et al., 2006 ; Massa et al., 2008 ; Koch et al., 2011 ).

In contrast to validity of parental and teacher ratings regarding expressive vocabulary, their reliability has not been sufficiently substantiated, specifically with regard to caregivers other than parents. Since a significant number of young children are experiencing regular care outside their families, the ability of different caregivers to provide a reliable assessment of behavior, performance or ability level, using established tools, is relevant with regard to screening and monitoring a variety of developmental characteristics (e.g., Gilmore and Vance, 2007 ). The few studies examining (inter-rater) reliability regarding expressive vocabulary frequently rely solely or predominantly on linear correlations between the raw scores provided by different raters (e.g., de Houwer et al., 2005 ; Vagh et al., 2009 .) Moderate correlations between two parental ratings or between a parent and a teacher rating are reported, varying between r = 0.30 and r = 0.60. These correlations have been shown to be similar for parent–teacher and father–mother rating-pairs ( Janus, 2001 ; Norbury et al., 2004 ; Bishop et al., 2006 ; Massa et al., 2008 ; Gudmundsson and Gretarsson, 2009 ; Koch et al., 2011 ).

While the employed correlation analyses (mostly Pearson correlations) provide information about the strength of the relation between two groups of values, they do not capture the agreement between raters at all ( Bland and Altman, 2003 ; Kottner et al., 2011 ). Nonetheless, claims about inter-rater agreement are frequently inferred from correlation analyses (see for example, Bishop and Baird, 2001 ; Janus, 2001 ; Van Noord and Prevatt, 2002 ; Norbury et al., 2004 ; Bishop et al., 2006 ; Massa et al., 2008 ; Gudmundsson and Gretarsson, 2009 .) The flaw of such conclusions is easily revealed: A perfect linear correlation can be achieved if one rater group systematically differs (by a nearly consistent amount) from another, even though not one single absolute agreement exists. In contrast, agreement is only reached, when points lie on the line (or within an area) of equality of both ratings ( Bland and Altman, 1986 ; Liao et al., 2010 ). Thus, analyses relying solely on correlations do not provide a measure of inter-rater agreement and are not sufficient for a concise assessment of inter-rater reliability either. As pointed out by Stemler (2004) , reliability is not a single, unitary concept and it cannot be captured by correlations alone. To show how the three concepts inter-rater reliability expressed here as intra-class correlation coefficients (ICC, see Liao et al., 2010 ; Kottner et al., 2011 ), agreement (sometimes also termed consensus, see for example, Stemler, 2004 ), and correlation (here: Pearson correlations) complement each other in the assessment of ratings' concordance is one main intention of this report.

Conclusions drawn from ratings provided by different raters (e.g., parents and teacher) or at different points of time (e.g., before and after an intervention) are highly relevant for many disciplines in which abilities, behaviors and symptoms are frequently evaluated and compared. In order to capture the degree of agreement between raters, as well as the relation between ratings, it is important to consider three different aspects: (1) inter-rater reliability assessing to what extent the used measure is able to differentiate between participants with different ability levels, when evaluations are provided by different raters. Measures of inter-rater-reliability can also serve to determine the least amount of divergence between two scores necessary to establish a reliable difference. (2) Inter-rater agreement, including proportion of absolute agreement, where applicable also magnitude and direction of differences. (3) Strength of association between ratings, measured by linear correlations. Detailed explanations of these approaches are provided for example by Kottner and colleagues in their “Guidelines for Reporting Reliability and Agreement Studies” ( Kottner et al., 2011 ). Authors from the fields of education (e.g., Brown et al., 2004 ; Stemler, 2004 ) and behavioral psychology ( Mitchell, 1979 ) have also emphasized the necessity to distinguish clearly between the different aspects contributing to the assessment of ratings' concordance and reliability. Precise definition and distinction of concepts potentially prevents misleading interpretations of data. As the different but complementary concepts of agreement, correlation and inter-rater reliability are often mixed up and these terms are used interchangeably (see e.g., Van Noord and Prevatt, 2002 ; Massa et al., 2008 ), below we briefly present their definitions and methodological backgrounds, while also linking each of them to the content related questions addressed in the present report.

The term agreement (or consensus) refers to the degree to which ratings are identical (for detailed overviews see, de Vet et al., 2006 ; Shoukri, 2010 ) often described using the proportion of identical to diverging rating pairs ( Kottner et al., 2011 ). In order to state, however, whether two ratings differ statistically from each other, psychometric aspects of the employed tool, such as reliability (e.g., test-retest reliability or intra-class correlations as a measure of inter-rater reliability), must be taken into consideration. General characteristics of the rating scale, for example the presence or absence of valid scoring categories ( Jonsson and Svingby, 2007 ) and the number of individual items (and thus decisions) comprising a score, will influence directly the likelihood of absolute agreement. For example, the more items a scale comprising a raw-score has, the less likely it is to reach absolute agreement of scores. Therefore, two raw scores or two standardized values (such as T -scores) diverging in absolute numbers are not necessarily statistically different from each other. An (absolute) difference can be too small to reflect a systematic divergence in relation to the distribution of scores. Thus, the size of non-systematic errors has to be taken into account prior to making judgments on proportions of agreement. Unfortunately, many studies attempting to assess inter-rater agreement completely disregard the distinction between absolute differences and statistically reliable differences and do not use standardized values (e.g., Bishop and Baird, 2001 ; Bishop et al., 2006 ; Gudmundsson and Gretarsson, 2009 ). In the field of language acquisition for example the direct comparison of raw-scores still seems to be the norm, rather than the exception, despite the lengthy item lists comprising vocabulary assessment instruments (e.g., Marchman and Martinez-Sussmann, 2002 ; Norbury et al., 2004 ).

Before assessing absolute agreement, it is thus necessary to determine the minimum divergence classifying two ratings as statistically (and thus reliably) different. One way to establish reliable difference is to calculate the so called “Reliable Change Index” (RCI, e.g., Zahra and Hedge, 2010 ) an approach intended to define significantly changed or diverging values. If the RCI is significant, a 95% probability that the two values differ from each other can be assumed. Critically, the RCI is a function of the employed instrument's reliability. There are several reliability measures appropriate for calculating the RCI, among them test-retest or inter-rater reliability. However, different reliability measures are likely to yield different results, depending mostly on the characteristics of the population samples they are derived from. For a standardized instrument such as the vocabulary checklist ELAN ( Bockmann and Kiese-Himmel, 2006 ), reliability assessments derived from the standardization sample (e.g., the test-retest reliability according to the instrument's manual) provide a conservative estimate of its reliability. Reliability for calculating the RCI can also be estimated for a concrete study sample, which is usually smaller and often less representative than the standardization sample. This second approach is thus likely to provide a less conservative, population specific estimate of reliability. In this report, we demonstrate how interpretation of agreement can differ when using reliability estimates from either a standardization population (here test-retest reliability) or from the study population (here the intra-class correlation coefficient).

In order to provide such a population-specific estimate of reliability for our study, we calculated inter-rater reliability expressed as intra-class correlation coefficients (ICC). The intra-class correlation assesses the degree to which the measure used is able to differentiate between participants with diverging scores, indicated by two or more raters that reach similar conclusions using a particular tool ( Liao et al., 2010 ; Kottner et al., 2011 ). Moreover, when considering extending the use of parental questionnaires to other caregivers, it is important to compare reliability between different rater groups. The ICC takes into account the variance of ratings for one child evaluated by two raters as well as the variance across the complete group of children. It can thus serve to compare the reliability of ratings between two groups of raters and to estimate the instrument's reliability in a concrete study. This study is the first to report inter-rater reliability assessed by intra-class correlations (ICCs) for the German vocabulary checklist ELAN ( Bockmann and Kiese-Himmel, 2006 ).

In order to assess rater agreement, we first calculated two reliable change indexes (RCIs), one on the basis of the ELAN-manual's test-retest reliability, the second considering the ICC for our study population. Note that even though both reliability measures can be used to calculate the RCI, they are not equivalent in terms of accuracy and strictness. Test-retest correlations represent a very accurate estimate of the instrument's reliability (regarding a construct stable over time), inter-rater reliability rather reflects the rating process' accuracy. The proportion of (reliable) agreement was assessed using both reliability estimates in order to demonstrate how the choice of reliability measure impacts the evaluation and interpretation of rater agreement. In addition to the proportion of absolute agreement, information about the magnitude of (reliable) differences and about possible systematic direction of differences is also relevant for the comprehensive assessment of rater-agreement. Thus, three aspects of agreement are considered in this report: percentages of ratings that differ reliably, if applicable, the extent to which they differ, and the direction of the difference (i.e., a systematic response tendency of either group of raters compared to the other). In the analyses presented here we also relate magnitude of differences to those factors that can influence the likelihood of diverging ratings in our sample: gender of the evaluated child, bilingual vs. monolingual family environment and rater subgroup.

As shown above, Pearson correlations are the most commonly used statistic when inter-rater reliability in the domain of expressive vocabulary is assessed (e.g., Bishop and Baird, 2001 ; Janus, 2001 ; Norbury et al., 2004 ; Bishop et al., 2006 ; Massa et al., 2008 ; Gudmundsson and Gretarsson, 2009 ) and this tendency extends to other domains, such as language impairments (e.g., Boynton Hauerwas and Addison Stone, 2000 ), or learning disabilities (e.g., Van Noord and Prevatt, 2002 ). As argued above, linear correlations do not give information on ratings' agreement. However, they provide useful information on the relation between two variables, here vocabulary estimates of two caregivers for the same child. In the specific case of using correlation coefficients as an indirect measure of rating consistency, linear associations can be expected, thus Pearson correlations are an appropriate statistical approach. It cannot and should not serve as a sole measure of inter-rater reliability, but it can be employed as an assessment of strength of (linear) association. Correlation coefficients have the additional advantage of enabling comparisons, useful for example when examining between-group differences regarding the strength of ratings' association. Since most other studies assessing inter-rater reliability of expressive vocabulary scores report correlation coefficients (only), this measure also enables us to relate the results of the pre-sent study to earlier research. Thus, we report correlations for each of the two rating subgroups (mother–father and parent–teacher rating pairs), compare them and calculate the correlation of ratings across both subgroups, too.

In order to give one realistic, purposeful example of the research strategy outlined above, we employed the ELAN vocabulary scale ( Bockmann and Kiese-Himmel, 2006 ), a German parental questionnaire, developed for screening purposes with regard to children's early expressive vocabulary. This instrument is comprised of a checklist including a total of 250 individual words: The rater decides for each item on the list whether or not the child actively uses it. General questions regarding demographic background and child development supplement the vocabulary information. Children experiencing regular daycare were evaluated by a daycare teacher and a parent, children cared for exclusively in their families were evaluated by both parents. Here, we provide a first analysis of the usability of the ELAN with daycare teachers and illustrate the necessity to evaluate rating scales on more than one dimension of rating consistency.

In summary, this report has two main goals: to provide a methodological tutorial for assessing inter-rater reliability, agreement and linear correlation of rating pairs, and to evaluate whether the German parent questionnaire ELAN ( Bockmann and Kiese-Himmel, 2006 ) can be reliably employed also with daycare teachers when assessing early expressive vocabulary development. We compared mother–father and parent–teacher ratings with regard to agreement, correlation as well as reliability of ratings. We also explored which child and rater related factors influence rater agreement and reliability. In a relatively homogeneous group of mostly middle class families and high quality daycare environments, we expected high agreement and linear correlation of ratings.

2.1. Ethics Statement

Parents, teachers and the heads of the child care centers participating in this study gave written informed consent according to the principles of the Declaration of Helsinki. Special care was taken to ensure that all participants understood that their participation is voluntary and could be ended at any time without causing them any disadvantages. The research reported here was conducted in Germany (country of residence of all authors) and met the Ethic Guidelines of the German Psychological Association and the German Psychological Professional Organization (Ethische Richtlinien der Deutschen Gesellschaft für Psychologie e.V. und des Berufsverbands Deutscher Psychologinnen und Psychologen e.V., see http://www.bdp-verband.org/bdp/verband/ethik.shtml ), an approved German adaption of the “Ethical Principles of Psychologists and Code of Conduct” ( American Psychological Association and Others, 2002 ).

2.2. Data Collection, Research Instruments, Exclusion Criteria, and Subgroups

Participating families and daycare centers were recruited from the German cities Konstanz and Radolfzell, as well as their surroundings. For each participating child, two caregivers assessed the number of spoken words on the basis of the German lexical checklist for parents ELAN ( Bockmann and Kiese-Himmel, 2006 ). These two independent vocabulary ratings were provided within a period of 3 days before or after the child's second birthday. The data collection sessions with each of the two caregivers took place within a maximum of 6 days; more than 84% were completed within 48 h from each other. Data was collected by trained researchers from the University of Konstanz and was obtained for 59 two-year-old. The data of six children had to be excluded from further analyses due to the following reasons:

  • More than five missing answers to items of the vocabulary checklist (2). Respondents had to indicate, whether a child spoke a certain word by crossing either a “yes”- or a “no”-field, if no indication was provided, items were counted as “missing.”
  • Preterm birth (1).
  • State of transition between parental- and non-parental-care (1).
  • Vocabulary score too low to obtain a T -value (1).
  • Vocabulary information provided the maternal grandmother, instead of the father, as he did not speak any German (1).

Two independent vocabulary ratings for a total of 53 two-year-old children were included in the analyses. For those children ( n = 34), who had experienced daily (Monday through Friday) non-parental care for at least 6 months, the two vocabulary ratings were provided by the daycare teacher responsible for each child in the daycare center and by one or two parents: either by the mother (27), or by the father (4), or by the mother and the father together (3). In this last case the two parents filled out one questionnaire actively communicating between each other about the answers and provided one single rating. We refer to the vocabulary rating pairs provided for these 34 children experiencing regular non-parental daycare as the “parent–teacher ratings.”

For those children ( n = 19) who at the age of 2 years were cared for at home by their parents, the mother and the father each provided separate vocabulary ratings for their child. Data acquisition usually occurred at the same time, but special care was taken to ensure that the parents did not influence each other's responses. Children were also included in this group if they experienced some form of irregular non-parental care (e.g., playgroups or babysitters) up to a maximum of 12 h and up to three times per week. We refer to the vocabulary rating pairs provided by the mother and the father of the children experiencing parental care as the “parental” or “mother–father ratings.”

For all children vocabulary information was supplemented by demographic information provided by one parent (for a summary see Table 1 ). For children experiencing regular daycare additional information was provided by the daycare teacher concerning the duration and the quality of care (as indicated by the amount of time spent in direct proximity of the evaluated child, group size, teacher-to-child ratio, and educational background of the daycare teachers).

www.frontiersin.org

Table 1. Demographic characteristics of the study population .

Parental education level was defined as the highest school degree obtained. The category reported by the vast majority of the parents was the German university entrance certificate (Abitur) or a foreign equivalent and thus the highest possible secondary education degree in Germany (see Table 1 ). In addition, all parents had received further professional training and/or completed a high education degree. At the time of testing, mothers were either employed (33), on parental leave (18) or pursued a university degree (2). All fathers were employed.

All 53 two-year-old children spoke and listened to German on a daily basis, 41 of them were raised in monolingual German family environments (subsequently referred to as “monolingual” children). In contrast, 12 children had regular contact with a second language. One of those children was raised in a trilingual environment (the parents spoke two different languages other than German). Yet, we will refer to the complete group of 12 children as “bilingual.” All bilingual children actively spoke a second language in addition to German according to their parents.

A total of 24 daycare teachers participated in this study; four of them were the primary responsible teacher for more than one participating child and thus provided more than one evaluation. All of the participating teachers were female German native speakers. All but one daycare teacher had completed a vocational degree in early child-care, one teacher held a degree in nursing. All daycare teachers reported regular participation in continuing education courses. The group size in the daycare centers varied between 9 and 20 children, the majority (22 out of 34) were cared for in a group with up to 10 children and at least two daycare teachers present at all times. Weekly daycare reported by the parents varied between the categories “11–20 h” ( n = 5) and “more than 20 h” ( n = 28, one missing value).

The teachers participating in the study were always the ones who were primarily responsible for the evaluated children since their daycare enrollment. The daycare teachers provided information on the percentage of time spent in direct proximity, i.e., hearing and seeing the evaluated child. The teachers of 28 out of 34 children (82.35%) reported direct contact more than 60% of the time the evaluated child spent in daycare. The teachers of four children (11.76%) were in direct contact for 40–60% of time and only one child (2.94%) was reported to be in direct proximity to the evaluating teacher for 20–40% of daycare time; for one child, this data was missing.

2.3. Analyses

First, demographic differences between the two subgroups were assessed. Then inter-rater reliability, agreement and correlations within and across the two different rating subgroups were analyzed. The analysis procedure and the corresponding research questions are summarized in Figure 1 .

www.frontiersin.org

Figure 1. Analysis procedure . A total of 53 rating pairs was included in the analysis and divided in two rating subgroups (represented by round boxes in the upper line). On the left side of the figure the purpose of the applied statistical analysis is provided framed as research questions. The next column shows the analyses conducted within the parent–teacher rating subgroup ( n = 34), in the right column the respective analyses for the mother–father rating subgroup ( n = 19) are shown. The column in the middle lists tests conducted for the whole study population, as well as between group comparisons. Dotted arrows mark analyses conducted for the differing ratings identified using the manual's test-retest reliability (no reliably diverging ratings were identified if using the ICC for calculating the critical difference between ratings).

Systematic demographic differences between the two rating subgroups were assessed regarding the following variables: educational level and occupational status of the parents, family status (one-parent- or two-parent-family), gender distribution, number of siblings, birth order, and number of bilingual children. If expected values in all cells were above 4, we used Pearson's χ 2 -tests, otherwise, Fisher's exact tests were employed.

Raw-vocabulary-scores were transformed into corresponding T -values according to the transformation table provided by the authors of the ELAN-questionnaire. All analyses were based on these standardized T -values.

We calculated inter-rater reliability for the mother–father as well as the parent–teacher rating subgroups and across the study population. We calculated the intra-class correlation coefficient as a measure of inter-rater reliability reflecting the accuracy of the rating process using the formula proposed by Bortz and Döring (2006) , see also Shrout and Fleiss (1979) :

with σ 2 bt being the variance of ratings between children, σ 2 in being the variance within the children and k the number of raters. Confidence intervals for all ICCs were calculated in order to assess whether they differed from each other.

This analysis adds information regarding inter-rater reliability of the ELAN-questionnaire, and also serves as a basis for one out of two calculations of the reliable change index (RCI) considering the characteristics of the concrete study sample.

In order to determine, whether two ELAN ratings a child received differed statistically from one another, the RCI was calculated using the classical approach ( Jacobson and Truax, 1991 ; Zahra and Hedge, 2010 ) as recommended e.g., in Maassen (2000) , see also Maassen (2004) for a discussion about which exact formula should be used in which case.

with x 1 / x 2 = compared scores and S d i f f = S E M 2 . The latter gives the standard error of difference between two test scores and thus describes the spread of distribution of differences in case no differences actually occurred. SEM was calculated as S E M = s 1 1 − r x x , with s 1 = SD and r xx = reliability of measure.

RCI values are standardized z -values, therefore an RCI ≥ 1.96 indicates a difference at a significance level of α = 0.05. As all scores were transformed into standardized T -values, a SD of 10 was utilized.

For r xx we used two different measures of reliability: (1) the r ICC obtained across our study population and (2) the test-retest reliability provided in the ELAN-manual ( Bockmann and Kiese-Himmel, 2006 ), a value originating from a larger and representative population and rather reflects the ELAN's and not our sample's characteristics. The use of external sources of reliability measures, as employed in the second RCI-calculation, has been recommended e.g., by Maassen (2004) and can be thought of as the most conservative means of estimating the RCI.

The RCI formula can be rearranged to determine the exact value from which onwards two T -values of the ELAN-questionnaire differ significantly:

Whether ratings differed significantly from each other was assessed within as well as between rating subgroups, proportions of diverging to equal ratings were calculated. If applicable, exact binomial tests were used to evaluate whether significantly more diverging than non-diverging ratings existed in each of the subgroups or across subgroups.

Pearson's χ 2 -tests were employed to determine whether the probability that a child received two diverging ratings differed for rater subgroups (mother–father vs. parent–teacher-ratings), for boys and girls as well as for mono- vs. bilingual two-year-old. We tested whether the differences' direction within each of the subgroups was systematic using Wilcoxon paired-sample tests.

We compared mean ratings for each of the different raters, i.e., parents and teachers for the 34 children experiencing daycare and for mothers and fathers for the 19 children in parental care using t -tests. In addition, the magnitude of individual differences was assessed descriptively. We displayed the distribution of differences with regard to the standard deviation of the T -distribution using a scatter plot (see Figure 3 ). Considering only children who received significantly diverging ratings, we also explored the magnitude of those differences by looking at the deviation between ratings of a pair using a graphical approach: a Bland-Altman plot (see Figure 4 ). A Bland-Altman plot, also known as Tukey mean-difference plot, illustrates dispersion of agreement by showing individual differences in T -values in relation to the mean difference. Therewith, magnitudes of differences in ratings can be categorized in relation to the standard deviation of differences ( Bland and Altman, 2003 ).

To further assess the strength of linear relations between ratings, Pearson correlation coefficients were calculated for mother–father ratings and for parent–teacher ratings. In a next step, we assessed whether correlation coefficients of the two rating subgroups differed significantly from each other. For this statistical comparison, correlation coefficients were transformed into Fisher's Z -values, since means and standard deviations of correlation coefficients cannot be compared directly (see for example, Bortz and Döring, 2006 ). A Pearson correlation coefficient was also obtained for the whole study population, in order to assess the general strength of linear association between two different raters. To make this calculation possible, we combined teacher—with maternal ratings and parental with paternal ratings.

3.1. Comparison of Demographic Characteristics between Rating Subgroups

There were no significant differences between rating subgroups (and thus between children experiencing early center based daycare and children cared for exclusively at home) regarding parental education (mothers and fathers), occupational status of the father, number of siblings, birth order, gender distribution and number of bilingual children, all p ≥ 0.05. The employment status of the mother differed significantly between subgroups (χ 2 (1, N = 53) = 27.226, p < 0.001), as did the number of children raised in two-parent-, as opposed to single-parent-households (χ 2 (1, N = 53) = 5.265, p = 0.040); see Table 1 for absolute numbers and percentages. This means, that children in the two rating subgroups did not differ regarding most demographic variables. Importantly, we did not find systematic differences in parental education, gender distribution and birth order. The observed divergences regarding family and employment status are explicable by the fact that children below the age of three could only enter center-based state-regulated daycare facilities in the cities of Konstanz and Radolfzell, if the parents (or in the case of a single-parent family the one parent) were employed, pursuing their education, or were currently on parental leave with a younger child.

3.2. Inter-Rater Reliability

Inter-rater reliability was calculated within subgroups and across the study population as an estimate for the accuracy of the rating process. For the mother–father rating subgroup the intra-class correlation coefficient (ICC) was r ICC = 0.906, for the parent–teacher-rating subgroup an ICC of r ICC = 0.793 was found. Across the study population the calculation of the ICC resulted in a reliability of r ICC = 0.837. The confidence intervals (α = 0.05) of reliabilities for the subgroups and for the study population are overlapping, indicating that they do not differ from each other (see Figure 2 for ICCs and the corresponding confidence intervals). Thus, we did not find evidence that the ability of the ELAN to differentiate between children with high and low vocabulary is lowered when instead of two parents a parent and a teacher provide evaluations.

www.frontiersin.org

Figure 2. Comparison of inter-rater reliability . Intra-class correlation coefficients (ICCs, represented as dots) and corresponding confidence intervals at α = 0.05 (CIs, represented as error bars) for parent–teacher ratings, mother–father ratings and for all rating pairs across rater subgroups. Overlapping CIs indicate that the ICCs did not differ systematically from each other.

3.3. Number, Likelihood, and Direction of Rating Differences

The Reliable Change Index (RCI) was used to calculate the least number of T -points necessary for two ELAN-scores to be significantly different from each other. We used two different estimates of reliability to demonstrate their impact on measures of agreement. First, the ICC calculated across the complete study population was employed as an estimate for the ELAN's reliability in this concrete study's population. As the ICC is calculated within and between subjects and not between specific rater groups, this is a valid approach for estimating overall reliability across both rating subgroups.

The critical difference when considering the ICC calculated across the study population The critical difference was D i f f T 1   −   T 2 = 1.96 ∗ 2 ( 10 2 ( 1 − 0.837 ) 2 ) = 11.199 . Since T -scores are calculated in integral numbers only, this result means that for the ELAN-questionnaire two ratings differ statistically at a significance level lower than α = 0.05, if the difference between them equals or is greater than 12 T -points.

When using the reliability provided in the ELAN-manual ( Bockmann and Kiese-Himmel, 2006 ), and thus when employing a more conservative estimate of reliability, the RCI was considerably lower, D i f f T 1   −   T 2 = 1.96 ∗ 2 ( 10 2 ( 1 − 0.99 ) 2 ) = 2.772 , resulting in a critical difference of three T -points.

Measuring the reliable difference between ratings on the basis of the inter-rater reliability in our study resulted in 100% rating agreement. In contrast, when the RCI was calculated on the basis of the manuals' more conservative test-retest reliability, a substantial number of diverging ratings was found; absolute agreement was 43.4%. When this conservative estimate of the RCI was used, significantly higher numbers of equal or diverging ratings were not found, neither for a single rating subgroup, nor across the study population. (see Table 2 for the results of the relevant binomial tests). Thus, the probability of a child to receive a concordant rating did not differ from chance. When the study's own reliability was employed, the probability to receive concordant ratings was 100% and thus clearly above chance.

www.frontiersin.org

Table 2. Proportions of diverging ratings for monolingual, bilingual, and all children in the sample .

In the parent–teacher rating subgroup 21 out of 34 children received diverging ratings; 9 out of 19 children received diverging ratings in the mother–father rating subgroup. Binomial tests (see Table 2 for details) clarified that these absolute differences were not statistically reliable within the limitations posed by the small sample size.

3.4. Factors Influencing the Likelihood and Direction Diverging Ratings

The results reported in this section consider those rating pairs that were classified as reliably different using the more conservative RCI calculation on the basis of the test-retest reliability, which yield a considerable number of diverging ratings. We explored the potential influence of three different factors on the likelihood of receiving diverging ratings: rating subgroup (mother–father vs. teacher–parent), gender of the child and bilingualism of the child.

The likelihood to receive diverging ratings did not depend systematically on whether a child was evaluated by a teacher and a parent or by father and mother [χ 2 (1, N = 53) = 1.028, p = 0.391]. Being a boy or a girl also did not change the likelihood of receiving diverging ratings [χ 2 (1, N = 53) = 0.106, p = 0.786]. In contrast, monolingual and bilingual children differed significantly concerning the likelihood of receiving two different ratings [χ 2 (1, N = 53) = 7.764, p = 0.007]: Bilingual children ( n = 12, 11 different ratings) were much more likely to receive diverging scores than monolingual children ( n = 41, 19 different ratings).

Next, we assessed whether the likelihood to receive diverging ratings was above chance. We conducted these binomial tests separately for bilingual and monolingual children, as bilingual children were shown to receive more diverging ratings compared to monolingual children. As only 2 out of 19 bilingual children were rated by two parents (see Table 1 ), we also considered rating subgroups separately. As summarized in Table 2 , the likelihood to receive diverging ratings exceeded chance for bilingual children only. However, conclusions about whether this is also true for bilingual children rated by two parents cannot be drawn on the basis of our data, as only two children fell in this category.

Wilcoxon paired-sample tests were used to uncover possible systematic direction tendencies for different groups of raters. None of the within subgroup comparisons (maternal- vs. paternal- and teacher- vs. parent-ratings) reached significance (all p ≥ 0.05). Thus, we did not find evidence for systematic direction of rating divergence, neither for bilingual, nor for monolingual children.

We therefore conclude that within the two different rating subgroups a similar proportion of diverging ratings occurred. Neither the gender of the child, nor whether the expressive vocabulary was evaluated by two parents or by a teacher and a parent, increased the probability of the children to receive two diverging ratings. The only factor that reliably increased this probability was bilingualism of the child. No systematic direction of differences was found.

3.5. Comparison of Rating Means and Magnitude of Differences

In a first step, we compared means of ratings for each rater group: mothers, fathers, parents and teachers. T -Tests did not reveal any significant differences (see Table 3 ).

www.frontiersin.org

Table 3. Means and standard deviations of vocabulary ratings and comparisons of means .

Only when using the test-retest reliability provided in the manual of the ELAN, there was a substantial number of differing rating pairs (30 out of 53 or 56.6%). The magnitude of these differences was assessed descriptively using a scatter plot (see Figure 3 ) and a Bland-Altman plot (also known as Tukey mean-difference plot, see Figure 4 ). First, we displayed the rating of the individual children in a scatter plot and illustrated the two different areas of agreement: 43.4% of ratings diverged by less than three T -points and can thus be considered concordant within the limits of the more conservative RCI estimate, all 100% of the ratings lie within 11 T -points and thus within the limits of agreement based on a reliability estimate obtained with the present study's sample.

www.frontiersin.org

Figure 3. Scatter-plot of children's ratings . Every dot represents two ratings provided for a child. For the parent–teacher rating subgroup, parental ratings are on the x -axis, teacher ratings are on the y -axis, for the parental rating subgroup, paternal ratings are on the x -axis, maternal ratings are on the y -axis. Ratings for bilingual children are represented by gray, for monolingual children by black dots. Dashed lines enclose statistically identical ratings as calculated on the basis of the manual-provided test-retest reliability (less than 3 T -points difference; 23 out of 53 rating pairs). Straight lines enclose statistically identical ratings as calculated on the basis of the inter-rater reliability (ICC) in our study (less than 12 T -points difference).

www.frontiersin.org

Figure 4. Bland-Altman plot of T -values, corresponding to a Tukey mean-difference plot . The solid line indicates the mean difference ( M = −1), dashed lines mark mean difference ±1.96 SDs. Dots represent the 30 rating pairs diverging significant in the study population. Differing mother–father ratings are represented by empty, differing parent–teacher ratings by filled dots. Positive differences indicate a higher evaluation of the parent in the parent–teacher rating subgroup or a higher evaluation by the father in the parental rating subgroup ( M = −1, SD = 5.7, min = −10, max = 9). Note that all but one difference lie within in the range of ± 10 T -points (1 SD on a T -scale) and that there is no indication for systematic over- or underrating.

Another way of illustrating the magnitude of differences is to display the distribution of significant differences, where mean T -values are plotted against the absolute difference values as proposed by Bland and Altman (1986 , 2003 ). This plot (see Figure 4 ) shows that 18 out of 30 observed differences (60%) are within 1 SD of differences ( SD = 5.7). The limits of agreement in this study, as defined by Bland and Altman (2003) , to contain 95% of the differences in similar populations are −12.2 to 10.2 T -points, a range that contains all of the observed differences in this study. Thus, the graphical approach toward assessing differences' magnitude mirrors the result of 100% rater agreement if considering ICC as the reliability in the calculation of reliable differences.

3.6. Correlations between Ratings

So far we reported results regarding inter-rater reliability and the number of diverging ratings within and between subgroups using two different but equally legitimate reliability estimates. We also explored which factors might influence the likelihood of receiving two statistically diverging ratings and described the magnitude of observed differences. These analyses focused on inter-rater reliability and agreement, as well as related measures. In this last section we turn to Pearson correlations coefficients in order to explore the linear relation between ratings and their strength within and between rater subgroups.

Teacher and parent ratings were highly correlated [ r = 0.797, p < 0.001, 95% CI (0.503, 1.0), see Figure 5A ] with large effect size of R 2 = 0.636. For the mother–father rating subgroup correlation between maternal and paternal ratings was similarly high [ r = 0.917, p < 0.001, 95% CI (0.698, 1.0), see Figure 5B ], effect size of R 2 = 0.842. The strength of relation between ratings did not differ systematically between the two rating subgroups ( p = 0.119). For the whole study population ( n = 53) Pearson correlation between ratings of two different caregivers was r = 0.841, p < 0.001, R 2 = 0.707. In conclusion, with regard to correlation of ratings, strong associations were observed for ratings provided by mothers and fathers, as well as for those provided by teachers and parents and thus across our study sample.

www.frontiersin.org

Figure 5. Correlations of ratings . Pearson correlations of parent–teacher ratings ( A , n = 34) and of mother–father ratings ( B , n = 19), both significant (both p ≤ 0.001) and with large effect sizes. Monolingual children are represented by black, bilingual by gray dots. The two correlations did not differ significantly from each other ( p = 0.119). *** p < 0.001.

4. Discussion

In this report a concrete data set is employed to demonstrate how a comprehensive evaluation of inter-rater reliability, inter-rater agreement (concordance), and linear correlation of ratings can be conducted and reported. On the grounds of this example we aim to disambiguate aspects of assessment that are frequently confused and thereby to contribute to increasing comparability of future rating analyses. By providing a tutorial, we hope to foster knowledge transfer to e.g., educational and therapeutic contexts, in which the methodological requirements for rating comparison are disregarded still too frequently, leading to misinterpretation of empirical data.

We analyzed two independent vocabulary ratings obtained for 53 German speaking children at the age of 2 years with the German vocabulary scale ELAN ( Bockmann and Kiese-Himmel, 2006 ). On the example of assessing whether ELAN ratings can be reliably obtained from daycare teachers as well as from parents we show that rater agreement, linear correlation, and inter-rater reliability all have to be considered. Otherwise, an exhaustive conclusion about a rating scale's employability with different rater groups cannot be made. We also considered the factors gender and bilingualism of the evaluated child as potentially influencing the likelihood of rating agreement.

First, we assessed the inter-rater reliability within and across rating subgroups. The inter-rater reliability as expressed by intra-class correlation coefficients (ICC) measures the degree to which the instrument used is able to differentiate between participants indicated by two or more raters that reach similar conclusions ( Liao et al., 2010 ; Kottner et al., 2011 ). Hence, the inter-rater reliability is a quality criterion of the assessment instrument and the accuracy of the rating process rather than one quantifying the agreement between raters. It can be regarded as an estimate for the instrument's reliability in a concrete study population. This is the first study to evaluate inter-rater reliability of the ELAN questionnaire. We report high inter-rater reliability for mother–father as well as for parent–teacher ratings and across the complete study population. No systematic differences between the subgroups of raters were found. This indicates that using the ELAN with daycare teachers does not lower its capability to differentiate between children with high and low vocabulary.

The term “agreement” describes the degree to which ratings are identical (see for example, de Vet et al., 2006 ; Shoukri, 2010 ; Kottner et al., 2011 ). Many studies supposedly evaluating agreement of expressive vocabulary ratings rely (only) on measures of strength of relations such as linear correlations (e.g., Bishop and Baird, 2001 ; Janus, 2001 ; Van Noord and Prevatt, 2002 ; Bishop et al., 2006 ; Massa et al., 2008 ; Gudmundsson and Gretarsson, 2009 ). In some studies the raw scores are used as reference values and critical differences are disregarded (e.g., Marchman and Martinez-Sussmann, 2002 ; McLeod and Harrison, 2009 ). However, absolute differences between raw scores or percentiles do not contain information about their statistical relevance. We demonstrate the use of the Reliable Change Index (RCI) to establish statistically meaningful divergences between rating pairs. We obtained two different RCIs on the basis of two reliability measures: the test-retest reliability provided in the ELAN's manual ( Bockmann and Kiese-Himmel, 2006 ) and the inter-rater reliability (expressed as ICC) derived from our sample. This dual approach was chosen to demonstrate the impact of more or less conservative, but similarly applicable reliability estimates, on measures of rating agreement. We determined that, if considering the reliability provided in the ELAN-manual, ratings differ reliably if the absolute difference between them amounts to three or more T -points. With regard to the reliability of our study, however, this difference necessary to establish reliable divergence between two ratings is considerably larger, i.e., 12 T -points or more.

For both critical values we determined absolute agreement (e.g., Liao et al., 2010 ) as the proportion of statistically non-different ratings. Absolute agreement was 100% if considering the RCI calculated on the basis of the ICC for our sample. In contrast, absolute agreement was 43.4% if the manual's test-retest reliability was used to estimate the critical difference. With this more conservative measure of absolute agreement, the probability to receive a concordant rating did not differ from chance. This probability did not differ statistically for the two rating subgroups (parent–teacher and mother–father ratings) and thus across the study population, regardless of the chosen RCI calculation. These results support the assumption that parents and daycare teachers in this case were similarly competent raters with regard to early expressive vocabulary of the children. Nonetheless, the RCIs obtained with different reliability estimates differ substantially with regard to the specific estimates of absolute agreement. The profoundly diverging amounts of absolute agreement obtained by using either inter-rater reliability within a relatively small sample or the instrument's test-retest reliability obtained with a large and more representative sample highlights the need for caution when calculating reliable differences.

Absolute agreement of 100% can undoubtedly be considered high. Whether 43.4% proportion of absolute agreement is high or low needs to be evaluated in comparison to previous reports using similar instruments and methods of analyses. In the domain of expressive vocabulary, however, we scarcely find empirical studies reporting the proportion of absolute agreement between raters. If they do, they consider agreement on the level of individual items (here words) and not on the level of the overall rating a child receives ( de Houwer et al., 2005 ; Vagh et al., 2009 ). In other domains, such as attention deficit or behavior problems, percentages of absolute agreement as proportion of concordant rating pairs are reported more often and provide more comparable results (e.g., Grietens et al., 2004 ; Wolraich et al., 2004 ; Brown et al., 2006 ). In those studies, agreement is considered high at and above 80% absolutely agreeing rating pairs; proportions of absolute agreement below 40% are considered low. However, one should take into account that these studies usually evaluate inter-rater agreement of instruments with far fewer items than the present study in which raters had to decide on 250 individual words. When comparing the results of our study and those of studies in other domains it has to be considered that increasing the number of items composing a rating reduces the likelihood of two identical scores. The difficulty to find reliable and comparable data on rater agreement in the otherwise well-examined domain of early expressive vocabulary assessment highlights both the widespread inconsistency of reporting practices and the need to measure absolute agreement in a comparable way, as e.g., presented here.

In order to evaluate inter-rater agreement in more detail, the proportion of absolute agreement needs to be considered in light of magnitude and direction of the observed differences. These two aspects provide relevant information on how close diverging ratings tend to be and whether systematically higher or lower ratings emerge for one subgroup of raters or rated persons in comparison to another. The magnitude of difference is an important aspect of agreement evaluations, since the proportions of statistically equal ratings only reflect perfect concordance. Such perfect concordance may, however, not always be relevant, e.g., by clinical means. In order to assess the magnitude of difference between raters, we employed a descriptive approach considering the distribution and the magnitude of score differences. As reliably different ratings were only observed when calculations were based on the test-retest reliability of the ELAN, we used these results to assess magnitude and direction of differences. Overall, the differences observed were small: most of them (60%) within 1 SD , all of them within 1.96 SDs of the differences' mean. Thus, the occurring differences were in an acceptable range for a screening tool, since they did not exceed one standard deviation of the norm scale used. This finding puts into perspective the relatively low proportion of absolute agreement measured on the groups of the tools test-retest reliability (43.4%) and highlights the importance of not only considering significance but also magnitude of differences. Interestingly, it is also in line with the 100% absolute agreement resulting from calculations employing this study's rather than the standardized reliability of the instrument used.

The analysis of differences' direction is intended to uncover systematic rating tendencies by a group of raters or for a group of rated persons. Some validity studies show a tendency of raters, specifically of mothers, to estimate children's language developmental status higher than the results obtained via objective testing of the child's language abilities ( Deimann et al., 2005 ; Koch et al., 2011 ; Rennen-Allhoff, 2012 ). Whether these effects reflect an overrating of the abilities of the children by their mothers, or the fact that objective results acquired specifically for young children might underestimate the actual ability of a child, remains uncertain. In the present study we did not assess validity and thus did not compare the acquired ratings to objective data. This also means that our assessments cannot reveal lenience or harshness of ratings. Instead, comparisons were conducted between raters, i.e., between mother and father, as well as between teacher and parent. We did not find any systematic direction of differences under these circumstances: No one party of either rating pair rated children's vocabulary systematically higher or lower than the other.

As explained above, only with the more conservative approach to calculate the RCI did we find a substantial amount of diverging ratings. We looked at the factors possibly influencing the likelihood of receiving diverging ratings. Neither gender of the child, nor whether it was evaluated by two parents or by a parent and a teacher, influenced this likelihood systematically. Bilingualism of the evaluated child was the only examined factor which increased the likelihood of a child to receive diverging scores. It is possible that diverging ratings for the small group of bilingual children reflected systematic differences of vocabulary used in the two different settings: monolingual German daycare and bilingual family environments. Larger groups and more systematic variability of the bilingual environment characteristics are necessary to determine whether bilingualism has a systematic effect on rater agreement, as suggested by this report and, if yes, where this effect stems from.

In order to further explore the linear relation between ratings, we calculated Pearson correlation coefficients. As mentioned above, many researchers employ correlation coefficients as an indicator of agreement (e.g., Bishop and Baird, 2001 ; Janus, 2001 ; Van Noord and Prevatt, 2002 ; Norbury et al., 2004 ; Bishop et al., 2006 ; Massa et al., 2008 ; Gudmundsson and Gretarsson, 2009 ), disregarding the fact that correlation measures the strength of the relation between two variables or ratings, but does not in itself provide information on the extent of agreement between them (for a methodological background see for example, Liao et al., 2010 ; Kottner et al., 2011 ). However, Pearson correlation coefficients are useful when quantifying the strength of linear association between variables. They can also be compared to assess differences between rater groups concerning these relations. In the context of vocabulary assessment, they allow us to relate the present results to previous findings. We found high correlation coefficients ( r = 0.841) across the study population and within each of the two rating subgroups (parent–teacher ratings r = 0.797, mother–father ratings r = 0.917). These correlations are higher than those found in comparable studies which are mostly moderate with correlation coefficients ranging from r = 0.30 to r = 0.60 ( Bishop and Baird, 2001 ; Janus, 2001 ; Norbury et al., 2004 ; Bishop et al., 2006 ; Massa et al., 2008 ; Gudmundsson and Gretarsson, 2009 ; Koch et al., 2011 ). Possible explanations can be found in our population characteristics, specifically in the homogeneity of the children's family and educational backgrounds, as well as the high professional qualification of the teachers in the participating state regulated daycare facilities. The high correlations could also be seen as indication that the employed questionnaire was easy to understand and unambiguous for most of the raters. What is more, we did not find differences in correlation coefficients when comparing rater subgroups. These results provide evidence that two parental ratings were not more strongly associated with each other than a parent with a teacher rating and that in general the two ratings of the expressive vocabulary of a child obtained with the ELAN-questionnaire ( Bockmann and Kiese-Himmel, 2006 ) were strongly associated with each other.

Taking together the results on agreement and those on linear correlations, we conclude that both measures are important to report. We demonstrate that high correlations of ratings do not necessarily indicate high agreement of ratings (when a conservative reliability estimate is used). The present study is an example of low to moderate agreement of ratings combined with relatively small magnitude of differences, unsystematic direction of differences and very high linear correlations between ratings within and between rating subgroups. In our study it would have thus been very misleading to only consider correlations as a measure of agreement (which they are not).

In summary, this study provides a comprehensive evaluation of agreement within and between two rater groups with regard to a German expressive vocabulary checklist for parents (ELAN, Bockmann and Kiese-Himmel, 2006 ). Inter-rater reliability of the ELAN-questionnaire, assessed here for the first time, proved to be high across rater groups. Within the limits of population size and its homogeneity, our results indicate that the ELAN-questionnaire, originally standardized for parents, can also be used reliably with qualified daycare teachers who have sufficient amount of experience with a child. We did not find any indication for systematically lower agreement of parent–teacher ratings compared to mother–father ratings. Also, teachers compared to parents as well as mothers compared to fathers did not provide systematically higher or lower ratings. The magnitude of absolute agreement profoundly depended on the reliability estimate used to calculate a statistically meaningful difference between ratings. The magnitude of rating differences was small and the strength of association between vocabulary ratings was high. These findings highlight that rater agreement has to be assessed in addition to correlative measures while not only taking significance but also magnitude of differences into account.

The employed and discussed analytical approach serves as one example for evaluation of ratings and rating instruments applicable to a variety of developmental and behavioral characteristics. It allows the assessment and documentation of differences and similarities between rater and rated subgroups using a combination of different statistical analyses. If future reports succeed in disambiguating the terms agreement, reliability and liner correlation and if the statistical approaches necessary to tackle each aspect are used appropriately, higher comparability of research results and thus improved transparency will be achieved.

Funding for this study was provided by the Zukunftskolleg of the University of Konstanz.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We thank the families and daycare centers that participated in this study and extend our gratitude to the student researchers who assisted with data collection: Anna Artinyan, Katharina Haag, and Stephanie Hoss.

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing . Washington, DC: American Educational Research Association.

American Psychological Association and Others. (2002). Ethical Principles of Psychologists and Code of Conduct . (Retrieved July 20, 2007).

Bishop, D. V., and Baird, G. (2001). Parent and teacher report of pragmatic aspects of communication: use of the children's communication checklist in a clinical setting. Dev. Med. Child Neurol . 43, 809–818. doi: 10.1017/S0012162201001475

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Bishop, D. V. M., Laws, G., Adams, C., and Norbury, C. F. (2006). High heritability of speech and language impairments in 6-year-old twins demonstrated using parent and teacher report. Behav. Genet . 36, 173–184. doi: 10.1007/s10519-005-9020-0

Bland, J. M., and Altman, D. G. (2003). Applying the right statistics: analyses of measurement studies. Ultrasound Obstet. Gynecol . 22, 85–93. doi: 10.1002/uog.122

Bland, M. J., and Altman, D. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 327, 307–310. doi: 10.1016/S0140-6736(86)90837-8

Bockmann, A.-K. (2008). Elan - mit schwung bis ins grundschulalter: Die vorhersagekraft des frühen wortschatzes für spätere sprachleistungen. Forum Logopädie 22, 20–23.

Bockmann, A.-K., and Kiese-Himmel, C. (2006). ELAN – Eltern Antworten: Elternfragebogen zur Wortschatzentwicklung im frühen Kindesalter, 1st Edn . Göttingen: Hogrefe.

Borsboom, D., Mellenbergh, G. J., and van Heerden, J. (2004). The concept of validity. Psychol. Rev . 111, 1061. doi: 10.1037/0033-295X.111.4.1061

Bortz, J., and Döring, N. (2006). Forschungsmethoden und Evaluation: für Human- und Sozialwissenschaftler . Berlin: Springer. doi: 10.1007/978-3-540-33306-7

CrossRef Full Text

Boynton Hauerwas, L., and Addison Stone, C. (2000). Are parents of school-age children with specific language impairments accurate estimators of their child's language skills? Child Lang. Teach. Ther . 16, 73–86. doi: 10.1191/026565900677949708

Brown, G. T., Glasswell, K., and Harland, D. (2004). Accuracy in the scoring of writing: studies of reliability and validity using a new Zealand writing assessment system. Assess. Writ . 9, 105–121. doi: 10.1016/j.asw.2004.07.001

Brown, J. D., Wissow, L. S., Gadomski, A., Zachary, C., Bartlett, E., and Horn, I. (2006). Parent and teacher mental health ratings of children using primary-care services: interrater agreement and implications for mental health screening. Ambulat. Pediatr . 6, 347–351. doi: 10.1016/j.ambp.2006.09.004

de Houwer, A., Bornstein, M. H., and Leach, D. B. (2005). Assessing early communicative ability: a cross-reporter cumulative score for the macarthur cdi. J. Child Lang . 32, 735–758. doi: 10.1017/S0305000905007026

Deimann, P., Kastner-Koller, U., Benka, M., Kainz, S., and Schmidt, H. (2005). Mütter als Entwicklungsdiagnostikerinnen. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie 37, 122–134. doi: 10.1026/0049-8637.37.3.122

de Vet, H., Terwee, C., Knol, D., and Bouter, L. (2006). When to use agreement versus reliability measures. J. Clin. Epidemiol . 59, 1033–1039. doi: 10.1016/j.jclinepi.2005.10.015

Dunn, L. M., and Dunn, D. M. (2007). Peabody Picture Vocabulary Test: PPVT-4B, 4th Edn . Minneapolis, MN: NCS Pearson.

Fenson, L. (1993). MacArthur Communicative Development Inventories: User's Guide and Technical Manual . San Diego, CA: Singular Publishing Group.

Fenson, L. (2007). MacArthur-Bates Communicative Development Inventories: User's Guide and Technical Manual . Baltimore, MD: Brookes.

Gilmore, J., and Vance, M. (2007). Teacher ratings of children's listening difficulties. Child Lang. Teach. Ther . 23, 133–156. doi: 10.1177/0265659007073876

Grietens, H., Onghena, P., Prinzie, P., Gadeyne, E., Van Assche, V., Ghesquière, P., et al. (2004). Comparison of mothers', fathers', and teachers' reports on problem behavior in 5- to 6-year-old children. J. Psychopathol. Behav. Assess . 26, 137–146. doi: 10.1023/B:JOBA.0000013661.14995.59

Gudmundsson, E., and Gretarsson, S. J. (2009). Comparison of mothers' and fathers' ratings of their children's verbal and motor development. Nordic Psychol . 61, 14–25. doi: 10.1027/1901-2276.61.1.14

Jacobson, N. S., and Truax, P. (1991). Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. J. Consult. Clin. Psychol . 59, 12–19. doi: 10.1037/0022-006X.59.1.12

Janus, M. (2001). Validation of a Teacher Measure of School Readiness with Parent and Child-Care Provider Reports . Department of Psychiatry Research Day, Canadian Centre for Studies of Children at Risk.

Jonsson, A., and Svingby, G. (2007). The use of scoring rubrics: reliability, validity and educational consequences. Edu. Res. Rev . 2, 130–144. doi: 10.1016/j.edurev.2007.05.002

Koch, H., Kastner-Koller, U., Deimann, P., Kossmeier, C., Koitz, C., and Steiner, M. (2011). The development of kindergarten children as evaluated by their kindergarten teachers and mothers. Psychol. Test Assess. Model . 53, 241–257.

Kottner, J., Audige, L., Brorson, S., Donner, A., Gajewski, B. J., Hróbjartsson, A., et al. (2011). Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. Int. J. Nurs. Stud . 48, 661–671. doi: 10.1016/j.ijnurstu.2011.01.017

Liao, S. C., Hunt, E. A., and Chen, W. (2010). Comparison between inter-rater reliability and inter-rater agreement in performance assessment. Annal. Acad. Med. Singapore 39, 613.

Pubmed Abstract | Pubmed Full Text

Maassen, G. H. (2000). Principles of defining reliable change indices. J. Clin. Exp. Neuropsychol . 22, 622–632. doi: 10.1076/1380-3395(200010)22:5;1-9;FT622

Maassen, G. H. (2004). The standard error in the jacobson and truax reliable change index: the classical approach to the assessment of reliable change. J. Int. Neuropsychol. Soc . 10, 888–893. doi: 10.1017/S1355617704106097

Marchman, V. A., and Martinez-Sussmann, C. (2002). Concurrent validity of caregiver/parent report measures of language for children who are learning both English and Spanish. J. Speech Lang. Hear. Res . 45, 983–997. doi: 10.1044/1092-4388(2002/080)

Massa, J., Gomes, H., Tartter, V., Wolfson, V., and Halperin, J. M. (2008). Concordance rates between parent and teacher clinical evaluation of language fundamentals observational rating scale. Int. J. Lang. Commun. Disord . 43, 99–110. doi: 10.1080/13682820701261827

McLeod, S., and Harrison, L. J. (2009). Epidemiology of speech and language impairment in a nationally representative sample of 4-to 5-year-old children. J. Speech Lang. Hear. Res . 52, 1213–1229. doi: 10.1044/1092-4388(2009/08-0085)

Mitchell, S. K. (1979). Interobserver agreement, reliability, and generalizability of data collected in observational studies. Psychol. Bull . 86, 376–390. doi: 10.1037/0033-2909.86.2.376

Norbury, C. F., Nash, M., Baird, G., and Bishop, D. V. (2004). Using a parental checklist to identify diagnostic groups in children with communication impairment: a validation of the Children's Communication Checklist-2. Int. J. Lang. Commun. Disord . 39, 345–364. doi: 10.1080/13682820410001654883

Rennen-Allhoff, B. (2012). Wie verläßlich sind Elternangaben? Kinderpsychol. Kinderpsychiatr . 40, 333–338.

Shoukri, M. M. (2010). Measures of Interobserver Agreement and Reliability, 2nd Edn . Boca Raton, FL: CRC Press. doi: 10.1201/b10433

Shrout, P. E., and Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychol. Bull . 86, 420–428. doi: 10.1037/0033-2909.86.2.420

Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Pract. Assess. Res. Eval . 9, 66–78.

Vagh, S. B., Pan, B. A., and Mancilla-Martinez, J. (2009). Measuring growth in bilingual and monolingual children's English productive vocabulary development: the utility of combining parent and teacher report. Child Dev . 80, 1545–1563. doi: 10.1111/j.1467-8624.2009.01350.x

Van Noord, R. G., and Prevatt, F. F. (2002). Rater agreement on iq and achievement tests: effect on evaluations of learning disabilities. J. School Psychol . 40, 167–176. doi: 10.1016/S0022-4405(02)00091-2

Wolraich, M. L., Lambert, E. W., Bickman, L., Simmons, T., Doffing, M. A., and Worley, K. A. (2004). Assessing the impact of parent and teacher agreement on diagnosing attention-deficit hyperactivity disorder. J. Dev. Behav. Pediatr . 25, 41–47. doi: 10.1097/00004703-200402000-00007

Zahra, D., and Hedge, C. (2010). The reliable change index: Why isn't it more popular in academic psychology. Psychol. Postgrad. Aff. Group Q . 76, 14–19.

Keywords: inter-rater agreement, inter-rater reliability, correlation analysis, expressive vocabulary, parent questionnaire, language assessment, parent–teacher ratings, concordance of ratings

Citation: Stolarova M, Wolf C, Rinker T and Brielmann A (2014) How to assess and compare inter-rater reliability, agreement and correlation of ratings: an exemplary analysis of mother-father and parent-teacher expressive vocabulary rating pairs. Front. Psychol . 5 :509. doi: 10.3389/fpsyg.2014.00509

Received: 17 January 2014; Accepted: 09 May 2014; Published online: 04 June 2014.

Reviewed by:

Copyright © 2014 Stolarova, Wolf, Rinker and Brielmann. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Margarita Stolarova, Department of Psychology, University of Konstanz, Universitätsstraße 10, 78464 Konstanz, Germany e-mail: [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

An official website of the United States government

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List

BMJ Open logo

Inter-reviewer reliability of human literature reviewing and implications for the introduction of machine-assisted systematic reviews: a mixed-methods review

Piet hanegraaf, abrham wondimu, jacob jan mosselman, rutger de jong, seye abogunrin, luisa queiros, maarten j postma, cornelis boersma, jurjen van der schans.

  • Author information
  • Article notes
  • Copyright and License information

Correspondence to Dr Jurjen van der Schans; [email protected]

Corresponding author.

Series information

Original research

Received 2023 Jun 20; Accepted 2024 Feb 23; Collection date 2024.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:  http://creativecommons.org/licenses/by-nc/4.0/ .

Our main objective is to assess the inter-reviewer reliability (IRR) reported in published systematic literature reviews (SLRs). Our secondary objective is to determine the expected IRR by authors of SLRs for both human and machine-assisted reviews.

We performed a review of SLRs of randomised controlled trials using the PubMed and Embase databases. Data were extracted on IRR by means of Cohen’s kappa score of abstract/title screening, full-text screening and data extraction in combination with review team size, items screened and the quality of the review was assessed with the A MeaSurement Tool to Assess systematic Reviews 2. In addition, we performed a survey of authors of SLRs on their expectations of machine learning automation and human performed IRR in SLRs.

After removal of duplicates, 836 articles were screened for abstract, and 413 were screened full text. In total, 45 eligible articles were included. The average Cohen’s kappa score reported was 0.82 (SD=0.11, n=12) for abstract screening, 0.77 (SD=0.18, n=14) for full-text screening, 0.86 (SD=0.07, n=15) for the whole screening process and 0.88 (SD=0.08, n=16) for data extraction. No association was observed between the IRR reported and review team size, items screened and quality of the SLR. The survey (n=37) showed overlapping expected Cohen’s kappa values ranging between approximately 0.6–0.9 for either human or machine learning-assisted SLRs. No trend was observed between reviewer experience and expected IRR. Authors expect a higher-than-average IRR for machine learning-assisted SLR compared with human based SLR in both screening and data extraction.

Currently, it is not common to report on IRR in the scientific literature for either human and machine learning-assisted SLRs. This mixed-methods review gives first guidance on the human IRR benchmark, which could be used as a minimal threshold for IRR in machine learning-assisted SLRs.

PROSPERO registration number

CRD42023386706.

Keywords: Systematic Review, Randomized Controlled Trial, Surveys and Questionnaires

STRENGTHS AND LIMITATIONS OF THIS STUDY.

First assessment of threshold of agreement between human reviewers of systematic literature reviews.

First reference for a threshold of agreement for machine learning-assisted systematic literature reviews.

Under-reporting of inter-reviewer reliability metrics may not accurately reflect the true agreement.

Sample size of the survey is small, undermining both the internal and external validity.

Introduction

Evidence-based medicine (EBM) is an integration of clinical expertise combined with the best available evidence from systematic research. The aim of EBM is to combine comprehensive evidence with patient’s values to inform decision-making for the individual care pathway and the development of clinical (treatment) guidelines. 1 Synthesis of available evidence by means of a systematic literature review (SLR) forms the foundation to this type of informed medical decision-making, making it one of the most important sources of evidence. 2 The rigorous character of SLRs combined with the increasing volume of evidence and the need for systematic updates to prevent evidence to become outdated 3 4 has put excessive pressure on researchers involved in evidence generation and assessment. The potential impact of outdated and incomplete health information goes beyond the field of evidence generation and is likely to result in suboptimal treatment of patients. As a result, automating aspects of the SLR process could lead to better and up-to-date informed medical decision-making, and thus indirectly improve the health of individual patients and entire populations.

A combination of human and machine learning automation efforts has been proposed to reduce the workload of conducting SLRs, and potentially enhance screening and data extraction quality. Machine learning can be used for both fully automated or assisted screening and eligibility assessment, as well as to support data extraction efforts, and has shown promising potential over the recent years. 5 A recent survey found a 32% uptake of automation tools among systematic review practitioners, but the survey identified a lack of published evidence on the tool’s benefit as one possible cause of the relatively low uptake. 6 The absence of transparency was also mentioned as a barrier to use automation tools. Another concern is machine learning’s compatibility with established methodology in evidence synthesis. The need for rigorously produced, disseminated and easily accessed evidence of machine learning validation is one of the key aspects to advance the field of machine learning in evidence synthesis. The demonstration of accuracy versus human classification is one of the first steps in the significant introduction of machine learning into evidence synthesis. 7

In terms of accuracy of agreement between reviewers of SLRs, Cohen’s kappa is the primary inter-reviewer reliability (IRR) score often used to assess the agreement between two reviewers. 8 Cohen’s kappa is defined as the relative observed agreement among reviewers, corrected for the probability of chance of agreement. There is also a level of agreement scale attached to the Cohen’s kappa score ranging from everything under 0.20 as none to slight agreement, 0.21–0.39 as minimal agreement, 0.40–0.59 as weak agreement, 0.60–0.79 as moderate agreement, 0.80–0.90 as strong agreement and everything above 0.90 as almost perfect agreement. 9 Notably, the Cohen’s kappa score can also be used to analyse machine learning approaches’ performance in comparison to human classification in automation tasks. For example, the Cohen’s kappa score was used to evaluate the performance of a machine learning model on automated classification of patient-based age-related macular degeneration severity using colour fundus photographs. 10 According to our knowledge, both an assessment and a reference standard of the inter-reviewer reliability between human literature reviewers or between human and machine learning automated IRR is yet to be determined.

One way to potentially determine how well the researchers involved in a SLR understood the topic being investigated is the reporting and level of the disagreements between the researchers. 11 This can be used as a proxy for the quality of the evidence summarised in such SLRs. As a first step to understanding the potential improvements machine approaches can bring, it is important to consider the standard of human executed SLRs. To define the accuracy of machine learning approaches in SLRs, it is necessary to compare machine learning automation approaches for systematic review screening with the human performed SLRs using the human–human agreement as a reference. In this way, the human inter-reviewer reliability can be used as a reference to validate the introduction of machine learning automation in SLRs.

When looking at a similar problem in setting the standards for the introduction of machine learning automation, autonomous self-driving cars show similar high stakes in terms of human lives and the automation of jobs. 12 One of the problems encountered with setting safety standards for self-driving cars is the better-than-average effect. At the level of individual risk assessment, most drivers perceive themselves to be safer than the average driver. 12 Most drivers want self-driving cars that are safer than their own perceived ability to drive safely before they would feel reasonably safe riding in a self-driving vehicle or buying a self-driving vehicle, all other things being equal. 12 To our knowledge, the better-than-average effect is unknown in the introduction of machine learning automation in evidence synthesis.

The aim of this work is to assess the level of agreement of human-executed SLRs and to assist in setting objectives for machine learning algorithms and creating a benchmark for determining the level of conflicts between machine learning algorithm classification and human classification. Therefore, our main objective is to assess the IRR reported in systematic reviews and analyse how it relates with the overall systematic review quality, reviewer experience and size of the screening task. Our secondary objective is to determine the expected IRR by authors of SLRs and to determine the potential existence of the better-than-average effect between human and automated SLRs.

This mixed-methods review consists of two parts to create a comprehensive synthesis of quantitative and qualitative data of IRR for screening and data extraction in SLRs. In the first part of this study, we performed a review of SLRs of randomised controlled trials (RCTs). The Pitts web application ( www.pitts.ai ) was used for both the literature screening as well as the data extraction. 13 The machine learning components of the Pitts web application were not employed for any steps of this systematic review. To limit the scope of the review and to increase the comparability of the included studies, we only included systematic reviews of RCTs. We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline 14 for reporting for this part of the study ( online supplemental appendix A ). In the second part of this research, we surveyed authors of SLRs on their expectations of machine learning automation and human performed inter-reviewer reliability in SLRs. For the survey, we used the Qualtrics platform, a user-friendly, feature-rich, web-based survey tool which allows users to build, distribute and analyse online surveys. We followed the Strengthening the Reporting of Observational Studies in Epidemiology guideline 15 for reporting of observational studies for this part of the study ( online supplemental appendix B ).

bmjopen-2023-076912supp001.pdf (58.2KB, pdf)

bmjopen-2023-076912supp002.pdf (112.1KB, pdf)

Systematic literature review

Selection criteria for eligible studies.

For title and abstract, and full-text screening, the Sample, Phenomenon of Interest, Design, Evaluation, Research type framework 16 ( online supplemental appendix C ) was used to guide the selection of the keywords and organise the inclusion and exclusion criteria. We included SLRs of RCTs with double-blind screening that reported on inter-reviewer reliability of screening or data extraction decisions. We restricted the search to publications in the past 5 years.

bmjopen-2023-076912supp003.pdf (64.1KB, pdf)

Studies were included if they met the following inclusion criteria

Systematic review articles of RCTs in which:

Two or more reviewers were involved during literature screening or data extraction.

Level of agreement between reviewers was reported.

The studies screening process was double-blinded.

The number of inclusion and excluded studies was reported.

Kappa score or other inter-reviewer agreement metrics was reported for abstract screening or full-text screening, or overall literature screening, or double data extraction was reported. Kappa score or other inter-reviewer agreement metrics can be calculated for any of the above items.

Studies were excluded if they met the following exclusion criteria

Systematic review.

Where not two or more reviewers were involved in literature screening or data extraction.

Which does not report level of agreement between reviewers.

In which the study screening was not double-blinded or not clearly reported.

In which the data extraction was not double-blinded or not clearly reported.

Where the number of included and excluded studies was not reported.

The following types of studies were excluded: diagnostic test reviews, individual patient data meta-analysis, scoping review, realist reviews, systematic review of health economic evaluation, empirical studies, case reports, narratives, letters to editors, genome-wide association meta-analysis, umbrella review and mixed-methods review.

Search strategy and study selection

The databases PubMed and Embase were searched using the search terms presented below. The search was limited in terms of language (Dutch and English) and time (publication between January 2017 and December 2022). The search was performed on 7 December 2022. Two reviewers performed the title and abstract literature screening independently. The articles that fulfilled the eligibility criteria were retrieved as full text. The full-text screening was performed again by two reviewers independently. A third reviewer was asked to resolve the disagreement between the two authors during the title and abstract screening as well as the full-text screening. Agreement between reviewers will be presented via the Cohen’s kappa score, for abstract screening, full-text screening and data extraction separately.

Search query PubMed via https://pubmed.ncbi.nlm.nih.gov/ (416 hits):

(("Systematic Review" [Publication Type] OR "Meta-Analysis" [Publication Type]) AND ("inter rater agreement" OR "inter-rater reliability" OR "IRR" OR "percent agreement" OR "percentage agreement" OR "reviewers agreed" OR no disagreement OR "Cohen's kappa" OR "Cohen’s Kappa statistic" OR "Cohen’s kappa coefficient" OR "Cohen’s Κ" OR "kappa test" OR "kappa*")) AND (randomised controlled trial OR randomized controlled trial OR rct OR randomized control trial OR randomised control trial) AND (y_5[Filter])

Search query Embase via https://www.embase.com/%23advancedSearch/default (454 hits):

('meta analysis'/exp OR ’systematic review'/exp OR 'meta-analysis' OR 'metaanalysis') AND ('inter rater agreement' OR 'inter-rater reliability' OR 'irr' OR 'percent agreement' OR 'percentage agreement' OR 'reviewers agreed' OR 'no disagreement' OR 'cohen/s kappa' OR 'cohen/s kappa statistic' OR 'cohen/s kappa coefficient' OR 'cohen/s κ' OR 'kappa test' OR kappa*) AND ('randomised controlled trial' OR 'rct' OR 'randomized control trial' OR 'randomised control trial' OR 'randomized controlled trial') AND [2017–2022]/py.

Data extraction and data synthesis

Two reviewers independently performed the data extraction in a custom-made data form. If necessary, data were calculated from the available information in the article. If there were any inconsistencies during data extraction, a third author was consulted. If necessary, data were calculated from the available information in the article. If data weres missing, we tried to contact the author to retrieve the missing information. The following data were extracted: study characteristics, review team size, screening data (ie, number of papers screened for title/abstract, number of papers retained after title/abstract screening, number of studies retained following full-text screening, and reporting of inclusion and exclusion criteria), study quality (A MeaSurement Tool to Assess systematic Reviews (AMSTAR) 2 17 ) and Cohen’s kappa score or other inter-reviewer reliability metrics of title/abstract screening, full-text screening and/or data extraction. In addition, one of the reviewers assessed if the protocol of the included SLR was registered in a publicly available database such as PROSPERO.

For the quality assessment of the included SLRs, we performed the AMSTAR 2 17 checklist on all the included SLRs. AMSTAR 2 is a critical appraisal tool for SLRs, which consists of 16 questions. The assessment of quality results was done by following the approach recommended by the authors. The quality of each SLR was deemed:

High; no or one non-critical weakness: The systematic review provides an accurate and comprehensive summary of the results of the available studies that address the question of interest.

Moderate; more than one non-critical weakness: The systematic review has more than one weakness but no critical flaws. It may provide an accurate summary of the results of the available studies that were included in the review.

Low; one critical flaw with or without non-critical weaknesses: The review has a critical flaw and may not provide an accurate and comprehensive summary of the available studies that address the question of interest.

Critically low; more than one critical flaw with or without non-critical weaknesses: The review has more than one critical flaw and should not be relied on to provide an accurate and comprehensive summary of the available studies.

The list of questions included in the survey is presented in online supplemental appendix D . The survey questions are intended to collect data on the participants’ expectations on the Cohen’s kappa score of SLRs between two human reviewers or a human reviewer and a machine learning agent, the better-than-average expectations of the participant, and the participants’ own experience on presenting the Cohen’s kappa score of their SLRs as primary outcome. In addition, we collected the participants’ scientific experience, experience with SLRs and experience with automation in SLRs as potential effect modifiers. Participants in this survey were also asked to reflect on the study objectives. We contacted authors of SLRs included in our review via email to complete our survey. In addition, we used a snow-balling approach to identify authors of SLRs from our own network. We followed up with the authors once after the initial contact. Informed consent was given by all authors who participated in the survey. The survey was open between 31 January 2023 and 17 February 2023, in which we aimed for a response rate of 10% of the targeted SLR authors (in the past 5 years).

bmjopen-2023-076912supp004.pdf (61.7KB, pdf)

Patient and public involvement

No patient involved.

Data analysis

We calculated the mean, variance and SD statistics over the extracted IRR metrics. The analysis of variance test was used to determine if a significant difference in the Cohen’s kappa exists between the different levels of review team size, screening data size and AMSTAR 2 ratings. An alpha value of <0.001 was used to determine statistical significance. All survey metrics were presented against the different aspects of the reviewer (ie, reviewer experience, reviewer ability, reviewer background, publication experience and academic qualification of the reviewer) in a colour-coded heat grid to observe trends in the association with human and machine learning-assisted reviewing expected IRR. All questions of the survey needed to be answered to avoid missing data. Only surveys that were answered fully were analysed.

A total of 836 records were identified based on applying the search query after removal of duplicates. As part of the study selection procedure, we excluded 423 records during abstract screening and another 363 studies were excluded during full-text screening. In total, 45 articles met the eligibility criteria and were included in this review. Figure 1 gives an overview of the flow diagram of the systematic review. The primary reason for exclusion at full-text level was that the full-text study did not report on the level of agreement between reviewers (n=307). An overview of the included studies and the extracted data are given in online supplemental appendix E . The Cohen’s kappa scores between reviewers for the abstract screening and full-text screening were both 0.72.

Figure 1

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram 2020 for systematic reviews. 14 The number represents the total respondents for the combination of answer categories, a darker colour represents a higher number of respondents.

bmjopen-2023-076912supp005.pdf (71.1KB, pdf)

Among the included SLRs, on average a team of 2.45 (SD=1.17) reviewers screened 3386 abstracts (SD=8880), 129 full texts (SD=170) and included 41 articles (SD=71). The average Cohen’s kappa score was 0.82 (SD=0.11, n=12) for abstract screening, 0.77 (SD=0.18, n=14) for full-text screening, 0.86 (SD=0.07, n=15) for the whole screening process and 0.88 (SD=0.08, n=16) for data extraction. Most of the studies (84.4%, n=38/45) reported on the inclusion and exclusion criteria. However, the AMSTAR 2 rating was critically low in almost all the studies included (91.1%, n=41/45). No association was observed between the IRR reported and review team size, items screened and quality of the SLR as reported by the AMSTAR 2 rating as well as the number of ‘no’s’ reported in the rating.

In total, 37 respondents completed the survey ( table 1 ). The full survey results can be found in online supplemental appendix F . Academic qualifications were primarily Ph.D. or doctorate degree (59.5%, n=22/37), but with an almost evenly spread of H-indexes. Most of the respondents have their professional activity in academia (83.8%, n=31/37), followed by industry (10.8%, n=4/37), government (5.4%, n=2/37) and healthcare organisations (5.4%, n=2/37).

Characteristics of survey respondents

SLRs, systematic literature reviews.

bmjopen-2023-076912supp006.pdf (724.5KB, pdf)

Both scientific experience and experience with SLR publications of respondents were evenly distributed among the different categories ( table 1 ). However, experience with machine learning within SLRs was reported to be below average (self-assessed) for most of the respondents (62.2%, n=23/37). Both the screening ability and the data extraction ability (ie, time spent per article and overall precision) were estimated by themselves to be above average for the larger part of the respondents (97.3%, n=36/37 and 94.6%, n=35/37). The survey results for the expected IRR showed overlapping expected Cohen’s kappa values compared with the SLR performed ranging between approximately 0.6–0.9, indicating a moderate to strong agreement between reviewers ( figures 2 and 3 ). In general, respondents expect a higher-than-average IRR for machine learning-assisted SLRs compared with human based SLRs in both the screening and the data extraction. No trend was observed between reviewer experience (ie, scientific experience, publication experience SLRs, machine learning experience SLRs, and screening and data extraction ability) and expected IRR in both human based SLR and machine learning-assisted SLR.

Figure 2

Lowest acceptable agreement expressed in Cohen’s kappa score between two reviewers (human–human or human–machine learning agent) for double-blinded literature abstract screening, full-text screening and data extraction decisions. The number represents the total respondents for the combination of answer categories, a darker colour represents a higher number of respondents.

Figure 3

Lowest acceptable agreement expressed in Cohen’s kappa score between two reviewers (human–human or human–machine learning agent) for double-blinded literature abstract screening, full-text screening and data extraction decisions to be published in a scientific journal.

When comparing figures 2 and 3 , the IRR (human–human or human–machine learning agent) acceptability in general and the acceptability to be published in a scientific journal was addressed. Similar trends were observed in both human–human and human–machine learning agent expected IRR, in which more respondents prefer not to have a minimal Cohen’s kappa score when addressing acceptability to be published in scientific journals. The absolute Cohen’s kappa score between general acceptability and acceptability to be published in a scientific journal remains similar for both human and machine learning-assisted acceptability for the respondents.

The majority of the respondents indicated that automated literature screening systems should be above average with respect to their ability to screen (72.97%) or extract (70.27%) accurately before they should be used for peer-reviewed and journal-published SLRs. This is in line with the number of respondents who indicated that automated literature screening systems should perform above average before the respondent would feel confident using the system (75.67%), compared with performing average (18.92%) or respondents that would never use an automated literature screening system (5.41%). Among the respondents, 45.95% indicated that they reported before on IRR for either screening or data extraction in their SLRs. Among reasons to not report on IRR, respondents indicated that they were not aware of the need of reporting such metrics or the software used was not capable of retrieving such information. An important common response on not reporting IRR was the fact that the reviewers included a third assessor to solve disagreement and the IRR only reflects part of the consensus-building process. Misinterpretation of the IRR as a quality measure of the reviewing process was mentioned as another underlying reason to not record or report IRR in SLRs.

The aim of this study was to assess the IRR reported in human performed SLRs and expected IRR of SLRs authors of both human and machine learning-assisted SLRs. The findings from our review on SLRs of RCTs show that there is moderate to strong agreement between reviewers in published reviews. The data from our survey show expected moderate to strong agreement between reviewers, and a trend of higher agreement throughout the screening process. On average, respondents of our survey expect a higher-than-average IRR agreement for machine learning-assisted SLR compared with human based SLR. The association between reported IRR agreement between reviewers and review team size, items screened and quality of the SLR, and reviewer experience and expected IRR was absent, likely because of the low sample size.

The findings of our review highlight that human reviewers are able to reach moderate to strong agreement, but not perfect agreement, despite the use of certain mechanisms, such as clearly defined study selection criteria to improve agreement outcomes. The reviewer’s decision is likely dependent on individual characteristics and interpretation of the reviewing process. Some may strictly follow the inclusion and exclusion criteria, while others may not be as strict and be more inclusive, this leads to disagreement and a resultant suboptimal inter-reviewer agreement score. 18 The survey showed an increase in expected agreement as the screening process progresses, with the expected agreement of abstract screening being lower relative to full-text screening and data extraction. This might be attributed to the fact that authors expect that the understanding of the inclusion and exclusion criteria increases as the work progresses thereby minimising the disagreement, and that full-text papers contain much more information than abstracts. Additionally, the discussions held among the reviewers to settle the disagreement through consensus or a third party in the early stages of the review might help one reviewer understand the decision-making behaviour of the others and foster harmony in the later stages. However, such a trend was not observed in the SLRs of RCTs, where IRR was comparable among the different review steps. An explanation could be that the learning effect of the reviewers as described above is cancelled out by the higher quantity of data you can have disagreement on.

SLRs are trusted to inform clinical practices and public health policies and have been used for this purpose for a very long time. In case of deviation from the conventional practice, it is very likely that stakeholders may have concerns with a new method that might be considered not as effective. Lacking trust from users is a major barrier to the adoption of machine learning in the SLRs process. 19 Therefore, in order to forward the adoption of machine learning in SLRs, researchers, funding agencies and policy-makers must be convinced that the new norm is either equal to or better than current practice. 19 The early adoption of machine learning for SLRs is already well documented in literature 20–22 and, therefore, it is crucial to investigate reviewers’ expectations for the use of machine learning. This can impact how much reviewers can rely on machine learning in comparison to their own and other human reviewers' performance. In that regard, our study can be used to set a perceived performance benchmark for machine learning vis-à-vis the human reviewers’ performance. A finding of our research is that in order for machine learning to be widely integrated in the systematic review process, it should demonstrate above-average IRR agreement and above-average ability to carefully screen literature and extract data. Considering the IRR found in our review (moderate agreement between human reviewers), this would mean that machine learning reviews should have a minimal strong agreement.

To our knowledge, this is the first attempt to assess the threshold of agreement between human-executed reviewers to guide the process of setting threshold for machine learning-assisted SLRs. However, the results should be seen in light of its limitations. It is to assume that there is publication bias present in the reporting of IRR in SLRs. If reported, the IRR level in SLRs would likely be higher compared with non-reported IRR. We found scarce reporting on Cohen’s kappa scores for IRR in SLRs of RCTs. About 83% of the articles in our SLRs were excluded during the full-text review because they did not contain any IRR data. However, it is unclear whether the authors did not conduct an IRR at all or why they did not report it in their publication. This may be partially because the Cochrane Review Guide and other major systematic review guidelines do not require conducting an IRR or reporting data on IRR for literature screening. 23 The publication bias in this particular instance would be more pronounced as the IRR measures the agreement level of authors’ who themselves are directly involved in reporting the finding. Therefore, it is important to note that the IRR reported in SLRs may be inflated and so, not accurately reflect the true IRR. In those studies that reported Cohen’s kappa statistics, variability in the kappa scores was found. Differences in researcher experience in combination with the difficulty of the screening task attributed to the fineness of the inclusion and exclusion criteria, time constraints on reviewing, and level of preparation including pretest and training for screening and data extraction, likely drove variations in inter-reviewer kappa estimates, but it was not possible to assess this in our review. As reported in our survey an important common response on not to report was the fact that the reviewers included a third assessor to solve disagreement and the IRR only reflects part of the consensus-building process. Misinterpretation of the IRR as a quality measure of the reviewing process was mentioned as another underlying reason to not record or report IRR in SLRs. The process of agreement between humans is likely to be more iterative than the final IRR measure presented in the final publication. However, the use of AI in machine-assisted reviews, where the AI model introduction is often trained or retrained, could be seen as an iterative process by itself as well. In that sense, both human–human IRR reported in SLRs, and machine-assisted IRR in SLRs could be facing the same interpretation issues.

A sizeable portion of respondents to the survey expressed the opinion that there should not be a minimum acceptable agreement level (expressed in kappa score) for screening and data extraction decisions or for publishing a systematic review in a scientific journal, indicating that reviewers will continue to question the appropriateness of IRRs for the benchmarking. To increase confidence in the accuracy of the overall review, it is crucial to ensure reliability during screening and data extraction. 18 SLRs involve a number of subjective decisions that need to be recorded along the way from screening to data extraction in order to ensure transparency and replicability. 18 24 It is only in this sense that the significance of IRR can be fully appreciated. Due to the lack of prior research on the topic, it is impossible to compare the findings of the current study with those of other studies, which is another limitation of our study. Finally, the sample size both for the SLRs and survey is small, undermining both the internal and external validity of the study. We recommend the future study with a larger sample size to generate more accurate results. Future SLRs should report IRR kappa scores as a best scientific practice to showcase how well the researchers involved in SLRs understood the work they did. These anticipated future results could build further on the quality of human-executed SLRs and the value of machine learning-assisted methods for conducting SLRs. In addition, future validation and direct comparison of machine-assisted reviews versus the human reviewers’ performance should place the IRR threshold in the context of the real-world performance as opposed to the indirect comparison which is made in this study. Accuracy measures of the reviewing process should be further explored to guide the process and evaluation of introduction of machine learning in evidence generation and evidence synthesis.

Currently, it is not common to report on IRR in the scientific literature for either human or machine learning-assisted SLRs. Human performed SLRs likely show a moderate agreement between reviewers, while authors expect machine learning-assisted SLRs to perform better. This mixed-methods review gives first guidance on the human IRR benchmark, which could be used as a minimal threshold for IRR in machine learning-assisted SLRs. A minimal strong agreement between reviewers of machine learning-assisted SLRs is recommended to ensure overall acceptance of machine learning in SLRs.

Supplementary Material

Contributors: Conceptualisation of this study was done by PH, AW, SA, LQ, ML, MP, CB and JvdS. The design of the methodology was done by PH, AW, SA, LQ, ML, MP, CB and JvdS. The abstract and full-text screening was done by AW and JvdS, with CB for disagreement resolution. The data extraction was performed by AW, PH, with CB for disagreement resolution. The statistical analysis was performed by RdJ and JJM. Writing, reviewing and editing was done by all authors. All authors have read and agreed to the published version of the manuscript. JvdS acts as the guarantor for this publication.

Funding: This work was funded by F. Hoffmann-La Roche, Basel, Switzerland (grant number: N/A).

Competing interests: MP and CB reported stock ownership in Health-Ecore B.V. LQ, ML and SA are employed by Roche. SA and ML reported stock ownership in Roche. Roche provided funding for this research to Health-Ecore and Pitts. PH and JJM reported ownership in Pitts. The other authors declare that they have no further competing interests related to this specific study and topic.

Patient and public involvement: Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

Provenance and peer review: Not commissioned; externally peer reviewed.

Supplemental material: This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Data availability statement

No data are available.

Ethics statements

Patient consent for publication.

Consent obtained directly from patient(s).

Ethics approval

This study involves human participants but patient anonymity is guaranteed by the use of a unique anonymous identifier, hence the ethical approval for observational studies has been waived. Participants gave informed consent to participate in the study before taking part.

  • 1. Sackett DL, Rosenberg WM, Gray JA, et al. Evidence based medicine: what it is and what it isn't. BMJ 1996;312:71–2. 10.1136/bmj.312.7023.71 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 2. Gough D, Elbourne D. Systematic research synthesis to inform policy, practice and democratic debate. Soc Policy Soc 2002;1:225–36. 10.1017/S147474640200307X [ DOI ] [ Google Scholar ]
  • 3. Shojania KG, Sampson M, Ansari MT, et al. How quickly do systematic reviews go out of date? A survival analysis. Ann Intern Med 2007;147:224–33. 10.7326/0003-4819-147-4-200708210-00179 [ DOI ] [ PubMed ] [ Google Scholar ]
  • 4. Elliott JH, Synnot A, Turner T, et al. Living systematic review: 1. introduction-the why, what, when, and how. J Clin Epidemiol 2017;91:23–30. 10.1016/j.jclinepi.2017.08.010 [ DOI ] [ PubMed ] [ Google Scholar ]
  • 5. Cierco Jimenez R, Lee T, Rosillo N, et al. Machine learning computational tools to assist the performance of systematic reviews: a mapping review. BMC Med Res Methodol 2022;22:322. 10.1186/s12874-022-01805-4 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 6. van Altena AJ, Spijker R, Olabarriaga SD. Usage of automation tools in systematic reviews. Res Synth Methods 2019;10:72–82. 10.1002/jrsm.1335 [ DOI ] [ PubMed ] [ Google Scholar ]
  • 7. Arno A, Elliott J, Wallace B, et al. The views of health guideline developers on the use of automation in health evidence synthesis. Syst Rev 2021;10:16. 10.1186/s13643-020-01569-2 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 8. Park CU, Kim HJ. Measurement of inter-rater reliability in systematic review. Hanyang Med Rev 2015;35:44. 10.7599/hmr.2015.35.1.44 [ DOI ] [ Google Scholar ]
  • 9. McHugh ML. Interrater reliability: the Kappa Statistic. Biochem Med (Zagreb) 2012;22:276–82. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 10. Peng Y, Dharssi S, Chen Q, et al. Deepseenet: a deep learning model for automated classification of patient-based age-related macular degeneration severity from color fundus photographs. Ophthalmology 2019;126:565–75. 10.1016/j.ophtha.2018.11.015 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 11. Belur J, Tompson L, Thornton A, et al. Interrater reliability in systematic review methodology. Sociol Methods Res 2018:004912411879937. 10.1177/0049124118799372 [ DOI ] [ Google Scholar ]
  • 12. Nees MA. Safer than the average human driver (who is less safe than me)? Examining a popular safety benchmark for self-driving cars. J Safety Res 2019;69:61–8. 10.1016/j.jsr.2019.02.002 [ DOI ] [ PubMed ] [ Google Scholar ]
  • 13. Pitts . Living systematic review software. Available: https://pitts.ai/ [Accessed 24 Nov 2022].
  • 14. Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Syst Rev 2021;10:89. 10.1186/s13643-021-01626-4 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 15. Elm E von, Altman DG, Egger M, et al. Strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. BMJ 2007;335:806–8. 10.1136/bmj.39335.541782.AD [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 16. Cooke A, Smith D, Booth A. Beyond PICO: the SPIDER tool for qualitative evidence synthesis. Qual Health Res 2012;22:1435–43. 10.1177/1049732312452938 [ DOI ] [ PubMed ] [ Google Scholar ]
  • 17. Shea BJ, Reeves BC, Wells G, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of Healthcare interventions, or both. BMJ 2017;358:j4008. 10.1136/bmj.j4008 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 18. Belur J, Tompson L, Thornton A, et al. Interrater reliability in systematic review methodology: exploring variation in coder decision-making. Sociol Methods Res 2021;50:837–65. 10.1177/0049124118799372 [ DOI ] [ Google Scholar ]
  • 19. O’Connor AM, Tsafnat G, Thomas J, et al. A question of trust: can we build an evidence base to gain trust in systematic review automation technologies. Syst Rev 2019;8:143. 10.1186/s13643-019-1062-0 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 20. Cohen AM, Hersh WR, Peterson K, et al. Reducing workload in systematic review preparation using automated citation classification. J Am Med Inform Assoc 2006;13:206–19. 10.1197/jamia.M1929 [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 21. Howard BE, Phillips J, Miller K, et al. SWIFT-review: a text-mining workbench for systematic review. Syst Rev 2016;5:87. 10.1186/s13643-016-0263-z [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 22. Liao J, Ananiadou S, Currie LG, et al. Automation of citation screening in pre-clinical systematic reviews. Neuroscience [Preprint] 2018. 10.1101/280131 [ DOI ]
  • 23. Higgins JPT, Thomas J, Chandler J, et al., eds. Cochrane Handbook for Systematic Reviews of Interventions version 6.3 (updated February 2022). Cochrane, 2022. Available: www.training.cochrane.org/handbook [ Google Scholar ]
  • 24. McHugh ML. Interrater reliability: the Kappa statistic. Biochem Med (Zagreb) 2012;22:276–82. [ PMC free article ] [ PubMed ] [ Google Scholar ]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data availability statement.

  • View on publisher site
  • PDF (725.7 KB)
  • Collections

Similar articles

Cited by other articles, links to ncbi databases.

  • Download .nbib .nbib
  • Format: AMA APA MLA NLM

Add to Collections

  • Open access
  • Published: 07 December 2023

Inter-rater reliability of risk of bias tools for non-randomized studies

  • Isabel Kalaycioglu   ORCID: orcid.org/0000-0002-7116-151X 1 ,
  • Bastien Rioux 1 , 2 , 3 ,
  • Joel Neves Briard 1 , 2 , 3 ,
  • Ahmad Nehme 1 , 2 , 3 ,
  • Lahoud Touma 1 , 2 , 3 ,
  • Bénédicte Dansereau 1 , 2 , 3 ,
  • Ariane Veilleux-Carpentier 1 , 2 , 3 &
  • Mark R. Keezer 1 , 2 , 3 , 4  

Systematic Reviews volume  12 , Article number:  227 ( 2023 ) Cite this article

1310 Accesses

3 Citations

1 Altmetric

Metrics details

There is limited knowledge on the reliability of risk of bias (ROB) tools for assessing internal validity in systematic reviews of exposure and frequency studies. We aimed to identify and then compare the inter-rater reliability (IRR) of six commonly used tools for frequency (Loney scale, Gyorkos checklist, American Academy of Neurology [AAN] tool) and exposure (Newcastle–Ottawa scale, SIGN50 checklist, AAN tool) studies.

Six raters independently assessed the ROB of 30 frequency and 30 exposure studies using the three respective ROB tools. Articles were rated as low, intermediate, or high ROB. We calculated an intraclass correlation coefficient (ICC) for each tool and category of ROB tool. We compared the IRR between ROB tools and tool type by inspection of overlapping ICC 95% CIs and by comparing their coefficients after transformation to Fisher’s Z values. We assessed the criterion validity of the AAN ROB tools by calculating an ICC for each rater in comparison with the original ratings from the AAN.

All individual ROB tools had an IRR in the substantial range or higher (ICC point estimates between 0.61 and 0.80). The IRR was almost perfect (ICC point estimate > 0.80) for the AAN frequency tool and the SIGN50 checklist. All tools were comparable in IRR, except for the AAN frequency tool which had a significantly higher ICC than the Gyorkos checklist ( p  = 0.021) and trended towards a higher ICC when compared to the Loney scale ( p  = 0.085). When examined by category of ROB tool, scales, and checklists had a substantial IRR, whereas the AAN tools had an almost perfect IRR. For the criterion validity of the AAN ROB tools, the average agreement between our raters and the original AAN ratings was moderate.

All tools had substantial IRRs except for the AAN frequency tool and the SIGN50 checklist, which both had an almost perfect IRR. The AAN ROB tools were the only category of ROB tools to demonstrate an almost perfect IRR. This category of ROB tools had fewer and simpler criteria. Overall, parsimonious tools with clear instructions, such as those from the AAN, may provide more reliable ROB assessments.

Peer Review reports

Introduction

Risk of bias (ROB) assessment is a critical step in a systematic review [ 1 ]. Accurate ROB assessments identify the degree of bias in different bodies of evidence to inform decisions made by health professionals and policy makers. Given that low-bias randomized controlled trials (RCTs) cannot always be conducted, many public health officials rely on observational studies to inform their medical policies [ 2 ]. Proper ROB assessment is especially important for these non-randomized observational studies, as various sources of bias (e.g., confounding bias) are more likely to arise than in their RCT counterparts [ 1 ]. Without reliable ROB tools, one may overestimate the validity of results from high-bias studies, which may lead to the incorrect synthesis of knowledge and incorrect guidance for policy makers [ 2 ].

The high number of ROB tools and the lack of guidance on their optimal use in non-randomized studies, particularly in descriptive or analytical observational studies, are major obstacles to the interpretation of systematic reviews. There is a growing number of domain- and design-specific ROB tools for non-randomized studies, especially for frequency and exposure studies in health-related systematic reviews. Frequency studies use cohort or cross-sectional designs to assess the incidence or prevalence of an outcome [ 3 ]. Through cohort and case–control studies, exposure study designs observe outcome occurrence in relation to a given exposure [ 4 ]. Several research organizations, such as the American Academy of Neurology (AAN), have created their own tools to evaluate these types of studies [ 5 ]. Other commonly used ROB tools for frequency studies include the Loney scale and the Gyorkos checklist, whereas for exposure studies the Newcastle–Ottawa scale and the SIGN50 checklist are highly used tools [ 6 , 7 , 8 , 9 ]. In general, for non-randomized interventional studies, Cochrane recommends the ROBINS-I tool to evaluate potential sources of bias. There are currently no practice standards for ROB tools in observational studies, possibly due to the limited knowledge on how these numerous tools compare to one another [ 10 , 11 ].

These commonly used ROB tools have not previously reported inter-rater reliability, which attempts to quantify the performance of the tool by assessing the reproducibility of ratings between evaluators [ 1 ]. Furthermore, comprehensive head-to-head comparisons for these ROB tools are lacking [ 12 ]. There is a pressing need to identify and compare the inter-rater reliability of individual ROB tools to better guide their optimal use in systematic reviews of observational studies. As a primary objective, we aimed to quantify and then compare the inter-rater reliability of three commonly used ROB tools for frequency (Loney scale, Gyorkos checklist, AAN frequency tool) and for exposure (Newcastle–Ottawa scale, SIGN50 checklist, AAN exposure tool) studies. As secondary objectives, we identified and compared the inter-rater reliability of each category of ROB tool (scales, checklists, AAN tools) and evaluated the criterion validity of the AAN tools.

We conducted a reliability study and reported our findings using the Guidelines for Reporting Reliability and Agreement Studies (GRRAS; Supplemental Material, Table S 1 ) [ 13 ]. We defined frequency studies as descriptive studies that aimed to measure incidence or prevalence [ 3 ]. We defined exposure studies as analytical observational studies (e.g., cohort or case–control studies) that aimed to compare outcomes in two or more exposure groups [ 4 ]. These definitions are based on those generally used in the systematic review literature.

Selection and description of the ROB tools

We first selected one AAN ROB tool designed for frequency studies and another for exposure studies. The AAN ROB assessment tools use a four-tier classification system, whereby each article is rated from class one (lowest ROB) to class four (highest ROB) [ 5 ]. Each rating has a distinct set of criteria tailored to the review question and study design. Although the AAN has various ROB tools, none was explicitly stated to be a frequency or exposure ROB tool. We therefore selected tools with the most fitting criteria for the intended type of study. For frequency studies, we chose the Population Screening Scheme, as this tool assessed characteristics needed for a high-quality frequency study, such as having a representative and unbiased sample population. For exposure studies, we chose the Prognostic Accuracy Scheme over the similar Causation Evidence Scheme as the latter had stricter criteria concerning confounding factors and biological plausibility. The precision of the criterion limited the tool’s scope and made it better suited to assess observational studies that were specifically implemented where randomized controlled trials could not be due to ethical concerns [ 5 ].

The two other categories of ROB tools considered in our study were scales and checklists (with or without summary judgments). Scales include a list of items that are each scored and assigned a weight. After scoring each weighted item, a quantitative summary score is produced [ 1 ]. For checklists, raters answer predetermined domain-specific questions from a given set of responses, such as “yes,” “no,” or “uncertain.” Although no instructions are provided to calculate an overall score, some checklists provide guidance to formulate a summary judgment, such as a low, intermediate, or high ROB [ 10 ].

We searched for two scales and two checklists from published systematic reviews which qualitatively described an extensive list of available ROB tools [ 1 , 14 , 15 ]. Over the period of June–August 2020, we searched for a combination of the following terms on Google Scholar: “Risk of Bias Tools,” “Observational Studies,” “Non-randomized studies,” “Exposure studies,” and “Frequency studies.” From this search, we found three systematic reviews, which each had a comprehensive list of various ROB tools, and five academic institutions that each created their own ROB tool [ 1 , 9 , 14 , 15 , 16 , 17 , 18 , 19 ]. We screened for a preliminary set of ROB tools for exposure and frequency studies from these systematic reviews and academic institutions by using the following criteria: (i) freely available online in English, (ii) simple to use for non-experts in ROB assessment, and (iii) commonly used for non-randomized studies of frequency or exposure. A ROB tool was considered simple to use for non-experts if there were no reviews stating it was “complicated” or “difficult to summarize” [ 1 , 14 , 15 ]. Two authors (IK and BR) then assessed the citation impact of each tool on PubMed and GoogleScholar to produce a list of five commonly used tools for each category of tool (scale, checklist) and for each study design (frequency, exposure; Supplemental Material, Table S 2 ). Consensus for the final set of tools was settled through consensus with a third author (MRK) based on the initial set of criteria. We selected four ROB tools: the Loney scale and the Gyorkos checklist for frequency studies, as well as the Newcastle–Ottawa scale and the SIGN50 checklist for exposure studies (Table 1 ) [ 6 , 7 , 8 , 9 ]. Certain tools had various versions designed for specific study designs. We used the most appropriate version of these tools for each study design (frequency tools: case series/survey studies or cross-sectional designs; exposure tools: cohort or case–control designs). We followed the suggested summary scoring method for the Gyorkos and SIGN50 checklists [ 7 , 9 ]. For the Loney and the Newcastle–Ottawa scales, we split the total score into 3 equal tiers (low, intermediate, and high ROB) to allow for category comparisons [ 6 , 8 ].

Article selection

We sampled 30 frequency and 30 exposure articles from randomly selected clinical practice guidelines of the AAN published between 2015 and 2020 (Supplemental Material, Tables S 3 and S 4 ). We selected articles from the AAN guidelines for convenience, as they were already assigned a ROB rating by the AAN. To ensure that we selected articles evaluated by the appropriate AAN ROB tool, we verified the appendices of these clinical guidelines which stated if the Population Screening Scheme (frequency studies) or the Prognostic Accuracy Scheme (exposure studies) were used to evaluate the included articles. The appendices outlined all articles by class; therefore, we used information from this section to choose an equal number of class one, class two, and class three ROB articles, as rated by the authors of the original AAN systematic reviews. Although the AAN has four classes of risk of bias, we only used articles from classes 1–3 for two reasons. Firstly, class four studies are not included in the AAN published guidelines given their high risk of bias; therefore, we could not choose any class four articles from the guidelines to be evaluated [ 5 ]. Secondly, to allow for comparisons between ROB tools, we needed to split ROB assessments into three levels, with class one articles as low ROB, class two articles as intermediate ROB, and class three articles as high ROB. Of note, although articles were selected from the AAN guidelines, the chosen studies included a diverse range of topics within neurology and medicine.

Rating process

We recruited six raters (BR, JNB, AN, LT, BD, AVC), all of whom were post-graduate neurology residents at our institution who had previously completed at least one systematic review. All raters attended a 60-min course on the selected ROB tools to ensure a standardized familiarity with the instruments. During this course, the necessity of ROB tools in systematic reviews was discussed and a description of each tool along with their scoring system was given. After the training, participants were asked to rate articles independently (i.e., without communication between raters) using a customized online form. Each rater assessed all chosen 60 articles using a set of three tools for frequency ( n  = 30) and exposure ( n  = 30) studies. All the exposure and frequency tools were used by each rater on all the exposure and frequency studies, respectively. We varied the sequence of articles to be assessed across raters, as well as the order of ROB tools across both raters and articles. Raters were asked to limit themselves to a maximum of 10 articles per day to avoid exhaustion.

Statistical analyses

We assessed inter-rater reliability with a two-way, agreement, average-measures intraclass correlation coefficient (ICC) with 95% confidence intervals (CI). This coefficient is commonly used to measure agreement on the ordinal scale for multiple raters [ 20 ]. We compared the inter-rater reliability between frequency tools (Loney, Gyorkos, and AAN frequency tool), exposure tools (Newcastle–Ottawa scale, SIGN50, and AAN exposure tool), and category of ROB tool (scales, checklists, and AAN tools) by transforming their ICC to Fisher’s Z values and testing the null hypothesis of equality. No adjustment for multiple testing was done. We also inspected their ICC and associated 95% CI. We visually inspected the variances across raters for each median score (for the pooled checklists, scales, and the AAN tools) and did not identify evidence of heteroscedastic variances. Homoscedasticity is a primary assumption behind the ICC, and violation of this assumption may inflate ICC estimates, which may lead to an overstatement of the inter-rater reliability [ 21 ]. Finally, we calculated an ICC for each of our six raters by comparing the ratings they produced with the AAN tools for each article to the ROB ratings published by the AAN for these same articles (criterion validity).

We expected an ICC for most tools of approximately 0.50 based on prior publications assessing the Newcastle–Ottawa scale [ 22 ]. We used Landis and Koch benchmarks to define inter-rater reliability as poor (ICC < 0), slight (0–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), almost perfect (0.81–0.99), and perfect (1.00) [ 23 ]. To detect a statistical difference between an ICC of 0.20 (slight reliability) versus 0.50 with a group of 6 raters, a minimum of 27 studies was required assuming at least 80% power and an alpha of 0.05 [ 24 ]. This was our reason for choosing to include a priori 30 frequency (10 of each class) and 30 exposure studies (10 of each class), for a total of 60 articles. We used a threshold of p value < 0.05 for statistical significance and performed our analyses with R Studio (v.1.2.5) [ 25 ].

Availability of data and materials

The datasets supporting the conclusions of this article are available at https://datadryad.org/stash/share/6PQuln5wyTvTBx_CO_JFESVD8M7gX1ImQAy4t4JVxls .

Inter-rater reliability of ROB tools

The SIGN50 (ICC = 0.835; 95% CI 0.719, 0.912) and the AAN frequency (ICC = 0.893; 95% CI 0.821, 0.943) tools had the highest ICC point estimates; these fell within the range of an almost perfect reliability (i.e., 0.81–0.99; Fig.  1 , panels A and B). The four other tools had a substantial reliability (i.e., 0.61–0.80): the Loney scale (ICC = 0.749; 95% CI 0.580, 0.865), the Gyorkos checklist (ICC = 0.669; 95% CI 0.450, 0.821; Fig.  1 A), the Newcastle–Ottawa scale (ICC = 0.633; 95% CI 0.387, 0.802), and the AAN exposure tool (ICC = 0.743; 95% CI 0.517, 0.862; Fig.  1 B). The AAN frequency tool had higher inter-rater reliability than the Gyorkos checklist ( p  = 0.021). The AAN frequency tool trended to have a greater inter-rater reliability as compared to the Loney scale, with only minimal overlap in their 95% CIs ( p  = 0.085; Fig.  1 A). We did not observe any other significant differences in ICC among the remaining tools. A summary of the results can be found in Supplemental Material, Table S 5 .

figure 1

Intraclass correlation coefficient (ICC) by individual tools ( A , B ) and tool types ( C ). Abbreviations: AAN, American Academy of Neurology; CI, confidence interval; ICC, intraclass correlation coefficient; NOS, Newcastle–Ottawa scale

Inter-rater reliability of categories of ROB tools

The AAN ROB tools, taken together as a category of ROB tool, had an almost perfect inter-rater reliability (ICC = 0.838; 95% CI 0.765, 0.894; Fig.  1 C). The inter-rater reliability of scales (ICC = 0.698; 95% CI 0.559, 0.803) and checklists (ICC = 0.772; 95% CI 0.664, 0.852) were substantial. Although checklists did not differ significantly in inter-rater reliability when compared to the AAN ROB tools ( p  = 0.311), scales trended towards a lower inter-rater reliability compared to the AAN ROB tools ( p  = 0.061), with little overlap in their 95% CI. A summary of the results can be found in Supplemental Material, Table S 6 .

Criterion validity of AAN ROB tools

We obtained the ICC using the AAN tools for each of our six reviewers as compared to the original ratings from the published AAN reviews. The average ICC among the six reviewers was moderate (0.563; 95% CI 0.239, 0.739). Individual point estimates for ICCs ranged from 0.417 (95% CI 0.022, 0.652) to 0.683 (95% CI 0.472, 0.810).

Several ROB tools are available to assess non-randomized studies; however, few have been thoroughly evaluated in terms of inter-rater reliability. Non-randomized studies, especially observational studies, usually harbor greater potential threats to their internal validity that deserve particular attention as compared to randomized studies. Reliable ROB tools for observational studies are therefore essential to properly appreciate and assess evidence from articles in systematic reviews.

In this inter-rater reliability assessment of ROB tools for exposure and frequency articles, we observed that all individual tools reached at least the substantial inter-rater reliability range (ICC point estimate = 0.61–0.80). We observed that the AAN tool for frequency studies had a higher inter-rater reliability as compared to the Gyorkos checklist and trended towards a higher inter-rater reliability as compared to the Loney scale. We did not observe differences in the inter-rater reliability for tools used in exposure studies (Newcastle–Ottawa scale, SIGN50 checklist, and AAN tool). When each category of ROB tool was analyzed, the AAN category of ROB tools was the only one to demonstrate an almost perfect inter-rater reliability, with trends in their favor as compared to ROB scales (Newcastle–Ottawa and Loney scales). These results suggest that the AAN ROB tools, especially the AAN frequency tool, may offer a high inter-rater reliability.

We observed a significantly higher inter-rater reliability for the AAN frequency tool when compared to the Gyorkos checklist. These results may be explained by differences in scoring structures between the Gyorkos checklist and the AAN frequency tool. The Gyorkos checklist was the only ROB instrument in our study to distinguish between minor and major flaws in ROB appraisal [ 7 ]. We suspect this stratification of the potential impact of biases added more complexity in the ratings and may have allowed for greater variation in responses between raters, particularly when compared to the parsimonious grading scheme of the AAN. Furthermore, the Gyorkos checklist was the only tool lacking instructions for each question [ 7 ]. Lack of guidance within the instrument may have led to varying interpretations of items. These results suggest that individual characteristics of ROB tools, such as their complexity and the lack of explicit guidance aimed at the raters may decrease their inter-rater reliability. In keeping with this, a way to enhance the Gyorkos checklist would be to simplify its scoring structure and add clearer instructions to guide its use.

The AAN category of ROB tools was the only category (i.e., as compared to scales and checklists) to show an almost perfect reliability. The simple criteria of the AAN tools may have contributed to their greater inter-rater reliability as these criteria are less susceptible to divergent interpretations. We did not, however, include any class 4 articles from the AAN ROB tools, which may have led to an overestimation of their inter-rater reliability. The AAN tools also trended towards a higher inter-rater reliability when compared to scales. Scales included in our study had a stricter grading scheme than the chosen AAN tools, which should theoretically have led to less variability amongst raters. An explanation for this may be that certain questions in our scales were much more open to interpretation than the relatively explicit AAN criteria. In addition, our scales comprised a greater set of criteria than the AAN ROB tools, which may have contributed to their higher inter-rater variability. Our checklists were just as complete as our scales, and yet, no difference was found between checklists and the AAN ROB tools. This may also be explained by the possibility that the questions in our scales were less objective than our checklists. Moving forward, a way to optimize scales would be to incorporate simpler, more straightforward criteria.

Our findings may be compared to previous studies on the Newcastle–Ottawa scale, as this is the only included tool that had already been assessed for inter-rater reliability [ 8 , 22 ]. Oremus et al. assessed the inter-rater reliability of scales such as the Newcastle–Ottawa scale using novice student raters [ 22 ]. The inter-rater reliability in their study for the case–control and for the cohort version of the tool was fair (0.55, 95% CI − 0.18, 0.89) and poor (− 0.19, 95% CI − 0.67, 0.35), respectively. Here, we report an overall substantial reliability (0.633; 95% CI 0.387, 0.802) for both versions combined. Slight differences in reliability study designs might contribute to this small discrepancy. In the first study, raters all had different levels of experience and were new to quality ROB rating, whereas our raters were all neurology trainees with similar experience in systematic reviews and had participated in a 60-min training session [ 22 ].

The inter-rater reliability of the AAN tool type was almost perfect between our participants but varied between fair and substantial when compared to the ROB assessments from published AAN guidelines. Several sources of discrepancy may explain these results. First, the AAN ROB tools do not guide raters on how to respond when information needed for a criterion is not explicitly stated in the article. This is especially important if that specific criterion can change the class of the article. For example, many of the class one frequency articles were graded as class three by our raters. This often occurred when our raters felt that there were ambiguities in determining if the cohort under study came from a clinical center with or without a specialized interest in the outcome. Many raters could not find this information directly stated in certain class one articles, thereby assuming that the articles did not have this specific study cohort and would then rate these class one articles as class three articles. Although these articles met all the other criteria of a class one article, they were required to rate it as class three due to this criterion. Raters did not have the opportunity to consider if these ambiguities should impact the final ROB rating. It is possible that raters from the AAN leave room for interpretation of ambiguous information, especially when an article meets all other necessary criteria for a lower ROB level. Secondly, the moderate agreement of our raters as compared to the reference AAN ratings may be partly explained by a framing effect. It is possible that reviewers involved in AAN guidelines inexplicitly prioritized certain criteria when classifying ambiguous articles. In contrast, our raters all came from similar academic backgrounds, and it could be that they prioritized certain AAN criteria similarly to one another, but differently from other authors involved in AAN guidelines. As an example, some exposure studies rated as class one in the AAN guidelines were assigned as class two ROB by our raters, as many of them were retrospective studies. Class one and class two ROB categories in the AAN exposure tool share core criteria; however, class one studies require prospective data collection. Finally, certain criteria may be open to interpretation in the AAN tool. For example, a class three article requires a “narrow” spectrum of people with or without the disease, whereas a class one article requires a “broad” spectrum of people, yet these terms are not quantified. This lack of specification may explain why some of our raters assigned a class three ROB for articles considered as class one by AAN raters. Overall, in order to improve the AAN tools, it would be beneficial to add instructions addressing how to rate articles when information is presented ambiguously, particularly emphasizing if certain criteria should be prioritized in this case, as well as instructions to define all quantitative adjectives used in the criteria.

A high inter-rater reliability is necessary, but not sufficient, to reach a valid assessment of ROB. Other factors are also important to consider when choosing a tool to assess and report ROB in systematic reviews. The choice of ROB tool usually implies a tradeoff between completeness and complexity. More parsimonious tools such as those from the AAN may allow raters to assess relevant sources of bias faster than more complex tools while maintaining a high inter-rater reliability, as observed in our study. They may not, however, cover all potential sources of bias across different study settings and designs. Whether the focused scope of domains assessed in more parsimonious tools preserves the validity of ratings for more complex study designs remains unclear. Future studies assessing the validity of various tools, especially in other health-related domains, and how their content influences their validity and inter-rater reliability are needed to better understand how these tools compare to one another.

Strengths and limitations

The strengths of our study include a comprehensive assessment of the reliability of a larger number of ROB tools and the inclusion of a larger number of raters as compared to prior publications [ 1 , 11 , 12 , 15 ]. The ratings were independent and performed on a sizable sample of articles. Our study, however, has limitations. We included participants with a similar academic background and asked them to rate articles in their field of study, which may have inflated the inter-rater reliability as compared to what may be observed for a more heterogenous group of participants. We chose raters with a common medical background as we believed this was more likely to reflect the most frequent population of raters in systematic reviews of clinical data. Furthermore, although the selected articles were diverse in study topic, they were all chosen from the AAN guidelines. This enabled us to assess the criterion validity of the AAN ROB tools; however, it could have hindered the generalizability of our findings to other domains. The selected ROB tools do not have criteria relating solely to neurology studies, therefore selecting neurology articles from the AAN should not be a reason for these tools to perform better in this study than another study with articles from other medical domains. In the future, studies could address the above limitations in generalizability by incorporating a more heterogenous group of raters, with varying academic backgrounds and articles from varying medical domains. Another limitation to our study’s completeness is that we chose to assess inter-rater reliability as a first step to assess the reliability of these ROB tools; however, we did not assess intra-rater reliability. In addition, although we chose commonly used ROB tools, we did not select a wide range of ROB tools. In order for future studies to be more complete, both intra- and inter-rater reliability could be assessed within the same study, with a larger scale of ROB tools. Finally, we constructed a summary ROB score for each scale assessed in our study to allow for an ease of comparison between all tools. This could have influenced the results as the scales did not originally have a scoring system; the final ROB assessment was left up to the interpretation of the rater based on the answered questions. Future studies comparing the inter-rater reliability of scales with and without a strict scoring system would be necessary to assess the impact this modification had on our results.

There is a growing body of available ROB tools for non-randomized studies, although information is generally lacking on their reliability. In this inter-rater reliability study, we assessed and compared six common ROB tools for frequency and exposure studies. We observed that the AAN category of ROB tools had an almost perfect reliability, while all other categories had a substantial inter-rater reliability. All exposure tools were comparable in reliability, yet amongst the frequency tools, the AAN frequency tool had a significantly higher inter-rater reliability as compared to the Gyorkos checklist and trended towards a higher inter-rater reliability when compared to the Loney scale. Our findings suggest that parsimonious ROB tools, such as those from the AAN, may contribute to a high inter-rater reliability. However, it remains uncertain how such minimal criteria affect the overall validity of ratings produced by these tools.

The datasets generated and analyzed during the current study are available in the dryad repository, https://datadryad.org/stash/share/6PQuln5wyTvTBx_CO_JFESVD8M7gX1ImQAy4t4JVxls

Abbreviations

Risk of bias

American Academy of Neurology

Intraclass correlation coefficient

Confidence intervals

Sanderson S, Tatt ID, Higgins JP. Tools for assessing quality and susceptibility to bias in observational studies in epidemiology: a systematic review and annotated bibliography. Int J Epidemiol. 2007;36(3):666–76.

Article   PubMed   Google Scholar  

Barnish MS, Turner S. The value of pragmatic and observational studies in health care and public health. Pragmat Obs Res. 2017;8:49–55.

PubMed   PubMed Central   Google Scholar  

Munnangi S, Boktor SW. Epidemiology of study design. Treasure Island: StatPearls; 2022.

Google Scholar  

Lee TA, Pickard AS. Exposure Definition and Measurement. In: Velentgas P, Dreyer NA, Nourjah P, Smith SR, Torchia MM, editors. Developing a Protocol for Observational Comparative Effectiveness Research: a User’s Guide. AHRQ Publication No. 12(13)-EHC099. Rockville: Agency for Healthcare Research and Quality; 2013.

American Academy of Neurology. Guideline development procedure manual. American Academy of Neurology; c2017 [cited 2023 Jul 6].

Loney PL, Chambers LW, Bennett KJ, Roberts JG, Stratford PW. Critical appraisal of the health research literature: prevalence or incidence of a health problem. Chronic Dis Can. 1998;19(4):170–6.

CAS   PubMed   Google Scholar  

Gyorkos TW, Tannenbaum TN, Abrahamowicz M, Oxman AD, Scott EA, Millson ME, et al. An approach to the development of practice guidelines for community health interventions. Can J Public Health. 1994;85(Suppl 1):S8–13.

GA Wells BS, D O'Connell, J Peterson, V Welch, M Losos, P Tugwell,. The Newcastle-Ottawa Scale (NOS) for assessing the quality of nonrandomised studies in meta-analyses. The Ottawa Hospital Research Institute. Available from:  http://www.ohri.ca/programs/clinical_epidemiology/oxford.asp .

(SIGN) SIGN. A guideline developer’s handbook. Edinburgh: SIGN; 2019. Available from: http://www.sign.ac.uk .

Collaboration C. RoB 2: a revised Cochrane risk-of-bias tool for randomized trials: Cochrane Methods Bias. Available from: https://methods.cochrane.org/bias/resources/rob-2-revised-cochrane-risk-bias-tool-randomized-trials .

Viswanathan M, Ansari MT, Berkman ND, Chang S, Hartling L, McPheeters M, et al. Assessing the Risk of Bias of Individual Studies in Systematic Reviews of Health Care Interventions. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Rockville: AHRQ Methods for Effective Health Care; 2008.

Da Costa BR, Beckett B, Diaz A, Resta NM, Johnston BC, Egger M, et al. Effect of standardized training on the reliability of the Cochrane risk of bias assessment tool: a prospective study. Syst Rev. 2017;6(1):44.

Article   PubMed   PubMed Central   Google Scholar  

Kottner J, Audigé L, Brorson S, Donner A, Gajewski BJ, Hróbjartsson A, et al. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. J Clin Epidemiol. 2011;64(1):96–106.

Wang Z, Taylor K, Allman-Farinelli M, Armstrong B, Askie L, Ghersi D, McKenize JE, Norris SL, Page MJ, Rooney A, Woodruff T, Bero LA. A systematic review: Tools for assessing methodological quality of human observational studies. NHMRC. 2019.

Shamliyan T, Kane RL, Dickinson S. A systematic review of tools used to assess the quality of observational studies that examine incidence or prevalence and risk factors for diseases. J Clin Epidemiol. 2010;63(10):1061–70.

Migliavaca CB, Stein C, Colpani V, Munn Z, Falavigna M. Prevalence estimates reviews – Systematic Review Methodology Group (PERSyst). Quality assessment of prevalence studies: a systematic review. J Clin Epidemiol. 2020;127:59–68. https://doi.org/10.1016/j.jclinepi.2020.06.039 .

National Heart, Lung, and Blood Institute. Background: development and use of study Quality Assessment Tools. Available from: https://www.nhlbi.nih.gov/node/80102 .

National Collaborating Centre for Methods and Tools. In: Webinar companion: spotlight on KT Methods and Tools. Episode 3. Hamilton: McMaster University. Available from: https://www.nccmt.ca/uploads/media/media/0001/01/8cad682fd4a6ebf34531046a79f3fbb1cfccbfb6.pdf .

Moola S, Munn Z, Tufanaru C, Aromataris E, Sears K, Sfetcu R, et al. Chapter 7: Systematic reviews of etiology and risk. In: Aromataris E, Munn Z, editors. JBI Manual for Evidence Synthesis. JBI; 2020. https://doi.org/10.46658/JBIMES-20-0 . Available from: https://synthesismanual.jbi.global .

Hallgren KA. Computing inter-rater reliability for observational data: an overview and tutorial. Tutor Quant Methods Psychol. 2012;8(1):23–34.

Bobak C, Barr P, O’Malley A. Estimation of an inter-rater intra-class correlation coefficient that overcomes common assumption violations in the assessment of health measurement scales. BMC Med Res Methodol. 2018;18:93.

Oremus M, Oremus C, Hall GB, McKinnon MC. Inter-rater and test-retest reliability of quality assessments by novice student raters using the Jadad and Newcastle-Ottawa Scales. BMJ Open. 2012;2(4):e001368.

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74.

Article   CAS   PubMed   Google Scholar  

Zou GY. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med. 2012;31(29):3972–81.

RStudio Team. RStudio: Integrated Development Environment for R. Boston: RStudio Inc.; 2015. Available from: http://www.rstudio.com/ .

Download references

Acknowledgements

Not applicable.

Author information

Authors and affiliations.

Faculty of Medicine, Université de Montréal, Montreal, QC, Canada

Isabel Kalaycioglu, Bastien Rioux, Joel Neves Briard, Ahmad Nehme, Lahoud Touma, Bénédicte Dansereau, Ariane Veilleux-Carpentier & Mark R. Keezer

Department of Neurosciences, Université de Montréal, Montreal, QC, Canada

Bastien Rioux, Joel Neves Briard, Ahmad Nehme, Lahoud Touma, Bénédicte Dansereau, Ariane Veilleux-Carpentier & Mark R. Keezer

Centre Hospitalier de L’Université de Montréal, Pavillon R R04-700, 1000 Saint-Denis St., Montreal, QC, H2X 0C1, Canada

School of Public Health, Université de Montréal, Montreal, QC, Canada

Mark R. Keezer

You can also search for this author in PubMed   Google Scholar

Contributions

IK, MRK, and BR contributed to the design of the study protocol and chose the ROB tools as well as the articles to be evaluated in the study. BR, JNB, AN, LT, BD, and AVC all participated as raters for the ROB assessments. BR carried out the statistical analysis. IK, BR, and MRK drafted the manuscript. All authors critically revised the manuscript.

Corresponding author

Correspondence to Mark R. Keezer .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

MRK reports unrestricted educational grants from UCB and Eisai, research grants for investigator-initiated studies from UCB and Eisai as well as from government entities (Canadian Institutes of Health Research, Fonds de Recherche Québec – Santé), academic institutions (Centre Hospitalier de l’Université de Montréal), and foundations (TD Bank, TSC Alliance, Savoy Foundation, Quebec Bio-Imaging Network). MRK’s salary is supported by the Fonds de Recherche Québec – Santé. MRK is a member of the Guidelines Subcommittee of the American Academy of Neurology.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:.

Table S1. GRRAS Checklist. Table S2. Preliminary ROB Tool List. Table S3. Information for frequency articles. Table S4. Information for exposure articles. Table S5. Inter-rater reliability per ROB tool. Table S6. Inter-rater reliability per ROB tool category

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Kalaycioglu, I., Rioux, B., Briard, J.N. et al. Inter-rater reliability of risk of bias tools for non-randomized studies. Syst Rev 12 , 227 (2023). https://doi.org/10.1186/s13643-023-02389-w

Download citation

Received : 14 November 2022

Accepted : 12 November 2023

Published : 07 December 2023

DOI : https://doi.org/10.1186/s13643-023-02389-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • ROB assessments
  • Systematic reviews

Systematic Reviews

ISSN: 2046-4053

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

inter rater reliability literature review

An official website of the United States government

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List

PLOS ONE logo

A Reliability-Generalization Study of Journal Peer Reviews: A Multilevel Meta-Analysis of Inter-Rater Reliability and Its Determinants

Lutz bornmann, rüdiger mutz, hans-dieter daniel.

  • Author information
  • Article notes
  • Copyright and License information

* E-mail: [email protected]

Conceived and designed the experiments: LB. Performed the experiments: RM. Analyzed the data: RM. Wrote the paper: LB HDD.

Received 2010 Apr 15; Accepted 2010 Nov 24; Collection date 2010.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.

This paper presents the first meta-analysis for the inter-rater reliability (IRR) of journal peer reviews. IRR is defined as the extent to which two or more independent reviews of the same scientific document agree.

Methodology/Principal Findings

Altogether, 70 reliability coefficients (Cohen's Kappa, intra-class correlation [ICC], and Pearson product-moment correlation [r]) from 48 studies were taken into account in the meta-analysis. The studies were based on a total of 19,443 manuscripts; on average, each study had a sample size of 311 manuscripts (minimum: 28, maximum: 1983). The results of the meta-analysis confirmed the findings of the narrative literature reviews published to date: The level of IRR (mean ICC/r 2  = .34, mean Cohen's Kappa = .17) was low. To explain the study-to-study variation of the IRR coefficients, meta-regression analyses were calculated using seven covariates. Two covariates that emerged in the meta-regression analyses as statistically significant to gain an approximate homogeneity of the intra-class correlations indicated that, firstly, the more manuscripts that a study is based on, the smaller the reported IRR coefficients are. Secondly, if the information of the rating system for reviewers was reported in a study, then this was associated with a smaller IRR coefficient than if the information was not conveyed.

Conclusions/Significance

Studies that report a high level of IRR are to be considered less credible than those with a low level of IRR. According to our meta-analysis the IRR of peer assessments is quite limited and needs improvement (e.g., reader system).

Introduction

Science rests on journal peer review [1] . As stated in a British Academy report, “the essential principle of peer review is simple to state: it is that judgements about the worth or value of a piece of research should be made by those with demonstrated competence to make such a judgement … With publications, an author submits a paper to a journal … and peers are asked to offer a judgement as to whether it should be published. A decision is then taken, in the light of peer review, on publication” [2] (p. 2). Quality control undertaken by peers in the traditional peer review of manuscripts for scientific journals is an essential part in most scientific disciplines to reach valid and reliable knowledge [3] .

According to Marsh, Bond, and Jayasinghe [4] , the most important weakness of the peer review process is that the ratings given to the same submission by different reviewers typically differ. This results in a lack of inter-rater reliability (IRR). Cicchetti [5] defines IRR as “the extent to which two or more independent reviews of the same scientific document agree” (p. 120). All overviews of the literature on the reliability of peer reviews published so far come to the same conclusion: There is a low level of IRR [5] , [6] , [7] , [8] . However, these reviews describe the existing literature using the narrative technique, without attempting any quantitative synthesis of study results. From the viewpoint of quantitative social scientists, narrative reviews are not very precise in their descriptions of study results [9] . The term meta-analysis refers to “the statistical analysis of a large collection of analytical results from individual studies for the purpose of integrating the findings” [10] (p. 3). Marsh, Jayasinghe, and Bond [11] note the relevance of meta-analysis to synthesizing results of peer review research. In peer review research, previously published meta-analyses investigated only gender differences in the selection process of grant proposals [12] , [13] .

In this study, we test whether the result of the narrative techniques used in the reviews – that there is a generally low level of IRR in peer reviews – can be confirmed using the quantitative technique of meta-analysis. Additionally, we examine how the study-to-study variation of the reported reliability coefficients can be explained by covariates. What are the determinants of a high or low level of IRR [7] ?

Materials and Methods

Literature search.

We performed a systematic search of publications of all document types (journal articles, monographs, collected works, etc.). In a first step, we located several studies that investigated the reliability of journal peer reviews using the reference lists provided by narrative overviews of research on this topic [5] , [6] , [7] , [8] and using tables of contents of special issues of journals publishing research papers on journal peer review (e.g., Journal of the American Medical Association ). In a second step, to obtain keywords for searching computerized databases, we prepared a bibliogram [14] for the studies located in the first step. The bibliogram ranks by frequency the words included in the abstracts of the studies located. Words at the top of the ranking list (e.g., peer review, reliability, and agreement) were used for searches in computerized literature databases (e.g., Web of Science, Scopus, IngentaConnect, PubMed, PsycINFO, ERIC) and Internet search engines (e.g., Google). In a third step of our literature search, we located all of the citing publications for a series of articles (found in the first and second steps) for which there are a fairly large number of citations in Web of Science.

The search for publications identified 84 studies published between 1966 and 2008. Fifty-two out of the 84 studies reported all information required for a meta-analysis: reliability coefficients and number of manuscripts. Nearly all of the studies provided the following quantitative IRR coefficients: Cohen's Kappa, intra-class correlation (ICC), and Pearson product-moment correlation (r). If different coefficients were reported for the same sample in one single study, ICCs were included in the meta-analyses (n = 35). The ICC measures inter-rater reliability and inter-rater agreement of single reviewers [15] . An ICC is high, if reviewers absolutely agree in their ratings of the same manuscript (absolute consensus) and rate different manuscripts quite differently (consistency). With a high ICC, the average rating of reviewers across all manuscripts in the sample can be accurately inferred from the individual ratings of reviewers for a manuscript. If there were no ICCs available (n = 35), r (n = 9) or Cohen's Kappa (n = 26) was used. Of the 52 studies, 4 could not be included because they reported neither ICC nor r nor Cohen's Kappa. In the end, we had 48 studies [16] – [65] (two studies reported their findings in two papers). As some of the studies reported more than one reliability coefficient for various journals and different cohorts of submissions, we had 70 reliability coefficients for the analyses (on average 1.5 coefficients per study). The studies included were based on a total of 19,443 manuscripts. On average, each study had a sample size of 311 manuscripts; the average sample size per study ranged between 28 and 1983 manuscripts (some studies were based on more than one sample).

Statistical Procedure

Reliability generalization studies were originally introduced by Vacha-Haase [66] to summarize the score reliabilities across studies while searching for the source of variability in studies' reliabilities. In our study we focus on the inter-rater reliabilities of journal peer reviews instead of score reliabilities. The technique involves pooling together the reported IRR estimates and applying meta-analytic techniques to sum up commonalities and differences across studies [67] . There are two ways to conceptualize this summarization: fixed effects models and random effects models. Following Hedges [68] and Hedges and Vevea [69] , the fixed effects model implies that the IRR in the population is assumed to be the same for all studies included in the meta-analysis (homogeneous case). Therefore, the only reason the IRR estimates varies between studies is sampling error, that is, the error in estimating the reliability. The theoretically defined standard error of the IRR coefficient indicates the amount of sampling error. The standard error, however, depends strongly on the sample size: The higher the sample size of a study, the lower the standard error of the reliability coefficient is, and the better the information of this study is for the estimation of the overall true reliability. Therefore, in summing up the reliabilities across studies to a mean value, studies with large samples sizes will be more heavily weighted (1/standard error as weight) than studies with low sample sizes.

As opposed to fixed effects models, the objective of random effects models is not to estimate a fixed reliability coefficient but to estimate the average of a distribution of reliabilities. Random effects models assume that the population effect sizes themselves vary randomly from study to study and that the true inter-rater reliabilities are sampled from a universe of possible reliabilities (“super-population”).

Whereas fixed effects models only allow generalizations about the studies that are included in the meta-analysis, in random effects models the studies are assumed to be a sample of all possible studies that could be done on a given topic, about which the results can be generalized [70] . From a statistical point of view, the main difference between fixed effects and random effects models is in the calculation of standard errors associated with the combined effect size. Fixed effects models only use within-study variability to estimate the standard errors. In random effect models, two sources of error variability are taken into account: within-study variability and between-study variability. Within the framework of random effects models it can be tested whether the between-study variability deviates statistically significant from zero and whether a fixed effects model is sufficient to fit the data, respectively (Q test).

Multilevel models are an improvement over fixed and random effects models, as they allow simultaneous estimation of the overall reliability and the between-study variability and do not assume the independency of the effect sizes or correlations. If a single study reports results from different samples, the results might be more similar than results reported by different studies. Statistically speaking, the different reliability coefficients reported by a single study are not independent. This lack of independence may distort the statistical analyses – particularly the estimation of standard errors [71] , because the effective sample size decreases with increasing similarity among the units of analysis (i.e., the samples of a single study). Multilevel models take into account the hierarchical structure of data and are therefore able to deal with the dependence problem by including different samples for a single study as an additional level of analysis. With respect to reliability generalization studies, Beretvas and Pastor [67] suggested a three-level model (which we used in this paper as follows: first level: manuscript, second level: sample, third level: study). Whereas the variability of the reliability coefficients between different samples within single studies (level 2) and the variability between studies (level 3) are estimated by multilevel models, the within-variability (standard error, level 1) must be calculated for each study using the standard error of the reliability coefficient and will be imputed in the multilevel analysis.

In this study we used a multilevel model (especially a three-level model) suggested by several researchers, including DerSimonian and Laird [72] , [73] , DerSimonian and Kacker [74] , Goldstein, Yang, Omar, Turner, and Thompson [75] , van den Noortgate and Onghena [76] , Beretvas and Pastor [67] , and van Houwelingen, Arends, and Stijnen [77] .

When there is a high level of between-study variation (study heterogeneity), it is important to look for explanatory variables (covariates) to explain this variation. As Egger, Ebrahim, and Smith [78] argued, “the thorough consideration of heterogeneity between observational study results, in particular of possible sources of confounding and bias, will generally provide more insights than the mechanistic calculation of an overall measure of effect” (p. 3). To explain the study heterogeneity of the inter-rater reliabilities in this study, meta-regression analyses were calculated. Whereas ordinary linear regression uses individual data from a single study, meta-regression uses weighted data from multiple studies, where each study provides for a data point in the regression analysis. To include categorical covariates (e.g., disciplines) in the meta-regression, they were dummy-coded. To avoid an excessive reduction of sample size and to warrant the power of the statistical tests, the missing values in categorical covariates are coded as an additional category, called “unknown.” In total, 32 studies reporting 44 reliability coefficients (ICC or r) could be included in the meta-regression analyses. Following the recommendations of Baker, White, Cappelleri, Kluger, and Coleman [79] , we thus had a sufficient number of studies to run a linear meta-regression with two or more covariates.

Proposed Covariates

The following covariates were included in the meta-regression analysis:

(1) Number of manuscripts

The number of manuscripts was used as the first covariate, based on which the reliability coefficients in the individual studies were calculated. The number of manuscripts was divided by 100 to obtain a regression parameter that is not too small. This procedure both warrants the accuracy of estimation and enhances the interpretation of the results. The influence of the commonly called “publication bias” or “file drawer problem” [80] (p. 150) is tested with this covariate: “Publication bias is the tendency on the parts of investigators, reviewers, and editors to submit or accept manuscripts for publication based on the direction or strength of the study findings” [81] (p. 1385). Hopewell, Loudon, Clarke, Oxman, and Dickersin [82] found, e.g., that clinical trials with positive or statistically significant findings are published more often, and more quickly, than trials with negative or statistically not significant findings. It is well known in statistics that even very low correlations or – in our case – IRR coefficients are still statistically significant, if only the sample size of the study is high, et vice versa, high IRR coefficients are statistically significant, even if the sample size of the study is small. Therefore, Hox [80] recommended including the sample sizes of the studies as a covariate in a multilevel meta-analysis.

According to the findings of an analysis by Cicchetti and Conn [24] , inter-rater reliabilities vary considerably in dependence on the method with which the reliabilities were calculated in the empirical studies. For this reason, the method used for the calculation of the IRR (ICC or r) in a study was considered in the meta-regression analyses as a second covariate. Higher coefficients are to be expected when using the one or other method. Only in the case where ratings by different reviewers have identical means and variances are r and one-way ICC identical [83] . Otherwise, r considerably overestimates the amount of IRR [36] . To include ICC and r into one single analysis, we followed Thompson and Vacha-Haase [84] and used the square root of the ICC as a kind of correlation coefficient. Fisher Z-transformed correlations and the corresponding standard error are used instead of correlations (square root of the reliability), because correlations are not continuous. The Fisher Z-transformation yields an approximate continuous scale.

(3) Discipline

As a third covariate the scientific discipline was included in the meta-regression analysis: (1) economics/law, (2) natural sciences, (3) medical sciences, or (4) social sciences. For Weller [7] “some discipline differences were apparent in reviewer agreement studies. Many of the studies were conducted in psychology and sociology and to some degree medicine, where the subject matter is human behavior and human health. These areas are less precise and absolute than other sciences and, therefore, it might be expected that there are more discussions of reviewer agreement” (p. 200).

(4) Object of appraisal

The fourth covariate is based on the object of appraisal. According to Weller [7] , higher levels of inter-rater reliabilities are to be expected for abstracts that are submitted especially at conferences or meetings than for papers (such as research articles or short communications), which are normally submitted to journals: “Abstracts by their very nature are an abbreviated representation of a study. Reviewers of abstracts are asked to make a recommendation to accept or reject a work with little knowledge of the entire endeavor. One would expect studies of reviewer agreement of abstracts to show a relatively high level of reviewer disagreement” (p. 183).

Further, with the covariate cohort, the period is included in the meta-regression analyses on which the data in a study is based. In general, a study investigated the IRR for manuscripts submitted to a journal within a certain period of time (e.g., one year). For the meta-analysis, we classified these periods into four different categories of time (e.g., 1980–1989). The covariate cohort tests whether the level of IRR has changed since 1950.

(6) Blinding

“In an attempt to eliminate some of the drawbacks of the peer review system, many journals resort to a double-blind review system, keeping the names and affiliations of both authors and referees confidential” [85] (p. 294). In a single-blind system, the reviewer knows the identity of the author but the reviewer remains anonymous. One of the drawbacks meant to be eliminated by use of the double-blind system is the low level of IRR. If the reviewer's ratings are not to be influenced by potential sources of bias (such as the author's gender or affiliation), a higher level of agreement between reviewers is to be expected. We tested the extent to which the type of blinding can actually influence the level of IRR.

(7) Rating system

Finally, the type of rating system used by the reviewers in a journal peer review process (analyzed in a reliability study) was included as a covariate. This tests whether various rating systems (metric or categorical) are connected to different levels of IRR. Strayhorn, McDermott, and Tanguay [60] were thus able to determine that reliability increased by increasing the number of rating scale points for questions about a manuscript. In some studies that we included in this study, there were no references to the rating system to be found (coded for the regression analysis as “unknown”). In a narrative review about studies on the reliability of peer review, Cicchetti [5] stated that information about the rating system is very basic for an empirical research paper and criticized studies that did not provide this information. Thus, their mention or non-mention can provide information about the quality of a study.

All analyses were performed using SAS PROC MIXED in SAS, version 9.1.3 [86] . The SAS syntax suggested by van Houwelingen, Arends, and Stijnen [77] was used.

Comparison of Average Effects

Using the above mentioned meta-analysis methods, three analyses were calculated based on r coefficients and ICC coefficients (see Table 1 , part a). The different meta-analysis methods estimate mean correlations that were squared again to obtain reliability coefficients as the ICC. A very low average reliability (∼.23) with a 95% confidence interval of ∼.22 to ∼.25 was obtained for the fixed effects model. The results for the random effects model showed a slightly higher average reliability (∼.34) with a 95% confidence interval of ∼.29 to ∼.39. An ICC of .23 indicates that only 23% of the variability in the reviewers' rating of a manuscript could be explained by the agreement of reviewers. The residue of 77% traces back to disagreement among the reviewers' ratings.

Table 1. Overview of mean reliabilities with confidence interval.

Notes : To obtain the reliability estimates (ICC/r 2 ) shown in this table, correlations (r) were squared. N = number of coefficients included. Levels = number of levels in the meta-analysis.

One further model was calculated on the basis of Cohen's Kappa (see Table 1 , part b). The mean reliability amounts to .17. The confidence interval varies between .13 and .21. According to the guidelines for interpretation of Kappa by Landis and Koch [87] , these mean reliabilities indicated a slight IRR. A Cohen's Kappa of .17 indicates that the reviewers agreed in their evaluations for 17% more of the manuscripts than would have been predicted on the basis of chance alone [88] .

The forest plot ( Figure 1 ) shows the predicted inter-rater reliabilities for each study and the individual 95% confidence interval for each reliability coefficient (r coefficient or ICC coefficient) based on the three-level model [77] . The predicted coefficients are Bayes estimates [80] . Bayes estimates take into account the different sampling errors of the reliability coefficients. The smaller the sampling error of a study and thus the larger its sample size (manuscripts) is, the more the reported reliability coefficient is a true estimate of the reliability of the study. The larger the sampling errors of a study and thus the smaller its sample size, the more the mean value across all reliability coefficients is a true estimate of the reliability of the particular study. This means that the smaller the sample sizes of the studies included in the meta-analysis are, the more the empirical Bayes estimates are shrunken towards the overall mean ß 0 . As Figure 1 shows, there was a positive correlation between the extent of IRR and the individual confidence interval: The smaller the coefficient, the smaller the confidence interval is. Furthermore, there is a high variability with the coefficients; most deviate from the 95% confidence interval of the mean value (shaded grey). The test of homogeneity (Q test) was statistically significant (Q(44) = 409.99, p<.05), i.e., the study-to-study variation of the inter-rater reliabilities was considerably higher than would be expected on the basis of random sampling (fixed effects model). To explain the study-to-study variation of correlation coefficients by covariates, meta-regression analyses were calculated.

Figure 1. Forest plot of the predicted inter-rater reliability (Bayes estimate) for each study (random effects model without covariates) with 95% confidence interval (as bars) for each reliability coefficient (sorted in ascending order).

Figure 1

The 95% confidence interval of the mean value (vertical line) is shaded grey. Predicted values for the same author and year but with different letters (e.g., Herzog 2005a and Herzog 2005b) belong to the same study.

Meta-Regression Analyses

Table 2 provides a description of the covariates included in the meta-regression analyses. Table 3 shows the results of the multilevel meta-analyses. These analyses are based on those studies that reported an ICC or r ( n  = 44). For studies with a Kappa coefficient that were included in this study ( n  = 26), no analyses could be performed due to the lack of a statistical approach for carrying out a meta-regression analysis and the comparatively small number of studies.

Table 2. Description of the covariates included in the meta-regression analyses (n = 32 studies with 44 coefficients).

Note: RC = reference category in meta-regression analysis. Unknown = this information is missing in a study.

*Rating systems are classified as categorical, if they have nine or fewer categories; in case of more than nine categories, the classification is made as metric [110] .

Table 3. Multilevel meta-analyses of the metric inter-rater-reliabilities (Fisher-Z √r tt or r).

Note : For each categorical variable, one category was chosen as a reference category (RC, e.g., RC = Social Sciences for the categorical variable discipline). For categorical variables, effect for each predictor variable (a dummy variable representing one of the categories) is a regression coefficient (Coeff) that should be interpreted in relation to its standard error (SE) and the effect of the reference category. Variance components for level 1 are derived from the data, but variance components at level 2 and level 3 indicate the amount of variance that can be explained by differences between studies (level 3) and differences between single reliability coefficients nested within studies (level 2). The loglikelihood test provided by SAS/proc mixed (−2LL) can be used to compare different models, as can also the Bayes Information Criteria (BIC). The smaller the BIC, the better the model is.

We carried out a series of meta-regression analyses in which we explored the effects of each covariate in isolation and in combination with other covariates. The focus was particularly on tests of the a priori predictions about the effects of the covariates (e.g., publication bias). As Table 3 shows, a total of 9 different models were calculated: Model 0 is the null model. In models 1 to 7 the meta-regression of an IRR on a covariate was determined. In model 8 those covariates were included that emerged as statistically significant in models 1 to 7.

The loglikelihood test provided by SAS/proc mixed (−2LL) can be used to compare different models, as can also the Bayes Information Criteria (BIC). The smaller the BIC, the better the model is. By comparison to the null model, only models 1, 7, and 8 exhibited significant differences in the loglikelihood and BIC, with statistically significant regression coefficients. The covariates method, discipline, object of appraisal, cohort, and blinding were accordingly not significantly correlated to the study-to-study variation (see models 2, 3, 4, 5, and 6).

The statistically significant regression coefficient of −.03 in model 1 can be interpreted as follows: The more manuscripts (divided by 100) that a study is based on, the smaller the reported reliability coefficients are. If the number increases, for instance from 100 manuscripts to 500, the reliability decreases from .40 to .34. By including this covariate, the study-to-study random effects variance declined from .03 (model 0) to .016 (model 1), i.e., 46.6% of the variance between the studies could be explained by the number of manuscripts. This result indicated a distinctly marked publication bias in the case of publication of studies for reliability of peer review. Even when the statistical significance level was adjusted by Bonferroni correction (α divided by the number of single tests), the regression parameter remained statistically significant. There is much evidence in the meta-analysis literature that studies that report relatively high correlations or effect sizes are more likely to be published than results of studies that report low correlations or effect sizes [89] . It seems that low correlations or effect sizes are only published by journals if the results are justified by a huge sample size; high correlations or effect sizes are published even if the sample size of the study is small. The negative correlation found in our meta-analysis between sample size of manuscripts and reliability coefficient confirms this publication bias hypothesis.

A further significant covariate is represented by the rating system. Even, if the statistical significance level is adjusted by Bonferroni correction, the regression parameter of the categorical rating remains statistically significant. It was decisive whether the rating system was reported in a study or not. If the information was conveyed, then this was associated with smaller reliability coefficients (regression coefficients in Table 3: −.40, −.33) than if the information was not conveyed. By considering this covariate, the study-to-study random effects variance decreased from .03 (model 0) to .017 (model 7), i.e., 43.3% of the variance between the studies could be explained. As it can be assumed based on Cicchetti [5] that the mentioning or non-mentioning of information about the rating system provides information about the quality of a study (see above), the IRR about which the individual studies report will vary accordingly with the quality of the studies.

When the statistically significant covariates in models 1 and 7 – number of manuscripts and rating system – were included in a multiple meta-regression analysis, the study-to-study variance fell from 0.03 (model 0) to 0.0036 (model 8), i.e., 86.6% of the variance could be explained with both variables. As the variance component was no longer statistically significant in this model, an approximate homogeneity of the intra-class correlations was present, i.e., the residuals of the meta-regression analysis almost only varied due to sampling error (the desired final result of a meta-analysis).

Meta-analysis tests the replicability and generalizability of results – the hallmark of good science. In this study we present the first meta-analysis for reliability of journal peer reviews. The results of our analyses confirmed the findings of narrative reviews: a low level of IRR: .34 for ICC and r (random effects model) and .17 for Cohen's Kappa. Even when we used different models for calculating the meta-analyses, we arrived at similar results. With respect to Cohen's Kappa, a meta-analysis of studies examining the IRR of the standard practice of peer assessments of quality of care published by Goldman [90] found a similar result: The weighted mean Kappa of 21 independent findings from 13 studies was .31. Based on this result, Goldman [90] considered the IRR of peer assessments to be quite limited and in need of improvement. Neff and Olden [91] concluded in a study on peer review that there are considerable benefits to employing three or four reviewers instead of just two, to minimize decision errors over manuscripts. Marsh, Jayasinghe, and Bond [11] and Jayasinghe, Marsh, and Bond [92] proposed a reader trial approach to peer review to increase IRR: A small number of expert readers are chosen on the basis of research expertise in a certain subdiscipline of a subject. The level of expertise of these readers should be higher than the broader cross-section reviewers in the traditional review system. “The same reader reviewed all the proposal in their subdisciplinary area, rated the quality of both the proposal and the researcher (or team of researcher), provided written comments, and were paid a small emolument” [92] (p. 597). Marsh, Jayasinghe, and Bond [11] found that single-rater reliabilities were much higher for the reader system than for the traditional review approach: For 4.3 readers on average per proposal the IRR of the researcher ratings reaches an acceptable value of .88 for the reader system. Although a high level of IRR is generally seen as desirable, when it comes to peer review some researchers, such as Bailar [93] , view agreement as detrimental to the review process: “Too much agreement is in fact a sign that the review process is not working well, that reviewers are not properly selected for diversity, and that some are redundant” (p. 138). Although selecting reviewers according to the principle of complementarity (for example, choosing a generalist and a specialist) will lower IRR, the validity of the process can gain, according to Langfeldt [94] : “Low inter-reviewer agreement on a peer panel is no indication of low validity or low legitimacy of the assessments. In fact, it may indicate that the panel is highly competent because it represents a wide sample of the various views on what is good and valuable research” (p. 821).

To explain the study-to-study variation for the inter-rater reliabilities, we calculated meta-regression analyses regarding the metric reliability coefficients. It emerged that neither the type of blinding nor the discipline corresponded to the level of the IRR. With double-blinding, which is already used by many journals as a measure against biases in refereeing [95] , an effect at the level of the reviewer agreement can thus be excluded according to our results. This result may point out that such blinding is difficult to accomplish and that reviewers could identify the authors in approximately a quarter to a third of the manuscripts [96] . In each text, there are clues as to the author (e.g., self-citation), and in many cases long-standing researchers in a particular field recognize the author based on these clues [97] , [98] , [99] . Falagas, Zouglakis, and Kavvadia [100] show that “half the abstracts we reviewed provided information about the origin of the study, despite the fact that instructions to the authors for the preparation of abstracts informed authors that the submissions would undergo masked peer review.”As we mentioned in the section “ Material and Methods ” with regard to discipline-specific reliabilities, it has been suggested that peer review in the natural and physical sciences should be more reliable because of shared theoretical perspectives. This is in contrast to the social sciences and humanities. In fact, we did not find any effect of discipline, which contradicts the “theoretical paradigms” hypothesis. Our results are in accordance with Cole's statement [101] that a low level of agreement among reviewers reflects the lack of consensus that is prevalent in all scientific disciplines at the ‘research frontier.’ Cole [101] says that usually no one reliably assesses scientific work occurring at the frontiers of research.

Two covariates emerged in the analyses as significant, to achieve approximate homogeneity of the intra-class correlations. On the one hand, the number of manuscripts on which a study is based was statistically significant. We therefore assume a distinctly more marked publication bias for studies on IRR: With a small sample, the results are published only if the reported reliability coefficients are high. If the reported reliability coefficients are low, on the other hand, a study has to be based on a large number of manuscripts to justify publication. Figure 1 also shows this correlation distinctly: The larger the confidence interval of a reliability coefficient, the higher the coefficient will be. This results from the fact that high reliability coefficients are reported more probably by studies with small sample sizes, which are associated with large standard errors and confidence intervals of the estimates.

Apart from the number of manuscripts upon which a study is based, the covariate rating system was also statistically significant. Studies that do not provide information on the rating system report higher IRR coefficients than studies that provide detailed information on the rating system. Failure to mention the rating system must be viewed as an indication of low quality of a study.

The main conclusion of our meta-analysis is that studies that report a high level of IRR are to be considered less credible than those with a low level of IRR. The reason is that high IRR coefficients are mostly based on small sample sizes than low IRR coefficients, which are based mostly on huge sample sizes. In contrast to narrative literature reviews, quantitative meta-analysis weights the study results according to the standard error to get unbiased estimates of the mean IRR. Therefore, meta-analysis should be preferred over narrative reviews. However, future primary studies on IRR of peer reviews that could be included in later meta-analyses should be based on large sample sizes and describe the evaluation sheet/rating system for reviewers in detail.

Very few studies have investigated reviewer agreement with the purpose of identifying the actual reasons behind reviewer disagreement, e.g., by carrying out comparative content analyses of reviewers' comment sheets [102] , [103] . For example, LaFollette [104] noted the scarcity of research studies on questions such as how reviewers apply standards and the specific criteria established for making a decision on a manuscript. In-depth studies that address these issues might prove to be fruitful avenues for future investigation [7] . This research should dedicate itself primarily to the dislocational component in the judgment of reviewers as well as differences in strictness or leniency in reviewer's judgments [105] , [106] .

Studies included in the meta-analyses are marked with an asterisk.

Acknowledgments

The authors wish to express their gratitude to three reviewers for their helpful comments.

Competing Interests: The authors have declared that no competing interests exist.

Funding: The authors have no support or funding to report.

  • 1. Ziman J. Real science. What it is, and what it means. Cambridge, UK: Cambridge University Press; 2000. [ Google Scholar ]
  • 2. British Academy. Peer Review: the challenges for the humanities and social sciences. London, UK: The British Academy; 2007. [ Google Scholar ]
  • 3. Hemlin S, Rasmussen SB. The shift in academic quality control. Science Technology & Human Values. 2006;31:173–198. [ Google Scholar ]
  • 4. Marsh HW, Bond NW, Jayasinghe UW. Peer review process: assessments by applicant-nominated referees are biased, inflated, unreliable and invalid. Australian Psychologist. 2007;42:33–38. [ Google Scholar ]
  • 5. Cicchetti DV. The reliability of peer review for manuscript and grant submissions: a cross-disciplinary investigation. Behavioral and Brain Sciences. 1991;14:119–135. [ Google Scholar ]
  • 6. Lindsey D. Assessing precision in the manuscript review process - a little better than a dice roll. Scientometrics. 1988;14:75–82. [ Google Scholar ]
  • 7. Weller AC. Editorial peer review: its strengths and weaknesses. Medford, NJ, USA: Information Today, Inc; 2002. [ Google Scholar ]
  • 8. Campanario JM. Peer review for journals as it stands today - part 1. Science Communication. 1998;19:181–211. [ Google Scholar ]
  • 9. Shadish WR, Cook TD, Campbell DT. Experimental and quasi-experimental designs for generalized causal inference. Boston, MA, USA: Houghton Mifflin Company; 2002. [ Google Scholar ]
  • 10. Glass GV. Primary, secondary, and meta-analysis. Review of Research in Education. 1976;5:351–379. [ Google Scholar ]
  • 11. Marsh HW, Jayasinghe UW, Bond NW. Improving the peer-review process for grant applications - reliability, validity, bias, and generalizability. American Psychologist. 2008;63:160–168. doi: 10.1037/0003-066X.63.3.160. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 12. Bornmann L, Mutz R, Daniel H-D. Gender differences in grant peer review: a meta-analysis. Journal of Informetrics. 2007;1:226–238. [ Google Scholar ]
  • 13. Marsh HW, Bornmann L, Mutz R, Daniel HD, O'Mara A. Gender effects in the peer reviews of grant proposals: a comprehensive meta-analysis comparing traditional and multilevel approaches. Review of Educational Research. 2009;79:1290–1326. [ Google Scholar ]
  • 14. White HD. On extending informetrics: an opinion paper. In: Ingwersen P, Larsen B, editors. Proceedings of the 10th International Conference of the International Society for Scientometrics and Informetrics. Stockholm, Sweden: Karolinska University Press; 2005. pp. 442–449. [ Google Scholar ]
  • 15. LeBreton JM, Senter JL. Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods. 2008;11:815–852. [ Google Scholar ]
  • 16. *Hargens LL, Herting JR. A new approach to referees assessments of manuscripts. Social Science Research. 1990;19:1–16. [ Google Scholar ]
  • 17. Bakanic V, McPhail C, Simon RJ. The manuscript review and decision-making process. American Sociological Review. 1987;52:631–642. [ Google Scholar ]
  • 18. *Beyer JM, Chanove RG, Fox WB. Review process and the fates of manuscripts submitted to AMJ. Academy of Management Journal. 1995;38:1219–1260. [ Google Scholar ]
  • 19. *Bhandari M, Templeman D, Tornetta P. Interrater reliability in grading abstracts for the Orthopaedic Trauma Association. Clinical Orthopaedics and Related Research. 2004;(423):217–221. doi: 10.1097/01.blo.0000127584.02606.00. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 20. Blackburn JL, Hakel MD. An examination of sources of peer-review bias. Psychological Science. 2006;17:378–382. doi: 10.1111/j.1467-9280.2006.01715.x. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 21. *Bohannon RW. Agreement among reviewers. Physical Therapy. 1986;66:1431–1432. doi: 10.1093/ptj/66.9.1431a. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 22. *Bornmann L, Daniel H-D. The effectiveness of the peer review process: inter-referee agreement and predictive validity of manuscript refereeing at Angewandte Chemie. Angewandte Chemie-International Edition. 2008;47:7173–7178. doi: 10.1002/anie.200800513. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 23. *Callaham ML, Baxt WG, Waeckerie JF, Wears RL. Reliability of editors' subjective quality ratings of peer reviews of manuscripts. Journal of the American Medical Association. 1998;280:229–231. doi: 10.1001/jama.280.3.229. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 24. Cicchetti DV, Conn HO. A statistical analysis of reviewer agreement and bias in evaluating medical abstracts. Yale Journal of Biology and Medicine. 1976;49:373–383. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 25. *Cicchetti DV, Conn HO. Reviewer evaluation of manuscripts submitted to medical journals. Biometrics. 1978;34:728–728. [ Google Scholar ]
  • 26. *Cicchetti DV, Eron LD. The realiability of manuscript reviewing for the Journal of Abnormal Psychology. Proceedings of the American Statistical Association (Social Statistics Section) 1979;22:596–600. [ Google Scholar ]
  • 27. *Cicchetti DV. Reliability of reviews for the American Psychologist - a biostatistical assessment of the data. American Psychologist. 1980;35:300–303. [ Google Scholar ]
  • 28. *Cohen IT, Patel K. Peer review interrater reliability of scientific abstracts: a study of an anesthesia subspecialty society. Journal of Education in Perioperative Medicine. 2005;7 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 29. *Cohen IT, Patel K. Peer review interrater concordance of scientific abstracts: a study of anesthesiology subspecialty and component societies. Anesthesia and Analgesia. 2006;102:1501–1503. doi: 10.1213/01.ane.0000200314.73035.4d. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 30. *Conn HO. An experiment in blind program selection. Clinical Research. 1974;22:128–134. [ Google Scholar ]
  • 31. *Daniel H-D. An evaluation of the peer-review process at Angewandte Chemie. Angewandte Chemie - International Edition in English. 1993;32:234–238. [ Google Scholar ]
  • 32. Glidewell JC. Reflections on thirteen years of editing AJCP. American Journal of Community Psychology. 1988;16:759–770. [ Google Scholar ]
  • 33. Gottfredson SD. Evaluating psychological research reports - dimensions, reliability, and correlates of quality judgments. American Psychologist. 1978;33:920–934. [ Google Scholar ]
  • 34. Gupta P, Kaur G, Sharma B, Shah D, Choudhury P. What is submitted and what gets accepted in Indian Pediatrics: analysis of submissions, review process, decision making, and criteria for rejection. Indian Pediatrics. 2006;43:479–489. [ PubMed ] [ Google Scholar ]
  • 35. *Hendrick C. Editorial comment. Personality and Social Psychology Bulletin. 1976;2:207–208. [ Google Scholar ]
  • 36. *Hendrick C. Editorial comment. Personality and Social Psychology Bulletin. 1977;3:1–2. [ Google Scholar ]
  • 37. Herzog HA, Podberscek AL, Docherty A. The reliability of peer review in anthrozoology. Anthrozoos. 2005;18:175–182. [ Google Scholar ]
  • 38. Howard L, Wilkinson G. Peer review and editorial decision-making. British Journal of Psychiatry. 1998;173:110–113. doi: 10.1192/bjp.173.2.110. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 39. *Justice AC, Berlin JA, Fletcher SW, Fletcher RH, Goodman SN. Do readers and peer reviewers agree on manuscript quality? Journal of the American Medical Association. 1994;272:117–119. [ PubMed ] [ Google Scholar ]
  • 40. Kemp S. Editorial Comment: agreement between reviewers of Journal of Economic Psychology submissions. Journal of Economic Psychology. 2005;26:779–784. [ Google Scholar ]
  • 41. *Kemper KJ, McCarthy PL, Cicchetti DV. Improving participation and interrater agreement in scoring ambulatory pediatric association abstracts: how well have we succeeded? Archives of Pediatrics & Adolescent Medicine. 1996;150:380–383. doi: 10.1001/archpedi.1996.02170290046007. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 42. *Kirk SA, Franke TM. Agreeing to disagree: a study of the reliability of manuscript reviews. Social Work Research. 1997;21:121–126. [ Google Scholar ]
  • 43. *Lempert RO. From the editor. Law and Society Review. 1985;19:529–536. [ Google Scholar ]
  • 44. *Linden W, Craig KD, Wen FK. Contributions of reviewer judgements to editorial decision-making for the Canadian Journal of Behavioural Science: 1985–1986. Canadian Journal of Behavioural Science. 1992;24:433–441. [ Google Scholar ]
  • 45. Marsh HW, Ball S. Interjudgmental reliability of reviews for the Journal of Educational Psychology. Journal of Educational Psychology. 1981;73:872–880. [ Google Scholar ]
  • 46. *Marusic A, Lukic IK, Marusic M, McNamee D, Sharp D, et al. Peer review in a small and a big medical journal: case study of the Croatian Medical Journal and The Lancet. Croatian Medical Journal. 2002;43:286–289. [ PubMed ] [ Google Scholar ]
  • 47. *McReynolds P. Reliability of ratings of research papers. American Psychologist. 1971;26:400–401. [ Google Scholar ]
  • 48. *Montgomery AA, Graham A, Evans PH, Fahey T. Inter-rater agreement in the scoring of abstracts submitted to a primary care research conference. BMC Health Services Research. 2002;2 doi: 10.1186/1472-6963-2-8. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 49. *Morrow JR, Bray MS, Fulton JE, Thomas JR. Interrater Reliability of 1987–1991 Research Quarterly for Exercise and Sport reviews. Research Quarterly for Exercise and Sport. 1992;63:200–204. doi: 10.1080/02701367.1992.10607582. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 50. Munley PH, Sharkin B, Gelso CJ. Reviewer ratings and agreement on manuscripts reviewed for the Journal of Counseling Psychology. Journal of Counseling Psychology. 1988;35:198–202. [ Google Scholar ]
  • 51. *Oxman AD, Guyatt GH, Singer J, Goldsmith CH, Hutchison BG, et al. Agreement among reviewers of review articles. Journal of Clinical Epidemiology. 1991;44:91–98. doi: 10.1016/0895-4356(91)90205-n. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 52. Petty RE, Fleming MA. The review process at PSPB: correlates of interreviewer agreement and manuscript acceptance. Personality and Social Psychology Bulletin. 1999;25:188–203. [ Google Scholar ]
  • 53. *Plug C. The reliability of manuscript evaluation for the South African Journal of Psychology. South African Journal of Psychology. 1993;23:43–48. [ Google Scholar ]
  • 54. *Rothwell PM, Martyn CN. Reproducibility of peer review in clinical neuroscience: is agreement between reviewers any greater than would be expected by chance alone? Brain. 2000;123:1964–1969. doi: 10.1093/brain/123.9.1964. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 55. *Rubin HR, Redelmeier DA, Wu AW, Steinberg EP. How reliable is peer review of scientific abstracts? Looking back at the 1991 Annual Meeting of the Society of General Internal Medicine. Clinical Research. 1992;40:A604. doi: 10.1007/BF02600092. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 56. *Rubin HR, Redelmeier DA, Wu AW, Steinberg EP. How reliable is peer review of scientific abstracts? Looking back at the 1991 Annual Meeting of the Society of General Internal Medicine. Journal of General Internal Medicine. 1993;8:255–258. doi: 10.1007/BF02600092. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 57. *Scarr S, Weber BLR. The reliability of reviews for the American Psychologist. American Psychologist. 1978;33:935. [ Google Scholar ]
  • 58. *Scott WA. Interreferee agreement on some characteristics of manuscripts submitted to Journal of Personality and Social Psychology. American Psychologist. 1974;29:698–702. [ Google Scholar ]
  • 59. *Scott JR, Martin S, Burmeister L. Consistency between reviewers and editors about which papers should be published. 2005. Fifth International Congress on Peer Review and Biomedical Publication. September 16–18, 2005. Chicago, Illinois.
  • 60. Strayhorn J, McDermott JF, Tanguay P. An intervention to improve the reliability of manuscript reviews for the Journal of the American Academy of Child and Adolescent Psychiatry. American Journal of Psychiatry. 1993;150:947–952. doi: 10.1176/ajp.150.6.947. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 61. *Timmer A, Sutherland L, Hilsden R. Development and evaluation of a quality score for abstracts. BMC Medical Research Methodology. 2003;3:2. doi: 10.1186/1471-2288-3-2. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 62. *van der Steen LPE, Hage JJ, Kon M, Mazzola R. Reliability of a structured method of selecting abstracts for a plastic surgical scientific meeting. Plastic and Reconstructive Surgery. 2003;111:2215–2222. doi: 10.1097/01.PRS.0000061092.88629.82. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 63. *Whitehurst GJ. Interrater agreement for reviews for Developmental Review. Developmental Review. 1983;3:73–78. [ Google Scholar ]
  • 64. Wood M, Roberts M, Howell B. The reliability of peer reviews of papers on information systems. Journal of Information Science. 2004;30:2–11. [ Google Scholar ]
  • 65. *Yadollahie M, Roshanipoor M, Habibzadeh F. The agreement in reports of peer reviews in the Iranian Journal of Medical Sciences. Saudi Medical Journal. 2004;25(Supplement):S44. [ Google Scholar ]
  • 66. Vacha-Haase T. Reliability generalization: exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement. 1998;58:6–20. [ Google Scholar ]
  • 67. Beretvas SN, Pastor DA. Using mixed-effects models in reliability generalization studies. Educational and Psychological Measurement. 2003;63:75–95. [ Google Scholar ]
  • 68. Hedges LV. Fixed effects models. In: Cooper HM, Hedges LV, editors. The handbook of research synthesis. New York, NY, USA: Russell Sage Foundation; 1994. pp. 285–299. [ Google Scholar ]
  • 69. Hedges LV, Vevea JL. Fixed and random effects models in meta-analysis. Psychological Methods. 1998;3:486–504. [ Google Scholar ]
  • 70. Field AP. Meta-analysis of correlation coefficients: a Monte Carlo comparison of fixed- and random-effects methods. Psychological Methods. 2001;6:161–180. doi: 10.1037/1082-989x.6.2.161. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 71. Bateman IJ, Jones AP. Contrasting conventional with multi-level modeling approaches to meta-analysis: expectation consistency in UK woodland recreation values. Land Economics. 2003;79:235–258. [ Google Scholar ]
  • 72. DerSimonian R, Laird NM. Evaluating the effect of coaching on SAT scores: a meta-analysis. Havard Educational Review. 1983;53:1–15. [ Google Scholar ]
  • 73. DerSimonian R, Laird NM. Meta-analysis in clinical trials. Controlled Clinical Trials. 1986;7:177–188. doi: 10.1016/0197-2456(86)90046-2. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 74. DerSimonian R, Kacker R. Random-effects model for meta-analysis of clinical trials: an update. Contemporary Clinical Trials. 2007;28:105–144. doi: 10.1016/j.cct.2006.04.004. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 75. Goldstein H, Yang M, Omar R, Turner R, Thompson S. Meta-analysis using multilevel models with an application to the study of class effect size effects. Applied Statistics. 2000;49:399–412. [ Google Scholar ]
  • 76. van den Noortgate W, Onghena P. Multilevel meta-analysis: a comparison with traditional meta-analytic procedures. Educational and Psychological Measurement. 2003;63:765–790. [ Google Scholar ]
  • 77. van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in meta-analysis: multivariate approach and meta-regression. Statistics in Medicine. 2002;21:589–624. doi: 10.1002/sim.1040. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 78. Egger M, Ebrahim S, Smith GD. Where now for meta-analysis? International Journal of Epidemiology. 2002;31:1–5. doi: 10.1093/ije/31.1.1. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 79. Baker WL, White CM, Cappelleri JC, Kluger J, Coleman CI. Understanding heterogeneity in meta-analysis: the role of meta-regression. International Journal of Clinical Practice. 2009;63:1426–1434. doi: 10.1111/j.1742-1241.2009.02168.x. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 80. Hox JJ. Multilevel analysis. London, UK: Lawrence Erlbaum; 2002. [ Google Scholar ]
  • 81. Dickersin K. The existence of publication bias and risk-factors for its occurrence. Journal of the American Medical Association. 1990;263:1385–1389. [ PubMed ] [ Google Scholar ]
  • 82. Hopewell S, Loudon K, Clarke MJ, Oxman AD, Dickersin K. Publication bias in clinical trials due to statistical significance or direction of trial results. Cochrane Database of Systematic Reviews; 2009. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 83. Ebel RL. Estimation of the reliability of ratings. Psychometrika. 1951;16:407–424. [ Google Scholar ]
  • 84. Thompson B, Vacha-Haase T. Psychometrics is datametrics: the test is not reliable. Educational and Psychological Measurement. 2000;60:174–195. [ Google Scholar ]
  • 85. Campanario JM. Peer review for journals as it stands today - part 2. Science Communication. 1998;19:277–306. [ Google Scholar ]
  • 86. Little RC, Milliken GA, Stroup WW, Wolfinger RD, Schabenberger O. SAS for mixed models. Cary, NC, USA: SAS Institute Inc; 2007. [ Google Scholar ]
  • 87. Landis JR, Koch GG. Measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174. [ PubMed ] [ Google Scholar ]
  • 88. Daniel H-D. Guardians of science. Fairness and reliability of peer review. Weinheim, Germany: Wiley-VCH; 1993. [ Google Scholar ]
  • 89. Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Introduction to meta-analysis. Chichester, UK: Wiley; 2009. [ Google Scholar ]
  • 90. Goldman RL. The reliability of peer assessments - a meta-analysis. Evaluation & the Health Professions. 1994;17:3–21. doi: 10.1177/016327879401700101. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 91. Neff BD, Olden JD. Is peer review a game of chance? BioScience. 2006;56:333–340. [ Google Scholar ]
  • 92. Jayasinghe UW, Marsh HW, Bond N. A new reader trial approach to peer review in funding research grants: an Australian experiment. Scientometrics. 2006;69:591–606. [ Google Scholar ]
  • 93. Bailar JC. Reliability, fairness, objectivity, and other inappropriate goals in peer review. Behavioral and Brain Sciences. 1991;14:137–138. [ Google Scholar ]
  • 94. Langfeldt L. The decision-making constraints and processes of grant peer review, and their effects on the review outcome. Social Studies of Science. 2001;31:820–841. [ Google Scholar ]
  • 95. Good CD, Parente ST. A worldwide assessment of medical journal editors' practices and needs - results of a survey by the World Association of Medical Editors. South African Medical Journal. 1999;4:397–401. [ PubMed ] [ Google Scholar ]
  • 96. Smith R. Peer review: a flawed process at the heart of science and journals. Journal of the Royal Society of Medicine. 2006;99:178–182. doi: 10.1258/jrsm.99.4.178. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 97. Cho MK, Justice AC, Winker MA, Berlin JA, Waeckerle JF, et al. Masking author identity in peer review - What factors influence masking success? Journal of the American Medical Association. 1998;280:243–245. doi: 10.1001/jama.280.3.243. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 98. Godlee F. Making reviewers visible - Openness, accountability, and credit. Journal of the American Medical Association. 2002;287:2762–2765. doi: 10.1001/jama.287.21.2762. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 99. Snodgrass R. Single- versus double-blind reviewing: an analysis of the literature. Sigmod Record. 2006;35:8–21. [ Google Scholar ]
  • 100. Falagas ME, Zouglakis GM, Kavvadia PK. How masked is the “masked peer review” of abstracts submitted to international medical conferences? Mayo Clinic Proceedings. 2006;81:705. doi: 10.4065/81.5.705. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 101. Cole JR. The role of journals in the growth of scientific knowledge. In: Cronin B, Atkins HB, editors. The web of knowledge A festschrift in honor of Eugene Garfield. Medford, NJ, USA: Information Today; 2000. pp. 109–142. [ Google Scholar ]
  • 102. Siegelman SS. Assassins and zealots - variations in peer review - special report. Radiology. 1991;178:637–642. doi: 10.1148/radiology.178.3.1994394. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 103. Fiske DW, Fogg L. But the reviewers are making different criticisms of my paper - diversity and uniqueness in reviewer comments. American Psychologist. 1990;45:591–598. [ Google Scholar ]
  • 104. LaFollette MC. Stealing into print: fraud, plagiarism and misconduct in scientific publishing. Berkeley, CA, USA: University of California Press; 1992. [ Google Scholar ]
  • 105. Lienert GA. Schulnoten-Evaluation. Frankfurt am Main, Germany: Athenäum; 1987. [ Google Scholar ]
  • 106. Eckes T. Rater agreement and rater severity: a many-faceted Rasch analysis of performance assessments in the “Test Deutsch als Fremdsprache” (TestDaF). Diagnostica. 2004;50:65–77. [ Google Scholar ]
  • 107. Hunter JE, Schmidt FL. Methods of meta-analysis: correcting error and bias in research findings. Newbury Park, CA, USA: Sage; 1990. [ Google Scholar ]
  • 108. Hunter JE, Schmidt FL. Fixed effects vs. random effects meta-analysis models: implications for cumulative research knowledge. International Journal of Selection and Assessment. 2000;8:275–292. [ Google Scholar ]
  • 109. Hunter JE, Schmidt FL. Methods of meta-analysis: correcting error and bias in research findings. Thousand Oaks, CA, USA: Sage; 2004. [ Google Scholar ]
  • 110. Muthén LK, Muthén BO. Mplus User's Guide. Los Angeles, CA, USA: Muthén & Muthén; 1998–2007. [ Google Scholar ]
  • View on publisher site
  • PDF (504.4 KB)
  • Collections

Similar articles

Cited by other articles, links to ncbi databases.

  • Download .nbib .nbib
  • Format: AMA APA MLA NLM

Add to Collections

IMAGES

  1. Interrater Reliability in Systematic Review Methodology: Exploring

    inter rater reliability literature review

  2. 5: Scatterplot: Inter-rater reliability literature review marking

    inter rater reliability literature review

  3. 15 Inter-Rater Reliability Examples (2024)

    inter rater reliability literature review

  4. Inter-Rater Reliability Methods in Qualitative Case Study Research

    inter rater reliability literature review

  5. PPT

    inter rater reliability literature review

  6. Inter-rater reliability (Cronbach's alpha).

    inter rater reliability literature review

VIDEO

  1. Inter-rater and Procedual Reliability

  2. SRE concepts part 12 Inter Personal skills

  3. Learn English: Exploring Reliability in Educational Assessments

  4. Recommendation on optimal LINAC commissioning and QA radiation isocentre variability; By: Zhen Chen

  5. STOP Breeding dogs with skulls too small for their brains

  6. Interrater reliability: Analisis Interclass Correlation (ICC) dan Krippendorf Alpha dengan JASP

COMMENTS

  1. Interrater Reliability in Systematic Review Methodology: Exploring

    A methodologically sound systematic review is characterized by transparency, replicability, and a clear inclusion criterion. However, little attention has been paid to reporting the details of interrater reliability (IRR) when multiple coders are used to make decisions at various points in the screening and data extraction stages of a study.

  2. Inter-Rater Reliability Methods in Qualitative Case Study Research

    The use of inter-rater reliability (IRR) methods may provide an opportunity to improve the transparency and consistency of qualitative case study data analysis in terms of the rigor of how codes and constructs have been developed from the raw data. Few articles on qualitative research methods in the literature conduct IRR assessments or neglect ...

  3. Intercoder Reliability in Qualitative Research: Debates and Practical

    The recommendations offered are based on a thorough review of the literature on ICR, as well as the authors' own research experience. ... Computing inter-rater reliability for observaional data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8, 23-34. Crossref. PubMed. Google Scholar.

  4. Interrater reliability: the kappa statistic

    The kappa statistic is frequently used to test interrater reliability. The importance of rater reliability lies in the fact that it represents the extent to which the data collected in the study are correct representations of the variables measured. Measurement of the extent to which data collectors (raters) assign the same score to the same ...

  5. Executive Summary

    The assessment of methodological quality, or risk of bias, of studies included in a systematic review is a key step and serves to: (1) identify the strengths and limitations of the included studies; (2) investigate, and potentially explain, heterogeneity in findings across different studies included in a systematic review; and, (3) grade the strength of evidence for a given question. There are ...

  6. Frontiers

    The inter-rater reliability as expressed by intra-class correlation coefficients (ICC) measures the degree to which the instrument used is able to differentiate between participants indicated by two or more raters that reach similar conclusions (Liao et al., 2010; Kottner et al., 2011). Hence, the inter-rater reliability is a quality criterion ...

  7. Original research: Inter-reviewer reliability of human literature

    Methods. We performed a review of SLRs of randomised controlled trials using the PubMed and Embase databases. Data were extracted on IRR by means of Cohen's kappa score of abstract/title screening, full-text screening and data extraction in combination with review team size, items screened and the quality of the review was assessed with the A MeaSurement Tool to Assess systematic Reviews 2.

  8. Measuring inter-rater reliability for nominal data

    This includes both the agreement among different raters (inter-rater reliability, see Gwet ) as well as the agreement of repeated measurements performed by the same rater (intra-rater ... AK conducted the literature review. LM conducted the case study. All authors wrote and revised the manuscript. All authors read and approved the final ...

  9. Inter-rater reliability of risk of bias tools for non-randomized

    Purpose There is limited knowledge on the reliability of risk of bias (ROB) tools for assessing internal validity in systematic reviews of exposure and frequency studies. We aimed to identify and then compare the inter-rater reliability (IRR) of six commonly used tools for frequency (Loney scale, Gyorkos checklist, American Academy of Neurology [AAN] tool) and exposure (Newcastle-Ottawa ...

  10. A Reliability-Generalization Study of Journal Peer Reviews: A

    This results in a lack of inter-rater reliability (IRR). Cicchetti defines IRR as "the extent to which two or more independent reviews of the same scientific document agree" (p. 120). All overviews of the literature on the reliability of peer reviews published so far come to the same conclusion: There is a low level of IRR , , , . However ...