14
Applied Mathematical Sciences, Vol. 7, 2013, no. 124, 6153 - 6166 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.39496 Sample Size Determination and Power Analysis for Modified Cohen’s Kappa Statistic Pornpis Yimprayoon Department of Mathematics, Faculty of Liberal Arts and Science Kasetsart University, Kamphaeng Saen Campus Nakhonpathom 73140, Thailand [email protected] Copyright © 2013 Pornpis Yimprayoon. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract In this research, the statistical inference of the problem of measuring agreement between two observers who employ measurements on a 2-point nominal scale is focused. One of the most popular indices of agreement was originally presented by Cohen [2], namely Cohen’s kappa statistic ( ) C κ , as a reliability index for measuring agreement between two raters employing nominal scales. Sinha et al. [5] pointed out some undesirable features of C κ and provided modification to it by proposing the modified Cohen’s kappa M κ to deal with the full strength of agreement and disagreement between the two raters. Thus making adequate interpretation of C κ in assessing agreement may be possible by examining the expressions for M κ . The purpose of this study is to determine the sample size for testing hypothesis based on C κ and M κ . Moreover, the power of test for C κ and M κ is computed. The result of this study shows that M κ is more efficient than C κ . Mathematics Subject Classification: 97K80 Keywords: Cohen’s kappa, Modified kappa, Sample size, Power of test

Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

  • Upload
    lethu

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

Applied Mathematical Sciences, Vol. 7, 2013, no. 124, 6153 - 6166 HIKARI Ltd, www.m-hikari.com

http://dx.doi.org/10.12988/ams.2013.39496

Sample Size Determination and Power Analysis

for Modified Cohen’s Kappa Statistic

Pornpis Yimprayoon

Department of Mathematics, Faculty of Liberal Arts and Science

Kasetsart University, Kamphaeng Saen Campus Nakhonpathom 73140, Thailand

[email protected] Copyright © 2013 Pornpis Yimprayoon. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

In this research, the statistical inference of the problem of measuring agreement between two observers who employ measurements on a 2-point nominal scale is focused. One of the most popular indices of agreement was originally presented by Cohen [2], namely Cohen’s kappa statistic ( )Cκ , as a reliability index for measuring agreement between two raters employing nominal scales. Sinha et al. [5] pointed out some undesirable features of Cκ and provided modification to it by proposing the modified Cohen’s kappa Mκ to deal with the full strength of agreement and disagreement between the two raters. Thus making adequate interpretation of Cκ in assessing agreement may be possible by examining the expressions for Mκ . The purpose of this study is to determine the sample size for testing hypothesis based on Cκ and Mκ . Moreover, the power of test for Cκ and Mκ is computed. The result of this study shows that Mκ is more efficient than Cκ . Mathematics Subject Classification: 97K80 Keywords: Cohen’s kappa, Modified kappa, Sample size, Power of test

Page 2: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

6154 Pornpis Yimprayoon 1 Introduction

Researchers have become increasingly aware of the problem of errors in measurements more than one and a half centuries ago. Statisticians, clinicians, epidemiologists, psychologists and many other scientists have investigated the scientific bases of measurement errors [4]. So a number of statistical problems in several fields of application in the social and biomedical sciences require the measurement of agreement between two or more raters. In health care practice, clinical measurements serve as a basis for diagnosis, prognostic, and therapeutic evaluations. In recent time because of the technological advancement, new methods or instruments for diagnostic, prognostic, and therapeutic evaluations become available which needed to be tested and compared with the olds ones. And before a new method or a new instrument is adopted for use in measuring a variable of interest, one needs to ensure the accuracy and precision of the measurement. Many times, a reliability or a validity study involving multiple raters is conducted in clinical or experimental settings. High measures of agreement would indicate consensus in the diagnosis and interchangeability of the measuring devices [1]. One of the most popular indices of agreement was originally presented by Cohen [2] as a reliability index for measuring agreement between two raters employing nominal scales.

In the following section, we describe some fundamental concepts that are necessary for proper understanding of Cohen’s kappa statistic discussed in this research. 2 Brief Description of Cohen’s Kappa and its Modification

Let us consider a reliability research where 2 raters, referred to as rater A

and rater B, operate independently and are required to classify subjects into one of 2 possible response categories. The subjects are independent. The 2 response categories, labeled as 1 and 2, are independent, mutually exclusive, and exhaustive. We denote by ijπ the chance that rater A classifies a subject into category ,i while rater B classifies the same subject into category j for

2,1, =ji . Let ∑=

⋅ =2

111

jjππ and

⋅=

⋅ −== ∑ 1

2

122 1 πππ

jj be the probabilities of

being classified by rater A into categories 1 and 2, respectively. We also define

∑=

⋅ =2

111

iiππ and 1

2

122 1 ⋅

=⋅ −== ∑ πππ

ii in the same manner.

In this set-up, Cohen’s kappa statistic for measuring agreement between the two raters is defined as

1o e

Ce

θ θκθ−

=−

(1)

Page 3: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

Sample size determination and power analysis 6155 where

2211 ππθ +=o (2) and

.2211 ⋅⋅⋅⋅ += ππππθ e (3) In applications, if there are n subjects and ijn represents the number of

subjects classified in category i by rater A and in category j by rater B, the sample estimate of Cκ is given by

ˆ ˆˆ ˆ1

o eC

e

θ θκθ−

=−

(4)

where

,ˆnnij

ij =π (5)

,ˆnni

i⋅

⋅ =π (6)

,ˆn

n jj

⋅⋅ =π (7)

,ˆ 2211

nnn

o+

=θ (8)

and

.ˆ2

2211

nnnnn

e⋅⋅⋅⋅ +

=θ (9)

The difference eo θθ ˆˆ − is the proportion of agreement beyond what is

expected by chance. If eo θθ ˆˆ − is positive, two raters agree more often than

expected based on chance. But negative values of eo θθ ˆˆ − indicate they agree less than expected based on chance. The maximum possible discrepancy between

oθ̂ and eθ̂ is eθ̂1− . This discrepancy results when all decision appear in the agreement cells of the cross-classification. In this case, agreement is perfect.

Note that sample estimates of the parameters ijπ , j⋅π , ⋅iπ , eθ , and oθ

denoted by ijπ̂ , j⋅π̂ , ⋅iπ̂ , eθ̂ , and oθ̂ respectively. In addition, a number of authors has proposed guidelines for the

interpretation of Cκ . For example, Landis and Koch [3] suggest the categories that the largest value of kappa is 1.00, indicating perfect agreement. A value of 0.00 indicates that the observed agreement is the same as that expected by chance, and the minimum value of kappa falls between -1.00 and 0.00.

We then can consider a general interpretation of Cκ focuses on the characteristic of Cκ in the following:

Page 4: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

6156 Pornpis Yimprayoon 1. 00.1=Cκ if and only if 00.1=oπ , this means that no controversial

judgment by the raters, that is, the probability in the disagreement cells (off-diagonal cells) is zero.

2. 00.0=Cκ only if eo ππ = (that is 1111 ⋅⋅ ×= πππ or 2222 ⋅⋅ ×= πππ or 2112 ⋅⋅ ×= πππ or 1221 ⋅⋅ ×= πππ ) or only says raters A and B perform independent.

3. 1−=Cκ if the probability in the agreement cells (diagonal cells) is zero and the probability of cell (2,1) is equal to the probability of cell (1,2) or it is only occurred 50.02112 == ππ and 00.02211 == ππ .

Sinha et al. [5] pointed out some undesirable features of Cκ and also said that the case of “ 1−=Cκ ” seems to impose too restrictive behavior on the part of the raters. When ,02211 == ππ there is already an indication of total disagreement between the two raters. Therefore, in such situations, irrespective of the values assumed by 12π and 21π ( ,1,0 2112 << ππ ),12112 =+ππ the kappa coefficient is desired to assume the value 1− . With this in mind, set απ =12 and ,121 απ −= 10 << α and analyzed the situation with the purpose of modifying the definition of Cκ to deal with the full strength of disagreement between the two raters while the ratings are given independently in 2-point nominal scale.

Their modification is aimed at the value .1−=Cκ They modified Cκ as

e

eoM A θ

θθκ

−−

= (10)

and suggested a value of A to take care of the situations: ,02211 == ππ (11)

,12 απ = (12) ,121 απ −= (13)

when 10 << α along with 1−=Mκ . Under (11)-(13), Mκ reduces to

)1(2)1(2αααακ−−−−

=AM . (14)

and 1−=Mκ yields )1(4 αα −=A . (15)

Then, replacing α by 2

21 ⋅⋅ + ππ in (15). That is

224 2121 ⋅⋅⋅⋅ +

⋅+

⋅=ππππ

A

))(( 2121 ⋅⋅⋅⋅ ++= ππππ . (16) Next, substituting (16) in (12), we obtain

Page 5: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

Sample size determination and power analysis 6157

1 2 1 2 1 1 2 2( )( ) ( )o e

Mθ θκ

π π π π π π π π⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

−=

+ + − +. (17)

Hence, the modified kappa statistic Mκ is defined as

2121 ⋅⋅⋅⋅ +−

=ππππ

θθκ eo

M . (18)

This modification is based on the analysis of situations leading to total disagreement between the two raters. In addition Sinha et al. [5] verified below that all the three essential features of the kappa statistic are retained by .Mκ

This modification is based on the analysis of situations leading to total disagreement between the two raters and all the three essential features of kappa statistic are retained by .Mκ

So, the purpose of this study is to determine the sample size for testing hypothesis based on Cκ and Mκ . Moreover, the power of test for Cκ and Mκ is computed.

3 Sample Size Determination for Modified Kappa In this section, kappa based on multi-sample case is mentioned. We wish

to test the hypothesis )()2()1(

0 : tCCCH κκκ === L (19)

versus :1H not all Cκ ’s are equal. (20)

Suppose on t different opportunities, independent sample units, each of size n are collected and presented before the raters for the rating.

The estimators of kappa coefficient on the r th opportunity is

)(

)()()(

ˆ1

ˆˆˆ

re

re

ror

C θθθκ

−−

= , tr ,,2,1 K= (21)

where

nnn rr

ro

)(11

)(11)(ˆ +

=θ (22)

and

2

)(2

)(2

)(1

)(1)(ˆ

nnnnn rrrr

re

⋅⋅⋅⋅ +=θ . (23)

Then

)(

)()()()(

1)ˆ( r

e

re

ror

Cr

CEθθθκκ

−−

=≈ (24)

and

Page 6: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

6158 Pornpis Yimprayoon

nQV

rCr

C

)()( )ˆ( ≈κ (25)

where

[ ]⎩⎨⎧

−+−−−

= ∑ ⋅⋅i

ro

ri

ri

re

riir

e

rCQ 2)()()()()(

4)()( )1)(()1(

)1(1 θππθπθ

⎭⎬⎫

+−−+−+ ∑∑≠ ≠

⋅⋅ji ji

ro

re

re

ro

rj

ri

rij

ro

2)()()()(2)()()(2)( )2()()1( θθθθπππθ . (26)

Moreover, for each r such that tr ,,2,1 K= ,

)1,0(ˆ

)(

)()()( N

nQ

Zr

C

rC

rCr

C ≈−

=κκ . (27)

Now, we are in a position to propose a large sample test for the null hypothesis 0H in (19). This is based on 2χ -test. Define

=

== t

r

rC

t

r

rC

rC

C

Q

Q

1

)(

1

)()(

ˆ1

ˆˆˆ

κκ (28)

where )(ˆ rCQ is the estimator of )(r

CQ for all tr ,,2,1 K= . In applications, often )(

1ˆ r⋅π and )(

1ˆ r⋅π are know in advance or are based on

prior guess. In that case, the estimator )(ˆ rCQ depends on the estimated value of

)(ˆ rCκ .

Next define

∑=

−=t

r

rCC

rC QnT

1

)(2)( ˆ)ˆˆ( κκ . (29)

This is equivalent to ( )2 ( ) 2 ( )

1

ˆˆ ˆ ˆ ˆ( 2 )t

r r rC C C C C

rT n Qκ κ κ κ

=

= − +∑

( )2 ( ) ( ) ( ) 2 ( )

1 1 1

ˆ ˆ ˆˆ ˆ ˆ ˆ2 1t t t

r r r r r

r r rn Q n Q n Qκ κ κ κ

= = =

= − +∑ ∑ ∑ . (30)

We then substitute (28) in the above expression, so it takes the form

∑∑∑===

+−=t

r

rCC

t

r

rCC

t

r

rC

rC QnQnQnT

1

)(2

1

)(2

1

)(2)( ˆ1ˆˆ1ˆ2ˆˆ κκκ

∑∑==

−=t

r

rCC

t

r

rC

rC QnQn

1

)(2

1

)(2)( ˆ1ˆˆˆ κκ , (31)

which can be rewritten

Page 7: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

Sample size determination and power analysis 6159

( )2 2 ( )

1

ˆˆ ˆ( )t

r rC C C

rT n Qκ κ

=

= −∑ . (32)

It follows that if 0H correct then T will have the central chi-squared distribution with 1−t degrees of freedom. If 0H is not correct then the value of T will have a non-central chi-squared distribution with 1−t degrees of freedom and non-centrality parameters Δ .

We denote these by T ~ Central 2

f..d)1( −tχ under 0H ,

T ~ Non-central 2( 1),t Δχ − under 1H .

Writing ,)()()( t

Crr

C κδκ += 1,,2,1 −= tr K . (33) It can be seen that

∑=

−=t

r

rCC

rC Qn

1

)(2)( )( κκΔ . (34)

We then substitute )()()( rC

rC

tC δκκ −= and

=

== t

r

rC

t

r

rC

rC

C

Q

Q

1

)(

1

)()(

1

κκ in the above

expression. So equation (34) can be put in the form

∑=

−=t

r

rCC

rC Qn

1

)(2)( )( δδΔ (35)

where

=

== t

r

rC

t

r

rC

rC

C

Q

Q

1

)(

1

)()(

1

δδ (36)

and 0)( =t

Cδ . (37) If we use the significance level α , then the critical region of the test is

2;1 αχ −> tT (38)

and we have the power of the test 2

1; 1( ) [ ]tP T Hαγ Δ χ −= > . (39) This is equivalent to

( ) Pγ Δ = [Non-central 2 21, 1; ]t tΔ αχ χ− −> (40)

where Δ is as in equation (35). For a specific power, (40) gives an expression for Δ for given α and t .

Page 8: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

6160 Pornpis Yimprayoon

Once Δ is known, we can solve for n provided )(rδ ’s and )(rQ ’s are

known. We may summarize the method to compute the sample size n for testing hypotheses (19) against (20) as follows:

Step 1: Specify the values of the significance level α , and power of the test

( )γ Δ . Step 2: Evaluate Δ which corresponding to the power of test by using (40).

That is, for all ∞<χ< 20 , we have

21;

2

2 21 2

0

2( ) ( )!

t

j

t jj

ef d

Δ

χ

Δ

γ Δ χ χ−

∞ ∞

− +=

⎡ ⎤⎛ ⎞⎢ ⎥⎜ ⎟

⎝ ⎠⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

∑∫

( ) .

2212

1!2

0

2232

221

221

2

2;1

2

∑ ∫∞

=

∞ −+−

+−

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎥⎥⎥⎥

⎢⎢⎢⎢

⎟⎠⎞

⎜⎝⎛ +−

⎟⎠⎞

⎜⎝⎛

=−

j x

jt

jt

j

t

dejtj

e

α

χχΓ

Δχ

Δ

(41)

Step 3: Find )(r

CQ , and )(rCδ for each tr ,,2,1 K= and calculate δ from

equation (36). Step 4: Solve equation (35), then the desired sample size n is obtained for analysis. Table 1 through Table 4 show some examples for tests for hypotheses (19) against (20) in order to evaluate the desired sample size n for 05.0,01.0=α , power ( ) 0.80, 0.85, 0.9, 0.95, 0.99γ Δ = and 5,4,3,2=t respectively.

From our numerical examples as shown in Table 1-Table 4, we can divide

into 4 following cases. Case 1: High agreement and high kappa ; Case 2: High agreement and low kappa ; Case 3: Low agreement and low kappa ; Case 4: Low agreement and high kappa ;

Page 9: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

Sample size determination and power analysis 6161

Table 1 Values of n for 2=t

NOTE: a

C and bC represent sample size for Cκ at level 0.01 and 0.05 respectively.

aM and b

M represent sample size for Mκ at level 0.01 and 0.05 respectively. Table 2 Values of n for 3=t

NOTE: a

C and bC represent sample size for Cκ at level 0.01 and 0.05 respectively.

aM and b

M represent sample size for Mκ at level 0.01 and 0.05 respectively.

Page 10: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

6162 Pornpis Yimprayoon Table 3 Values of n for 3=t

NOTE: a

C and bC represent sample size for Cκ at level 0.01 and 0.05 respectively.

aM and b

M represent sample size for Mκ at level 0.01 and 0.05 respectively. Table 4 Values of n for 4=t

NOTE: a

C and bC represent sample size for Cκ at level 0.01 and 0.05 respectively.

aM and b

M represent sample size for Mκ at level 0.01 and 0.05 respectively. 4 Power Analysis for Modified Kappa

Conversely, We can also brief the method to compute the power of test for

given sample size n as follows:

Page 11: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

Sample size determination and power analysis 6163

Step 1: Specify the values of the significance level α , and sample size n . Step 2: Find )(r

CQ , and )(rCδ for each tr ,,2,1 K= and use equation (36) for

calculating Cδ . Step 3: Compute the value of a non-centrality parameters Δ by using (35). Step 4: Calculate the power of test ( )γ Δ from equation (41). Table 5 through Table 8 display some examples for tests for hypotheses (19) against (20) in order to calculate power when given 05.0,01.0=α and

200,150,100,50,30=n for 5,4,3,2=t respectively. Similarly, we look at Table 5-Table 8 and also divide the numerical examples into 4 cases as mentioned in the previous section. Table 5 Values of power for 2=t

NOTE: a

C and bC represent power for Cκ at level 0.01 and 0.05 respectively.

aM and b

M represent power for Mκ at level 0.01 and 0.05 respectively.

Page 12: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

6164 Pornpis Yimprayoon Table 6 Values of power for 3=t

NOTE: a

C and bC represent power for Cκ at level 0.01 and 0.05 respectively.

aM and b

M represent power for Mκ at level 0.01 and 0.05 respectively. Table 7 Values of power for 4=t

NOTE: a

C and bC represent power for Cκ at level 0.01 and 0.05 respectively.

aM and b

M represent power for Mκ at level 0.01 and 0.05 respectively.

Page 13: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

Sample size determination and power analysis 6165

Table 8 Values of power for 5=t

NOTE: a

C and bC represent power for Cκ at level 0.01 and 0.05 respectively.

aM and b

M represent power for Mκ at level 0.01 and 0.05 respectively. 5 Conclusion

In this research, we discussed the problem of measuring agreement or

disagreement between two raters while the ratings are given separately in 2-point nominal scale. The purpose of this study is to determine the sample size for testing hypothesis based on the Cohen’s kappa statistic )( Cκ and the modified Cohen’s kappa )( Mκ . Moreover, the power of test for Cκ and Mκ is computed. The result of this study shows that Mκ is more efficient than Cκ .

Acknowledgments Our sincere thanks are due to Assoc. Prof. Dr. Montip Tiensuwan at the

Department of Mathematics, Faculty of Science, Mahidol University, for suggesting this problem. A special gratitude is expressed to Prof. Dr. Bikas K. Sinha from the Indian Statistical Institute, Kolkata, India, for his expert and excellent guidance and his useful comments. We also are particularly indebted to the Kasetsart University Research and Development Institute (KURDI), Kasetsart University, Thailand for the financial support which has enabled us to undertake this research. Furthermore, the authors would like to thank the Faculty of Liberal Arts and Science Kasetsart University, Kamphaeng Saen Campus Research Fund for financial support which has enabled us to undertake this research.

Page 14: Sample Size Determination and Power Analysis for … · Sample Size Determination and Power Analysis ... namely Cohen’s kappa statistic ()κC, as a reliability index for measuring

6166 Pornpis Yimprayoon

References [1] A.O. Adejumo, Modelling Generalized Linear (Loglinear) Models for

Raters Agreement Measure, Peter Lang GmbH, Frankfurt, 2005. [2] J. Cohen, A coefficient of agreement for nominal scales, Educational and

Psychological Measurement, 20 (1) (1960), 37-46. [3] J.R. Landis and G.G. Koch, The measurement of observer agreement for

categorical data, Biometrics, 33 (1977), 159-174. [4] L. Lin, A.S. Hedayat, B.K. Sinha and M. Yang, Statistical methods in

assessing agreement: Models, issues, and tools, Journal of the American Statistical Association, 97 (457) (2002), 257-270.

[5] B.K. Sinha, P. Yimprayoon and M. Tiensuwan, Cohen’s kappa statistic: A

critical appraisal and some modifications, Calcutta Statistical Association Bulletin, 58 (2006) 151-169.

Received: September 9, 2013