11
This article was downloaded by: [Colorado College] On: 08 October 2014, At: 16:25 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of the American Statistical Association Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/uasa20 A Framework for Assessing Broad Sense Agreement Between Ordinal and Continuous Measurements Limin Peng a , Ruosha Li a , Ying Guo a & Amita Manatunga a a Limin Peng is Assistant Professor, Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322. Ruosha Li is Graduate Student, Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322. Ying Guo is Assistant Professor, Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322. Amita Manatunga is Professor, Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322. This research is supported by U.S. National Institutes of Health grant R01 MH079448-02. Published online: 24 Jan 2012. To cite this article: Limin Peng, Ruosha Li, Ying Guo & Amita Manatunga (2011) A Framework for Assessing Broad Sense Agreement Between Ordinal and Continuous Measurements, Journal of the American Statistical Association, 106:496, 1592-1601, DOI: 10.1198/jasa.2011.tm10483 To link to this article: http://dx.doi.org/10.1198/jasa.2011.tm10483 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

A Framework for Assessing Broad Sense Agreement Between Ordinal and Continuous Measurements

  • Upload
    amita

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

This article was downloaded by: [Colorado College]On: 08 October 2014, At: 16:25Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: MortimerHouse, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of the American Statistical AssociationPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/uasa20

A Framework for Assessing Broad Sense AgreementBetween Ordinal and Continuous MeasurementsLimin Penga, Ruosha Lia, Ying Guoa & Amita Manatungaa

a Limin Peng is Assistant Professor, Department of Biostatistics and Bioinformatics,Emory University, Atlanta, GA 30322. Ruosha Li is Graduate Student, Department ofBiostatistics and Bioinformatics, Emory University, Atlanta, GA 30322. Ying Guo isAssistant Professor, Department of Biostatistics and Bioinformatics, Emory University,Atlanta, GA 30322. Amita Manatunga is Professor, Department of Biostatistics andBioinformatics, Emory University, Atlanta, GA 30322. This research is supported by U.S.National Institutes of Health grant R01 MH079448-02.Published online: 24 Jan 2012.

To cite this article: Limin Peng, Ruosha Li, Ying Guo & Amita Manatunga (2011) A Framework for Assessing Broad SenseAgreement Between Ordinal and Continuous Measurements, Journal of the American Statistical Association, 106:496,1592-1601, DOI: 10.1198/jasa.2011.tm10483

To link to this article: http://dx.doi.org/10.1198/jasa.2011.tm10483

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose ofthe Content. Any opinions and views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be reliedupon and should be independently verified with primary sources of information. Taylor and Francis shallnot be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and otherliabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

A Framework for Assessing Broad Sense AgreementBetween Ordinal and Continuous Measurements

Limin PENG, Ruosha LI, Ying GUO, and Amita MANATUNGA

Conventional agreement studies have been confined to addressing the sense of reproducibility, and therefore are limited to assessing mea-surements on the same scale. In this work, we propose a new concept, called “broad sense agreement,” which extends the classical frameworkof agreement to evaluate the capability of interpreting a continuous measurement in an ordinal scale. We present a natural measure for broadsense agreement. Nonparametric estimation and inference procedures are developed for the proposed measure along with theoretical jus-tifications. We also consider longitudinal settings which involve agreement assessments at multiple time points. Simulation studies havedemonstrated good performance of the proposed method with small sample sizes. We illustrate our methods via an application to a mentalhealth study.

KEY WORDS: Asymptotic normality; Consistency; Hypothesis testing; Jackknife; Nonparametric statistics; Order consistency.

1. INTRODUCTION

Agreement studies have often been conducted in biomedi-cal research, for example, to assess the reliability of labora-tory measurements by examining their similarity among differ-ent raters, or to validate a new diagnostic instrument throughcomparing with some “gold standard.” Various methods havebeen developed in literature for measuring agreement and mak-ing appropriate statistical inference. For example, with categor-ical data, the kappa coefficient (Cohen 1960, 1968; Fleiss 1971;Kraemer 1980) has been widely used. Since its development,kappa has been extended for dependent samples (Williamson,Lipsitz, and Manatunga 2000; Barnhart and Williamson 2002)and survival studies with censored observations (Guo and Man-atunga 2009). In the context of continuous data, concordancecorrelation coefficient (CCC) is a popular measure of agree-ment and has been well studied (Lin 1989, 1992; Lin et al. 2002;among others). In addition, CCC has been extended for repli-cated samples (Lin, Hedayat, and Wenting 2007), repeated mea-sures (Quiroz 2005; King et al. 2007), survival outcomes (Guoand Manatunga 2007), and multivariate observations (Jason andOlsson 2001). These agreement methods essentially serve tocharacterize reproducibility, by which one may conclude thatthe measurements produced by different raters or instrumentsare completely exchangeable. Applications are thus limited tostudying measurements on the same scale.

In mental health studies, numerous instruments have been de-veloped for the diagnosis of psychiatric disorders such as majordepression, and there is a considerable interest in replacing oneinstrument by another instrument for reduction of cost, ease ofadministration, and other considerations. However, since theseinstruments are based on different questionnaires with distinc-tive structures and point systems, they often have differentscales. When the scales of the instruments are different, the

Limin Peng is Assistant Professor, Department of Biostatistics and Bioinfor-matics, Emory University, Atlanta, GA 30322 (E-mail: [email protected]).Ruosha Li is Graduate Student, Department of Biostatistics and Bioinformat-ics, Emory University, Atlanta, GA 30322 (E-mail: [email protected]). Ying Guois Assistant Professor, Department of Biostatistics and Bioinformatics, EmoryUniversity, Atlanta, GA 30322 (E-mail: [email protected]). Amita Man-atunga is Professor, Department of Biostatistics and Bioinformatics, EmoryUniversity, Atlanta, GA 30322 (E-mail: [email protected]). This re-search is supported by U.S. National Institutes of Health grant R01 MH079448-02.

existing agreement methodology is not applicable. For exam-ple, in the Melanoma and Depression study (Musselman et al.2001), depression was measured by the clinician-administeredHamilton Depression Scale (HAM-D) and self-report dimen-sional scale (Carroll-D). While depression grade, such as no de-pression, mild depression, and severe depression, has been welldefined based on the observer-rated HAM-D scale (Potts et al.1990), it remains unknown whether the less time-consumingCarroll-D can replace HAM-D to determine the grade of the ill-ness. This problem is equivalent to assessing to what extent thecontinuous Carroll-D can be interpreted as the ordinal gradedseverity of depression. Due to the different measurement scales,this question cannot be addressed in the classical framework ofagreement.

Motivated by the Melanoma and Depression study, we pro-pose a new concept of agreement between an ordinal measure-ment and a continuous measurement and call it “broad senseagreement.” This new concept is designed to characterize thecapability of interpreting a continuous instrument (e.g., Carroll-D) according to ordered categories of interest (e.g., depressiongrade as measured by HAM-D), and thus expands the tradi-tional framework of agreement study. Specifically, let X andY denote a continuous measurement and an ordinal measure-ment of a common outcome variable from the same subject,respectively, and let DX and DY be the domain of X and Y . Weformally introduce the concept of the broad sense agreementbetween X and Y as follows.

Definition 1. X and Y are in perfect broad sense agreement(or disagreement) if and only if there exists an increasing (ordecreasing) step function � from DX to DY such that Y = �(X)

with probability 1.

Such defined broad sense agreement may be viewed as anextension of the classical concept of agreement because it ad-dresses the relationship between two distinct types of measure-ments of the same characteristic or outcome. A more straight-forward interpretation of the perfect broad sense agreement (ordisagreement) is about a clear identification of a set of cut-off points for the continuous X that can produce an exact one-to-one concordant (or discordant) correspondence between the

© 2011 American Statistical AssociationJournal of the American Statistical Association

December 2011, Vol. 106, No. 496, Theory and MethodsDOI: 10.1198/jasa.2011.tm10483

1592

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

6:25

08

Oct

ober

201

4

Peng et al.: Broad Sense Agreement 1593

discretized X and the ordinal Y . It essentially serves the samepurpose that the traditional agreement acts, for example, to finda convenient replacement for an established instrument.

As suggested by Definition 1, order consistency is one crucialrequirement for classifying X and Y as in perfect broad senseagreement. That is, given (X1,Y1) and (X2,Y2) ∈ DX × DY ,X1 ≤ X2 implies Y1 ≤ Y2 and vice versa. In this regard, clas-sical measures of concordance may be applicable for describ-ing broad sense agreement, including Kendall’s tau-b (Kendall1945; Agresti 1984), Goodman–Kruskal’s gamma (Brown andBenedetti 1977; Agresti 1990), Somers’ delta (Somers 1962),and Svensson’s (2000) approach for assessing order consis-tency. However, most of these measures are designed to studythe relationship between two ordinal scales. Even after beingadapted to assess continuous and ordinal measurements, allthese measures are constructed based on the probability of con-cordant or discordant pair of (X,Y), thereby only involving or-der comparisons between two observations. As a result, theymay lack a clear reflection of the overall deviation of the corre-spondence between X and Y from the perfect broad sense agree-ment, especially when Y has more than two levels. In addition,many existing concordance-based measures may not achieve 1(or −1) in the case of perfect broad sense agreement (or dis-agreement). This limitation will be further illustrated in Sec-tion 2.2.

In this article, we develop a novel framework of agreementstudy based on the newly defined concept of broad sense agree-ment, which can provide useful information on the extent towhich a continuous scale corresponds to an ordinal scale. InSection 2, we first introduce a natural summary measure forbroad sense agreement, and then propose a robust nonparamet-ric estimator. Inferences including asymptotic properties, vari-ance estimation, and construction of confidence intervals arealso presented. In addition, we briefly describe an extension tothe longitudinal setting. Monte Carlo simulation results are re-ported in Section 3 showing good and stable performance of theproposed estimator even with small sample sizes. We illustrateour proposal via an application to the Melanoma and Depres-sion study in Section 4. Finally we conclude with some remarksin Section 5.

2. METHODS

2.1 Proposed Broad Sense Agreement Measure

Without loss of generality, we suppose that the ordinal Ytakes values 1 < · · · < L, that is, DY = {1, . . . ,L}. For brevity,we use the shorthand BSA for the term broad sense agreement.By Definition 1, the perfect broad sense agreement entails ascenario where, for randomly selected X with Y = l, denotedby X(∗l), it must be satisfied that X(∗1) < · · · < X(∗L). In thecontrary case with perfect broad sense disagreement, the ranksof {X(∗1), . . . ,X(∗L)} are reversed; that is, X(∗1) > · · · > X(∗L).One can further show that X(∗1) < · · · < X(∗L) (or X(∗1) > · · · >X(∗L)) with probability 1 implies the existence of an increase(or decrease) step function, say �(·), such that Y = �(X) withprobability 1, which is perfect broad sense agreement (or dis-agreement).

The facts stated above motivate us to formulate a BSA mea-sure based on the comparison between the ranks of {X(∗1), . . . ,

X(∗L)}, denoted by (R1, . . . ,RL), and their anticipated ranks un-der the scenario of perfect broad sense agreement, (1, . . . ,L).Specifically, we propose a BSA measure taking the form

ρbsa = 1 − E{∑Ll=1(l − Rl)

2}E{∑L

l=1(l − Rl)2|X ⊥ Y} ,

where E(·) denotes the expectation and X ⊥ Y stands for theindependence between X and Y . In the definition of ρbsa, weadapt the form of CCC to characterize the similarity betweenthe two sets of ranks, (R1, . . . ,RL) and (1, . . . ,L). By this con-struction, we characterize the extent of departure from perfectbroad sense agreement by using the distance of the observedranks in X from the ranks expected under the perfect BSA,given L randomly selected observations with distinct Y . Thisrepresents a key distinction between BSA and previous mea-sures which are based on the probability of discordant pairs.That is, the proposed measure ρbsa has a clear meaning as asummary of the global resemblance to perfect BSA, which theclassical concordance-based measures lack.

The new measure ρbsa is standardized and is easy to interpret.Specifically, as shown in Appendix A, −1 ≤ ρbsa ≤ 1, and

ρbsa =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩1, X and Y are in perfect

broad sense agreement

0, X and Y are independent

−1, X and Y are in perfectbroad sense disagreement.

(1)

In the Melanoma and Depression study, Y has three levels,1,2,3, representing no or borderline depression, mild or moder-ate depression, marked or severe depression respectively. A ρbsa

of large magnitude would suggest a high capability of findinginterpretable cut-off points of Carroll-D corresponding to thethree ordinal categorical depression grades.

Remark 1. As an alternative view, the proposed measure ρbsa

reflects a comparison among FX|Y=l(x) (l = 1, . . . ,L), whereFX|Y=l(x) denotes the conditional distribution of X given Y = l.The special case with ρbsa = 1 indicates that FX|Y=l(x) (l =1, . . . ,L) have disjoint supports which are ordered accordingto Y .

2.2 Nonparametric Estimation of the Proposed BroadSense Agreement Measure

Suppose that the observed data consist of n random samplesof (X,Y), denoted by {Xi,Yi}n

i=1. Without loss of generality, wearrange data as follows:

(X1,Y1 = 1), . . . ,(Xn1,Yn1 = 1

),(

Xn1+1,Yn1+1 = 2), . . . ,

(Xn1+n2 ,Yn1+n2 = 2

),

...(X∑L−1

l=1 nl+1,Y∑L−1l=1 nl+1 = L

), . . . ,

(X∑L

l=1 nl,Y∑L

l=1 nl= L

),

where nl = ∑ni=1 I(Yi = l) and

∑Ll=1 nl = n.

Let �L denote the sample space of �R = (R1, . . . ,RL), whichconsists of L! permutations of {1, . . . ,L}. When X and Y are in-dependent, X(∗l) (l = 1, . . . ,L) have identical distributions and

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

6:25

08

Oct

ober

201

4

1594 Journal of the American Statistical Association, December 2011

thus �R has equal probability on all sample points in �L. Thisimplies

E

{L∑

l=1

(l − Rl)2∣∣∣X ⊥ Y

}=

∑(k1,...,kL)∈�L

L∑l=1

(l − kl)2/

L! ≡ CL.

Some standard algebraic manipulation presented in Appendix Ashows that CL = (L3 − L)/6.

For 1 ≤ s ≤ L, define Xs,ts = X∑s−1l=1 nl+ts

with 1 ≤ ts ≤ ns.Let �S = (1,2, . . . ,L), and �R(·) be a mapping from {DX}L to�L: �R(�x) = (r1, . . . , rL), where �x = (x1, . . . , xL) ∈ {DX}L andrl = ∑L

j=1 I(xl ≥ xj) denoting the rank of xl among {x1, . . . , xL}.Adopting the idea of stratified sampling without replacement,we propose to estimate E{∑L

l=1(l − Rl)2} by

Wn =(

L∏l=1

nl

)−1

·n1∑

j1=1

· · ·nL∑

jL=1

∥∥�S − �R(X1,j1 , . . . , XL,jL

)∥∥2,

where ‖ ·‖ is Euclidean norm in RL. Note that (X1,j1 , . . . , XL,jL)

is a random realization of (X(∗1), . . . ,X(∗L)). Therefore, Wn isan unbiased estimator of E{∑L

l=1(l − Rl)2}. A natural estimator

of ρbsa is then given by

ρbsa = 1 − Wn

CL.

It is easy to see that under perfect broad sense agreement or dis-agreement, ρbsa equals 1 or −1 respectively with probability 1.

It is interesting to note that when L = 2, ρbsa = 1 −2Sn/(

∏2l=1 nl), where Sn is the number of strictly discordant

pairs, namely, (Xi,Yi) and (Xj,Yj) with (Xi − Xj)(Yi − Yj) < 0.From this view, the form of ρbsa somewhat resembles that ofKendall’s tau-b, with the major discrepancy lying in

∏2l=1 nl

in place of the denominator of Kendall’s tau-b. This distinctionessentially indicates a better strategy of handling tied ordinalmeasurements adopted by ρbsa for evaluating the correspon-dence between a continuous scale and an ordinal scale. That is,ρbsa only compares observations with different Y values. As-sessing the ordering of data points within the same level of Ycan dilute the information on BSA. For instance, in the exam-ple depicted by Figure 1, it is quite suggestive that the observedsmall overlap in X between the two Y categories may be causedby some random factor such as measurement error. In this case,Kendall’s tau-b is only 0.62, while ρbsa equals 0.90, which maybetter reflect the good capability of identifying a cutoff in X incorrespondence to the ordinal Y categories, “Low” and “High.”Switching the Y values of the two data points with X = 3 andX = 4 leads to a perfect BSA situation, in which Kendall’s

Figure 1. An example of a high broad sense agreement situation.

tau-b only equals 0.68. Similar phenomena are observed forother concordance-based measures. For example, computingStuart’s tau-c in the above two cases gives 0.76 and 0.86, re-spectively. This example further demonstrates the limitation ofusing concordance-based association measures to assess BSA.

Remark 2. Some concordance-based measures, such asGoodman–Kruskal’s Gamma, can achieve high values in theexample shown in Figure 1. However, they are still prone to thegeneral issues associated with concordance-based measures forassessing BSA, which are discussed in Section 1.

As pointed out by the associate editor, when L = 2, ρbsa hasa one-to-one relationship with the area under the ROC curve(AUROC) for a logistic regression (Hosmer and Lemeshow2000). Specifically, AUROC equals 1 − Sn/(

∏2l=1 nl), and thus

ρbsa = 1 − 2(1−AUROC). However, in general, ρbsa and ROCmeasures are designed to capture different aspects of the corre-spondence between the continuous X and the ordinal Y . For ex-ample, with L > 2, Waegeman, De Baets, and Boullart’s (2008)ROC measure for ordinal regression is estimated by the propor-tion of the observed ranks of X’s from L subjects, one for eachcategory of Y , exactly matching the expected ranks under theperfect broad sense agreement, namely,

U =(

L∏l=1

nl

)−1

·n1∑

j1=1

· · ·nL∑

jL=1

I(�S = �R(

X1,j1 , . . . , XL,jL

)).

Though the construction of U bears some similarity with thatof ρbsa, U is solely focused on differentiating accurate ranking(i.e., �S = �R) versus inaccurate ranking (i.e., �S = �R) for the clas-sification purpose. In contrast, ρbsa reveals additional dimen-sions of the degree of departure from the perfect broad senseagreement by using ‖�S − �R‖2. This fact entails the key distinc-tion between the new concept of broad sense agreement and theexisting ROC measures for ordinal regression.

2.3 Asymptotic Properties

We establish the consistency and asymptotic normality of theproposed estimator ρbsa in Theorems 1 and 2.

Theorem 1. (i) If ρbsa = 1 (or −1), then ρbsa = 1 (or −1)with probability 1. (ii) If −1 < ρbsa < 1, under regularity con-ditions C1 and C2 given in Appendix B, ρbsa → ρbsa almostsurely.

Theorem 2. If −1 < ρbsa < 1, under regularity conditions C1and C2 given in Appendix B, we have

n1/2(ρbsa − ρbsa) →d N(0, σ 2bsa),

where the definition of σbsa is provided in Appendix B.

To prove Theorems 1 and 2, the key step is to approximateρbsa by a U-statistic, ρ

[1]bsa, as shown in Appendix B. The stan-

dard U-statistic theory, coupled with the fact that the expecta-tion of ρ

[1]bsa equals ρbsa, and its deviation from ρbsa converges

to 0, is used to establish the consistency of ρbsa. When study-ing the asymptotic distribution of ρbsa, we derive the influencefunctions of n1/2(ρbsa − ρbsa) appropriately accounting for thedifference between ρbsa and ρ

[1]bsa. Detailed proofs for Theorems

1 and 2 are provided in Appendix B.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

6:25

08

Oct

ober

201

4

Peng et al.: Broad Sense Agreement 1595

2.4 Inference

We propose to estimate the asymptotic variance of ρbsa us-ing the jackknife method, given the rather complicated analyticform of σbsa. The consistency of the resulting jackknife estima-tor is ensured by the fact that n1/2(ρbsa −ρbsa) is asymptoticallyequivalent to a U-statistic (Arvesen 1969). Specifically, let ρ−i

bsadenote the proposed estimator of ρbsa obtained from the dataexcluding (Xi,Yi). The jackknife variance estimator of ρbsa isgiven by

V(ρbsa) = n − 1

n

n∑i=1

(ρ−i

bsa − n−1n∑

k=1

ρ−kbsa

)2

.

One may use normal approximation to construct confidenceintervals of ρbsa. Some transformation, such as Fisher’s z-transformation (Fisher 1921), may be adopted to address theissues related to the skewed distribution of ρbsa when ρbsa isclose to the boundary −1 or 1. Specifically, let φ(·) be the se-lected transformation function and φ(x) = dφ(x)/dx. The con-fidence interval of level 1 − α for ρbsa may be constructed as(φ−1(φ(ρbsa) − z1−αφ(ρbsa){V(ρbsa)}1/2),

φ−1(φ(ρbsa) + z1−αφ(ρbsa){V(ρbsa)}1/2)),where z1−α denotes the 100(1 − α)th percentile of N(0,1) andφ−1(·) denotes the inverse function of φ(·).

As in traditional agreement studies, complications may arisewhen the estimated agreement measure falls on the boundary{1,−1}, in which case, V(ρbsa) = 0. We propose a practicalremedy for this problem, assuming that the true measure ρbsa ∈(−1,1). Our strategy is to slightly perturb the observed data byadding L − 1 pseudo data points,

{(Xs,1,Ys,1 = 1), . . . , (Xs,L−1,Ys,L−1 = L − 1)},where Xs,l is chosen to be larger (or smaller) than the smallest(or largest) Xl+1,j (j = 1, . . . ,nl+1) but to be smaller (or larger)than the second smallest (or largest) Xl+1,j when ρbsa = 1 (or−1). The estimators, ρbsa and V(ρbsa), may be recomputedbased on such perturbed data. As n goes to infinity, such a dataperturbation is expected to be asymptotically negligible. Oursimulation studies reported in Section 3 suggest that this prac-tical solution works very well in situations with small samplesizes.

2.5 Extension to Longitudinal Settings

We extend the proposed method to longitudinal settings,where subjects are followed at multiple time points and there-fore produce two or more pairs of continuous and ordinal mea-surements. In such a setting, in addition to cross-sectionallyevaluating the BSA at each time point, assessing whether BSAchanges over time may serve as a useful approach to validat-ing the finding on the capability of interpreting a continuousscale in an ordinal scale. Evidence for constant BSA over timewill also support the robustness of the proposed BSA measureto potential influence from time-varying confounders, such asintervention or disease progression.

Suppose there are J time points (J ≥ 2). Let (X(j)i ,Y(j)

i ) de-note the (X,Y) observed on the ith subject at jth time point(i = 1, . . . ,n, k = 1, . . . , J). Denote the true and estimated BSA

measure at jth time point as ρ(j)bsa and ρ

(j)bsa, respectively. The

question on the constancy of ρbsa over time can be formulatedas the testing problem of the following hypothesis:

H0 :ρ(1)bsa = · · · = ρ

(J)bsa.

To test H0, we consider constructing a test-statistic based on

Dn = √n

⎛⎜⎝φ(ρ(2)) − φ(ρ(1))...

φ(ρ(J)) − φ(ρ(1))

⎞⎟⎠ .

Under H0, we can show that Dn converges to a mean-zero mul-tivariate Gaussian distribution, the covariance of which can beconsistently estimated by �n = (σij)(J−1)×(J−1). Here σij de-notes the element on the ith row and jth column of �n and isgiven by

σij = φ{(

ρ(1) + ρ(i+1))/2

}φ{(

ρ(1) + ρ(j+1))/2

}vij,

where vij is the element on the ith row and the jth column ofVn, the jackknife estimator of the covariance matrix of Rn =√

n(ρ(2) − ρ(1), . . . , ρ(J) − ρ(1))T. Specifically, with R−kn denot-

ing Rn obtained from data without {(X(j)k ,Y(j)

k ), j = 1, . . . , J},

Vn = n − 1

n

n∑i=1

(R−i

n − 1

n

n∑k=1

R−kn

)T(R−i

n − 1

n

n∑k=1

R−kn

).

Therefore, the proposed test statistic for H0 takes the form

Tn = DTn�

−1n Dn.

The limit distribution of Tn is χ2J−1 under the null hypothesis

H0. Therefore, one may reject H0 at significance level 1 − α

when Tn is greater than the 100(1 − α)th percentile of χ2J−1.

When any ρ(j)bsa equals 1, we propose to recompute Tn after

inserting L − 1 pseudo data points at all time points in the samemanner as that described in Section 2.4.

3. MONTE CARLO SIMULATIONS

We conduct Monte Carlo simulations to evaluate the finite-sample performance of the proposed method. We consider se-tups with one time point and longitudinal setups with three timepoints.

In the case involving only one time point, we let L = 3and generate Y from {1,2,3} with equal probability. GivenY = y, we generate X from two setups: (I) a normal distribu-tion, N(y, σ ); (II) a nonnormal distribution, y + Weibull(2, ξ).In setup (I), we set σ as 0.4, 0.8, and 1 to produce high tomoderate sizes of BSA, corresponding to ρbsa = 0.961, 0.773,and 0.681, respectively. In setup (II), we choose ξ = 1, 1.8,and 2.5 such that ρbsa equals 0.935, 0.760, and 0.624, re-spectively. In all configurations, we construct confidence inter-vals choosing φ(·) as Fisher’s z-transformation; that is, φ(z) =log{(1 + z)/(1 − z)}/2. Results presented in Table 1 are basedon 1000 simulated datasets of size N = 40, 60, or 80.

Table 1 suggests quite good performance of the proposedagreement method when X given Y follows either a normal dis-tribution or a nonnormal distribution. First, the estimator ρbsa isvirtually unbiased. The estimated standard deviations based onthe jackknife method agree very well with the empirical stan-dard deviations. The 95% confidence intervals have coverage

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

6:25

08

Oct

ober

201

4

1596 Journal of the American Statistical Association, December 2011

Table 1. Simulation results in cases with one time point: empiricalbiases (EmpBias), empirical standard deviation (EmpSD), estimated

standard deviation (EstSD), and coverage probabilities of 95%confidence intervals (Cov95)

Setup σ ρbsa N EmpBias EmpSD EstSD Cov95

(I) 0.4 0.961 40 0.001 0.024 0.024 0.94560 0.001 0.019 0.019 0.95180 0.000 0.016 0.016 0.955

0.8 0.773 40 −0.001 0.078 0.080 0.95960 0.003 0.064 0.063 0.94280 0.000 0.055 0.054 0.950

1.0 0.681 40 −0.001 0.101 0.103 0.94660 0.003 0.082 0.082 0.94180 0.001 0.068 0.071 0.950

ξ

(II) 1 0.935 40 0.000 0.034 0.034 0.96360 0.000 0.027 0.027 0.96080 0.000 0.024 0.023 0.954

1.8 0.760 40 0.003 0.082 0.083 0.94260 0.001 0.066 0.066 0.93880 0.000 0.059 0.058 0.941

2.5 0.624 40 0.000 0.112 0.116 0.94960 0.001 0.092 0.093 0.94380 0.000 0.079 0.080 0.935

probabilities close to the nominal level, even when the true ρbsais close to the boundary. It is also noted that the standard devia-tions of ρbsa seem to be bounded by 0.1 in most configurations.This suggests that the proposed BSA agreement can be esti-mated quite accurately even with a small dataset of size 40 or60.

We also evaluate the proposed test for H0 in longitudinal set-tings with three time points. Specifically, we generate Y(1)

i from

{1,2,3} with equal probability, and let Y(2)i = ψ(Y(1)

i )I(U(1)i <

p)+Y(1)i I(U(1)

i ≥ p), Y(3)i = ψ(Y(2)

i )I(U(2)i < p)+Y(2)

i I(U(2)i ≥

p), where {U(k)i }n

i=1 are n iid Unif(0,1) variates, and ψ(y) =(y + 1) − 3 max{0, y − 2}. Here, the superscript (k) indicatesthe time point (k = 1, 2, 3). The parameter p represents theproportion of subjects with changing Y values over time, andwe let p = 0 or 0.1. Given (Y(1)

i ,Y(2)i ,Y(3)

i ) = (y1, y2, y3), we

generate (X(1)i ,X(2)

i ,X(3)i ) from the following two configura-

tions: (III) a multivariate normal distribution N(μ,�), whereμ = (y1, y2, y3)

T and

� =⎛⎝ σ 2 cσ 2α c2σ 2α

cσ 2α c2σ 2 c3σ 2α

c2σ 2α c3σ 2α c4σ 2

⎞⎠with α = 0.6; (IV) a three-variate Gaussian copula model (Song2000) with the dispersion matrix⎛⎝1.0 0.6 0.6

0.6 1.0 0.6

0.6 0.6 1.0

⎞⎠ ,

and the marginal of X(j) is set as yj + Weibull(2, cj−1ξ) (j =1,2,3). The setups (III) and (IV) mimic the scenarios wherethe longitudinal X values are moderately positively correlated.

The values of ρbsa under different combinations of (c, σ ) or(c, ξ) are presented in Table 2. Note that when c = 1, the BSAbetween X and Y is the same at all three time points and thusH0 holds. As indicated in Table 2, the longitudinal difference inBSA increases with c.

In Table 2, we report empirical rejection rate for testing H0based on the proposed test under setups (III) and (IV). Note thatin the configurations with c = 1, the empirical rejection ratesstand for the empirical sizes of the proposed test. We observefrom Table 2 that with either normal or nonnormal data con-figurations, the empirical sizes of the proposed test are ratherclose to the nominal level. The power, even with a small sam-ple size, 40, also appears satisfactory. As expected, when c in-creases yielding a bigger change in ρbsa over time, the proposedtest demonstrates larger power to detect the change in ρbsa.

4. AN APPLICATION TO THE MELANOMA ANDDEPRESSION STUDY

The Melanoma and Depression study is a double-blind studyof 40 patients with malignant melanoma, who were randomlyassigned to receive the antidepressant paroxetine or placebo(Musselman et al. 2001). The primary goal of this study wasto determine whether the prophylactic antidepressant treatmentcould effectively reduce the development of clinical depression.

Patients were evaluated at baseline (i.e., before the adminis-tration of the study drug), and week 4, week 8, and week 12follow-up visits. At each visit, assessments include the dimen-sional 21-item Hamilton Depression Scale (HAM-D), admin-istered by the clinicians, and another self-report dimensionalscale (Carroll-D). While there has been some established rulefor reporting graded severity of depression, for example, milddepression, based on the HAM-D scale (Potts et al. 1990), howthe self-report Carroll-D is related to such defined depressiongrade remains unclear. Upon the identification of a close cor-respondence between Carroll-D and depression grade, the sim-pler Carroll-D may replace the more time-consuming HAM-Dto determine the ordinal depression grade, thereby saving clini-cian time and effort while providing the needed information.

The focus of our analysis is to assess the correspondence be-tween Carroll-D scores and depression grades. In Figure 2, weplot the depression grade versus Carroll-D score at baseline, 4weeks, 8 weeks, and 12 weeks after the start of the assignedtreatment. Here, depression grades are determined based on awell-established guideline (Potts et al. 1990): HAM-D scores0–11 correspond to no or borderline depression; 11–25 corre-spond to mild or moderate depression, and ≥25 correspond tomarked or severe depression. It is observed from Figure 2 that,at each time point, only up to two subjects were diagnosed withmarked or severe depression. To avoid unstable inferences, wecombine mild or moderate depression and marked or severe de-pression into one category, and examine the interpretability ofCarroll-D as two depression grades, no or borderline depressionand mild to severe depression.

The top section of Table 3 presents the estimated ρbsa and theassociated standard errors and confidence intervals. Fisher’s z-transformation is used in computing standard errors and confi-dence intervals. The estimates for ρbsa at baseline and 4 and 12week follow-up visits suggest relatively high capability of in-terpreting the continuous Carroll-D in terms of no or borderline

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

6:25

08

Oct

ober

201

4

Peng et al.: Broad Sense Agreement 1597

Table 2. Simulation results for longitudinal settings with three time points: empirical rejection rates for testing H0 :ρ(1)bsa = ρ

(2)bsa = ρ

(3)bsa

c = 1 c = 1.5 c = 2(no BSA change) (moderate BSA increase) (large BSA increase)

σ σ σ

Setup 0.4 0.8 1.0 0.4 0.8 1.0 0.4 0.8 1.0

(III) ρ(1)bsa 0.961 0.773 0.681 0.961 0.773 0.681 0.961 0.773 0.681

ρ(2)bsa 0.961 0.773 0.681 0.872 0.603 0.508 0.773 0.482 0.397

ρ(3)bsa 0.961 0.773 0.681 0.726 0.436 0.357 0.482 0.258 0.209

p = 0 N = 40 0.060 0.046 0.039 0.943 0.738 0.605 0.999 0.948 0.882N = 60 0.065 0.040 0.043 0.993 0.931 0.839 1.000 0.999 0.983N = 80 0.050 0.041 0.046 1.000 0.977 0.919 1.000 0.999 0.999

p = 0.1 N = 40 0.065 0.041 0.039 0.915 0.666 0.525 0.998 0.914 0.811N = 60 0.063 0.044 0.047 0.990 0.872 0.730 1.000 0.991 0.953N = 80 0.060 0.045 0.051 1.000 0.942 0.844 1.000 0.999 0.983

ξ ξ ξ

1 1.8 2.5 1 1.8 2.5 1 1.8 2.5

(IV) ρ(1)bsa 0.935 0.760 0.624 0.935 0.760 0.624 0.935 0.760 0.624

ρ(2)bsa 0.935 0.760 0.624 0.826 0.591 0.457 0.718 0.473 0.359

ρ(3)bsa 0.935 0.760 0.624 0.666 0.426 0.321 0.431 0.255 0.184

p = 0 N = 40 0.068 0.038 0.039 0.892 0.681 0.481 0.996 0.946 0.805N = 60 0.059 0.045 0.044 0.989 0.897 0.716 1.000 0.998 0.959N = 80 0.056 0.045 0.043 0.998 0.966 0.858 1.000 1.000 0.992

p = 0.1 N = 40 0.068 0.047 0.046 0.865 0.601 0.416 0.995 0.883 0.693N = 60 0.066 0.055 0.059 0.978 0.822 0.598 1.000 0.984 0.872N = 80 0.063 0.049 0.050 0.996 0.923 0.752 1.000 0.998 0.958

depression and mild to severe depression with score 11 beinga possible clinical cutoff. However, the confidence intervals forρbsa may not be tight enough to make this finding conclusive.Studies of larger sample size may be warranted to confirm theutility of Carroll-D as a basis for determining graded depressionseverity.

It is also seen from Table 3 that the estimated BSA mea-sures demonstrate a U-shape varying pattern across the fourtime points of this study, which reaches the bottom at week8 follow-up. We investigate this phenomenon by applying theproposed test Tn to compare ρbsa between any pair of timepoints. The bottom section of Table 3 reports the differencesin ρbsa, the observed test statistics, and the corresponding p-values. The results are consistent with the cross-sectional BSAestimates. That is, the BSA between Carroll-D and graded de-pression appears to be fairly robust over the time regardless ofthe presence of antidepressant intervention. The observed ρbsa

difference between week 8 and baseline and that between week8 and week 12, however, may not be ignorable. An explanationmay be linked to the sharply increased patient dropout rate atweek 8 follow-up, which may indicate that 8 weeks after thestart of study medication may be a changing point for the men-tal health of patients undergoing malignant melanoma therapy.The looser correspondence between the self-reported Carroll-D and the clinician-rated depression grade may be caused bythe slower adjustment made by a self-evaluation based instru-

ment during the transition of depression status. We also con-duct a comparison of ρbsa across all four time points, whichyields Tn = 5.533 with the corresponding p-value, 0.137. Thisp-value may reflect the decreased power of Tn partially due tothe smaller sample size when all the four time points are con-sidered. The observed U-shape changing pattern of ρbsa may befurther investigated through a larger scale study.

5. REMARKS

In this article we propose a novel concept of broad senseagreement which aims to measure the capability to interpretcontinuous measurements in ordinal categories of interest. Weformulate a broad sense agreement measure following the in-trinsic order consistency requirement. Estimation and inferenceprocedures are developed for the one-sample case and alsothe longitudinal setting. As suggested by the motivating ex-ample of the Melanoma and Depression study, this researchendeavor may help determine appropriate instrument replace-ment, thereby saving cost and time of clinical studies.

Our simulation studies demonstrate quite stable small-sampleperformance of the proposed method. Since the proposed mea-sure is designed to only compare X values of observations fromdistinct levels of Y , our method pertains to little issue on tiesunder the assumption that X is continuous. When ties in X dopresent in a real dataset, evenly breaking the ties appears towork well based on our numerical experience (unreported).

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

6:25

08

Oct

ober

201

4

1598 Journal of the American Statistical Association, December 2011

Figure 2. Melanoma and Depression study: depression grade versus Carroll-D score at baseline, 4, 8, 12 week follow-up visits.

It is worth mentioning that existing non-concordance basedassociation measures, such as Pearson correlation coefficient,are generally inadequate to serve the purpose of measuringbroad sense agreement. For example, Pearson correlation co-efficient, by its definition, is not designed to capture the orderconsistency between X and Y but is targeted to quantify thelinear association between X and Y . Pearson correlation coef-ficient equals 1 only if X and Y strictly follow a linear rela-tionship. With continuous X and ordinal Y , Pearson correlationcoefficient would not attain the upper bound of 1 even whenX and Y are in perfect broad sense agreement. In the exampledepicted in Figure 1, Pearson correlation coefficient is as lowas 0.72 and slightly increases to 0.80 after a data perturbationthat makes X’s in the low Y category have no overlap with X’sin the high Y category. These observations further justify theneed of the proposed work to develop a new measure to assess

the replaceability of one established continuous instrument bya convenient ordinal scale.

We also emphasize that the proposed framework of broadsense agreement, though designed to evaluate the feasibility ofinterpreting the continuous X in the ordinal scale Y , does notrequire identification of the cut-off points of X correspondingto Y categories. This feature entails an advantage of the newmethod versus some alternative two-stage approach, which isto first determine the optimal cut-off points of X, for exam-ple, by jackknife discriminant analysis, and then calculate theweighted kappa between Y and the classified X. That is, thevalue of the weighted kappa derived from the two-stage ap-proach depends on the optimal cut-off points selected whichare data-dependent and also method-dependent, that is, whichmethod is used to determine the cut-off. This can reduce theinterpretability and generalizability of the resulting weighted

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

6:25

08

Oct

ober

201

4

Peng et al.: Broad Sense Agreement 1599

Table 3. Melanoma and Depression study: cross-sectional estimatesfor ρbsa and associated standard errors (SE) and 95% confidenceintervals (CI); comparisons of ρbsa between any two time pointsincluding differences in ρbsa (DIFF), test statistics (Tn) and the

corresponding p-values, and the comparison of ρbsa across all fourtime points including the test statistic (Tn) and the p-value

Time N ρbg SE 95% CI

Baseline 40 0.941 0.089 (0.225, 0.997)4 weeks 38 0.857 0.099 (0.503, 0.965)8 weeks 31 0.676 0.178 (0.179, 0.898)12 weeks 26 0.925 0.086 (0.421, 0.993)

DIFF Tn p-value

Week 4 vs. Baseline 38 −0.080 −0.649 0.516Week 8 vs. Baseline 31 −0.324 −2.642 0.008Week 12 vs. Baseline 26 −0.075 −0.343 0.732Week 8 vs. Week 4 31 −0.102 −0.413 0.680Week 12 vs. Week 4 26 0.142 0.906 0.365Week 12 vs. Week 8 26 0.408 2.360 0.018All time points 26 5.533 0.137

kappa. In comparison, ρbsa does not require the specifica-tion of such cut-off points and is a more direct and objectivemeasure of agreement between a continuous and ordinal vari-ables.

As noted in Section 2.4, ρbsa can have a skewed distributionwhen ρbsa is close to the boundary −1 or 1. In practice, sometransformation can be adopted to improve the finite-sample in-ference on ρbsa. In our numerical studies, we choose Fisher’sz-transformation because it has shown great success in address-ing the same problem in the context of correlation coefficient in-cluding theoretical justifications (Fisher 1921; Hotelling 1953).Given the strong supporting evidence from our simulation stud-ies, we conjecture that some theory that justifies the use ofFisher’s z-transformation in the proposed broad sense agree-ment framework may also be derived. A detailed investigationalong this direction merits future research.

APPENDIX A: JUSTIFICATION FOR −1 ≤ ρbsa ≤ 1AND EQUATION (1)

It is easily seen from the definition of ρbsa that ρbsa ≤ 1. Next, weshow for any (k1, . . . , kL) ∈ L,

L∑l=1

(l − kl)2 ≤ 2CL. (A.1)

If (A.1) holds, then we have E{∑Ll=1(l − Rl)

2} ≤ 2CL, and hence

ρbsa ≥ 1 − 2CL/CL = −1. Because∑L

k=1 k2l = ∑L

l=1 l2, showing

(A.1) is equivalent to proving that 2∑L

l=1 l · kl ≥ 2∑L

l=1 l2 − 2CL.First, applying the Cauchy–Schwarz inequality, we get

L∑l=1

2l(L + 1 − kl) ≤L∑

l=1

l2 +L∑

l=1

(L + 1 − kl)2 = 2

L∑l=1

l2. (A.2)

The last equality follows based on the fact that (k1, . . . , kL) is a per-mutation of (1, . . . ,L). It is implied by (A.2) that

2L∑

l=1

l · kl ≥L∑

l=1

2l(L + 1) − 2L∑

l=1

l2 = 1

3L3 + L2 + 2

3L

=L∑

l=1

l2 − 1

3(L3 − L). (A.3)

Now it only remains to show that CL = (L3 − L)/6. First, we write

CL =∑

(k1,...,kL)∈L

L∑l=1

l2

L! +∑

(k1,...,kL)∈L

L∑l=1

k2l

L!

−∑

(k1,...,kL)∈L

L∑l=1

2l · kl

L!≡ I + II − III.

It is easy to see that

I = II = 1

3L3 + 1

2L2 + 1

6L.

To calculate III, the key is to note that

∑(k1,...,kL)∈L

L∑l=1

l · kl =∑

(k1,...,kL)∈L

L∑l=1

(L + 1 − l) · kl.

Therefore,

∑(k1,...,kL)∈L

L∑l=1

2l · kl =∑

(k1,...,kL)∈L

L∑l=1

(L +1) · kl = L(L +1)2L!/2.

This implies that III = L(L + 1)2/2, and hence

CL = 2

3L3 + L2 + 1

3L − L(L + 1)2/2 = (L3 − L)/6. (A.4)

Combining (A.3) and (A.4) implies (A.1) and thus completes the proofof ρbsa ≥ −1.

To prove Equation (1), the key is to note that when X and Y arein perfect broad sense agreement (or disagreement), (R1, . . . ,RL) =(1, . . . ,L) (or (L, . . . ,1)) with probability 1. Therefore, when there isperfect broad sense agreement (or disagreement), E{∑L

l=1(l − RL)2}equals 0 (or

∑Ll=1{l − (L + 1 − l)}2 = 2CL), rendering ρbsa = 1 (or

−1). When X and Y are independent, we immediately have ρbsa = 0by the definition of ρbsa.

APPENDIX B: PROOFS OF THEOREMS 1 AND 2

We first introduce necessary notation and regularity conditions. LetZi = (Xi,Yi)

T and �n,L = {(j1, . . . , jL) : 1 ≤ jl ≤ n, j1, . . . , jL are dis-tinct}. For (m1, . . . ,mL) ∈ �n,L, define

�(Zm1 , . . . ,ZmL

) = I((

Ym1 , . . . ,YmL

) ∈ �L)

×L∑

k=1

{Ymk −

L∑r=1

I(Xmk ≥ Xmr

)}2

.

Let pl = Pr(Y = l) for l = 1, . . . ,L and γL = 1/(CL · L!). Defineh(Zm1 , . . . ,ZmL ) = 1 − �(Zm1 , . . . ,ZmL)γL(

∏Ll=1 pl)

−1, h1(z1) =E{h(z1,Z2, . . . ,ZL)}, h1(z1) = h1(z1) − ρbsa, ζ1 = var{h1(Z1)}.

Regularity conditions include:

(C1) pl > 0 for l = 1, . . . ,L;(C2) ζ1 > 0.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

6:25

08

Oct

ober

201

4

1600 Journal of the American Statistical Association, December 2011

Proof of Theorem 1

When ρbsa = 1 (or −1), the definition of ρbsa implies that withprobability 1, (R1, . . . ,RL) = (1, . . . ,L) [or (L, . . . ,1)], and thenequivalently X(1) < · · · < X(L) [or X(1) > · · · > X(L)], which impliesρbsa = 1 (or −1).

To prove Theorem 1(ii), first note that �(·) is symmetric in its ar-guments, and

Wn =( L∏

l=1

nl

)−1 ∑(m1,...,mL)∈�n,L

�(Zm1 , . . . ,ZmL

).

Therefore, the proposed estimator of ρbsa can be written as

ρbsa =(

nL

)−1 ∑(m1,...,mL)∈�n,L

{1 − �

(Zm1 , . . . ,ZmL

)M∗

n,L},

where M∗n,L = (n

L)/{CL · ∏L

l=1 nl}.By (C1) and the strong law of large numbers, M∗

n,L → γL ×(∏L

l=1 pl)−1 almost surely. Then we can rewrite ρbsa as

ρbsa = ρ[1]bsa − ρ

[2]bsa, (B.1)

where

ρ[1]bsa =

(n

L

)−1 ∑(m1,...,mL)∈�n,L

h(Zm1 , . . . ,ZmL

)and

ρ[2]bsa =

(nL

)−1 ∑(m1,...,mL)∈�n,L

�(Zm1 , . . . ,ZmL

)

×{

M∗n,L − γL

( n∏l=1

pl

)−1}.

Note that ρ[1]bsa is a U-statistic with kernel h(·) of order L. Since

�(Zm1 , . . . ,ZmL ) ≤ L(L − 1)2, we have E{h(Zm1 , . . . ,ZmL)2} < ∞.It is also easy to see that E{�(Z1, . . . ,ZL)} = aL, where aL =(L!∏L

l=1 pl) · E{∑Ll=1(l − Rl)

2}. This implies E{h(Zm1 , . . . ,ZmL )} =ρbsa. Therefore, by condition (C2), it follows from theorem 3.5 of thebook by Shao (2003) that

ρ[1]bsa → ρbsa, a.s.

To complete the proof of Theorem 1(ii), it now suffices to show

that ρ[2]bsa → 0, a.s. This follows immediately from the boundedness of

�(Zm1 , . . . ,ZmL ) and the convergence of M∗n,L to γL(

∏nl=1 pl)

−1.

Proof of Theorem 2

When −1 < ρbsa < 1, under conditions (C1) and (C2), applying the

projection method to the U-statistic ρ[1]bsa (Shao 2003) gives

ρ[1]bsa − ρbsa = n−1

n∑i=1

h1(Zi) + o(n−1/2)

. (B.2)

Let pl = nl/n ≡ n−1 ∑ni=1 I(Yi = l) for 1 ≤ l ≤ L. By some simple

algebraic manipulation, we get

M∗n,L −γL

( n∏l=1

pl

)−1

= γL

{( L∏l=1

pl

)−1

−( L∏

l=1

pl

)−1}+o

(n−1/2)

.

Using Taylor expansion and standard asymptotic arguments, we canshow that

M∗n,L − γL

( n∏l=1

pl

)−1

= n−1n∑

i=1

∑Ll=1 γL · (∏1≤j≤L,j=l pj) · {I(Yi = l) − pl}

(∏L

l=1 pl)2

+ o(n−1/2)

. (B.3)

The U-statistic theory also implies that(nL

)−1 ∑(m1,...,mL)∈�n,L

�(Zm1 , . . . ,ZmL

) → aL, a.s. (B.4)

By (B.1)–(B.4), we then have

ρbsa − ρbsa = n−1n∑

i=1

ξi + o(n−1/2)

,

where

ξi = h1(Zi) − aL · ∑Ll=1 γL · (∏1≤j≤L,j=l pj) · {I(Yi = l) − pl}

(∏L

l=1 pl)2

.

It follows from the central limit theorem that

n1/2(ρbsa − ρbsa) →d N(0, σ 2bsa)

with σ 2bsa = E(ξ2

1 ). This completes the proof of Theorem 2.

[Received July 2010. Revised June 2011.]

REFERENCESAgresti, A. (1984), Analysis of Ordinal Categorical Data, New York: Wiley.

[1593](1990), Categorical Data Analysis Data, New York: Wiley. [1593]

Arvesen, J. N. (1969), “Jackknifing U-Statistic,” Annals of Methematical Statis-tics, 40, 2076–2100. [1595]

Barnhart, H. X., and Williamson, J. M. (2002), “Weighted Least-Squares Ap-proach for Comparing Correlated Kappa,” Biometrics, 58, 1012–1019.[1592]

Brown, M., and Benedetti, J. (1977), “Sampling Behavior of Tests for Corre-lation in Twoway Contingency Tables,” Journal of the American StatisticalAssociation, 72, 309–315. [1593]

Cohen, J. (1960), “A Coefficient of Agreement for Nominal Scales,” Educa-tional and Psychological Measurement, 20, 37–46. [1592]

(1968), “Weighted Kappa: Nominal Scale Agreement With Provisionfor Scaled Disagreement or Partial Credit,” Psychological Bulletin, 70, 213–220. [1592]

Fisher, R. A. (1921), “On the ‘Probable Error’ of a Coefficient of CorrelationDeduced From a Small Sample,” Metron, 1, 3–32. [1595,1599]

Fleiss, J. L. (1971), “Measuring Nominal Scale Agreement Among ManyRaters,” Psychological Bulletin, 76, 378–382. [1592]

Guo, Y., and Manatunga, A. (2007), “Nonparametric Estimation of the Con-cordance Correlation Coefficient Under Univariate Censoring,” Biometrics,63, 164–172. [1592]

(2009), “Measuring Agreement of Multivariate Discrete SurvivalTimes Using a Modified Weighted Kappa Coefficient,” Biometrics, 65, 125–134. [1592]

Hosmer, D., and Lemeshow, S. (2000), Applied Logistic Regression (2nd ed.),New York: Wiley. [1594]

Hotelling, H. (1953), “New Light on the Correlection Coefficient and Its Trans-forms,” Journal of the Royal Statistical Society, Ser. B, 15, 193–225. [1599]

Jason, H., and Olsson, U. (2001), “A Measure of Agreement for Interval orNominal Multivariate Observations,” Educational and Psychological Mea-surement, 61, 277–289. [1592]

Kendall, M. (1945), “The Treatment of Ties in Ranking Problems,” Biometrika,33, 239–251. [1593]

King, T. S., Chinchilli, V. M., Carrasco, J. L., and Wang, K. (2007), “A Classof Repeated Measures Concordance Correlation Coefficients,” Journal ofBiopharmaceutical Statistics, 17, 653–672. [1592]

Kraemer, H. C. (1980), “Extension of the Kappa Coefficient,” Biometrics, 36,207–216. [1592]

Lin, L. (1989), “A Concordance Correlation Coefficient to Evaluate Repro-ducibility,” Biometrics, 45, 255–268. [1592]

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

6:25

08

Oct

ober

201

4

Peng et al.: Broad Sense Agreement 1601

(1992), “Assay Validation Using the Concordance Correlation Coeffi-cient,” Biometrics, 48, 599–604. [1592]

Lin, L., Hedayat, A., Sinha, B., and Yang, M. (2002), “Statistical Methods inAssessing Agreement Models, Issues and Tools,” Journal of the AmericanStatistical Association, 97, 257–270. [1592]

Lin, L., Hedayat, A., and Wenting, W. (2007), “A Unified Approach for Assess-ing Agreement for Continuous and Categorical Data,” Journal of Biophar-maceutical Statistics, 17, 629–652. [1592]

Musselman, D., Lawson, D., Gumnick, J., Manatunga, A., Penna, S., Goodkin,R., Greiner, K., Nemeroff, C., and Miller, A. (2001), “Paroxetine for thePrevention of Depression Induced by High-Dose Interferon Alfa,” The NewEngland Journal of Medicine, 344, 961–966. [1592,1596]

Potts, M. K., Daniels, M., Burnam, M. A., and Wells, K. B. (1990), “A Struc-tured Interview Version of the Hamilton Depression Rating Scale: Evidenceof Reliability and Versatility of Administration,” Journal of Psychiatric Re-search, 24, 335–350. [1592,1596]

Quiroz, J. (2005), “Assessment of Equivalence Using a Concordance Correla-tion Coefficient in a Repeated Measurement Design,” Journal of Biophar-maceutical Statistics, 15, 913–928. [1592]

Shao, J. (2003), Mathematical Statistics (2nd ed.), New York: Springer. [1600]Somers, R. (1962), “A New Asymmetric Measure of Association for Ordinal

Variables,” American Sociological Review, 27, 799–811. [1593]Song, P. (2000), “Multivariate Dispersion Models Generated From Gaussian

Copula,” Scandinavian Journal of Statistics, 27, 305–320. [1596]Svensson, E. (2000), “Concordance Between Rating Using Different Scales for

the Same Variable,” Statistics in Medicine, 19, 3483–3496. [1593]Waegeman, W., De Baets, B., and Boullart, L. (2008), “ROC Analysis in Ordi-

nal Regression Learning,” Pattern Recognition Letters, 29, 1–9. [1594]Williamson, J. M., Lipsitz, S., and Manatunga, A. K. (2000), “Modelling Kappa

for Measuring Dependent Categorical Agreement Data,” Biostatistics, 1,191–202. [1592]

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

6:25

08

Oct

ober

201

4