12
© 2012 Royal Statistical Society 0035–9254/12/61653 Appl. Statist. (2012) 61, Part 4, pp. 653–664 An extension of the Wilcoxon rank sum test for complex sample survey data Sundar Natarajan, Veterans Affairs NewYork Harbor Healthcare System, NewYork, USA Stuart R. Lipsitz and Garrett M. Fitzmaurice, Harvard Medical School, Boston, USA Debajyoti Sinha, Florida State University, Talahassee, USA Joseph G. Ibrahim, University of North Carolina, Chapel Hill, USA Jennifer Haas Harvard Medical School, Boston, USA and Walid Gellad University of Pittsburgh, USA [Received February 2010. Final revision November 2011] Summary. In complex survey sampling, a fraction of a finite population is sampled. Often, the survey is conducted so that each subject in the population has a different probability of being selected into the sample. Further, many complex surveys involve stratification and clustering. For generalizability of the sample to the finite population, these features of the design are usu- ally incorporated in the analysis. Although the Wilcoxon rank sum test is commonly used to compare an ordinal variable in bivariate analyses, no simple extension of the Wilcoxon rank sum test has been proposed for complex survey data. With multinomial sampling of indepen- dent subjects, the Wilcoxon rank sum test statistic equals the score test statistic for the group effect from a proportional odds cumulative logistic regression model for an ordinal outcome. Using this regression framework, for complex survey data, we formulate a similar proportional odds cumulative logistic regression model for the ordinal variable, and we use an estimating equations score statistic for no group effect as an extension of the Wilcoxon test. The method proposed is applied to a complex survey designed to produce national estimates of healthcare use, expenditures, sources of payment and insurance coverage. Keywords: Cumulative logistic model; Medical Expenditure Panel Survey; Proportional odds model; Score statistic; Weighted estimating equations 1. Introduction The Wilcoxon rank sum test is a frequently used statistical test to compare an ordinal out- come between two groups of subjects. Even in cases where regression analyses are subsequently Address for correspondence: Sundar Natarajan, Section of Primary Care, Veterans Affairs New York Harbor Healthcare System, 423 East 23rd Street, 15160 North, New York, NY 10010, USA. E-mail: [email protected]

An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

  • Upload
    lethien

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

© 2012 Royal Statistical Society 0035–9254/12/61653

Appl. Statist. (2012)61, Part 4, pp. 653–664

An extension of the Wilcoxon rank sum test forcomplex sample survey data

Sundar Natarajan,

Veterans Affairs New York Harbor Healthcare System, New York, USA

Stuart R. Lipsitz and Garrett M. Fitzmaurice,

Harvard Medical School, Boston, USA

Debajyoti Sinha,

Florida State University, Talahassee, USA

Joseph G. Ibrahim,

University of North Carolina, Chapel Hill, USA

Jennifer Haas

Harvard Medical School, Boston, USA

and Walid Gellad

University of Pittsburgh, USA

[Received February 2010. Final revision November 2011]

Summary. In complex survey sampling, a fraction of a finite population is sampled. Often, thesurvey is conducted so that each subject in the population has a different probability of beingselected into the sample. Further, many complex surveys involve stratification and clustering.For generalizability of the sample to the finite population, these features of the design are usu-ally incorporated in the analysis. Although the Wilcoxon rank sum test is commonly used tocompare an ordinal variable in bivariate analyses, no simple extension of the Wilcoxon ranksum test has been proposed for complex survey data. With multinomial sampling of indepen-dent subjects, the Wilcoxon rank sum test statistic equals the score test statistic for the groupeffect from a proportional odds cumulative logistic regression model for an ordinal outcome.Using this regression framework, for complex survey data, we formulate a similar proportionalodds cumulative logistic regression model for the ordinal variable, and we use an estimatingequations score statistic for no group effect as an extension of the Wilcoxon test. The methodproposed is applied to a complex survey designed to produce national estimates of healthcareuse, expenditures, sources of payment and insurance coverage.

Keywords: Cumulative logistic model; Medical Expenditure Panel Survey; Proportional oddsmodel; Score statistic; Weighted estimating equations

1. Introduction

The Wilcoxon rank sum test is a frequently used statistical test to compare an ordinal out-come between two groups of subjects. Even in cases where regression analyses are subsequently

Address for correspondence: Sundar Natarajan, Section of Primary Care, Veterans Affairs New York HarborHealthcare System, 423 East 23rd Street, 15160 North, New York, NY 10010, USA.E-mail: [email protected]

Page 2: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

654 S. Natarajan, S. R. Lipsitz, G. M. Fitzmaurice, D. Sinha, J. G. Ibrahim, J. Haas and W. Gellad

performed, initial summaries in terms of bivariate analyses are regularly reported at the begin-ning of the results section of published papers. In this paper, we propose an extension of theWilcoxon rank sum test to complex survey data. In complex survey sampling, a fraction of afinite population is sampled, while accounting for its size and characteristics. On the basis ofcertain subject characteristics (e.g. age, race and gender), some individuals may be oversampledor undersampled. Thus, individuals in the population may have different probabilities of beingselected into the sample. Further, the sampling design can have multiple stages of stratificationand clustering. Thus, in general, the design for complex sample surveys often includes stratifica-tion, clustering and different selection probabilities. Although alternative approaches have beenproposed for analysing complex survey data (Chambers and Skinner, 2003), for generalizabilityof the sample to the finite population (Korn and Graubard, 1999), in this paper we incorporatethe design in the analysis, including sampling weights (derived from the probability of selectioninto the survey), strata and/or cluster variables.

Extensions of rank sum tests have been proposed for clustered data (Jung and Kang, 2001;Rosner et al., 2003; Datta and Satten, 2005), which would arise from a random sample of inde-pendent clusters. In these proposed tests, the variance of the usual Wilcoxon test is adjustedfor clustering. The multistage sampling design, with different probabilities of selection, hasbeen a roadblock in developing a general extension of the Wilcoxon test procedure to complexsurveys.

The ready availability of public use data from large population-based complex sample surveyshas led to the calculation of population estimates of frequency of disease (incidence and prev-alence) and to associations between risk factors and disease. Many seminal papers publishedin leading journals have been based on such complex surveys. For example, Reilly and Dorosty(1999), using Health Survey for England data, was published in The Lancet and Bibbins-Domingo et al. (2007), using National Health and Nutrition Examination Survey (NHANES)data, was published in The New England Journal of Medicine. A search of PubMed (NationalLibrary of Medicine) abstracts using the term ‘NHANES’ yielded 7699 articles between 2006and 2011; the NHANES is just one of at least 100 national complex surveys conducted in differ-ent countries. Despite the huge increase in the use of such complex sample surveys, there hasnot been a simple proposed extension of the Wilcoxon test for comparing an ordinal outcomebetween two groups of subjects in complex survey data.

Our motivating example is the Medical Expenditure Panel Survey (MEPS) (Cohen, 2003) forthe year 2002, conducted by the United States Agency for Health Research and Quality, and theUnited States National Center for Health Statistics, Centers for Disease Control and Preven-tion. The 2002 cross-sectional survey was designed to produce national and regional estimatesof healthcare use, expenditures, sources of payment and insurance coverage of the US civiliannon-institutionalized population. The MEPS is a stratified, multistage probability cluster sam-ple. We analyse data from the 25388 subjects who participated in the household componentof the MEPS. In the design of the study, the population was first stratified into 203 geograph-ical regions, defined by census and state regions, metropolitan status and sociodemographicmeasures. Within each stratum, the geographical region was subdivided into area segments,which are composed of counties or groups of contiguous counties. Two or three clusters (areasegments) were sampled within each stratum. In 161 of the strata, there were two clusters; in theremaining 42 strata, there were three clusters. On average, a typical cluster contained 59 sub-jects, with a range of 2–316 subjects. Although the clusters within strata were sampled withoutreplacement, for our analysis we can assume that they were sampled with replacement, since thefraction of clusters sampled within each stratum is much less than 1%: The study oversampledHispanics, African-Americans, adults with functional impairments, children with limitations

Page 3: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

Extension of Wilcoxon Rank Sum Test 655

in activities, individuals who were predicted to incur high levels of medical expenditures andlow income individuals.

For this study of the MEPS, we explored whether patients with and without health insurancediffer in the following ordinal variables: level of education, income (defined as a percentage ofthe poverty line), perceived health status and body mass index BMI. Thus, for these cross-sec-tional data, the ‘group’ is health insurance (yes or no), and the ordinal variables are level ofeducation (no degree, high school graduate equivalency degree, high school diploma, Bache-lor’s degree, Master’s degree or doctorate degree), income (poor, near poor, low income, middleincome or high income), perceived health status (excellent, very good, good, fair or poor) andBMI (underweight, BMI < 18:5 kg m−2; normal, BMI 18.5–24.9 kg m−2; overweight, BMI25.0–29.9 kg m−2; obese, BMI > 30:0 kg m−2). Table 1 show individual level data from 25typical subjects, including strata, cluster and weights (these data are typical, not true, data sothat subjects cannot be identified).

Motivated by the need to develop an analogue for complex sample survey data, we proposean extension of the Wilcoxon rank sum test to the MEPS. Since subjects in complex surveysare not independent, the assumptions that are needed for applying the Wilcoxon rank sum testdo not hold. With a multinomial sample of independent subjects, the Wilcoxon rank sum teststatistic is shown to equal the score test statistic for no effect of a single dichotomous covariate

Table 1. Example data on 25 subjects from the MEPS

Subject Stratum Cluster Sampling Health Education Income Perceived BMIweight insurance health status

1 1 1 7080.48 Yes Bachelor’s Middle Good Normal2 1 2 4714.22 Yes No degree High Good Normal3 2 2 6925.06 Yes High school High Excellent Obese4 3 2 9358.85 Yes No degree High Very good Overweight5 4 1 6081.79 No No degree Middle Good Normal6 4 2 3728.20 No High school Poor Very good Normal7 5 1 4056.79 No High school Middle Good Overweight8 6 1 5936.66 Yes Master’s High Excellent Overweight9 7 2 2871.62 No Bachelor’s High Good Normal

10 8 2 2671.22 Yes Doctorate High Very good Obese11 9 1 5101.48 Yes High school Middle Very good Normal12 10 1 3569.07 Yes High school Poor Poor Overweight13 11 1 4751.75 Yes High school Poor Excellent Overweight14 12 1 9790.85 Yes Graduate Middle Very good Overweight

equivalencydegree

15 13 1 7168.04 Yes Graduate High Excellent Overweightequivalencydegree

16 14 2 5762.49 Yes No degree High Excellent Overweight17 15 1 7382.55 Yes High school Middle Excellent Normal18 15 1 10140.54 No No degree Middle Excellent Underweight19 16 1 4952.08 Yes High school High Good Normal20 17 1 6989.89 No No degree High Excellent Overweight21 18 1 2649.72 Yes Graduate High Very good Obese

equivalencydegree

22 19 2 3363.35 Yes High school High Very good Underweight23 20 2 5425.54 Yes High school Middle Fair Normal24 21 2 9417.92 No High school Low Excellent Overweight25 22 1 2017.34 No No degree Middle Very good Obese

Page 4: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

656 S. Natarajan, S. R. Lipsitz, G. M. Fitzmaurice, D. Sinha, J. G. Ibrahim, J. Haas and W. Gellad

in a proportional odds cumulative logistic regression model (McCullagh, 1980) for the ordinaloutcome. Using this regression framework, the Wilcoxon test is then extended to general com-plex sample surveys. With complex survey data, we propose to formulate a proportional oddscumulative logistic regression model for an ordinal outcome, with the group as a dichotomouscovariate. Weighted estimating equations (Shah et al., 2001; Binder, 1983; Pfeffermann, 1993)are used to account for the weighting and sampling design. We propose the use of an estimatingequations score test (Rao et al., 1998) for no group effect for the proportional odds model;this score test reduces to the standard Wilcoxon test if the design is a multinomial sample. Thetest can be obtained with minimal programming in common statistical software such as SASprocedure Surveylogistic.

In Section 2, we introduce some notation and discuss the score test for a simple multinomialsample. In Section 3, we describe weighted estimating equations for complex survey data, andour proposed score test statistic based on weighted estimating equations. Finally, in Section 4,we present the results of analyses of the MEPS data to illustrate the method proposed.

2. Proportional odds model and Wilcoxon rank sum test

In this section, we describe the proportional odds model and the Wilcoxon rank sum test fora multinomial sample of n independent subjects, i= 1, 2, . . . , n: In the next section, we extendthese results to complex survey sampling. We assume that the outcome Yi is an ordinal discreterandom variable which can take positive integer values j =1, 2, . . . , J: Since the outcome has Jlevels, we can form J indicator random variables Yij, where Yij =1 if subject i has response j andYij = 0 if otherwise. Our goal is to determine whether the distribution of this ordinal outcomediffers across two groups. Thus we form a dichotomous covariate xi, where xi = 1 if subject iis in group 1 and xi =0 if subject i is in group 2. Then, we denote the probability of response jgiven xi as

pij =pr.Yi = j|xi/=pr.Yij =1|xi/,

and the multinomial probability mass function for subject i is

f.yi1, yi2, . . . , yiJ /=J∏

j=1p

yij

ij : .1/

The proportional odds model can be written as

γij =pr.Yi � j|xi, θ, β/= exp.θj −xiβ/

1+ exp.θj −xiβ/, .2/

where γij is a ‘cumulative probability’ and θ′ = .θ1, . . . , θJ−1/ is the vector of cumulative inter-cepts. Since

pij =pr.Yi = j|xi/

=pr.Yi � j|xi, θ, β/−pr.Yi � j −1|xi, θ, β/

=γij −γi,j−1, .3/

with γiJ =1 and γi0 =0, the contribution to the likelihood for subject i can be rewritten as

Li.θ, β/=J∏

j=1.γij −γi,j−1/yij : .4/

Our main interest is in testing for no group effect in model (2), i.e.

H0 :β =0:

Page 5: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

Extension of Wilcoxon Rank Sum Test 657

Under this null hypothesis the distribution of the ordinal variable is identical in the two groups.As we briefly describe here, the Wilcoxon rank sum test statistic equals the score test statisticfor testing β =0: Next, consider the general form of a score test statistic for testing β =0: If θ̂0is the maximum likelihood estimate of θ under the null hypothesis that β = 0, then the scoretest statistic for testing the null hypothesis has the general form

X2 =U.θ̂0, 0/′{var{U.θ, β/}}−1θ=θ̂0,β=0

U.θ̂0, 0/, .5/

where U.θ, β/ is the score (first-derivative) vector of the logarithm of the likelihood in equa-tion (4) with respect to φ = .θ′, β/′, U.θ̂0, 0/ is the score vector evaluated at .θ = θ̂0, β = 0/

and {var{U.θ, β/}}θ=θ̂0,β=0 is the variance of U.θ, β/ evaluated at .θ= θ̂0, β =0/: For regularlikelihood problems such as this when subjects are independent,

var{U.θ, β/}=E[U.θ, β/U.θ, β/′]=−E

[d

dφU.θ, β/′

]:

Under the null hypothesis, X2 in equation (5) has an asymptotic χ2-distribution with 1 degreeof freedom.

For the proportional odds model (McCullagh and Nelder, 1989), the general form of thescore vector is

U.θ, β/=n∑

i=1

J∑j=1

dpij

dφ.yij −pij/=pij:

The only non-zero component of U.θ̂0, 0/ is

U0 =n∑

i=1

J∑j=1

(dpij

)θ=θ̂0,β=0

yij − p̂j

p̂j

=n∑

i=1

J∑j=1

(d.γij −γi,j−1/

)θ=θ̂0,β=0

yij − p̂j

p̂j

, .6/

where p̂j = n−1Σni=1Yij is the level j proportion (regardless of group). Since γij is a logistic

regression model, using results for ordinary logistic regression,

d.γij −γi,j−1/

dβ=xi{γij.1−γij/−γi,j−1.1−γi,j−1/}:

Thus, the non-zero component of U.θ̂0, 0/ can be written as

U0 =n∑

i=1xi

J∑j=1

Sj.yij − p̂j/, .7/

where

Sj = p̂−1j .γij.1−γij/−γi,j−1.1−γi,j−1//θ=θ̂0,β=0 = γ̂j.1− γ̂j/− γ̂j−1.1− γ̂j−1/

γ̂j − γ̂j−1,

and γ̂j =Σjk=1p̂k is the cumulative proportion (regardless of group). Straightforward algebra

shows that Sj =1− .γ̂j + γ̂j−1/, and that expression (7) is proportional to

U0 =n∑

i=1xi

J∑j=1

0:5.γ̂j + γ̂j−1/.yij − p̂j/, .8/

where

0:5.γ̂j + γ̂j−1/

Page 6: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

658 S. Natarajan, S. R. Lipsitz, G. M. Fitzmaurice, D. Sinha, J. G. Ibrahim, J. Haas and W. Gellad

is the ‘ridit’ score (Bross, 1958). We further show in Appendix A that expression (8) is propor-tional to

U0 =n∑

i=1xi

J∑j=1

Rj.yij − p̂j/=J∑

j=1Rj

n∑i=1

yijxi −J∑

j=1Rjp̂j

n∑i=1

xi, .9/

where

Rj =n.γ̂j + γ̂j−1/=2+0:5

is the average rank or ’midrank’ for subjects in category j. Subjects with xi =0 do not contrib-ute to expression (9). For subjects in group xi = 1, this statistic multiplies the average ranksin category j .Rj/ by the number of subjects in category j .Σn

i=1xiyij/ and then sums acrossall categories. Further, the sum of the average ranks in group xi = 1 under the null hypothesisis subtracted from the observed ranks. Thus, the form of the score statistic in equation (9) isequivalent to the Wilcoxon rank sum test statistic.

By formulating the Wilcoxon test statistic in terms of a score test statistic from the propor-tional odds model, we can apply theory that was developed for estimating equations score teststo the proportional odds models in the complex sample survey setting, without having to developnew theory for ranks in complex survey data. As discussed in Agresti (2010), the Wicoxon ranksum test statistic can be written either in terms of ridits as in equation (8) or the midranks asin equation (9). In the next section, we discuss the estimating equations score test for complexsurvey data in terms of ridits.

3. Extension of the Wilcoxon rank sum test for complex survey data

To develop our extension of the Wilcoxon rank sum test, we first discuss weighted estimatingequations for estimating .θ, β/ in complex sample surveys. For complex sample surveys, thetarget population is usually thought to be of finite size N. We assume that the sample is still ofsize n. To indicate which n subjects are sampled from the population of N subjects, we definethe indicator random variable

δi ={

1 if subject i is selected into the sample,0 if subject i is not selected into the sample,

for i = 1, . . . , N, where ΣNi=1δi = n: Depending on the sampling design, some of the δi could

be correlated (e.g. for two subjects within the same cluster). We let πi denote the probabilitythat subject i is selected into the survey, which is typically specified in the design of the study.Depending on the sampling design, πi may depend on the outcome of interest, the independentvariables or additional variables (screening variables, for example) that are not in the model ofinterest. In particular, πi = pr.δi = 1|yi, xi, si/, where si is a vector of additional variables. Weassume that the proportional odds model holds for subjects in the population, and the distribu-tion of the ordinal outcome for a subject in the population follows a multinomial distributionas in equation (4).

To obtain a consistent estimate of .θ, β/, we can use a weighted estimating equation, whichis the solution to Uwee.θ̂, β̂/=0, where

Uwee.θ̂, β̂/=N∑

i=1

δi

πi

J∑j=1

(dpij

)θ=θ̂,β=β̂

yij − p̂ij

p̂ij

, .10/

and φ= .θ′, β/′: Here, the ‘weights’ are wi = δi=πi (wi =1=πi if sampled δi =1/: Note, also, thatthese are weighted likelihood score equations under a working ‘independence’ assumption forthe N subjects (disregarding any clustering).

Page 7: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

Extension of Wilcoxon Rank Sum Test 659

Using a first-order Taylor series expansion and a suitable central limit theorem for samplesurvey data (Binder, 1983), .θ̂, β̂/ has an asymptotic multivariate normal distribution with mean.θ, β/ and covariance matrix

Vθ,β =var{.θ̂, β̂/}=E

[dUwee.θ, β/

]−1

var{Uwee.θ, β/}E

[dUwee.θ, β/

]−1

, .11/

Note that var{Uwee.θ, β/} depends on the sample design (stratification, clustering and samplingwith or without replacement) as well as the finite population correction factor. Empirically,equation (11) is estimated via the ‘sandwich variance estimator’. For multinomial and ordinallogistic regression, sample survey programs in SAS, SUDAAN, R and Stata can be used tocalculate this variance.

Next, we apply an estimating equations score test statistic (Rao et al., 1998) for the nullhypothesis, H0 :β =0, in the proportional odds model. Similarly to the previous section, we letθ̂0 denote the weighted estimating equation estimate of θ under the null hypothesis that β =0:

Then, similarly to the usual score test, the estimating equations score test statistic for H0 :β =0is

X2 =Uwee.θ̂0, 0/′{var{Uwee.θ, β/}}−1θ=θ̂0,β=0

Uwee.θ̂0, 0/, .12/

where the form of Uwee.θ̂0, 0/ and {var{Uwee.θ, β/}}θ=θ̂0,β=0 are both derived under the alter-native hypothesis, but evaluated at .θ= θ̂0, β =0/: In particular,

{var{Uwee.θ, β/}}θ=θ̂0,β=0

=(

E

[dUwee.θ, β/

])θ=θ̂0,β=0

{var{.θ̂, β̂/}}θ=θ̂0,β=0

(E

[dUwee.θ, β/

])θ=θ̂0,β=0

.13/

where (E

[dUwee.θ, β/

])θ=θ̂0,β=0

=N∑

i=1

δi

πi

J∑j=1

p̂−1j

(dpij

)θ=θ̂0,β=0

(dpij

)′

θ=θ̂0,β=0: .14/

is the negative of the information matrix that is obtained if we ignore the complex survey designand assume that all subjects are independent with weights wi: The central limit theorem canbe used to show that asymptotically X2 has a χ2-distribution with 1 degree of freedom underthe null hypothesis (Rao et al., 1998), although the definition of ‘asymptotic’ is sometimesnon-standard in complex sample surveys. Finite sample approximations for the distribution ofexpression (13) are given in Rao et al. (1998). For example, for a stratified cluster design, if welet S be the number of strata and C the number of clusters, then Rao et al. (1998) proposedto approximate the distribution of expression (12) with an F -distribution with 1 and f =C −S

degrees of freedom. For the MEPS data that we analyse, since f = 448 − 203 = 245, the F -and χ2-approximations are virtually identical, so we use the χ2-approximation in the followingsection.

Similarly to the score test for non-complex survey data, the only non-zero component ofUwee.θ̂0, 0/ can be written as

Uwee,0 =N∑

i=1wi

J∑j=1

(dpij

)θ=θ̂0,β=0

yij − p̂j

p̂j

=N∑

i=1wixi

J∑j=1

0:5.γ̂j + γ̂j−1/.Yij − p̂j/, .15/

Page 8: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

660 S. Natarajan, S. R. Lipsitz, G. M. Fitzmaurice, D. Sinha, J. G. Ibrahim, J. Haas and W. Gellad

where

p̂j =N∑

i=1wiYij

/N∑

i=1wi

is the weighted proportion of subjects with response level j, regardless of group,

γ̂j =j∑

k=1p̂k

is the cumulative weighted proportion (regardless of group) and

0:5.γ̂j + γ̂j−1/

is the weighted ridit. Subjects with xi =0 do not contribute to expression (15). Then, the estimat-ing equations score statistic has the same form, in terms of ridits, as the usual proportional oddsstatistic from a multinomial sample and can be considered an extension of the usual Wilcoxonrank sum test to complex survey data.

Most sample survey programs allow fitting of the proportional odds model for ordinal datafrom complex sample surveys. However, the estimating equations score statistic is not directlyavailable and requires a two-step procedure. First, one fits the proportional odds model under thenull hypothesis H0 :β =0 to obtain θ̂0: Then, one fits a proportional odds model under the alter-native hypothesis H0 :β �=0 with starting values .θ̂0, 0/ but, instead of iterating until convergence,perform 0 iterations. From this fit, we can obtain Uwee.θ̂0, 0/ as well as {var{.θ̂, β̂/}}θ=θ̂0,β=0 inequation (13). Finally, one ignores the design and assumes that all subjects are independent withweights wi; under these assumptions, we fit a proportional odds model under the alternativeH0 :β �=0 with starting values .θ̂0, 0/ and perform zero iterations; this gives us expression (14).The SAS Proc Surveylogistic code for the example that is analysed in Section 4 can befound at

http://www.blackwellpublishing.com/rss

As an alternative to the score statistic, one can also use a Wald statistic (an estimate of βdivided by its estimated standard error) to test H0 :β =0: The Wald and score statistics have iden-tical large sample properties under the null, but have different properties under the alternative,with the score statistic typically being more powerful (Hauck and Donner, 1977).

4. Application: Medical Expenditure Panel Survey

In this section, we present results for analyses of data from the MEPS example that was discussedin Section 1. There are 203 strata in this study, and two or three clusters were sampled withoutreplacement within each stratum. Although the sampling was performed without replacementin each stratum, the total (population) number of clusters within each stratum was so large thatthe finite population correction factor can be ignored. The weights that were used in the analysisare the (Horvitz–Thompson) survey weights provided by the MEPS, so the weights sum to thepopulation total. These weights account for unit non-response. Data on 25 subjects from thisdata set are given in Table 1. Our goal is to explore bivariate analyses between health insurancestatus (yes or no) and the ordinal variables level of education (no degree, high school gradu-ate equivalency degree, high school diploma, Bachelor’s degree, Master’s degree or doctoratedegree), income (poor, near poor, low income, middle income or high income) and perceivedhealth status (excellent, very good, good, fair or poor) and BMI (underweight, normal, over-

Page 9: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

Extension of Wilcoxon Rank Sum Test 661

weight or obese). In this data set, 5401 .21:2%/ of 25388 subjects did not have health insurance,although the weighted percentage of subjects without health insurance was 17:2%: Table 2 givesthe weighted column percentages given that the subject has or does not have health insurance,as well as the usual Wilcoxon test (the proportional odds score test) ignoring the complex surveydesign (including stratification, clustering, and weighting) and our proposed proportional oddsscore test which takes the complex survey design into account (and uses a χ2-approximationwith 1 degree of freedom).

We see from Table 2 that, as expected, the usual Wilcoxon test (proportional odds score test)ignoring the complex survey design and the proportional odds score test taking the design intoaccount are quite different in value. For education and income, the proportional odds scoretest statistics taking the design into account are almost half the size of those that do not takethe design into account, albeit all are very significant. In contrast, for perceived health statusand BMI, we see that the opposite is true; the test statistics taking the design into account aremuch larger than those that do not. In fact, for BMI, the test statistic taking the design intoaccount is borderline significant (P = 0:070), whereas the test statistic not taking the designinto account is far from significant (P =0:472). We note here that if one used the Wald statistic(which is printed out directly in SAS Proc Surveylogistic) instead of the score statistic,one obtains P-values that are very similar to the score statistic: education level (P < 0:0001/,income .P<0:0001/, perceived health status .P =0:30/ and BMI .P =0:067/; however, this veryclose correspondence between the Wald and score test cannot be expected in general. The resultsof analyses of the MEPS data indicate that failure to incorporate the design in the analysis canpotentially yield misleading inferences about the associations.

Table 2. Weighted column percentages and Wilcoxon test statistics for the MEPS data

Variable Level Health insurance (%) Ignoring design Complex surveyWilcoxon proportional odds

No Yes (proportional odds) X2 (P-value)X2 (P-value)

Education No degree 31.3 17.9 959.81 .< 0:0001/ 355.10 .< 0:0001/Graduate 7.3 4.2

equivalencydegree

High School 49.3 49.5Bachelor’s 9.7 18.8Master’s 2.0 7.7Doctorate 0.5 2.0

Income Poor 21.0 7.8 1933.38 .< 0:0001/ 626.24 .< 0:0001/Near poor 7.4 3.1Low 22.5 10.7Middle 30.8 31.0High 18.3 47.5

Perceived Excellent 26.0 25.8 0.03 (0.864) 1.06 (0.30)health status Very good 31.4 34.6

Good 30.6 26.6Fair 9.4 9.5Poor 2.6 3.5

BMI Underweight 2.7 2.0 0.52 (0.472) 3.28 (0.070)Normal 38.7 37.7Overweight 34.8 35.7Obese 23.8 24.6

Page 10: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

662 S. Natarajan, S. R. Lipsitz, G. M. Fitzmaurice, D. Sinha, J. G. Ibrahim, J. Haas and W. Gellad

From Table 2, we see that patients without health insurance are less educated and poorer, andhave slightly lower BMI. There does not appear to be any difference in perceived health statusbetween patients with or without health insurance. With the large number of subjects sampledin complex surveys, we usually have high power to detect small differences. We see that this is sofor BMI, in that patients with health insurance are approximately 2% more likely (in absoluteterms) to be overweight or obese.

To explore which of the design factors (stratification, clustering or weighting) has the largesteffect on the P-values, we recalculated the proposed test statistic under three scenarios. In thefirst scenario, we ignored the strata and just assumed that the design is a cluster sampling designin which we randomly sampled 448 clusters from the population; we also included the weightsin this analysis. In the second scenario, we ignored the clusters and just assumed that the designis a stratified sampling design; we again also included the weights in the analysis. Finally, in thethird scenario, we kept stratification and clustering as in the design, but we ignored the weights(i.e. set all weights equal to 1). Overall, the average of the weights is 8903.6, with standard devi-ation 5446.2, and range 386.9–49958.0. Across the 203 strata, the mean of the weights rangesfrom 2441.6 to 15711.4, and the standard deviations range from 1436.0 to 8750.0. Given thisvariability in the weights across strata, we might expect the weights to play an important role incalculating the test statistics. The results under these three scenarios are given in Table 3.

In general, stratification tends to decrease the variance, and we see that the values of the teststatistics tend to be smaller (owing the larger variance) when stratification is ignored. Clusteringtends to increases the variance, and we see that the values of the test statistics tend to be larger(owing to the smaller variance) when clustering is ignored. Finally, ignoring the weighting doesnot yield a clear pattern in that the value of two of the test statistics (for education and income)are larger, and the value of two of the test statistics (for health and BMI) are smaller. On thebasis of the pattern of results in Table 3, for this particular example it appears that the relative

Table 3. Wilcoxon test statistics for the MEPS data accountingfor various features of the survey design

Variable Scenario X2 P-value

Education Ignoring design 959.81 < 0:0001Ignoring strata 267.28 < 0:0001Ignoring clusters 686.69 < 0:0001Ignoring weights 467.91 < 0:0001Incorporating design 355.10 < 0:0001

Income Ignoring design 1933.38 < 0:0001Ignoring strata 400.79 < 0:0001Ignoring clusters 1347.41 < 0:0001Ignoring weights 632.16 < 0:0001Incorporating design 626.24 < 0:0001

Perceived Ignoring design 0.03 0.864health status Ignoring strata 0.99 0.319

Ignoring clusters 1.33 0.249Ignoring weights 0.03 0.872Incorporating design 1.06 0.301

BMI Ignoring design 0.52 0.472Ignoring strata 3.28 0.070Ignoring clusters 3.89 0.048Ignoring weights 0.40 0.525Incorporating design 3.28 0.070

Page 11: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

Extension of Wilcoxon Rank Sum Test 663

importance of stratification, clustering and weighting differs for the four ordinal variables. Forexample, for perceived health status and BMI, clustering and weighting are the most impor-tant design factors to take account of in the analysis; stratification has little effect on the teststatistics. However, for income, stratification and clustering appear to be the most importantdesign factors. This differential relative importance of the three design factors for the ordinaloutcomes explains in part why the test statistics taking the design into account can be eitherlarger or smaller than those that do not.

5. Conclusion

In summary, we propose an extension of the Wilcoxon rank sum test to complex survey data.The approach is not ad hoc but is based on the connection between the Wilcoxon rank sum testand the proportional odds score test for the group effect. Our proposed test statistic uses anestimating equations score statistic (Rao et al., 1998) for no group effect in a proportional oddslogistic regression model.

By formulating the test statistic in terms of a score test statistic from the proportional oddsmodel for complex survey data, we can apply theory developed for estimating equations scoretests without having to develop new theory for ranks in complex survey data. The huge increasein use of population-based complex sample surveys has led to analyses that have been publishedin leading medical journals, yet no simple approach has been developed to test for associationbetween group and an ordinal categorical variable. This paper provides such an approach thatcan be used for any complex survey design.

One issue that may arise is the maximum number of outcome categories .J/ that is allowed forour proposed test to be approximately χ2 with 1 degree of freedom under the null hypothesis.Considering the data as arising from a 2×J contingency table (Reynolds, 1984), for multinomialsamples there should be at least 5 × 2 ×J observations in the data set, e.g. n� 5 × 2 ×J: If welet DEFF represent the design effect for a complex sample survey (Korn and Graubard, 1999),where DEFF represents the ratio of the variance of equation (15) under the given design toa multinomial sample, then n� DEFF × 10 ×J or, equivalently, J �n=.10DEFF/: Typically,design effects for complex surveys are less than 2 (Heeringa and Liu, 2006), giving the rule ofthumb that our proposed method can be applied provided that J � n=20: For a typical largecomplex survey data set, with say 10000 subjects, this means that we could have an orderedcategorical variable with 500 levels.

On the basis of this paper, in future, researchers analysing complex samples could use ourapproach to formulate a non-parametric test with sample survey data. The same approachcould be used to formulate Wilcoxon-type test statistics in other directions, such as adjustingfor covariates and missing data.

Acknowledgements

We are grateful for the support provided by grants from the United States National Institutesof Health and the United States Department of Veterans Affairs.

Appendix A: Derivation of Wilcoxon test from proportional odds score test

In equation (8), we can further rewrite γ̂j + γ̂j−1 in terms of the ‘average rank’ for category j. First, notethat nγ̂j−1 is the number of subjects with outcome less than j; also, let nj =Σn

i=1Yij =np̂j denote the numberof subjects with outcome j. Then, the average rank for the nj subjects in category j equals the average of the

Page 12: An extension of theWilcoxon rank sum test for …krader/lipsitz/ranksum/ranksum.pdfAn extension of theWilcoxon rank sum test for complex sample survey data ... Although the Wilcoxon

664 S. Natarajan, S. R. Lipsitz, G. M. Fitzmaurice, D. Sinha, J. G. Ibrahim, J. Haas and W. Gellad

numbers nγ̂j−1 + 1 to nγ̂j−1 +nj . Then, using the formula for sums given in Conover (1998), the averagerank for the nj subjects in category j equals

Rj =n−1j

nγ̂j−1+nj∑

m=nγ̂j−1+1m

=n−1j .nγ̂j−1 +nj +nγ̂j−1 +1/nj=2

= .nγ̂j−1 +nj +nγ̂j−1 +1/=2= .nγ̂j−1 +np̂j +nγ̂j−1 +1/=2=n.γ̂j + γ̂j−1/=2+0:5, .16/

since nγ̂j−1 +np̂j =nγ̂j . Then, in terms of the average rank Rj , we can rewrite γ̂j + γ̂j−1 as

γ̂j + γ̂j−1 =2Rj=n−1=n: .17/

Then, inserting equation (17) in equation (8),

U0 =n∑

i=1xi

J∑

j=10:5.2Rj=n−1=n/.yij − p̂j/= .1=n/

n∑

i=1xi

J∑

j=1Rj.yij − p̂j/:

Since the factor 1=n does not affect the score statistic in equation (5), without loss of generality, we writethe numerator of the score statistic as

U0 =n∑

i=1xi

J∑

j=1Rj.yij − p̂j/=

J∑

j=1Rj

n∑

i=1yijxi −

J∑

j=1Rjp̂j

n∑

i=1xi:

References

Agresti, A. (2010) Analysis of Ordinal Categorical Data, 2nd edn. New York: Wiley.Bibbins-Domingo, K., Coxson, P., Pletcher, M. J., Lightwood, J. and Goldman, L. (2007) Adolescent overweight

and future adult coronary heart disease. New Engl. J. Med., 357, 2371–2379.Binder, D. (1983) On the variance of asymptotically normal estimators from complex surveys. Int. Statist. Rev.,

51, 279–292.Bross, I. D. J. (1958) How to use ridit analysis. Biometrics, 14, 18–38.Chambers R. L. and Skinner C. J. (2003) Analysis of Survey Data. New York: Wiley.Cohen, S. B. (2003) Design strategies and innovations in the Medical Expenditure Panel Survey. Med. Care, 41,

5–12.Conover, W. J. (1998) Practical Nonparametric Statistics, 3rd edn. New York: Wiley.Datta, S. and Satten, G. A. (2005) Rank-sum tests for clustered data. J. Am. Statist. Ass., 100, 908–915.Hauck, W. W. and Donner, A. (1977) Wald’s test as applied to hypotheses in logit analysis. J. Am. Statist. Ass.,

72, 851–853.Heeringa, S. G. and Liu, J. (2006) Complex sample design effects and inference for mental health survey data. Int.

J. Meth. Psychiatr. Res., 7, 56–65.Jung, S. and Kang, S. (2001) Tests for 2 × K contingency tables with clustered ordered categorical data. Statist.

Med., 20, 785–794.Korn, E. L. and Graubard, B. I. (1999) Analysis of Health Surveys. New York: Wiley.McCullagh, P. (1980) Regression models for ordinal data (with discussion). J. R. Statist. Soc. B, 42, 109–142.McCullagh, P. and Nelder, J. A. (1989) Generalized Linear Models, 2nd edn. London: Chapman and Hall.Pfeffermann, D. (1993) The role of sampling weights when modeling survey data. Int. Statist. Rev., 61, 317–337.Rao, J. N. K., Scott, A. J. and Skinner, C. J. (1998) Quasi-score tests with survey data. Statist. Sin., 8, 1059–1070.Reilly, J. J. and Dorosty, A. R. (1999) Epidemic of obesity in UK children. Lancet, 354, 1874–1875.Reynolds, H. T. (1984) Analysis of Nominal Data, 2nd edn. Beverly Hills: Sage.Rosner, B., Glynn, R. J. and Lee, M. L. (2003) Incorporation of clustering effects for the Wilcoxon rank sum test:

a large-sample approach. Biometrics, 59, 1089–1098.Shah, B. V., Barnwell, B. G. and Bieler, G. S. (2001) SUDAAN User’s Manual, Release 8.0. Research Triangle

Park: Research Triangle Institute.