30
Likelihood-Based Inference on Haplotype Effects in Genetic Association Studies D. Y. LIN and D. ZENG A haplotype is a specific sequence of nucleotides on a single chromosome. The population associations between haplotypes and disease phenotypes provide critical information about the genetic basis of complex human diseases. Standard genotyping techniques cannot dis- tinguish the two homologous chromosomes of an individual, so only the unphased genotype (i.e., the combination of the two homologous haplotypes) is directly observable. Statistical inference about haplotype–phenotype associations based on unphased genotype data presents an intriguing missing-data problem, especially when the sampling depends on the disease status. The objective of this article is to provide a systematic and rigorous treatment of this problem. All commonly used study designs, including cross-sectional, case-control, and cohort studies, are considered. The phenotype can be a disease indicator, a quantitative trait, or a potentially censored time-to-disease variable. The effects of haplotypes on the phenotype are formulated through flexible regression models, which can accommodate various genetic mechanisms and gene–environment interactions. Appropriate likelihoods are constructed that may involve high-dimensional parameters. The identifiability of the parameters and the consistency, asymptotic normality, and efficiency of the maximum likelihood estimators are es- tablished. Efficient and reliable numerical algorithms are developed. Simulation studies show that the likelihood-based procedures perform well in practical settings. An application to the Finland–United States Investigation of NIDDM Genetics Study is provided. Areas in need of further development are discussed. KEY WORDS: Case-control study; Gene–environment interaction; Hardy–Weinberg equilibrium; Missing data; Single nucleotide poly- morphism; Unphased genotype. 1. INTRODUCTION In the early 1900s there was a fierce debate between Gregor Mendel’s followers and the biometrical school led by Francis Galton and Karl Pearson as to whether the patterns of inheri- tance were consistent with Mendel’s law of segregation or with a “blending”-type theory. Fisher (1918) reconciled the two con- flicting schools by recognizing the difference in the genetic ba- sis for the variation in the trait being studied. For the traits that the Mendelists studied, the observed variation was due to a sim- ple difference at a single gene; for the traits studied by the bio- metrical school, individual differences were attributed to many different genes, with no particular gene having a singly large effect. Like the traits studied by Mendel, many genetic disorders, such as Huntington disease and cystic fibrosis, are caused by mutations of single genes. The genes underlying a number of these Mendelian syndromes have been discovered over the last 20 years through linkage analysis and positional cloning (Risch 2000). The same approach, however, is failing to unravel the genetic basis of complex human diseases (e.g., hypertension, bipolar disorder, diabetes, schizophrenia), which are influenced by a variety of genetic and environmental factors, just like the traits studied by the biometrical school a century ago. It is widely recognized that genetic dissection of complex hu- man disorders requires large-scale association studies, which relate disease phenotypes to genetic variants, especially single nucleotide polymorphisms (SNPs) (Risch 2000; Botstein and Risch 2003). SNPs are DNA sequence variations that occur when a sin- gle nucleotide in the genome sequence is altered. SNPs make up about 90% of all human genetic variation and are believed to have a major impact on disease susceptibility. Aided by the sequencing of the human genome (International Human D. Y. Lin is Dennis Gillings Distinguished Professor (E-mail: [email protected]. edu) and D. Zeng is Assistant Professor (E-mail: [email protected]), De- partment of Biostatistics, CB#7420, University of North Carolina, Chapel Hill, NC 27599. The authors are grateful to the FUSION Study Group for sharing their data and to Michael Boehnke and Laura Scott for transmitting the data. They also thank the editor, an associate editor, and two referees for their com- ments. This work was supported by the National Institutes of Health. Genome Sequencing Consortium 2001; Venter et al. 2001), geneticists have identified several million SNPs (International SNP Map Working Group 2001). With current technology, it is economically feasible to genotype thousands of subjects for thousands of SNPs. These remarkable scientific and techno- logical advances offer unprecedented opportunities to conduct SNPs-based association studies aimed at unraveling the genetic basis of complex diseases. There is one of three possible genotypes at each SNP site: homozygous with allele A, homozygous with allele a, or het- erozygous with one allele A and one allele a. Thus assessing the association between a SNP and a disease phenotype is a trivial three-sample problem. It is, however, desirable to deal with multiple SNPs simultaneously. One appealing approach is to consider the haplotypes for multiple SNPs within candidate genes (Hallman, Groenemeijer, Jukema, and Boerwinkle 1999; International SNP Map Working Group 2001; Patil et al. 2001; Stephens, Smith, and Donnelly 2001). The haplotype (i.e., a specific combination of nucleotides at a series of closely linked SNPs on the same chromosome of an individual) contains information about the protein prod- ucts. Because the actual number of haplotypes within a can- didate gene is much smaller than the number of all possible haplotypes, haplotyping serves as an effective data-reduction strategy. Using SNP-based haplotypes may yield more pow- erful tests of genetic associations than using individual, un- organized SNPs, especially when the causal variants are not measured directly or when there are strong interactions of mul- tiple mutations on the same chromosome (Akey, Jin, and Xiong 2001; Fallin et al. 2001; Li 2001; Morris and Kaplan 2002; Schaid, Rowland, Tines, Jacobson, and Poland 2002; Zaykin et al. 2002; Schaid 2004). Determining haplotype requires the parental origin or ga- metic phase information, which cannot be easily obtained with the current genotyping technology. As a result, only the un- phased genotype (i.e., the combination of the two homologous © 2006 American Statistical Association Journal of the American Statistical Association March 2006, Vol. 101, No. 473, Theory and Methods DOI 10.1198/016214505000000808 89

Likelihood-Based Inference on Haplotype Effects in Genetic

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Likelihood-Based Inference on Haplotype Effects in Genetic

Likelihood-Based Inference on Haplotype Effectsin Genetic Association Studies

D Y LIN and D ZENG

A haplotype is a specific sequence of nucleotides on a single chromosome The population associations between haplotypes and diseasephenotypes provide critical information about the genetic basis of complex human diseases Standard genotyping techniques cannot dis-tinguish the two homologous chromosomes of an individual so only the unphased genotype (ie the combination of the two homologoushaplotypes) is directly observable Statistical inference about haplotypendashphenotype associations based on unphased genotype data presentsan intriguing missing-data problem especially when the sampling depends on the disease status The objective of this article is to providea systematic and rigorous treatment of this problem All commonly used study designs including cross-sectional case-control and cohortstudies are considered The phenotype can be a disease indicator a quantitative trait or a potentially censored time-to-disease variableThe effects of haplotypes on the phenotype are formulated through flexible regression models which can accommodate various geneticmechanisms and genendashenvironment interactions Appropriate likelihoods are constructed that may involve high-dimensional parametersThe identifiability of the parameters and the consistency asymptotic normality and efficiency of the maximum likelihood estimators are es-tablished Efficient and reliable numerical algorithms are developed Simulation studies show that the likelihood-based procedures performwell in practical settings An application to the FinlandndashUnited States Investigation of NIDDM Genetics Study is provided Areas in needof further development are discussed

KEY WORDS Case-control study Genendashenvironment interaction HardyndashWeinberg equilibrium Missing data Single nucleotide poly-morphism Unphased genotype

1 INTRODUCTION

In the early 1900s there was a fierce debate between GregorMendelrsquos followers and the biometrical school led by FrancisGalton and Karl Pearson as to whether the patterns of inheri-tance were consistent with Mendelrsquos law of segregation or witha ldquoblendingrdquo-type theory Fisher (1918) reconciled the two con-flicting schools by recognizing the difference in the genetic ba-sis for the variation in the trait being studied For the traits thatthe Mendelists studied the observed variation was due to a sim-ple difference at a single gene for the traits studied by the bio-metrical school individual differences were attributed to manydifferent genes with no particular gene having a singly largeeffect

Like the traits studied by Mendel many genetic disorderssuch as Huntington disease and cystic fibrosis are caused bymutations of single genes The genes underlying a number ofthese Mendelian syndromes have been discovered over the last20 years through linkage analysis and positional cloning (Risch2000) The same approach however is failing to unravel thegenetic basis of complex human diseases (eg hypertensionbipolar disorder diabetes schizophrenia) which are influencedby a variety of genetic and environmental factors just likethe traits studied by the biometrical school a century ago Itis widely recognized that genetic dissection of complex hu-man disorders requires large-scale association studies whichrelate disease phenotypes to genetic variants especially singlenucleotide polymorphisms (SNPs) (Risch 2000 Botstein andRisch 2003)

SNPs are DNA sequence variations that occur when a sin-gle nucleotide in the genome sequence is altered SNPs makeup about 90 of all human genetic variation and are believedto have a major impact on disease susceptibility Aided bythe sequencing of the human genome (International Human

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) De-partment of Biostatistics CB7420 University of North Carolina Chapel HillNC 27599 The authors are grateful to the FUSION Study Group for sharingtheir data and to Michael Boehnke and Laura Scott for transmitting the dataThey also thank the editor an associate editor and two referees for their com-ments This work was supported by the National Institutes of Health

Genome Sequencing Consortium 2001 Venter et al 2001)geneticists have identified several million SNPs (InternationalSNP Map Working Group 2001) With current technology itis economically feasible to genotype thousands of subjects forthousands of SNPs These remarkable scientific and techno-logical advances offer unprecedented opportunities to conductSNPs-based association studies aimed at unraveling the geneticbasis of complex diseases

There is one of three possible genotypes at each SNP sitehomozygous with allele A homozygous with allele a or het-erozygous with one allele A and one allele a Thus assessingthe association between a SNP and a disease phenotype is atrivial three-sample problem It is however desirable to dealwith multiple SNPs simultaneously One appealing approach isto consider the haplotypes for multiple SNPs within candidategenes (Hallman Groenemeijer Jukema and Boerwinkle 1999International SNP Map Working Group 2001 Patil et al 2001Stephens Smith and Donnelly 2001)

The haplotype (ie a specific combination of nucleotidesat a series of closely linked SNPs on the same chromosomeof an individual) contains information about the protein prod-ucts Because the actual number of haplotypes within a can-didate gene is much smaller than the number of all possiblehaplotypes haplotyping serves as an effective data-reductionstrategy Using SNP-based haplotypes may yield more pow-erful tests of genetic associations than using individual un-organized SNPs especially when the causal variants are notmeasured directly or when there are strong interactions of mul-tiple mutations on the same chromosome (Akey Jin and Xiong2001 Fallin et al 2001 Li 2001 Morris and Kaplan 2002Schaid Rowland Tines Jacobson and Poland 2002 Zaykinet al 2002 Schaid 2004)

Determining haplotype requires the parental origin or ga-metic phase information which cannot be easily obtained withthe current genotyping technology As a result only the un-phased genotype (ie the combination of the two homologous

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000808

89

90 Journal of the American Statistical Association March 2006

haplotypes) can be determined Statistically speaking this is amissing-data problem in which the variable of interest pertainsto two ordered sequences of 0rsquos and 1rsquos but only the summa-tion of the two sequences is observed This type of missing-dataproblem has not been studied in the statistical literature

Many authors (eg Clark 1990 Excoffier and Slatkin 1995Stephens et al 2001 Zhang Pakstis Kidd and Zhao 2001 NiuQin Xu and Liu 2002 Qin Niu and Liu 2002) have proposedmethods to infer haplotypes or estimate haplotype frequenciesfrom unphased genotype data To make inference about haplo-type effects one may then relate the probabilistically inferredhaplotypes to the phenotype through a regression model (egZaykin et al 2002) This approach does not account for the vari-ation due to haplotype estimation and does not yield consistentestimators of regression parameters

A growing number of articles have been published in geneticjournals on making proper inference about the effects of haplo-types on disease phenotypes Most of these articles have dealtwith case-control studies Specifically Zhao Li and Khalid(2003) developed an estimating function that approximates theexpectation of the complete-data prospective-likelihood scorefunction given the observable data This method assumes thatthe disease is rare and that haplotypes are independent of envi-ronmental variables and it is not statistically efficient Epsteinand Satten (2003) derived a retrospective likelihood for therelative risk that does not accommodate environmental vari-ables Stram et al (2003) proposed a conditional likelihoodfor the odds ratio assuming that cases and controls are cho-sen randomly with known probabilities from the target popula-tion but did not consider environmental variables or investigatethe properties of the estimator Building on the earlier work ofSchaid et al (2002) Lake et al (2003) discussed likelihood-based inference for cross-sectional studies under generalizedlinear models Seltman Roeder and Devlin (2003) provided asimilar discussion based on the cladistic approach RecentlyLin (2004) showed how to perform Coxrsquos (1972) regressionwhen potentially censored age at onset of the disease observa-tions are collected in cohort studies All of the aforementionedwork assumes HardyndashWeinberg equilibrium (Weir 1996 p 40)Simulation studies (Epstein and Satten 2003 Lake et al 2003Satten and Epstein 2004) revealed that violation of this assump-tion can adversely affect the validity of the inference

The aim of this article is to address statistical issues in esti-mating haplotype effects in a systematic and rigorous mannerFor case-control studies we allow environmental variables andderive efficient inference procedures For cross-sectional andcohort studies we consider more versatile models than thosein the existing literature For all study designs we accommo-date HardyndashWeinberg disequilibrium We construct appropriate

likelihoods for a variety of models Under case-control sam-pling the likelihood pertains to the distribution of genotypesand environmental variables conditional on the case-control sta-tus which involves infinite-dimensional nuisance parameters ifenvironmental variables are continuous In cohort studies it isdesirable to not parameterize the distribution of time to dis-ease so that the likelihood also involves infinite-dimensionalparameters The presence of infinite-dimensional parametersentails considerable theoretical and computational challengesWe establish the theoretical properties of the maximum like-lihood estimators (MLEs) by appealing to modern asymptotictechniques and develop efficient and stable algorithms to im-plement the corresponding inference procedures We assess theperformance of the proposed methods through simulation stud-ies and provide an application to a major genetic study of type 2diabetes mellitus

2 INFERENCE PROCEDURES

21 Preliminaries

We consider SNP-based association studies of unrelatedindividuals Suppose that each individual is genotyped at Mbiallelic SNPs within a candidate gene At each SNP site we in-dicate the two possible alleles by the values 0 and 1 Thus eachhaplotype h is a unique sequence of M numbers from 01The total number of possible haplotypes is K equiv 2M the ac-tual number of haplotypes consistent with the data is usuallymuch smaller For k = 1 K let hk denote the kth possi-ble haplotype Figure 1 shows the eight possible haplotypes forthree SNPs

Our human chromosomes come in pairs one member ofeach pair inherited from our mother and the other memberinherited from our father These pairs are called homologouschromosomes Thus each individual has a pair of homologoushaplotypes that may or may not be identical Routine genotyp-ing procedures cannot separate the two homologous chromo-somes so only the (unphased) genotypes (ie the combinationsof the two homologous haplotypes) are directly observable Foreach individual the multi-SNP genotype is an ordered sequenceof M numbers from 012

Let H and G denote the pair of haplotypes and the geno-type for an individual We write H = (hkhl) if the individualrsquoshaplotypes are hk and hl in which case G = hk + hl The or-dering of the two homologous haplotypes within an individ-ual is considered arbitrary By allowing genotypes to includemissing SNP information we may assume that G is known foreach individual Given G the value of H is unknown if the in-dividual is heterozygous at more than one SNP or if any SNPgenotype is missing For the case of M = 3 shown in Figure 1if G = (021) then H = (h3h4) and if G = (011) thenH = (h1h4) or H = (h2h3)

Figure 1 Possible Haplotype Configurations With Three SNPs

Lin and Zeng Inference on Haplotype Effects 91

The goal of the association studies is to relate the pair ofhaplotypes to disease phenotypes or traits The simplest pheno-type is the binary indicator for the disease status which takesthe value 1 if the individual is diseased and 0 otherwise Thediseased individuals may be further classified into several cat-egories corresponding to different types of disease or varyingdegrees of disease severity If the age of onset is likely to begenetically mediated then it is desirable to use the age of onsetas the phenotype One may also be interested in disease-relatedtraits such as blood pressure

The data on the disease phenotype may be gathered in variousways The simplest approach is to obtain a random sample fromthe target population and measure the phenotype of interest onevery individual in the sample Such studies are referred to ascross-sectional studies which are feasible if the disease is rela-tively frequent or if one is interested only in some readily mea-sured traits that are related to the disease If one is interested inthe age at the onset of a disease however then it is necessary tofollow a cohort of individuals forward in time in which case thephenotype (ie time to disease occurrence) may be censoredWhen the disease is relatively rare it is more cost-effective touse the case-control design which collects data retrospectivelyon a sample of diseased individuals and on a separate sample ofdisease-free individuals It is often desirable to collect data onenvironmental variables or covariates so as to investigate genendashenvironment interactions

Let Y be the phenotype of interest and let X be the covari-ates For cross-sectional and case-control studies the associa-tion between Y and (XH) is characterized by the conditionaldensity of Y = y given H = (hkhl) and X = x denoted byPαβξ (y|x (hkhl)) where α denotes the intercept(s) β de-notes the regression effects and ξ denotes the nuisance para-meters (eg variance and overdispersion parameters) Thereis considerable flexibility in specifying the regression relation-ship Suppose that hlowast is the target haplotype of interest andthat there are no covariates Then a linear predictor in theform of α + βI(hk = hl = hlowast) pertains to a recessive modelα + βI(hk = hlowast) + I(hl = hlowast) minus I(hk = hl = hlowast) pertains to adominant model α + βI(hk = hlowast) + I(hl = hlowast) pertains to anadditive model and α+β1I(hk = hlowast)+ I(hl = hlowast)+β2I(hk =hl = hlowast) pertains to a codominant model where I(middot) is the in-dicator function Clearly the codominant model contains theother three models as special cases A codominant model withgenendashenvironment interactions has the following linear predic-tor

α + β1I(hk = hlowast) + I(hl = hlowast) + β2I(hk = hl = hlowast)

+ βT3 x + βT

4 I(hk = hlowast) + I(hl = hlowast)x+ βT

5 I(hk = hl = hlowast)x (1)

Additional terms may be included so as to examine the effectsof several haplotype configurations or to investigate the jointeffects of multiple candidate genes

Although we are interested in the effects of H and X on Ywe observe G instead of H As mentioned earlier G is thesummation of the paired sequences in H Thus we have aregression problem with missing data in which the primaryexplanatory variable pertains to two ordered sequences of num-bers from 01 but only the summation of the two sequences

is observed We assume that X is independent of H conditionalon G and that (1XT) is linearly independent with positiveprobability

Write πkl = PH = (hkhl) and πk = P(h = hk) k l =1 K As we demonstrate in this article it is sometimes pos-sible to make inference about haplotype effects without impos-ing any structures on πkl although estimating πk and testingfor no haplotype effects require some restrictions on πklUnder HardyndashWeinberg equilibrium

πkl = πkπl k l = 1 K (2)

We consider two specific forms of departure from HardyndashWeinberg equilibrium

πkl = (1 minus ρ)πkπl + δklρπk (3)

and

πkl = (1 minus ρ + δklρ)πkπl

1 minus ρ + ρsumK

j=1 π2j

(4)

where 0 le πk le 1sumK

k=1 πk = 1 δkk = 1 and δkl = 0 (k = l)In (3) ρ is called the inbreeding coefficient or fixation index(Weir 1996 p 93) and corresponds to Cohenrsquos (1960) kappameasure of agreement Equation (4) creates disequilibrium bygiving different fitness values to the homozygous and heterozy-gous pairs (Niu et al 2002) The denominator is a normalizingconstant Both (3) and (4) reduce to (2) if ρ = 0 Excess ho-mozygosity (ie πkk gt π2

k k = 1 K) arises when ρ gt 0and excess heterozygosity (ie πkk lt π2

k k = 1 K) ariseswhen ρ lt 0 Recently Satten and Epstein (2004) considered (3)for the control population under the case-control design Weabuse the notation slightly in that the πk in (4) do not pertainto the marginal distribution of H unless ρ = 0

Let h denote a haplotype that differs from h at only one SNPWrite nablau f (uv) = partf (uv)partu The following lemma statesthat under (3) or (4) πk and ρ are identifiable from the dataon G and the data on G provide positive information aboutthese parameters

Lemma 1 Assume that either (3) or (4) holds The para-meters πk and ρ are uniquely determined by the distributionof G For nondegenerate distribution πk if there exist a con-stant micro and a vector ν = (ν1 νK)T such that

sumKk=1 νk = 0

and micronablaρ log P(G = g) + sumKk=1 νknablaπk log P(G = g) = 0 for g =

2h then micro = 0 and ν = 0

In the sequel G denotes the set of all possible genotypes andS(G) denotes the set of haplotype pairs that are consistent withgenotype G We suppose that πk gt 0 for all k = 1 K whereK is now interpreted as the total number of haplotypes that existin the population For any parameter θ we use θ0 to denote itstrue value if the distinction is necessary We assume that the truevalue of any Euclidean parameter θ belongs to the interior of aknown compact set within the domain of θ Proofs of Lemma 1and all of the theorems are provided in the Appendix

92 Journal of the American Statistical Association March 2006

22 Cross-Sectional Studies

There is a random sample of n individuals from the under-lying population The observable data consist of (YiXiGi)i = 1 n The trait Y can be discrete or continuous uni-variate or multivariate As stated in Section 21 the conditionaldensity of Y given X and H is given by Pαβξ (Y|XH) Fora univariate trait this regression model may take the form of ageneralized linear model (McCullagh and Nelder 1989) withthe linear predictor given in (1) If the trait is measured re-peatedly in a longitudinal study then generalized linear mixedmodels (Diggle Heagerty Liang and Zeger 2002 chap 9)may be used The following conditions are required for esti-mating (αβ ξ)

Condition 1 If Pαβξ (Y|XH) = Pαβ ξ (Y|XH) for any

H = (hkhk) and H = (hk hk) k = 1 K then α = αβ = β and ξ = ξ

Condition 2 If there exists a constant vector ν such thatνTnablaαβξ log Pαβξ (Y|XH) = 0 for H = (hkhk) and H =(hk hk) then ν = 0

Remark 1 Condition 1 ensures that the parameters of in-terest are identifiable from the genotype data The linearindependence of the score function stated in Condition 2 en-sures nonsingularity of the information matrix The reason forconsidering H = (hkhk) and H = (hk hk) is that these haplo-type pairs can be inferred with certainty because of the uniquedecompositions of the corresponding genotypes g = 2hk andg = hk + hk All of the commonly used regression models par-ticularly generalized linear (mixed) models with linear predic-tors in the form of (1) satisfy Conditions 1 and 2

We show in Section A21 that it is possible to estimatethe regression parameters without imposing any structure onthe joint distribution of H But this estimation requires knowl-edge of whether or not the dominant effects exist Specifi-cally if there are no dominant effects then only (αβ ξ) andP(G = g) are identifiable otherwise (αβ ξ) P(G = g) andP(H = (hlowastg minus hlowast)) are identifiable If either (3) or (4) holdsthen it follows from Lemma 1 and Condition 1 that all of the pa-rameters are identifiable regardless of the genetic mechanismDenote the joint distribution of H by Pγ (H = (hkhl)) whereγ consists of the identifiable parameters in the distribution of HUnder (3) or (4) γ = (ρπ1 πK)T When the distributionof H is unspecified γ pertains to the aspects of the distributionof H that are identifiable

Write θ = (αβγ ξ) The likelihood for θ based on thecross-sectional data is proportional to

Ln(θ) equivnprod

i=1

prod

gisinGmg(YiXi θ)I(Gi=g) (5)

where

mg(yx θ) =sum

(hkhl)isinS(g)

Pαβξ

(y|x (hkhl)

)Pγ (hkhl)

The MLE θ can be obtained by maximizing (5) via theNewtonndashRaphson algorithm or an optimization algorithm Itis generally more efficient to use the expectationndashmaximization

(EM) algorithm (Dempster Laird and Rubin 1977) especiallywhen the distribution of H satisfies (3) with ρ ge 0 see Sec-tion A22 for details

By the classical likelihood theory we can show that θ is con-sistent asymptotically normal and asymptotically efficient un-der Conditions 1 and 2 and the following condition

Condition 3 If there exists a constant vector ν such thatνTnablaθ log mg(YX θ0) = 0 then ν = 0

Remark 2 Condition 3 ensures the nonsingularity of the in-formation matrix This condition can be easily verified when thejoint distribution of H is unspecified and is implied by Lemma 1and Condition 2 when the distribution satisfies (3) or (4)

23 Case-Control Studies With Known Population Totals

We consider case-control data supplemented by informationon population totals (Scott and Wild 1997) There is a finitepopulation of N individuals that is considered a random sam-ple from the joint distribution of (YXH) where Y is a cat-egorical response variable All that is known about this finitepopulation is the total number of individuals in each categoryof Y = y A sample of size n stratified on the disease status isdrawn from the finite population and the values of X and Gare recorded for each sampled individual The supplementaryinformation on population totals is often available from hospi-tal records cancer registries and official statistics If a case-control sample is drawn from a cohort study then the cohortserves as the finite population The observable data consist of(YiRiRiXiRiGi) i = 1 N where Ri indicates by thevalues 1 versus 0 whether or not the ith individual in the fi-nite population is selected into the case-control sample

The association between Y and (XH) is characterized byPαβξ (Y|XH) where α β and ξ pertain to the intercept(s)regression effects and overdispersion parameters (McCullaghand Nelder 1989) In the case of a binary response variable im-portant examples of Pαβξ (Y|XH) include the logistic probitand complementary logndashlog regression models When there aremore than two categories examples include the proportionalodds multivariate probit and multivariate logistic regressionmodels Because the data associated with Ri = 1 yield the sameform of likelihood as that of a cross-sectional study and thedata associated Ri = 0 yield a missing-data likelihood all ofthe identifiability results stated in Section 22 apply to the cur-rent setting We again write θ = (αβ ξ γ ) where γ consistsof the identifiable parameters in the distribution of H

Let Fg(middot) be the cumulative distribution function of X givenG = g and let fg(x) be the density of Fg(x) with respectto a dominating measure micro(x) Note that Fg(middot) is infinite-dimensional if X has continuous components The joint densityof (Y = yG = gX = x) is mg( yx θ)fg(x) The likelihoodconcerning θ and Fg takes the form

Ln(θ Fg) =Nprod

i=1

[prod

gisinGmg(YiXi θ)fg(Xi)I(Gi=g)

]Ri

times[sum

gisinG

int

mg(Yix θ)dFg(x)

]1minusRi

(6)

Lin and Zeng Inference on Haplotype Effects 93

Unlike the likelihood for the cross-sectional design given in (5)the density functions of X given G cannot be factored out ofthe likelihood given in (6) and thus cannot be omitted from thelikelihood

We maximize (6) to obtain the MLEs θ and Fg(middot) The lat-ter is an empirical function with point masses at the observed Xisuch that Gi = g and Ri = 1 The maximization can be car-ried out via the NewtonndashRaphson profile-likelihood or large-scale optimization methods An alternative way to calculate theMLEs is through the EM algorithm described in Section A31

We impose the following regularity condition and then statethe asymptotic results in Theorem 1

Condition 4 For any g isin G fg(x) is positive in its supportand continuously differentiable with respect to a suitable mea-sure

Theorem 1 Under Conditions 1ndash4 θ and Fg(middot) are con-sistent in that |θ minus θ0| + supxg |Fg(x) minus Fg(x)| rarr 0 almostsurely In addition n12(θ minus θ0) converges in distribution to amean 0 normal random vector whose covariance matrix attainsthe semiparametric efficiency bound

Let pln(θ) be the profile log-likelihood for θ that is pln(θ) =maxFg log Ln(θ Fg) Then the (s t)th element of the inversecovariance matrix of θ can be estimated by minusεminus2

n pln(θ +εnes + εnet) minus pln(θ + εnes minus εnet) minus pln(θ minus εnes + εnet) +pln(θ) where εn is a constant of the order nminus12 and es and etare the sth and tth canonical vectors The function pln(θ) can becalculated via the EM algorithm by holding θ constant in boththe E-step and the M-step

Remark 3 If N is much larger than n or if the populationfrequencies rather than the totals are known then we max-imize

prodni=1

prodgisinGmg(YiXi θ)fg(Xi)I(Gi=g) subject to the

constraints thatsum

gisinGint

mg( yx θ)dFg(x) = py where py isthe population frequency of Y = y The resultant estimator of θ0is consistent asymptotically normal and asymptotically effi-cient The results in this section can be extended straightfor-wardly to accommodate stratifications on covariates

24 Case-Control Studies With UnknownPopulation Totals

We consider the classical case-control design which mea-sures X and G on n1 cases (Y = 1) and n0 controls (Y = 0)

and requires no knowledge about the finite population Withthe notation introduced in the previous section the likelihoodcontribution from one individual takes the form

RL(θ Fg) =prod

gisinGmg( yX θ)fg(X)I(G=g)

sumgisinG

intmg( yx θ)dFg(x)

(7)

where we use y instead of Y to emphasize that y is not randomDefine

f daggerg (x) = mg(0x θ)fg(x)

intmg(0x θ)dFg(x)

qg =int

mg(0x θ)dFg(x)sum

gisinGint

mg(0x θ)dFg(x)

Clearly f daggerg (x) is the conditional density of X given G = g and

Y = 0 and qg is the conditional probability of G = g given

Y = 0 Let g0 and x0 be some specific values of G and X WriteFdagger

g(x) = int x0 f dagger

g (s)dmicro(s) We can express (7) as

RL(θ Fdaggerg qg) =

prodgisinGη( yXg θ)f dagger

g (x)qgI(G=g)

sumgisinG qg

intη( yxg θ)dFdagger

g(x) (8)

where

η( yxg θ) = mg( yx θ)mg0(0x0 θ)

mg(0x θ)mg0( yx0 θ)

We call η the generalized odds ratio (Liang and Qin 2000)which reduces to the ordinary odds ratio when S(g) is a sin-gleton

Remark 4 The parameter qg is a functional of f daggerg and θ be-

causeint

mg(0x θ)dFg(x) = int mminus1g (0x θ)dFdagger

g(x)minus1 Thisconstraint makes it very difficult to study the identifiability ofthe parameters Thus we treat qg as a free parameter in our de-velopment

For traditional case-control data the odds ratio is identifiable(whereas the intercept is not) and its MLE can be obtainedby maximizing the prospective likelihood (Prentice and Pyke1979) Similar results hold when the exposure is measured witherror (Roeder Carroll and Lindsay 1996) however the distri-bution of the measurement error needs to be estimated from avalidation set or an external source With unphased genotypedata identifiability is much more delicate We show in Sec-tion A41 that the components of θ that are identifiable fromthe retrospective likelihood are exactly those that are identi-fiable from the generalized odds ratio Thus we assume thatthe generalized odds ratio depends only on a set of identifi-able parameters still denoted by θ otherwise the inference isnot tractable For the logistic link function with linear predic-tor (1) we show in Section A42 that if there are no dominanteffects then θ consists only of β if there are no covariate ef-fects but there exists a dominant main effect then β is identi-fiable and P(H = (hlowastg minus hlowast))P(G = g) is identifiable up toa scalar constant and if the dominant effect depends on a con-tinuous covariate or if the dominant main effect and the maineffect of a continuous covariate are nonzero then θ consistsof α β and P(H = (hlowastg minus hlowast))P(G = g) For the probitand complementary logndashlog link functions we show in Sec-tion A43 that if there are dominant effects and at least onecontinuous covariate has an effect then θ consists of α β andP(H = (hlowastg minus hlowast))P(G = g)

We maximize the product of (8) over the n equiv n1 +n0 individ-uals in the case-control sample to produce the MLEs θ Fdagger

g(middot)and qg Although the Fdagger

g(middot) are high-dimensional we showin Section A44 that θ can be obtained by profiling a likelihoodfunction over a scalar nuisance parameter

To state the asymptotic properties of the MLEs we imposethe following conditions

Condition 5 If there exists a vector v such that vTnablaθ logη(1

xg θ) is a constant with probability 1 then v = 0

Condition 6 The function f daggerg is positive in its support and

continuously differentiable

Condition 7 The fraction n1n rarr isin (01)

94 Journal of the American Statistical Association March 2006

Remark 5 Condition 5 implies nonsingularity of the infor-mation matrix for θ0 and can be shown to hold for the logisticprobit and complementary logndashlog link functions Condition 7ensures that there are both cases and controls in the sample

Theorem 2 Under Conditions 5ndash7 |θ minus θ0| + supg |qg minusqg| + supxg |Fdagger

g(x) minus Fdaggerg(x)| rarr 0 almost surely In addition

n12(θ minus θ0) converges in distribution to a normal random vec-tor whose covariance matrix attains the semiparametric effi-ciency bound

In most case-control studies the disease is (relatively) rareWhen the disease is rare considerable simplicity arises be-cause of the following approximation for the logistic regressionmodel

Pαβ(Y|XH) asymp expY(α + βTZ(XH)

)

where Z(XH) is a specific function of X and H We assumethat either (3) or (4) holds The likelihood based on (XiGi yi)i = 1 n can be approximated by

Ln(θ Fg)

=nprod

i=1

(prodgisinG [fg(Xi)

sum(hk hl)isinS(g) eβTZ(Xihk hl)Pγ (hkhl)]I(Gi=g)

sumgisinG

intxsum

(hk hl)isinS(g) eβTZ(xhkhl)Pγ (hkhl)dFg(x)

)yi

times[prod

gisinG

fg(Xi)sum

(hkhl)isinS(g)

Pγ (hkhl)

I(Gi=g)]1minusyi

(9)

We impose the following condition

Condition 8 If α + βTZ(XH) = α + βTZ(XH) for H =(hkhk) and H = (hk hk) then α = α and β = β

This condition is similar to Condition 1 stated in Section 22and it holds for the codominant model Under this conditionit follows from Lemma 1 that no two sets of parameters cangive the same likelihood with probability 1 Thus the maximizerof (9) denoted by (θ Fg) is locally unique We show in Sec-tion A45 that θ can be easily obtained by profiling over a smallnumber of parameters

To derive the asymptotic properties we provide a mathemat-ical definition of rare disease

Condition 9 For i = 1 n the conditional distribu-tion of Yi given (XiHi) satisfies that P(Yi = 1|XiHi) =an expβT

0Z(XiHi)[1 + an expβT0Z(XiHi)] where an =

o(nminus12)

Theorem 3 Under Conditions 6ndash9 |θ minusθ0|+supxg |Fg(x)minusFg(x)| rarrPn 0 where Pn is the probability measure given byCondition 9 Furthermore n12(θ minus θ0) converges in distri-bution to a normal random vector whose covariance matrixachieves the semiparametric efficiency bound

25 Cohort Studies

In a cohort study Y represents the time to disease occur-rence which is subject to right-censorship by C The data con-sist of (YiiXiGi) i = 1 n where Yi = min(YiCi) and

i = I(Yi le Ci) We relate Yi to (XiHi) through a class ofsemiparametric linear transformation models

13(Yi) = minusβTZ(XiHi) + εi i = 1 n (10)

where 13 is an unknown increasing function Z(XH) is aknown function of X and H and the εirsquos are independent er-rors with known distribution function F We may rewrite (10)as

P(Yi le t|XiHi) = Q((t)eβTZ(XiHi)

)

where (t) = e13(t) and Q(x) = F(log x) (x gt 0) The choicesof the extreme-value and standard logistic distributions for F orequivalently Q(x) = 1minuseminusx and Q(x) = 1minus(1+x)minus1 yield theproportional hazards model and the proportional odds model(Pettitt 1984)

We impose Condition 8 Under this condition β and (middot)are identifiable from the observable data The identifiability ofthe distribution of H is the same here as in the case of cross-sectional studies Under (3) or (4) and Condition 8 all of theparameters including β (middot) and γ are identifiable This isshown in Section A51

The following assumption on censoring is required in con-structing the likelihood

Condition 10 Conditional on X and G the censoring time Cis independent of Y and H

Let θ = (βγ ) The likelihood concerning θ and takes theform

Ln(θ )

=nprod

i=1

[ sum

(hkhl)isinS(Gi)

(Yi)e

βTZ(Xi(hkhl))

times Q((Yi)e

βTZ(Xi(hkhl)))i

times 1 minus Q

((Yi)e

βTZ(Xi(hkhl)))1minusi Pγ (hkhl)

]

(11)

Here and in the sequel f (x) = df (x)dx and f (x) = d2f (x)dx2 Like (6) (8) and (9) this likelihood involves infinite-dimensional parameters If is restricted to be absolutely con-tinuous then as in the case of density estimation there is nomaximizer of this likelihood Thus we relax to be right-continuous and replace (Yi) in (11) by the jump size of

at Yi By the arguments of Zeng Lin and Lin (2005) the resul-tant MLE denoted by (θ ) exists and is a step functionwith jumps only at the observed Yi for which i = 1 The max-imization can be carried out through an optimization algorithmFurthermore the covariance matrix of θ can be estimated by theprofile likelihood method as discussed by Zeng et al (2005)

Lin (2004) considered the special case of the proportionalhazards model under condition (2) and provided an EM algo-rithm for obtaining the MLEs We can modify that algorithm toaccommodate HardyndashWeinberg disequilibrium along the linesof Section A22 In addition the EM algorithm can be used toevaluate the profile likelihood

We assume the following regularity conditions for the as-ymptotic results

Lin and Zeng Inference on Haplotype Effects 95

Condition 11 There exists some positive constant δ0 suchthat P(Ci ge τ |XiGi) = P(Ci = τ |XiGi) ge δ0 almost surelywhere τ corresponds to the end of the study

Condition 12 The true value 0(t) of (t) is a strictly in-creasing function in [0 τ ] and is continuously differentiable Inaddition 0(0) = 0 0(τ ) lt infin and 0(0) gt 0

Theorem 4 Under Conditions 8 and 10ndash12 n12(θ minusθ0 minus0) converges weakly to a Gaussian process in R

d times linfin([0 τ ])where d is the dimension of θ0 and linfin([0 τ ]) is the space ofall bounded functions on [0 τ ] equipped with the supremumnorm Furthermore θ is asymptotically efficient

3 SIMULATION STUDIES

We used Monte Carlo simulation to evaluate the proposedmethods in realistic settings We considered the five SNPs onchromosome 22 from the FinlandndashUnited States Investigationof NIDDM Genetics (FUSION) Study described in the nextsection We obtained the πkrsquos from the frequencies shown inTable 1 by assuming a 7 disease rate and generated haplo-types under (3) with ρ = 05 The R2

h in Table 1 is the mea-sure of haplotype certainty of Stram et al (2003) We focusedon hlowast = (01100) and considered case-control and cohortstudies

For the cohort studies we generated ages of onset from theproportional hazards model

λt|x (hkhl) = 2t exp[β1I(hk = hlowast) + I(hl = hlowast) + β2x

+ β3I(hk = hlowast) + I(hl = hlowast)x]where X is a Bernoulli variable with P(X = 1) = 2 that is in-dependent of H We generated the censoring times from theuniform (0 τ ) distribution where τ was chosen to yield ap-proximately 250 500 and 1000 cases under n = 5000 We letβ1 = β2 = 25 and varied β3 from minus5 to 5

As shown in Table 2 the MLE is virtually unbiased the like-lihood ratio test has proper type I error and the confidence in-terval has reasonable coverage Additional simulation studiesrevealed that the proposed methods also perform well for mak-ing inference about other parameters and under other geneticmodels

Table 1 Observed Haplotype Frequencies in the FUSION Study

Frequencies

Haplotype Controls Cases R2h

00011 0042 0066 38800100 0035 0034 33600110 0018 0007 37701011 1292 1344 59201100 2514 3183 73801101 0012 lt10minus4 45001110 lt10minus4 0045 49901111 0019 lt10minus4 32510000 0136 0114 45610010 lt10minus4 0012 50010011 3573 2883 72710100 0521 0597 40210110 0317 0318 55411011 1392 1290 56011100 0109 0092 26611110 lt10minus4 0014 lt10minus4

11111 0020 lt10minus4 338

Table 2 Simulation Results for the HaplotypendashEnvironmentInteractions in Cohort Studies

β3 Cases Bias SE CP Power

0 250 minus010 232 949 051500 minus005 157 953 047

1000 minus003 114 954 046

minus25 250 minus014 256 950 190500 minus008 172 949 334

1000 minus004 122 952 554

minus5 250 minus022 281 950 505500 minus011 190 950 806

1000 minus006 132 952 976

25 250 minus007 216 947 207500 minus002 146 953 395

1000 minus001 109 954 614

5 250 minus003 204 943 693500 minus001 140 951 940

1000 minus001 105 952 998

NOTE Bias and SE are the bias and standard error of β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level likelihood ratio test of H0 β3 = 0Each entry is based on 5000 replicates

For the case-control studies we used the same distributionsof H and X and considered the same hlowast as in the cohort stud-ies We generated disease incidence from the logistic regressionmodel

logit PY = 1|x (hkhl)= α + β1I(hk = hlowast) + I(hl = hlowast)

+ β2x + β3I(hk = hlowast) + I(hl = hlowast)x (12)

For making inference on β1 we set β2 = β3 = 25 and var-ied β1 from minus5 to 5 for making inference on β3 we setβ1 = β2 = 25 and varied β3 from minus5 to 5 We chose α = minus3or minus4 yielding disease rates between 16 and 7 We letn1 = n0 = 500 or 1000 We considered the situations of knownand unknown population totals with N being 15 and 30 timesof n under α = minus3 and minus4 For known population totals weused the EM algorithm described in Section A31 and eval-uated the inference procedures based on the likelihood ratiostatistic For unknown population totals we used the profile-likelihood method for rare diseases described in Section A45and set the πk less than 2n to 0 to improve numerical stabilityThe results for β1 and β3 are displayed in Tables 3 and 4

For known population totals the proposed estimators are vir-tually unbiased and the likelihood ratio statistics yield propertests and confidence intervals For unknown population totalsβ1 has little bias especially for large n whereas β3 tends tobe slightly biased downward the variance estimators are fairlyaccurate and the corresponding confidence intervals have rea-sonable coverage probabilities except for α = minus3 β3 = 5The method with known population totals yields slightly higherpower than the method with unknown population totals

All the aforementioned results pertain to haplotype 01100which has a relatively high frequency and a large value of R2

hthe covariate is binary and ρ is 05 which is relatively largeAdditional simulation studies showed that the foregoing con-clusions continue to hold for other haplotypes other valuesof ρ and continuous covariates Table 5 reports some resultsfor haplotype 10100 which has a frequency of about 5 and

96 Journal of the American Statistical Association March 2006

Table 3 Simulation Results for the Main Effects of the Haplotype in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β1 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus003 117 952 987 019 121 124 951 979minus25 minus002 109 954 587 014 112 117 960 5250 minus001 104 951 049 009 109 112 955 045

25 minus001 102 950 641 002 105 108 961 646

5 000 099 948 996 minus005 103 106 958 998

minus4 minus5 001 112 954 987 022 119 124 951 977minus25 minus002 104 955 574 013 114 117 952 5290 minus002 100 953 047 004 109 112 953 047

25 minus001 095 956 636 minus003 103 108 959 640

5 minus000 094 950 999 minus009 102 105 956 997

1000 minus3 minus5 minus003 082 953 100 005 087 087 949 100minus25 minus002 076 952 874 005 081 082 948 8530 minus001 073 951 049 005 077 077 954 046

25 minus001 071 953 898 004 075 076 948 920

5 minus001 070 953 100 003 075 075 946 100

minus4 minus5 000 079 952 100 005 087 088 947 100minus25 minus000 074 959 867 005 081 083 954 8470 minus001 070 955 045 002 079 079 949 051

25 minus001 067 956 904 000 074 076 955 909

5 minus001 066 956 100 minus002 073 074 954 100

NOTE Bias and SE are the bias and standard error of β1 SEE is the mean of the standard error estimator for β1 CP is the coverage probability of the95 confidence interval for β1 Power pertains to the 05-level test of H0 β1 = 0 Each entry is based on 5000 replicates

an R2h of 4 We generated disease incidence from the logistic

regression model

logit PY = 1|X1X2 (hkhl)= α + βhI(hk = hlowast) + I(hl = hlowast)

+ βx1 X1 + βx2X2 + βhx2I(hk = hlowast) + I(hl = hlowast)X2

where hlowast = (10100) X1 is Bernoulli with 2 success probabil-ity and X2 is uniform(01) We set ρ = 01 α = minus37 βh = 0and βx1 = βx2 = minusβx2h = 5 yielding an overall disease rateof 7 We assumed unknown population totals and used the

profile-likelihood method for rare diseases described in Sec-tion A45 The method performed remarkably well

4 APPLICATION TO THE FUSION STUDY

Type 2 diabetes mellitus or nonndashinsulin-dependent diabetesmellitus is a complex disease characterized by resistance of pe-ripheral tissues to insulin and a deficiency of insulin secretionApproximately 7 of adults in developed countries suffer fromthe disease The FUSION study is a major effort to map andclone genetic variants that predispose to type 2 diabetes (Valleet al 1998) We consider a subset of data from this study

Table 4 Simulation Results for the HaplotypendashEnvironment Interactions in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β3 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus008 205 949 729 030 187 195 953 692minus25 minus002 186 949 271 016 169 176 961 2440 minus001 173 946 054 minus006 155 162 963 037

25 002 165 949 334 minus038 144 151 958 287

5 006 161 947 885 minus088 138 143 915 831

minus4 minus5 minus009 198 950 763 012 194 195 950 720minus25 minus005 181 949 309 006 172 176 953 2640 minus002 168 945 055 minus007 156 161 956 044

25 minus001 157 944 370 minus022 146 149 948 333

5 001 148 945 926 minus047 136 141 945 904

1000 minus3 minus5 minus004 147 943 953 027 134 136 950 953minus25 minus003 133 946 493 013 122 123 949 4770 minus001 123 951 049 minus005 114 113 948 052

25 000 119 945 580 minus034 107 106 934 535

5 002 117 947 994 minus080 102 101 870 986

minus4 minus5 minus005 140 945 965 010 137 136 949 965minus25 minus002 128 945 529 005 124 123 951 5050 minus001 119 946 054 minus004 113 113 947 053

25 minus000 110 947 633 minus016 104 105 952 601

5 002 105 949 998 minus037 099 099 937 995

NOTE Bias and SE are the bias and standard error of β3 SEE is the mean of the standard error estimator for β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level test of H0 β3 = 0 Each entry is based on 5000 replicates

Lin and Zeng Inference on Haplotype Effects 97

Table 5 Simulation Results for Haplotype 10100 inCase-Control Studies

n1 = n0 Parameter True value Bias SE SEE CP Power

500 βh 0 minus030 400 401 957 043βx1 5 002 151 152 951 917βx2 5 001 228 230 953 584βhx2 minus5 015 641 644 956 118

1000 βh 0 minus017 275 277 953 047βx1 5 002 107 107 954 997βx2 5 000 162 161 950 871βhx2 minus5 012 441 443 950 198

NOTE Bias and SE are the bias and standard error of the parameter estimator SEE is themean of the standard error estimator CP is the coverage probability of the 95 confidenceinterval Power pertains to the 05-level test of zero parameter value Each entry is based on5000 replicates

A total of 796 cases and 415 controls were genotyped at5 SNPs in a putative susceptibility region on chromosome 22with 131 cases and 82 controls having missing genotype infor-mation for at least one SNP If Gi is missing then the set S(Gi)

is enlarged accordingly in the analysis Table 1 displays the es-timated haplotype frequencies under (3) separated by the casesand controls along with the values of R2

h (Stram et al 2003) forthe controls We estimated ρ at 000 for controls and 002 forcases

We use the method based on (9) to estimate the effects ofthe haplotypes whose observed frequencies in the controls aregreater than 2 As shown in Table 6 the results are signif-icant for the two most common haplotypes haplotype 01100increases the risk of disease whereas haplotype 10011 is pro-tective against diabetes Epstein and Satten (2003) also reportedthe estimates for these two haplotypes which agree with ournumbers Although they did not report standard error estimatestheir confidence intervals are similar to those based on Table 6The results under the codominant model as well as the calcula-tions of the Akaike information criterion (AIC) (Akaike 1985)suggest that the additive model fits the data the best for bothhaplotypes 01100 and 10011

The FUSION investigators are currently exploring genendashenvironment interactions on chromosome 22 so the covariateinformation is confidential at this stage To illustrate our methodfor detecting genendashenvironment interactions we artificially cre-ated a binary covariate X by setting X = 1 for the first 600 in-dividuals in the dataset Under the additive genetic model forhaplotype 01100 the estimate of the interaction is 043 withan estimated standard error of 110 For further illustration wegenerated a binary covariate from the conditional distributionof X given Y and G under model (12) with α = minus37 β1 = 32and β2 = 25 Based on 5000 replicates the power for testing

H0 β3 = 0 is estimated at 053 479 or 974 under β3 = 0 25or 5

5 DISCUSSION

Inferring haplotypendashdisease associations is an interestingand difficult statistical problem The presence of infinite-dimensional nuisance parameters in the likelihoods for case-control and cohort studies entails considerable theoretical andcomputational challenges Although we have conducted a sys-tematic and rigorous investigation providing powerful newmethods there remain substantial open problems Here we dis-cuss some directions for future research

Case-Control Studies It is numerically difficult to maxi-mize (6) when N is much larger than n and algorithms forimplementing the constrained maximization mentioned in Re-mark 3 have yet to be developed For case-control studies withunknown population totals identifiability is a thorny issue Wehave provided a simple and efficient method under the rare dis-ease assumption which appears to work well even when thedisease is not rare But can we do better

Model Selection and Model Assessment Because our ap-proach is built on likelihood we can apply likelihood-basedmodel selection criteria such as the AIC used in Section 4Lin (2004) showed that the AIC performs well for the propor-tional hazards model It is unclear how to apply the traditionalresidual-based methods for assessing model adequacy becausethe haplotypes are not directly observable

Other Genetic Variants We have focused on SNPs-basedhaplotypes The proposed inference procedures are potentiallyapplicable to microsatellite loci and other genotype data al-though the identifiability of parameters needs to be verified foreach kind of genotype data

Other Study Designs It is sometimes desirable to use thematched case-control design in which one or more controls areindividually matched to each case In large cohort studies withrare diseases it is cost-effective to adopt the case-cohort designor nested case-control design so that only a subset of the cohortmembers needs to be genotyped We are currently developingefficient inference procedures for such designs

Population Substructure The presence of latent populationsubstructure can cause bias in association studies of unrelatedindividuals There exist several statistical methods to adjust forthe effects of population substructure with the aid of genomicmarkers It should be possible to extend the proposed methodsso as to accommodate potential population substructure

Table 6 Estimates of Haplotype Effects Under Various Genetic Models for the FUSION Study

Recessivemodel

Dominantmodel

Additivemodel

Codominant model

Haplotype Additive Recessive

01011 327(270) minus027(140) 049(135) 005(143) 331(289)01100 316(146) 274(109) 355(099) 334(114) 063(167)10011 minus206(155) minus323(112) minus320(095) minus344(111) 076(183)10100 minus1019(1020) 196(219) 116(213) 169(217) minus1131(1029)10110 903(746) minus007(248) 063(249) 016(254) 892(765)11011 minus222(328) minus096(140) minus127(133) minus108(140) minus143(344)

NOTE Standard error estimates are shown in parentheses

98 Journal of the American Statistical Association March 2006

Studies of Related Individuals This article is concernedwith studies of unrelated individuals Many genetic studies in-volve multiple family members or relatives Haplotype ambigu-ity possibly can be reduced by using the genotype informationfrom related individuals Inference on haplotype effects needsto account for the intraclass correlation

Genotyping Error and DNA Pooling Laboratory genotyp-ing is prone to error It is sometimes necessary to pool DNAsamples rather than genotyping individual samples (WangKidd and Zhao 2003) Such data create additional complexityin haplotype analysis (Zeng and Lin 2005)

Many SNPs The traditional EM algorithm works well fora small number of SNPs When the number of SNPs is largethe partitionndashligation method of Niu et al (2002) and Qin et al(2002) and other modifications potentially can be adapted to re-duce the computational burden However the haplotype analy-sis may not be very useful if the SNPs are weakly linked

Many Haplotypes and Rare Haplotypes The approachtaken in this article assumes that we are interested in a smallnumber of haplotype configurations that are relatively frequentIf there are many haplotypes then we are confronted withthe problem of multiple comparisons and sparse data Schaid(2004) discussed some possible solutions

Large-Scale Studies There is an increasing interest ingenome-wide association studies With a large number of SNPsone possible approach is to use sliding windows of 5ndash10 SNPsand test for the haplotypendashdisease association in each windowBecause most of the SNPs are common between adjacent win-dows the test statistics tend to be highly correlated so that theBonferroni-type correction for multiple comparisons would beextremely conservative To properly adjust for multiple com-parisons one needs to ascertain the joint distribution of the teststatistics This can be done by permuting the data or by evaluat-ing the asymptotic joint normal distribution of the test statistics(Lin 2005)

We hope that other statisticians will join us in tackling theforegoing problems and other challenges in genetic associationstudies

APPENDIX TECHNICAL ANDCOMPUTATIONAL DETAILS

A1 Proof of Lemma 1

We provide a proof under (3) the proof under (4) is simpler andis omitted here To prove the first part of the lemma we supposethat two sets of parameters (πk ρ) and (πk ρ) yield the samedistribution of G We wish to show that these two sets are iden-tical Consider g = 2hk For such a choice of g the set S(g) is asingleton Clearly (1 minus ρ)π2

k + ρπk = (1 minus ρ)π2k + ρπk We de-

note this constant by ck Then 0 le ck le 1 for all k and 0 lt ck lt 1for at least one k Because πk ge 0 we have πk = [minusρ + ρ2 +4ck(1 minus ρ)12]2(1 minus ρ) Thus (1 minus ρ)minus1 satisfies the equationsum

k[(1 minus x) + (x minus 1)2 + 4ckx12] = 2 and (1 minus ρ)minus1 satisfies thesame equation It can be shown that the first derivative of (1 minus x) +(x minus 1)2 + 4ckx12 is nonpositive and is strictly negative for at leastone k Thus the foregoing equation has a unique solution for x gt 1which implies that ρ = ρ It follows immediately that πk = πk forall k To prove the second part of the lemma we choose g = 2hk toobtain νk2πk(1 minus ρ) + ρ + microπk(1 minus πk) = 0 Because

sumk νk = 0

we havesum

kmicroπk(1 minus πk)2πk(1 minus ρ) + ρ = 0 Therefore micro = 0and ν = 0

A2 Cross-Sectional Studies

A21 Identifiability Under Arbitrary Distributions of H UnderCondition 1 (αβ ξ) is identifiable The identifiability of the distribu-tion of H depends on the structure of Pαβξ For concreteness we con-sider the codominant logistic regression model for a binary trait Wedivide G into three categories G1 = g isin G g = h + h or g = h + hG2 = g isin G minus G1 g is not ge hlowast and G3 = G minus G1 minus G2 We derivethe expression for mg( yx θ) when g belongs to each of the three cat-egories

For g isin G1 S(g) = (hh) or (h h) so that mg( yxθ) = Pαβξ (Y = y|X = xH = (hh))P(H = (hh)) or mg( yxθ) = Pαβξ (Y = y|X = xH = (h h))P(H = (h h)) For g isin G2Pαβξ (Y = y|X = xH = (hkhl)) does not depend on (hkhl) isin S(g)so that mg( yx θ) = Pαβξ (Y = y|X = xH = (hkhl))P(G = g)where (hkhl) isin S(g) For g isin G3

mg( yx θ) = expy(α + β1 + βT3 x + βT

4 x)1 + exp(α + β1 + βT

3 x + βT4 x)

π1(g)

+ expy(α + βT3 x)

1 + exp(α + βT3 x)

π2(g)

where π1(g) = 2P(H = (hlowastg minus hlowast)) and π2(g) = P(H = (hkhl) hk + hl = ghk = hlowasthl = hlowast)

Let θ0 denote the true value of θ P0(G = g) denote the truevalue of P(G = g) and π0j(g) denote the true values πj(g) j = 12We then can draw the following conclusions (1) When β01 = 0 andβ04 = 0 mg( yx θ) = mg( yx θ0) if and only if α = α0 β = β0and P(G = g) = P0(G = g) for any g isin G and (2) when either β01or β04 is nonzero mg( yx θ) = mg( yx θ0) if and only if α = α0β = β0 P(G = g) = P0(G = g) for g isin G1 cup G2 and πj(g) = π0j(g)

for g isin G3 and j = 12 These conclusions hold for any generalizedlinear model with the linear predictor given in (1)

A22 EM Algorithm The complete-data likelihood is propor-tional to

prodni=1Pαβξ (Yi|XiHi)Pγ (Hi) The expectation of the log-

arithm of this function conditional on the observable data (YiXiGi)i = 1 n is

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ)log Pαβξ

(Yi|Xi (hkhl)

) + log Pγ (hkhl)

where

pikl(θ) = Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

Thus in the (m + 1)st iteration of the EM algorithm we evalu-ate pikl(θ) at the current estimate θ (m) and obtain θ (m+1) by solvingthe following equations through the NewtonndashRaphson algorithm

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)

times nablaαβξ log Pαβξ

(Yi|Xi (hkhl)

) = 0

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)nablaγ log Pγ (hkhl) = 0 (A1)

Under (3) with ρ ge 0 the estimate of γ equiv (ρ πk) can be ob-tained in a closed form rather than by solving (A1) Let B be aBernoulli variable with success probability ρ let Q1 be a discreterandom variable taking values in H with P(Q1 = (hkhl)) = δklπk and let Q2 be another discrete random variable taking values in H

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 2: Likelihood-Based Inference on Haplotype Effects in Genetic

90 Journal of the American Statistical Association March 2006

haplotypes) can be determined Statistically speaking this is amissing-data problem in which the variable of interest pertainsto two ordered sequences of 0rsquos and 1rsquos but only the summa-tion of the two sequences is observed This type of missing-dataproblem has not been studied in the statistical literature

Many authors (eg Clark 1990 Excoffier and Slatkin 1995Stephens et al 2001 Zhang Pakstis Kidd and Zhao 2001 NiuQin Xu and Liu 2002 Qin Niu and Liu 2002) have proposedmethods to infer haplotypes or estimate haplotype frequenciesfrom unphased genotype data To make inference about haplo-type effects one may then relate the probabilistically inferredhaplotypes to the phenotype through a regression model (egZaykin et al 2002) This approach does not account for the vari-ation due to haplotype estimation and does not yield consistentestimators of regression parameters

A growing number of articles have been published in geneticjournals on making proper inference about the effects of haplo-types on disease phenotypes Most of these articles have dealtwith case-control studies Specifically Zhao Li and Khalid(2003) developed an estimating function that approximates theexpectation of the complete-data prospective-likelihood scorefunction given the observable data This method assumes thatthe disease is rare and that haplotypes are independent of envi-ronmental variables and it is not statistically efficient Epsteinand Satten (2003) derived a retrospective likelihood for therelative risk that does not accommodate environmental vari-ables Stram et al (2003) proposed a conditional likelihoodfor the odds ratio assuming that cases and controls are cho-sen randomly with known probabilities from the target popula-tion but did not consider environmental variables or investigatethe properties of the estimator Building on the earlier work ofSchaid et al (2002) Lake et al (2003) discussed likelihood-based inference for cross-sectional studies under generalizedlinear models Seltman Roeder and Devlin (2003) provided asimilar discussion based on the cladistic approach RecentlyLin (2004) showed how to perform Coxrsquos (1972) regressionwhen potentially censored age at onset of the disease observa-tions are collected in cohort studies All of the aforementionedwork assumes HardyndashWeinberg equilibrium (Weir 1996 p 40)Simulation studies (Epstein and Satten 2003 Lake et al 2003Satten and Epstein 2004) revealed that violation of this assump-tion can adversely affect the validity of the inference

The aim of this article is to address statistical issues in esti-mating haplotype effects in a systematic and rigorous mannerFor case-control studies we allow environmental variables andderive efficient inference procedures For cross-sectional andcohort studies we consider more versatile models than thosein the existing literature For all study designs we accommo-date HardyndashWeinberg disequilibrium We construct appropriate

likelihoods for a variety of models Under case-control sam-pling the likelihood pertains to the distribution of genotypesand environmental variables conditional on the case-control sta-tus which involves infinite-dimensional nuisance parameters ifenvironmental variables are continuous In cohort studies it isdesirable to not parameterize the distribution of time to dis-ease so that the likelihood also involves infinite-dimensionalparameters The presence of infinite-dimensional parametersentails considerable theoretical and computational challengesWe establish the theoretical properties of the maximum like-lihood estimators (MLEs) by appealing to modern asymptotictechniques and develop efficient and stable algorithms to im-plement the corresponding inference procedures We assess theperformance of the proposed methods through simulation stud-ies and provide an application to a major genetic study of type 2diabetes mellitus

2 INFERENCE PROCEDURES

21 Preliminaries

We consider SNP-based association studies of unrelatedindividuals Suppose that each individual is genotyped at Mbiallelic SNPs within a candidate gene At each SNP site we in-dicate the two possible alleles by the values 0 and 1 Thus eachhaplotype h is a unique sequence of M numbers from 01The total number of possible haplotypes is K equiv 2M the ac-tual number of haplotypes consistent with the data is usuallymuch smaller For k = 1 K let hk denote the kth possi-ble haplotype Figure 1 shows the eight possible haplotypes forthree SNPs

Our human chromosomes come in pairs one member ofeach pair inherited from our mother and the other memberinherited from our father These pairs are called homologouschromosomes Thus each individual has a pair of homologoushaplotypes that may or may not be identical Routine genotyp-ing procedures cannot separate the two homologous chromo-somes so only the (unphased) genotypes (ie the combinationsof the two homologous haplotypes) are directly observable Foreach individual the multi-SNP genotype is an ordered sequenceof M numbers from 012

Let H and G denote the pair of haplotypes and the geno-type for an individual We write H = (hkhl) if the individualrsquoshaplotypes are hk and hl in which case G = hk + hl The or-dering of the two homologous haplotypes within an individ-ual is considered arbitrary By allowing genotypes to includemissing SNP information we may assume that G is known foreach individual Given G the value of H is unknown if the in-dividual is heterozygous at more than one SNP or if any SNPgenotype is missing For the case of M = 3 shown in Figure 1if G = (021) then H = (h3h4) and if G = (011) thenH = (h1h4) or H = (h2h3)

Figure 1 Possible Haplotype Configurations With Three SNPs

Lin and Zeng Inference on Haplotype Effects 91

The goal of the association studies is to relate the pair ofhaplotypes to disease phenotypes or traits The simplest pheno-type is the binary indicator for the disease status which takesthe value 1 if the individual is diseased and 0 otherwise Thediseased individuals may be further classified into several cat-egories corresponding to different types of disease or varyingdegrees of disease severity If the age of onset is likely to begenetically mediated then it is desirable to use the age of onsetas the phenotype One may also be interested in disease-relatedtraits such as blood pressure

The data on the disease phenotype may be gathered in variousways The simplest approach is to obtain a random sample fromthe target population and measure the phenotype of interest onevery individual in the sample Such studies are referred to ascross-sectional studies which are feasible if the disease is rela-tively frequent or if one is interested only in some readily mea-sured traits that are related to the disease If one is interested inthe age at the onset of a disease however then it is necessary tofollow a cohort of individuals forward in time in which case thephenotype (ie time to disease occurrence) may be censoredWhen the disease is relatively rare it is more cost-effective touse the case-control design which collects data retrospectivelyon a sample of diseased individuals and on a separate sample ofdisease-free individuals It is often desirable to collect data onenvironmental variables or covariates so as to investigate genendashenvironment interactions

Let Y be the phenotype of interest and let X be the covari-ates For cross-sectional and case-control studies the associa-tion between Y and (XH) is characterized by the conditionaldensity of Y = y given H = (hkhl) and X = x denoted byPαβξ (y|x (hkhl)) where α denotes the intercept(s) β de-notes the regression effects and ξ denotes the nuisance para-meters (eg variance and overdispersion parameters) Thereis considerable flexibility in specifying the regression relation-ship Suppose that hlowast is the target haplotype of interest andthat there are no covariates Then a linear predictor in theform of α + βI(hk = hl = hlowast) pertains to a recessive modelα + βI(hk = hlowast) + I(hl = hlowast) minus I(hk = hl = hlowast) pertains to adominant model α + βI(hk = hlowast) + I(hl = hlowast) pertains to anadditive model and α+β1I(hk = hlowast)+ I(hl = hlowast)+β2I(hk =hl = hlowast) pertains to a codominant model where I(middot) is the in-dicator function Clearly the codominant model contains theother three models as special cases A codominant model withgenendashenvironment interactions has the following linear predic-tor

α + β1I(hk = hlowast) + I(hl = hlowast) + β2I(hk = hl = hlowast)

+ βT3 x + βT

4 I(hk = hlowast) + I(hl = hlowast)x+ βT

5 I(hk = hl = hlowast)x (1)

Additional terms may be included so as to examine the effectsof several haplotype configurations or to investigate the jointeffects of multiple candidate genes

Although we are interested in the effects of H and X on Ywe observe G instead of H As mentioned earlier G is thesummation of the paired sequences in H Thus we have aregression problem with missing data in which the primaryexplanatory variable pertains to two ordered sequences of num-bers from 01 but only the summation of the two sequences

is observed We assume that X is independent of H conditionalon G and that (1XT) is linearly independent with positiveprobability

Write πkl = PH = (hkhl) and πk = P(h = hk) k l =1 K As we demonstrate in this article it is sometimes pos-sible to make inference about haplotype effects without impos-ing any structures on πkl although estimating πk and testingfor no haplotype effects require some restrictions on πklUnder HardyndashWeinberg equilibrium

πkl = πkπl k l = 1 K (2)

We consider two specific forms of departure from HardyndashWeinberg equilibrium

πkl = (1 minus ρ)πkπl + δklρπk (3)

and

πkl = (1 minus ρ + δklρ)πkπl

1 minus ρ + ρsumK

j=1 π2j

(4)

where 0 le πk le 1sumK

k=1 πk = 1 δkk = 1 and δkl = 0 (k = l)In (3) ρ is called the inbreeding coefficient or fixation index(Weir 1996 p 93) and corresponds to Cohenrsquos (1960) kappameasure of agreement Equation (4) creates disequilibrium bygiving different fitness values to the homozygous and heterozy-gous pairs (Niu et al 2002) The denominator is a normalizingconstant Both (3) and (4) reduce to (2) if ρ = 0 Excess ho-mozygosity (ie πkk gt π2

k k = 1 K) arises when ρ gt 0and excess heterozygosity (ie πkk lt π2

k k = 1 K) ariseswhen ρ lt 0 Recently Satten and Epstein (2004) considered (3)for the control population under the case-control design Weabuse the notation slightly in that the πk in (4) do not pertainto the marginal distribution of H unless ρ = 0

Let h denote a haplotype that differs from h at only one SNPWrite nablau f (uv) = partf (uv)partu The following lemma statesthat under (3) or (4) πk and ρ are identifiable from the dataon G and the data on G provide positive information aboutthese parameters

Lemma 1 Assume that either (3) or (4) holds The para-meters πk and ρ are uniquely determined by the distributionof G For nondegenerate distribution πk if there exist a con-stant micro and a vector ν = (ν1 νK)T such that

sumKk=1 νk = 0

and micronablaρ log P(G = g) + sumKk=1 νknablaπk log P(G = g) = 0 for g =

2h then micro = 0 and ν = 0

In the sequel G denotes the set of all possible genotypes andS(G) denotes the set of haplotype pairs that are consistent withgenotype G We suppose that πk gt 0 for all k = 1 K whereK is now interpreted as the total number of haplotypes that existin the population For any parameter θ we use θ0 to denote itstrue value if the distinction is necessary We assume that the truevalue of any Euclidean parameter θ belongs to the interior of aknown compact set within the domain of θ Proofs of Lemma 1and all of the theorems are provided in the Appendix

92 Journal of the American Statistical Association March 2006

22 Cross-Sectional Studies

There is a random sample of n individuals from the under-lying population The observable data consist of (YiXiGi)i = 1 n The trait Y can be discrete or continuous uni-variate or multivariate As stated in Section 21 the conditionaldensity of Y given X and H is given by Pαβξ (Y|XH) Fora univariate trait this regression model may take the form of ageneralized linear model (McCullagh and Nelder 1989) withthe linear predictor given in (1) If the trait is measured re-peatedly in a longitudinal study then generalized linear mixedmodels (Diggle Heagerty Liang and Zeger 2002 chap 9)may be used The following conditions are required for esti-mating (αβ ξ)

Condition 1 If Pαβξ (Y|XH) = Pαβ ξ (Y|XH) for any

H = (hkhk) and H = (hk hk) k = 1 K then α = αβ = β and ξ = ξ

Condition 2 If there exists a constant vector ν such thatνTnablaαβξ log Pαβξ (Y|XH) = 0 for H = (hkhk) and H =(hk hk) then ν = 0

Remark 1 Condition 1 ensures that the parameters of in-terest are identifiable from the genotype data The linearindependence of the score function stated in Condition 2 en-sures nonsingularity of the information matrix The reason forconsidering H = (hkhk) and H = (hk hk) is that these haplo-type pairs can be inferred with certainty because of the uniquedecompositions of the corresponding genotypes g = 2hk andg = hk + hk All of the commonly used regression models par-ticularly generalized linear (mixed) models with linear predic-tors in the form of (1) satisfy Conditions 1 and 2

We show in Section A21 that it is possible to estimatethe regression parameters without imposing any structure onthe joint distribution of H But this estimation requires knowl-edge of whether or not the dominant effects exist Specifi-cally if there are no dominant effects then only (αβ ξ) andP(G = g) are identifiable otherwise (αβ ξ) P(G = g) andP(H = (hlowastg minus hlowast)) are identifiable If either (3) or (4) holdsthen it follows from Lemma 1 and Condition 1 that all of the pa-rameters are identifiable regardless of the genetic mechanismDenote the joint distribution of H by Pγ (H = (hkhl)) whereγ consists of the identifiable parameters in the distribution of HUnder (3) or (4) γ = (ρπ1 πK)T When the distributionof H is unspecified γ pertains to the aspects of the distributionof H that are identifiable

Write θ = (αβγ ξ) The likelihood for θ based on thecross-sectional data is proportional to

Ln(θ) equivnprod

i=1

prod

gisinGmg(YiXi θ)I(Gi=g) (5)

where

mg(yx θ) =sum

(hkhl)isinS(g)

Pαβξ

(y|x (hkhl)

)Pγ (hkhl)

The MLE θ can be obtained by maximizing (5) via theNewtonndashRaphson algorithm or an optimization algorithm Itis generally more efficient to use the expectationndashmaximization

(EM) algorithm (Dempster Laird and Rubin 1977) especiallywhen the distribution of H satisfies (3) with ρ ge 0 see Sec-tion A22 for details

By the classical likelihood theory we can show that θ is con-sistent asymptotically normal and asymptotically efficient un-der Conditions 1 and 2 and the following condition

Condition 3 If there exists a constant vector ν such thatνTnablaθ log mg(YX θ0) = 0 then ν = 0

Remark 2 Condition 3 ensures the nonsingularity of the in-formation matrix This condition can be easily verified when thejoint distribution of H is unspecified and is implied by Lemma 1and Condition 2 when the distribution satisfies (3) or (4)

23 Case-Control Studies With Known Population Totals

We consider case-control data supplemented by informationon population totals (Scott and Wild 1997) There is a finitepopulation of N individuals that is considered a random sam-ple from the joint distribution of (YXH) where Y is a cat-egorical response variable All that is known about this finitepopulation is the total number of individuals in each categoryof Y = y A sample of size n stratified on the disease status isdrawn from the finite population and the values of X and Gare recorded for each sampled individual The supplementaryinformation on population totals is often available from hospi-tal records cancer registries and official statistics If a case-control sample is drawn from a cohort study then the cohortserves as the finite population The observable data consist of(YiRiRiXiRiGi) i = 1 N where Ri indicates by thevalues 1 versus 0 whether or not the ith individual in the fi-nite population is selected into the case-control sample

The association between Y and (XH) is characterized byPαβξ (Y|XH) where α β and ξ pertain to the intercept(s)regression effects and overdispersion parameters (McCullaghand Nelder 1989) In the case of a binary response variable im-portant examples of Pαβξ (Y|XH) include the logistic probitand complementary logndashlog regression models When there aremore than two categories examples include the proportionalodds multivariate probit and multivariate logistic regressionmodels Because the data associated with Ri = 1 yield the sameform of likelihood as that of a cross-sectional study and thedata associated Ri = 0 yield a missing-data likelihood all ofthe identifiability results stated in Section 22 apply to the cur-rent setting We again write θ = (αβ ξ γ ) where γ consistsof the identifiable parameters in the distribution of H

Let Fg(middot) be the cumulative distribution function of X givenG = g and let fg(x) be the density of Fg(x) with respectto a dominating measure micro(x) Note that Fg(middot) is infinite-dimensional if X has continuous components The joint densityof (Y = yG = gX = x) is mg( yx θ)fg(x) The likelihoodconcerning θ and Fg takes the form

Ln(θ Fg) =Nprod

i=1

[prod

gisinGmg(YiXi θ)fg(Xi)I(Gi=g)

]Ri

times[sum

gisinG

int

mg(Yix θ)dFg(x)

]1minusRi

(6)

Lin and Zeng Inference on Haplotype Effects 93

Unlike the likelihood for the cross-sectional design given in (5)the density functions of X given G cannot be factored out ofthe likelihood given in (6) and thus cannot be omitted from thelikelihood

We maximize (6) to obtain the MLEs θ and Fg(middot) The lat-ter is an empirical function with point masses at the observed Xisuch that Gi = g and Ri = 1 The maximization can be car-ried out via the NewtonndashRaphson profile-likelihood or large-scale optimization methods An alternative way to calculate theMLEs is through the EM algorithm described in Section A31

We impose the following regularity condition and then statethe asymptotic results in Theorem 1

Condition 4 For any g isin G fg(x) is positive in its supportand continuously differentiable with respect to a suitable mea-sure

Theorem 1 Under Conditions 1ndash4 θ and Fg(middot) are con-sistent in that |θ minus θ0| + supxg |Fg(x) minus Fg(x)| rarr 0 almostsurely In addition n12(θ minus θ0) converges in distribution to amean 0 normal random vector whose covariance matrix attainsthe semiparametric efficiency bound

Let pln(θ) be the profile log-likelihood for θ that is pln(θ) =maxFg log Ln(θ Fg) Then the (s t)th element of the inversecovariance matrix of θ can be estimated by minusεminus2

n pln(θ +εnes + εnet) minus pln(θ + εnes minus εnet) minus pln(θ minus εnes + εnet) +pln(θ) where εn is a constant of the order nminus12 and es and etare the sth and tth canonical vectors The function pln(θ) can becalculated via the EM algorithm by holding θ constant in boththe E-step and the M-step

Remark 3 If N is much larger than n or if the populationfrequencies rather than the totals are known then we max-imize

prodni=1

prodgisinGmg(YiXi θ)fg(Xi)I(Gi=g) subject to the

constraints thatsum

gisinGint

mg( yx θ)dFg(x) = py where py isthe population frequency of Y = y The resultant estimator of θ0is consistent asymptotically normal and asymptotically effi-cient The results in this section can be extended straightfor-wardly to accommodate stratifications on covariates

24 Case-Control Studies With UnknownPopulation Totals

We consider the classical case-control design which mea-sures X and G on n1 cases (Y = 1) and n0 controls (Y = 0)

and requires no knowledge about the finite population Withthe notation introduced in the previous section the likelihoodcontribution from one individual takes the form

RL(θ Fg) =prod

gisinGmg( yX θ)fg(X)I(G=g)

sumgisinG

intmg( yx θ)dFg(x)

(7)

where we use y instead of Y to emphasize that y is not randomDefine

f daggerg (x) = mg(0x θ)fg(x)

intmg(0x θ)dFg(x)

qg =int

mg(0x θ)dFg(x)sum

gisinGint

mg(0x θ)dFg(x)

Clearly f daggerg (x) is the conditional density of X given G = g and

Y = 0 and qg is the conditional probability of G = g given

Y = 0 Let g0 and x0 be some specific values of G and X WriteFdagger

g(x) = int x0 f dagger

g (s)dmicro(s) We can express (7) as

RL(θ Fdaggerg qg) =

prodgisinGη( yXg θ)f dagger

g (x)qgI(G=g)

sumgisinG qg

intη( yxg θ)dFdagger

g(x) (8)

where

η( yxg θ) = mg( yx θ)mg0(0x0 θ)

mg(0x θ)mg0( yx0 θ)

We call η the generalized odds ratio (Liang and Qin 2000)which reduces to the ordinary odds ratio when S(g) is a sin-gleton

Remark 4 The parameter qg is a functional of f daggerg and θ be-

causeint

mg(0x θ)dFg(x) = int mminus1g (0x θ)dFdagger

g(x)minus1 Thisconstraint makes it very difficult to study the identifiability ofthe parameters Thus we treat qg as a free parameter in our de-velopment

For traditional case-control data the odds ratio is identifiable(whereas the intercept is not) and its MLE can be obtainedby maximizing the prospective likelihood (Prentice and Pyke1979) Similar results hold when the exposure is measured witherror (Roeder Carroll and Lindsay 1996) however the distri-bution of the measurement error needs to be estimated from avalidation set or an external source With unphased genotypedata identifiability is much more delicate We show in Sec-tion A41 that the components of θ that are identifiable fromthe retrospective likelihood are exactly those that are identi-fiable from the generalized odds ratio Thus we assume thatthe generalized odds ratio depends only on a set of identifi-able parameters still denoted by θ otherwise the inference isnot tractable For the logistic link function with linear predic-tor (1) we show in Section A42 that if there are no dominanteffects then θ consists only of β if there are no covariate ef-fects but there exists a dominant main effect then β is identi-fiable and P(H = (hlowastg minus hlowast))P(G = g) is identifiable up toa scalar constant and if the dominant effect depends on a con-tinuous covariate or if the dominant main effect and the maineffect of a continuous covariate are nonzero then θ consistsof α β and P(H = (hlowastg minus hlowast))P(G = g) For the probitand complementary logndashlog link functions we show in Sec-tion A43 that if there are dominant effects and at least onecontinuous covariate has an effect then θ consists of α β andP(H = (hlowastg minus hlowast))P(G = g)

We maximize the product of (8) over the n equiv n1 +n0 individ-uals in the case-control sample to produce the MLEs θ Fdagger

g(middot)and qg Although the Fdagger

g(middot) are high-dimensional we showin Section A44 that θ can be obtained by profiling a likelihoodfunction over a scalar nuisance parameter

To state the asymptotic properties of the MLEs we imposethe following conditions

Condition 5 If there exists a vector v such that vTnablaθ logη(1

xg θ) is a constant with probability 1 then v = 0

Condition 6 The function f daggerg is positive in its support and

continuously differentiable

Condition 7 The fraction n1n rarr isin (01)

94 Journal of the American Statistical Association March 2006

Remark 5 Condition 5 implies nonsingularity of the infor-mation matrix for θ0 and can be shown to hold for the logisticprobit and complementary logndashlog link functions Condition 7ensures that there are both cases and controls in the sample

Theorem 2 Under Conditions 5ndash7 |θ minus θ0| + supg |qg minusqg| + supxg |Fdagger

g(x) minus Fdaggerg(x)| rarr 0 almost surely In addition

n12(θ minus θ0) converges in distribution to a normal random vec-tor whose covariance matrix attains the semiparametric effi-ciency bound

In most case-control studies the disease is (relatively) rareWhen the disease is rare considerable simplicity arises be-cause of the following approximation for the logistic regressionmodel

Pαβ(Y|XH) asymp expY(α + βTZ(XH)

)

where Z(XH) is a specific function of X and H We assumethat either (3) or (4) holds The likelihood based on (XiGi yi)i = 1 n can be approximated by

Ln(θ Fg)

=nprod

i=1

(prodgisinG [fg(Xi)

sum(hk hl)isinS(g) eβTZ(Xihk hl)Pγ (hkhl)]I(Gi=g)

sumgisinG

intxsum

(hk hl)isinS(g) eβTZ(xhkhl)Pγ (hkhl)dFg(x)

)yi

times[prod

gisinG

fg(Xi)sum

(hkhl)isinS(g)

Pγ (hkhl)

I(Gi=g)]1minusyi

(9)

We impose the following condition

Condition 8 If α + βTZ(XH) = α + βTZ(XH) for H =(hkhk) and H = (hk hk) then α = α and β = β

This condition is similar to Condition 1 stated in Section 22and it holds for the codominant model Under this conditionit follows from Lemma 1 that no two sets of parameters cangive the same likelihood with probability 1 Thus the maximizerof (9) denoted by (θ Fg) is locally unique We show in Sec-tion A45 that θ can be easily obtained by profiling over a smallnumber of parameters

To derive the asymptotic properties we provide a mathemat-ical definition of rare disease

Condition 9 For i = 1 n the conditional distribu-tion of Yi given (XiHi) satisfies that P(Yi = 1|XiHi) =an expβT

0Z(XiHi)[1 + an expβT0Z(XiHi)] where an =

o(nminus12)

Theorem 3 Under Conditions 6ndash9 |θ minusθ0|+supxg |Fg(x)minusFg(x)| rarrPn 0 where Pn is the probability measure given byCondition 9 Furthermore n12(θ minus θ0) converges in distri-bution to a normal random vector whose covariance matrixachieves the semiparametric efficiency bound

25 Cohort Studies

In a cohort study Y represents the time to disease occur-rence which is subject to right-censorship by C The data con-sist of (YiiXiGi) i = 1 n where Yi = min(YiCi) and

i = I(Yi le Ci) We relate Yi to (XiHi) through a class ofsemiparametric linear transformation models

13(Yi) = minusβTZ(XiHi) + εi i = 1 n (10)

where 13 is an unknown increasing function Z(XH) is aknown function of X and H and the εirsquos are independent er-rors with known distribution function F We may rewrite (10)as

P(Yi le t|XiHi) = Q((t)eβTZ(XiHi)

)

where (t) = e13(t) and Q(x) = F(log x) (x gt 0) The choicesof the extreme-value and standard logistic distributions for F orequivalently Q(x) = 1minuseminusx and Q(x) = 1minus(1+x)minus1 yield theproportional hazards model and the proportional odds model(Pettitt 1984)

We impose Condition 8 Under this condition β and (middot)are identifiable from the observable data The identifiability ofthe distribution of H is the same here as in the case of cross-sectional studies Under (3) or (4) and Condition 8 all of theparameters including β (middot) and γ are identifiable This isshown in Section A51

The following assumption on censoring is required in con-structing the likelihood

Condition 10 Conditional on X and G the censoring time Cis independent of Y and H

Let θ = (βγ ) The likelihood concerning θ and takes theform

Ln(θ )

=nprod

i=1

[ sum

(hkhl)isinS(Gi)

(Yi)e

βTZ(Xi(hkhl))

times Q((Yi)e

βTZ(Xi(hkhl)))i

times 1 minus Q

((Yi)e

βTZ(Xi(hkhl)))1minusi Pγ (hkhl)

]

(11)

Here and in the sequel f (x) = df (x)dx and f (x) = d2f (x)dx2 Like (6) (8) and (9) this likelihood involves infinite-dimensional parameters If is restricted to be absolutely con-tinuous then as in the case of density estimation there is nomaximizer of this likelihood Thus we relax to be right-continuous and replace (Yi) in (11) by the jump size of

at Yi By the arguments of Zeng Lin and Lin (2005) the resul-tant MLE denoted by (θ ) exists and is a step functionwith jumps only at the observed Yi for which i = 1 The max-imization can be carried out through an optimization algorithmFurthermore the covariance matrix of θ can be estimated by theprofile likelihood method as discussed by Zeng et al (2005)

Lin (2004) considered the special case of the proportionalhazards model under condition (2) and provided an EM algo-rithm for obtaining the MLEs We can modify that algorithm toaccommodate HardyndashWeinberg disequilibrium along the linesof Section A22 In addition the EM algorithm can be used toevaluate the profile likelihood

We assume the following regularity conditions for the as-ymptotic results

Lin and Zeng Inference on Haplotype Effects 95

Condition 11 There exists some positive constant δ0 suchthat P(Ci ge τ |XiGi) = P(Ci = τ |XiGi) ge δ0 almost surelywhere τ corresponds to the end of the study

Condition 12 The true value 0(t) of (t) is a strictly in-creasing function in [0 τ ] and is continuously differentiable Inaddition 0(0) = 0 0(τ ) lt infin and 0(0) gt 0

Theorem 4 Under Conditions 8 and 10ndash12 n12(θ minusθ0 minus0) converges weakly to a Gaussian process in R

d times linfin([0 τ ])where d is the dimension of θ0 and linfin([0 τ ]) is the space ofall bounded functions on [0 τ ] equipped with the supremumnorm Furthermore θ is asymptotically efficient

3 SIMULATION STUDIES

We used Monte Carlo simulation to evaluate the proposedmethods in realistic settings We considered the five SNPs onchromosome 22 from the FinlandndashUnited States Investigationof NIDDM Genetics (FUSION) Study described in the nextsection We obtained the πkrsquos from the frequencies shown inTable 1 by assuming a 7 disease rate and generated haplo-types under (3) with ρ = 05 The R2

h in Table 1 is the mea-sure of haplotype certainty of Stram et al (2003) We focusedon hlowast = (01100) and considered case-control and cohortstudies

For the cohort studies we generated ages of onset from theproportional hazards model

λt|x (hkhl) = 2t exp[β1I(hk = hlowast) + I(hl = hlowast) + β2x

+ β3I(hk = hlowast) + I(hl = hlowast)x]where X is a Bernoulli variable with P(X = 1) = 2 that is in-dependent of H We generated the censoring times from theuniform (0 τ ) distribution where τ was chosen to yield ap-proximately 250 500 and 1000 cases under n = 5000 We letβ1 = β2 = 25 and varied β3 from minus5 to 5

As shown in Table 2 the MLE is virtually unbiased the like-lihood ratio test has proper type I error and the confidence in-terval has reasonable coverage Additional simulation studiesrevealed that the proposed methods also perform well for mak-ing inference about other parameters and under other geneticmodels

Table 1 Observed Haplotype Frequencies in the FUSION Study

Frequencies

Haplotype Controls Cases R2h

00011 0042 0066 38800100 0035 0034 33600110 0018 0007 37701011 1292 1344 59201100 2514 3183 73801101 0012 lt10minus4 45001110 lt10minus4 0045 49901111 0019 lt10minus4 32510000 0136 0114 45610010 lt10minus4 0012 50010011 3573 2883 72710100 0521 0597 40210110 0317 0318 55411011 1392 1290 56011100 0109 0092 26611110 lt10minus4 0014 lt10minus4

11111 0020 lt10minus4 338

Table 2 Simulation Results for the HaplotypendashEnvironmentInteractions in Cohort Studies

β3 Cases Bias SE CP Power

0 250 minus010 232 949 051500 minus005 157 953 047

1000 minus003 114 954 046

minus25 250 minus014 256 950 190500 minus008 172 949 334

1000 minus004 122 952 554

minus5 250 minus022 281 950 505500 minus011 190 950 806

1000 minus006 132 952 976

25 250 minus007 216 947 207500 minus002 146 953 395

1000 minus001 109 954 614

5 250 minus003 204 943 693500 minus001 140 951 940

1000 minus001 105 952 998

NOTE Bias and SE are the bias and standard error of β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level likelihood ratio test of H0 β3 = 0Each entry is based on 5000 replicates

For the case-control studies we used the same distributionsof H and X and considered the same hlowast as in the cohort stud-ies We generated disease incidence from the logistic regressionmodel

logit PY = 1|x (hkhl)= α + β1I(hk = hlowast) + I(hl = hlowast)

+ β2x + β3I(hk = hlowast) + I(hl = hlowast)x (12)

For making inference on β1 we set β2 = β3 = 25 and var-ied β1 from minus5 to 5 for making inference on β3 we setβ1 = β2 = 25 and varied β3 from minus5 to 5 We chose α = minus3or minus4 yielding disease rates between 16 and 7 We letn1 = n0 = 500 or 1000 We considered the situations of knownand unknown population totals with N being 15 and 30 timesof n under α = minus3 and minus4 For known population totals weused the EM algorithm described in Section A31 and eval-uated the inference procedures based on the likelihood ratiostatistic For unknown population totals we used the profile-likelihood method for rare diseases described in Section A45and set the πk less than 2n to 0 to improve numerical stabilityThe results for β1 and β3 are displayed in Tables 3 and 4

For known population totals the proposed estimators are vir-tually unbiased and the likelihood ratio statistics yield propertests and confidence intervals For unknown population totalsβ1 has little bias especially for large n whereas β3 tends tobe slightly biased downward the variance estimators are fairlyaccurate and the corresponding confidence intervals have rea-sonable coverage probabilities except for α = minus3 β3 = 5The method with known population totals yields slightly higherpower than the method with unknown population totals

All the aforementioned results pertain to haplotype 01100which has a relatively high frequency and a large value of R2

hthe covariate is binary and ρ is 05 which is relatively largeAdditional simulation studies showed that the foregoing con-clusions continue to hold for other haplotypes other valuesof ρ and continuous covariates Table 5 reports some resultsfor haplotype 10100 which has a frequency of about 5 and

96 Journal of the American Statistical Association March 2006

Table 3 Simulation Results for the Main Effects of the Haplotype in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β1 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus003 117 952 987 019 121 124 951 979minus25 minus002 109 954 587 014 112 117 960 5250 minus001 104 951 049 009 109 112 955 045

25 minus001 102 950 641 002 105 108 961 646

5 000 099 948 996 minus005 103 106 958 998

minus4 minus5 001 112 954 987 022 119 124 951 977minus25 minus002 104 955 574 013 114 117 952 5290 minus002 100 953 047 004 109 112 953 047

25 minus001 095 956 636 minus003 103 108 959 640

5 minus000 094 950 999 minus009 102 105 956 997

1000 minus3 minus5 minus003 082 953 100 005 087 087 949 100minus25 minus002 076 952 874 005 081 082 948 8530 minus001 073 951 049 005 077 077 954 046

25 minus001 071 953 898 004 075 076 948 920

5 minus001 070 953 100 003 075 075 946 100

minus4 minus5 000 079 952 100 005 087 088 947 100minus25 minus000 074 959 867 005 081 083 954 8470 minus001 070 955 045 002 079 079 949 051

25 minus001 067 956 904 000 074 076 955 909

5 minus001 066 956 100 minus002 073 074 954 100

NOTE Bias and SE are the bias and standard error of β1 SEE is the mean of the standard error estimator for β1 CP is the coverage probability of the95 confidence interval for β1 Power pertains to the 05-level test of H0 β1 = 0 Each entry is based on 5000 replicates

an R2h of 4 We generated disease incidence from the logistic

regression model

logit PY = 1|X1X2 (hkhl)= α + βhI(hk = hlowast) + I(hl = hlowast)

+ βx1 X1 + βx2X2 + βhx2I(hk = hlowast) + I(hl = hlowast)X2

where hlowast = (10100) X1 is Bernoulli with 2 success probabil-ity and X2 is uniform(01) We set ρ = 01 α = minus37 βh = 0and βx1 = βx2 = minusβx2h = 5 yielding an overall disease rateof 7 We assumed unknown population totals and used the

profile-likelihood method for rare diseases described in Sec-tion A45 The method performed remarkably well

4 APPLICATION TO THE FUSION STUDY

Type 2 diabetes mellitus or nonndashinsulin-dependent diabetesmellitus is a complex disease characterized by resistance of pe-ripheral tissues to insulin and a deficiency of insulin secretionApproximately 7 of adults in developed countries suffer fromthe disease The FUSION study is a major effort to map andclone genetic variants that predispose to type 2 diabetes (Valleet al 1998) We consider a subset of data from this study

Table 4 Simulation Results for the HaplotypendashEnvironment Interactions in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β3 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus008 205 949 729 030 187 195 953 692minus25 minus002 186 949 271 016 169 176 961 2440 minus001 173 946 054 minus006 155 162 963 037

25 002 165 949 334 minus038 144 151 958 287

5 006 161 947 885 minus088 138 143 915 831

minus4 minus5 minus009 198 950 763 012 194 195 950 720minus25 minus005 181 949 309 006 172 176 953 2640 minus002 168 945 055 minus007 156 161 956 044

25 minus001 157 944 370 minus022 146 149 948 333

5 001 148 945 926 minus047 136 141 945 904

1000 minus3 minus5 minus004 147 943 953 027 134 136 950 953minus25 minus003 133 946 493 013 122 123 949 4770 minus001 123 951 049 minus005 114 113 948 052

25 000 119 945 580 minus034 107 106 934 535

5 002 117 947 994 minus080 102 101 870 986

minus4 minus5 minus005 140 945 965 010 137 136 949 965minus25 minus002 128 945 529 005 124 123 951 5050 minus001 119 946 054 minus004 113 113 947 053

25 minus000 110 947 633 minus016 104 105 952 601

5 002 105 949 998 minus037 099 099 937 995

NOTE Bias and SE are the bias and standard error of β3 SEE is the mean of the standard error estimator for β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level test of H0 β3 = 0 Each entry is based on 5000 replicates

Lin and Zeng Inference on Haplotype Effects 97

Table 5 Simulation Results for Haplotype 10100 inCase-Control Studies

n1 = n0 Parameter True value Bias SE SEE CP Power

500 βh 0 minus030 400 401 957 043βx1 5 002 151 152 951 917βx2 5 001 228 230 953 584βhx2 minus5 015 641 644 956 118

1000 βh 0 minus017 275 277 953 047βx1 5 002 107 107 954 997βx2 5 000 162 161 950 871βhx2 minus5 012 441 443 950 198

NOTE Bias and SE are the bias and standard error of the parameter estimator SEE is themean of the standard error estimator CP is the coverage probability of the 95 confidenceinterval Power pertains to the 05-level test of zero parameter value Each entry is based on5000 replicates

A total of 796 cases and 415 controls were genotyped at5 SNPs in a putative susceptibility region on chromosome 22with 131 cases and 82 controls having missing genotype infor-mation for at least one SNP If Gi is missing then the set S(Gi)

is enlarged accordingly in the analysis Table 1 displays the es-timated haplotype frequencies under (3) separated by the casesand controls along with the values of R2

h (Stram et al 2003) forthe controls We estimated ρ at 000 for controls and 002 forcases

We use the method based on (9) to estimate the effects ofthe haplotypes whose observed frequencies in the controls aregreater than 2 As shown in Table 6 the results are signif-icant for the two most common haplotypes haplotype 01100increases the risk of disease whereas haplotype 10011 is pro-tective against diabetes Epstein and Satten (2003) also reportedthe estimates for these two haplotypes which agree with ournumbers Although they did not report standard error estimatestheir confidence intervals are similar to those based on Table 6The results under the codominant model as well as the calcula-tions of the Akaike information criterion (AIC) (Akaike 1985)suggest that the additive model fits the data the best for bothhaplotypes 01100 and 10011

The FUSION investigators are currently exploring genendashenvironment interactions on chromosome 22 so the covariateinformation is confidential at this stage To illustrate our methodfor detecting genendashenvironment interactions we artificially cre-ated a binary covariate X by setting X = 1 for the first 600 in-dividuals in the dataset Under the additive genetic model forhaplotype 01100 the estimate of the interaction is 043 withan estimated standard error of 110 For further illustration wegenerated a binary covariate from the conditional distributionof X given Y and G under model (12) with α = minus37 β1 = 32and β2 = 25 Based on 5000 replicates the power for testing

H0 β3 = 0 is estimated at 053 479 or 974 under β3 = 0 25or 5

5 DISCUSSION

Inferring haplotypendashdisease associations is an interestingand difficult statistical problem The presence of infinite-dimensional nuisance parameters in the likelihoods for case-control and cohort studies entails considerable theoretical andcomputational challenges Although we have conducted a sys-tematic and rigorous investigation providing powerful newmethods there remain substantial open problems Here we dis-cuss some directions for future research

Case-Control Studies It is numerically difficult to maxi-mize (6) when N is much larger than n and algorithms forimplementing the constrained maximization mentioned in Re-mark 3 have yet to be developed For case-control studies withunknown population totals identifiability is a thorny issue Wehave provided a simple and efficient method under the rare dis-ease assumption which appears to work well even when thedisease is not rare But can we do better

Model Selection and Model Assessment Because our ap-proach is built on likelihood we can apply likelihood-basedmodel selection criteria such as the AIC used in Section 4Lin (2004) showed that the AIC performs well for the propor-tional hazards model It is unclear how to apply the traditionalresidual-based methods for assessing model adequacy becausethe haplotypes are not directly observable

Other Genetic Variants We have focused on SNPs-basedhaplotypes The proposed inference procedures are potentiallyapplicable to microsatellite loci and other genotype data al-though the identifiability of parameters needs to be verified foreach kind of genotype data

Other Study Designs It is sometimes desirable to use thematched case-control design in which one or more controls areindividually matched to each case In large cohort studies withrare diseases it is cost-effective to adopt the case-cohort designor nested case-control design so that only a subset of the cohortmembers needs to be genotyped We are currently developingefficient inference procedures for such designs

Population Substructure The presence of latent populationsubstructure can cause bias in association studies of unrelatedindividuals There exist several statistical methods to adjust forthe effects of population substructure with the aid of genomicmarkers It should be possible to extend the proposed methodsso as to accommodate potential population substructure

Table 6 Estimates of Haplotype Effects Under Various Genetic Models for the FUSION Study

Recessivemodel

Dominantmodel

Additivemodel

Codominant model

Haplotype Additive Recessive

01011 327(270) minus027(140) 049(135) 005(143) 331(289)01100 316(146) 274(109) 355(099) 334(114) 063(167)10011 minus206(155) minus323(112) minus320(095) minus344(111) 076(183)10100 minus1019(1020) 196(219) 116(213) 169(217) minus1131(1029)10110 903(746) minus007(248) 063(249) 016(254) 892(765)11011 minus222(328) minus096(140) minus127(133) minus108(140) minus143(344)

NOTE Standard error estimates are shown in parentheses

98 Journal of the American Statistical Association March 2006

Studies of Related Individuals This article is concernedwith studies of unrelated individuals Many genetic studies in-volve multiple family members or relatives Haplotype ambigu-ity possibly can be reduced by using the genotype informationfrom related individuals Inference on haplotype effects needsto account for the intraclass correlation

Genotyping Error and DNA Pooling Laboratory genotyp-ing is prone to error It is sometimes necessary to pool DNAsamples rather than genotyping individual samples (WangKidd and Zhao 2003) Such data create additional complexityin haplotype analysis (Zeng and Lin 2005)

Many SNPs The traditional EM algorithm works well fora small number of SNPs When the number of SNPs is largethe partitionndashligation method of Niu et al (2002) and Qin et al(2002) and other modifications potentially can be adapted to re-duce the computational burden However the haplotype analy-sis may not be very useful if the SNPs are weakly linked

Many Haplotypes and Rare Haplotypes The approachtaken in this article assumes that we are interested in a smallnumber of haplotype configurations that are relatively frequentIf there are many haplotypes then we are confronted withthe problem of multiple comparisons and sparse data Schaid(2004) discussed some possible solutions

Large-Scale Studies There is an increasing interest ingenome-wide association studies With a large number of SNPsone possible approach is to use sliding windows of 5ndash10 SNPsand test for the haplotypendashdisease association in each windowBecause most of the SNPs are common between adjacent win-dows the test statistics tend to be highly correlated so that theBonferroni-type correction for multiple comparisons would beextremely conservative To properly adjust for multiple com-parisons one needs to ascertain the joint distribution of the teststatistics This can be done by permuting the data or by evaluat-ing the asymptotic joint normal distribution of the test statistics(Lin 2005)

We hope that other statisticians will join us in tackling theforegoing problems and other challenges in genetic associationstudies

APPENDIX TECHNICAL ANDCOMPUTATIONAL DETAILS

A1 Proof of Lemma 1

We provide a proof under (3) the proof under (4) is simpler andis omitted here To prove the first part of the lemma we supposethat two sets of parameters (πk ρ) and (πk ρ) yield the samedistribution of G We wish to show that these two sets are iden-tical Consider g = 2hk For such a choice of g the set S(g) is asingleton Clearly (1 minus ρ)π2

k + ρπk = (1 minus ρ)π2k + ρπk We de-

note this constant by ck Then 0 le ck le 1 for all k and 0 lt ck lt 1for at least one k Because πk ge 0 we have πk = [minusρ + ρ2 +4ck(1 minus ρ)12]2(1 minus ρ) Thus (1 minus ρ)minus1 satisfies the equationsum

k[(1 minus x) + (x minus 1)2 + 4ckx12] = 2 and (1 minus ρ)minus1 satisfies thesame equation It can be shown that the first derivative of (1 minus x) +(x minus 1)2 + 4ckx12 is nonpositive and is strictly negative for at leastone k Thus the foregoing equation has a unique solution for x gt 1which implies that ρ = ρ It follows immediately that πk = πk forall k To prove the second part of the lemma we choose g = 2hk toobtain νk2πk(1 minus ρ) + ρ + microπk(1 minus πk) = 0 Because

sumk νk = 0

we havesum

kmicroπk(1 minus πk)2πk(1 minus ρ) + ρ = 0 Therefore micro = 0and ν = 0

A2 Cross-Sectional Studies

A21 Identifiability Under Arbitrary Distributions of H UnderCondition 1 (αβ ξ) is identifiable The identifiability of the distribu-tion of H depends on the structure of Pαβξ For concreteness we con-sider the codominant logistic regression model for a binary trait Wedivide G into three categories G1 = g isin G g = h + h or g = h + hG2 = g isin G minus G1 g is not ge hlowast and G3 = G minus G1 minus G2 We derivethe expression for mg( yx θ) when g belongs to each of the three cat-egories

For g isin G1 S(g) = (hh) or (h h) so that mg( yxθ) = Pαβξ (Y = y|X = xH = (hh))P(H = (hh)) or mg( yxθ) = Pαβξ (Y = y|X = xH = (h h))P(H = (h h)) For g isin G2Pαβξ (Y = y|X = xH = (hkhl)) does not depend on (hkhl) isin S(g)so that mg( yx θ) = Pαβξ (Y = y|X = xH = (hkhl))P(G = g)where (hkhl) isin S(g) For g isin G3

mg( yx θ) = expy(α + β1 + βT3 x + βT

4 x)1 + exp(α + β1 + βT

3 x + βT4 x)

π1(g)

+ expy(α + βT3 x)

1 + exp(α + βT3 x)

π2(g)

where π1(g) = 2P(H = (hlowastg minus hlowast)) and π2(g) = P(H = (hkhl) hk + hl = ghk = hlowasthl = hlowast)

Let θ0 denote the true value of θ P0(G = g) denote the truevalue of P(G = g) and π0j(g) denote the true values πj(g) j = 12We then can draw the following conclusions (1) When β01 = 0 andβ04 = 0 mg( yx θ) = mg( yx θ0) if and only if α = α0 β = β0and P(G = g) = P0(G = g) for any g isin G and (2) when either β01or β04 is nonzero mg( yx θ) = mg( yx θ0) if and only if α = α0β = β0 P(G = g) = P0(G = g) for g isin G1 cup G2 and πj(g) = π0j(g)

for g isin G3 and j = 12 These conclusions hold for any generalizedlinear model with the linear predictor given in (1)

A22 EM Algorithm The complete-data likelihood is propor-tional to

prodni=1Pαβξ (Yi|XiHi)Pγ (Hi) The expectation of the log-

arithm of this function conditional on the observable data (YiXiGi)i = 1 n is

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ)log Pαβξ

(Yi|Xi (hkhl)

) + log Pγ (hkhl)

where

pikl(θ) = Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

Thus in the (m + 1)st iteration of the EM algorithm we evalu-ate pikl(θ) at the current estimate θ (m) and obtain θ (m+1) by solvingthe following equations through the NewtonndashRaphson algorithm

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)

times nablaαβξ log Pαβξ

(Yi|Xi (hkhl)

) = 0

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)nablaγ log Pγ (hkhl) = 0 (A1)

Under (3) with ρ ge 0 the estimate of γ equiv (ρ πk) can be ob-tained in a closed form rather than by solving (A1) Let B be aBernoulli variable with success probability ρ let Q1 be a discreterandom variable taking values in H with P(Q1 = (hkhl)) = δklπk and let Q2 be another discrete random variable taking values in H

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 3: Likelihood-Based Inference on Haplotype Effects in Genetic

Lin and Zeng Inference on Haplotype Effects 91

The goal of the association studies is to relate the pair ofhaplotypes to disease phenotypes or traits The simplest pheno-type is the binary indicator for the disease status which takesthe value 1 if the individual is diseased and 0 otherwise Thediseased individuals may be further classified into several cat-egories corresponding to different types of disease or varyingdegrees of disease severity If the age of onset is likely to begenetically mediated then it is desirable to use the age of onsetas the phenotype One may also be interested in disease-relatedtraits such as blood pressure

The data on the disease phenotype may be gathered in variousways The simplest approach is to obtain a random sample fromthe target population and measure the phenotype of interest onevery individual in the sample Such studies are referred to ascross-sectional studies which are feasible if the disease is rela-tively frequent or if one is interested only in some readily mea-sured traits that are related to the disease If one is interested inthe age at the onset of a disease however then it is necessary tofollow a cohort of individuals forward in time in which case thephenotype (ie time to disease occurrence) may be censoredWhen the disease is relatively rare it is more cost-effective touse the case-control design which collects data retrospectivelyon a sample of diseased individuals and on a separate sample ofdisease-free individuals It is often desirable to collect data onenvironmental variables or covariates so as to investigate genendashenvironment interactions

Let Y be the phenotype of interest and let X be the covari-ates For cross-sectional and case-control studies the associa-tion between Y and (XH) is characterized by the conditionaldensity of Y = y given H = (hkhl) and X = x denoted byPαβξ (y|x (hkhl)) where α denotes the intercept(s) β de-notes the regression effects and ξ denotes the nuisance para-meters (eg variance and overdispersion parameters) Thereis considerable flexibility in specifying the regression relation-ship Suppose that hlowast is the target haplotype of interest andthat there are no covariates Then a linear predictor in theform of α + βI(hk = hl = hlowast) pertains to a recessive modelα + βI(hk = hlowast) + I(hl = hlowast) minus I(hk = hl = hlowast) pertains to adominant model α + βI(hk = hlowast) + I(hl = hlowast) pertains to anadditive model and α+β1I(hk = hlowast)+ I(hl = hlowast)+β2I(hk =hl = hlowast) pertains to a codominant model where I(middot) is the in-dicator function Clearly the codominant model contains theother three models as special cases A codominant model withgenendashenvironment interactions has the following linear predic-tor

α + β1I(hk = hlowast) + I(hl = hlowast) + β2I(hk = hl = hlowast)

+ βT3 x + βT

4 I(hk = hlowast) + I(hl = hlowast)x+ βT

5 I(hk = hl = hlowast)x (1)

Additional terms may be included so as to examine the effectsof several haplotype configurations or to investigate the jointeffects of multiple candidate genes

Although we are interested in the effects of H and X on Ywe observe G instead of H As mentioned earlier G is thesummation of the paired sequences in H Thus we have aregression problem with missing data in which the primaryexplanatory variable pertains to two ordered sequences of num-bers from 01 but only the summation of the two sequences

is observed We assume that X is independent of H conditionalon G and that (1XT) is linearly independent with positiveprobability

Write πkl = PH = (hkhl) and πk = P(h = hk) k l =1 K As we demonstrate in this article it is sometimes pos-sible to make inference about haplotype effects without impos-ing any structures on πkl although estimating πk and testingfor no haplotype effects require some restrictions on πklUnder HardyndashWeinberg equilibrium

πkl = πkπl k l = 1 K (2)

We consider two specific forms of departure from HardyndashWeinberg equilibrium

πkl = (1 minus ρ)πkπl + δklρπk (3)

and

πkl = (1 minus ρ + δklρ)πkπl

1 minus ρ + ρsumK

j=1 π2j

(4)

where 0 le πk le 1sumK

k=1 πk = 1 δkk = 1 and δkl = 0 (k = l)In (3) ρ is called the inbreeding coefficient or fixation index(Weir 1996 p 93) and corresponds to Cohenrsquos (1960) kappameasure of agreement Equation (4) creates disequilibrium bygiving different fitness values to the homozygous and heterozy-gous pairs (Niu et al 2002) The denominator is a normalizingconstant Both (3) and (4) reduce to (2) if ρ = 0 Excess ho-mozygosity (ie πkk gt π2

k k = 1 K) arises when ρ gt 0and excess heterozygosity (ie πkk lt π2

k k = 1 K) ariseswhen ρ lt 0 Recently Satten and Epstein (2004) considered (3)for the control population under the case-control design Weabuse the notation slightly in that the πk in (4) do not pertainto the marginal distribution of H unless ρ = 0

Let h denote a haplotype that differs from h at only one SNPWrite nablau f (uv) = partf (uv)partu The following lemma statesthat under (3) or (4) πk and ρ are identifiable from the dataon G and the data on G provide positive information aboutthese parameters

Lemma 1 Assume that either (3) or (4) holds The para-meters πk and ρ are uniquely determined by the distributionof G For nondegenerate distribution πk if there exist a con-stant micro and a vector ν = (ν1 νK)T such that

sumKk=1 νk = 0

and micronablaρ log P(G = g) + sumKk=1 νknablaπk log P(G = g) = 0 for g =

2h then micro = 0 and ν = 0

In the sequel G denotes the set of all possible genotypes andS(G) denotes the set of haplotype pairs that are consistent withgenotype G We suppose that πk gt 0 for all k = 1 K whereK is now interpreted as the total number of haplotypes that existin the population For any parameter θ we use θ0 to denote itstrue value if the distinction is necessary We assume that the truevalue of any Euclidean parameter θ belongs to the interior of aknown compact set within the domain of θ Proofs of Lemma 1and all of the theorems are provided in the Appendix

92 Journal of the American Statistical Association March 2006

22 Cross-Sectional Studies

There is a random sample of n individuals from the under-lying population The observable data consist of (YiXiGi)i = 1 n The trait Y can be discrete or continuous uni-variate or multivariate As stated in Section 21 the conditionaldensity of Y given X and H is given by Pαβξ (Y|XH) Fora univariate trait this regression model may take the form of ageneralized linear model (McCullagh and Nelder 1989) withthe linear predictor given in (1) If the trait is measured re-peatedly in a longitudinal study then generalized linear mixedmodels (Diggle Heagerty Liang and Zeger 2002 chap 9)may be used The following conditions are required for esti-mating (αβ ξ)

Condition 1 If Pαβξ (Y|XH) = Pαβ ξ (Y|XH) for any

H = (hkhk) and H = (hk hk) k = 1 K then α = αβ = β and ξ = ξ

Condition 2 If there exists a constant vector ν such thatνTnablaαβξ log Pαβξ (Y|XH) = 0 for H = (hkhk) and H =(hk hk) then ν = 0

Remark 1 Condition 1 ensures that the parameters of in-terest are identifiable from the genotype data The linearindependence of the score function stated in Condition 2 en-sures nonsingularity of the information matrix The reason forconsidering H = (hkhk) and H = (hk hk) is that these haplo-type pairs can be inferred with certainty because of the uniquedecompositions of the corresponding genotypes g = 2hk andg = hk + hk All of the commonly used regression models par-ticularly generalized linear (mixed) models with linear predic-tors in the form of (1) satisfy Conditions 1 and 2

We show in Section A21 that it is possible to estimatethe regression parameters without imposing any structure onthe joint distribution of H But this estimation requires knowl-edge of whether or not the dominant effects exist Specifi-cally if there are no dominant effects then only (αβ ξ) andP(G = g) are identifiable otherwise (αβ ξ) P(G = g) andP(H = (hlowastg minus hlowast)) are identifiable If either (3) or (4) holdsthen it follows from Lemma 1 and Condition 1 that all of the pa-rameters are identifiable regardless of the genetic mechanismDenote the joint distribution of H by Pγ (H = (hkhl)) whereγ consists of the identifiable parameters in the distribution of HUnder (3) or (4) γ = (ρπ1 πK)T When the distributionof H is unspecified γ pertains to the aspects of the distributionof H that are identifiable

Write θ = (αβγ ξ) The likelihood for θ based on thecross-sectional data is proportional to

Ln(θ) equivnprod

i=1

prod

gisinGmg(YiXi θ)I(Gi=g) (5)

where

mg(yx θ) =sum

(hkhl)isinS(g)

Pαβξ

(y|x (hkhl)

)Pγ (hkhl)

The MLE θ can be obtained by maximizing (5) via theNewtonndashRaphson algorithm or an optimization algorithm Itis generally more efficient to use the expectationndashmaximization

(EM) algorithm (Dempster Laird and Rubin 1977) especiallywhen the distribution of H satisfies (3) with ρ ge 0 see Sec-tion A22 for details

By the classical likelihood theory we can show that θ is con-sistent asymptotically normal and asymptotically efficient un-der Conditions 1 and 2 and the following condition

Condition 3 If there exists a constant vector ν such thatνTnablaθ log mg(YX θ0) = 0 then ν = 0

Remark 2 Condition 3 ensures the nonsingularity of the in-formation matrix This condition can be easily verified when thejoint distribution of H is unspecified and is implied by Lemma 1and Condition 2 when the distribution satisfies (3) or (4)

23 Case-Control Studies With Known Population Totals

We consider case-control data supplemented by informationon population totals (Scott and Wild 1997) There is a finitepopulation of N individuals that is considered a random sam-ple from the joint distribution of (YXH) where Y is a cat-egorical response variable All that is known about this finitepopulation is the total number of individuals in each categoryof Y = y A sample of size n stratified on the disease status isdrawn from the finite population and the values of X and Gare recorded for each sampled individual The supplementaryinformation on population totals is often available from hospi-tal records cancer registries and official statistics If a case-control sample is drawn from a cohort study then the cohortserves as the finite population The observable data consist of(YiRiRiXiRiGi) i = 1 N where Ri indicates by thevalues 1 versus 0 whether or not the ith individual in the fi-nite population is selected into the case-control sample

The association between Y and (XH) is characterized byPαβξ (Y|XH) where α β and ξ pertain to the intercept(s)regression effects and overdispersion parameters (McCullaghand Nelder 1989) In the case of a binary response variable im-portant examples of Pαβξ (Y|XH) include the logistic probitand complementary logndashlog regression models When there aremore than two categories examples include the proportionalodds multivariate probit and multivariate logistic regressionmodels Because the data associated with Ri = 1 yield the sameform of likelihood as that of a cross-sectional study and thedata associated Ri = 0 yield a missing-data likelihood all ofthe identifiability results stated in Section 22 apply to the cur-rent setting We again write θ = (αβ ξ γ ) where γ consistsof the identifiable parameters in the distribution of H

Let Fg(middot) be the cumulative distribution function of X givenG = g and let fg(x) be the density of Fg(x) with respectto a dominating measure micro(x) Note that Fg(middot) is infinite-dimensional if X has continuous components The joint densityof (Y = yG = gX = x) is mg( yx θ)fg(x) The likelihoodconcerning θ and Fg takes the form

Ln(θ Fg) =Nprod

i=1

[prod

gisinGmg(YiXi θ)fg(Xi)I(Gi=g)

]Ri

times[sum

gisinG

int

mg(Yix θ)dFg(x)

]1minusRi

(6)

Lin and Zeng Inference on Haplotype Effects 93

Unlike the likelihood for the cross-sectional design given in (5)the density functions of X given G cannot be factored out ofthe likelihood given in (6) and thus cannot be omitted from thelikelihood

We maximize (6) to obtain the MLEs θ and Fg(middot) The lat-ter is an empirical function with point masses at the observed Xisuch that Gi = g and Ri = 1 The maximization can be car-ried out via the NewtonndashRaphson profile-likelihood or large-scale optimization methods An alternative way to calculate theMLEs is through the EM algorithm described in Section A31

We impose the following regularity condition and then statethe asymptotic results in Theorem 1

Condition 4 For any g isin G fg(x) is positive in its supportand continuously differentiable with respect to a suitable mea-sure

Theorem 1 Under Conditions 1ndash4 θ and Fg(middot) are con-sistent in that |θ minus θ0| + supxg |Fg(x) minus Fg(x)| rarr 0 almostsurely In addition n12(θ minus θ0) converges in distribution to amean 0 normal random vector whose covariance matrix attainsthe semiparametric efficiency bound

Let pln(θ) be the profile log-likelihood for θ that is pln(θ) =maxFg log Ln(θ Fg) Then the (s t)th element of the inversecovariance matrix of θ can be estimated by minusεminus2

n pln(θ +εnes + εnet) minus pln(θ + εnes minus εnet) minus pln(θ minus εnes + εnet) +pln(θ) where εn is a constant of the order nminus12 and es and etare the sth and tth canonical vectors The function pln(θ) can becalculated via the EM algorithm by holding θ constant in boththe E-step and the M-step

Remark 3 If N is much larger than n or if the populationfrequencies rather than the totals are known then we max-imize

prodni=1

prodgisinGmg(YiXi θ)fg(Xi)I(Gi=g) subject to the

constraints thatsum

gisinGint

mg( yx θ)dFg(x) = py where py isthe population frequency of Y = y The resultant estimator of θ0is consistent asymptotically normal and asymptotically effi-cient The results in this section can be extended straightfor-wardly to accommodate stratifications on covariates

24 Case-Control Studies With UnknownPopulation Totals

We consider the classical case-control design which mea-sures X and G on n1 cases (Y = 1) and n0 controls (Y = 0)

and requires no knowledge about the finite population Withthe notation introduced in the previous section the likelihoodcontribution from one individual takes the form

RL(θ Fg) =prod

gisinGmg( yX θ)fg(X)I(G=g)

sumgisinG

intmg( yx θ)dFg(x)

(7)

where we use y instead of Y to emphasize that y is not randomDefine

f daggerg (x) = mg(0x θ)fg(x)

intmg(0x θ)dFg(x)

qg =int

mg(0x θ)dFg(x)sum

gisinGint

mg(0x θ)dFg(x)

Clearly f daggerg (x) is the conditional density of X given G = g and

Y = 0 and qg is the conditional probability of G = g given

Y = 0 Let g0 and x0 be some specific values of G and X WriteFdagger

g(x) = int x0 f dagger

g (s)dmicro(s) We can express (7) as

RL(θ Fdaggerg qg) =

prodgisinGη( yXg θ)f dagger

g (x)qgI(G=g)

sumgisinG qg

intη( yxg θ)dFdagger

g(x) (8)

where

η( yxg θ) = mg( yx θ)mg0(0x0 θ)

mg(0x θ)mg0( yx0 θ)

We call η the generalized odds ratio (Liang and Qin 2000)which reduces to the ordinary odds ratio when S(g) is a sin-gleton

Remark 4 The parameter qg is a functional of f daggerg and θ be-

causeint

mg(0x θ)dFg(x) = int mminus1g (0x θ)dFdagger

g(x)minus1 Thisconstraint makes it very difficult to study the identifiability ofthe parameters Thus we treat qg as a free parameter in our de-velopment

For traditional case-control data the odds ratio is identifiable(whereas the intercept is not) and its MLE can be obtainedby maximizing the prospective likelihood (Prentice and Pyke1979) Similar results hold when the exposure is measured witherror (Roeder Carroll and Lindsay 1996) however the distri-bution of the measurement error needs to be estimated from avalidation set or an external source With unphased genotypedata identifiability is much more delicate We show in Sec-tion A41 that the components of θ that are identifiable fromthe retrospective likelihood are exactly those that are identi-fiable from the generalized odds ratio Thus we assume thatthe generalized odds ratio depends only on a set of identifi-able parameters still denoted by θ otherwise the inference isnot tractable For the logistic link function with linear predic-tor (1) we show in Section A42 that if there are no dominanteffects then θ consists only of β if there are no covariate ef-fects but there exists a dominant main effect then β is identi-fiable and P(H = (hlowastg minus hlowast))P(G = g) is identifiable up toa scalar constant and if the dominant effect depends on a con-tinuous covariate or if the dominant main effect and the maineffect of a continuous covariate are nonzero then θ consistsof α β and P(H = (hlowastg minus hlowast))P(G = g) For the probitand complementary logndashlog link functions we show in Sec-tion A43 that if there are dominant effects and at least onecontinuous covariate has an effect then θ consists of α β andP(H = (hlowastg minus hlowast))P(G = g)

We maximize the product of (8) over the n equiv n1 +n0 individ-uals in the case-control sample to produce the MLEs θ Fdagger

g(middot)and qg Although the Fdagger

g(middot) are high-dimensional we showin Section A44 that θ can be obtained by profiling a likelihoodfunction over a scalar nuisance parameter

To state the asymptotic properties of the MLEs we imposethe following conditions

Condition 5 If there exists a vector v such that vTnablaθ logη(1

xg θ) is a constant with probability 1 then v = 0

Condition 6 The function f daggerg is positive in its support and

continuously differentiable

Condition 7 The fraction n1n rarr isin (01)

94 Journal of the American Statistical Association March 2006

Remark 5 Condition 5 implies nonsingularity of the infor-mation matrix for θ0 and can be shown to hold for the logisticprobit and complementary logndashlog link functions Condition 7ensures that there are both cases and controls in the sample

Theorem 2 Under Conditions 5ndash7 |θ minus θ0| + supg |qg minusqg| + supxg |Fdagger

g(x) minus Fdaggerg(x)| rarr 0 almost surely In addition

n12(θ minus θ0) converges in distribution to a normal random vec-tor whose covariance matrix attains the semiparametric effi-ciency bound

In most case-control studies the disease is (relatively) rareWhen the disease is rare considerable simplicity arises be-cause of the following approximation for the logistic regressionmodel

Pαβ(Y|XH) asymp expY(α + βTZ(XH)

)

where Z(XH) is a specific function of X and H We assumethat either (3) or (4) holds The likelihood based on (XiGi yi)i = 1 n can be approximated by

Ln(θ Fg)

=nprod

i=1

(prodgisinG [fg(Xi)

sum(hk hl)isinS(g) eβTZ(Xihk hl)Pγ (hkhl)]I(Gi=g)

sumgisinG

intxsum

(hk hl)isinS(g) eβTZ(xhkhl)Pγ (hkhl)dFg(x)

)yi

times[prod

gisinG

fg(Xi)sum

(hkhl)isinS(g)

Pγ (hkhl)

I(Gi=g)]1minusyi

(9)

We impose the following condition

Condition 8 If α + βTZ(XH) = α + βTZ(XH) for H =(hkhk) and H = (hk hk) then α = α and β = β

This condition is similar to Condition 1 stated in Section 22and it holds for the codominant model Under this conditionit follows from Lemma 1 that no two sets of parameters cangive the same likelihood with probability 1 Thus the maximizerof (9) denoted by (θ Fg) is locally unique We show in Sec-tion A45 that θ can be easily obtained by profiling over a smallnumber of parameters

To derive the asymptotic properties we provide a mathemat-ical definition of rare disease

Condition 9 For i = 1 n the conditional distribu-tion of Yi given (XiHi) satisfies that P(Yi = 1|XiHi) =an expβT

0Z(XiHi)[1 + an expβT0Z(XiHi)] where an =

o(nminus12)

Theorem 3 Under Conditions 6ndash9 |θ minusθ0|+supxg |Fg(x)minusFg(x)| rarrPn 0 where Pn is the probability measure given byCondition 9 Furthermore n12(θ minus θ0) converges in distri-bution to a normal random vector whose covariance matrixachieves the semiparametric efficiency bound

25 Cohort Studies

In a cohort study Y represents the time to disease occur-rence which is subject to right-censorship by C The data con-sist of (YiiXiGi) i = 1 n where Yi = min(YiCi) and

i = I(Yi le Ci) We relate Yi to (XiHi) through a class ofsemiparametric linear transformation models

13(Yi) = minusβTZ(XiHi) + εi i = 1 n (10)

where 13 is an unknown increasing function Z(XH) is aknown function of X and H and the εirsquos are independent er-rors with known distribution function F We may rewrite (10)as

P(Yi le t|XiHi) = Q((t)eβTZ(XiHi)

)

where (t) = e13(t) and Q(x) = F(log x) (x gt 0) The choicesof the extreme-value and standard logistic distributions for F orequivalently Q(x) = 1minuseminusx and Q(x) = 1minus(1+x)minus1 yield theproportional hazards model and the proportional odds model(Pettitt 1984)

We impose Condition 8 Under this condition β and (middot)are identifiable from the observable data The identifiability ofthe distribution of H is the same here as in the case of cross-sectional studies Under (3) or (4) and Condition 8 all of theparameters including β (middot) and γ are identifiable This isshown in Section A51

The following assumption on censoring is required in con-structing the likelihood

Condition 10 Conditional on X and G the censoring time Cis independent of Y and H

Let θ = (βγ ) The likelihood concerning θ and takes theform

Ln(θ )

=nprod

i=1

[ sum

(hkhl)isinS(Gi)

(Yi)e

βTZ(Xi(hkhl))

times Q((Yi)e

βTZ(Xi(hkhl)))i

times 1 minus Q

((Yi)e

βTZ(Xi(hkhl)))1minusi Pγ (hkhl)

]

(11)

Here and in the sequel f (x) = df (x)dx and f (x) = d2f (x)dx2 Like (6) (8) and (9) this likelihood involves infinite-dimensional parameters If is restricted to be absolutely con-tinuous then as in the case of density estimation there is nomaximizer of this likelihood Thus we relax to be right-continuous and replace (Yi) in (11) by the jump size of

at Yi By the arguments of Zeng Lin and Lin (2005) the resul-tant MLE denoted by (θ ) exists and is a step functionwith jumps only at the observed Yi for which i = 1 The max-imization can be carried out through an optimization algorithmFurthermore the covariance matrix of θ can be estimated by theprofile likelihood method as discussed by Zeng et al (2005)

Lin (2004) considered the special case of the proportionalhazards model under condition (2) and provided an EM algo-rithm for obtaining the MLEs We can modify that algorithm toaccommodate HardyndashWeinberg disequilibrium along the linesof Section A22 In addition the EM algorithm can be used toevaluate the profile likelihood

We assume the following regularity conditions for the as-ymptotic results

Lin and Zeng Inference on Haplotype Effects 95

Condition 11 There exists some positive constant δ0 suchthat P(Ci ge τ |XiGi) = P(Ci = τ |XiGi) ge δ0 almost surelywhere τ corresponds to the end of the study

Condition 12 The true value 0(t) of (t) is a strictly in-creasing function in [0 τ ] and is continuously differentiable Inaddition 0(0) = 0 0(τ ) lt infin and 0(0) gt 0

Theorem 4 Under Conditions 8 and 10ndash12 n12(θ minusθ0 minus0) converges weakly to a Gaussian process in R

d times linfin([0 τ ])where d is the dimension of θ0 and linfin([0 τ ]) is the space ofall bounded functions on [0 τ ] equipped with the supremumnorm Furthermore θ is asymptotically efficient

3 SIMULATION STUDIES

We used Monte Carlo simulation to evaluate the proposedmethods in realistic settings We considered the five SNPs onchromosome 22 from the FinlandndashUnited States Investigationof NIDDM Genetics (FUSION) Study described in the nextsection We obtained the πkrsquos from the frequencies shown inTable 1 by assuming a 7 disease rate and generated haplo-types under (3) with ρ = 05 The R2

h in Table 1 is the mea-sure of haplotype certainty of Stram et al (2003) We focusedon hlowast = (01100) and considered case-control and cohortstudies

For the cohort studies we generated ages of onset from theproportional hazards model

λt|x (hkhl) = 2t exp[β1I(hk = hlowast) + I(hl = hlowast) + β2x

+ β3I(hk = hlowast) + I(hl = hlowast)x]where X is a Bernoulli variable with P(X = 1) = 2 that is in-dependent of H We generated the censoring times from theuniform (0 τ ) distribution where τ was chosen to yield ap-proximately 250 500 and 1000 cases under n = 5000 We letβ1 = β2 = 25 and varied β3 from minus5 to 5

As shown in Table 2 the MLE is virtually unbiased the like-lihood ratio test has proper type I error and the confidence in-terval has reasonable coverage Additional simulation studiesrevealed that the proposed methods also perform well for mak-ing inference about other parameters and under other geneticmodels

Table 1 Observed Haplotype Frequencies in the FUSION Study

Frequencies

Haplotype Controls Cases R2h

00011 0042 0066 38800100 0035 0034 33600110 0018 0007 37701011 1292 1344 59201100 2514 3183 73801101 0012 lt10minus4 45001110 lt10minus4 0045 49901111 0019 lt10minus4 32510000 0136 0114 45610010 lt10minus4 0012 50010011 3573 2883 72710100 0521 0597 40210110 0317 0318 55411011 1392 1290 56011100 0109 0092 26611110 lt10minus4 0014 lt10minus4

11111 0020 lt10minus4 338

Table 2 Simulation Results for the HaplotypendashEnvironmentInteractions in Cohort Studies

β3 Cases Bias SE CP Power

0 250 minus010 232 949 051500 minus005 157 953 047

1000 minus003 114 954 046

minus25 250 minus014 256 950 190500 minus008 172 949 334

1000 minus004 122 952 554

minus5 250 minus022 281 950 505500 minus011 190 950 806

1000 minus006 132 952 976

25 250 minus007 216 947 207500 minus002 146 953 395

1000 minus001 109 954 614

5 250 minus003 204 943 693500 minus001 140 951 940

1000 minus001 105 952 998

NOTE Bias and SE are the bias and standard error of β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level likelihood ratio test of H0 β3 = 0Each entry is based on 5000 replicates

For the case-control studies we used the same distributionsof H and X and considered the same hlowast as in the cohort stud-ies We generated disease incidence from the logistic regressionmodel

logit PY = 1|x (hkhl)= α + β1I(hk = hlowast) + I(hl = hlowast)

+ β2x + β3I(hk = hlowast) + I(hl = hlowast)x (12)

For making inference on β1 we set β2 = β3 = 25 and var-ied β1 from minus5 to 5 for making inference on β3 we setβ1 = β2 = 25 and varied β3 from minus5 to 5 We chose α = minus3or minus4 yielding disease rates between 16 and 7 We letn1 = n0 = 500 or 1000 We considered the situations of knownand unknown population totals with N being 15 and 30 timesof n under α = minus3 and minus4 For known population totals weused the EM algorithm described in Section A31 and eval-uated the inference procedures based on the likelihood ratiostatistic For unknown population totals we used the profile-likelihood method for rare diseases described in Section A45and set the πk less than 2n to 0 to improve numerical stabilityThe results for β1 and β3 are displayed in Tables 3 and 4

For known population totals the proposed estimators are vir-tually unbiased and the likelihood ratio statistics yield propertests and confidence intervals For unknown population totalsβ1 has little bias especially for large n whereas β3 tends tobe slightly biased downward the variance estimators are fairlyaccurate and the corresponding confidence intervals have rea-sonable coverage probabilities except for α = minus3 β3 = 5The method with known population totals yields slightly higherpower than the method with unknown population totals

All the aforementioned results pertain to haplotype 01100which has a relatively high frequency and a large value of R2

hthe covariate is binary and ρ is 05 which is relatively largeAdditional simulation studies showed that the foregoing con-clusions continue to hold for other haplotypes other valuesof ρ and continuous covariates Table 5 reports some resultsfor haplotype 10100 which has a frequency of about 5 and

96 Journal of the American Statistical Association March 2006

Table 3 Simulation Results for the Main Effects of the Haplotype in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β1 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus003 117 952 987 019 121 124 951 979minus25 minus002 109 954 587 014 112 117 960 5250 minus001 104 951 049 009 109 112 955 045

25 minus001 102 950 641 002 105 108 961 646

5 000 099 948 996 minus005 103 106 958 998

minus4 minus5 001 112 954 987 022 119 124 951 977minus25 minus002 104 955 574 013 114 117 952 5290 minus002 100 953 047 004 109 112 953 047

25 minus001 095 956 636 minus003 103 108 959 640

5 minus000 094 950 999 minus009 102 105 956 997

1000 minus3 minus5 minus003 082 953 100 005 087 087 949 100minus25 minus002 076 952 874 005 081 082 948 8530 minus001 073 951 049 005 077 077 954 046

25 minus001 071 953 898 004 075 076 948 920

5 minus001 070 953 100 003 075 075 946 100

minus4 minus5 000 079 952 100 005 087 088 947 100minus25 minus000 074 959 867 005 081 083 954 8470 minus001 070 955 045 002 079 079 949 051

25 minus001 067 956 904 000 074 076 955 909

5 minus001 066 956 100 minus002 073 074 954 100

NOTE Bias and SE are the bias and standard error of β1 SEE is the mean of the standard error estimator for β1 CP is the coverage probability of the95 confidence interval for β1 Power pertains to the 05-level test of H0 β1 = 0 Each entry is based on 5000 replicates

an R2h of 4 We generated disease incidence from the logistic

regression model

logit PY = 1|X1X2 (hkhl)= α + βhI(hk = hlowast) + I(hl = hlowast)

+ βx1 X1 + βx2X2 + βhx2I(hk = hlowast) + I(hl = hlowast)X2

where hlowast = (10100) X1 is Bernoulli with 2 success probabil-ity and X2 is uniform(01) We set ρ = 01 α = minus37 βh = 0and βx1 = βx2 = minusβx2h = 5 yielding an overall disease rateof 7 We assumed unknown population totals and used the

profile-likelihood method for rare diseases described in Sec-tion A45 The method performed remarkably well

4 APPLICATION TO THE FUSION STUDY

Type 2 diabetes mellitus or nonndashinsulin-dependent diabetesmellitus is a complex disease characterized by resistance of pe-ripheral tissues to insulin and a deficiency of insulin secretionApproximately 7 of adults in developed countries suffer fromthe disease The FUSION study is a major effort to map andclone genetic variants that predispose to type 2 diabetes (Valleet al 1998) We consider a subset of data from this study

Table 4 Simulation Results for the HaplotypendashEnvironment Interactions in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β3 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus008 205 949 729 030 187 195 953 692minus25 minus002 186 949 271 016 169 176 961 2440 minus001 173 946 054 minus006 155 162 963 037

25 002 165 949 334 minus038 144 151 958 287

5 006 161 947 885 minus088 138 143 915 831

minus4 minus5 minus009 198 950 763 012 194 195 950 720minus25 minus005 181 949 309 006 172 176 953 2640 minus002 168 945 055 minus007 156 161 956 044

25 minus001 157 944 370 minus022 146 149 948 333

5 001 148 945 926 minus047 136 141 945 904

1000 minus3 minus5 minus004 147 943 953 027 134 136 950 953minus25 minus003 133 946 493 013 122 123 949 4770 minus001 123 951 049 minus005 114 113 948 052

25 000 119 945 580 minus034 107 106 934 535

5 002 117 947 994 minus080 102 101 870 986

minus4 minus5 minus005 140 945 965 010 137 136 949 965minus25 minus002 128 945 529 005 124 123 951 5050 minus001 119 946 054 minus004 113 113 947 053

25 minus000 110 947 633 minus016 104 105 952 601

5 002 105 949 998 minus037 099 099 937 995

NOTE Bias and SE are the bias and standard error of β3 SEE is the mean of the standard error estimator for β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level test of H0 β3 = 0 Each entry is based on 5000 replicates

Lin and Zeng Inference on Haplotype Effects 97

Table 5 Simulation Results for Haplotype 10100 inCase-Control Studies

n1 = n0 Parameter True value Bias SE SEE CP Power

500 βh 0 minus030 400 401 957 043βx1 5 002 151 152 951 917βx2 5 001 228 230 953 584βhx2 minus5 015 641 644 956 118

1000 βh 0 minus017 275 277 953 047βx1 5 002 107 107 954 997βx2 5 000 162 161 950 871βhx2 minus5 012 441 443 950 198

NOTE Bias and SE are the bias and standard error of the parameter estimator SEE is themean of the standard error estimator CP is the coverage probability of the 95 confidenceinterval Power pertains to the 05-level test of zero parameter value Each entry is based on5000 replicates

A total of 796 cases and 415 controls were genotyped at5 SNPs in a putative susceptibility region on chromosome 22with 131 cases and 82 controls having missing genotype infor-mation for at least one SNP If Gi is missing then the set S(Gi)

is enlarged accordingly in the analysis Table 1 displays the es-timated haplotype frequencies under (3) separated by the casesand controls along with the values of R2

h (Stram et al 2003) forthe controls We estimated ρ at 000 for controls and 002 forcases

We use the method based on (9) to estimate the effects ofthe haplotypes whose observed frequencies in the controls aregreater than 2 As shown in Table 6 the results are signif-icant for the two most common haplotypes haplotype 01100increases the risk of disease whereas haplotype 10011 is pro-tective against diabetes Epstein and Satten (2003) also reportedthe estimates for these two haplotypes which agree with ournumbers Although they did not report standard error estimatestheir confidence intervals are similar to those based on Table 6The results under the codominant model as well as the calcula-tions of the Akaike information criterion (AIC) (Akaike 1985)suggest that the additive model fits the data the best for bothhaplotypes 01100 and 10011

The FUSION investigators are currently exploring genendashenvironment interactions on chromosome 22 so the covariateinformation is confidential at this stage To illustrate our methodfor detecting genendashenvironment interactions we artificially cre-ated a binary covariate X by setting X = 1 for the first 600 in-dividuals in the dataset Under the additive genetic model forhaplotype 01100 the estimate of the interaction is 043 withan estimated standard error of 110 For further illustration wegenerated a binary covariate from the conditional distributionof X given Y and G under model (12) with α = minus37 β1 = 32and β2 = 25 Based on 5000 replicates the power for testing

H0 β3 = 0 is estimated at 053 479 or 974 under β3 = 0 25or 5

5 DISCUSSION

Inferring haplotypendashdisease associations is an interestingand difficult statistical problem The presence of infinite-dimensional nuisance parameters in the likelihoods for case-control and cohort studies entails considerable theoretical andcomputational challenges Although we have conducted a sys-tematic and rigorous investigation providing powerful newmethods there remain substantial open problems Here we dis-cuss some directions for future research

Case-Control Studies It is numerically difficult to maxi-mize (6) when N is much larger than n and algorithms forimplementing the constrained maximization mentioned in Re-mark 3 have yet to be developed For case-control studies withunknown population totals identifiability is a thorny issue Wehave provided a simple and efficient method under the rare dis-ease assumption which appears to work well even when thedisease is not rare But can we do better

Model Selection and Model Assessment Because our ap-proach is built on likelihood we can apply likelihood-basedmodel selection criteria such as the AIC used in Section 4Lin (2004) showed that the AIC performs well for the propor-tional hazards model It is unclear how to apply the traditionalresidual-based methods for assessing model adequacy becausethe haplotypes are not directly observable

Other Genetic Variants We have focused on SNPs-basedhaplotypes The proposed inference procedures are potentiallyapplicable to microsatellite loci and other genotype data al-though the identifiability of parameters needs to be verified foreach kind of genotype data

Other Study Designs It is sometimes desirable to use thematched case-control design in which one or more controls areindividually matched to each case In large cohort studies withrare diseases it is cost-effective to adopt the case-cohort designor nested case-control design so that only a subset of the cohortmembers needs to be genotyped We are currently developingefficient inference procedures for such designs

Population Substructure The presence of latent populationsubstructure can cause bias in association studies of unrelatedindividuals There exist several statistical methods to adjust forthe effects of population substructure with the aid of genomicmarkers It should be possible to extend the proposed methodsso as to accommodate potential population substructure

Table 6 Estimates of Haplotype Effects Under Various Genetic Models for the FUSION Study

Recessivemodel

Dominantmodel

Additivemodel

Codominant model

Haplotype Additive Recessive

01011 327(270) minus027(140) 049(135) 005(143) 331(289)01100 316(146) 274(109) 355(099) 334(114) 063(167)10011 minus206(155) minus323(112) minus320(095) minus344(111) 076(183)10100 minus1019(1020) 196(219) 116(213) 169(217) minus1131(1029)10110 903(746) minus007(248) 063(249) 016(254) 892(765)11011 minus222(328) minus096(140) minus127(133) minus108(140) minus143(344)

NOTE Standard error estimates are shown in parentheses

98 Journal of the American Statistical Association March 2006

Studies of Related Individuals This article is concernedwith studies of unrelated individuals Many genetic studies in-volve multiple family members or relatives Haplotype ambigu-ity possibly can be reduced by using the genotype informationfrom related individuals Inference on haplotype effects needsto account for the intraclass correlation

Genotyping Error and DNA Pooling Laboratory genotyp-ing is prone to error It is sometimes necessary to pool DNAsamples rather than genotyping individual samples (WangKidd and Zhao 2003) Such data create additional complexityin haplotype analysis (Zeng and Lin 2005)

Many SNPs The traditional EM algorithm works well fora small number of SNPs When the number of SNPs is largethe partitionndashligation method of Niu et al (2002) and Qin et al(2002) and other modifications potentially can be adapted to re-duce the computational burden However the haplotype analy-sis may not be very useful if the SNPs are weakly linked

Many Haplotypes and Rare Haplotypes The approachtaken in this article assumes that we are interested in a smallnumber of haplotype configurations that are relatively frequentIf there are many haplotypes then we are confronted withthe problem of multiple comparisons and sparse data Schaid(2004) discussed some possible solutions

Large-Scale Studies There is an increasing interest ingenome-wide association studies With a large number of SNPsone possible approach is to use sliding windows of 5ndash10 SNPsand test for the haplotypendashdisease association in each windowBecause most of the SNPs are common between adjacent win-dows the test statistics tend to be highly correlated so that theBonferroni-type correction for multiple comparisons would beextremely conservative To properly adjust for multiple com-parisons one needs to ascertain the joint distribution of the teststatistics This can be done by permuting the data or by evaluat-ing the asymptotic joint normal distribution of the test statistics(Lin 2005)

We hope that other statisticians will join us in tackling theforegoing problems and other challenges in genetic associationstudies

APPENDIX TECHNICAL ANDCOMPUTATIONAL DETAILS

A1 Proof of Lemma 1

We provide a proof under (3) the proof under (4) is simpler andis omitted here To prove the first part of the lemma we supposethat two sets of parameters (πk ρ) and (πk ρ) yield the samedistribution of G We wish to show that these two sets are iden-tical Consider g = 2hk For such a choice of g the set S(g) is asingleton Clearly (1 minus ρ)π2

k + ρπk = (1 minus ρ)π2k + ρπk We de-

note this constant by ck Then 0 le ck le 1 for all k and 0 lt ck lt 1for at least one k Because πk ge 0 we have πk = [minusρ + ρ2 +4ck(1 minus ρ)12]2(1 minus ρ) Thus (1 minus ρ)minus1 satisfies the equationsum

k[(1 minus x) + (x minus 1)2 + 4ckx12] = 2 and (1 minus ρ)minus1 satisfies thesame equation It can be shown that the first derivative of (1 minus x) +(x minus 1)2 + 4ckx12 is nonpositive and is strictly negative for at leastone k Thus the foregoing equation has a unique solution for x gt 1which implies that ρ = ρ It follows immediately that πk = πk forall k To prove the second part of the lemma we choose g = 2hk toobtain νk2πk(1 minus ρ) + ρ + microπk(1 minus πk) = 0 Because

sumk νk = 0

we havesum

kmicroπk(1 minus πk)2πk(1 minus ρ) + ρ = 0 Therefore micro = 0and ν = 0

A2 Cross-Sectional Studies

A21 Identifiability Under Arbitrary Distributions of H UnderCondition 1 (αβ ξ) is identifiable The identifiability of the distribu-tion of H depends on the structure of Pαβξ For concreteness we con-sider the codominant logistic regression model for a binary trait Wedivide G into three categories G1 = g isin G g = h + h or g = h + hG2 = g isin G minus G1 g is not ge hlowast and G3 = G minus G1 minus G2 We derivethe expression for mg( yx θ) when g belongs to each of the three cat-egories

For g isin G1 S(g) = (hh) or (h h) so that mg( yxθ) = Pαβξ (Y = y|X = xH = (hh))P(H = (hh)) or mg( yxθ) = Pαβξ (Y = y|X = xH = (h h))P(H = (h h)) For g isin G2Pαβξ (Y = y|X = xH = (hkhl)) does not depend on (hkhl) isin S(g)so that mg( yx θ) = Pαβξ (Y = y|X = xH = (hkhl))P(G = g)where (hkhl) isin S(g) For g isin G3

mg( yx θ) = expy(α + β1 + βT3 x + βT

4 x)1 + exp(α + β1 + βT

3 x + βT4 x)

π1(g)

+ expy(α + βT3 x)

1 + exp(α + βT3 x)

π2(g)

where π1(g) = 2P(H = (hlowastg minus hlowast)) and π2(g) = P(H = (hkhl) hk + hl = ghk = hlowasthl = hlowast)

Let θ0 denote the true value of θ P0(G = g) denote the truevalue of P(G = g) and π0j(g) denote the true values πj(g) j = 12We then can draw the following conclusions (1) When β01 = 0 andβ04 = 0 mg( yx θ) = mg( yx θ0) if and only if α = α0 β = β0and P(G = g) = P0(G = g) for any g isin G and (2) when either β01or β04 is nonzero mg( yx θ) = mg( yx θ0) if and only if α = α0β = β0 P(G = g) = P0(G = g) for g isin G1 cup G2 and πj(g) = π0j(g)

for g isin G3 and j = 12 These conclusions hold for any generalizedlinear model with the linear predictor given in (1)

A22 EM Algorithm The complete-data likelihood is propor-tional to

prodni=1Pαβξ (Yi|XiHi)Pγ (Hi) The expectation of the log-

arithm of this function conditional on the observable data (YiXiGi)i = 1 n is

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ)log Pαβξ

(Yi|Xi (hkhl)

) + log Pγ (hkhl)

where

pikl(θ) = Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

Thus in the (m + 1)st iteration of the EM algorithm we evalu-ate pikl(θ) at the current estimate θ (m) and obtain θ (m+1) by solvingthe following equations through the NewtonndashRaphson algorithm

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)

times nablaαβξ log Pαβξ

(Yi|Xi (hkhl)

) = 0

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)nablaγ log Pγ (hkhl) = 0 (A1)

Under (3) with ρ ge 0 the estimate of γ equiv (ρ πk) can be ob-tained in a closed form rather than by solving (A1) Let B be aBernoulli variable with success probability ρ let Q1 be a discreterandom variable taking values in H with P(Q1 = (hkhl)) = δklπk and let Q2 be another discrete random variable taking values in H

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 4: Likelihood-Based Inference on Haplotype Effects in Genetic

92 Journal of the American Statistical Association March 2006

22 Cross-Sectional Studies

There is a random sample of n individuals from the under-lying population The observable data consist of (YiXiGi)i = 1 n The trait Y can be discrete or continuous uni-variate or multivariate As stated in Section 21 the conditionaldensity of Y given X and H is given by Pαβξ (Y|XH) Fora univariate trait this regression model may take the form of ageneralized linear model (McCullagh and Nelder 1989) withthe linear predictor given in (1) If the trait is measured re-peatedly in a longitudinal study then generalized linear mixedmodels (Diggle Heagerty Liang and Zeger 2002 chap 9)may be used The following conditions are required for esti-mating (αβ ξ)

Condition 1 If Pαβξ (Y|XH) = Pαβ ξ (Y|XH) for any

H = (hkhk) and H = (hk hk) k = 1 K then α = αβ = β and ξ = ξ

Condition 2 If there exists a constant vector ν such thatνTnablaαβξ log Pαβξ (Y|XH) = 0 for H = (hkhk) and H =(hk hk) then ν = 0

Remark 1 Condition 1 ensures that the parameters of in-terest are identifiable from the genotype data The linearindependence of the score function stated in Condition 2 en-sures nonsingularity of the information matrix The reason forconsidering H = (hkhk) and H = (hk hk) is that these haplo-type pairs can be inferred with certainty because of the uniquedecompositions of the corresponding genotypes g = 2hk andg = hk + hk All of the commonly used regression models par-ticularly generalized linear (mixed) models with linear predic-tors in the form of (1) satisfy Conditions 1 and 2

We show in Section A21 that it is possible to estimatethe regression parameters without imposing any structure onthe joint distribution of H But this estimation requires knowl-edge of whether or not the dominant effects exist Specifi-cally if there are no dominant effects then only (αβ ξ) andP(G = g) are identifiable otherwise (αβ ξ) P(G = g) andP(H = (hlowastg minus hlowast)) are identifiable If either (3) or (4) holdsthen it follows from Lemma 1 and Condition 1 that all of the pa-rameters are identifiable regardless of the genetic mechanismDenote the joint distribution of H by Pγ (H = (hkhl)) whereγ consists of the identifiable parameters in the distribution of HUnder (3) or (4) γ = (ρπ1 πK)T When the distributionof H is unspecified γ pertains to the aspects of the distributionof H that are identifiable

Write θ = (αβγ ξ) The likelihood for θ based on thecross-sectional data is proportional to

Ln(θ) equivnprod

i=1

prod

gisinGmg(YiXi θ)I(Gi=g) (5)

where

mg(yx θ) =sum

(hkhl)isinS(g)

Pαβξ

(y|x (hkhl)

)Pγ (hkhl)

The MLE θ can be obtained by maximizing (5) via theNewtonndashRaphson algorithm or an optimization algorithm Itis generally more efficient to use the expectationndashmaximization

(EM) algorithm (Dempster Laird and Rubin 1977) especiallywhen the distribution of H satisfies (3) with ρ ge 0 see Sec-tion A22 for details

By the classical likelihood theory we can show that θ is con-sistent asymptotically normal and asymptotically efficient un-der Conditions 1 and 2 and the following condition

Condition 3 If there exists a constant vector ν such thatνTnablaθ log mg(YX θ0) = 0 then ν = 0

Remark 2 Condition 3 ensures the nonsingularity of the in-formation matrix This condition can be easily verified when thejoint distribution of H is unspecified and is implied by Lemma 1and Condition 2 when the distribution satisfies (3) or (4)

23 Case-Control Studies With Known Population Totals

We consider case-control data supplemented by informationon population totals (Scott and Wild 1997) There is a finitepopulation of N individuals that is considered a random sam-ple from the joint distribution of (YXH) where Y is a cat-egorical response variable All that is known about this finitepopulation is the total number of individuals in each categoryof Y = y A sample of size n stratified on the disease status isdrawn from the finite population and the values of X and Gare recorded for each sampled individual The supplementaryinformation on population totals is often available from hospi-tal records cancer registries and official statistics If a case-control sample is drawn from a cohort study then the cohortserves as the finite population The observable data consist of(YiRiRiXiRiGi) i = 1 N where Ri indicates by thevalues 1 versus 0 whether or not the ith individual in the fi-nite population is selected into the case-control sample

The association between Y and (XH) is characterized byPαβξ (Y|XH) where α β and ξ pertain to the intercept(s)regression effects and overdispersion parameters (McCullaghand Nelder 1989) In the case of a binary response variable im-portant examples of Pαβξ (Y|XH) include the logistic probitand complementary logndashlog regression models When there aremore than two categories examples include the proportionalodds multivariate probit and multivariate logistic regressionmodels Because the data associated with Ri = 1 yield the sameform of likelihood as that of a cross-sectional study and thedata associated Ri = 0 yield a missing-data likelihood all ofthe identifiability results stated in Section 22 apply to the cur-rent setting We again write θ = (αβ ξ γ ) where γ consistsof the identifiable parameters in the distribution of H

Let Fg(middot) be the cumulative distribution function of X givenG = g and let fg(x) be the density of Fg(x) with respectto a dominating measure micro(x) Note that Fg(middot) is infinite-dimensional if X has continuous components The joint densityof (Y = yG = gX = x) is mg( yx θ)fg(x) The likelihoodconcerning θ and Fg takes the form

Ln(θ Fg) =Nprod

i=1

[prod

gisinGmg(YiXi θ)fg(Xi)I(Gi=g)

]Ri

times[sum

gisinG

int

mg(Yix θ)dFg(x)

]1minusRi

(6)

Lin and Zeng Inference on Haplotype Effects 93

Unlike the likelihood for the cross-sectional design given in (5)the density functions of X given G cannot be factored out ofthe likelihood given in (6) and thus cannot be omitted from thelikelihood

We maximize (6) to obtain the MLEs θ and Fg(middot) The lat-ter is an empirical function with point masses at the observed Xisuch that Gi = g and Ri = 1 The maximization can be car-ried out via the NewtonndashRaphson profile-likelihood or large-scale optimization methods An alternative way to calculate theMLEs is through the EM algorithm described in Section A31

We impose the following regularity condition and then statethe asymptotic results in Theorem 1

Condition 4 For any g isin G fg(x) is positive in its supportand continuously differentiable with respect to a suitable mea-sure

Theorem 1 Under Conditions 1ndash4 θ and Fg(middot) are con-sistent in that |θ minus θ0| + supxg |Fg(x) minus Fg(x)| rarr 0 almostsurely In addition n12(θ minus θ0) converges in distribution to amean 0 normal random vector whose covariance matrix attainsthe semiparametric efficiency bound

Let pln(θ) be the profile log-likelihood for θ that is pln(θ) =maxFg log Ln(θ Fg) Then the (s t)th element of the inversecovariance matrix of θ can be estimated by minusεminus2

n pln(θ +εnes + εnet) minus pln(θ + εnes minus εnet) minus pln(θ minus εnes + εnet) +pln(θ) where εn is a constant of the order nminus12 and es and etare the sth and tth canonical vectors The function pln(θ) can becalculated via the EM algorithm by holding θ constant in boththe E-step and the M-step

Remark 3 If N is much larger than n or if the populationfrequencies rather than the totals are known then we max-imize

prodni=1

prodgisinGmg(YiXi θ)fg(Xi)I(Gi=g) subject to the

constraints thatsum

gisinGint

mg( yx θ)dFg(x) = py where py isthe population frequency of Y = y The resultant estimator of θ0is consistent asymptotically normal and asymptotically effi-cient The results in this section can be extended straightfor-wardly to accommodate stratifications on covariates

24 Case-Control Studies With UnknownPopulation Totals

We consider the classical case-control design which mea-sures X and G on n1 cases (Y = 1) and n0 controls (Y = 0)

and requires no knowledge about the finite population Withthe notation introduced in the previous section the likelihoodcontribution from one individual takes the form

RL(θ Fg) =prod

gisinGmg( yX θ)fg(X)I(G=g)

sumgisinG

intmg( yx θ)dFg(x)

(7)

where we use y instead of Y to emphasize that y is not randomDefine

f daggerg (x) = mg(0x θ)fg(x)

intmg(0x θ)dFg(x)

qg =int

mg(0x θ)dFg(x)sum

gisinGint

mg(0x θ)dFg(x)

Clearly f daggerg (x) is the conditional density of X given G = g and

Y = 0 and qg is the conditional probability of G = g given

Y = 0 Let g0 and x0 be some specific values of G and X WriteFdagger

g(x) = int x0 f dagger

g (s)dmicro(s) We can express (7) as

RL(θ Fdaggerg qg) =

prodgisinGη( yXg θ)f dagger

g (x)qgI(G=g)

sumgisinG qg

intη( yxg θ)dFdagger

g(x) (8)

where

η( yxg θ) = mg( yx θ)mg0(0x0 θ)

mg(0x θ)mg0( yx0 θ)

We call η the generalized odds ratio (Liang and Qin 2000)which reduces to the ordinary odds ratio when S(g) is a sin-gleton

Remark 4 The parameter qg is a functional of f daggerg and θ be-

causeint

mg(0x θ)dFg(x) = int mminus1g (0x θ)dFdagger

g(x)minus1 Thisconstraint makes it very difficult to study the identifiability ofthe parameters Thus we treat qg as a free parameter in our de-velopment

For traditional case-control data the odds ratio is identifiable(whereas the intercept is not) and its MLE can be obtainedby maximizing the prospective likelihood (Prentice and Pyke1979) Similar results hold when the exposure is measured witherror (Roeder Carroll and Lindsay 1996) however the distri-bution of the measurement error needs to be estimated from avalidation set or an external source With unphased genotypedata identifiability is much more delicate We show in Sec-tion A41 that the components of θ that are identifiable fromthe retrospective likelihood are exactly those that are identi-fiable from the generalized odds ratio Thus we assume thatthe generalized odds ratio depends only on a set of identifi-able parameters still denoted by θ otherwise the inference isnot tractable For the logistic link function with linear predic-tor (1) we show in Section A42 that if there are no dominanteffects then θ consists only of β if there are no covariate ef-fects but there exists a dominant main effect then β is identi-fiable and P(H = (hlowastg minus hlowast))P(G = g) is identifiable up toa scalar constant and if the dominant effect depends on a con-tinuous covariate or if the dominant main effect and the maineffect of a continuous covariate are nonzero then θ consistsof α β and P(H = (hlowastg minus hlowast))P(G = g) For the probitand complementary logndashlog link functions we show in Sec-tion A43 that if there are dominant effects and at least onecontinuous covariate has an effect then θ consists of α β andP(H = (hlowastg minus hlowast))P(G = g)

We maximize the product of (8) over the n equiv n1 +n0 individ-uals in the case-control sample to produce the MLEs θ Fdagger

g(middot)and qg Although the Fdagger

g(middot) are high-dimensional we showin Section A44 that θ can be obtained by profiling a likelihoodfunction over a scalar nuisance parameter

To state the asymptotic properties of the MLEs we imposethe following conditions

Condition 5 If there exists a vector v such that vTnablaθ logη(1

xg θ) is a constant with probability 1 then v = 0

Condition 6 The function f daggerg is positive in its support and

continuously differentiable

Condition 7 The fraction n1n rarr isin (01)

94 Journal of the American Statistical Association March 2006

Remark 5 Condition 5 implies nonsingularity of the infor-mation matrix for θ0 and can be shown to hold for the logisticprobit and complementary logndashlog link functions Condition 7ensures that there are both cases and controls in the sample

Theorem 2 Under Conditions 5ndash7 |θ minus θ0| + supg |qg minusqg| + supxg |Fdagger

g(x) minus Fdaggerg(x)| rarr 0 almost surely In addition

n12(θ minus θ0) converges in distribution to a normal random vec-tor whose covariance matrix attains the semiparametric effi-ciency bound

In most case-control studies the disease is (relatively) rareWhen the disease is rare considerable simplicity arises be-cause of the following approximation for the logistic regressionmodel

Pαβ(Y|XH) asymp expY(α + βTZ(XH)

)

where Z(XH) is a specific function of X and H We assumethat either (3) or (4) holds The likelihood based on (XiGi yi)i = 1 n can be approximated by

Ln(θ Fg)

=nprod

i=1

(prodgisinG [fg(Xi)

sum(hk hl)isinS(g) eβTZ(Xihk hl)Pγ (hkhl)]I(Gi=g)

sumgisinG

intxsum

(hk hl)isinS(g) eβTZ(xhkhl)Pγ (hkhl)dFg(x)

)yi

times[prod

gisinG

fg(Xi)sum

(hkhl)isinS(g)

Pγ (hkhl)

I(Gi=g)]1minusyi

(9)

We impose the following condition

Condition 8 If α + βTZ(XH) = α + βTZ(XH) for H =(hkhk) and H = (hk hk) then α = α and β = β

This condition is similar to Condition 1 stated in Section 22and it holds for the codominant model Under this conditionit follows from Lemma 1 that no two sets of parameters cangive the same likelihood with probability 1 Thus the maximizerof (9) denoted by (θ Fg) is locally unique We show in Sec-tion A45 that θ can be easily obtained by profiling over a smallnumber of parameters

To derive the asymptotic properties we provide a mathemat-ical definition of rare disease

Condition 9 For i = 1 n the conditional distribu-tion of Yi given (XiHi) satisfies that P(Yi = 1|XiHi) =an expβT

0Z(XiHi)[1 + an expβT0Z(XiHi)] where an =

o(nminus12)

Theorem 3 Under Conditions 6ndash9 |θ minusθ0|+supxg |Fg(x)minusFg(x)| rarrPn 0 where Pn is the probability measure given byCondition 9 Furthermore n12(θ minus θ0) converges in distri-bution to a normal random vector whose covariance matrixachieves the semiparametric efficiency bound

25 Cohort Studies

In a cohort study Y represents the time to disease occur-rence which is subject to right-censorship by C The data con-sist of (YiiXiGi) i = 1 n where Yi = min(YiCi) and

i = I(Yi le Ci) We relate Yi to (XiHi) through a class ofsemiparametric linear transformation models

13(Yi) = minusβTZ(XiHi) + εi i = 1 n (10)

where 13 is an unknown increasing function Z(XH) is aknown function of X and H and the εirsquos are independent er-rors with known distribution function F We may rewrite (10)as

P(Yi le t|XiHi) = Q((t)eβTZ(XiHi)

)

where (t) = e13(t) and Q(x) = F(log x) (x gt 0) The choicesof the extreme-value and standard logistic distributions for F orequivalently Q(x) = 1minuseminusx and Q(x) = 1minus(1+x)minus1 yield theproportional hazards model and the proportional odds model(Pettitt 1984)

We impose Condition 8 Under this condition β and (middot)are identifiable from the observable data The identifiability ofthe distribution of H is the same here as in the case of cross-sectional studies Under (3) or (4) and Condition 8 all of theparameters including β (middot) and γ are identifiable This isshown in Section A51

The following assumption on censoring is required in con-structing the likelihood

Condition 10 Conditional on X and G the censoring time Cis independent of Y and H

Let θ = (βγ ) The likelihood concerning θ and takes theform

Ln(θ )

=nprod

i=1

[ sum

(hkhl)isinS(Gi)

(Yi)e

βTZ(Xi(hkhl))

times Q((Yi)e

βTZ(Xi(hkhl)))i

times 1 minus Q

((Yi)e

βTZ(Xi(hkhl)))1minusi Pγ (hkhl)

]

(11)

Here and in the sequel f (x) = df (x)dx and f (x) = d2f (x)dx2 Like (6) (8) and (9) this likelihood involves infinite-dimensional parameters If is restricted to be absolutely con-tinuous then as in the case of density estimation there is nomaximizer of this likelihood Thus we relax to be right-continuous and replace (Yi) in (11) by the jump size of

at Yi By the arguments of Zeng Lin and Lin (2005) the resul-tant MLE denoted by (θ ) exists and is a step functionwith jumps only at the observed Yi for which i = 1 The max-imization can be carried out through an optimization algorithmFurthermore the covariance matrix of θ can be estimated by theprofile likelihood method as discussed by Zeng et al (2005)

Lin (2004) considered the special case of the proportionalhazards model under condition (2) and provided an EM algo-rithm for obtaining the MLEs We can modify that algorithm toaccommodate HardyndashWeinberg disequilibrium along the linesof Section A22 In addition the EM algorithm can be used toevaluate the profile likelihood

We assume the following regularity conditions for the as-ymptotic results

Lin and Zeng Inference on Haplotype Effects 95

Condition 11 There exists some positive constant δ0 suchthat P(Ci ge τ |XiGi) = P(Ci = τ |XiGi) ge δ0 almost surelywhere τ corresponds to the end of the study

Condition 12 The true value 0(t) of (t) is a strictly in-creasing function in [0 τ ] and is continuously differentiable Inaddition 0(0) = 0 0(τ ) lt infin and 0(0) gt 0

Theorem 4 Under Conditions 8 and 10ndash12 n12(θ minusθ0 minus0) converges weakly to a Gaussian process in R

d times linfin([0 τ ])where d is the dimension of θ0 and linfin([0 τ ]) is the space ofall bounded functions on [0 τ ] equipped with the supremumnorm Furthermore θ is asymptotically efficient

3 SIMULATION STUDIES

We used Monte Carlo simulation to evaluate the proposedmethods in realistic settings We considered the five SNPs onchromosome 22 from the FinlandndashUnited States Investigationof NIDDM Genetics (FUSION) Study described in the nextsection We obtained the πkrsquos from the frequencies shown inTable 1 by assuming a 7 disease rate and generated haplo-types under (3) with ρ = 05 The R2

h in Table 1 is the mea-sure of haplotype certainty of Stram et al (2003) We focusedon hlowast = (01100) and considered case-control and cohortstudies

For the cohort studies we generated ages of onset from theproportional hazards model

λt|x (hkhl) = 2t exp[β1I(hk = hlowast) + I(hl = hlowast) + β2x

+ β3I(hk = hlowast) + I(hl = hlowast)x]where X is a Bernoulli variable with P(X = 1) = 2 that is in-dependent of H We generated the censoring times from theuniform (0 τ ) distribution where τ was chosen to yield ap-proximately 250 500 and 1000 cases under n = 5000 We letβ1 = β2 = 25 and varied β3 from minus5 to 5

As shown in Table 2 the MLE is virtually unbiased the like-lihood ratio test has proper type I error and the confidence in-terval has reasonable coverage Additional simulation studiesrevealed that the proposed methods also perform well for mak-ing inference about other parameters and under other geneticmodels

Table 1 Observed Haplotype Frequencies in the FUSION Study

Frequencies

Haplotype Controls Cases R2h

00011 0042 0066 38800100 0035 0034 33600110 0018 0007 37701011 1292 1344 59201100 2514 3183 73801101 0012 lt10minus4 45001110 lt10minus4 0045 49901111 0019 lt10minus4 32510000 0136 0114 45610010 lt10minus4 0012 50010011 3573 2883 72710100 0521 0597 40210110 0317 0318 55411011 1392 1290 56011100 0109 0092 26611110 lt10minus4 0014 lt10minus4

11111 0020 lt10minus4 338

Table 2 Simulation Results for the HaplotypendashEnvironmentInteractions in Cohort Studies

β3 Cases Bias SE CP Power

0 250 minus010 232 949 051500 minus005 157 953 047

1000 minus003 114 954 046

minus25 250 minus014 256 950 190500 minus008 172 949 334

1000 minus004 122 952 554

minus5 250 minus022 281 950 505500 minus011 190 950 806

1000 minus006 132 952 976

25 250 minus007 216 947 207500 minus002 146 953 395

1000 minus001 109 954 614

5 250 minus003 204 943 693500 minus001 140 951 940

1000 minus001 105 952 998

NOTE Bias and SE are the bias and standard error of β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level likelihood ratio test of H0 β3 = 0Each entry is based on 5000 replicates

For the case-control studies we used the same distributionsof H and X and considered the same hlowast as in the cohort stud-ies We generated disease incidence from the logistic regressionmodel

logit PY = 1|x (hkhl)= α + β1I(hk = hlowast) + I(hl = hlowast)

+ β2x + β3I(hk = hlowast) + I(hl = hlowast)x (12)

For making inference on β1 we set β2 = β3 = 25 and var-ied β1 from minus5 to 5 for making inference on β3 we setβ1 = β2 = 25 and varied β3 from minus5 to 5 We chose α = minus3or minus4 yielding disease rates between 16 and 7 We letn1 = n0 = 500 or 1000 We considered the situations of knownand unknown population totals with N being 15 and 30 timesof n under α = minus3 and minus4 For known population totals weused the EM algorithm described in Section A31 and eval-uated the inference procedures based on the likelihood ratiostatistic For unknown population totals we used the profile-likelihood method for rare diseases described in Section A45and set the πk less than 2n to 0 to improve numerical stabilityThe results for β1 and β3 are displayed in Tables 3 and 4

For known population totals the proposed estimators are vir-tually unbiased and the likelihood ratio statistics yield propertests and confidence intervals For unknown population totalsβ1 has little bias especially for large n whereas β3 tends tobe slightly biased downward the variance estimators are fairlyaccurate and the corresponding confidence intervals have rea-sonable coverage probabilities except for α = minus3 β3 = 5The method with known population totals yields slightly higherpower than the method with unknown population totals

All the aforementioned results pertain to haplotype 01100which has a relatively high frequency and a large value of R2

hthe covariate is binary and ρ is 05 which is relatively largeAdditional simulation studies showed that the foregoing con-clusions continue to hold for other haplotypes other valuesof ρ and continuous covariates Table 5 reports some resultsfor haplotype 10100 which has a frequency of about 5 and

96 Journal of the American Statistical Association March 2006

Table 3 Simulation Results for the Main Effects of the Haplotype in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β1 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus003 117 952 987 019 121 124 951 979minus25 minus002 109 954 587 014 112 117 960 5250 minus001 104 951 049 009 109 112 955 045

25 minus001 102 950 641 002 105 108 961 646

5 000 099 948 996 minus005 103 106 958 998

minus4 minus5 001 112 954 987 022 119 124 951 977minus25 minus002 104 955 574 013 114 117 952 5290 minus002 100 953 047 004 109 112 953 047

25 minus001 095 956 636 minus003 103 108 959 640

5 minus000 094 950 999 minus009 102 105 956 997

1000 minus3 minus5 minus003 082 953 100 005 087 087 949 100minus25 minus002 076 952 874 005 081 082 948 8530 minus001 073 951 049 005 077 077 954 046

25 minus001 071 953 898 004 075 076 948 920

5 minus001 070 953 100 003 075 075 946 100

minus4 minus5 000 079 952 100 005 087 088 947 100minus25 minus000 074 959 867 005 081 083 954 8470 minus001 070 955 045 002 079 079 949 051

25 minus001 067 956 904 000 074 076 955 909

5 minus001 066 956 100 minus002 073 074 954 100

NOTE Bias and SE are the bias and standard error of β1 SEE is the mean of the standard error estimator for β1 CP is the coverage probability of the95 confidence interval for β1 Power pertains to the 05-level test of H0 β1 = 0 Each entry is based on 5000 replicates

an R2h of 4 We generated disease incidence from the logistic

regression model

logit PY = 1|X1X2 (hkhl)= α + βhI(hk = hlowast) + I(hl = hlowast)

+ βx1 X1 + βx2X2 + βhx2I(hk = hlowast) + I(hl = hlowast)X2

where hlowast = (10100) X1 is Bernoulli with 2 success probabil-ity and X2 is uniform(01) We set ρ = 01 α = minus37 βh = 0and βx1 = βx2 = minusβx2h = 5 yielding an overall disease rateof 7 We assumed unknown population totals and used the

profile-likelihood method for rare diseases described in Sec-tion A45 The method performed remarkably well

4 APPLICATION TO THE FUSION STUDY

Type 2 diabetes mellitus or nonndashinsulin-dependent diabetesmellitus is a complex disease characterized by resistance of pe-ripheral tissues to insulin and a deficiency of insulin secretionApproximately 7 of adults in developed countries suffer fromthe disease The FUSION study is a major effort to map andclone genetic variants that predispose to type 2 diabetes (Valleet al 1998) We consider a subset of data from this study

Table 4 Simulation Results for the HaplotypendashEnvironment Interactions in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β3 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus008 205 949 729 030 187 195 953 692minus25 minus002 186 949 271 016 169 176 961 2440 minus001 173 946 054 minus006 155 162 963 037

25 002 165 949 334 minus038 144 151 958 287

5 006 161 947 885 minus088 138 143 915 831

minus4 minus5 minus009 198 950 763 012 194 195 950 720minus25 minus005 181 949 309 006 172 176 953 2640 minus002 168 945 055 minus007 156 161 956 044

25 minus001 157 944 370 minus022 146 149 948 333

5 001 148 945 926 minus047 136 141 945 904

1000 minus3 minus5 minus004 147 943 953 027 134 136 950 953minus25 minus003 133 946 493 013 122 123 949 4770 minus001 123 951 049 minus005 114 113 948 052

25 000 119 945 580 minus034 107 106 934 535

5 002 117 947 994 minus080 102 101 870 986

minus4 minus5 minus005 140 945 965 010 137 136 949 965minus25 minus002 128 945 529 005 124 123 951 5050 minus001 119 946 054 minus004 113 113 947 053

25 minus000 110 947 633 minus016 104 105 952 601

5 002 105 949 998 minus037 099 099 937 995

NOTE Bias and SE are the bias and standard error of β3 SEE is the mean of the standard error estimator for β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level test of H0 β3 = 0 Each entry is based on 5000 replicates

Lin and Zeng Inference on Haplotype Effects 97

Table 5 Simulation Results for Haplotype 10100 inCase-Control Studies

n1 = n0 Parameter True value Bias SE SEE CP Power

500 βh 0 minus030 400 401 957 043βx1 5 002 151 152 951 917βx2 5 001 228 230 953 584βhx2 minus5 015 641 644 956 118

1000 βh 0 minus017 275 277 953 047βx1 5 002 107 107 954 997βx2 5 000 162 161 950 871βhx2 minus5 012 441 443 950 198

NOTE Bias and SE are the bias and standard error of the parameter estimator SEE is themean of the standard error estimator CP is the coverage probability of the 95 confidenceinterval Power pertains to the 05-level test of zero parameter value Each entry is based on5000 replicates

A total of 796 cases and 415 controls were genotyped at5 SNPs in a putative susceptibility region on chromosome 22with 131 cases and 82 controls having missing genotype infor-mation for at least one SNP If Gi is missing then the set S(Gi)

is enlarged accordingly in the analysis Table 1 displays the es-timated haplotype frequencies under (3) separated by the casesand controls along with the values of R2

h (Stram et al 2003) forthe controls We estimated ρ at 000 for controls and 002 forcases

We use the method based on (9) to estimate the effects ofthe haplotypes whose observed frequencies in the controls aregreater than 2 As shown in Table 6 the results are signif-icant for the two most common haplotypes haplotype 01100increases the risk of disease whereas haplotype 10011 is pro-tective against diabetes Epstein and Satten (2003) also reportedthe estimates for these two haplotypes which agree with ournumbers Although they did not report standard error estimatestheir confidence intervals are similar to those based on Table 6The results under the codominant model as well as the calcula-tions of the Akaike information criterion (AIC) (Akaike 1985)suggest that the additive model fits the data the best for bothhaplotypes 01100 and 10011

The FUSION investigators are currently exploring genendashenvironment interactions on chromosome 22 so the covariateinformation is confidential at this stage To illustrate our methodfor detecting genendashenvironment interactions we artificially cre-ated a binary covariate X by setting X = 1 for the first 600 in-dividuals in the dataset Under the additive genetic model forhaplotype 01100 the estimate of the interaction is 043 withan estimated standard error of 110 For further illustration wegenerated a binary covariate from the conditional distributionof X given Y and G under model (12) with α = minus37 β1 = 32and β2 = 25 Based on 5000 replicates the power for testing

H0 β3 = 0 is estimated at 053 479 or 974 under β3 = 0 25or 5

5 DISCUSSION

Inferring haplotypendashdisease associations is an interestingand difficult statistical problem The presence of infinite-dimensional nuisance parameters in the likelihoods for case-control and cohort studies entails considerable theoretical andcomputational challenges Although we have conducted a sys-tematic and rigorous investigation providing powerful newmethods there remain substantial open problems Here we dis-cuss some directions for future research

Case-Control Studies It is numerically difficult to maxi-mize (6) when N is much larger than n and algorithms forimplementing the constrained maximization mentioned in Re-mark 3 have yet to be developed For case-control studies withunknown population totals identifiability is a thorny issue Wehave provided a simple and efficient method under the rare dis-ease assumption which appears to work well even when thedisease is not rare But can we do better

Model Selection and Model Assessment Because our ap-proach is built on likelihood we can apply likelihood-basedmodel selection criteria such as the AIC used in Section 4Lin (2004) showed that the AIC performs well for the propor-tional hazards model It is unclear how to apply the traditionalresidual-based methods for assessing model adequacy becausethe haplotypes are not directly observable

Other Genetic Variants We have focused on SNPs-basedhaplotypes The proposed inference procedures are potentiallyapplicable to microsatellite loci and other genotype data al-though the identifiability of parameters needs to be verified foreach kind of genotype data

Other Study Designs It is sometimes desirable to use thematched case-control design in which one or more controls areindividually matched to each case In large cohort studies withrare diseases it is cost-effective to adopt the case-cohort designor nested case-control design so that only a subset of the cohortmembers needs to be genotyped We are currently developingefficient inference procedures for such designs

Population Substructure The presence of latent populationsubstructure can cause bias in association studies of unrelatedindividuals There exist several statistical methods to adjust forthe effects of population substructure with the aid of genomicmarkers It should be possible to extend the proposed methodsso as to accommodate potential population substructure

Table 6 Estimates of Haplotype Effects Under Various Genetic Models for the FUSION Study

Recessivemodel

Dominantmodel

Additivemodel

Codominant model

Haplotype Additive Recessive

01011 327(270) minus027(140) 049(135) 005(143) 331(289)01100 316(146) 274(109) 355(099) 334(114) 063(167)10011 minus206(155) minus323(112) minus320(095) minus344(111) 076(183)10100 minus1019(1020) 196(219) 116(213) 169(217) minus1131(1029)10110 903(746) minus007(248) 063(249) 016(254) 892(765)11011 minus222(328) minus096(140) minus127(133) minus108(140) minus143(344)

NOTE Standard error estimates are shown in parentheses

98 Journal of the American Statistical Association March 2006

Studies of Related Individuals This article is concernedwith studies of unrelated individuals Many genetic studies in-volve multiple family members or relatives Haplotype ambigu-ity possibly can be reduced by using the genotype informationfrom related individuals Inference on haplotype effects needsto account for the intraclass correlation

Genotyping Error and DNA Pooling Laboratory genotyp-ing is prone to error It is sometimes necessary to pool DNAsamples rather than genotyping individual samples (WangKidd and Zhao 2003) Such data create additional complexityin haplotype analysis (Zeng and Lin 2005)

Many SNPs The traditional EM algorithm works well fora small number of SNPs When the number of SNPs is largethe partitionndashligation method of Niu et al (2002) and Qin et al(2002) and other modifications potentially can be adapted to re-duce the computational burden However the haplotype analy-sis may not be very useful if the SNPs are weakly linked

Many Haplotypes and Rare Haplotypes The approachtaken in this article assumes that we are interested in a smallnumber of haplotype configurations that are relatively frequentIf there are many haplotypes then we are confronted withthe problem of multiple comparisons and sparse data Schaid(2004) discussed some possible solutions

Large-Scale Studies There is an increasing interest ingenome-wide association studies With a large number of SNPsone possible approach is to use sliding windows of 5ndash10 SNPsand test for the haplotypendashdisease association in each windowBecause most of the SNPs are common between adjacent win-dows the test statistics tend to be highly correlated so that theBonferroni-type correction for multiple comparisons would beextremely conservative To properly adjust for multiple com-parisons one needs to ascertain the joint distribution of the teststatistics This can be done by permuting the data or by evaluat-ing the asymptotic joint normal distribution of the test statistics(Lin 2005)

We hope that other statisticians will join us in tackling theforegoing problems and other challenges in genetic associationstudies

APPENDIX TECHNICAL ANDCOMPUTATIONAL DETAILS

A1 Proof of Lemma 1

We provide a proof under (3) the proof under (4) is simpler andis omitted here To prove the first part of the lemma we supposethat two sets of parameters (πk ρ) and (πk ρ) yield the samedistribution of G We wish to show that these two sets are iden-tical Consider g = 2hk For such a choice of g the set S(g) is asingleton Clearly (1 minus ρ)π2

k + ρπk = (1 minus ρ)π2k + ρπk We de-

note this constant by ck Then 0 le ck le 1 for all k and 0 lt ck lt 1for at least one k Because πk ge 0 we have πk = [minusρ + ρ2 +4ck(1 minus ρ)12]2(1 minus ρ) Thus (1 minus ρ)minus1 satisfies the equationsum

k[(1 minus x) + (x minus 1)2 + 4ckx12] = 2 and (1 minus ρ)minus1 satisfies thesame equation It can be shown that the first derivative of (1 minus x) +(x minus 1)2 + 4ckx12 is nonpositive and is strictly negative for at leastone k Thus the foregoing equation has a unique solution for x gt 1which implies that ρ = ρ It follows immediately that πk = πk forall k To prove the second part of the lemma we choose g = 2hk toobtain νk2πk(1 minus ρ) + ρ + microπk(1 minus πk) = 0 Because

sumk νk = 0

we havesum

kmicroπk(1 minus πk)2πk(1 minus ρ) + ρ = 0 Therefore micro = 0and ν = 0

A2 Cross-Sectional Studies

A21 Identifiability Under Arbitrary Distributions of H UnderCondition 1 (αβ ξ) is identifiable The identifiability of the distribu-tion of H depends on the structure of Pαβξ For concreteness we con-sider the codominant logistic regression model for a binary trait Wedivide G into three categories G1 = g isin G g = h + h or g = h + hG2 = g isin G minus G1 g is not ge hlowast and G3 = G minus G1 minus G2 We derivethe expression for mg( yx θ) when g belongs to each of the three cat-egories

For g isin G1 S(g) = (hh) or (h h) so that mg( yxθ) = Pαβξ (Y = y|X = xH = (hh))P(H = (hh)) or mg( yxθ) = Pαβξ (Y = y|X = xH = (h h))P(H = (h h)) For g isin G2Pαβξ (Y = y|X = xH = (hkhl)) does not depend on (hkhl) isin S(g)so that mg( yx θ) = Pαβξ (Y = y|X = xH = (hkhl))P(G = g)where (hkhl) isin S(g) For g isin G3

mg( yx θ) = expy(α + β1 + βT3 x + βT

4 x)1 + exp(α + β1 + βT

3 x + βT4 x)

π1(g)

+ expy(α + βT3 x)

1 + exp(α + βT3 x)

π2(g)

where π1(g) = 2P(H = (hlowastg minus hlowast)) and π2(g) = P(H = (hkhl) hk + hl = ghk = hlowasthl = hlowast)

Let θ0 denote the true value of θ P0(G = g) denote the truevalue of P(G = g) and π0j(g) denote the true values πj(g) j = 12We then can draw the following conclusions (1) When β01 = 0 andβ04 = 0 mg( yx θ) = mg( yx θ0) if and only if α = α0 β = β0and P(G = g) = P0(G = g) for any g isin G and (2) when either β01or β04 is nonzero mg( yx θ) = mg( yx θ0) if and only if α = α0β = β0 P(G = g) = P0(G = g) for g isin G1 cup G2 and πj(g) = π0j(g)

for g isin G3 and j = 12 These conclusions hold for any generalizedlinear model with the linear predictor given in (1)

A22 EM Algorithm The complete-data likelihood is propor-tional to

prodni=1Pαβξ (Yi|XiHi)Pγ (Hi) The expectation of the log-

arithm of this function conditional on the observable data (YiXiGi)i = 1 n is

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ)log Pαβξ

(Yi|Xi (hkhl)

) + log Pγ (hkhl)

where

pikl(θ) = Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

Thus in the (m + 1)st iteration of the EM algorithm we evalu-ate pikl(θ) at the current estimate θ (m) and obtain θ (m+1) by solvingthe following equations through the NewtonndashRaphson algorithm

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)

times nablaαβξ log Pαβξ

(Yi|Xi (hkhl)

) = 0

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)nablaγ log Pγ (hkhl) = 0 (A1)

Under (3) with ρ ge 0 the estimate of γ equiv (ρ πk) can be ob-tained in a closed form rather than by solving (A1) Let B be aBernoulli variable with success probability ρ let Q1 be a discreterandom variable taking values in H with P(Q1 = (hkhl)) = δklπk and let Q2 be another discrete random variable taking values in H

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 5: Likelihood-Based Inference on Haplotype Effects in Genetic

Lin and Zeng Inference on Haplotype Effects 93

Unlike the likelihood for the cross-sectional design given in (5)the density functions of X given G cannot be factored out ofthe likelihood given in (6) and thus cannot be omitted from thelikelihood

We maximize (6) to obtain the MLEs θ and Fg(middot) The lat-ter is an empirical function with point masses at the observed Xisuch that Gi = g and Ri = 1 The maximization can be car-ried out via the NewtonndashRaphson profile-likelihood or large-scale optimization methods An alternative way to calculate theMLEs is through the EM algorithm described in Section A31

We impose the following regularity condition and then statethe asymptotic results in Theorem 1

Condition 4 For any g isin G fg(x) is positive in its supportand continuously differentiable with respect to a suitable mea-sure

Theorem 1 Under Conditions 1ndash4 θ and Fg(middot) are con-sistent in that |θ minus θ0| + supxg |Fg(x) minus Fg(x)| rarr 0 almostsurely In addition n12(θ minus θ0) converges in distribution to amean 0 normal random vector whose covariance matrix attainsthe semiparametric efficiency bound

Let pln(θ) be the profile log-likelihood for θ that is pln(θ) =maxFg log Ln(θ Fg) Then the (s t)th element of the inversecovariance matrix of θ can be estimated by minusεminus2

n pln(θ +εnes + εnet) minus pln(θ + εnes minus εnet) minus pln(θ minus εnes + εnet) +pln(θ) where εn is a constant of the order nminus12 and es and etare the sth and tth canonical vectors The function pln(θ) can becalculated via the EM algorithm by holding θ constant in boththe E-step and the M-step

Remark 3 If N is much larger than n or if the populationfrequencies rather than the totals are known then we max-imize

prodni=1

prodgisinGmg(YiXi θ)fg(Xi)I(Gi=g) subject to the

constraints thatsum

gisinGint

mg( yx θ)dFg(x) = py where py isthe population frequency of Y = y The resultant estimator of θ0is consistent asymptotically normal and asymptotically effi-cient The results in this section can be extended straightfor-wardly to accommodate stratifications on covariates

24 Case-Control Studies With UnknownPopulation Totals

We consider the classical case-control design which mea-sures X and G on n1 cases (Y = 1) and n0 controls (Y = 0)

and requires no knowledge about the finite population Withthe notation introduced in the previous section the likelihoodcontribution from one individual takes the form

RL(θ Fg) =prod

gisinGmg( yX θ)fg(X)I(G=g)

sumgisinG

intmg( yx θ)dFg(x)

(7)

where we use y instead of Y to emphasize that y is not randomDefine

f daggerg (x) = mg(0x θ)fg(x)

intmg(0x θ)dFg(x)

qg =int

mg(0x θ)dFg(x)sum

gisinGint

mg(0x θ)dFg(x)

Clearly f daggerg (x) is the conditional density of X given G = g and

Y = 0 and qg is the conditional probability of G = g given

Y = 0 Let g0 and x0 be some specific values of G and X WriteFdagger

g(x) = int x0 f dagger

g (s)dmicro(s) We can express (7) as

RL(θ Fdaggerg qg) =

prodgisinGη( yXg θ)f dagger

g (x)qgI(G=g)

sumgisinG qg

intη( yxg θ)dFdagger

g(x) (8)

where

η( yxg θ) = mg( yx θ)mg0(0x0 θ)

mg(0x θ)mg0( yx0 θ)

We call η the generalized odds ratio (Liang and Qin 2000)which reduces to the ordinary odds ratio when S(g) is a sin-gleton

Remark 4 The parameter qg is a functional of f daggerg and θ be-

causeint

mg(0x θ)dFg(x) = int mminus1g (0x θ)dFdagger

g(x)minus1 Thisconstraint makes it very difficult to study the identifiability ofthe parameters Thus we treat qg as a free parameter in our de-velopment

For traditional case-control data the odds ratio is identifiable(whereas the intercept is not) and its MLE can be obtainedby maximizing the prospective likelihood (Prentice and Pyke1979) Similar results hold when the exposure is measured witherror (Roeder Carroll and Lindsay 1996) however the distri-bution of the measurement error needs to be estimated from avalidation set or an external source With unphased genotypedata identifiability is much more delicate We show in Sec-tion A41 that the components of θ that are identifiable fromthe retrospective likelihood are exactly those that are identi-fiable from the generalized odds ratio Thus we assume thatthe generalized odds ratio depends only on a set of identifi-able parameters still denoted by θ otherwise the inference isnot tractable For the logistic link function with linear predic-tor (1) we show in Section A42 that if there are no dominanteffects then θ consists only of β if there are no covariate ef-fects but there exists a dominant main effect then β is identi-fiable and P(H = (hlowastg minus hlowast))P(G = g) is identifiable up toa scalar constant and if the dominant effect depends on a con-tinuous covariate or if the dominant main effect and the maineffect of a continuous covariate are nonzero then θ consistsof α β and P(H = (hlowastg minus hlowast))P(G = g) For the probitand complementary logndashlog link functions we show in Sec-tion A43 that if there are dominant effects and at least onecontinuous covariate has an effect then θ consists of α β andP(H = (hlowastg minus hlowast))P(G = g)

We maximize the product of (8) over the n equiv n1 +n0 individ-uals in the case-control sample to produce the MLEs θ Fdagger

g(middot)and qg Although the Fdagger

g(middot) are high-dimensional we showin Section A44 that θ can be obtained by profiling a likelihoodfunction over a scalar nuisance parameter

To state the asymptotic properties of the MLEs we imposethe following conditions

Condition 5 If there exists a vector v such that vTnablaθ logη(1

xg θ) is a constant with probability 1 then v = 0

Condition 6 The function f daggerg is positive in its support and

continuously differentiable

Condition 7 The fraction n1n rarr isin (01)

94 Journal of the American Statistical Association March 2006

Remark 5 Condition 5 implies nonsingularity of the infor-mation matrix for θ0 and can be shown to hold for the logisticprobit and complementary logndashlog link functions Condition 7ensures that there are both cases and controls in the sample

Theorem 2 Under Conditions 5ndash7 |θ minus θ0| + supg |qg minusqg| + supxg |Fdagger

g(x) minus Fdaggerg(x)| rarr 0 almost surely In addition

n12(θ minus θ0) converges in distribution to a normal random vec-tor whose covariance matrix attains the semiparametric effi-ciency bound

In most case-control studies the disease is (relatively) rareWhen the disease is rare considerable simplicity arises be-cause of the following approximation for the logistic regressionmodel

Pαβ(Y|XH) asymp expY(α + βTZ(XH)

)

where Z(XH) is a specific function of X and H We assumethat either (3) or (4) holds The likelihood based on (XiGi yi)i = 1 n can be approximated by

Ln(θ Fg)

=nprod

i=1

(prodgisinG [fg(Xi)

sum(hk hl)isinS(g) eβTZ(Xihk hl)Pγ (hkhl)]I(Gi=g)

sumgisinG

intxsum

(hk hl)isinS(g) eβTZ(xhkhl)Pγ (hkhl)dFg(x)

)yi

times[prod

gisinG

fg(Xi)sum

(hkhl)isinS(g)

Pγ (hkhl)

I(Gi=g)]1minusyi

(9)

We impose the following condition

Condition 8 If α + βTZ(XH) = α + βTZ(XH) for H =(hkhk) and H = (hk hk) then α = α and β = β

This condition is similar to Condition 1 stated in Section 22and it holds for the codominant model Under this conditionit follows from Lemma 1 that no two sets of parameters cangive the same likelihood with probability 1 Thus the maximizerof (9) denoted by (θ Fg) is locally unique We show in Sec-tion A45 that θ can be easily obtained by profiling over a smallnumber of parameters

To derive the asymptotic properties we provide a mathemat-ical definition of rare disease

Condition 9 For i = 1 n the conditional distribu-tion of Yi given (XiHi) satisfies that P(Yi = 1|XiHi) =an expβT

0Z(XiHi)[1 + an expβT0Z(XiHi)] where an =

o(nminus12)

Theorem 3 Under Conditions 6ndash9 |θ minusθ0|+supxg |Fg(x)minusFg(x)| rarrPn 0 where Pn is the probability measure given byCondition 9 Furthermore n12(θ minus θ0) converges in distri-bution to a normal random vector whose covariance matrixachieves the semiparametric efficiency bound

25 Cohort Studies

In a cohort study Y represents the time to disease occur-rence which is subject to right-censorship by C The data con-sist of (YiiXiGi) i = 1 n where Yi = min(YiCi) and

i = I(Yi le Ci) We relate Yi to (XiHi) through a class ofsemiparametric linear transformation models

13(Yi) = minusβTZ(XiHi) + εi i = 1 n (10)

where 13 is an unknown increasing function Z(XH) is aknown function of X and H and the εirsquos are independent er-rors with known distribution function F We may rewrite (10)as

P(Yi le t|XiHi) = Q((t)eβTZ(XiHi)

)

where (t) = e13(t) and Q(x) = F(log x) (x gt 0) The choicesof the extreme-value and standard logistic distributions for F orequivalently Q(x) = 1minuseminusx and Q(x) = 1minus(1+x)minus1 yield theproportional hazards model and the proportional odds model(Pettitt 1984)

We impose Condition 8 Under this condition β and (middot)are identifiable from the observable data The identifiability ofthe distribution of H is the same here as in the case of cross-sectional studies Under (3) or (4) and Condition 8 all of theparameters including β (middot) and γ are identifiable This isshown in Section A51

The following assumption on censoring is required in con-structing the likelihood

Condition 10 Conditional on X and G the censoring time Cis independent of Y and H

Let θ = (βγ ) The likelihood concerning θ and takes theform

Ln(θ )

=nprod

i=1

[ sum

(hkhl)isinS(Gi)

(Yi)e

βTZ(Xi(hkhl))

times Q((Yi)e

βTZ(Xi(hkhl)))i

times 1 minus Q

((Yi)e

βTZ(Xi(hkhl)))1minusi Pγ (hkhl)

]

(11)

Here and in the sequel f (x) = df (x)dx and f (x) = d2f (x)dx2 Like (6) (8) and (9) this likelihood involves infinite-dimensional parameters If is restricted to be absolutely con-tinuous then as in the case of density estimation there is nomaximizer of this likelihood Thus we relax to be right-continuous and replace (Yi) in (11) by the jump size of

at Yi By the arguments of Zeng Lin and Lin (2005) the resul-tant MLE denoted by (θ ) exists and is a step functionwith jumps only at the observed Yi for which i = 1 The max-imization can be carried out through an optimization algorithmFurthermore the covariance matrix of θ can be estimated by theprofile likelihood method as discussed by Zeng et al (2005)

Lin (2004) considered the special case of the proportionalhazards model under condition (2) and provided an EM algo-rithm for obtaining the MLEs We can modify that algorithm toaccommodate HardyndashWeinberg disequilibrium along the linesof Section A22 In addition the EM algorithm can be used toevaluate the profile likelihood

We assume the following regularity conditions for the as-ymptotic results

Lin and Zeng Inference on Haplotype Effects 95

Condition 11 There exists some positive constant δ0 suchthat P(Ci ge τ |XiGi) = P(Ci = τ |XiGi) ge δ0 almost surelywhere τ corresponds to the end of the study

Condition 12 The true value 0(t) of (t) is a strictly in-creasing function in [0 τ ] and is continuously differentiable Inaddition 0(0) = 0 0(τ ) lt infin and 0(0) gt 0

Theorem 4 Under Conditions 8 and 10ndash12 n12(θ minusθ0 minus0) converges weakly to a Gaussian process in R

d times linfin([0 τ ])where d is the dimension of θ0 and linfin([0 τ ]) is the space ofall bounded functions on [0 τ ] equipped with the supremumnorm Furthermore θ is asymptotically efficient

3 SIMULATION STUDIES

We used Monte Carlo simulation to evaluate the proposedmethods in realistic settings We considered the five SNPs onchromosome 22 from the FinlandndashUnited States Investigationof NIDDM Genetics (FUSION) Study described in the nextsection We obtained the πkrsquos from the frequencies shown inTable 1 by assuming a 7 disease rate and generated haplo-types under (3) with ρ = 05 The R2

h in Table 1 is the mea-sure of haplotype certainty of Stram et al (2003) We focusedon hlowast = (01100) and considered case-control and cohortstudies

For the cohort studies we generated ages of onset from theproportional hazards model

λt|x (hkhl) = 2t exp[β1I(hk = hlowast) + I(hl = hlowast) + β2x

+ β3I(hk = hlowast) + I(hl = hlowast)x]where X is a Bernoulli variable with P(X = 1) = 2 that is in-dependent of H We generated the censoring times from theuniform (0 τ ) distribution where τ was chosen to yield ap-proximately 250 500 and 1000 cases under n = 5000 We letβ1 = β2 = 25 and varied β3 from minus5 to 5

As shown in Table 2 the MLE is virtually unbiased the like-lihood ratio test has proper type I error and the confidence in-terval has reasonable coverage Additional simulation studiesrevealed that the proposed methods also perform well for mak-ing inference about other parameters and under other geneticmodels

Table 1 Observed Haplotype Frequencies in the FUSION Study

Frequencies

Haplotype Controls Cases R2h

00011 0042 0066 38800100 0035 0034 33600110 0018 0007 37701011 1292 1344 59201100 2514 3183 73801101 0012 lt10minus4 45001110 lt10minus4 0045 49901111 0019 lt10minus4 32510000 0136 0114 45610010 lt10minus4 0012 50010011 3573 2883 72710100 0521 0597 40210110 0317 0318 55411011 1392 1290 56011100 0109 0092 26611110 lt10minus4 0014 lt10minus4

11111 0020 lt10minus4 338

Table 2 Simulation Results for the HaplotypendashEnvironmentInteractions in Cohort Studies

β3 Cases Bias SE CP Power

0 250 minus010 232 949 051500 minus005 157 953 047

1000 minus003 114 954 046

minus25 250 minus014 256 950 190500 minus008 172 949 334

1000 minus004 122 952 554

minus5 250 minus022 281 950 505500 minus011 190 950 806

1000 minus006 132 952 976

25 250 minus007 216 947 207500 minus002 146 953 395

1000 minus001 109 954 614

5 250 minus003 204 943 693500 minus001 140 951 940

1000 minus001 105 952 998

NOTE Bias and SE are the bias and standard error of β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level likelihood ratio test of H0 β3 = 0Each entry is based on 5000 replicates

For the case-control studies we used the same distributionsof H and X and considered the same hlowast as in the cohort stud-ies We generated disease incidence from the logistic regressionmodel

logit PY = 1|x (hkhl)= α + β1I(hk = hlowast) + I(hl = hlowast)

+ β2x + β3I(hk = hlowast) + I(hl = hlowast)x (12)

For making inference on β1 we set β2 = β3 = 25 and var-ied β1 from minus5 to 5 for making inference on β3 we setβ1 = β2 = 25 and varied β3 from minus5 to 5 We chose α = minus3or minus4 yielding disease rates between 16 and 7 We letn1 = n0 = 500 or 1000 We considered the situations of knownand unknown population totals with N being 15 and 30 timesof n under α = minus3 and minus4 For known population totals weused the EM algorithm described in Section A31 and eval-uated the inference procedures based on the likelihood ratiostatistic For unknown population totals we used the profile-likelihood method for rare diseases described in Section A45and set the πk less than 2n to 0 to improve numerical stabilityThe results for β1 and β3 are displayed in Tables 3 and 4

For known population totals the proposed estimators are vir-tually unbiased and the likelihood ratio statistics yield propertests and confidence intervals For unknown population totalsβ1 has little bias especially for large n whereas β3 tends tobe slightly biased downward the variance estimators are fairlyaccurate and the corresponding confidence intervals have rea-sonable coverage probabilities except for α = minus3 β3 = 5The method with known population totals yields slightly higherpower than the method with unknown population totals

All the aforementioned results pertain to haplotype 01100which has a relatively high frequency and a large value of R2

hthe covariate is binary and ρ is 05 which is relatively largeAdditional simulation studies showed that the foregoing con-clusions continue to hold for other haplotypes other valuesof ρ and continuous covariates Table 5 reports some resultsfor haplotype 10100 which has a frequency of about 5 and

96 Journal of the American Statistical Association March 2006

Table 3 Simulation Results for the Main Effects of the Haplotype in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β1 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus003 117 952 987 019 121 124 951 979minus25 minus002 109 954 587 014 112 117 960 5250 minus001 104 951 049 009 109 112 955 045

25 minus001 102 950 641 002 105 108 961 646

5 000 099 948 996 minus005 103 106 958 998

minus4 minus5 001 112 954 987 022 119 124 951 977minus25 minus002 104 955 574 013 114 117 952 5290 minus002 100 953 047 004 109 112 953 047

25 minus001 095 956 636 minus003 103 108 959 640

5 minus000 094 950 999 minus009 102 105 956 997

1000 minus3 minus5 minus003 082 953 100 005 087 087 949 100minus25 minus002 076 952 874 005 081 082 948 8530 minus001 073 951 049 005 077 077 954 046

25 minus001 071 953 898 004 075 076 948 920

5 minus001 070 953 100 003 075 075 946 100

minus4 minus5 000 079 952 100 005 087 088 947 100minus25 minus000 074 959 867 005 081 083 954 8470 minus001 070 955 045 002 079 079 949 051

25 minus001 067 956 904 000 074 076 955 909

5 minus001 066 956 100 minus002 073 074 954 100

NOTE Bias and SE are the bias and standard error of β1 SEE is the mean of the standard error estimator for β1 CP is the coverage probability of the95 confidence interval for β1 Power pertains to the 05-level test of H0 β1 = 0 Each entry is based on 5000 replicates

an R2h of 4 We generated disease incidence from the logistic

regression model

logit PY = 1|X1X2 (hkhl)= α + βhI(hk = hlowast) + I(hl = hlowast)

+ βx1 X1 + βx2X2 + βhx2I(hk = hlowast) + I(hl = hlowast)X2

where hlowast = (10100) X1 is Bernoulli with 2 success probabil-ity and X2 is uniform(01) We set ρ = 01 α = minus37 βh = 0and βx1 = βx2 = minusβx2h = 5 yielding an overall disease rateof 7 We assumed unknown population totals and used the

profile-likelihood method for rare diseases described in Sec-tion A45 The method performed remarkably well

4 APPLICATION TO THE FUSION STUDY

Type 2 diabetes mellitus or nonndashinsulin-dependent diabetesmellitus is a complex disease characterized by resistance of pe-ripheral tissues to insulin and a deficiency of insulin secretionApproximately 7 of adults in developed countries suffer fromthe disease The FUSION study is a major effort to map andclone genetic variants that predispose to type 2 diabetes (Valleet al 1998) We consider a subset of data from this study

Table 4 Simulation Results for the HaplotypendashEnvironment Interactions in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β3 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus008 205 949 729 030 187 195 953 692minus25 minus002 186 949 271 016 169 176 961 2440 minus001 173 946 054 minus006 155 162 963 037

25 002 165 949 334 minus038 144 151 958 287

5 006 161 947 885 minus088 138 143 915 831

minus4 minus5 minus009 198 950 763 012 194 195 950 720minus25 minus005 181 949 309 006 172 176 953 2640 minus002 168 945 055 minus007 156 161 956 044

25 minus001 157 944 370 minus022 146 149 948 333

5 001 148 945 926 minus047 136 141 945 904

1000 minus3 minus5 minus004 147 943 953 027 134 136 950 953minus25 minus003 133 946 493 013 122 123 949 4770 minus001 123 951 049 minus005 114 113 948 052

25 000 119 945 580 minus034 107 106 934 535

5 002 117 947 994 minus080 102 101 870 986

minus4 minus5 minus005 140 945 965 010 137 136 949 965minus25 minus002 128 945 529 005 124 123 951 5050 minus001 119 946 054 minus004 113 113 947 053

25 minus000 110 947 633 minus016 104 105 952 601

5 002 105 949 998 minus037 099 099 937 995

NOTE Bias and SE are the bias and standard error of β3 SEE is the mean of the standard error estimator for β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level test of H0 β3 = 0 Each entry is based on 5000 replicates

Lin and Zeng Inference on Haplotype Effects 97

Table 5 Simulation Results for Haplotype 10100 inCase-Control Studies

n1 = n0 Parameter True value Bias SE SEE CP Power

500 βh 0 minus030 400 401 957 043βx1 5 002 151 152 951 917βx2 5 001 228 230 953 584βhx2 minus5 015 641 644 956 118

1000 βh 0 minus017 275 277 953 047βx1 5 002 107 107 954 997βx2 5 000 162 161 950 871βhx2 minus5 012 441 443 950 198

NOTE Bias and SE are the bias and standard error of the parameter estimator SEE is themean of the standard error estimator CP is the coverage probability of the 95 confidenceinterval Power pertains to the 05-level test of zero parameter value Each entry is based on5000 replicates

A total of 796 cases and 415 controls were genotyped at5 SNPs in a putative susceptibility region on chromosome 22with 131 cases and 82 controls having missing genotype infor-mation for at least one SNP If Gi is missing then the set S(Gi)

is enlarged accordingly in the analysis Table 1 displays the es-timated haplotype frequencies under (3) separated by the casesand controls along with the values of R2

h (Stram et al 2003) forthe controls We estimated ρ at 000 for controls and 002 forcases

We use the method based on (9) to estimate the effects ofthe haplotypes whose observed frequencies in the controls aregreater than 2 As shown in Table 6 the results are signif-icant for the two most common haplotypes haplotype 01100increases the risk of disease whereas haplotype 10011 is pro-tective against diabetes Epstein and Satten (2003) also reportedthe estimates for these two haplotypes which agree with ournumbers Although they did not report standard error estimatestheir confidence intervals are similar to those based on Table 6The results under the codominant model as well as the calcula-tions of the Akaike information criterion (AIC) (Akaike 1985)suggest that the additive model fits the data the best for bothhaplotypes 01100 and 10011

The FUSION investigators are currently exploring genendashenvironment interactions on chromosome 22 so the covariateinformation is confidential at this stage To illustrate our methodfor detecting genendashenvironment interactions we artificially cre-ated a binary covariate X by setting X = 1 for the first 600 in-dividuals in the dataset Under the additive genetic model forhaplotype 01100 the estimate of the interaction is 043 withan estimated standard error of 110 For further illustration wegenerated a binary covariate from the conditional distributionof X given Y and G under model (12) with α = minus37 β1 = 32and β2 = 25 Based on 5000 replicates the power for testing

H0 β3 = 0 is estimated at 053 479 or 974 under β3 = 0 25or 5

5 DISCUSSION

Inferring haplotypendashdisease associations is an interestingand difficult statistical problem The presence of infinite-dimensional nuisance parameters in the likelihoods for case-control and cohort studies entails considerable theoretical andcomputational challenges Although we have conducted a sys-tematic and rigorous investigation providing powerful newmethods there remain substantial open problems Here we dis-cuss some directions for future research

Case-Control Studies It is numerically difficult to maxi-mize (6) when N is much larger than n and algorithms forimplementing the constrained maximization mentioned in Re-mark 3 have yet to be developed For case-control studies withunknown population totals identifiability is a thorny issue Wehave provided a simple and efficient method under the rare dis-ease assumption which appears to work well even when thedisease is not rare But can we do better

Model Selection and Model Assessment Because our ap-proach is built on likelihood we can apply likelihood-basedmodel selection criteria such as the AIC used in Section 4Lin (2004) showed that the AIC performs well for the propor-tional hazards model It is unclear how to apply the traditionalresidual-based methods for assessing model adequacy becausethe haplotypes are not directly observable

Other Genetic Variants We have focused on SNPs-basedhaplotypes The proposed inference procedures are potentiallyapplicable to microsatellite loci and other genotype data al-though the identifiability of parameters needs to be verified foreach kind of genotype data

Other Study Designs It is sometimes desirable to use thematched case-control design in which one or more controls areindividually matched to each case In large cohort studies withrare diseases it is cost-effective to adopt the case-cohort designor nested case-control design so that only a subset of the cohortmembers needs to be genotyped We are currently developingefficient inference procedures for such designs

Population Substructure The presence of latent populationsubstructure can cause bias in association studies of unrelatedindividuals There exist several statistical methods to adjust forthe effects of population substructure with the aid of genomicmarkers It should be possible to extend the proposed methodsso as to accommodate potential population substructure

Table 6 Estimates of Haplotype Effects Under Various Genetic Models for the FUSION Study

Recessivemodel

Dominantmodel

Additivemodel

Codominant model

Haplotype Additive Recessive

01011 327(270) minus027(140) 049(135) 005(143) 331(289)01100 316(146) 274(109) 355(099) 334(114) 063(167)10011 minus206(155) minus323(112) minus320(095) minus344(111) 076(183)10100 minus1019(1020) 196(219) 116(213) 169(217) minus1131(1029)10110 903(746) minus007(248) 063(249) 016(254) 892(765)11011 minus222(328) minus096(140) minus127(133) minus108(140) minus143(344)

NOTE Standard error estimates are shown in parentheses

98 Journal of the American Statistical Association March 2006

Studies of Related Individuals This article is concernedwith studies of unrelated individuals Many genetic studies in-volve multiple family members or relatives Haplotype ambigu-ity possibly can be reduced by using the genotype informationfrom related individuals Inference on haplotype effects needsto account for the intraclass correlation

Genotyping Error and DNA Pooling Laboratory genotyp-ing is prone to error It is sometimes necessary to pool DNAsamples rather than genotyping individual samples (WangKidd and Zhao 2003) Such data create additional complexityin haplotype analysis (Zeng and Lin 2005)

Many SNPs The traditional EM algorithm works well fora small number of SNPs When the number of SNPs is largethe partitionndashligation method of Niu et al (2002) and Qin et al(2002) and other modifications potentially can be adapted to re-duce the computational burden However the haplotype analy-sis may not be very useful if the SNPs are weakly linked

Many Haplotypes and Rare Haplotypes The approachtaken in this article assumes that we are interested in a smallnumber of haplotype configurations that are relatively frequentIf there are many haplotypes then we are confronted withthe problem of multiple comparisons and sparse data Schaid(2004) discussed some possible solutions

Large-Scale Studies There is an increasing interest ingenome-wide association studies With a large number of SNPsone possible approach is to use sliding windows of 5ndash10 SNPsand test for the haplotypendashdisease association in each windowBecause most of the SNPs are common between adjacent win-dows the test statistics tend to be highly correlated so that theBonferroni-type correction for multiple comparisons would beextremely conservative To properly adjust for multiple com-parisons one needs to ascertain the joint distribution of the teststatistics This can be done by permuting the data or by evaluat-ing the asymptotic joint normal distribution of the test statistics(Lin 2005)

We hope that other statisticians will join us in tackling theforegoing problems and other challenges in genetic associationstudies

APPENDIX TECHNICAL ANDCOMPUTATIONAL DETAILS

A1 Proof of Lemma 1

We provide a proof under (3) the proof under (4) is simpler andis omitted here To prove the first part of the lemma we supposethat two sets of parameters (πk ρ) and (πk ρ) yield the samedistribution of G We wish to show that these two sets are iden-tical Consider g = 2hk For such a choice of g the set S(g) is asingleton Clearly (1 minus ρ)π2

k + ρπk = (1 minus ρ)π2k + ρπk We de-

note this constant by ck Then 0 le ck le 1 for all k and 0 lt ck lt 1for at least one k Because πk ge 0 we have πk = [minusρ + ρ2 +4ck(1 minus ρ)12]2(1 minus ρ) Thus (1 minus ρ)minus1 satisfies the equationsum

k[(1 minus x) + (x minus 1)2 + 4ckx12] = 2 and (1 minus ρ)minus1 satisfies thesame equation It can be shown that the first derivative of (1 minus x) +(x minus 1)2 + 4ckx12 is nonpositive and is strictly negative for at leastone k Thus the foregoing equation has a unique solution for x gt 1which implies that ρ = ρ It follows immediately that πk = πk forall k To prove the second part of the lemma we choose g = 2hk toobtain νk2πk(1 minus ρ) + ρ + microπk(1 minus πk) = 0 Because

sumk νk = 0

we havesum

kmicroπk(1 minus πk)2πk(1 minus ρ) + ρ = 0 Therefore micro = 0and ν = 0

A2 Cross-Sectional Studies

A21 Identifiability Under Arbitrary Distributions of H UnderCondition 1 (αβ ξ) is identifiable The identifiability of the distribu-tion of H depends on the structure of Pαβξ For concreteness we con-sider the codominant logistic regression model for a binary trait Wedivide G into three categories G1 = g isin G g = h + h or g = h + hG2 = g isin G minus G1 g is not ge hlowast and G3 = G minus G1 minus G2 We derivethe expression for mg( yx θ) when g belongs to each of the three cat-egories

For g isin G1 S(g) = (hh) or (h h) so that mg( yxθ) = Pαβξ (Y = y|X = xH = (hh))P(H = (hh)) or mg( yxθ) = Pαβξ (Y = y|X = xH = (h h))P(H = (h h)) For g isin G2Pαβξ (Y = y|X = xH = (hkhl)) does not depend on (hkhl) isin S(g)so that mg( yx θ) = Pαβξ (Y = y|X = xH = (hkhl))P(G = g)where (hkhl) isin S(g) For g isin G3

mg( yx θ) = expy(α + β1 + βT3 x + βT

4 x)1 + exp(α + β1 + βT

3 x + βT4 x)

π1(g)

+ expy(α + βT3 x)

1 + exp(α + βT3 x)

π2(g)

where π1(g) = 2P(H = (hlowastg minus hlowast)) and π2(g) = P(H = (hkhl) hk + hl = ghk = hlowasthl = hlowast)

Let θ0 denote the true value of θ P0(G = g) denote the truevalue of P(G = g) and π0j(g) denote the true values πj(g) j = 12We then can draw the following conclusions (1) When β01 = 0 andβ04 = 0 mg( yx θ) = mg( yx θ0) if and only if α = α0 β = β0and P(G = g) = P0(G = g) for any g isin G and (2) when either β01or β04 is nonzero mg( yx θ) = mg( yx θ0) if and only if α = α0β = β0 P(G = g) = P0(G = g) for g isin G1 cup G2 and πj(g) = π0j(g)

for g isin G3 and j = 12 These conclusions hold for any generalizedlinear model with the linear predictor given in (1)

A22 EM Algorithm The complete-data likelihood is propor-tional to

prodni=1Pαβξ (Yi|XiHi)Pγ (Hi) The expectation of the log-

arithm of this function conditional on the observable data (YiXiGi)i = 1 n is

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ)log Pαβξ

(Yi|Xi (hkhl)

) + log Pγ (hkhl)

where

pikl(θ) = Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

Thus in the (m + 1)st iteration of the EM algorithm we evalu-ate pikl(θ) at the current estimate θ (m) and obtain θ (m+1) by solvingthe following equations through the NewtonndashRaphson algorithm

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)

times nablaαβξ log Pαβξ

(Yi|Xi (hkhl)

) = 0

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)nablaγ log Pγ (hkhl) = 0 (A1)

Under (3) with ρ ge 0 the estimate of γ equiv (ρ πk) can be ob-tained in a closed form rather than by solving (A1) Let B be aBernoulli variable with success probability ρ let Q1 be a discreterandom variable taking values in H with P(Q1 = (hkhl)) = δklπk and let Q2 be another discrete random variable taking values in H

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 6: Likelihood-Based Inference on Haplotype Effects in Genetic

94 Journal of the American Statistical Association March 2006

Remark 5 Condition 5 implies nonsingularity of the infor-mation matrix for θ0 and can be shown to hold for the logisticprobit and complementary logndashlog link functions Condition 7ensures that there are both cases and controls in the sample

Theorem 2 Under Conditions 5ndash7 |θ minus θ0| + supg |qg minusqg| + supxg |Fdagger

g(x) minus Fdaggerg(x)| rarr 0 almost surely In addition

n12(θ minus θ0) converges in distribution to a normal random vec-tor whose covariance matrix attains the semiparametric effi-ciency bound

In most case-control studies the disease is (relatively) rareWhen the disease is rare considerable simplicity arises be-cause of the following approximation for the logistic regressionmodel

Pαβ(Y|XH) asymp expY(α + βTZ(XH)

)

where Z(XH) is a specific function of X and H We assumethat either (3) or (4) holds The likelihood based on (XiGi yi)i = 1 n can be approximated by

Ln(θ Fg)

=nprod

i=1

(prodgisinG [fg(Xi)

sum(hk hl)isinS(g) eβTZ(Xihk hl)Pγ (hkhl)]I(Gi=g)

sumgisinG

intxsum

(hk hl)isinS(g) eβTZ(xhkhl)Pγ (hkhl)dFg(x)

)yi

times[prod

gisinG

fg(Xi)sum

(hkhl)isinS(g)

Pγ (hkhl)

I(Gi=g)]1minusyi

(9)

We impose the following condition

Condition 8 If α + βTZ(XH) = α + βTZ(XH) for H =(hkhk) and H = (hk hk) then α = α and β = β

This condition is similar to Condition 1 stated in Section 22and it holds for the codominant model Under this conditionit follows from Lemma 1 that no two sets of parameters cangive the same likelihood with probability 1 Thus the maximizerof (9) denoted by (θ Fg) is locally unique We show in Sec-tion A45 that θ can be easily obtained by profiling over a smallnumber of parameters

To derive the asymptotic properties we provide a mathemat-ical definition of rare disease

Condition 9 For i = 1 n the conditional distribu-tion of Yi given (XiHi) satisfies that P(Yi = 1|XiHi) =an expβT

0Z(XiHi)[1 + an expβT0Z(XiHi)] where an =

o(nminus12)

Theorem 3 Under Conditions 6ndash9 |θ minusθ0|+supxg |Fg(x)minusFg(x)| rarrPn 0 where Pn is the probability measure given byCondition 9 Furthermore n12(θ minus θ0) converges in distri-bution to a normal random vector whose covariance matrixachieves the semiparametric efficiency bound

25 Cohort Studies

In a cohort study Y represents the time to disease occur-rence which is subject to right-censorship by C The data con-sist of (YiiXiGi) i = 1 n where Yi = min(YiCi) and

i = I(Yi le Ci) We relate Yi to (XiHi) through a class ofsemiparametric linear transformation models

13(Yi) = minusβTZ(XiHi) + εi i = 1 n (10)

where 13 is an unknown increasing function Z(XH) is aknown function of X and H and the εirsquos are independent er-rors with known distribution function F We may rewrite (10)as

P(Yi le t|XiHi) = Q((t)eβTZ(XiHi)

)

where (t) = e13(t) and Q(x) = F(log x) (x gt 0) The choicesof the extreme-value and standard logistic distributions for F orequivalently Q(x) = 1minuseminusx and Q(x) = 1minus(1+x)minus1 yield theproportional hazards model and the proportional odds model(Pettitt 1984)

We impose Condition 8 Under this condition β and (middot)are identifiable from the observable data The identifiability ofthe distribution of H is the same here as in the case of cross-sectional studies Under (3) or (4) and Condition 8 all of theparameters including β (middot) and γ are identifiable This isshown in Section A51

The following assumption on censoring is required in con-structing the likelihood

Condition 10 Conditional on X and G the censoring time Cis independent of Y and H

Let θ = (βγ ) The likelihood concerning θ and takes theform

Ln(θ )

=nprod

i=1

[ sum

(hkhl)isinS(Gi)

(Yi)e

βTZ(Xi(hkhl))

times Q((Yi)e

βTZ(Xi(hkhl)))i

times 1 minus Q

((Yi)e

βTZ(Xi(hkhl)))1minusi Pγ (hkhl)

]

(11)

Here and in the sequel f (x) = df (x)dx and f (x) = d2f (x)dx2 Like (6) (8) and (9) this likelihood involves infinite-dimensional parameters If is restricted to be absolutely con-tinuous then as in the case of density estimation there is nomaximizer of this likelihood Thus we relax to be right-continuous and replace (Yi) in (11) by the jump size of

at Yi By the arguments of Zeng Lin and Lin (2005) the resul-tant MLE denoted by (θ ) exists and is a step functionwith jumps only at the observed Yi for which i = 1 The max-imization can be carried out through an optimization algorithmFurthermore the covariance matrix of θ can be estimated by theprofile likelihood method as discussed by Zeng et al (2005)

Lin (2004) considered the special case of the proportionalhazards model under condition (2) and provided an EM algo-rithm for obtaining the MLEs We can modify that algorithm toaccommodate HardyndashWeinberg disequilibrium along the linesof Section A22 In addition the EM algorithm can be used toevaluate the profile likelihood

We assume the following regularity conditions for the as-ymptotic results

Lin and Zeng Inference on Haplotype Effects 95

Condition 11 There exists some positive constant δ0 suchthat P(Ci ge τ |XiGi) = P(Ci = τ |XiGi) ge δ0 almost surelywhere τ corresponds to the end of the study

Condition 12 The true value 0(t) of (t) is a strictly in-creasing function in [0 τ ] and is continuously differentiable Inaddition 0(0) = 0 0(τ ) lt infin and 0(0) gt 0

Theorem 4 Under Conditions 8 and 10ndash12 n12(θ minusθ0 minus0) converges weakly to a Gaussian process in R

d times linfin([0 τ ])where d is the dimension of θ0 and linfin([0 τ ]) is the space ofall bounded functions on [0 τ ] equipped with the supremumnorm Furthermore θ is asymptotically efficient

3 SIMULATION STUDIES

We used Monte Carlo simulation to evaluate the proposedmethods in realistic settings We considered the five SNPs onchromosome 22 from the FinlandndashUnited States Investigationof NIDDM Genetics (FUSION) Study described in the nextsection We obtained the πkrsquos from the frequencies shown inTable 1 by assuming a 7 disease rate and generated haplo-types under (3) with ρ = 05 The R2

h in Table 1 is the mea-sure of haplotype certainty of Stram et al (2003) We focusedon hlowast = (01100) and considered case-control and cohortstudies

For the cohort studies we generated ages of onset from theproportional hazards model

λt|x (hkhl) = 2t exp[β1I(hk = hlowast) + I(hl = hlowast) + β2x

+ β3I(hk = hlowast) + I(hl = hlowast)x]where X is a Bernoulli variable with P(X = 1) = 2 that is in-dependent of H We generated the censoring times from theuniform (0 τ ) distribution where τ was chosen to yield ap-proximately 250 500 and 1000 cases under n = 5000 We letβ1 = β2 = 25 and varied β3 from minus5 to 5

As shown in Table 2 the MLE is virtually unbiased the like-lihood ratio test has proper type I error and the confidence in-terval has reasonable coverage Additional simulation studiesrevealed that the proposed methods also perform well for mak-ing inference about other parameters and under other geneticmodels

Table 1 Observed Haplotype Frequencies in the FUSION Study

Frequencies

Haplotype Controls Cases R2h

00011 0042 0066 38800100 0035 0034 33600110 0018 0007 37701011 1292 1344 59201100 2514 3183 73801101 0012 lt10minus4 45001110 lt10minus4 0045 49901111 0019 lt10minus4 32510000 0136 0114 45610010 lt10minus4 0012 50010011 3573 2883 72710100 0521 0597 40210110 0317 0318 55411011 1392 1290 56011100 0109 0092 26611110 lt10minus4 0014 lt10minus4

11111 0020 lt10minus4 338

Table 2 Simulation Results for the HaplotypendashEnvironmentInteractions in Cohort Studies

β3 Cases Bias SE CP Power

0 250 minus010 232 949 051500 minus005 157 953 047

1000 minus003 114 954 046

minus25 250 minus014 256 950 190500 minus008 172 949 334

1000 minus004 122 952 554

minus5 250 minus022 281 950 505500 minus011 190 950 806

1000 minus006 132 952 976

25 250 minus007 216 947 207500 minus002 146 953 395

1000 minus001 109 954 614

5 250 minus003 204 943 693500 minus001 140 951 940

1000 minus001 105 952 998

NOTE Bias and SE are the bias and standard error of β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level likelihood ratio test of H0 β3 = 0Each entry is based on 5000 replicates

For the case-control studies we used the same distributionsof H and X and considered the same hlowast as in the cohort stud-ies We generated disease incidence from the logistic regressionmodel

logit PY = 1|x (hkhl)= α + β1I(hk = hlowast) + I(hl = hlowast)

+ β2x + β3I(hk = hlowast) + I(hl = hlowast)x (12)

For making inference on β1 we set β2 = β3 = 25 and var-ied β1 from minus5 to 5 for making inference on β3 we setβ1 = β2 = 25 and varied β3 from minus5 to 5 We chose α = minus3or minus4 yielding disease rates between 16 and 7 We letn1 = n0 = 500 or 1000 We considered the situations of knownand unknown population totals with N being 15 and 30 timesof n under α = minus3 and minus4 For known population totals weused the EM algorithm described in Section A31 and eval-uated the inference procedures based on the likelihood ratiostatistic For unknown population totals we used the profile-likelihood method for rare diseases described in Section A45and set the πk less than 2n to 0 to improve numerical stabilityThe results for β1 and β3 are displayed in Tables 3 and 4

For known population totals the proposed estimators are vir-tually unbiased and the likelihood ratio statistics yield propertests and confidence intervals For unknown population totalsβ1 has little bias especially for large n whereas β3 tends tobe slightly biased downward the variance estimators are fairlyaccurate and the corresponding confidence intervals have rea-sonable coverage probabilities except for α = minus3 β3 = 5The method with known population totals yields slightly higherpower than the method with unknown population totals

All the aforementioned results pertain to haplotype 01100which has a relatively high frequency and a large value of R2

hthe covariate is binary and ρ is 05 which is relatively largeAdditional simulation studies showed that the foregoing con-clusions continue to hold for other haplotypes other valuesof ρ and continuous covariates Table 5 reports some resultsfor haplotype 10100 which has a frequency of about 5 and

96 Journal of the American Statistical Association March 2006

Table 3 Simulation Results for the Main Effects of the Haplotype in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β1 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus003 117 952 987 019 121 124 951 979minus25 minus002 109 954 587 014 112 117 960 5250 minus001 104 951 049 009 109 112 955 045

25 minus001 102 950 641 002 105 108 961 646

5 000 099 948 996 minus005 103 106 958 998

minus4 minus5 001 112 954 987 022 119 124 951 977minus25 minus002 104 955 574 013 114 117 952 5290 minus002 100 953 047 004 109 112 953 047

25 minus001 095 956 636 minus003 103 108 959 640

5 minus000 094 950 999 minus009 102 105 956 997

1000 minus3 minus5 minus003 082 953 100 005 087 087 949 100minus25 minus002 076 952 874 005 081 082 948 8530 minus001 073 951 049 005 077 077 954 046

25 minus001 071 953 898 004 075 076 948 920

5 minus001 070 953 100 003 075 075 946 100

minus4 minus5 000 079 952 100 005 087 088 947 100minus25 minus000 074 959 867 005 081 083 954 8470 minus001 070 955 045 002 079 079 949 051

25 minus001 067 956 904 000 074 076 955 909

5 minus001 066 956 100 minus002 073 074 954 100

NOTE Bias and SE are the bias and standard error of β1 SEE is the mean of the standard error estimator for β1 CP is the coverage probability of the95 confidence interval for β1 Power pertains to the 05-level test of H0 β1 = 0 Each entry is based on 5000 replicates

an R2h of 4 We generated disease incidence from the logistic

regression model

logit PY = 1|X1X2 (hkhl)= α + βhI(hk = hlowast) + I(hl = hlowast)

+ βx1 X1 + βx2X2 + βhx2I(hk = hlowast) + I(hl = hlowast)X2

where hlowast = (10100) X1 is Bernoulli with 2 success probabil-ity and X2 is uniform(01) We set ρ = 01 α = minus37 βh = 0and βx1 = βx2 = minusβx2h = 5 yielding an overall disease rateof 7 We assumed unknown population totals and used the

profile-likelihood method for rare diseases described in Sec-tion A45 The method performed remarkably well

4 APPLICATION TO THE FUSION STUDY

Type 2 diabetes mellitus or nonndashinsulin-dependent diabetesmellitus is a complex disease characterized by resistance of pe-ripheral tissues to insulin and a deficiency of insulin secretionApproximately 7 of adults in developed countries suffer fromthe disease The FUSION study is a major effort to map andclone genetic variants that predispose to type 2 diabetes (Valleet al 1998) We consider a subset of data from this study

Table 4 Simulation Results for the HaplotypendashEnvironment Interactions in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β3 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus008 205 949 729 030 187 195 953 692minus25 minus002 186 949 271 016 169 176 961 2440 minus001 173 946 054 minus006 155 162 963 037

25 002 165 949 334 minus038 144 151 958 287

5 006 161 947 885 minus088 138 143 915 831

minus4 minus5 minus009 198 950 763 012 194 195 950 720minus25 minus005 181 949 309 006 172 176 953 2640 minus002 168 945 055 minus007 156 161 956 044

25 minus001 157 944 370 minus022 146 149 948 333

5 001 148 945 926 minus047 136 141 945 904

1000 minus3 minus5 minus004 147 943 953 027 134 136 950 953minus25 minus003 133 946 493 013 122 123 949 4770 minus001 123 951 049 minus005 114 113 948 052

25 000 119 945 580 minus034 107 106 934 535

5 002 117 947 994 minus080 102 101 870 986

minus4 minus5 minus005 140 945 965 010 137 136 949 965minus25 minus002 128 945 529 005 124 123 951 5050 minus001 119 946 054 minus004 113 113 947 053

25 minus000 110 947 633 minus016 104 105 952 601

5 002 105 949 998 minus037 099 099 937 995

NOTE Bias and SE are the bias and standard error of β3 SEE is the mean of the standard error estimator for β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level test of H0 β3 = 0 Each entry is based on 5000 replicates

Lin and Zeng Inference on Haplotype Effects 97

Table 5 Simulation Results for Haplotype 10100 inCase-Control Studies

n1 = n0 Parameter True value Bias SE SEE CP Power

500 βh 0 minus030 400 401 957 043βx1 5 002 151 152 951 917βx2 5 001 228 230 953 584βhx2 minus5 015 641 644 956 118

1000 βh 0 minus017 275 277 953 047βx1 5 002 107 107 954 997βx2 5 000 162 161 950 871βhx2 minus5 012 441 443 950 198

NOTE Bias and SE are the bias and standard error of the parameter estimator SEE is themean of the standard error estimator CP is the coverage probability of the 95 confidenceinterval Power pertains to the 05-level test of zero parameter value Each entry is based on5000 replicates

A total of 796 cases and 415 controls were genotyped at5 SNPs in a putative susceptibility region on chromosome 22with 131 cases and 82 controls having missing genotype infor-mation for at least one SNP If Gi is missing then the set S(Gi)

is enlarged accordingly in the analysis Table 1 displays the es-timated haplotype frequencies under (3) separated by the casesand controls along with the values of R2

h (Stram et al 2003) forthe controls We estimated ρ at 000 for controls and 002 forcases

We use the method based on (9) to estimate the effects ofthe haplotypes whose observed frequencies in the controls aregreater than 2 As shown in Table 6 the results are signif-icant for the two most common haplotypes haplotype 01100increases the risk of disease whereas haplotype 10011 is pro-tective against diabetes Epstein and Satten (2003) also reportedthe estimates for these two haplotypes which agree with ournumbers Although they did not report standard error estimatestheir confidence intervals are similar to those based on Table 6The results under the codominant model as well as the calcula-tions of the Akaike information criterion (AIC) (Akaike 1985)suggest that the additive model fits the data the best for bothhaplotypes 01100 and 10011

The FUSION investigators are currently exploring genendashenvironment interactions on chromosome 22 so the covariateinformation is confidential at this stage To illustrate our methodfor detecting genendashenvironment interactions we artificially cre-ated a binary covariate X by setting X = 1 for the first 600 in-dividuals in the dataset Under the additive genetic model forhaplotype 01100 the estimate of the interaction is 043 withan estimated standard error of 110 For further illustration wegenerated a binary covariate from the conditional distributionof X given Y and G under model (12) with α = minus37 β1 = 32and β2 = 25 Based on 5000 replicates the power for testing

H0 β3 = 0 is estimated at 053 479 or 974 under β3 = 0 25or 5

5 DISCUSSION

Inferring haplotypendashdisease associations is an interestingand difficult statistical problem The presence of infinite-dimensional nuisance parameters in the likelihoods for case-control and cohort studies entails considerable theoretical andcomputational challenges Although we have conducted a sys-tematic and rigorous investigation providing powerful newmethods there remain substantial open problems Here we dis-cuss some directions for future research

Case-Control Studies It is numerically difficult to maxi-mize (6) when N is much larger than n and algorithms forimplementing the constrained maximization mentioned in Re-mark 3 have yet to be developed For case-control studies withunknown population totals identifiability is a thorny issue Wehave provided a simple and efficient method under the rare dis-ease assumption which appears to work well even when thedisease is not rare But can we do better

Model Selection and Model Assessment Because our ap-proach is built on likelihood we can apply likelihood-basedmodel selection criteria such as the AIC used in Section 4Lin (2004) showed that the AIC performs well for the propor-tional hazards model It is unclear how to apply the traditionalresidual-based methods for assessing model adequacy becausethe haplotypes are not directly observable

Other Genetic Variants We have focused on SNPs-basedhaplotypes The proposed inference procedures are potentiallyapplicable to microsatellite loci and other genotype data al-though the identifiability of parameters needs to be verified foreach kind of genotype data

Other Study Designs It is sometimes desirable to use thematched case-control design in which one or more controls areindividually matched to each case In large cohort studies withrare diseases it is cost-effective to adopt the case-cohort designor nested case-control design so that only a subset of the cohortmembers needs to be genotyped We are currently developingefficient inference procedures for such designs

Population Substructure The presence of latent populationsubstructure can cause bias in association studies of unrelatedindividuals There exist several statistical methods to adjust forthe effects of population substructure with the aid of genomicmarkers It should be possible to extend the proposed methodsso as to accommodate potential population substructure

Table 6 Estimates of Haplotype Effects Under Various Genetic Models for the FUSION Study

Recessivemodel

Dominantmodel

Additivemodel

Codominant model

Haplotype Additive Recessive

01011 327(270) minus027(140) 049(135) 005(143) 331(289)01100 316(146) 274(109) 355(099) 334(114) 063(167)10011 minus206(155) minus323(112) minus320(095) minus344(111) 076(183)10100 minus1019(1020) 196(219) 116(213) 169(217) minus1131(1029)10110 903(746) minus007(248) 063(249) 016(254) 892(765)11011 minus222(328) minus096(140) minus127(133) minus108(140) minus143(344)

NOTE Standard error estimates are shown in parentheses

98 Journal of the American Statistical Association March 2006

Studies of Related Individuals This article is concernedwith studies of unrelated individuals Many genetic studies in-volve multiple family members or relatives Haplotype ambigu-ity possibly can be reduced by using the genotype informationfrom related individuals Inference on haplotype effects needsto account for the intraclass correlation

Genotyping Error and DNA Pooling Laboratory genotyp-ing is prone to error It is sometimes necessary to pool DNAsamples rather than genotyping individual samples (WangKidd and Zhao 2003) Such data create additional complexityin haplotype analysis (Zeng and Lin 2005)

Many SNPs The traditional EM algorithm works well fora small number of SNPs When the number of SNPs is largethe partitionndashligation method of Niu et al (2002) and Qin et al(2002) and other modifications potentially can be adapted to re-duce the computational burden However the haplotype analy-sis may not be very useful if the SNPs are weakly linked

Many Haplotypes and Rare Haplotypes The approachtaken in this article assumes that we are interested in a smallnumber of haplotype configurations that are relatively frequentIf there are many haplotypes then we are confronted withthe problem of multiple comparisons and sparse data Schaid(2004) discussed some possible solutions

Large-Scale Studies There is an increasing interest ingenome-wide association studies With a large number of SNPsone possible approach is to use sliding windows of 5ndash10 SNPsand test for the haplotypendashdisease association in each windowBecause most of the SNPs are common between adjacent win-dows the test statistics tend to be highly correlated so that theBonferroni-type correction for multiple comparisons would beextremely conservative To properly adjust for multiple com-parisons one needs to ascertain the joint distribution of the teststatistics This can be done by permuting the data or by evaluat-ing the asymptotic joint normal distribution of the test statistics(Lin 2005)

We hope that other statisticians will join us in tackling theforegoing problems and other challenges in genetic associationstudies

APPENDIX TECHNICAL ANDCOMPUTATIONAL DETAILS

A1 Proof of Lemma 1

We provide a proof under (3) the proof under (4) is simpler andis omitted here To prove the first part of the lemma we supposethat two sets of parameters (πk ρ) and (πk ρ) yield the samedistribution of G We wish to show that these two sets are iden-tical Consider g = 2hk For such a choice of g the set S(g) is asingleton Clearly (1 minus ρ)π2

k + ρπk = (1 minus ρ)π2k + ρπk We de-

note this constant by ck Then 0 le ck le 1 for all k and 0 lt ck lt 1for at least one k Because πk ge 0 we have πk = [minusρ + ρ2 +4ck(1 minus ρ)12]2(1 minus ρ) Thus (1 minus ρ)minus1 satisfies the equationsum

k[(1 minus x) + (x minus 1)2 + 4ckx12] = 2 and (1 minus ρ)minus1 satisfies thesame equation It can be shown that the first derivative of (1 minus x) +(x minus 1)2 + 4ckx12 is nonpositive and is strictly negative for at leastone k Thus the foregoing equation has a unique solution for x gt 1which implies that ρ = ρ It follows immediately that πk = πk forall k To prove the second part of the lemma we choose g = 2hk toobtain νk2πk(1 minus ρ) + ρ + microπk(1 minus πk) = 0 Because

sumk νk = 0

we havesum

kmicroπk(1 minus πk)2πk(1 minus ρ) + ρ = 0 Therefore micro = 0and ν = 0

A2 Cross-Sectional Studies

A21 Identifiability Under Arbitrary Distributions of H UnderCondition 1 (αβ ξ) is identifiable The identifiability of the distribu-tion of H depends on the structure of Pαβξ For concreteness we con-sider the codominant logistic regression model for a binary trait Wedivide G into three categories G1 = g isin G g = h + h or g = h + hG2 = g isin G minus G1 g is not ge hlowast and G3 = G minus G1 minus G2 We derivethe expression for mg( yx θ) when g belongs to each of the three cat-egories

For g isin G1 S(g) = (hh) or (h h) so that mg( yxθ) = Pαβξ (Y = y|X = xH = (hh))P(H = (hh)) or mg( yxθ) = Pαβξ (Y = y|X = xH = (h h))P(H = (h h)) For g isin G2Pαβξ (Y = y|X = xH = (hkhl)) does not depend on (hkhl) isin S(g)so that mg( yx θ) = Pαβξ (Y = y|X = xH = (hkhl))P(G = g)where (hkhl) isin S(g) For g isin G3

mg( yx θ) = expy(α + β1 + βT3 x + βT

4 x)1 + exp(α + β1 + βT

3 x + βT4 x)

π1(g)

+ expy(α + βT3 x)

1 + exp(α + βT3 x)

π2(g)

where π1(g) = 2P(H = (hlowastg minus hlowast)) and π2(g) = P(H = (hkhl) hk + hl = ghk = hlowasthl = hlowast)

Let θ0 denote the true value of θ P0(G = g) denote the truevalue of P(G = g) and π0j(g) denote the true values πj(g) j = 12We then can draw the following conclusions (1) When β01 = 0 andβ04 = 0 mg( yx θ) = mg( yx θ0) if and only if α = α0 β = β0and P(G = g) = P0(G = g) for any g isin G and (2) when either β01or β04 is nonzero mg( yx θ) = mg( yx θ0) if and only if α = α0β = β0 P(G = g) = P0(G = g) for g isin G1 cup G2 and πj(g) = π0j(g)

for g isin G3 and j = 12 These conclusions hold for any generalizedlinear model with the linear predictor given in (1)

A22 EM Algorithm The complete-data likelihood is propor-tional to

prodni=1Pαβξ (Yi|XiHi)Pγ (Hi) The expectation of the log-

arithm of this function conditional on the observable data (YiXiGi)i = 1 n is

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ)log Pαβξ

(Yi|Xi (hkhl)

) + log Pγ (hkhl)

where

pikl(θ) = Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

Thus in the (m + 1)st iteration of the EM algorithm we evalu-ate pikl(θ) at the current estimate θ (m) and obtain θ (m+1) by solvingthe following equations through the NewtonndashRaphson algorithm

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)

times nablaαβξ log Pαβξ

(Yi|Xi (hkhl)

) = 0

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)nablaγ log Pγ (hkhl) = 0 (A1)

Under (3) with ρ ge 0 the estimate of γ equiv (ρ πk) can be ob-tained in a closed form rather than by solving (A1) Let B be aBernoulli variable with success probability ρ let Q1 be a discreterandom variable taking values in H with P(Q1 = (hkhl)) = δklπk and let Q2 be another discrete random variable taking values in H

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 7: Likelihood-Based Inference on Haplotype Effects in Genetic

Lin and Zeng Inference on Haplotype Effects 95

Condition 11 There exists some positive constant δ0 suchthat P(Ci ge τ |XiGi) = P(Ci = τ |XiGi) ge δ0 almost surelywhere τ corresponds to the end of the study

Condition 12 The true value 0(t) of (t) is a strictly in-creasing function in [0 τ ] and is continuously differentiable Inaddition 0(0) = 0 0(τ ) lt infin and 0(0) gt 0

Theorem 4 Under Conditions 8 and 10ndash12 n12(θ minusθ0 minus0) converges weakly to a Gaussian process in R

d times linfin([0 τ ])where d is the dimension of θ0 and linfin([0 τ ]) is the space ofall bounded functions on [0 τ ] equipped with the supremumnorm Furthermore θ is asymptotically efficient

3 SIMULATION STUDIES

We used Monte Carlo simulation to evaluate the proposedmethods in realistic settings We considered the five SNPs onchromosome 22 from the FinlandndashUnited States Investigationof NIDDM Genetics (FUSION) Study described in the nextsection We obtained the πkrsquos from the frequencies shown inTable 1 by assuming a 7 disease rate and generated haplo-types under (3) with ρ = 05 The R2

h in Table 1 is the mea-sure of haplotype certainty of Stram et al (2003) We focusedon hlowast = (01100) and considered case-control and cohortstudies

For the cohort studies we generated ages of onset from theproportional hazards model

λt|x (hkhl) = 2t exp[β1I(hk = hlowast) + I(hl = hlowast) + β2x

+ β3I(hk = hlowast) + I(hl = hlowast)x]where X is a Bernoulli variable with P(X = 1) = 2 that is in-dependent of H We generated the censoring times from theuniform (0 τ ) distribution where τ was chosen to yield ap-proximately 250 500 and 1000 cases under n = 5000 We letβ1 = β2 = 25 and varied β3 from minus5 to 5

As shown in Table 2 the MLE is virtually unbiased the like-lihood ratio test has proper type I error and the confidence in-terval has reasonable coverage Additional simulation studiesrevealed that the proposed methods also perform well for mak-ing inference about other parameters and under other geneticmodels

Table 1 Observed Haplotype Frequencies in the FUSION Study

Frequencies

Haplotype Controls Cases R2h

00011 0042 0066 38800100 0035 0034 33600110 0018 0007 37701011 1292 1344 59201100 2514 3183 73801101 0012 lt10minus4 45001110 lt10minus4 0045 49901111 0019 lt10minus4 32510000 0136 0114 45610010 lt10minus4 0012 50010011 3573 2883 72710100 0521 0597 40210110 0317 0318 55411011 1392 1290 56011100 0109 0092 26611110 lt10minus4 0014 lt10minus4

11111 0020 lt10minus4 338

Table 2 Simulation Results for the HaplotypendashEnvironmentInteractions in Cohort Studies

β3 Cases Bias SE CP Power

0 250 minus010 232 949 051500 minus005 157 953 047

1000 minus003 114 954 046

minus25 250 minus014 256 950 190500 minus008 172 949 334

1000 minus004 122 952 554

minus5 250 minus022 281 950 505500 minus011 190 950 806

1000 minus006 132 952 976

25 250 minus007 216 947 207500 minus002 146 953 395

1000 minus001 109 954 614

5 250 minus003 204 943 693500 minus001 140 951 940

1000 minus001 105 952 998

NOTE Bias and SE are the bias and standard error of β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level likelihood ratio test of H0 β3 = 0Each entry is based on 5000 replicates

For the case-control studies we used the same distributionsof H and X and considered the same hlowast as in the cohort stud-ies We generated disease incidence from the logistic regressionmodel

logit PY = 1|x (hkhl)= α + β1I(hk = hlowast) + I(hl = hlowast)

+ β2x + β3I(hk = hlowast) + I(hl = hlowast)x (12)

For making inference on β1 we set β2 = β3 = 25 and var-ied β1 from minus5 to 5 for making inference on β3 we setβ1 = β2 = 25 and varied β3 from minus5 to 5 We chose α = minus3or minus4 yielding disease rates between 16 and 7 We letn1 = n0 = 500 or 1000 We considered the situations of knownand unknown population totals with N being 15 and 30 timesof n under α = minus3 and minus4 For known population totals weused the EM algorithm described in Section A31 and eval-uated the inference procedures based on the likelihood ratiostatistic For unknown population totals we used the profile-likelihood method for rare diseases described in Section A45and set the πk less than 2n to 0 to improve numerical stabilityThe results for β1 and β3 are displayed in Tables 3 and 4

For known population totals the proposed estimators are vir-tually unbiased and the likelihood ratio statistics yield propertests and confidence intervals For unknown population totalsβ1 has little bias especially for large n whereas β3 tends tobe slightly biased downward the variance estimators are fairlyaccurate and the corresponding confidence intervals have rea-sonable coverage probabilities except for α = minus3 β3 = 5The method with known population totals yields slightly higherpower than the method with unknown population totals

All the aforementioned results pertain to haplotype 01100which has a relatively high frequency and a large value of R2

hthe covariate is binary and ρ is 05 which is relatively largeAdditional simulation studies showed that the foregoing con-clusions continue to hold for other haplotypes other valuesof ρ and continuous covariates Table 5 reports some resultsfor haplotype 10100 which has a frequency of about 5 and

96 Journal of the American Statistical Association March 2006

Table 3 Simulation Results for the Main Effects of the Haplotype in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β1 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus003 117 952 987 019 121 124 951 979minus25 minus002 109 954 587 014 112 117 960 5250 minus001 104 951 049 009 109 112 955 045

25 minus001 102 950 641 002 105 108 961 646

5 000 099 948 996 minus005 103 106 958 998

minus4 minus5 001 112 954 987 022 119 124 951 977minus25 minus002 104 955 574 013 114 117 952 5290 minus002 100 953 047 004 109 112 953 047

25 minus001 095 956 636 minus003 103 108 959 640

5 minus000 094 950 999 minus009 102 105 956 997

1000 minus3 minus5 minus003 082 953 100 005 087 087 949 100minus25 minus002 076 952 874 005 081 082 948 8530 minus001 073 951 049 005 077 077 954 046

25 minus001 071 953 898 004 075 076 948 920

5 minus001 070 953 100 003 075 075 946 100

minus4 minus5 000 079 952 100 005 087 088 947 100minus25 minus000 074 959 867 005 081 083 954 8470 minus001 070 955 045 002 079 079 949 051

25 minus001 067 956 904 000 074 076 955 909

5 minus001 066 956 100 minus002 073 074 954 100

NOTE Bias and SE are the bias and standard error of β1 SEE is the mean of the standard error estimator for β1 CP is the coverage probability of the95 confidence interval for β1 Power pertains to the 05-level test of H0 β1 = 0 Each entry is based on 5000 replicates

an R2h of 4 We generated disease incidence from the logistic

regression model

logit PY = 1|X1X2 (hkhl)= α + βhI(hk = hlowast) + I(hl = hlowast)

+ βx1 X1 + βx2X2 + βhx2I(hk = hlowast) + I(hl = hlowast)X2

where hlowast = (10100) X1 is Bernoulli with 2 success probabil-ity and X2 is uniform(01) We set ρ = 01 α = minus37 βh = 0and βx1 = βx2 = minusβx2h = 5 yielding an overall disease rateof 7 We assumed unknown population totals and used the

profile-likelihood method for rare diseases described in Sec-tion A45 The method performed remarkably well

4 APPLICATION TO THE FUSION STUDY

Type 2 diabetes mellitus or nonndashinsulin-dependent diabetesmellitus is a complex disease characterized by resistance of pe-ripheral tissues to insulin and a deficiency of insulin secretionApproximately 7 of adults in developed countries suffer fromthe disease The FUSION study is a major effort to map andclone genetic variants that predispose to type 2 diabetes (Valleet al 1998) We consider a subset of data from this study

Table 4 Simulation Results for the HaplotypendashEnvironment Interactions in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β3 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus008 205 949 729 030 187 195 953 692minus25 minus002 186 949 271 016 169 176 961 2440 minus001 173 946 054 minus006 155 162 963 037

25 002 165 949 334 minus038 144 151 958 287

5 006 161 947 885 minus088 138 143 915 831

minus4 minus5 minus009 198 950 763 012 194 195 950 720minus25 minus005 181 949 309 006 172 176 953 2640 minus002 168 945 055 minus007 156 161 956 044

25 minus001 157 944 370 minus022 146 149 948 333

5 001 148 945 926 minus047 136 141 945 904

1000 minus3 minus5 minus004 147 943 953 027 134 136 950 953minus25 minus003 133 946 493 013 122 123 949 4770 minus001 123 951 049 minus005 114 113 948 052

25 000 119 945 580 minus034 107 106 934 535

5 002 117 947 994 minus080 102 101 870 986

minus4 minus5 minus005 140 945 965 010 137 136 949 965minus25 minus002 128 945 529 005 124 123 951 5050 minus001 119 946 054 minus004 113 113 947 053

25 minus000 110 947 633 minus016 104 105 952 601

5 002 105 949 998 minus037 099 099 937 995

NOTE Bias and SE are the bias and standard error of β3 SEE is the mean of the standard error estimator for β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level test of H0 β3 = 0 Each entry is based on 5000 replicates

Lin and Zeng Inference on Haplotype Effects 97

Table 5 Simulation Results for Haplotype 10100 inCase-Control Studies

n1 = n0 Parameter True value Bias SE SEE CP Power

500 βh 0 minus030 400 401 957 043βx1 5 002 151 152 951 917βx2 5 001 228 230 953 584βhx2 minus5 015 641 644 956 118

1000 βh 0 minus017 275 277 953 047βx1 5 002 107 107 954 997βx2 5 000 162 161 950 871βhx2 minus5 012 441 443 950 198

NOTE Bias and SE are the bias and standard error of the parameter estimator SEE is themean of the standard error estimator CP is the coverage probability of the 95 confidenceinterval Power pertains to the 05-level test of zero parameter value Each entry is based on5000 replicates

A total of 796 cases and 415 controls were genotyped at5 SNPs in a putative susceptibility region on chromosome 22with 131 cases and 82 controls having missing genotype infor-mation for at least one SNP If Gi is missing then the set S(Gi)

is enlarged accordingly in the analysis Table 1 displays the es-timated haplotype frequencies under (3) separated by the casesand controls along with the values of R2

h (Stram et al 2003) forthe controls We estimated ρ at 000 for controls and 002 forcases

We use the method based on (9) to estimate the effects ofthe haplotypes whose observed frequencies in the controls aregreater than 2 As shown in Table 6 the results are signif-icant for the two most common haplotypes haplotype 01100increases the risk of disease whereas haplotype 10011 is pro-tective against diabetes Epstein and Satten (2003) also reportedthe estimates for these two haplotypes which agree with ournumbers Although they did not report standard error estimatestheir confidence intervals are similar to those based on Table 6The results under the codominant model as well as the calcula-tions of the Akaike information criterion (AIC) (Akaike 1985)suggest that the additive model fits the data the best for bothhaplotypes 01100 and 10011

The FUSION investigators are currently exploring genendashenvironment interactions on chromosome 22 so the covariateinformation is confidential at this stage To illustrate our methodfor detecting genendashenvironment interactions we artificially cre-ated a binary covariate X by setting X = 1 for the first 600 in-dividuals in the dataset Under the additive genetic model forhaplotype 01100 the estimate of the interaction is 043 withan estimated standard error of 110 For further illustration wegenerated a binary covariate from the conditional distributionof X given Y and G under model (12) with α = minus37 β1 = 32and β2 = 25 Based on 5000 replicates the power for testing

H0 β3 = 0 is estimated at 053 479 or 974 under β3 = 0 25or 5

5 DISCUSSION

Inferring haplotypendashdisease associations is an interestingand difficult statistical problem The presence of infinite-dimensional nuisance parameters in the likelihoods for case-control and cohort studies entails considerable theoretical andcomputational challenges Although we have conducted a sys-tematic and rigorous investigation providing powerful newmethods there remain substantial open problems Here we dis-cuss some directions for future research

Case-Control Studies It is numerically difficult to maxi-mize (6) when N is much larger than n and algorithms forimplementing the constrained maximization mentioned in Re-mark 3 have yet to be developed For case-control studies withunknown population totals identifiability is a thorny issue Wehave provided a simple and efficient method under the rare dis-ease assumption which appears to work well even when thedisease is not rare But can we do better

Model Selection and Model Assessment Because our ap-proach is built on likelihood we can apply likelihood-basedmodel selection criteria such as the AIC used in Section 4Lin (2004) showed that the AIC performs well for the propor-tional hazards model It is unclear how to apply the traditionalresidual-based methods for assessing model adequacy becausethe haplotypes are not directly observable

Other Genetic Variants We have focused on SNPs-basedhaplotypes The proposed inference procedures are potentiallyapplicable to microsatellite loci and other genotype data al-though the identifiability of parameters needs to be verified foreach kind of genotype data

Other Study Designs It is sometimes desirable to use thematched case-control design in which one or more controls areindividually matched to each case In large cohort studies withrare diseases it is cost-effective to adopt the case-cohort designor nested case-control design so that only a subset of the cohortmembers needs to be genotyped We are currently developingefficient inference procedures for such designs

Population Substructure The presence of latent populationsubstructure can cause bias in association studies of unrelatedindividuals There exist several statistical methods to adjust forthe effects of population substructure with the aid of genomicmarkers It should be possible to extend the proposed methodsso as to accommodate potential population substructure

Table 6 Estimates of Haplotype Effects Under Various Genetic Models for the FUSION Study

Recessivemodel

Dominantmodel

Additivemodel

Codominant model

Haplotype Additive Recessive

01011 327(270) minus027(140) 049(135) 005(143) 331(289)01100 316(146) 274(109) 355(099) 334(114) 063(167)10011 minus206(155) minus323(112) minus320(095) minus344(111) 076(183)10100 minus1019(1020) 196(219) 116(213) 169(217) minus1131(1029)10110 903(746) minus007(248) 063(249) 016(254) 892(765)11011 minus222(328) minus096(140) minus127(133) minus108(140) minus143(344)

NOTE Standard error estimates are shown in parentheses

98 Journal of the American Statistical Association March 2006

Studies of Related Individuals This article is concernedwith studies of unrelated individuals Many genetic studies in-volve multiple family members or relatives Haplotype ambigu-ity possibly can be reduced by using the genotype informationfrom related individuals Inference on haplotype effects needsto account for the intraclass correlation

Genotyping Error and DNA Pooling Laboratory genotyp-ing is prone to error It is sometimes necessary to pool DNAsamples rather than genotyping individual samples (WangKidd and Zhao 2003) Such data create additional complexityin haplotype analysis (Zeng and Lin 2005)

Many SNPs The traditional EM algorithm works well fora small number of SNPs When the number of SNPs is largethe partitionndashligation method of Niu et al (2002) and Qin et al(2002) and other modifications potentially can be adapted to re-duce the computational burden However the haplotype analy-sis may not be very useful if the SNPs are weakly linked

Many Haplotypes and Rare Haplotypes The approachtaken in this article assumes that we are interested in a smallnumber of haplotype configurations that are relatively frequentIf there are many haplotypes then we are confronted withthe problem of multiple comparisons and sparse data Schaid(2004) discussed some possible solutions

Large-Scale Studies There is an increasing interest ingenome-wide association studies With a large number of SNPsone possible approach is to use sliding windows of 5ndash10 SNPsand test for the haplotypendashdisease association in each windowBecause most of the SNPs are common between adjacent win-dows the test statistics tend to be highly correlated so that theBonferroni-type correction for multiple comparisons would beextremely conservative To properly adjust for multiple com-parisons one needs to ascertain the joint distribution of the teststatistics This can be done by permuting the data or by evaluat-ing the asymptotic joint normal distribution of the test statistics(Lin 2005)

We hope that other statisticians will join us in tackling theforegoing problems and other challenges in genetic associationstudies

APPENDIX TECHNICAL ANDCOMPUTATIONAL DETAILS

A1 Proof of Lemma 1

We provide a proof under (3) the proof under (4) is simpler andis omitted here To prove the first part of the lemma we supposethat two sets of parameters (πk ρ) and (πk ρ) yield the samedistribution of G We wish to show that these two sets are iden-tical Consider g = 2hk For such a choice of g the set S(g) is asingleton Clearly (1 minus ρ)π2

k + ρπk = (1 minus ρ)π2k + ρπk We de-

note this constant by ck Then 0 le ck le 1 for all k and 0 lt ck lt 1for at least one k Because πk ge 0 we have πk = [minusρ + ρ2 +4ck(1 minus ρ)12]2(1 minus ρ) Thus (1 minus ρ)minus1 satisfies the equationsum

k[(1 minus x) + (x minus 1)2 + 4ckx12] = 2 and (1 minus ρ)minus1 satisfies thesame equation It can be shown that the first derivative of (1 minus x) +(x minus 1)2 + 4ckx12 is nonpositive and is strictly negative for at leastone k Thus the foregoing equation has a unique solution for x gt 1which implies that ρ = ρ It follows immediately that πk = πk forall k To prove the second part of the lemma we choose g = 2hk toobtain νk2πk(1 minus ρ) + ρ + microπk(1 minus πk) = 0 Because

sumk νk = 0

we havesum

kmicroπk(1 minus πk)2πk(1 minus ρ) + ρ = 0 Therefore micro = 0and ν = 0

A2 Cross-Sectional Studies

A21 Identifiability Under Arbitrary Distributions of H UnderCondition 1 (αβ ξ) is identifiable The identifiability of the distribu-tion of H depends on the structure of Pαβξ For concreteness we con-sider the codominant logistic regression model for a binary trait Wedivide G into three categories G1 = g isin G g = h + h or g = h + hG2 = g isin G minus G1 g is not ge hlowast and G3 = G minus G1 minus G2 We derivethe expression for mg( yx θ) when g belongs to each of the three cat-egories

For g isin G1 S(g) = (hh) or (h h) so that mg( yxθ) = Pαβξ (Y = y|X = xH = (hh))P(H = (hh)) or mg( yxθ) = Pαβξ (Y = y|X = xH = (h h))P(H = (h h)) For g isin G2Pαβξ (Y = y|X = xH = (hkhl)) does not depend on (hkhl) isin S(g)so that mg( yx θ) = Pαβξ (Y = y|X = xH = (hkhl))P(G = g)where (hkhl) isin S(g) For g isin G3

mg( yx θ) = expy(α + β1 + βT3 x + βT

4 x)1 + exp(α + β1 + βT

3 x + βT4 x)

π1(g)

+ expy(α + βT3 x)

1 + exp(α + βT3 x)

π2(g)

where π1(g) = 2P(H = (hlowastg minus hlowast)) and π2(g) = P(H = (hkhl) hk + hl = ghk = hlowasthl = hlowast)

Let θ0 denote the true value of θ P0(G = g) denote the truevalue of P(G = g) and π0j(g) denote the true values πj(g) j = 12We then can draw the following conclusions (1) When β01 = 0 andβ04 = 0 mg( yx θ) = mg( yx θ0) if and only if α = α0 β = β0and P(G = g) = P0(G = g) for any g isin G and (2) when either β01or β04 is nonzero mg( yx θ) = mg( yx θ0) if and only if α = α0β = β0 P(G = g) = P0(G = g) for g isin G1 cup G2 and πj(g) = π0j(g)

for g isin G3 and j = 12 These conclusions hold for any generalizedlinear model with the linear predictor given in (1)

A22 EM Algorithm The complete-data likelihood is propor-tional to

prodni=1Pαβξ (Yi|XiHi)Pγ (Hi) The expectation of the log-

arithm of this function conditional on the observable data (YiXiGi)i = 1 n is

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ)log Pαβξ

(Yi|Xi (hkhl)

) + log Pγ (hkhl)

where

pikl(θ) = Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

Thus in the (m + 1)st iteration of the EM algorithm we evalu-ate pikl(θ) at the current estimate θ (m) and obtain θ (m+1) by solvingthe following equations through the NewtonndashRaphson algorithm

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)

times nablaαβξ log Pαβξ

(Yi|Xi (hkhl)

) = 0

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)nablaγ log Pγ (hkhl) = 0 (A1)

Under (3) with ρ ge 0 the estimate of γ equiv (ρ πk) can be ob-tained in a closed form rather than by solving (A1) Let B be aBernoulli variable with success probability ρ let Q1 be a discreterandom variable taking values in H with P(Q1 = (hkhl)) = δklπk and let Q2 be another discrete random variable taking values in H

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 8: Likelihood-Based Inference on Haplotype Effects in Genetic

96 Journal of the American Statistical Association March 2006

Table 3 Simulation Results for the Main Effects of the Haplotype in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β1 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus003 117 952 987 019 121 124 951 979minus25 minus002 109 954 587 014 112 117 960 5250 minus001 104 951 049 009 109 112 955 045

25 minus001 102 950 641 002 105 108 961 646

5 000 099 948 996 minus005 103 106 958 998

minus4 minus5 001 112 954 987 022 119 124 951 977minus25 minus002 104 955 574 013 114 117 952 5290 minus002 100 953 047 004 109 112 953 047

25 minus001 095 956 636 minus003 103 108 959 640

5 minus000 094 950 999 minus009 102 105 956 997

1000 minus3 minus5 minus003 082 953 100 005 087 087 949 100minus25 minus002 076 952 874 005 081 082 948 8530 minus001 073 951 049 005 077 077 954 046

25 minus001 071 953 898 004 075 076 948 920

5 minus001 070 953 100 003 075 075 946 100

minus4 minus5 000 079 952 100 005 087 088 947 100minus25 minus000 074 959 867 005 081 083 954 8470 minus001 070 955 045 002 079 079 949 051

25 minus001 067 956 904 000 074 076 955 909

5 minus001 066 956 100 minus002 073 074 954 100

NOTE Bias and SE are the bias and standard error of β1 SEE is the mean of the standard error estimator for β1 CP is the coverage probability of the95 confidence interval for β1 Power pertains to the 05-level test of H0 β1 = 0 Each entry is based on 5000 replicates

an R2h of 4 We generated disease incidence from the logistic

regression model

logit PY = 1|X1X2 (hkhl)= α + βhI(hk = hlowast) + I(hl = hlowast)

+ βx1 X1 + βx2X2 + βhx2I(hk = hlowast) + I(hl = hlowast)X2

where hlowast = (10100) X1 is Bernoulli with 2 success probabil-ity and X2 is uniform(01) We set ρ = 01 α = minus37 βh = 0and βx1 = βx2 = minusβx2h = 5 yielding an overall disease rateof 7 We assumed unknown population totals and used the

profile-likelihood method for rare diseases described in Sec-tion A45 The method performed remarkably well

4 APPLICATION TO THE FUSION STUDY

Type 2 diabetes mellitus or nonndashinsulin-dependent diabetesmellitus is a complex disease characterized by resistance of pe-ripheral tissues to insulin and a deficiency of insulin secretionApproximately 7 of adults in developed countries suffer fromthe disease The FUSION study is a major effort to map andclone genetic variants that predispose to type 2 diabetes (Valleet al 1998) We consider a subset of data from this study

Table 4 Simulation Results for the HaplotypendashEnvironment Interactions in Case-Control Studies

Known totals Unknown totals

n1 = n0 α β3 Bias SE CP Power Bias SE SEE CP Power

500 minus3 minus5 minus008 205 949 729 030 187 195 953 692minus25 minus002 186 949 271 016 169 176 961 2440 minus001 173 946 054 minus006 155 162 963 037

25 002 165 949 334 minus038 144 151 958 287

5 006 161 947 885 minus088 138 143 915 831

minus4 minus5 minus009 198 950 763 012 194 195 950 720minus25 minus005 181 949 309 006 172 176 953 2640 minus002 168 945 055 minus007 156 161 956 044

25 minus001 157 944 370 minus022 146 149 948 333

5 001 148 945 926 minus047 136 141 945 904

1000 minus3 minus5 minus004 147 943 953 027 134 136 950 953minus25 minus003 133 946 493 013 122 123 949 4770 minus001 123 951 049 minus005 114 113 948 052

25 000 119 945 580 minus034 107 106 934 535

5 002 117 947 994 minus080 102 101 870 986

minus4 minus5 minus005 140 945 965 010 137 136 949 965minus25 minus002 128 945 529 005 124 123 951 5050 minus001 119 946 054 minus004 113 113 947 053

25 minus000 110 947 633 minus016 104 105 952 601

5 002 105 949 998 minus037 099 099 937 995

NOTE Bias and SE are the bias and standard error of β3 SEE is the mean of the standard error estimator for β3 CP is the coverage probability of the95 confidence interval for β3 Power pertains to the 05-level test of H0 β3 = 0 Each entry is based on 5000 replicates

Lin and Zeng Inference on Haplotype Effects 97

Table 5 Simulation Results for Haplotype 10100 inCase-Control Studies

n1 = n0 Parameter True value Bias SE SEE CP Power

500 βh 0 minus030 400 401 957 043βx1 5 002 151 152 951 917βx2 5 001 228 230 953 584βhx2 minus5 015 641 644 956 118

1000 βh 0 minus017 275 277 953 047βx1 5 002 107 107 954 997βx2 5 000 162 161 950 871βhx2 minus5 012 441 443 950 198

NOTE Bias and SE are the bias and standard error of the parameter estimator SEE is themean of the standard error estimator CP is the coverage probability of the 95 confidenceinterval Power pertains to the 05-level test of zero parameter value Each entry is based on5000 replicates

A total of 796 cases and 415 controls were genotyped at5 SNPs in a putative susceptibility region on chromosome 22with 131 cases and 82 controls having missing genotype infor-mation for at least one SNP If Gi is missing then the set S(Gi)

is enlarged accordingly in the analysis Table 1 displays the es-timated haplotype frequencies under (3) separated by the casesand controls along with the values of R2

h (Stram et al 2003) forthe controls We estimated ρ at 000 for controls and 002 forcases

We use the method based on (9) to estimate the effects ofthe haplotypes whose observed frequencies in the controls aregreater than 2 As shown in Table 6 the results are signif-icant for the two most common haplotypes haplotype 01100increases the risk of disease whereas haplotype 10011 is pro-tective against diabetes Epstein and Satten (2003) also reportedthe estimates for these two haplotypes which agree with ournumbers Although they did not report standard error estimatestheir confidence intervals are similar to those based on Table 6The results under the codominant model as well as the calcula-tions of the Akaike information criterion (AIC) (Akaike 1985)suggest that the additive model fits the data the best for bothhaplotypes 01100 and 10011

The FUSION investigators are currently exploring genendashenvironment interactions on chromosome 22 so the covariateinformation is confidential at this stage To illustrate our methodfor detecting genendashenvironment interactions we artificially cre-ated a binary covariate X by setting X = 1 for the first 600 in-dividuals in the dataset Under the additive genetic model forhaplotype 01100 the estimate of the interaction is 043 withan estimated standard error of 110 For further illustration wegenerated a binary covariate from the conditional distributionof X given Y and G under model (12) with α = minus37 β1 = 32and β2 = 25 Based on 5000 replicates the power for testing

H0 β3 = 0 is estimated at 053 479 or 974 under β3 = 0 25or 5

5 DISCUSSION

Inferring haplotypendashdisease associations is an interestingand difficult statistical problem The presence of infinite-dimensional nuisance parameters in the likelihoods for case-control and cohort studies entails considerable theoretical andcomputational challenges Although we have conducted a sys-tematic and rigorous investigation providing powerful newmethods there remain substantial open problems Here we dis-cuss some directions for future research

Case-Control Studies It is numerically difficult to maxi-mize (6) when N is much larger than n and algorithms forimplementing the constrained maximization mentioned in Re-mark 3 have yet to be developed For case-control studies withunknown population totals identifiability is a thorny issue Wehave provided a simple and efficient method under the rare dis-ease assumption which appears to work well even when thedisease is not rare But can we do better

Model Selection and Model Assessment Because our ap-proach is built on likelihood we can apply likelihood-basedmodel selection criteria such as the AIC used in Section 4Lin (2004) showed that the AIC performs well for the propor-tional hazards model It is unclear how to apply the traditionalresidual-based methods for assessing model adequacy becausethe haplotypes are not directly observable

Other Genetic Variants We have focused on SNPs-basedhaplotypes The proposed inference procedures are potentiallyapplicable to microsatellite loci and other genotype data al-though the identifiability of parameters needs to be verified foreach kind of genotype data

Other Study Designs It is sometimes desirable to use thematched case-control design in which one or more controls areindividually matched to each case In large cohort studies withrare diseases it is cost-effective to adopt the case-cohort designor nested case-control design so that only a subset of the cohortmembers needs to be genotyped We are currently developingefficient inference procedures for such designs

Population Substructure The presence of latent populationsubstructure can cause bias in association studies of unrelatedindividuals There exist several statistical methods to adjust forthe effects of population substructure with the aid of genomicmarkers It should be possible to extend the proposed methodsso as to accommodate potential population substructure

Table 6 Estimates of Haplotype Effects Under Various Genetic Models for the FUSION Study

Recessivemodel

Dominantmodel

Additivemodel

Codominant model

Haplotype Additive Recessive

01011 327(270) minus027(140) 049(135) 005(143) 331(289)01100 316(146) 274(109) 355(099) 334(114) 063(167)10011 minus206(155) minus323(112) minus320(095) minus344(111) 076(183)10100 minus1019(1020) 196(219) 116(213) 169(217) minus1131(1029)10110 903(746) minus007(248) 063(249) 016(254) 892(765)11011 minus222(328) minus096(140) minus127(133) minus108(140) minus143(344)

NOTE Standard error estimates are shown in parentheses

98 Journal of the American Statistical Association March 2006

Studies of Related Individuals This article is concernedwith studies of unrelated individuals Many genetic studies in-volve multiple family members or relatives Haplotype ambigu-ity possibly can be reduced by using the genotype informationfrom related individuals Inference on haplotype effects needsto account for the intraclass correlation

Genotyping Error and DNA Pooling Laboratory genotyp-ing is prone to error It is sometimes necessary to pool DNAsamples rather than genotyping individual samples (WangKidd and Zhao 2003) Such data create additional complexityin haplotype analysis (Zeng and Lin 2005)

Many SNPs The traditional EM algorithm works well fora small number of SNPs When the number of SNPs is largethe partitionndashligation method of Niu et al (2002) and Qin et al(2002) and other modifications potentially can be adapted to re-duce the computational burden However the haplotype analy-sis may not be very useful if the SNPs are weakly linked

Many Haplotypes and Rare Haplotypes The approachtaken in this article assumes that we are interested in a smallnumber of haplotype configurations that are relatively frequentIf there are many haplotypes then we are confronted withthe problem of multiple comparisons and sparse data Schaid(2004) discussed some possible solutions

Large-Scale Studies There is an increasing interest ingenome-wide association studies With a large number of SNPsone possible approach is to use sliding windows of 5ndash10 SNPsand test for the haplotypendashdisease association in each windowBecause most of the SNPs are common between adjacent win-dows the test statistics tend to be highly correlated so that theBonferroni-type correction for multiple comparisons would beextremely conservative To properly adjust for multiple com-parisons one needs to ascertain the joint distribution of the teststatistics This can be done by permuting the data or by evaluat-ing the asymptotic joint normal distribution of the test statistics(Lin 2005)

We hope that other statisticians will join us in tackling theforegoing problems and other challenges in genetic associationstudies

APPENDIX TECHNICAL ANDCOMPUTATIONAL DETAILS

A1 Proof of Lemma 1

We provide a proof under (3) the proof under (4) is simpler andis omitted here To prove the first part of the lemma we supposethat two sets of parameters (πk ρ) and (πk ρ) yield the samedistribution of G We wish to show that these two sets are iden-tical Consider g = 2hk For such a choice of g the set S(g) is asingleton Clearly (1 minus ρ)π2

k + ρπk = (1 minus ρ)π2k + ρπk We de-

note this constant by ck Then 0 le ck le 1 for all k and 0 lt ck lt 1for at least one k Because πk ge 0 we have πk = [minusρ + ρ2 +4ck(1 minus ρ)12]2(1 minus ρ) Thus (1 minus ρ)minus1 satisfies the equationsum

k[(1 minus x) + (x minus 1)2 + 4ckx12] = 2 and (1 minus ρ)minus1 satisfies thesame equation It can be shown that the first derivative of (1 minus x) +(x minus 1)2 + 4ckx12 is nonpositive and is strictly negative for at leastone k Thus the foregoing equation has a unique solution for x gt 1which implies that ρ = ρ It follows immediately that πk = πk forall k To prove the second part of the lemma we choose g = 2hk toobtain νk2πk(1 minus ρ) + ρ + microπk(1 minus πk) = 0 Because

sumk νk = 0

we havesum

kmicroπk(1 minus πk)2πk(1 minus ρ) + ρ = 0 Therefore micro = 0and ν = 0

A2 Cross-Sectional Studies

A21 Identifiability Under Arbitrary Distributions of H UnderCondition 1 (αβ ξ) is identifiable The identifiability of the distribu-tion of H depends on the structure of Pαβξ For concreteness we con-sider the codominant logistic regression model for a binary trait Wedivide G into three categories G1 = g isin G g = h + h or g = h + hG2 = g isin G minus G1 g is not ge hlowast and G3 = G minus G1 minus G2 We derivethe expression for mg( yx θ) when g belongs to each of the three cat-egories

For g isin G1 S(g) = (hh) or (h h) so that mg( yxθ) = Pαβξ (Y = y|X = xH = (hh))P(H = (hh)) or mg( yxθ) = Pαβξ (Y = y|X = xH = (h h))P(H = (h h)) For g isin G2Pαβξ (Y = y|X = xH = (hkhl)) does not depend on (hkhl) isin S(g)so that mg( yx θ) = Pαβξ (Y = y|X = xH = (hkhl))P(G = g)where (hkhl) isin S(g) For g isin G3

mg( yx θ) = expy(α + β1 + βT3 x + βT

4 x)1 + exp(α + β1 + βT

3 x + βT4 x)

π1(g)

+ expy(α + βT3 x)

1 + exp(α + βT3 x)

π2(g)

where π1(g) = 2P(H = (hlowastg minus hlowast)) and π2(g) = P(H = (hkhl) hk + hl = ghk = hlowasthl = hlowast)

Let θ0 denote the true value of θ P0(G = g) denote the truevalue of P(G = g) and π0j(g) denote the true values πj(g) j = 12We then can draw the following conclusions (1) When β01 = 0 andβ04 = 0 mg( yx θ) = mg( yx θ0) if and only if α = α0 β = β0and P(G = g) = P0(G = g) for any g isin G and (2) when either β01or β04 is nonzero mg( yx θ) = mg( yx θ0) if and only if α = α0β = β0 P(G = g) = P0(G = g) for g isin G1 cup G2 and πj(g) = π0j(g)

for g isin G3 and j = 12 These conclusions hold for any generalizedlinear model with the linear predictor given in (1)

A22 EM Algorithm The complete-data likelihood is propor-tional to

prodni=1Pαβξ (Yi|XiHi)Pγ (Hi) The expectation of the log-

arithm of this function conditional on the observable data (YiXiGi)i = 1 n is

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ)log Pαβξ

(Yi|Xi (hkhl)

) + log Pγ (hkhl)

where

pikl(θ) = Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

Thus in the (m + 1)st iteration of the EM algorithm we evalu-ate pikl(θ) at the current estimate θ (m) and obtain θ (m+1) by solvingthe following equations through the NewtonndashRaphson algorithm

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)

times nablaαβξ log Pαβξ

(Yi|Xi (hkhl)

) = 0

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)nablaγ log Pγ (hkhl) = 0 (A1)

Under (3) with ρ ge 0 the estimate of γ equiv (ρ πk) can be ob-tained in a closed form rather than by solving (A1) Let B be aBernoulli variable with success probability ρ let Q1 be a discreterandom variable taking values in H with P(Q1 = (hkhl)) = δklπk and let Q2 be another discrete random variable taking values in H

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 9: Likelihood-Based Inference on Haplotype Effects in Genetic

Lin and Zeng Inference on Haplotype Effects 97

Table 5 Simulation Results for Haplotype 10100 inCase-Control Studies

n1 = n0 Parameter True value Bias SE SEE CP Power

500 βh 0 minus030 400 401 957 043βx1 5 002 151 152 951 917βx2 5 001 228 230 953 584βhx2 minus5 015 641 644 956 118

1000 βh 0 minus017 275 277 953 047βx1 5 002 107 107 954 997βx2 5 000 162 161 950 871βhx2 minus5 012 441 443 950 198

NOTE Bias and SE are the bias and standard error of the parameter estimator SEE is themean of the standard error estimator CP is the coverage probability of the 95 confidenceinterval Power pertains to the 05-level test of zero parameter value Each entry is based on5000 replicates

A total of 796 cases and 415 controls were genotyped at5 SNPs in a putative susceptibility region on chromosome 22with 131 cases and 82 controls having missing genotype infor-mation for at least one SNP If Gi is missing then the set S(Gi)

is enlarged accordingly in the analysis Table 1 displays the es-timated haplotype frequencies under (3) separated by the casesand controls along with the values of R2

h (Stram et al 2003) forthe controls We estimated ρ at 000 for controls and 002 forcases

We use the method based on (9) to estimate the effects ofthe haplotypes whose observed frequencies in the controls aregreater than 2 As shown in Table 6 the results are signif-icant for the two most common haplotypes haplotype 01100increases the risk of disease whereas haplotype 10011 is pro-tective against diabetes Epstein and Satten (2003) also reportedthe estimates for these two haplotypes which agree with ournumbers Although they did not report standard error estimatestheir confidence intervals are similar to those based on Table 6The results under the codominant model as well as the calcula-tions of the Akaike information criterion (AIC) (Akaike 1985)suggest that the additive model fits the data the best for bothhaplotypes 01100 and 10011

The FUSION investigators are currently exploring genendashenvironment interactions on chromosome 22 so the covariateinformation is confidential at this stage To illustrate our methodfor detecting genendashenvironment interactions we artificially cre-ated a binary covariate X by setting X = 1 for the first 600 in-dividuals in the dataset Under the additive genetic model forhaplotype 01100 the estimate of the interaction is 043 withan estimated standard error of 110 For further illustration wegenerated a binary covariate from the conditional distributionof X given Y and G under model (12) with α = minus37 β1 = 32and β2 = 25 Based on 5000 replicates the power for testing

H0 β3 = 0 is estimated at 053 479 or 974 under β3 = 0 25or 5

5 DISCUSSION

Inferring haplotypendashdisease associations is an interestingand difficult statistical problem The presence of infinite-dimensional nuisance parameters in the likelihoods for case-control and cohort studies entails considerable theoretical andcomputational challenges Although we have conducted a sys-tematic and rigorous investigation providing powerful newmethods there remain substantial open problems Here we dis-cuss some directions for future research

Case-Control Studies It is numerically difficult to maxi-mize (6) when N is much larger than n and algorithms forimplementing the constrained maximization mentioned in Re-mark 3 have yet to be developed For case-control studies withunknown population totals identifiability is a thorny issue Wehave provided a simple and efficient method under the rare dis-ease assumption which appears to work well even when thedisease is not rare But can we do better

Model Selection and Model Assessment Because our ap-proach is built on likelihood we can apply likelihood-basedmodel selection criteria such as the AIC used in Section 4Lin (2004) showed that the AIC performs well for the propor-tional hazards model It is unclear how to apply the traditionalresidual-based methods for assessing model adequacy becausethe haplotypes are not directly observable

Other Genetic Variants We have focused on SNPs-basedhaplotypes The proposed inference procedures are potentiallyapplicable to microsatellite loci and other genotype data al-though the identifiability of parameters needs to be verified foreach kind of genotype data

Other Study Designs It is sometimes desirable to use thematched case-control design in which one or more controls areindividually matched to each case In large cohort studies withrare diseases it is cost-effective to adopt the case-cohort designor nested case-control design so that only a subset of the cohortmembers needs to be genotyped We are currently developingefficient inference procedures for such designs

Population Substructure The presence of latent populationsubstructure can cause bias in association studies of unrelatedindividuals There exist several statistical methods to adjust forthe effects of population substructure with the aid of genomicmarkers It should be possible to extend the proposed methodsso as to accommodate potential population substructure

Table 6 Estimates of Haplotype Effects Under Various Genetic Models for the FUSION Study

Recessivemodel

Dominantmodel

Additivemodel

Codominant model

Haplotype Additive Recessive

01011 327(270) minus027(140) 049(135) 005(143) 331(289)01100 316(146) 274(109) 355(099) 334(114) 063(167)10011 minus206(155) minus323(112) minus320(095) minus344(111) 076(183)10100 minus1019(1020) 196(219) 116(213) 169(217) minus1131(1029)10110 903(746) minus007(248) 063(249) 016(254) 892(765)11011 minus222(328) minus096(140) minus127(133) minus108(140) minus143(344)

NOTE Standard error estimates are shown in parentheses

98 Journal of the American Statistical Association March 2006

Studies of Related Individuals This article is concernedwith studies of unrelated individuals Many genetic studies in-volve multiple family members or relatives Haplotype ambigu-ity possibly can be reduced by using the genotype informationfrom related individuals Inference on haplotype effects needsto account for the intraclass correlation

Genotyping Error and DNA Pooling Laboratory genotyp-ing is prone to error It is sometimes necessary to pool DNAsamples rather than genotyping individual samples (WangKidd and Zhao 2003) Such data create additional complexityin haplotype analysis (Zeng and Lin 2005)

Many SNPs The traditional EM algorithm works well fora small number of SNPs When the number of SNPs is largethe partitionndashligation method of Niu et al (2002) and Qin et al(2002) and other modifications potentially can be adapted to re-duce the computational burden However the haplotype analy-sis may not be very useful if the SNPs are weakly linked

Many Haplotypes and Rare Haplotypes The approachtaken in this article assumes that we are interested in a smallnumber of haplotype configurations that are relatively frequentIf there are many haplotypes then we are confronted withthe problem of multiple comparisons and sparse data Schaid(2004) discussed some possible solutions

Large-Scale Studies There is an increasing interest ingenome-wide association studies With a large number of SNPsone possible approach is to use sliding windows of 5ndash10 SNPsand test for the haplotypendashdisease association in each windowBecause most of the SNPs are common between adjacent win-dows the test statistics tend to be highly correlated so that theBonferroni-type correction for multiple comparisons would beextremely conservative To properly adjust for multiple com-parisons one needs to ascertain the joint distribution of the teststatistics This can be done by permuting the data or by evaluat-ing the asymptotic joint normal distribution of the test statistics(Lin 2005)

We hope that other statisticians will join us in tackling theforegoing problems and other challenges in genetic associationstudies

APPENDIX TECHNICAL ANDCOMPUTATIONAL DETAILS

A1 Proof of Lemma 1

We provide a proof under (3) the proof under (4) is simpler andis omitted here To prove the first part of the lemma we supposethat two sets of parameters (πk ρ) and (πk ρ) yield the samedistribution of G We wish to show that these two sets are iden-tical Consider g = 2hk For such a choice of g the set S(g) is asingleton Clearly (1 minus ρ)π2

k + ρπk = (1 minus ρ)π2k + ρπk We de-

note this constant by ck Then 0 le ck le 1 for all k and 0 lt ck lt 1for at least one k Because πk ge 0 we have πk = [minusρ + ρ2 +4ck(1 minus ρ)12]2(1 minus ρ) Thus (1 minus ρ)minus1 satisfies the equationsum

k[(1 minus x) + (x minus 1)2 + 4ckx12] = 2 and (1 minus ρ)minus1 satisfies thesame equation It can be shown that the first derivative of (1 minus x) +(x minus 1)2 + 4ckx12 is nonpositive and is strictly negative for at leastone k Thus the foregoing equation has a unique solution for x gt 1which implies that ρ = ρ It follows immediately that πk = πk forall k To prove the second part of the lemma we choose g = 2hk toobtain νk2πk(1 minus ρ) + ρ + microπk(1 minus πk) = 0 Because

sumk νk = 0

we havesum

kmicroπk(1 minus πk)2πk(1 minus ρ) + ρ = 0 Therefore micro = 0and ν = 0

A2 Cross-Sectional Studies

A21 Identifiability Under Arbitrary Distributions of H UnderCondition 1 (αβ ξ) is identifiable The identifiability of the distribu-tion of H depends on the structure of Pαβξ For concreteness we con-sider the codominant logistic regression model for a binary trait Wedivide G into three categories G1 = g isin G g = h + h or g = h + hG2 = g isin G minus G1 g is not ge hlowast and G3 = G minus G1 minus G2 We derivethe expression for mg( yx θ) when g belongs to each of the three cat-egories

For g isin G1 S(g) = (hh) or (h h) so that mg( yxθ) = Pαβξ (Y = y|X = xH = (hh))P(H = (hh)) or mg( yxθ) = Pαβξ (Y = y|X = xH = (h h))P(H = (h h)) For g isin G2Pαβξ (Y = y|X = xH = (hkhl)) does not depend on (hkhl) isin S(g)so that mg( yx θ) = Pαβξ (Y = y|X = xH = (hkhl))P(G = g)where (hkhl) isin S(g) For g isin G3

mg( yx θ) = expy(α + β1 + βT3 x + βT

4 x)1 + exp(α + β1 + βT

3 x + βT4 x)

π1(g)

+ expy(α + βT3 x)

1 + exp(α + βT3 x)

π2(g)

where π1(g) = 2P(H = (hlowastg minus hlowast)) and π2(g) = P(H = (hkhl) hk + hl = ghk = hlowasthl = hlowast)

Let θ0 denote the true value of θ P0(G = g) denote the truevalue of P(G = g) and π0j(g) denote the true values πj(g) j = 12We then can draw the following conclusions (1) When β01 = 0 andβ04 = 0 mg( yx θ) = mg( yx θ0) if and only if α = α0 β = β0and P(G = g) = P0(G = g) for any g isin G and (2) when either β01or β04 is nonzero mg( yx θ) = mg( yx θ0) if and only if α = α0β = β0 P(G = g) = P0(G = g) for g isin G1 cup G2 and πj(g) = π0j(g)

for g isin G3 and j = 12 These conclusions hold for any generalizedlinear model with the linear predictor given in (1)

A22 EM Algorithm The complete-data likelihood is propor-tional to

prodni=1Pαβξ (Yi|XiHi)Pγ (Hi) The expectation of the log-

arithm of this function conditional on the observable data (YiXiGi)i = 1 n is

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ)log Pαβξ

(Yi|Xi (hkhl)

) + log Pγ (hkhl)

where

pikl(θ) = Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

Thus in the (m + 1)st iteration of the EM algorithm we evalu-ate pikl(θ) at the current estimate θ (m) and obtain θ (m+1) by solvingthe following equations through the NewtonndashRaphson algorithm

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)

times nablaαβξ log Pαβξ

(Yi|Xi (hkhl)

) = 0

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)nablaγ log Pγ (hkhl) = 0 (A1)

Under (3) with ρ ge 0 the estimate of γ equiv (ρ πk) can be ob-tained in a closed form rather than by solving (A1) Let B be aBernoulli variable with success probability ρ let Q1 be a discreterandom variable taking values in H with P(Q1 = (hkhl)) = δklπk and let Q2 be another discrete random variable taking values in H

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 10: Likelihood-Based Inference on Haplotype Effects in Genetic

98 Journal of the American Statistical Association March 2006

Studies of Related Individuals This article is concernedwith studies of unrelated individuals Many genetic studies in-volve multiple family members or relatives Haplotype ambigu-ity possibly can be reduced by using the genotype informationfrom related individuals Inference on haplotype effects needsto account for the intraclass correlation

Genotyping Error and DNA Pooling Laboratory genotyp-ing is prone to error It is sometimes necessary to pool DNAsamples rather than genotyping individual samples (WangKidd and Zhao 2003) Such data create additional complexityin haplotype analysis (Zeng and Lin 2005)

Many SNPs The traditional EM algorithm works well fora small number of SNPs When the number of SNPs is largethe partitionndashligation method of Niu et al (2002) and Qin et al(2002) and other modifications potentially can be adapted to re-duce the computational burden However the haplotype analy-sis may not be very useful if the SNPs are weakly linked

Many Haplotypes and Rare Haplotypes The approachtaken in this article assumes that we are interested in a smallnumber of haplotype configurations that are relatively frequentIf there are many haplotypes then we are confronted withthe problem of multiple comparisons and sparse data Schaid(2004) discussed some possible solutions

Large-Scale Studies There is an increasing interest ingenome-wide association studies With a large number of SNPsone possible approach is to use sliding windows of 5ndash10 SNPsand test for the haplotypendashdisease association in each windowBecause most of the SNPs are common between adjacent win-dows the test statistics tend to be highly correlated so that theBonferroni-type correction for multiple comparisons would beextremely conservative To properly adjust for multiple com-parisons one needs to ascertain the joint distribution of the teststatistics This can be done by permuting the data or by evaluat-ing the asymptotic joint normal distribution of the test statistics(Lin 2005)

We hope that other statisticians will join us in tackling theforegoing problems and other challenges in genetic associationstudies

APPENDIX TECHNICAL ANDCOMPUTATIONAL DETAILS

A1 Proof of Lemma 1

We provide a proof under (3) the proof under (4) is simpler andis omitted here To prove the first part of the lemma we supposethat two sets of parameters (πk ρ) and (πk ρ) yield the samedistribution of G We wish to show that these two sets are iden-tical Consider g = 2hk For such a choice of g the set S(g) is asingleton Clearly (1 minus ρ)π2

k + ρπk = (1 minus ρ)π2k + ρπk We de-

note this constant by ck Then 0 le ck le 1 for all k and 0 lt ck lt 1for at least one k Because πk ge 0 we have πk = [minusρ + ρ2 +4ck(1 minus ρ)12]2(1 minus ρ) Thus (1 minus ρ)minus1 satisfies the equationsum

k[(1 minus x) + (x minus 1)2 + 4ckx12] = 2 and (1 minus ρ)minus1 satisfies thesame equation It can be shown that the first derivative of (1 minus x) +(x minus 1)2 + 4ckx12 is nonpositive and is strictly negative for at leastone k Thus the foregoing equation has a unique solution for x gt 1which implies that ρ = ρ It follows immediately that πk = πk forall k To prove the second part of the lemma we choose g = 2hk toobtain νk2πk(1 minus ρ) + ρ + microπk(1 minus πk) = 0 Because

sumk νk = 0

we havesum

kmicroπk(1 minus πk)2πk(1 minus ρ) + ρ = 0 Therefore micro = 0and ν = 0

A2 Cross-Sectional Studies

A21 Identifiability Under Arbitrary Distributions of H UnderCondition 1 (αβ ξ) is identifiable The identifiability of the distribu-tion of H depends on the structure of Pαβξ For concreteness we con-sider the codominant logistic regression model for a binary trait Wedivide G into three categories G1 = g isin G g = h + h or g = h + hG2 = g isin G minus G1 g is not ge hlowast and G3 = G minus G1 minus G2 We derivethe expression for mg( yx θ) when g belongs to each of the three cat-egories

For g isin G1 S(g) = (hh) or (h h) so that mg( yxθ) = Pαβξ (Y = y|X = xH = (hh))P(H = (hh)) or mg( yxθ) = Pαβξ (Y = y|X = xH = (h h))P(H = (h h)) For g isin G2Pαβξ (Y = y|X = xH = (hkhl)) does not depend on (hkhl) isin S(g)so that mg( yx θ) = Pαβξ (Y = y|X = xH = (hkhl))P(G = g)where (hkhl) isin S(g) For g isin G3

mg( yx θ) = expy(α + β1 + βT3 x + βT

4 x)1 + exp(α + β1 + βT

3 x + βT4 x)

π1(g)

+ expy(α + βT3 x)

1 + exp(α + βT3 x)

π2(g)

where π1(g) = 2P(H = (hlowastg minus hlowast)) and π2(g) = P(H = (hkhl) hk + hl = ghk = hlowasthl = hlowast)

Let θ0 denote the true value of θ P0(G = g) denote the truevalue of P(G = g) and π0j(g) denote the true values πj(g) j = 12We then can draw the following conclusions (1) When β01 = 0 andβ04 = 0 mg( yx θ) = mg( yx θ0) if and only if α = α0 β = β0and P(G = g) = P0(G = g) for any g isin G and (2) when either β01or β04 is nonzero mg( yx θ) = mg( yx θ0) if and only if α = α0β = β0 P(G = g) = P0(G = g) for g isin G1 cup G2 and πj(g) = π0j(g)

for g isin G3 and j = 12 These conclusions hold for any generalizedlinear model with the linear predictor given in (1)

A22 EM Algorithm The complete-data likelihood is propor-tional to

prodni=1Pαβξ (Yi|XiHi)Pγ (Hi) The expectation of the log-

arithm of this function conditional on the observable data (YiXiGi)i = 1 n is

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ)log Pαβξ

(Yi|Xi (hkhl)

) + log Pγ (hkhl)

where

pikl(θ) = Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

Thus in the (m + 1)st iteration of the EM algorithm we evalu-ate pikl(θ) at the current estimate θ (m) and obtain θ (m+1) by solvingthe following equations through the NewtonndashRaphson algorithm

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)

times nablaαβξ log Pαβξ

(Yi|Xi (hkhl)

) = 0

nsum

i=1

sum

(hkhl)isinS(Gi)

pikl(θ (m)

)nablaγ log Pγ (hkhl) = 0 (A1)

Under (3) with ρ ge 0 the estimate of γ equiv (ρ πk) can be ob-tained in a closed form rather than by solving (A1) Let B be aBernoulli variable with success probability ρ let Q1 be a discreterandom variable taking values in H with P(Q1 = (hkhl)) = δklπk and let Q2 be another discrete random variable taking values in H

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 11: Likelihood-Based Inference on Haplotype Effects in Genetic

Lin and Zeng Inference on Haplotype Effects 99

with P(Q2 = (hkhl)) = πkπl Then H has the same distribution asBQ1 + (1minusB)Q2 The complete-data likelihood can be represented by

nprod

i=1

Pαβξ (Yi|XiHi)prod

k

πI(Q1i=(hkhk))Bik

timesprod

kl

(πkπl)I(Q2i=(hkhl))(1minusBi)ρBi(1 minus ρ)1minusBi

The corresponding score equations for πk and ρ satisfy

πk = cminus1

[ nsum

i=1

BiI(Q1i = (hkhk)

)

+nsum

i=1

Ksum

l=1

(1 minus Bi)I(Q2i = (hkhl)

) + I(Q2i = (hlhk)

)]

and ρ = nminus1 sumni=1 Bi where c is a normalizing constant such thatsum

k πk = 1 Define

Eω(BiQ1iQ2i)|YiXiGi=

sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)ω(bq1q2)

times[ sum

bq1+(1minusb)q2isinS(Gi)

Pαβξ

(Yi|Xibq1 + (1 minus b)q2

)

times p(bq1q2)

]minus1

where ω(BQ1Q2) = BI(Q1 = (hkhk)) (1 minus B)I(Q2 = (hkhl))

or B and

p(bq1q2) =prod

k

πbI(q1=(hkhk))k

timesprod

kl

(πkπl)(1minusb)I(q2=(hkhl))ρb(1 minus ρ)1minusb

In the (m + 1)st iteration the estimates of πk and ρ are obtained inclosed form

π(m+1)k = 1

c(m+1)

[ nsum

i=1

E(m)BiI

(Q1i = (hkhk)

)

+ 2nsum

i=1

Ksum

l=1

E(m)(1 minus Bi)I

(Q2i = (hkhl)

)]

and ρ(m+1) = nminus1 sumni=1 E(m)(Bi) where E(m)ω(BiQ1iQ2i) is

Eω(BiQ1iQ2i)|YiXiGi evaluated at θ = θ (m) and c(m+1) is the

constant such thatsum

k π(m+1)k = 1

A3 Case-Control Studies With Known Population Totals

A31 EM Algorithm This is similar to the EM algorithm forcross-sectional studies except that in addition to unknown H on allindividuals X is missing for the individuals not selected into the case-control sample and there are nonparametric components Fg(middot) Thecomplete-data likelihood is

Nprod

i=1

Pαβξ (Yi|XiHi)Pγ (Hi)prod

g fg(Xi)I(Gi=g)

The M-step solves the following equations for θ

Nsum

i=1

I(Ri = 1)Enablaαβξ log Pαβξ (Yi|XiHi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaαβξ log Pαβξ (Yi|XiHi)|Yi

= 0

Nsum

i=1

I(Ri = 1)Enablaγ log Pγ (Hi)|YiXiGi

+Nsum

i=1

I(Ri = 0)Enablaγ log Pγ (Hi)|Yi = 0 (A2)

and also estimates Fg by an empirical function with the following pointmass at the Xi for which (Gi = gRi = 1)

FgXi

=[ Nsum

j=1

I(Xj = XiGj = gRj = 1)

+Nsum

j=1

I(Rj = 0)EI(Xj = XiGj = g)|Yj]

times[ Nsum

j=1

I(Gj = gRj = 1) +Nsum

j=1

I(Rj = 0)EI(Gj = g)|Yj]minus1

where the conditional expectations are evaluated at the current esti-mates of θ and Fg in the E-step For a random function ω(YiXiHi)the conditional expectation takes the form

sum(hkhl)isinS(Gi)

w(YiXi (hkhl))Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)sum

(hkhl)isinS(Gi)Pαβξ (Yi|Xi (hkhl))Pγ (hkhl)

for Ri = 1 andsum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

ω(Yix (hkhl))

times Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx

times( sum

gisinG

sum

xisinXi Gi=gRi=1

sum

(hkhl)isinS(g)

Pαβξ (Yi|x (hkhl))

times Pγ (hkhl)Fgx)minus1

for Ri = 0 Under (3) with ρ ge 0 the idea described in Section A22can be applied to (A2) to obtain a closed-form estimate of γ

A32 Proof of Theorem 1 The case-control design with knownpopulation totals is a special case of the two-phase designs studied byBreslow McNeney and Wellner (2003) The likelihood given in (6)resembles (23) of Breslow et al The key difference is that the for-mer involves several nonparametric components Fg(middot) whereas thelatter involves only a single nonparametric function Despite this dif-ference the arguments of Breslow et al can be used to prove Theo-rem 1 with minor modifications Specifically the regularity conditionsof Breslow et al hold under our Conditions 1ndash4 Thus the consistencyof (θ Fg(middot)) follows from the results of van der Vaart and Wellner(2001) whereas the weak convergence and asymptotic efficiency canbe established by applying the results of Murphy and van der Vaart(2000) through a least favorable submodel which can be constructedas done by Breslow et al (2003 sec 3)

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 12: Likelihood-Based Inference on Haplotype Effects in Genetic

100 Journal of the American Statistical Association March 2006

A4 Case-Control Studies With Unknown Population Totals

A41 Equivalence Class Suppose that two sets of parameters(θ Fdagger

g qg) and (θ Fdaggerg qg) yield the same likelihood

RL(θ Fdaggerg qg) = RL(θ Fdagger

g qg) (A3)

Because η(0xg θ) = 1 (A3) with y = 0 implies that f daggerg (x)qg

sumgisinG qg = f dagger

g (x)qgsum

gisinG qg Thus f daggerg (x) = f dagger

g (x) and qg = qg Itthen follows from (A3) that

η( yxg θ) = C( y)η( yxg θ) (A4)

where C( y) depends only on y By setting x = x0 and g = g0 in (A4)and noting that η( yx0g0 θ) = 1 we conclude that C( y) = 1 Hencethe equivalence class for (θ Fdagger

g qg) is (θ Fdaggerg qg) η( yx

g θ) = η( yxg θ)A42 Identifiability for the Logistic Link Function Suppose that

η( yxg θ) = η( yxg θ) (A5)

for two sets of parameters θ and θ Let g0 = 0 As in Section A21we partition G into (G1G2G3) For g isin G1 S(g) is a singleton so thegeneralized odds ratio reduces to the ordinary odds ratio of Y givenX and H Thus (A5) is equivalent to β = β under Condition 8 Forg isin G2 P(Y = 0|X = xH = (hkhl)) = 1 + exp(α + βT

3 x)minus1 Thus

(A5) holds if and only if β3 = β3 For g isin G3 both π1(g) and π2(g)

are nonzero Then (A5) becomes

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

= π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + eψ2(x)minusψ1(x)

π1(g)(1 + eα+ψ2(x))π2(g)(1 + eα+ψ1(x)) + 1

(A6)

where ψ1(x) = β1 + βT3 x + βT

4 x and ψ2(x) = βT3 x

Without loss of generality assume that 0 is in the support of X Wethen have the following results

1 β1 = 0 and β4 = 0 Then (A6) holds naturally2 β1 = 0β4 = 0 and β3 = 0 Then because the function

(λ + c)(λ + 1) is strictly monotone in λ for c = 1 (A6) yields

π1(g)

π2(g)

1 + eα

1 + eα+β1= π1(g)

π2(g)

1 + eα

1 + eα+β1

Thus (A6) is equivalent to

π1(g)π2(g)

π1(g)π2(g)= π1(g)π2(g)

π1(g)π2(g)for all g g isin G3

3 β1 = 0β4 = 0 and β3z = 0 where β3z is the componentof β3 associated with a continuous covariate Z For x such thatβ3zz = 0 (A6) yields

π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz= π1(g)

π2(g)

1 + eα+β3zz

1 + eα+β1+β3zz

The foregoing equation holds for any z isin (minusinfininfin) becausethe functions on the two sides are analytic in z and z is con-tinuous Without loss of generality assume that β3z gt 0 Byletting z = minusinfin we have π1(g)π2(g) = π1(g)π2(g) Thenby letting z = 0 we have α = α Thus (A6) is equivalent toα = α π1(g)π2(g) = π1(g)π2(g)

4 β4z = 0 where β4z is the component of β4 pertaining to zThen (A6) is equivalent to

π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)= π1(g)

π2(g)

1 + eα+ψ2(x)

1 + eα+ψ1(x)(A7)

for any x such that β1 + βT4 x = 0 We set x except the compo-

nent z to 0 By letting z rarr minusβ1β4z we have π1(g)π2(g) =π1(g)π2(g) Then by differentiating both sides of (A7) withrespect to z and letting z rarr minusβ1β4z we obtain α = α Thus(A6) is equivalent to α = α π1(g)π2(g) = π1(g)π2(g)

A43 Identifiability for Probit and Complementary LogndashLog LinkFunctions Assume that |β1| + |β4| = 0 Also assume that there ex-ists a continuous covariate in X denoted by Z such that the corre-sponding regression parameter βz is nonzero Let x0 = 0 and g0 = 0We claim that under the probit and complementary logndashlog regres-sion models η(1xg θ) = η(1xg θ) for two sets of parametersθ and θ if and only if α = α β = β and π1(g)π2(g) = π1(g)π2(g)

for g isin G3We first prove the foregoing claim for the probit model Suppose

that η(1xg θ) = η(1xg θ) Without loss of generality assumethat hlowast is a nonzero sequence Let g = 2hlowasthlowast + hlowast and 0 in turnBecause S(g) has a single element for such g we obtain

(α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2βT

4 x + βT5 x)

minus 1

= (α)

1 minus (α)

1

(α + 2β1 + β2 + βT3 x + 2β

T4 x + β

T5 x)

minus 1

(A8)

(α)

1 minus (α)

1

(α + β1 + βT3 x + βT

4 x)minus 1

= (α)

1 minus (α)

1

(α + β1 + βT3 x + β

T4 x)

minus 1

(A9)

and

(α)

1 minus (α)

1

(α + βT3 x)

minus 1

= (α)

1 minus (α)

1

(α + βT3 x)

minus 1

(A10)

where is the standard normal distribution function In (A10) we letx except the component z be 0 Then

(α)

1 minus (α)

1

(α + βzz)minus 1

= (α)

1 minus (α)

1

(α + βzz)minus 1

By letting z rarr infin or minusinfin we conclude that βz and βz must have thesame sign Without loss of generality assume that βz gt βz gt 0 Thenthe left side divided by the right side goes to 0 as z rarr infin This is acontradiction Therefore βz = βz We differentiate both sides to obtain

(α)

1 minus (α)

φ(α + βzz)

(α + βzz)2= (α)

1 minus (α)

φ(α + βzz)

(α + βzz)2

By taking the ratio of the two sides and letting z rarr sgn(βz)infin we im-mediately conclude that α = α Applying this result to (A8)ndash(A10)

we obtain 2β1 +β2 +βT3 x+2βT

4 x+βT5 x = 2β1 + β2 + β

T3 x+2β

T4 x+

βT5 x β1 +βT

3 x+βT4 x = β1 + β

T3 x+ β

T4 x and βT

3 x = βT3 x Therefore

β = β For g isin G3

η(1xg θ)

= 1 minus (α)

(α)

(α + β1 + βT

3 x + βT4 x)π1(g)π2(g)

+ (α + βT3 x)

times [1 minus (α + β1 + βT3 x + βT

4 x)π1(g)π2(g)

+ 1 minus (α + βT3 x)

]minus1 (A11)

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 13: Likelihood-Based Inference on Haplotype Effects in Genetic

Lin and Zeng Inference on Haplotype Effects 101

It follows that π1(g)π2(g) = π1(g)π2(g) The other direction of theclaim is obvious in view of (A11) and the expressions of η(1xg) forg isin G1 and g isin G2

For the complementary logndashlog model we obtain the same equa-tions as (A8)ndash(A11) with (x) replaced by 1 minus exp(minusex) In partic-

ular eminuseα(eeα+βzz minus 1)(1 minus eminuseα

) = eminuseα(eeα+βzz minus 1)(1 minus eminuseα

)Taking the first and second derivatives of the two sides with respectto z and forming the ratio of them we obtain βz(eα+βzz + 1) =βz(eα+βzz + 1) Thus α = α and βz = βz The rest of the proof is thesame as that of the probit model

A44 Profile Likelihood of θ Based on (8) Suppose that thereare J distinct observed values of (XG) denoted by (x1g1)

(xJgJ) Let n1j and n0j be the number of times that (xjgj) is observedin the cases and controls and let δj be the jump size of the estimateddistribution of (XG) at (xjgj) Then the log-likelihood based on (8)can be written as

ln(θ δj) =Jsum

j=1

n1j logη(1xjgj θ)

minus n1 log

Jsum

j=1

η(1xjgj θ)δj

+Jsum

j=1

n+j log δj

where n+j = n0j + n1j Following Scott and Wild (1997) we intro-duce a Lagrange multiplier λ for the constraint

sumj δj = 1 and set the

derivative with respect to δj to 0 We then obtain

n+j

δjminus n1η(1xjgj θ)

sumJj=1 η(1xjgj θ)δj

+ λ = 0

Multiplying both sides by δj and summing over j we see thatλ = n1 minus n Thus

δj = n+j

n minus n1 + n1η(1xjgj θ)micro (A12)

where micro = sumJj=1 η(1xjgj θ)δj Plugging (A12) into ln(θ δj) we

see that the objective function to be maximized is up to a constant Cnequal to

llowastn(θ micro) =Jsum

j=1

n1j logη(1xjgj θ)

minusJsum

j=1

n+j log

n1

nη(1xjgj θ) +

(

1 minus n1

n

)

micro

+ (n minus n1) logmicro

Thus maxδj ln(θ δj) le maxmicro llowastn(θ micro) + Cn If micro maximizesllowastn(θ micro) then partllowastn(θ micro)partmicro = 0 and the δj given in (A12) satisfysumJ

j=1 δj = 1 Thus maxmicro llowastn(θmicro) + Cn le maxδj ln(θ δj) There-fore the profile log-likelihood function for θ based on ln(θ δj)equals the profile function based on llowastn(θmicro) up to a constant CnWe maximize llowastn(θ micro) via NewtonndashRaphson to yield θ and micro whereθ is the MLE of θ It can be shown that up to a constant llowastn(θ micro) is thelog-likelihood based on a random sample of size n from a conditionaldistribution of Y given X and G Hence the covariance matrix of (θ micro)

can be estimated by the inverse information matrix of llowastn(θ micro)

A45 Profile Likelihood of θ Based on (9) Suppose that (3) holdsWrite θ = (β πk ρ) Also define

ζ1(xg θ) =sum

(hkhl)isinS(g)

eβTZ(xhkhl)ρπkδkl + (1 minus ρ)πkπl

ζ0(g θ) =sum

(hkhl)isinS(g)

ρπkδkl + (1 minus ρ)πkπl

By a derivation similar to that of Section A44 profiling (9) overFg(middot) is equivalent to profiling the following function over microg

llowastn(θ microg)

=nsum

i=1

yi log ζ1(XiGi θ) + (1 minus yi) log ζ0(Gi θ)

minusnsum

i=1

sum

gI(Gi = g) log

ζ1(XiGi θ) + nminus11 ng

sum

g

microg minus microg

+nsum

i=1

(1 minus yi) log

sum

gmicrog

where ng is the number of times G = g in the sample The covariancematrix of θ can be estimated by the sandwich estimator or the profilelikelihood method

If X is independent of G then we obtain the MLE θ by maximizingthe following function

llowastn(θ micro) =nsum

i=1

yi log ζ1(XiGi θ) +nsum

i=1

(1 minus yi) log ζ0(Gi θ)

+nsum

i=1

(1 minus yi) logmicro

minusnsum

i=1

log

(1 minus r)micro + rsum

gζ1(Xig θ)

where r = n1n Let H = BQ1 + (1 minus B)Q2 where B is a Bernoullivariable Q1 takes values in (hkhk) k = 1 K and Q2 takes val-ues in (hkhl) k l = 1 K Suppose that Y is a binary variableand that the conditional distribution of (BQ1Q2Y) given X is char-acterized by

P(BQ1Q2Y|X) = expϑTW(BQ1Q2YX)sum

BQ1Q2Y expϑTW(BQ1Q2YX)

where ϑ = (minus logmicro + log r(1 minus r)βT logπ1 minus logρ(1 minus ρ)

logπK minus logρ(1 minus ρ))T and W(BQ1Q2YX) = (YYZT (XH)BI(Q1 = (h1h1))+ (1 minus B)

sumlI(Q2 = (h1hl))+ I(Q2 = (hlh1))

BI(Q1 = (hK hK)) + (1 minus B)sum

lI(Q2 = (hK hl)) + I(Q2 =(hlhK))T We can show that llowastn(θ micro) is equivalent to the log-likelihood

llowastn(ϑ) =nsum

i=1

log

[ sum

BQ1+(1minusB)Q2isinS(Gi)

eϑTW(BQ1Q2YiXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

We maximize llowastn(ϑ) through the EM algorithm in which (BQ1Q2) istreated as missing The estimation of the covariance matrix of θ isbased on the information matrix of llowastn(ϑ)

The complete-data score function isnsum

i=1

[

W(BiQ1iQ2iYiXi)

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)

]

Thus in the E-step we calculate the conditional expectation ofW(BiQ1iQ2iYiXi) given (YiXiGi) and the current parameterestimates

E[W(BiQ1iQ2iYiXi)|YiXiGi]

=sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)W(bq1q2YiXi)sum

bq1q2I(bq1+(1minusb)q2isinS(Gi))eϑTW(bq1q2YiXi)

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 14: Likelihood-Based Inference on Haplotype Effects in Genetic

102 Journal of the American Statistical Association March 2006

In the M-step we use the one-step NewtonndashRaphson iteration to updatethe parameter estimates

ϑ(k+1)

= ϑ(k) minus minus1 timesnsum

i=1

[

E[W(BQ1Q2YiXi)|YiXiGi]

minussum

bq1q2y W(bq1q2 yXi) expϑTW(bq1q2 yXi)sum

bq1q2y expϑTW(bq1q2 yXi)]

where

= minus[ nsum

i=1

sumbq1q2y W

otimes2(bq1q2 yXi)eϑTW(bq1q2yXi)

sumbq1q2y eϑTW(bq1q2yXi)

]

+nsum

i=1

[ sumbq1q2y W(bq1q2 yXi)eϑTW(bq1q2yXi)otimes2

sumbq1q2y eϑTW(bq1q2yXi)2

]

and aotimes2 = aaT

A46 Proof of Theorem 2 Write Fxg(xg) = Fdaggerg(x)qg and

Fxg(xg) = Fdaggerg(x)qg Because θ is bounded and Fxg is a probabil-

ity distribution we can choose a subsequence such that θ rarr θlowast andFxg(xg) rarr Flowast

xg(xg) equiv Flowastg(x)qlowast

g where qlowastg gt 0 for any g

Because Fdaggerg maximizes the likelihood there exists some Lagrange

multiplier λg such that

I(Gi = g)

FdaggergXi

minus n1η(1Xig θ )qgint

xg η(1x g θ)dFxg(x g)minus nλg = 0

where FdaggergXi denotes the point mass of Fdagger

g at Xi and the integralis interpreted as integration over x and summation over g Becausesumn

i=1 FdaggergXi = 1 λg satisfies the equation

nminus1nsum

i=1

I(Gi = g)

λg + n1η(1Xig θ )qgn intxg η(1x g θ)dFxg(x g)minus1

= 1 (A13)

and

min1leilen

λg + n1η(1Xig θ )qg

nint

xg η(1x g θ)dFxg(x g)

gt 0

Clearly λg must be bounded asymptotically Thus by choosing a sub-sequence we assume that λg rarr λlowast

g By (A13) and the Lipschitz continuity of η(1xg θlowast) in the con-

tinuous components of x we can show that there exists a positive con-stant δ such that

mingx

∣∣∣∣λ

lowastg + η(1xg θlowast)qlowast

gintxg η(1x g θlowast)dFlowast

xg(x g)

∣∣∣∣

gt δ

Consequently when n is sufficiently large

Fdaggerg(x) = nminus1

nsum

i=1

I(Gi = gXi le x)

times(

max

[∣∣∣∣λg + η(1Xig θ )qgn1

times

nint

xgη(1x g θ)dFxg(x g)

minus1∣∣∣∣ δ

])minus1

We define an empirical function Fdaggerg whose jump size at Xi is propor-

tional to

nminus1I(Gi = g)

(

P(G = gY = 0) + η(1Xig θ0)qg

timesint

xgη(1x g θ0)dFxg(x g)

minus1)minus1

Then it can be verified that Fdaggerg converges uniformly to Fdagger

g In ad-

dition Fdaggerg is absolutely continuous with respect to Fdagger

g and the

RadonndashNikodym derivative dFdaggerg(x)dFdagger

g(x) is bounded and con-

verges uniformly to dFlowastg(x)dFdagger

g(x) Let Fxg(xg) = Fdaggerg(x)qg and let

ln(θ Fdaggerg qg) be the log-likelihood based on (8) By the definition

of the MLE nminus1ln (θ Fdaggerg qg) minus nminus1ln(θ0 Fdagger

g qg) ge 0 Thelimit of this difference is the negative KullbackndashLeibler informationof the distribution for (θlowast Flowast

g qlowastg) with respect to (θ0 Fdagger

g qg)under P(Y = 1) = The identifiability conditions then yield θlowast = θ0Flowast

g = Fdaggerg and qlowast

g = qg Thus the consistency of θ is established Be-

cause Fxg is continuous supxg |Fxg(xg) minus Fxg(xg)| rarr 0 almostsurely

The derivation of the asymptotic distribution is similar to the proofof theorem 12 of Murphy and van der Vaart (2001) We first obtaina score function by differentiating ln(θ Fdagger

g qg) with respect to θ

along the direction v and with respect to Fxg along the path Fε =Fxg + ε

intψ(xg)dFxg where v has a unit norm and ψ(middotg) is any

function whose total variation is bounded by 1 The linearization of thescore function around the true parameter value yields

n12

(vT11 + 21[ψ]T )(θ minus θ0)

+int

(vT12 + 22[ψ])d(Fxg minus Fxg)

= nminus12nsum

i=1

yi

vT lθ (1XiGi θ0Fxg)

+ lF(1XiGi θ0Fxg)

[int

ψ dFxg

]

+ nminus12nsum

i=1

(1 minus yi)

vT lθ (0XiGi θ0Fxg)

+ lF(0XiGi θ0Fxg)

[int

ψ dFxg

]

+ op(1)

where 11 is a constant matrix 12 is a vector function of x 21[ψ]and 22[ψ] are linear operators of ψ and lθ and lF are the scoreswith respect to θ and Fxg The right side of the foregoing equa-tion converges weakly to a Gaussian process which depends on( y1 y2 ) only through We can show that the operator B[vψ] equivvT11 +21[ψ]T vT12 +22[ψ]T is invertible along the lines ofMurphy and van der Vaart (2001) It then follows from theorem 331of van der Vaart and Wellner (1996) that n12(θ minus θ0 Fxg minus Fxg)

converges weakly to a Gaussian processBecause the asymptotic distribution depends on ( y1 y2 ) only

via we assume that ( y1 y2 ) are independent realizations froma Bernoulli distribution with mean By choosing some ψ such thatB[vψ] = (vT 0)T for all v we see that θ is an asymptotically linearestimator for θ0 with the influence function in the score space It fol-lows from proposition 331 of Bickel Klaassen Ritov and Wellner(1993) that the limiting covariance matrix of n12(θ minus θ0) attains thesemiparametric efficiency bound

Lin and Zeng Inference on Haplotype Effects 103

A47 Proof of Theorem 3 We call the probability distribu-tion induced by (9) the pseudoprobability law denoted by Pn Letf ( yxg θ Fgan) be the density function under the true probabil-ity law Pn Because an = o(nminus12)

dPn

dPn= exp

an

nsum

i=1

part log f ( yiXiGi θ Fga)

parta

∣∣∣∣a=0

+o(1)

rarrPn 1

Thus any weak convergence under Pn also holds for Pn In additionby the arguments in the proof of Theorem 2 we can easily verify theresults of Theorem 3 when the data are generated from Pn Thus The-orem 3 holds when the data are generated from Pn

A5 Cohort Studies

A51 Identifiability We show that if two sets of parameters (θ )

and (θ ) yield the same joint distribution then θ = θ and = First it follows from Lemma 1 that γ = γ Suppose that

sum

HisinS(G)

˜(Y)eβTZ(XH)Q((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

=sum

HisinS(G)

(Y)eβTZ(XH)Q

((Y)eβTZ(XH)

)

times 1 minus Q

((Y)eβTZ(XH)

)1minusPγ (H)

By choosing = 1 and integrating Y from 0 to τ on both sides weobtain

sum

HisinS(G)

Q((τ )eβTZ(XH)

)Pγ (H)

=sum

HisinS(G)

Q((τ)eβTZ(XH)

)Pγ (H)

Because Q(middot) is strictly increasing the foregoing equation implies that

(Y)eβTZ(XH) = (Y)eβTZ(XH) for H = (hh) and H = (h h) Itthen follows from Condition 8 that β = β and =

A52 Proof of Theorem 4 Our problem is the same as that ofZeng et al (2005) except replacing the integration over random ef-fects in that article by the sum over H isin S(G) The asymptotic prop-erties stated in the theorem follow from the identifiability shown inSection A51 and the proofs of Zeng et al (2005) provided that wecan verify the following result If there exist a vector micro = (microT

β microTγ )T

and a function ψ(t) such that

microT lθ (θ00) + l(θ00)

[int

ψ d0

]

= 0 (A14)

where lθ is the score function for θ and l[int ψ d0] is the score func-tion for along the submodel 0 +ε

intψ d0 then micro = 0 and ψ = 0

To prove the desired result we write out (A14) We then let = 1and integrate Y from 0 to τ to obtain

sum

HisinS(G)

Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times Q(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

Q(0(τ )eβT0Z(XH))

+ Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A15)

In contrast by letting = 0 and Y = τ in (A14) we havesum

HisinS(G)

1 minus Q

(0(τ )eβT

0Z(XH))

Pγ (H)

times

minusQ(0(τ )eβT

0Z(XH))0(τ )eβT0Z(XH)microT

βZ(XH)

1 minus Q(0(τ )eβT0Z(XH))

minus Q(0(τ )eβT0Z(XH))

int τ0 ψ(t)d0(t) eβT

0Z(XH)

1 minus Q(0(τ )eβT0Z(XH))

+ microTγ nablaγ log Pγ (H)

= 0 (A16)

The summation of (A15) and (A16) entails microTγ nablaγ log Pγ (H) = 0

From the proof of Lemma 1 microγ = 0 We choose G = 2h or h + h and

let = 1 and Y = 0 in (A14) to obtain microTβZ(XH) + ψ(0) = 0 for

H = (hh) and (h h) Thus microβ = 0 and ψ(0) = 0 under Condition 8Finally (A14) with = 1 implies that

ψ(Y) + Q(0(Y)eβT0Z(XH))

int Y0 ψ(t)d0(t) eβT

0Z(XH)

Q(0(Y)eβT0Z(XH))

= 0

for H = (hh) Therefore ψ = 0

[Received September 2004 Revised February 2005]

REFERENCES

Akaike H (1985) ldquoPrediction and Entropyrdquo in A Celebration of Statistics edsA C Atkinson and S E Fienberg New York Springer-Verlag pp 1ndash24

Akey J Jin L and Xiong M (2001) ldquoHaplotypes vs Single-Marker Link-age Disequilibrium Tests What Do We Gainrdquo European Journal of HumanGenetics 9 291ndash300

Bickel P J Klaassen C A J Ritov Y and Wellner J A (1993) Efficientand Adaptive Estimation for Semiparametric Models Baltimore Johns Hop-kins University Press

Botstein D and Risch N (2003) ldquoDiscovering Genotypes Underlying HumanPhenotypes Past Successes for Mendelian Disease Future Approaches forComplex Diseaserdquo Nature Genetics Supplement 33 228ndash237

Breslow N McNeney B and Wellner J A (2003) ldquoLarge-Sample The-ory for Semiparametric Regression Models With Two-Phase Outcome-Dependent Samplingrdquo The Annals of Statistics 31 1110ndash1139

Clark A G (1990) ldquoInference of Haplotypes From PCR-Amplified Samplesof Diploid Populationsrdquo Molecular Biology and Evolution 7 111ndash122

Cohen J (1960) ldquoA Coefficient of Agreement for Nominal Scalesrdquo Educa-tional and Psychological Measurement 20 37ndash46

Cox D R (1972) ldquoRegression Models and Life-Tablesrdquo (with discussion)Journal of the Royal Statistical Society Ser B 34 187ndash220

Dempster A P Laird N M and Rubin D B (1977) ldquoMaximum LikelihoodEstimation From Incomplete Data via the EM Algorithmrdquo (with discussion)Journal of the Royal Statistical Society Ser B 39 1ndash38

Diggle P J Heagerty P Liang K-Y and Zeger S L (2002) Analysis ofLongitudinal Data (2nd ed) Oxford UK Oxford University Press

Epstein M P and Satten G A (2003) ldquoInference on Haplotype Effects inCase-Control Studies Using Unphased Genotype Datardquo American Journal ofHuman Genetics 73 1316ndash1329

Excoffier L and Slatkin M (1995) ldquoMaximum-Likelihood Estimation ofMolecular Haplotype Frequencies in a Diploid Populationrdquo Molecular Bi-ology and Evolution 12 921ndash927

Fallin D Cohen A Essioux L Chumakov I Blumenfeld M Cohen Dand Schork N (2001) ldquoGenetic Analysis of Case-Control Data Using Es-timated Haplotype Frequencies Application to APOE Locus Variation andAlzheimerrsquos Diseaserdquo Genome Research 11 143ndash151

Fisher R A (1918) ldquoThe Correlation Between Relatives on the Supposition ofMendelian Inheritancerdquo Transactions of the Royal Society of Edinburgh 52399ndash433

Hallman D M Groenemeijer B E Jukema J W and Boerwinkle E (1999)ldquoAnalysis of Lipoprotein Lipase Haplotypes Reveals Associations Not Ap-parent From Analysis of the Constitute Locirdquo Annals of Human Genetics 63499ndash510

International Human Genome Sequencing Consortium (2001) ldquoInitial Se-quencing and Analysis of the Human Genomerdquo Nature 409 860ndash921

104 Journal of the American Statistical Association March 2006

International SNP Map Working Group (2001) ldquoA Map of Human GenomeSequence Variation Containing 142 Million Single Nucleotide Polymor-phismsrdquo Nature 409 928ndash933

Lake S L Lyon H Tantisira K Silverman E K Weiss S T Laird N Mand Schaid D J (2003) ldquoEstimation and Tests of Haplotype-EnvironmentInteraction When the Linkage Phase Is Ambiguousrdquo Human Heredity 5556ndash65

Li H (2001) ldquoA Permutation Procedure for the Haplotype Method for Iden-tification of Disease-Predisposing Variantsrdquo Annals of Human Genetics 65180ndash196

Liang K-Y and Qin J (2000) ldquoRegression Analysis Under Non-StandardSituations A Pairwise Pseudolikelihood Approachrdquo Journal of the Royal Sta-tistical Society Ser B 62 773ndash786

Lin D Y (2004) ldquoHaplotype-Based Association Analysis in Cohort Studies ofUnrelated Individualsrdquo Genetic Epidemiology 26 255ndash264

(2005) ldquoAn Efficient Monte Carlo Approach to Assessing StatisticalSignificance in Genomic Studiesrdquo Bioinformatics 21 781ndash787

McCullagh P and Nelder J A (1989) Generalized Linear Models (2nd ed)New York Chapman amp Hall

Morris R W and Kaplan N L (2002) ldquoOn the Advantage of HaplotypeAnalysis in the Presence of Multiple Disease Susceptibility Allelesrdquo GeneticEpidemiology 23 221ndash233

Murphy S A and van der Vaart A W (2000) ldquoOn Profile Likelihoodrdquo Jour-nal of the American Statistical Association 95 449ndash465

(2001) ldquoSemiparametric Mixtures in Case-Control Studiesrdquo Journalof Multivariate Analysis 79 1ndash32

Niu T Qin Z S Xu X and Liu J S (2002) ldquoBayesian Haplotype Inferencefor Multiple Linked Single-Nucleotide Polymorphismsrdquo American Journal ofHuman Genetics 70 157ndash169

Patil N Berno A J Hinds D A Barrett W A Doshi J M Hacker C RKautzer C R Lee D H Marjoribanks C McDonough D PNguyen B T N Norris M C Sheehan J B Shen N P Stern DStokowski R P Thomas D J Trulson M O Vyas K R Frazer K AFodor S P A and Cox D R (2001) ldquoBlocks of Limited Haplotype Di-versity Revealed by High-Resolution Scanning of Human Chromosome 21rdquoScience 294 1719ndash1723

Pettitt A N (1984) ldquoProportional Odds Models for Survival Data and Esti-mates Using Ranksrdquo Applied Statistics 26 183ndash214

Prentice R L and Pyke R (1979) ldquoLogistic Disease Incidence Models andCase-Control Studiesrdquo Biometrika 66 403ndash411

Qin Z S Niu T and Liu J S (2002) ldquoPartition-Ligation-Expectation Max-imization Algorithm for Haplotype Inference With Single-Nucleotide Poly-morphismsrdquo American Journal of Human Genetics 71 1242ndash1247

Risch N (2000) ldquoSearching for Genetic Determinants in the New Millen-niumrdquo Nature 405 847ndash856

Roeder K Carroll R J and Lindsay B G (1996) ldquoA Semiparametric Mix-ture Approach to Case-Control Studies With Errors in Covariablesrdquo Journalof the American Statistical Association 91 722ndash732

Satten G A and Epstein M P (2004) ldquoComparison of Prospective and Retro-spective Methods for Haplotype Inference in Case-Control Studiesrdquo GeneticEpidemiology 27 192ndash201

Schaid D J (2004) ldquoEvaluating Associations of Haplotypes With TraitsrdquoGenetic Epidemiology 27 348ndash364

Schaid D J Rowland C M Tines D E Jacobson R M and Poland G A(2002) ldquoScore Tests for Association Between Traits and Haplotypes Whenthe Linkage Phase Is Ambiguousrdquo American Journal of Human Genetics 70425ndash434

Scott A J and Wild C J (1997) ldquoFitting Regression Models to Case-ControlData by Maximum Likelihoodrdquo Biometrika 84 57ndash71

Seltman H Roeder K and Devlin B (2003) ldquoEvolutionary-Based Associa-tion Analysis Using Haplotype Datardquo Genetic Epidemiology 25 48ndash58

Stephens M Smith N J and Donnelly P (2001) ldquoA New Statistical Methodfor Haplotype Reconstruction From Population Datardquo American Journal ofHuman Genetics 68 978ndash989

Stram D O Pearce C L Bretsky P Freedman M Hirschhorn J NAltshuler D Kolonel L N Henderson B E and Thomas D C (2003)ldquoModeling and EndashM Estimation of Haplotype-Specific Relative Risks FromGenotype Data for a Case-Control Study of Unrelated Individualsrdquo HumanHeredity 55 179ndash190

Valle T Ehnholm C Tuomilehto J Blaschak J Bergman R N LangefeldC D Ghosh S Watanabe R M Hauser E R Magnuson V Eriksson JAlly D S Nylund S J Hagopian W A Kohtamaki K Ross EToivanen L Buchanan T A Vidgren G Collins F Tuomilehto-Wolf Eand Boehnke M (1998) ldquoMapping Genes for NIDDMrdquo Diabetes Care 21949ndash958

van der Vaart A W and Wellner J A (1996) Weak Convergence and Empir-ical Processes New York Springer-Verlag

(2001) ldquoConsistency of Semiparametric Maximum Likelihood Es-timators for Two-Phase Samplingrdquo Canadian Journal of Statistics 29269ndash288

Venter J Adams M Myers E Li P Mural R Sutton G Smith H et al(2001) ldquoThe Sequence of the Human Genomerdquo Science 291 1304ndash1351

Wang S Kidd K and Zhao H (2003) ldquoOn the Use of DNA Pooling toEstimate Haplotype Frequenciesrdquo Genetic Epidemiology 24 74ndash82

Weir B S (1996) Genetic Data Analysis II Sunderland MA Sinauer Asso-ciates

Zaykin D V Westfall P H Young S S Karnoub M A Wagner M J andEhm M G (2002) ldquoTesting Association of Statistically Inferred HaplotypesWith Discrete and Continuous Traits in Samples of Unrelated IndividualsrdquoHuman Heredity 53 79ndash91

Zeng D and Lin D Y (2005) ldquoEstimating Haplotype-Disease AssociationsWith Pooled Genotype Datardquo Genetic Epidemiology 28 70ndash82

Zeng D Lin D Y and Lin X (2005) ldquoSemiparametric Transformation Mod-els With Random Effects for Clustered Failure Time Datardquo technical reportUniversity of North Carolina Chapel Hill Dept of Biostatistics

Zhang S Pakstis A J Kidd K K and Zhao H (2001) ldquoComparisons ofTwo Methods for Haplotype Reconstruction and Haplotype Frequency Esti-mation From Population Datardquo American Journal of Human Genetics 69906ndash912

Zhao L P Li S S and Khalid N (2003) ldquoA Method for the Assessmentof Disease Associations With Single-Nucleotide Polymorphism Haplotypesand Environmental Variables in Case-Control Studiesrdquo American Journal ofHuman Genetics 72 1231ndash1250

CommentChiara SABATTI

The detailed and careful article by Lin and Zeng deals withthe estimation of haplotype effects It is perhaps useful to givea little more genetical background on the problem at handThrough epidemiological studies (where eg one comparesrisk of siblings or twins of affected individuals with popula-tion prevalence) we can identify that some diseases have a cleargenetic component That is there are modifications in the DNA

Chiara Sabatti is Assistant Professor Departments of Human Genetics andStatistics University of CaliforniamdashLos Angeles Los Angeles CA 90095(E-mail csabattimednetuclaedu)

sequence that predispose carriers to develop the disease Thesemodifications have varied nature there may be mutations inser-tions or deletions in the gene sequence that lead to the synthesisof a different protein or these variations may take place in non-coding portions of DNA affecting slicing patterns or expres-sion levels Understanding the nature of these mutations andtheir functional effects is of considerable importance it leads to

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000817

Sabatti Comment 105

a clear understanding of the disease origins and suggests drugtargets This goal is achieved in a multistep process Initiallywith genome-wide investigations one tries to identify regionsthat harbor susceptibility genes These broad regions are thenanalyzed in more detail to lead to finer mapping and eventuallyto identification of the disease variants The study of haplotypes(the collection of allele values at measured polymorphic sites ona chromosome) plays a role in both these phases Note that typ-ically one does not assume that any of the variants recorded in ahaplotype is the disease-causing variant generally one simplyassumes that there may be an association between the disease-causing variant and the alleles characterizing one haplotype

To understand the origins of such an association recall thatwhen a disease-causing mutation arises it occurs on a specificgenetic background (a specific selection of alleles at the sur-rounding polymorphic loci) As the disease mutation is passeddown across generations the portion of the genome closer toit is also transmitted with boundaries determined by recombi-nation events Affected descendants are then carriers not onlyof the disease-specific mutation but also of the ancestral haplo-type Under the hypothesis of one disease-causing mutation af-ter a number of generations have intervened since the foundingevent affected individuals that may be practically consideredas unrelated will be sharing an identical-by-descent genomicsegment surrounding the mutation This is reflected in the asso-ciation between the disease status and haplotypes in the regionof interest This scenario implies that the effect of the mutationon fitness must be limited through reduced penetrance late on-set mild disease symptoms or recessive mode of inheritanceor other factors for the mutation to be passed down across gen-erations originating an association effect If current cases arein the vast majority due to new mutations then one would notbe able to detect association between disease status and haplo-type In present of multiple mutations with founder effect theremay be multiple ldquodiseaserdquo haplotypes and if the consequencesof the mutations are slightly different then different haplotypesmay be associated with different risks

The study of association of haplotypes with disease statushelps in the initial localization of the region harboring the dis-ease gene One can scan through the genome with a windowof fixed genetic length and test for association between diseasestatus and the haplotypes formed by the markers in the windowThe locations where association is detected can be consideredsuspicious of harboring the disease gene The study of haplo-types effects further contributes to the identification of func-tional variants and understanding their relevance in determiningdisease risks

At the beginning of the article the authors motivate theirstudy with reference to association mapping and in the con-clusion they refer again to application of their methods togenome scans In my opinion however the specific problemthat they tackle is not so much related to mapping as to refiningthe haplotype effects estimates once a location of interest hasbeen identified This is an important and worthy goal because(1) ranking the haplotypes in terms of their association withthe disease is necessary to identify those that are most likelyto include the disease susceptibility locus and (2) estimatinghaplotypes effects helps quantify the impact of the mutationin an epidemiological framework It may seem that (2) should

be more easily and effectively done when such a mutation isfound but going from a locus to the identification of a muta-tion can still require years of work especially in complex dis-eases where genendashgene and genendashenvironment interactions areexpected so that a preliminary quantification of the effect sizeis important to appropriately allocate resources The methodol-ogy described in the article shows how quantification of theseeffects can be obtained paying particular attention to the iden-tifiability issues and asymptotic efficiency

Genome screens for association mapping of disease genesmay not be the most direct application of the methodology de-scribed here For disease mapping case-control is definitelythe preferred design and so one needs to consider in particu-lar the methods developed here for this purpose coupled withthe idea of studying haplotype effects for sliding marker win-dows Computationally the efficiency of the algorithm may beunsatisfactory in a case-control study that is based on hundredsof thousands of SNPs as one can expect for genome-wide in-vestigations It must be emphasized that the methodology pre-sented here leads to efficient estimates of haplotype effects Forthe purpose of mapping it is not really necessary to get a ldquoverygoodrdquo estimate of the haplotype effect only to assess whetherthere is any such effect For this reason one may be able to takeadvantage of less sophisticated methods that are computation-ally preferable For example consider the association tests im-plemented in Mendel (Lange et al 2001) which are quite fastand can appropriately handle missing phase information (morelater) Aside from this computational issue and perhaps moreimportant a methodology that estimates the effects of well-defined haplotypes will encounter difficulties due to the sparsityof the data as the author themselves note Disease haplotypesare eroded by the effects of recombination (which can occur inmany different locations in the sampled haplotypes) and mu-tation Models that allow one to combine information acrosshaplotypes grouping ldquosimilarrdquo ones are expected to be morepowerful This idea is at the basis of methodologies describedby McPeek and Strahs (1999) Service Temple Lang Freimerand Sandkuijl (1999) Lam Roeder and Devlin (2000) MorrisWhittaker and Balding (2000) Liu Sabatti Teng Keats andRisch (2001) and Molitor Marjoram and Thomas (2003)to cite just a few Finally when contemplating the idea of agenome-wide series of association tests one must put in placesome control for multiple comparisons Although there havebeen some contributions in this direction the problem is notyet solved (Sabatti Service and Freimer 2003 Lin 2005) andthe methodology presented here does not appear to be easilyamenable to permutation procedures

Haplotypes are not directly observable Polymorphism scor-ing technologies lead to multiloci genotypes as the authorsexplain in their introduction Information on which chromo-some is carrying which observed alleles constituting a geno-type (phase) can be inferred using family information whenavailable or assuming linkage disequilibrium and using EM(Excoffier and Slatkin 1995) resorting to a Bayesian approach(as in Niu et al 2002 Stephens et al 2001) or exploiting theexistence of block structure (as in Halperin and Eskin 2004)The authors note the importance of not considering such in-ferred haplotypes as true ones when making inference on theireffects suggesting simultaneously imputing phase and estimate

106 Journal of the American Statistical Association March 2006

haplotype effects This is an important point Unfortunately inmuch of the genetics literature inferred haplotypes are consid-ered ldquotruerdquo with varying impacts on the results of analysis Forexample it is common practice to reconstruct haplotypes withEM and then measure the amount of LD on the reconstructedhaplotypes Now given that EM algorithms operate under theassumption of LD it may well be that the resulting measuresare biased upward But Lin and Zeng are not the first to notethe arbitrariness of this practice or the first to propose map-ping methodologies that deal directly with nonphased data Forexample the mentioned association tests coded in Mendel canbe run on multilocus genotypes (Lange et al 2001 Lazzeroniand Lange 1997) and methods for fine mapping via reconstruc-tion of ancestral haplotypes (as in Liu et al 2001) take as inputunphased data and reconstruct haplotypes in the process LuNiu and Liu (2003) have compared the results obtained by se-quentially reconstructing haplotypes and applying associationmapping procedures with results derived by joint inference onthe data In the present contribution Lin and Zeng suggest us-ing an EM algorithm to deal with missing phase information intheir likelihood It is well known that the performance of EM isunsatisfactory with a large number of markers However giventhat the considered analysis is most interesting when appliedto rather short haplotypes this should not represent a seriouslimitation

One interesting aspect of the article is that through their re-gression model the authors deal simultaneously with quanti-tative and qualitative traits In my comments I have referredmainly to disease traits which are typically considered quali-tative However quantitative traits are also the object of muchattention and indeed the distinction between the two types isoften arbitrary (consider eg the threshold models) In prac-tice the mapping of qualitative and quantitative traits is inter-woven faced with the difficulties in identifying susceptibilityloci for complex traits geneticists are increasingly turning theirattention to quantitative endophenotypes measured on a varietyof scales (eg brain morphology gene expression values en-zyme concentrations personality questionnaires) (see Freimerand Sabatti 2003 for a discussion) The possibility of using thesame framework to study quantitative and qualitative traits isdefinitely an advantage

Another aspect that makes this article very relevant is thatthe authors consider a variety of designs The gene mappingcommunity has been very much focused on family-based orcase-control designs But when a locus is identified and oneneeds to determine the its effect in a population in its interactionwith environmental variables then other designs such as cross-sectional and cohort studies may be more relevant Also recentyears have seen increased interest in cohort-based genetics stud-ies exploiting cohorts that have been collected and analyzedwith respect to a variety of diseasehealth statesquestionnairesGenotypes of individuals in these cohorts could be used to as-sess possible associations between loci and a number of phe-notypes of interest translating the costs incurred by genotypingsuch large and unspecific samples in savings

The results obtained in the article concern the identifiabilityof association parameters and the asymptotic behavior of theirestimates in the presence of covariate information Indeed itis the consideration of covariates that substantially complicates

the statistical analysis In the context of estimating the popula-tion effects of haplotypes associated with increased disease riskor with any phenotype of interest it is particularly important toconsider covariate information Genendashenvironment interactionsare known to be important in complex diseases and the authorsshould be congratulated for their thorough treatment of the sub-ject

Although haplotype effects are an important step towardidentifying the genetic variation underlying the traits understudy ultimately one would like to isolate the polymorphismthat is directly responsible for the phenotype As mentionedearlier this may or not be scored in the haplotype consideredWhen the haplotypes are obtained by genotyping any variationin an identified genomic region one can reasonably assume thatthe causative polymorphism is scored One approach of interestthen is to try to single out this polymorphism among all of thegenotyped ones For this purpose regression models similar tothe one analyzed here have been proposed (see eg Cordelland Clayton 2002) Often the effects of SNPs are consideredone at a time so that phase information is not always as rele-vant However one would expect that if the causal SNP is notincluded in the typed set or if causality is achieved by morethan one mutation then shorter haplotypes may be importantexplanatory variables It would be interesting to explore the im-plications of the results presented here for this problem

ADDITIONAL REFERENCES

Cordell H and Clayton D (2002) ldquoA Unified Stepwise Regression Procedurefor Evaluating the Relative Effects of Polymorphisms Within a Gene UsingCaseControl or Family Data Application to HLA in Type 1 Diabetesrdquo Amer-ican Journal of Human Genetics 70 124ndash141

Freimer N and Sabatti C (2003) ldquoThe Human Phenome Projectrdquo NatureGenetics 34 15ndash21

Halperin E and Eskin E (2004) ldquoHaplotype Reconstruction From GenotypeData Using Imperfect Phylogenrdquo Bioinformatics 20 1842ndash1849

Lam J Roeder K and Devlin B (2000) ldquoHaplotype Fine Mapping by Evo-lutionary Treesrdquo American Journal of Human Genetics 66 659ndash673

Lange K Cantor R Horvath S Perola M Sabatti C Sinsheimer J andSobel E (2001) ldquoMendel Version 40 A Complete Package for the ExactGenetic Analysis of Discrete Traits in Pedigree and Population Data SetsrdquoAmerican Journal of Human Genetics 69(suppl) A1886

Lazzeroni L and Lange K (1997) ldquoMarkov Chains for Monte Carlo Tests ofGenetic Equilibrium in Multidimensional Contingency Tablesrdquo The Annalsof Statistics 25 138ndash168

Liu J Sabatti C Teng J Keats B and Risch N (2001) ldquoBayesian Analysisof Haplotypes for Linkage Disequilibrium Mappingrdquo Genome Research 111716ndash1724

Lu X Niu T and Liu J (2003) ldquoHaplotype Information and Linkage Dis-equilibrium Mapping for Single Nucleotide Polymorphismsrdquo Genome Re-search 13 2112ndash2117

McPeek M and Strahs A (1999) ldquoAssessment of Linkage Disequilibriumby the Decay of Haplotype Sharing With Application to Fine-Scale GeneticMappingrdquo American Journal of Human Genetics 65 858ndash875

Molitor J Marjoram P and Thomas D (2003) ldquoFine-Scale Mapping ofDisease Genes With Multiple Mutations via Spatial Clustering TechniquesrdquoAmerican Journal of Human Genetics 73 1368ndash1384

Morris A P Whittaker J C and Balding D J (2000) ldquoBayesian Fine-ScaleMapping of Disease Loci by Hidden Markov Modelsrdquo American Journal ofHuman Genetics 67 155ndash169

Sabatti C Service S and Freimer N (2003) ldquoFalse Discovery Rates inLinkage and Association Linkage Genome Screens for Complex DisordersrdquoGenetics 164 829ndash833

Service S Temple Lang D Freimer N and Sandkuijl L (1999) ldquoLinkage-Disequilibrium Mapping of Disease Genes by Reconstruction of AncestralHaplotypes in Founder Populationsrdquo American Journal of Human Genetics64 1728ndash1738

Satten Allen and Epstein Comment 107

CommentGlen A SATTEN Andrew S ALLEN and Michael P EPSTEIN

We congratulate Lin and Zeng for their ambitious and thor-ough treatment of statistical procedures for haplotype analysisof complex genetic traits As the authors note the developmentof haplotype models is complicated by many factors includ-ing missing data arising from haplotype ambiguity in unphasedgenotype data and (potentially) high-dimensional nuisance pa-rameters arising from the modeling of covariates These com-plications illustrate the many challenging statistical problemsin genetic epidemiology that warrant the attention of the widerstatistical community

Our interest has been primarily in case-control studies andthis interest colors our subsequent comments Several meth-ods have been proposed for modeling haplotypendashdisease asso-ciation (eg Zhao et al 2003 Stram et al 2003 Epstein andSatten 2003 Lake et al 2003) Of these only the prospectiveapproaches of Zhao et al and Lake et al explicitly incorpo-rate covariates In the absence of covariates Satten and Epstein(2004) showed that there could be a remarkable difference inefficiency between prospective and retrospective approachesThis difference seems remarkable in light of the classic re-sult of Prentice and Pyke (1979) on the equivalence of retro-spective and prospective analyses of case-control data In factPrentice and Pyke showed that the retrospective likelihood forcase-control data was proportional to the prospective likelihoodtimes a factor related to the distribution of exposures so that ifa saturated (nonparametric) model for the exposure distributionwere used then inference based on prospective and retrospec-tive likelihoods would be equivalent In the haplotype problema saturated model for the distribution of haplotypes would notbe identifiable given only genotype data for this reason the re-sult of Prentice and Pyke does not apply Carroll Wang andWang (1995) gave some guidance indicating that a retrospec-tive analysis will always be at least as efficient as a prospectiveanalysis and situations in which the retrospective analysis willbe more efficient correspond to cases where the distribution ofexposures is restricted Haplotype models such as those in Linand Zengrsquos (3) and (4) correspond to such restrictions

The situation is considerably more complicated when co-variates are included in a model Here it would appear thatthe retrospective likelihood is inconvenient because it requiresus to specify the distribution of covariates given disease sta-tus a potentially infinite-dimensional nuisance parameter For-tunately several authors including Lin and Zeng have shownhow to avoid this inconvenience We feel that these are excit-ing and important results The difficulty of modeling the effectsof covariate and haplotypendashcovariate interactions using the ret-rospective likelihood depends substantially on the presumed

Glen A Satten is Mathematical Statistician Division of Laboratory Sci-ence Centers for Disease Control and Prevention Atlanta GA 30341 (E-mailGSattencdcgov) Andrew S Allen is Assistant Professor Department ofBiostatistics and Bioinformatics and Duke Clinical Research Institute DukeUniversity Durham NC 27710 Michael P Epstein is Assistant ProfessorDepartment of Human Genetics Emory University Atlanta GA 30322

relationship between covariates and haplotypes in the popula-tion The simplest model is to assume that haplotypes and co-variates are independent at the population level This model isbiologically plausible if population stratification (confounding)is absent and if certain haplotypes or genotypes do not causethe exposure to occur Although it is not hard to imagine be-havioral covariates having genetic influence the assumption ofhaplotypendashcovariate independence is reasonable in many cases

If haplotypes and covariates are associated at the popula-tion level then the situation is much more complicated If foreach value of covariates X we can assume HardyndashWeinbergequilibrium then Lin and Zengrsquos model 3 applies marginallyHowever conditionally on covariates X we would expect adifferent set of haplotype frequencies for each covariate valueThis is a modeling nightmare Lin and Zeng have avoided thishowever but at a different cost they make the assumption thathaplotypes and covariates are conditionally independent givengenotypes This assumption is used when defining mg(y x θ)where the integration is over the distribution of marginal haplo-types corresponding to genotypes g rather than the distributionof haplotypes conditional on X With this assumption specify-ing the haplotype frequencies at each covariate is unnecessaryHowever this assumption is unlikely to hold when X and Gare associated unless genotypes specify haplotypes completely(in which case there are no missing data) Thus we come toa matter of style Is it better to make a mathematically conve-nient assumption that leads to more flexible models that maynot themselves be biologically plausible or to stick with a sim-pler model that may not describe the data as well but that doeshave a chance of being correct In fact the assumption made byLin and Zeng includes the simpler model as a special case sothat the sole cost of the assumption is an increase in complex-ity How does this play out in the case-control setting If wemake the rare disease assumption (corresponding to the onlysituation where a case-control study makes sense) and assumehaplotypendashcovariate independence at the population level thenit is possible to show that the retrospective likelihood factorsinto one part involving only the parameters of interest and a sec-ond part involving only the nuisance distribution of exposuresin the population This factorization is analogous to the clas-sic result of Prentice and Pyke (1979) that enables prospectiveanalysis of a case-control study even though data are gatheredretrospectively As a final comment on this issue we note thatLin and Zengrsquos simulations assume that X and H are indepen-dent at the population level not just conditional on G

Lin and Zeng give several results on the joint identifiabilityof parameters governing relative risk and parameters governingthe distribution of haplotypes We applaud these results evenas we note some limitations First these results appear limited

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000826

108 Journal of the American Statistical Association March 2006

to situations in which the model allows only for the effect ofa single nonnull haplotype effectively comparing this ldquotargetrdquohaplotype with all others Tests based on such a procedure canperform very poorly in cases where more than one haplotypeaffects disease risk We feel that the modeling approach is es-pecially important in just this case because hypothesis tests forsingle haplotype effects are tests of a composite null hypothesissuch tests require the capacity to estimate relative risk parame-ters for those haplotypes not constrained by the null hypothesisBut estimation requires modeling the effects of multiple hap-lotypes which appears to go beyond the identifiability resultspresented by Lin and Zeng When interest is limited to modelsof a single haplotype tests can be constructed without muchdifficulty that are valid regardless of the distribution of haplo-types (see eg Schaid et al 2002 Zaykin et al 2002) Thusthe identifiability results of Lin and Zeng are most important insituations where one wishes to estimate the effect of a singlehaplotype (relative to all of the others) Can these results actu-ally be used to analyze real data The answer to this questionis less clear because the haplotype distribution parameters maybe only weakly identified in finite samples especially when thetrue parameters are close to the null hypothesis or where the truerisk model is close to dominant (a fact that will not always beknown a priori) Along these lines we note that in their analy-ses of the FUSION and simulated data Lin and Zeng use thestronger assumption that model 3 is correct

Finally some quibbles Lin and Zeng claim to describe an-alytical methods for ldquoall commonly used study designsrdquo Infact genetic epidemiologists often use family-based associationstudies such as case-parent trio studies that are not covered byLin and Zengrsquos article Recently we have developed methods

for fitting haplotype risk models using case-parent trio data thatare robust to misspecification of the parental haplotype distrib-ution (Allen Satten and Tsiatis 2005) We have extended ourapproach to include haplotypendashcovariate interactions where therobustness to misspecification of the parental haplotype distri-bution enables a general dependence of haplotype frequencieson covariates These methods are based on the efficient scorefunction we are currently studying the application of our ap-proach to case-control studies In particular it appears possibleto remove any dependence of the distribution of H given G inmodels with no covariates given the assumption that Lin andZeng were forced to make this will be of particular interestshould these methods extend to models that include haplotypendashcovariate interactions in case-control studies Another designalso not considered by Lin and Zeng corresponds to conditionallogistic regression of finely stratified data Here we note thatthe retrospective approach of Epstein and Satten (2003) can beused with highly stratified data because the intercept parameteris conditioned out as a result we can use this approach whenwe have a large number of intercept parameters We have alsodeveloped an extension of the Epstein and Satten approach thatincludes covariate effects in addition to haplotype effects (andtheir interactions) for matched or highly stratified studies

In summary we congratulate Lin and Zeng on an interestingand stimulating article

ADDITIONAL REFERENCES

Allen A S Satten G A and Tsiatis A A (2005) ldquoLocally-Efficient Ro-bust Estimation of HaplotypendashDisease Association in Family-Based StudiesrdquoBiometrika 92 559ndash571

Carroll R J Wang S and Wang C Y (1995) ldquoProspective Analysis of Lo-gistic Case-Control Studiesrdquo Journal of the American Statistical Association90 157ndash169

CommentNilanjan CHATTERJEE Christine SPINKA Jinbo CHEN and Raymond J CARROLL

Lin and Zeng are to be congratulated on an article that de-scribes identifiability and estimation of haplotype distributionsand risk parameters for very general models both prospectivelyand for case-control studies In particular the identifiabilityconditions will give important guidance to researchers as theyattempt to use different models for haplotypes besides HardyndashWeinberg equilibrium (HWE)

Our major aim in this comment is to place Lin and Zengrsquosarticle in the broader context of various alternative methods for

Nilanjan Chatterjee is Senior Investigator Biostatistics Branch Divisionof Cancer Epidemiology and Genetics National Cancer Institute RockvilleMD 20852 (E-mail chatternmailnihgov) Christine Spinka is AssistantProfessor Department of Statistics University of Missouri Columbia MO65211-6100 (E-mail spinkacmissouriedu) Jinbo Chen is Research Fellowin the Biostatistics Branch Division of Cancer Epidemiology and Genetics Na-tional Cancer Institute Rockville MD 20852 (E-mail chenjinmailnihgov)Raymond J Carroll is Distinguished Professor of Statistics Nutrition and Tox-icology Department of Statistics Texas AampM University College StationTX 77843 (E-mail carrollstattamuedu) Carrollrsquos research was supportedby a grant from the National Cancer Institute (CA-57030) and by the TexasAampM Center for Environmental and Rural Health through a grant from the Na-tional Institute of Environmental Health Sciences (P30-ES09106)

haplotype-based regression analysis We point out the connec-tions and the differences between these alternative methods toshed light on their relative merits In particular we note thatin some important subproblems other methods are availableThese methods are efficient and simple to implement and theyavoid the need to estimate possibly high-dimensional nuisanceparameters

1 CASEndashCONTROL STUDIES

Because haplotype-based association studies are becomingincreasingly popular a number of researchers have developedmethods for logistic regression analysis of case-control studiesin the presence of phase ambiguity The methods can be broadlyclassified into two categories prospective and retrospectiveBefore going into technical details it is useful to understandthe main principles behind these two classes of methods

In the Public DomainJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000835

Chatterjee et al Comment 109

With a slightly different notation than that of Lin and Zenglet D be disease status Hd be the ldquodiplotypesrdquo (ie the twohaplotypes an individual carries in his or her pair of homolo-gous chromosomes) G be the observed genotype and X be thenongeneticenvironmental covariates Let the risk function sat-isfy

logitpr(D = 1|HdX) = β0 + m(HdX β1) (1)

where m(middot) is known but of completely general formUnder the foregoing notation the prospective likelihood of

the data is given by pr(D|GX) which ignores the fact thatunder the case-control sampling design data are observedon (GX) conditional on D In contrast the retrospective like-lihood of the data is given by Pr(GX|D) and accounts for theunderlying case-control sampling design

When there are no missing data (ie G = Hd) it fol-lows from the well-known results of Prentice and Pyke (1979)that the prospective approach is actually equivalent to theretrospective maximum likelihood analysis provided that thedistribution of the covariates (GX) is treated completely non-parametrically Thus the prospective method is a ldquorobust ap-proachrdquo for analysis of case-control studies that does not relyon any assumption about the covariate distribution In studiesof genetic epidemiology however it often may be reasonableto assume certain parametric or semiparametric models for thecovariate distribution in the underlying source population Theassumptions of HWE and genendashenvironment independence areexamples of such models The retrospective likelihood can di-rectly incorporate such assumptions into the analysis and canbe much more efficient than the prospective method when theassumptions are valid (Epstein and Satten 2003 Chatterjee andCarroll 2005)

11 Retrospective Maximum Likelihood AnalysisWith Haplotype-Phase Ambiguity

Epstein and Satten (2003) first described the retrospectivemaximum likelihood method for haplotype-based associationanalysis of case-control studies Incorporation of nongeneticcovariates X in this method is complicated by the fact that theretrospective likelihood involves potentially high-dimensionalnuisance parameters that specify the distribution of X in the un-derlying population In the genendashenvironment interaction con-text and as in the simulation study and example of Lin andZeng it is often reasonable to assume that Hd and environmen-tal factors X are independent in the population with a paramet-ric form

pr(Hd = hd|X) = pr(H = hd) = q(hd θ) (2)

where the model q(hd θ) in turn could be specified accordingto HWE or some of its extensions as considered by Lin andZeng More generally one can assume a parametric model forthe diplotype distribution of the form

pr(H = hd|X = x) = q(hd x θ) (3)

Model (3) for example can incorporate departure from genendashenvironment independence and HWE that may be caused byldquopopulation stratificationrdquo In particular one could assume

HWE and genendashenvironment independence conditional on var-ious demographic factors such as ethnicity and geographic re-gions and specify the haplotype frequencies conditional onthese factors according to a parametric model such as thepolytomous logistic regression model (Spinka Carroll andChatterjee 2005) Moreover (3) potentially can be used to di-rectly model the association between haplotypes and environ-mental exposure X

Under models (2) and (3) Spinka et al (2005) described sim-ple and easily computable methods that avoid estimating thenonparametric marginal distribution of X and exploit the infor-mation available in (2) or (3) to increase efficiency ChatterjeeKalaylioglu and Carroll (2005) described similarly simplemethods applicable for family-based or other types of individ-ually matched case-control studies Let there be n1 cases andn0 controls and let π = pr(D = 1) be the marginal probabilityof the disease in the population Assume the definitions

κ = β0 + log(n1n0) minus logπ(1 minus π)and

S(dh x) = q(h x θ)exp[dκ + m(h x β1)]

1 + expβ0 + m(h x β1)

where = (β0 κ θT βT1 )T Let HG be the set of diplotypes

consistent with the observed genotype G Define

Llowast(DGX) =sum

hisinHGS(DhX)

sumhsum

d S(dhX)

Spinka et al (2005) first showed that under certain conditionswhich are easily verifiable from the data all of the parametersin including the intercept parameter β0 are identifiable fromthe retrospective likelihood

prodi Pr(GiXi|Di) as long as the un-

derlying models are specified in such a way that would beidentifiable from prospective studies Moreover the maximumretrospective likelihood estimate of can be obtained as a so-lution of the score equation corresponding to the pseudolikeli-hood

llowast =Nsum

i=1

logLlowast(DiGiXi) (4)

Spinka et al described strategies for estimating the regressionparameter β1 based on llowast for both known and unknown valuesof the marginal probability of the disease in the underlying pop-ulation If one is also willing to make the rare disease assump-tion for all Hd and X then lowast effectively becomes equivalent tothe method that Lin and Zeng derived in their section A45 un-der the assumption of genendashenvironment independence Notehowever that neither the rare disease approximation nor thegenendashenvironment independence assumption is necessary toderive the simple pseudolikelihood lowast

An alternative representation of llowast is very revealing Con-sider a sampling scenario where each subject from the under-lying population is selected into the case-control study using aBernoulli sampling scheme where the selection probability fora subject given his or her disease status D = d is proportional tomicrod = Ndpr(D = d) Let R = 1 denote the indicator of whethera subject is selected in the case-control sample under the fore-going Bernoulli sampling scheme With some algebra one can

110 Journal of the American Statistical Association March 2006

now show that the pseudolikelihood llowast can be expressed in theform

llowast =Nsum

i=1

log

sum

hdisinHGi

pr(Di|Hdi = hdXiRi = 1)

times pr(Hdi = hd|XiRi = 1)

=Nsum

i=1

logpr(DiGi|XiRi = 1) (5)

When no environmental factors are involved Stram et al (2003)proposed an analysis of haplotype-based case-control studiesusing an ldquoascertainment-corrected joint likelihoodrdquo of the formprod

i pr(DiGi|Ri = 1) The representation of the llowast given in (5)suggests that under model (2) or (3) with F(x) treated com-pletely nonparametrically the efficient retrospective maximumlikelihood estimate of the haplotype frequency and the regres-sion parameters can be obtained by conditioning on X in theapproach of Stram et al (2003)

In most parts of their article Lin and Zeng considered ret-rospective maximum likelihood estimation under the modelthat assumes Hd and X are independent given G and thenallows the distribution of [X|G] to be completely nonpara-metric This model has advantages and disadvantages It ismore flexible than the model (2) that assumes Hd and X areindependent unconditionally however unlike model (3) itcannot allow direct association between haplotypes and envi-ronmentaldemographic factors Computationally retrospectivemaximum likelihood assuming model (2) or model (3) com-pletely avoids estimation of the distribution of the possiblyhigh-dimensional covariates X In contrast under the modelconsidered by Lin and Zeng one must estimate the nonpara-metric distribution of X for each different genotype G possiblystratified by subpopulationsmdasha potentially daunting task Fi-nally in situations where the genendashenvironment independenceassumption is likely to be valid either in the entire populationor within subpopulations based on results of Chatterjee andCarroll (2005) and Spinka et al (2005) we conjecture that theretrospective maximum likelihood method assuming model (2)or model (3) can be much more efficient than that assuming themodel of Lin and Zeng

12 Prospective Methods for Retrospective Data

Lake et al (2003) described methods for haplotype-basedregression analysis based on the prospective likelihood of thedata (DGX) ignoring the true case-control sampling designFor fixed values of the haplotype-frequency parameter θ thescore equations for the regression parameters βlowast = (κβ1) un-der model (2) corresponding to the prospective likelihood of thedata is given by

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)q(hd θ)

)minus1

(6)

Unfortunately this purely prospective score equation is bi-ased under the case-control sampling design even if the truehaplotype frequencies were known and the underlying HWEand genendashenvironment independence assumptions were validHowever a simple modification of the prospective score equa-tion is unbiased

0 =Nsum

i=1

sum

hdisinHGi

part

partβlowast logprβlowast(Di|hdXi)

times prβlowast(Di|hdXi)r(hdXi)q(hd θ)

times( sum

hdisinHGi

prβlowast(Di|hdXi)r(hdXi)q(hd θ)

)minus1

(7)

where

r(hdX) = 1 + expκ + m(hd x β1)1 + expβ0 + m(hd x β1)

Spinka et al showed that with an appropriate rare disease ap-proximation the modified prospective estimating equation (7)is equivalent to the approximate estimating equation approachproposed by Zhao et al (2003) Spinka et al described strate-gies for estimating β1 and κ based on the modified prospectiveestimating equation (7) where the nuisance parameters θ andpossibly β0 are estimated based on score equation derived fromthe pseudolikelihood llowast Simulation studies show that such aprospective approach generally tends to be much more robustto violation of both HWE and the genendashenvironment indepen-dence assumption compared with the retrospective maximumlikelihood method (see also Satten and Epstein 2004)

2 COHORTndashBASED STUDIES AND THE COXPROPORTIONAL HAZARDS MODEL

Lin and Zeng admirably describe fully efficient nonpara-metric maximum likelihood estimation for fitting a generalhaplotype-based semiparametric linear transformation model tocohort studies with unphased genotype data An alternative esti-mator considered by Chen Peters Foster and Chatterjee (2004)and Chen and Chatterjee (2005) for the popular Cox propor-tional hazard (CPH) model deserves attention Consider theCPH model for specifying the hazard function for a subjectgiven his or her diplotype status (Hd) and environmental co-variates (X) as

λ[t|HdX] = λ0(t)R(HdXβ1) (8)

where λ0(t) is the unspecified baseline hazard function R(Hd

Xβ1) is a parametric function describing the relative risk as-sociated with the exposure (HdX) and β1 is the vector of as-sociated regression parameters of interest As before assumethat Pr(Hd) = q(Hd θ) is specified according to HWE and thatθ denotes the associated haplotype frequency parameters Themodel (8) cannot be used directly because Hd is not observableFollowing Prentice (1982) one can derive the hazard functionfor disease conditional on the observable genotype data G and

Tzeng and Roeder Comment 111

covariates X in the form

λ[t|GX] = λ0(t)RlowastGX t β1 θ0(middot) (9)

where

RlowastGX t β1 θ0(middot)= ER(HdXβ1)|GXT ge t

=sum

HdisinHGR(HdXβ1)pr[T gt t|HdX]q(Hd θ)

sumHdisinHG

pr[T gt t|HdX]q(Hd θ)

In general standard partial likelihood inference cannot beperformed based on (9) because the relative risk functionRlowastGX t β1 θ0(middot) itself depends on the baseline hazardfunction λ0(t) However Chen et al (2004) showed that an om-nibus score test for genetic association can be performed usingoutputs from standard statistical software for partial likelihoodanalysis Based on (9) Chen and Chatterjee (2005) also de-scribed alternative strategies for estimation of the risk parame-ters β1 In particular the authors observed that for rare diseaseone could assume that pr[T gt t|HdX] asymp 1 The correspondinginduced relative risk function

Rlowast(GXβ1 θ) =sum

HdisinHGR(HdXβ1)q(Hd θ)

sumHdisinHG

q(Hd θ)

is free of the baseline hazard function λ0(t) Thus under therare disease approximation one could estimate β by maxi-mizing the partial likelihood associated with the relative riskfunction Rlowast(GXβ1 θ ) where θ is a consistent estimate ofthe haplotype-frequency parameters θ Chen and Chatterjeedescribed alternative strategies for obtaining consistent esti-mate of θ for cohort and nested case-control studies A simpleasymptotic variance estimator was also provided Simulationstudies for the full cohort design show that the loss of efficiency

in this pseudolikelihood method was quite small compared withthe fully efficient nonparametric maximum likelihood estimator(NPMLE) estimator proposed by Lin (2004)

An advantage of pseudolikelihood approach of Chen andChatterjee (2005) is its wide applicability to alternative cohort-based study designs In particular for studies of rare dis-eases such as cancer it is common to conduct case-controlor case-cohort sampling within a cohort to select a subset ofpeople for whom genotype and expensive environmental ex-posure information will be collected Various alternative typesof partial likelihoods that are currently available for analy-sis of nested case-control and case-cohort studies can be ap-plied to estimate β1 based on the induced relative risk functionRlowast(GXβ1 θ ) Future research is merited to study whetherand how one can obtain the NPMLE for these alternative de-signs especially when both genotype and environmental expo-sure data are available only for the selected subsample of thesubjects We look forward to Lin and Zengrsquos further innova-tions in this area

ADDITIONAL REFERENCES

Chatterjee N and Carroll R J (2005) ldquoSemiparametric Maximum Likeli-hood Estimation in Case-Control Studies of GenendashEnvironment InteractionsrdquoBiometrika 92 399ndash418

Chatterjee N Kalaylioglu Z and Carroll R J (2005) ldquoExploiting GenendashEnvironment Independence in Family-Based Case-Control Studies IncreasedPower for Detecting Associations Interactions and Joint Effectsrdquo GeneticEpidemiology 28 138ndash156

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Association Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics in press

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Prentice R L (1982) ldquoCovariate Measurement Errors and Parameter Estima-tion in a Failure Time Regression Modelrdquo Biometrika 69 331ndash342

Spinka C Carroll R J and Chatterjee N (2005) ldquoAnalysis of Case-ControlStudies of Genetic and Environmental Factors With Missing Genetic Informa-tion and Haplotype-Phase Ambiguityrdquo Genetic Epidemiology 29 105ndash127

CommentJung-Ying TZENG and Kathryn ROEDER

All data analysis relies on a model that is strictly speak-ing not correct Choices about which features to model andwhich to ignore distinguish successful models from the restWithout artful modeling statisticians would be unable to makeinferences based on finite samples In this wide-ranging arti-cle Lin and Zeng (LZ hereinafter) make novel contributionsto the statistical genetics literature by introducing new mod-els and providing a rigorous statistical analysis of these mod-els Specifically their article builds on a series of related worksmodeling the effect of haplotypes on the risk of disease We

Jung-Ying Tzeng is Assistant Professor Department of Statistics North Car-olina State University Raleigh NC 27695 (E-mail jytzengstatncsuedu)Kathryn Roeder is Professor Department of Statistics Carnegie Mellon Uni-versity Pittsburgh PA 15213 (E-mail roederstatemuedu)

congratulate the authors for providing a firm theoretical foun-dation in this exciting area of research The authors investigate afamily of models that address a broad range of sampling designscommonly used in genetic epidemiology but for brevity we fo-cus our remarks on those models appropriate to case-controldata

Schaid et al (2002) published a practical methodological ap-proach for haplotype association analysis using a prospectivemodel to link the risk of disease to observed genetic data Thechosen model ignored two features of the data the case-controlsampling scheme that typically generates the data and poten-

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000844

112 Journal of the American Statistical Association March 2006

tial violations of ldquoHardyndashWeinberg equilibrium (HWE)rdquo in thepairwise distribution of haplotypes in the population Zhao et al(2003) proposed a similar model By ignoring the retrospectivenature of the data these authors were able to easily incorporateenvironmental covariates into their models Both of these arti-cles took a hypothesis testing approach and focused on testingfor the presence of haplotype effects rather than estimating thesize of such effects

Subsequently Epstein and Satten (2003) introduced an ap-proach based on a retrospective model that accounts for case-control sampling of the data Building on these results Sattenand Epstein (2004) chose not to assume HWE They used apopulation genetics model that permits ldquoinbreedingrdquo (a particu-lar violation of independence of haplotype pairings within indi-viduals) Their retrospective approach does not easily facilitatethe inclusion of environmental covariates LZ continue in thisline by developing more general models that handle all threedata features described earlier inbreeding environmental co-variates and retrospective sampling

Each step in this progression has led to models of increasingcomplexity Clearly more complex models may more closelyreflect reality The question is whether the extra complexity im-proves the inferences in realistic situations This is the questionthat we pursue in this discussion

Consider ldquoinbreedingrdquo This term is generally used to de-scribe a situation in which there is an increase in the proba-bility of observing matching genetic information at the pair ofhaplotypes within an individual that is πkk gt π2

k This excessof matching haplotypes (called homozygotes) tends to followthe model πkk = ρπk + (1 minus ρ)π2

k where ρ is the probabil-ity that both of a pair of inherited haplotypes trace back to acommon ancestor Excess homozygosity in the controls arisesnaturally under four scenarios cultural practices that encouragemarriage between close relatives insular populations for whomthe present generation traces back to a small number of ances-tors populations consisting of subpopulations whose membersrarely intermarry (ie population substructure) and laboratoryerror

Cultural norms generally discourage marriage among rela-tives and hence inbreeding levels in human populations tendto be very low For instance LZ estimate ρ = 0002 in theFUSION data described in their article In populations in whichinbreeding is considered acceptable even encouraged Donbak(2004) assessed the level of inbreeding to be ρ = 015 Contrastthis with the level of inbreeding (ρ = 05) used by Satten andEpstein (2004) and LZ in their simulation studies For a largepopulation to attain inbreeding levels of 05 would require asubstantial fraction of the marriages to be between first and sec-ond cousins (Lange 1997)

The other potential for inbreeding insular populations isbest illustrated by the Hutterite population of North Dakotawhose ancestry can be traced back to 90 ancestors in the1700s1800s Yet even this unique population has only a mod-erate amount of inbreeding (ρ = 03) (Bourgain et al 2003)

Thus true inbreeding in most human populations is consid-erably less than 05 Yet we sometimes observe a substantialexcess of ldquohomozygotesrdquo in control subjects leading to strongviolations of HWE Presumably the third and fourth featuresare the likely causes

Population substructure is known to lead to excesses of ho-mozygosity in practice (eg Devlin Risch and Roeder 1990)If the violation in HWE is due to population substructure thenwe argue that it is not acceptable to perform further analysisof the data using the type of association analysis developedby LZ Why Geneticists are not interested in simply findingassociations rather they wish to discover causal associationsHowever population substructure leads to confounding due toSimpsonrsquos paradox The lurking variable in this context is sub-population membership For example suppose that the diseaseis more common in one subpopulation than the other Any ge-netic marker that differs strongly in distribution across the sub-populations will exhibit a spurious association with the disease(Lander and Schork 1994 Devlin and Roeder 1999) Conse-quently great care must be taken to control for population sub-structure in good quality studies of genetic association If itis impossible to control for this feature directly by stratifyingthe data by subpopulation then a different experimental de-sign similar to matched case-control is often used (Ewens andSpielman 1995) Thus by careful design we can usually rule outpopulation substructure as the cause of excess homozygosity

Laboratory error often yields an excess of apparent homozy-gotes however these violations are rarely seen in the finalstages of data analysis Standard laboratory practice involvesscreening all genetic markers that result in violations of HWEto determine whether they are the consequence of errors Thusin careful studies violations of HWE in the control sample areunlikely in practice If a testing framework were used it wouldbe reasonable to assume HWE under the null hypothesis

Next we consider ldquotestingrdquo versus ldquoestimatingrdquo the effect ofhaplotypes Traditionally genetic epidemiologists have testedβ = 0 rather than attempting to estimate the haplotypic effectTo understand why it is necessary to consider the history of ge-netic epidemiology in this domain Although geneticists trac-ing back to Fisher have been remarkably adept in their use ofquantitative tools the data that they deal with are not alwaysamenable to modeling that is excessively detailed Moreoverprogress in mapping the genetic risk factors for complex diseasehas been slow due to the ldquoneedle in a haystackrdquo nature of thequest Millions of polymorphisms potentially could be testedfor association A disease may be caused by multigenic andorenvironmental factors which are likely to interact in complexways Effects of individual genes are likely to be small andeven worse to vary across ethnic groups For these reasons theodds are against discovering risk factors let alone refined esti-mates of the magnitude of these risks

In practice however a number of steps are taken to improvethe likelihood of success Although some steps could poten-tially introduce estimation bias of β they are amenable to sim-ple testing for the presence of a genetic effect For exampleto enhance the chance that participants in a study have the dis-ease due to a genetic cause (rather than environmental ones)diseased subjects often are included only if they have close rel-atives with the disease This type of sampling leads to an ascer-tainment bias in β This is just one of the substantial biases thatmight be present in a genetic study

LZrsquos model requires specification of a particular haplotypeas the associated one This requirement is made rather cava-lierly considering that there typically is no prior knowledge

Tzeng and Roeder Comment 113

about the relationship between haplotypes and a disease phe-notype Indeed for the most part haplotypes do not cause dis-ease rather one or more genetic variants do Haplotypes areused as proxies for these variants in association studies becausethe spatial correlation within the haplotypes often leads to acorrelation between a causal variant and one or more distincthaplotypes Certainly there is no guarantee that there will be aone-to-one mapping between a haplotype and a causal variantConsequently there is no reason to believe that specific hap-lotypic effects are interpretable or reproducible across ethnicgroups or studies

What is the alternative to the choice of preselecting the asso-ciated haplotype Even in the testing framework this is a chal-lenging problem Chapman Cooper Todd and Clayton (2003)performed some intriguing simulations and reached the con-clusion that haplotypes have very low power in a search forgenetic risk factors Their work suggests that greater powercan be obtained by analyzing a set of single nucleotide poly-morphisms (SNPs) with a simple application of Hotellingrsquos T2

(Fan and Knapp 2003) Roeder Bacanu Sonpar Zhang andDevlin (2005) took this view one step further and showed thatthe maximum of single SNP tests is yet more powerful thanHotellingrsquos T2 in some instances The reason that naive haplo-type analysis performs poorly is because it requires too manydegrees of freedom when the causal haplotype is not knowna priori To fully capitalize on their potential haplotype-basedmethods that do not dilute their power to detect interesting re-lationships between haplotypes and phenotypes in the sheervolume of distinct haplotypes are needed Tzeng (2005) usedhaplotype similarity to cluster related types and reduce the di-mension of the problem This approach can improve the powerof the test Seltman Roeder and Devlin (2001 2003) pro-posed a method of haplotype analysis that exploits the evolu-tionary history of the haplotypes to limit the tests performedThis approach increases the interpretability and the powerof generic haplotype analysis Alternatively Tzeng ByerleyDevlin Roeder and Wasserman (2003) described an approachthat tests for any unusual sharing of haplotypes among the casesthat requires only a single degree of freedom (For a summary ofthe challenges in this area see Clayton Chapman and Cooper2004)

Thus far we have discussed three scenarios that may intro-duce bias in estimators of β (A) inbreeding (B) ascertainmentbias and (C) a causal SNP not uniquely associated with a haplo-type To examine the impact of these violations of assumptionswe generated n = 500 cases and controls using (12) from LZwith α = minus4 and β2 = β3 = 0 To expedite the simulations wemade a simplification We simulated the data from a list of threehaplotypes (h1 = 11 h2 = 01 and h3 = 00) with probabilities(π1 = 2 π2 = 3 and π3 = 5) To estimate β1 we used LZrsquosrare disease algorithm as described in section A45 but assum-ing HWE Under scenario (A) we allowed a realistic violationof HWE and simulated data with an inbreeding coefficient ofρ = 015 In scenario (B) we drew the cases from families withan affected sibpair (one child sampled per family) In scenar-ios (A) and (B) the causal SNP is in position 1 which leadsto a unique mapping between h1 and SNP1 In scenario (C)the causal SNP is in position 2 Consequently for this scenarioboth h1 and h2 are associated with the phenotype For all threescenarios only the effect of h1 was investigated

Table 1 Bias and Mean Squared Error (MSE) of β1

β1 Scenario Bias MSE

5 Baseline minus002 010[A] ρ = 015 023 012[B] Ascertainment bias 273 083[C] Multiple causal haplotypes minus217 058

0 Baseline minus002 013[A] ρ = 015 001 012[B] Ascertainment bias 002 008[C] Multiple causal haplotypes 002 012

NOTE The results are based on 200 simulations with 500 cases and 500 controls ScenarioldquoBaselinerdquo refers to ρ = 0 no ascertainment bias and one causal haplotype h1

The bias of β1 is indicated in Table 1 Note that the ob-served bias is negligible when inbreeding is ignored This smalllevel of bias is likely the reason why virtually every pub-lished method for estimating haplotype distribution is based ona model that assumes HWE Alternatively ascertainment biasleads to a substantial bias in the estimated effect of the haplo-type Clearly if the size of β1 is of interest then care must betaken to model the ascertainment process Finally the bias dueto nonunique association between SNPs and haplotypes is alsoconsiderable This supports our view that it may be difficult tointerpret β1 in practice

Thus although LZ provide models that are closer to truthwe believe the nature of the data is not yet in sync with therefinements to the model that LZ have introduced Neverthelesstechnology continues to improve the quality and amount of dataavailable The likelihood of success in the quest to discover thegenetic risk factors for complex disease rises accordingly Soalthough we think LZ are ahead of their time for most geneticanalyses at the moment as the data improve and become morefocused it will be advantageous for statisticians to have refinedmodels with good properties at their fingertips

ADDITIONAL REFERENCES

Bourgain C Hoffjan S Nicolae R Newman D Steiner L Walker KReynolds R Ober C and McPeek M S (2003) ldquoNovel Case-Control Testin a Founder Population Identifies P-Selection as an Atopy-Susceptibility Lo-cusrdquo American Journal of Human Genetics 73 612ndash626

Clayton D Chapman J and Cooper J (2004) ldquoUse of Unphased MultilocusGenotype Data in Indirect Association Studiesrdquo Genetic Epidemiology 27415ndash428

Chapman J M Cooper J D Todd J A and Clayton D G (2003) ldquoDetect-ing Disease Associations Due to Linkage Disequilibrium Using HaplotypeTags A Class of Tests and the Determinants of Statistical Powerrdquo HumanHeredity 56 18ndash31

Devlin B Risch N and Roeder K (1990) ldquoNo Excess of Homozygosity atLoci Used for DNA Fingerprintingrdquo Science 240 1416ndash1420

Devlin B and Roeder K (1999) ldquoGenomic Control for Association StudiesrdquoBiometrics 55 997ndash1004

Donbak L (2004) ldquoConsanguinity in Kahramanmaras City Turkey and ItsMedical Impactrdquo Saudi Medical Journal 25 1991ndash1994

Ewens W J and Spielman R S (1995) ldquoThe TransmissionDisequilibriumTest History Subdivision and Admixturerdquo American Journal of Human Ge-netics 57 455ndash464

Fan R and Knapp M (2003) ldquoGenome Association Studies of Complex Dis-eases by Case-Control Designsrdquo American Journal of Human Genetics 72850ndash868

Lander E S and Schork N J (1994) ldquoGenetic Dissection of Complex TraitsrdquoScience 265 2037ndash2048

Lange K (1997) Mathematical and Statistical Methods for Genetic AnalysisNew York Springer-Verlag

114 Journal of the American Statistical Association March 2006

Roeder K Bacanu S A Sonpar V Zhang X and Devlin B (2005)ldquoAnalysis of Single-Locus Tests to Detect GeneDisease Associationsrdquo Ge-netic Epidemiology 28 207ndash219

Seltman H Roeder K and Devlin B (2001) ldquoTDT Meets MHA Family-Based Association Analysis Guided by Evolution of Haplotypesrdquo AmericanJournal of Human Genetics 68 1250ndash1263

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

Tzeng J Y Byerley W Devlin B Roeder K and Wasserman L(2003) ldquoOutlier Detection and False Discovery Rates for Whole-GenomeDNA Matchingrdquo Journal of the American Statistical Association 98236ndash247

CommentHongzhe LI

1 INTRODUCTION

I congratulate Professors Lin and Zeng (LZ for short) on theirfine work on inference on haplotype effects in genetic associ-ation studies The completion of the Human Genome Projectand near completion of the HapMap project make genomewidegenetic association studies of complex traits now a reality Oneof the main challenges is performing rigorous and efficient sta-tistical analysis for identifying genetic variants that explain thevariation of the traits of interest in the population Such traitscan be binary such as disease status quantitative such as bloodpressure censored survival data such as age of disease onsetor longitudinal or functional data such as growth curve LZprovide a unified framework for haplotype association analysisfor different types of traits and several different study designsAlthough the likelihood-based methods using the EM algorithmfor inferring the missing phases of the haplotypes have beenstudied and reported by many authors in recent years most ofthese articles were published in genetics journals or genetic epi-demiological journals and some of them lack statistical rigorThe major contribution of LZrsquos article is to provide a rigoroustreatment to the problem and the methods developed previouslyespecially in terms of parameter identifiability and the assump-tions for validity of the likelihood-based statistical inferences

Instead of commenting on the theoretical content of the arti-cle I provide some comments on the following aspects the de-sign of genetic association studies alternative parameterizationof genetic variants and incorporation of additional biologicalinformation into genetic association analysis

2 STUDY DESIGNS

LZ consider several commonly used study designs in tradi-tional epidemiological studies including cross-sectional stud-ies case-control studies with known or unknown populationtotals and cohort designs Due to the cost of genotyping andcollection of environmental covariates the most practical andthe most commonly used design among these listed for large-scale genetic association studies is the case-control designwith unknown population totals Inferences for such a designwere investigated by LZ in their simulations and application

Hongzhe Li is a Professor of Biostatistics Department of Biostatistics andEpidemiology University of Pennsylvania School of Medicine PhiladelphiaPA 19104 (E-mail hliccebupennedu) This work was done when he was onthe faculty at the University of CaliforniamdashDavis His research is supported byNational Institutes of Health grant R01 ES009911

to the FUSION data The traditional cohort design is proba-bly not practical or necessary for large-scale genetic associa-tion studies If age of onset and haplotype-specific risk ratiosare important then case-cohort or nested case-control designmay provide a feasible alternative especially for relatively rarediseases Chen Peters Foster and Chatterjee (2004) and Chenand Chatterjee (2005) proposed likelihood-based score tests forhaplotype association and likelihood-based procedures for esti-mating the haplotype risk ratio parameters in the framework ofthe Cox proportional hazards model where they proposed firstestimating the haplotype frequencies and then estimating therisk ratio parameters It should be easy to develop a nonpara-metric maximum likelihood estimator (NPMLE) procedure inthe framework of missing data Under the case-cohort or nestedcase-control design there are two types of missing data Forthose who were cases or were selected as controls or a subco-hort we may miss the phases of the haplotypes for those whowere disease-free but were not selected as controls or a subco-hort we miss their genotypes and of course also their haplo-types The EM algorithm can be developed for the NPMLE toestimate the haplotype frequencies the baseline hazard func-tion and the haplotype-specific risk ratios simultaneously It isalso interesting to develop rigorous statistical inference proce-dure for the class of semiparametric linear transformation mod-els considered by LZ for case-cohort or nested case-controldesigns Such procedures should contribute greatly to the fu-ture of the haplotype-based genetic association analysis

3 PARAMETERIZATION OF GENETIC VARIANTS

Haplotype-based genetic association analysis as consideredby LZ provides one way of identifying genetic variants thatare related to complex traits However there are some un-solved practical issues when a large set of SNPs are typed andinvestigated for example how many SNPs one should con-sider in haplotype analysis and how one can efficiently dealwith many rare haplotypes Recent idea is to use evolutionary-based grouping of rare haplotypes in association analysis toreduce the degrees of freedom (Tzeng 2005) Such groupingdepends of course on the population models assumed whichsometimes can be difficult to justify In addition the modelsparameterized in terms of haplotypes may not be appropri-ate for modeling complex higher-order interactions between

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000853

Li Comment 115

the SNPs Recently there have been some new developmentsin modeling more complex interactions among the SNPs thanwhat the haplotypes can model These include the logic re-gression models (Ruczinski Kooperberg and LeBlanc 2003Kooperberg and Ruczinski 2004) where logic combinations ofthe binary variables coded for the SNPs are treated as predic-tors Huang et al (2004) developed a tree-structured supervisedlearning methods called FlexTree for studying genendashgene andgenendashenvironment interactions in which they considered bothan additive score model involving many genes and a model inwhich only a precise list of aberrant genotypes is predisposingThese models hold great promise in modeling complex inter-actions among the gene variants Alternatively one can modelthe genotype effects over many SNPs and use the recently de-veloped methods in the analysis of high-dimensional genomicdata in selecting relevant SNPs and their interactions (EfronHastie Johnstone and Tibshirani 2004 Gui and Li 2005abLi and Luan 2005) Among these methods the threshold gra-dient descent procedure (Friedman and Popescu 2004 Gui andLi 2005a) can be used to implicitly model the linkage disequi-librium between the SNPs and the boosting procedure with re-gression trees as a base learner (Friedman 2001 Li and Luan2005) provides an interesting way to model complex interac-tions among the SNPs Which of these models is the best foridentifying the gene variants related to complex traits remainsto be seen

4 INCORPORATING ADDITIONALBIOLOGICAL INFORMATION

Although new high-throughput technologies are continu-ously being developed and elaborated for biomedical researchit is also extremely important to make full use of the dataand knowledge accumulated by traditional experiments Thisknowledge is often stored in databases as ldquometadatardquo usuallydefined as ldquodata about datardquo For example it is now known thatmany aspects of cancer result from deregulation or disturbanceof interacting signaling pathways and regulatory circuits thatnormally control processes of the cell cycle apoptosis cellndashcelladhesion and cytoskeletal organization Deregulation of thesepathways often leads to aberrant signaling and the correspond-ing cancer hallmarks such as increased proliferation decreasedapoptosis genome instability sustained angiogenesis tissue in-vasion and metastasis (Hanahan and Weinberg 2000) Stud-ies of these deregulated pathways have revealed genomic andbiological differences between tumor and normal cells Theseobservations suggest that genomic data and metadata such asknown pathways or networks have great potential in identifyinggenetic variants and pathways that contribute to the risk of can-cers and to the variability in treatment responses among cancerpatients Important metadata that have been widely used in bio-medical research include Gene Ontology (GO) (Asburner et al2000) the KEGG metabolic pathways database (KanehisaGoto Kawashima and Nakaya 2002) and BioCarta pathways(wwwbiocartacom)

One limitation of almost all of the available methods forgenetic association analysis of SNP data is that known biologi-cal knowledge derived from metadata such as known biolog-ical pathways is rarely incorporated into the analysis Froma statistical standpoint doing this may require a whole new

set of regression analysis methods where the levels of activityof several biological networks are treated as predictors How-ever such network activities cannot be measured directly butthey may be inferred from complex combinations of geneticvariants in genes within the pathways Such biological knowl-edge can be used to form more biologically relevant disease-predisposing models including complex interactions amongthe genetic variants In addition biological information alsoprovides an alternative for mediating the problem of a largenumber of potential interactions by limiting the analysis to bi-ologically plausible interactions between genes in related path-ways In general risk interactions are more plausible betweengenes involved in a physical interaction found in the same path-ways or involved in the same regulatory network (CarlsonEberle Kruglyak and Nickerson 2004) Incorporating thesemetadata into current genetic association analysis would appearto hold great promise in large-scale genetic association analysis

Finally I agree with LZ that there are many interesting statis-tical problems related to large-scale genetic association analysisand more efforts from statisticians are needed to develop robustand theoretically sound strategies for identifying genetic contri-butions to disease and pharmaceutical responses and for identi-fying gene variants that contribute to good health and resistanceto disease

ADDITIONAL REFERENCESAshburner M Ball C A Blake J A Botstein D Butler H Cherry J M

Davis A P Dolinski K Dwight S S Eppig J T Harris M A Hill D PIssel-Tarver L Kasarskis A Lewis S Matese J C Richardson J ERingwald M Rubin G M and Sherlock G (2000) ldquoGene Ontology Toolfor the Unification of Biology The Gene Ontology Consortiumrdquo Nature Ge-netics 25 25ndash29

Carlson C S Eberle M A Kruglyak L and Nickerson D A (2004) ldquoMap-ping Complex Disease Loci in Whole-Genome Association Studiesrdquo Nature429 446ndash452

Chen J and Chatterjee N (2005) ldquoHaplotype-Based Associaiton Analysis inCohort and Nested Case-Control Studiesrdquo Biometrics to appear

Chen J Peters U Foster C and Chatterjee N (2004) ldquoA Haplotype-BasedTest of Association Using Data From Cohort and Nested Case-Control Epi-demiologic Studiesrdquo Human Heredity 58 18ndash29

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast AngleRegressionrdquo The Annals of Statistics 32 407ndash499

Friedman J (2001) ldquoGreedy Function Approximation A Gradient BoostingMachinerdquo The Annals of Statistics 29 1189ndash1232

Friedman J H and Popescu B E (2004) ldquoGradient Directed Regularizationrdquotechnical report Stanford University

Gui J and Li H (2005a) ldquoThreshold Gradient Descent Method for CensoredData Regression With Applications in Pharmacogenomicsrdquo Pacific Sympo-sium on Biocomputing 10 272ndash283

(2005b) ldquoPenalized Cox Regression Analysis in the High-Dimensional and Low-Sample Size Settings With Applications to Mi-croarray Gene Expression Datardquo Bioinformatics Advance Access publishedon April 6 2005 doi101093bioinformaticsbti422

Hanahan D and Weinberg R A (2000) ldquoThe Hallmarks of Cancerrdquo Cell100 57ndash70

Huang J Lin A Narasimhan B Quertermous T Hsiung C A Ho L TGrove J S Olivier M Ranade K Risch N J and Olshen R A (2004)ldquoTree-Structured Supervised Learning and the Genetics of HypertensionrdquoProceedings of National Academy of Sciences 101 10529ndash10534

Kanehisa M Goto S Kawashima S and Nakaya A (2002) ldquoThe KEGGDatabases at GenomeNetrdquo Nucleic Acids Resources 30 42ndash46

Kooperberg C and Ruczinski I (2004) ldquoIdentifying Interacting SNPs UsingMonte Carlo Logic Regressionrdquo Genetic Epidemiology 28 157ndash170

Li H and Luan Y (2005) ldquoBoosting Proportional Hazards Models Us-ing Smoothing Splines With Applications to High-Dimensional Microar-ray Datardquo Bioinformatics Advance Access published on February 15 2005doi101093bioinformaticsbti324

Ruczinski I Kooperberg C and LeBlanc M (2003) ldquoLogic RegressionrdquoJournal of Computational and Graphical Statistics 12 475ndash511

Tzeng J Y (2005) ldquoEvolutionary-Based Grouping of Haplotypes in Associa-tion Analysisrdquo Genetic Epidemiology 28 220ndash231

116 Journal of the American Statistical Association March 2006

RejoinderD Y LIN and D ZENG

We are grateful to the editor and associate editor for organiz-ing the discussion of our article and to the discussants for theirvaluable contributions The discussants are leading researchersin statistical genetics and we greatly appreciate their expertopinions and insightful comments

Inference on haplotype effects is a hot topic in genetics Froma statistical standpoint the problem is very interesting and chal-lenging because of haplotype ambiguity and high-dimensionalnuisance parameters Because the number of haplotypes withina candidate gene can be large and the underlying biology ismostly unknown modeling the haplotype effects correctly isdifficult The task becomes more daunting when the study in-volves a large number of SNPs The comments provided bythe discussants underscore the importance of haplotype analy-sis and reflect the many challenges confronted by data analystsIndeed a major motivation for writing our article was to bringthese challenges to the attention of the broader statistical com-munity Here we address the main issues raised by each groupof discussants

1 RESPONSE TO SABATTI

Dr Sabatti offered an excellent description of the importantroles of haplotype-based studies in localizing regions harboringdisease susceptibility genes and in identifying the functionalvariants She also provided a nice discussion of the potentialuse of our methods in different types of studies Although thenumerical results in our article pertain to the inference on hap-lotype effects within a region of interest the theory developedin the article allows one to perform association mapping aswell As we mention in section 5 of our article one can scanthrough the genome with sliding marker windows and per-form an overall likelihood-ratio test for association between thedisease phenotype and the haplotypes formed by the markersin each window In genomewide investigations environmentalfactors are typically ignored so the maximization of the like-lihood given in (9) of our article is very fast Rare haplotypesneed to be removed or combined in the analysis so as to avoidthe sparsity of the data The effects of multiple comparisons canbe adjusted by permuting the data or by applying the MonteCarlo method of Lin (2005) An article on haplotype-based as-sociation mapping is currently under preparation

2 RESPONSE TO SATTEN ALLEN AND EPSTEIN

When we started this project 3 years ago the article byEpstein and Satten (2003) was unpublished The authors kindlyprovided us with earlier versions of their article from whichwe learned a great deal In fact our work on case-control stud-ies with unknown population totals was largely motivated by

D Y Lin is Dennis Gillings Distinguished Professor (E-mail linbiosuncedu) and D Zeng is Assistant Professor (E-mail dzengbiosuncedu) Depart-ment of Biostatistics University of North Carolina Chapel Hill NC 27599

their work We have also greatly benefited from personal com-munications with them

The first major comment of Drs Satten Allen and Epstein(SAE hereinafter) pertains to the efficiency advantages of theretrospective likelihood analysis over the prospective likelihoodanalysis for case-control studies Satten and Epstein (2004)found considerable efficiency differences in estimating mainhaplotype effects Our simulation studies revealed that the dif-ferences can be even more substantial in estimating haplotype-environment interactions (The results are not given in the finalversion of our article) There is a potential price for the effi-ciency gains the retrospective likelihood is less robust againstviolation of HardyndashWeinberg equilibrium (HWE)

We agree with SAE that the dependence between haplotypesand covariates is a difficult issue and that the independence as-sumption is reasonable in many cases For all study designsthe likelihoods involve the conditional distribution of covariatesgiven haplotypes and genotypes so one must either assume theconditional independence of covariates and haplotypes givengenotypes or model the conditional distribution of covariatesgiven haplotypes Our work accommodates both the situationsof conditional independence and unconditional independenceThe distinction between the two situations is immaterial tocross-sectional and cohort studies because the likelihoods forthe parameters of interest are the same For case-control stud-ies the computations are slightly more demanding under con-ditional independence than under unconditional independence

As we mention in section 21 of our article there is muchflexibility in specifying the regression effects of haplotypeson the phenotype With K haplotypes we can either compareone haplotype with all the others or compare each of the first(K minus 1) haplotypes with the last haplotype The numerical re-sults in our article pertain to the first formulation however allof the theoretical results including those on identifiability andthe numerical algorithms apply to the second formulation aswell We agree with SAE that the identifiability results underarbitrary distributions of haplotypes do not guarantee stable es-timates in small samples which is why all of our data analysisrelies on model (3)

Our article deals exclusively with studies of unrelated in-dividuals The methods developed by Allen et al (2005) forcase-parent trio studies are very clever and provide another il-lustration of the usefulness of semiparametric efficiency theoryin statistical genetics We look forward to learning about theextension of that work to the situations with covariates and tocase-control studies as well as the extension of the work ofEpstein and Satten (2003) to matched case-control studies

copy 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006 Vol 101 No 473 Theory and MethodsDOI 101198016214505000000862

Lin and Zeng Rejoinder 117

3 RESPONSE TO CHATTERJEE SPINKACHEN AND CARROLL

31 Case-Control Studies

This group of discussants has done an impressive amount ofwork on case-control studies The original work by Carroll andcolleagues (eg Roeder et al 1996) and the recent work byChatterjee and Carroll (2005) provide fundamental insights intothe analysis of case-control data particularly the relative mer-its of the retrospective-likelihood versus prospective-likelihoodanalyses

As indicated in our response to SAE it is necessary toassume the conditional independence of haplotypes and covari-ates given genotypes (or the stronger unconditional indepen-dence) unless one is willing and able to model the relationshipbetween haplotypes and covariates It is straightforward to in-corporate a parametric model for the relationship between hap-lotypes and covariates into our likelihoods It is also possibleto account for latent population substructure with the aid of ge-nomic markers as we mention in section 5 of our article

The identifiability results of Spinka Carroll and Chatterjee(2005) seem to be related to those presented in sectionsA41 and A42 of our article In our view such identifiabilityconditions are difficult to verify in practice Although the inter-cept term can be identified in some situations the estimator ofthis parameter is likely to be unstable in finite samples Thuswe advocate the methods based on the rare-disease assumption

We agree with Drs Chatterjee Spinka Chen and Carroll(CSCC hereinafter) that one can avoid the nonparametric es-timation of the covariate distribution whether or not the diseaseis rare In fact the profile likelihood method presented in sec-tion A44 of our article does not impose the rare-disease as-sumption Expression (5) of CSCC appears to be similar to thesecond paragraph of our section A45

Our work covers both the situation of conditional inde-pendence between haplotypes and covariates given genotypesand that of unconditional independence under HWE or one-parameter extensions of HWE We showed that our estimatorsachieve the semiparametric efficiency bounds It does not seempossible to obtain more efficient estimators without imposingadditional restrictions

The prospective methods of Spinka et al (2005) improveon those of Zhao et al (2003) We agree with CSCC thatthe prospective methods are more robust than the retrospectivemethods against violation of HWE and that of the assumptionof independence between haplotypes and covariates In con-trast the prospective methods can be considerably less efficientThe choice between the prospective and retrospective methodswould depend on the plausibility of the assumptions

32 Cohort Studies

The work by Chen and colleagues on cohort studies is a novelapplication of the results of Prentice (1982) for the proportionalhazards model with covariate measurement error Because therelative-risk function shown in (9) of CSCC involves the base-line hazard function it is very challenging both computation-ally and theoretically to make inference about the relative-riskparameters on the basis of (9) For rare diseases the relativerisk function is free of the baseline hazard function so that the

partial likelihood principle can be applied to both cohort stud-ies and nested case-control studies given consistent estimatorsof the haplotype frequencies The bias and efficiency of the re-sultant relative risk estimators require further investigation

The NPMLE is very easy to calculate for the proportionalhazards model The EM algorithm guarantees an increase in thelikelihood function at each step of the iteration and facilitatescalculation of the information matrix In the M-step of the EMalgorithm estimation of the relative risk parameters and the cu-mulative baseline hazard function is very similar to calculationof the maximum partial likelihood estimator and the Breslowestimator We have not encountered any convergence problemof the EM algorithm in our extensive simulation studies

A major advantage of the NPMLE approach is that it can beapplied to the entire class of transformation models not just theproportional hazards model In addition this approach is ap-plicable to case-cohort and nested case-control studies whetherenvironmental factors are measured for the selected sample orthe entire cohort and the estimators continue to be asymptoti-cally efficient A manuscript on the NPMLEs for case-cohortand nested case-control designs is currently under review

4 RESPONSE TO TZENG AND ROEDER

HWE requires stringent conditions (ie random mating noviability andor fertility differential of alleles no immigrationor emigration no mutation and infinite population size) andis unlikely to hold in human populations We agree with DrsTzeng and Roeder (TR hereinafter) that population substructureis the primary source of departure from HWE (assuming thatlaboratory errors have been corrected) Population substructurewill cause spurious associations if and only if the risks of dis-ease vary among subpopulations and the substructure is not ac-counted for in the analysis Our work contains HWE as a specialcase and enables one to test this assumption We wish to accom-modate HardyndashWeinberg disequilibrium because the analysisof case-control data is highly sensitive to violation of HWEand using model (3) greatly reduces the bias even when the dis-equilibrium does not conform to (3) (see Satten and Epstein2004) Allowing HardyndashWeinberg disequilibrium adds very lit-tle complexity to our algorithms

Although genomewide association studies are on the hori-zon most genetic epidemiology studies are concerned withcandidate genes In those studies epidemiologists are ofteninterested in both testing and estimation Because our meth-ods are likelihood-based parameter estimation and hypoth-esis testing are encompassed within a common frameworkWe share TRrsquos view that testing for the presence of a ge-netic effect is more realistic than refined estimates of the effectsize Although our numerical illustrations focus on models thatspecify a disease-susceptibility haplotype our theory and nu-merical algorithms allow other types of models Extension togenomewide association mapping is also possible as we men-tioned earlier in our response to Dr Sabatti

The relative efficiency of haplotype analysis versus singlemarker analysis depends on a number of factors including thenature of the SNPndashdisease association number and positions ofdisease-causing SNPs extent and strength of linkage disequi-librium and selection of markers Haplotype analysis is likelyto be more powerful than single marker analysis if the causal

118 Journal of the American Statistical Association March 2006

SNPs are not typed or if there are strong interactions of multipleSNPs on the same chromosome When there are a large num-ber of haplotypes it is sensible to group them by the methods ofTzeng (2005) and Seltman et al (2001 2003) The haplotype-sharing statistic developed by Tzeng et al (2003) is a usefulapproach for testing the presence of a genetic effect

As with any type of observational study it is highly chal-lenging to come up with the correct or even an approximatelycorrect model The small simulation study conducted by TRdemonstrated the potential bias of parameter estimators undermodel misspecification and ascertainment bias Because thereis no bias when the true parameter value is 0 hypothesis testingstill may be appropriate We certainly agree with TR that re-fined models will become more useful as the data improve andbecome more focused

5 RESPONSE TO LI

As mentioned in our response to CSCC we have already de-veloped the NPMLEs for the class of semiparametric transfor-mation models under the case-cohort and nested case-controldesigns The EM algorithm indeed can be used to estimate allof the parameters simultaneously The estimators again achievethe semiparametric efficiency bounds Our simulation studiesshowed that the case-cohort and nested case-control designs arehighly cost-effective

Dr Li described a number of methods for parameterizing ge-netic variants He also suggested incorporating known biologi-cal information into the modeling and analysis Exploring theseinteresting ideas will entail considerable statistical innovationand yield useful new methods for genetic association analysis

Page 15: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 16: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 17: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 18: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 19: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 20: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 21: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 22: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 23: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 24: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 25: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 26: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 27: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 28: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 29: Likelihood-Based Inference on Haplotype Effects in Genetic
Page 30: Likelihood-Based Inference on Haplotype Effects in Genetic