20
Human Heredity 21: 523-542 (1971) A General Model for the Genetic Analysis of Pedigree Datal R. C. ELSTON and J. STEWART Department of Biostatistics and the Genetics Curriculum, University of North Carolina, Chapel Hill, N.C., and Department of Genetics, Milton Road, Cambridge Abstract. Assuming random mating and random sampling of Key Words pedigrees, the likelihood of a set of pedigree data is developed in terms Pedigree data of: (1) the population distribution of the different genotypes; (2) the Genetic analysis phenotypic distributions for the different genotypes, and (3) the geno- typic distribution of offspring given the parents' genotypes. This last is given for any number of unlinked autosomal loci, two linked autosomal loci, an X-linked locus, and combinations of these possibilities. Methods are given for using this likelihood to test specific genetic hypotheses and for genetic counselling, I. Introduction The purpose of analysing pedigree data is to establish the presence or absence of a genetic mechanism for the manifestation of a particular trait or set of traits; to elucidate such a mechanism, if it is present; and to classify individuals for their genotypes. By pedigree data we mean data collected on one or more groups of related individuals, a group being more extensive than just parents and children (families): thus more than two generations will be involved. Whereas it is possible to examine genetic mechanisms without such data, using families, pairs of relatives, or even unrelated individuals, pedigree data provide the most genetic information. A typical pedigree may comprise a hundred or so individuals covering four or more generations, and from a genetic point of view such a group of individuals is capable of yielding far more information than can be obtained from the same number of individuals 1 This investigation was supported by a Public Health Service Research Career Develop- ment Award (1-K3-GM-31, 732), training grant (GM 00685) and research grant (GM HD 16697) from the National Institute of General Medical Sciences. I I

A General Model for the Genetic Analysis of Pedigree Data

Embed Size (px)

Citation preview

--------------~----------------------------._-----------

Human Heredity 21: 523-542 (1971)

A General Model for the Genetic Analysis of Pedigree Datal

R. C. ELSTON and J. STEWART

Department of Biostatistics and the Genetics Curriculum, University of North Carolina, Chapel Hill, N.C., and Department of Genetics, Milton Road, Cambridge

Abstract. Assuming random mating and random sampling of Key Words pedigrees, the likelihood of a set of pedigree data is developed in terms Pedigree data of: (1) the population distribution of the different genotypes; (2) the Genetic analysis phenotypic distributions for the different genotypes, and (3) the geno­typic distribution of offspring given the parents' genotypes. This last is given for any number of unlinked autosomal loci, two linked autosomal loci, an X-linked locus, and combinations of these possibilities. Methods are given for using this likelihood to test specific genetic hypotheses and for genetic counselling,

I. Introduction

The purpose of analysing pedigree data is to establish the presence or absence of a genetic mechanism for the manifestation of a particular trait or set of traits; to elucidate such a mechanism, if it is present; and to classify individuals for their genotypes. By pedigree data we mean data collected on one or more groups of related individuals, a group being more extensive than just parents and children (families): thus more than two generations will be involved. Whereas it is possible to examine genetic mechanisms without such data, using families, pairs of relatives, or even unrelated individuals, pedigree data provide the most genetic information. A typical pedigree may comprise a hundred or so individuals covering four or more generations, and from a genetic point of view such a group of individuals is capable of yielding far more information than can be obtained from the same number of individuals

1 This investigation was supported by a Public Health Service Research Career Develop­ment Award (1-K3-GM-31, 732), training grant (GM 00685) and research grant (GM HD 16697) from the National Institute of General Medical Sciences.

I

I

524 ELSTON/STEWART

divided up into small unrelated groups. Yet, except for the very specialized purpose of studying genetic linkage, informative statistical analysis of such data seems to have been completely ignored.

It is a common practice among geneticists, when wishing to establish a certain genetic hypothesis from pedigree data, to divide the pedigree up into families (i. e. groups of parents and children) and analyze the data as though the families were independently sampled. This practice 'wastes' information, even though it is not always statistically invalid. Such analyses have prob­ably done little harm in the past, since so far they have usually been used to study dichotomous traits with a view to establishing a 'dominant' or 'reces­sive' mode of inheritance. The families are of course not independent, but under a simple null hypothesis there is independent segregation in each of the families. However, if we are to examine quantitative traits, or traits whose manifestation is influenced to a large degree by the environment, a more powerful analysis becomes necessary. It is the purpose ofthis paper to indicate in broad outline and fair amount of generality an approach to this problem.

II. The Basic Probability Model: Likelihood of a Set of Data

In this section we derive, under very general assumptions as to the under­lying genetic model, the likelihood that a particular set of pedigree data should be observed. We shall elaborate in section III certain special cases, and indicate in section V how this likelihood formulation can be used to test specific genetic hypotheses. If the data are distributed discretely this likeli­hood is in fact the probability that the particular data should be observed; if the data are distributed continuously the likelihood is no longer a true probability, but it is intuitively helpful to think of it as such.

A. Notation for Data We need first some notation to identify the measures on each member of

the pedigree. We consider here only the case in which there are no consan­guineous marriages, so that each member of the pedigree is one of two types: either he is related to someone in the previous generation, or, if not, he is an unrelated person 'marrying into' the pedigree. Measures on the first type of person will be denoted by x, on the second type by y. In the case of the original parents of a pedigree, it is arbitrary as to which is termed x and which is termed y. The use of subscripted subscripts, though clumsy typographic­ally, will be helpful in the sequel to keep track of the different generations.

525 A General Model for the Genetic Analysis of Pedigree Data

Fig. 1. Hypothetical examples to illustrate the notation for the beginnings of the first 2 pedigrees.

In general, the data will consist of separate pedigrees, numbered 1,2, ... , io' ... Let the measures on the original parents of the io-th pedigree be xi and

o Yi (fig. 1). Let the measure on the iI-th child of these original parents be

o xi i ' and that on his or her spouse be Yi i ; similarly let the measure on the

01 01

i2-th child of the iI-th child of the io·th original parents be x· , I' , and that on 1011 2

his or her spouse be y. , I' ; and so on (fig. 1). In general, the measure on an 10112

individual in the j-th generation of the pedigree, counting the original parents as generation 0, will be of the form Xl' I' I' 1'. I"; this being the

o 1 2' ., J-I J

irth child of the ij_I-th child of the ... of the i2-th child of the iI-th child of the io·th original parents.

B. Relationship between Genotype and Phenotype Let k be the number of different genotypes that cause variation in the

trait measured; in particular it is the smallest number of distinguishable genotypes that must be postulated to exist in the population to account for the segregation occurring in our particular sample of pedigrees. Thus k must be equal to or greater than the number of different genotypes, influencing the trait concerned, that occur in our sample. For example, if in our sample data there is segregation at just one of several loci that can affect the trait, and furthermore only two alleles occur in our data, three genotypes are possible, say AA, Aa and aa. In this case k will be three, whether or not all three of these genotypes occur in the sample data. The genotypes can be

526 ELSTON/STEWART

arranged in some specified order, and so we can talk of the u-th genotype, u = 1,2, ..., k.

Frequently a single phenotype is associated with each genotype, but this ignores the possibility of misclassification. More importantly, it does not cater for the case of quantitative inheritance, where each genotype is associat­ed with a range of phenotypic values, the variation within each genotype being due to environmental influences. We, therefore, associate a probability density function with each genotype. If x is the trait measured, denote this function by gu (x) for the u-th genotype. There will thus be k such functions, not necessarily all distinct. For example, dominance at a single locus with two alleles segregating would imply that two of the three density functions are identical. There is no major difficulty in letting gu (x) be age and/or sex dependent, but, for simplicity of notation, this will not be done here. In the discrete case gu (x) is a multinomial distribution (binomial for the special case of a dichotomous trait); in the continuous case it will usually be reason­able to assume it is a normal distribution (x being a suitable transformation, if necessary, of the scale of measurement). Thus, gu (x) is the conditional probability (density), given the u-th genotype, that x should be observed.

The gu (x) (u = 1, 2, ... k) may be known independently, but usually they will need to be estimated from the data. For example, x might be a continuous character whose distribution in the population suggests trimo­dality, and the genetic model it is desired to test is that of a single locus with two alleles resulting in three genotypes. We could then fit a mixture of three normal distributions to the overall data, estimating the three means, relative proportions, and a common variance by maximum likelihood [MURPHY and BOLLING, 1967]. These estimates are then used as starting values in the maximization of the likelihood that will now be developed.

C. Construction of the Likelihood We shall start by considering the likelihood for a single sibship. The key

here is that, given the genotypes of both parents, the genotypes of all the offspring are independent of each other. Let Pstu be the probability that an individual has genotype u, given that his parents' genotypes are sand t (s, t and u can each have the values 1, 2, ... , k). Then if the values of x for a sibship of size n are Xl' x2, ... , xn' the likelihood of the sibship given that the parents have genotypes sand tis:

n k II E Pstugu(Xj). (1)

i=lu=l

527 A General Model for the Genetic Analysis of Pedigree Data

Now let "P be the probability that an individual should be of the v-th v genotype, i.e. "P is the proportion of individuals in the population who have v the v-th genotype. Then the likelihood of observing the spouse of the i-th member of the sibship, whose measure is Yi' is simply

k 1: 'If.'ygY(Yi). (2)

v=l

Thus the likelihood of observing the sibship and their spouses, given that the parents have genotypes sand t, can be written:

n k k II 1: Pstugu(Xi) 1: 'If.'ygvCYi). (3)

i=lu=l v=l

The expression (3) is of course a function of sand 1. However, the sand t in this expression correspond to the u and v in a similar expression for the previous generation. This relationship between the generations can be expressed by rewriting (3) as the likelihood of observing a sibship and their spouses of the j-th generation, using the notation of section A above. We have:

k k rj =II 1: PSj_ltj_Pjgsj(Xioil ... ij) 1: 'If.'tj gtj (Yioi 1 ... ij)' (4)

ij Sj = 1 tj = 1

rj is thus an operator which is a function of Sj-l and tj-1, since it is the likelihood ofobserving a sibship and their spouses in generation j conditional on the sibs' parents being of genotypes Sj-l and tj -1, respectively. But the likelihood of the parents, if they are of genotypes Sj-l and tj -1, is the term

(5)

in the likelihood of observing a (j-l)th generation sibship and spouses, except that this is conditional on their parents being of genotypes Sj-2 and tj -2• Thus

k

rj-l = II 1: PSj_2tj_2Sj_lgSj_l(XiOil'" ij-l)

ij_l Sj_l = 1

(6)

k }.,' 'If.'tj_l gtj_l (Yioi 1" . ij_l)

tj_l = 1

528 ELSTON/STEWART

and the likelihood of observing both the j-th and (j-l)th generation sibships and spouses, conditional upon the (j-2)th generation being of genotypes Sj-2 and tj-2, respectively, is given by

(7)

(This is not rj _1 multiplied by rj , but rather the operation (6) performed on the operation (4); in other words it is the expression obtained if the right hand side of (6) is followed immediately by the right hand side of (4), all products and sums being performed in the naturally implied sequence.)

The likelihood of the entire data can thus be expressed as the sequence of operations

(8)

provided we define PSj-ltj-lSj as If/Sj when j = O. This corresponds to the likelihood of observing the measures on all of the original parents in the data being:

k k

II 2:lfI So gso (Xio) 2: tp to gto (Yio)' (9) io So = 1 to=l

This is correct on the assumption that mating is random with respect to the k genotypes, that the sampling of pedigrees is independent, and that any ascertainment problems may be ignored. Expression (8) with Ij defined as in (4) enables us to calculate the likelihood of observing a set of pedigree data. It should be noted that missing observations are easily handled in this expression: the corresponding gu are simply set equal to unity wherever they occur.

III. Specification ofthe Genetic Transition Matrixfor Different Genetic Models

The conditional probabilities Pstu make up a three-dimensional stochastic matrix that we can conveniently call the genetic transition matrix and denote by ~. As already explained, the genotype-phenotype correspondences are specified by the functions guo On the other hand the purely genetic part of the model is specified by the genetic transition matrix, since the probability that an individual should be ofgenotype u, given that his parents have genotypes s and t, depends only on the genetic mechanism of inheritance (which includes Mendelian segregation, recombination and selection).

529 A General Model for the Genetic Analysis of Pedigree Data

A. One Autosomal Locus Let ~a be the genetic transition matrix for the segregation of a alleles at

one autosomal locus, with complete viability of each genotype. The simplest example is ~ 2' which is given in table I. The two alleles are taken to be A and a, and the genotypes are numbered AA = 1, Aa = 2 and aa = 3; each

entry in the table is the appropriate value of the vector (pst I Pst2 Pst3)'

B. One X-Linked Locus For the segregation of b alleles at an X-linked locus, with complete

viability of each genotype, we can let ~bCj2 and ~b<3' be the genetic transition matrices for female and male children, respectively. The matrices ~2Cj2 and ~ 2<3' are given in table II, in which the hemizygous genotypes are numbered A· = 1 and a' = 2. In the previous section we have associated the subscript s with an x-measurement and the subscript t with a y-measurement, while in table II s, the row of the matrix, is associated with the male parent and t, the column, is associated with the female parent. Thus ~ 2Cj2 and ~ 2<3" as defined by table II, are appropriate when it is the mother who marries into the pedigree.

Table I. ~2' the genetic transition matrix for two alleles at one autosomal locus with complete viability of each genotype (no selection)

s t 1 = AA 2 = Aa 3 = aa

1 = AA ( 1 0 0) (Y2 1/2 0) ( 0 0) 2 = Aa (1f2 1/2 0) (Y4 Y2 1/4) ( 0 Y2 lf2) 3 = aa ( 0 0) ( 0 Y2 Yz) ( 0 0 1)

Table II. ~2Cj2 and ~2<3', the genetic transition matrices for two alleles at one X-linked locus with complete viability of each genotype (no selection)

s t 1 = AA 2 = Aa 3 = aa

Female 1 = A' (1 0 0) (1f2 1/2 0) (0 1 0) Children 2 = a' (0 1 0) ( 0 Y2 Y2) (0 0 1) Male 1 = A' (1 0) (Y2 Y2) (0 1) Children 2 = a' (1 0) (Y2 Y2) (0 1)

530 ELSTON/STEWART

The appropriate matrices for the case where it is the father who marries into the pedigree can be simply denoted ~' 2~ and t 26" i.e. the transposes of two­dimensional matrices with vector elements.

C. Multiple Unlinked Loci By the use of Kronecker products, it is a simple matter to generate from

these matrices the appropriate genetic transition matrix for any number of unlinked loci. If A and B are any three-dimensional matrices, which we can consider as two-dimensional matrices with vector elements ~jj = (ajjl' ajj2"") and ~jj = (bjjl, bij2' ...), respectively, then the Kronecker product of B premultiplied by A is given by

(a111 Qll a 112 Qll ···) (all1 Q12 a112212" .) (a121211 a 122 Qll" .) .

(a111 Q21 a 112 Q21 ···) (all1 Q22 a 112 Q22" .) (a121Q21 a 122 Q21" .) .

A0B (a211 2 11 a212211"') (a211 Q12 a 212 Q12" .) (a 221 Qll a 222 Qll" .) .

(a211 Q21 a212221' .. ) (a 211 Q22 a 212 Q22' .. ) (a221221 a 222 Q21' .. ) .

Using this notation, the genetic transition matrix for two unlinked autosomal loci, one with al alleles and the other with a2 alleles, is simply P 0P ;-a1 -a2

similarly, if an X-linked locus with b alleles is also involved, the genetic transition matrix is one of the four matrices P 0 P 0 Pbo , P 0 P 0

-~ -~ - + -~ -~

~b6" ~al 0 ~a2 0 ~b~ or ~al 0 ~a2 0 ~b6" depending on whether it is the mother or father who marries into the pedigree and whether a daughter or son is involved.

D. Two Linked Autosomal Loci Before going on to the case of two linked autosomal loci, it is convenient

to consider first an expanded form of P2 in which two 'different' heterozy­gotes are distinguished, depending on which allele was contributed by which parent. Thus we now recognize, for two alleles at one locus, the four geno­types AA, Aa, aA and aa, where in each case the first allele is contributed by

A General Model for the Genetic Analysis of Pedigree Data 531

the person already in the pedigree (the 'x-parent') and the second allele by the person marrying in (the 'y-parent'). The expanded form of"p2 is then as given in table III. The expanded form of "pa, for arbitrary a, can be easily generated by the following algorithm. Denote the alleles by the numbers 1, 2, ..., a, and genotypes by the number pairs ij (i, j = 1,2, ... , a). Let T ijk be the probability that a parent with genotype ij should transmit allele k to his offspring; we shall call such a quantity a transmission probability. For an autosomal locus with complete viability of each genotype we have the simple relation

1 iijk = 2( 0ik + Ojk), (10)

where buY is the usual Kronecker delta:

,5 uy = 1 ifu = v

= 0 ifu *v.

Then if we rewrite Pstu as Pss' ttl uu', the probability an individual should have genotype uu', given that his parents' genotypes are ss' and tt', we immediately see that the elements of the expanded form of ~a are

Pss' tt' uu' = iss'u itt'u'. (11)

It can be verified that for a = 2, (10) and (11) give rise to the 4 x 4 x 4 matrix shown in table III.

Table III. Expanded form of ~2

s

1 = AA 2 = Aa 3 = aA 4 = aa

1 = AA ( 1 0 0 0) (V2 V2 0 0) (V2 V2 0 0) (0 1 0 0) 2 = Aa (~ o V2 0) ('l4 V4 V4 V4) (V4 V4 V4 V4) (0 V2 o V2) 3 = aA (V2 o ~ 0) (Y4 V4 Y4 V4) (V4 V4 V4 Y4) (0 Y2 o Y2) 4 = aa ( 0 0 1 0) ( 0 o V2 ~) ( 0 o V2 V2) (0 0 0 1)

When two autosomal loci are involved, with, say, a1 alleles at locus 1and a2

alleles at locus 2, we distinguish a1a2 different gametes and afa~ different genotypes. The transmission probabilities we are now interested in are T· J. i J. k k ' the probability that a parent whose genotypic constitution is

11 1 2 2 1 2

id1 at locus 1 and i2j 2at locus 2 (i.e. i1i2/jlj2) should transmit the gamete k1k2 to his offspring (iI' j1' k1 = 1, 2, ..., a1 ; i2, j 2' k2 = 1, 2, ... , a2). This

532 ELSTON/STEWART

probability depends upon the recombination fraction, which we shall denote A, and is given by

(12)

Thus the elements of P for two autosomal loci with recombination fraction A are given, analogous to (11) (but with each genotype now being specified by four alleles, two subscripted 1 for locus 1 and two subscripted 2 for locus 2), by

where the transmission probabilities are defined by (12).

It should be noted that if x is a single trait, so that gu (x) is a univariate distribution, the ~ matrix just described will be appropriate if the single trait is genetically determined by the action of genes at two linked loci. On the other hand, the same ~ matrix is also appropriate for the case where there is linkage between two autosomal loci affecting different traits. In this case, however, x is a vector of two variables and gu (x) is a bivariate distribution; and, provided there is no epistatic interaction between the two loci, this bivariate distribution can be expressed as the product of two univariate dis­tributions, one for each trait.

E. Infinite Number of Equal and Additive Unlinked Loci Suppose we have I unlinked autosomal loci, and represent gametes by the

vectors i, i, ~, each with I elements ih, jh and kh, respectively. Then the appropriate transmission probabilities are

I T i j k = II +(Oihkh + Ojhkh), (13)

h=l

and again, analogous to (11), we have

(14)

Since the number of elements in ~ increases rapidly as I increases, it is useful, for computational purposes, to determine a simpler expression for the

A General Model for the Genetic Analysis of Pedigree Data 533

limiting case as 1tends to infinity. This can be done if we assume there are only two alleles at each locus, and that the magnitude of the effect of a gene substitution is the same at all loci: this is a commonly assumed model for polygenic inheritance of a quantitative trait.

Letting the allele number at each locus be 0 or 1, depending on whether the allele decreases or increases the measure on the trait in question, we can, because of our assumptions, replace our vectorial representation of each gamete by the sum of the elements in the vector, and each genotype as the sum of two gametes; for now the phenotypic distribution to be associated with a given genotype is determined solely by the number of '1' alleles in that genotype. Now as 1tends to infinity, we can consider gametes and genotypes as continuous normally distributed variables. Furthermore, as 1 tends to infinity, the proportion of loci in a genotype that is heterozygous tends to a constant value; hence the variance of the population of gametes transmitted by any genotype also tends to a constant value, say a2, equal to half the additive genetic variance of the population. Under random mating all uniting gametes have an equal correlation of zero, and so the variance of genotypes within a sibship is always the same; in fact, if the parents' genotypes are s and t, the gametic distributions are N(t, a2) and N(t, a2), and (since the coefficient of relationship of sibs is one half) the genotype distribution within the sibship is N(S;t, a2).

Using this result we can define a function of the three continuous vari­ables s, t, u which is completely analogous to the matrix P. Explicitly, putting 8+t = m this function is ­

2 '

I _~(u.m)2 (p(u-m, a2) = \i (2n) a e 2 a , (15)

i.e. the ordinate at u of the distribution N(m, a 2), which is equal to the ordinate at zero of the distribution N(u-m, a2). Since ~ is now replaced by a continuous function, (4) must be modified accordingly. The trait x will necessarily be continuous, and the vector of densities gu(x) is replaced by a continuous function of the two variables x and u, g(x lu), the conditional density of x given u. A reasonable functional form of g(xlu) is N(u + fl, a~); then fl will in fact be the mean of x over all genotypes, and a~ the environ­mental variance. Similarly 7fJ y is replaced by ({J(v, 2a2), since the genotypic variance, being equal to the additive genetic variance under this model, is 2a2• Finally, all the summations are replaced by integrals.

534 ELSTON/STEWART

Making these changes, (4) becomes

(16)

J rp (tj, 2a2) rp (tj + ,u-Yioi l ... ij , a~), tj

where the symbol J is used to mean that everything following it is to be inte­z

grated with respect to z from minus infinity to plus infinity. Using (16) for r j , (8) can be integrated algebraically and the result is K<p(N, T2), where K, N and T are functions of fl, (12, (1~ and the data. Although it is not feasible to write out explicitly what these functions are, it is not difficult to devise an al­gorithm for calculating them, given particular values of fl, (12, (1~ and the data, by repeated use of the fact that (all products and sums being over a set of n values for i)

00

J lIrp(s+Ait+Bi, Ci)ds = xrp(t+v, T 2),

-00

where

and

F. Major Loci and an Infinite Number of Equal and Additive Loci Finally it should be noted that there is no theoretical difficulty in defining

a function Pstu for the more general case where the genotype is partly made up of a few major loci, as in sections A-E, and partly made up of an 'in­finite' number of equal and additive gene effects, as in the last section. The appropriate function is the product of: (a) an element of a ~-matrix, and (b) a function of the form (15); this is analogous to forming a Kronecker product for the case of two unlinked loci. Denoting the two parts of the genotype u1

A General Model for the Genetic Analysis of Pedigree Data 535

and U 2 respectively, we now have a vector of densities gu (xlu2), ul = 1,2, ... k, 1

where k is the number of genotypes determined by the major loci; assuming the two parts of the genotype interact additively, each density can be taken to be of the form N(u2 + flu ' (J~). Similarly'lfJv is replaced by the product of

1

'lfJ and the normal density <p(v2, 2(2), and we now have both summation andv1 .

integration in (8). In this general case the computation of (8) can become very complicated, but it would not be unreasonable to attempt it for the particular case in which there is only one 2-allele major locus, using ~2 as in table I, so that only three 'major genotypes' are involved.

IV. Genotypic Classification of Individuals

In this section we consider the problem of determining the most probable genotype for each individual assuming: (a) the genetic transition matrix, i.e. the genetic mechanism, is known, and (b) all the other parameters in the likelihood are known. The problem of testing whether a particular genetic transition matrix is appropriate will be taken up in section V; however, one of the methods suggested there requires that we already have, assuming that transition matrix is appropriate, the most likely genotypic classification of all the individuals. The other parameters in the likelihood will in practice be replaced by their maximum likelihood estimates. These can be obtained by various computer methods, searching the likelihood surface [ROSENBROCK,

i960; POWELL, 1964; NELDER and MEAD, 1965; KAPLAN and ELSTON, 1972]. The maximization of the likelihood must be performed under certain con­straints: all probabilities must lie between 0 and 1, and all variances must be strictly positive. Also, there will normally be imposed certain functional relationships, or restrictions, among the parameters; for example, the as­sumption of Hardy-Weinberg equilibrium implies restrictions on the'lfJv' The maximum likelihood program devised by KAPLAN and ELSTON [1972] allows for such constraints and restrictions.

Apart from its use in testing genetic hypotheses, the genotypic "classifi­cation of individuals can be used for genetic counselling. The methods to be described in this section can be used to derive not only the most probable genotype for any existing member of a pedigree, but also the appropriate genotypic distribution for an individual who, as yet, is neither born noreven conceived. This is done by including him in the pedigree as a 'missing obser­vation', i.e. an individual on whom no direct information is available; his

536 ELSTON/STEWART

genotypic distribution, conditional on what is known about his relatives, is then obtained in the same manner as is the distribution for any other mem­ber of the pedigree.

A. Finite Number of Genotypes The likelihood (8), with r j defined as in (4), contains for each individual

in anyone pedigree just one summation over the k possible genotypes for that individual. Thus the summation over Sj in (4) is the summation over the k possible genotypes for the individual whose measure is x· . '., and

loll •• '1)

likewise the summation over tj corresponds to the individual whose measure is y. . . . Replacing a particular summation in (8) by a single term cor-

loll ••• lj

responding to the u-th genotype, we obtain the likelihood of the data given that particular individual has genotype u. Denote this conditional likelihood for a particular individual, w say, L(w, u). Then the probability that individ­ual w should have genotype u, conditional on what is known about his relatives, is simply

k q(w, u) = L(w, u)/ L' Uw, u), u = 1,2, .. " k.

u=l

This probability does not depend upon any information on individuals un­related to w, and so can be calculated simply from w's own pedigree.

Let Ul be the genotype for which, for a given individual w, q(w, u) is lar­gest, and U z the genotype for which q(w, u) is second largest. Then provided q(w, ul ) is much larger than q(w, uz), it is reasonable to classify w as having genotype u l . But if q(w, u l ) is not too much larger than q(w, u:J, it may be better to classify him as having genotype U z. Ideally, if there are W individuals in a given pedigree, we should consider the kW configurations possible for the assignment of genotypes to individuals and calculate the likelihood corresponding to each configuration; the configuration that yields the lar­gest likelihood then gives us the best genotypic classification. Since in each case the genotypes are assumed to be known, these likelihoods are far simp­ler to compute than (8); nevertheless, even for k as small as 3 and W as small as 7, this ideal procedure involves the calculation of over 2,000 likelihoods. We suggest, that in practice a number, W*, first be chosen on the basis of it being feasible to calculate kW * such likelihoods for each pedigree. Then a constant fJ is chosen, for each pedigree separately, such that

q(w, u l ) < j3q(w, u 2)

537 A General Model for the Genetic Analysis of Pedigree Data

for W* individuals in the pedigree. The individuals for which the inequality does not hold, i.e. those W-W* individuals for whom q(w, ul ) is at least p times q(w, u2), are then classified as having genotype ul (i.e. each one his own 'most probable' genotype); and then, fixing these, the kW* configurations possible for the remaining W* individuals are examined. Of course, this limited procedure is not necessary if a pedigree contains less than W* mem­bers; and it may well be possible to reduce the number of configurations that need be examined by putting some upper limit, such as 2, on the value of p.

B. Infinite Number of Genotypes Here we shall limit ourselves to outlining the solution for the simple

polygenic model considered earlier, for which rj is defined by (16). Analo­gous to the case of a finite number of genotypes, the ideal solution would be to take out all the integral signs in the likelihood and then, given ft, a 2 and a~, find those genotypic values (s or t in (16), one for each individual) that maximize the resulting expression. This expression, which is a likelihood in which the genotypic values are the unknown parameters, is simply a product of many functions; hence, taking logarithms, it is in principle a simple matter to obtain the maximum likelihood estimates. In practice, however, we may have to limit the number ofgenotypic values that are jointly estimated to W*. Thus, analogous to the finite case, we first estimate the W genotypic values one at a time by maximizing a likelihood of the form (8), using (16) for rj' but with just one integral sign removed; and at the same time we obtain the variances of these W estimates. Then, fixing at these estimated values the W-W* estimates with smallest variances, the W* remaining genotypic values are estimated jointly.

V. Testing Genetic Hypotheses

In this section we outline three basically different methods of testing the various genetic models. Each method has certain disadvantages, as will be pointed out, and so it is suggested that all three be used on any given body of data.

A. Testing for Segregation Ratios In this method, we assume we know a priori the genotypic classification

of each individual for the model being tested, though in actual fact this will be estimated in the manner just described. The pedigrees are divided up into

538 ELSTON/STEWART

two-generational families, in the usual way, and (in the case of a finite number of genotypes) grouped into genotypically distinct mating types. A test is then performed within each group for departure from the expected segregation ratios, using a chi-square statistic. Ancillary to this the data measurements for each genotype are examined separately and tests performed to detect departure from gu(x), assuming that the parameters of gu(x) are in fact equal to their estimated values. A simple pictorial method of determining whether the data fit this environmental part of the model is to compare the theoretical cumulative distribution curve with an empirical cumulative plot of the data for each genotype. Exact tests are not feasible, either· for the segregation ratios or for the fit of gu(x), since there is no simple way of allowing for the fact that the genotypic classification and the function gu(x) are estimated from the data.

When there are an infinite number of genotypes, similar tests can be per­formed to test the fit of the simple polygenic model considered. Analogous to testing for segregation ratios within sibships, the genotypic variances within sibships are estimated as the mean squared deviations of the genotypic values from the mid-parental genotypic values, pooled over all sibships in the pedigree, and then compared with the estimated 0'2. Furthermore, one can test for homogeneity of the within sibship genotypic variances. The environ­mental part of the model is easily tested by comparing the average squared difference between the phenotypic and estimated genotypic values, over all members of the pedigree, with the estimated O'~.

B. Randomization Test The difficulty in allowing for the fact that the genotypic classification is

estimated, in the procedure just outlined, is completely circumvented by the use of a randomization test. If a pedigree contains W individuals on each of whom a measure has been obtained, there are W! different ways in which these measures could be assigned to the individuals. If, in the absence of any prior knowledge, we consider these W! permutations equiprobable, we have the basis for assigning a significance level to any particular genetic hypothe­sis. For we can calculate W! likelihoods, based on a particular genetic transition matrix but maximized with respect to all other unknown para­meters, and then a priori the rank, say r, corresponding to the actually ob­served permutation has an equal probability of being any number from 1 to W !; if the smallest likelihood is ranked 1 and the largest W!, then the hypothesis under which the likelihoods are calculated can be rejected at exactly the r/W! significance level. Significance levels obtained in this way can

539 A General Model for the Genetic Analysis of Pedigree Data

be used both for testing the fit of a particular model and also for comparing the relative plausibility of several different models.

There are two disadvantages to this method. The first is that in practice W! can be a large number, and it may not be feasible to compute W! likelihoods. This is not too serious a limitation, however, since it is always possible to use a pseudo-random number generator to choose a random sample of, say, 99 permutations, and then estimate the significance level from the likelihoods corresponding to these and the observed permutation. The other disadvantage is that this method may have very little power; for this reason we turn now to a more parametric approach.

C. Testing for a Member of a Parametric Class of Hypotheses In this method we consider a parametric class of hypotheses, and test the

particular hypothesis on the assumption that the true state of nature belongs to that class. The most obvious parametric class to consider in general is that which allows all possible values between zero and unity for each Pstu'

subject to ~Pstu = 1. In the simplest case possible, that of testing for u

segregation of two alleles at one locus (~2)' this results in the parameter space being I8-dimensional. For practical reasons we wish to keep down the number of parameters, and one way to do this is to further assume that Pstu = Ptsu' i.e. that the transition matrix is symmetric; then for ~ 2 the parameter space becomes 12-dimensional. This is still larger than is desir­able, however. To overcome this difficulty, we suggest that the parametric class be taken to be that which allows all possible values between zero and unity for each transmission probability Tijk' subject to ~Tijk = 1. Thus to

k test for ~ 2 the parameter space is 3-dimensional, and the null hypothesis is given by the point

T 111 = T 121 (17)

=T 221

The null hypothesis can be tested by either of two standard asymptotic procedures; in each case a statistic is obtained that is asymptotically dis­tributed as a X2, the number of degrees of freedom being equal to the dimension of the parameter space for the class considered. The first proce­

540 ELSTON/STEWART

dure uses the likelihood ratio criterion: the maximum likelihood is obtained when the restrictions (17) are imposed, and this is divided by the unrestricted maximum likelihood; minus twice the natural logarithm of this ratio is the x2-statistic. The second procedure requires a consistent estimate of the variance-covariance matrix of the estimated Lijk obtained when the un­restricted likelihood is maximized. Ifwe let this matrix be V, and let i and:: be the vector estimates and null hypothesis values, respectively, of the Lijk' then the x2-statistic is

( ; - i )'y-l( i - i). --.- --.­

The matrix V-I can be conveniently obtained by numerical double differen­tiation of the likelihood surface at its maximum [KAPLAN and ELSTON, 1972].

It is instructive to note how this parametric approach compares with classical segregation analysis for two-generational data. If we have a random sample of two-generational data and there are no errors of misclassification, this method becomes very similar to testing for segregation ratios, especially if it is assumed that L111 = I and L221 = O. Indeed, it is fair to say that classical segregation analysis corresponds to testing the hypothesis L121 = Yz assuming L111 and L221 are known to be 1 and 0, respectively. For this reason, although the class of hypotheses that is taken above to be the underlying model may seem restrictive, it is in fact less restrictive than that taken in classical segregation analysis. In both cases a significant x2-statistic could mean either that there is not 1: 1 segregation at a genetic locus or that the underlying model is not true; but, because the model is less restrictive, the latter possibility is somewhat less likely to result in a significant statistic when the null hypothesis tested is (17).

An alternative method of generalising the parameter space in order to test either the single locus hypothesis or the simple polygenic hypothesis, would be to take as the genetic model the combination of a single 'major' locus and an infinite number of 'polygenes', as described in section III F. In this case the proportion of genetic variation due to the single major locus could be estimated, with a confidence interval, and we could test whether this proportion is significantly different from unity or zero. The advantage of this method is that it simultaneously considers the two extreme genetic hypotheses, and so answers what is possibly the single most important question concerning the inheritance of any character: is most of the genetic variation due to one locus, or are many gene loci necessarily involved? This point will be taken up again in the discussion below.

541 A General Model for the Genetic Analysis of Pedigree Data

VI. Discussion

.Many important human characteristics, ranging from disease suscepti­bility to personality, are measured as continuously distributed quantitative traits. Most approaches to studying genetically-caused variation in such characters have simply assumed that many loci were involved, and conse­quently have been limited to estimates of heritability. Indeed, it has become fashionable to assume a polygenic mode of inheritance and to estimate heritability for qualitative traits also; whenever a fully penetrant single gene model does not fit the data [FALCONER, 1965; GOTTESMAN and SHIELDS, 1967].

Such studies may tell us whether genetic variation causes significant phenotypic variation in a given character, but they are unlikely to tell us how genetic variation causes such variation [MORTON et al., 1970]. In fact, if many loci are significantly involved, each locus may well be causing phenotypic variation by a different mechanism, so that there would be virtually no single discernible mechanism. On the other hand, if a single gene is responsible for most of the genetic variation, it becomes feasible to examine the mechanism of action of this gene. Moreover, when the mechan­ism whereby a gene causes undesirable variation is understood, it may become possible to cure the genetic disease - for example, rhesus incompatibility [SCHNEIDER, 1970] and phenylketonuria [KOCH et al., 1970]; or to prevent its occurrence [SCOTT, 1970]. It is for these reasons that the question of whether or not a single identifiable gene locus is responsible for a significant amount of the phenotypic variation is extremely important. The methods outlined in this paper provide the basis for the genetic analyses required to answer this question, even if the trait is continuously distributed and the data are from a human pedigree rather than a controlled breeding experiment [ELSTON and STEWART, 1972; STEWART and ELSTON, 1972]. In particular, the methods determine specifically whether one or many loci are involved, as discussed above in section VI; and it may also be noted that the formulation in section HID is appropriate to the case where the single locus has been detected indirectly through its linkage to a marker locus [HASEMAN and ELSTON, 1971].

Finally, the question arises as to whether the analyses proposed in this paper are feasible on the computers currently available. We are at present preparing computer programs to perform all these analyses. The results obtained so far for some of the simpler genetic models, which will be published separately, have shown that the methods are both feasible, on a large computer, and meaningful. Extension to the more general genetic

542 ELSTON/STEWART

models presents no theoretical problems, but may depend on practical ad­vances in computer technology.

References

ELSTON, R.C. and STEWART, J.: The analysis of quantitative traits for simple genetic models from parental, F I and backcross data. Genetics (in press, 1972).

FALCONER, D. S.: The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann. hum. Genet. 29: 51-76 (1965).

GOTTESMAN, 1.1. and SHIELDS, J.: A polygenic theory of schizophrenia. Proc. nat. Acad. Sci., Wash. 58: 199-205 (1967).

HASEMAN, J. K. and ELSTON, R. c.: The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet. 2: (in press, 1972).

KAPLAN, E. B and ELSTON, R. c.: A subroutine package for maximum likelihood estimation (Maxlik). University of North Carolina Institute of Statistics Mimeo Series (1972).

KOCH, R.; SHAW, K.N.P.; ACOSTA, P.B.; FISHLER, K.; SCHAEFFLER, G.; WENZ, M. S., and WOHLERS, A.: An approach to the management of phenylketonuria. J. Pediat. 76: 815-828 (1970).

MORTON, N.E.; YEE, S.; ELSTON, R.C., and LEW, R.: Discontinuity and quasicontinuity: alternative hypotheses of multifactorial inheritance. Clin. Genet. 1: 81-94 (1970).

MURPHY, E.A. and BOLLING, D.R.: Testing of single locus hypotheses where there is incomplete separation of the phenotypes. Amer. J. hum. Genet. 19: 322-334 (1967).

NELDER, J. A. and MEAD, R.: A simplex method for function minimization. Comput. J. 7: 308-313 (1965).

POWELL, M.J.D.: An efficient method for finding the minimum of a function of several variables without calculating derivatives. Comput. J. 7: 155-162 (1964).

ROSENBROCK, H.H.: An automatic method for finding the greatest or least value of a function. Comput. J. 3: 175-184 (1960).

SCHNEIDER, J.: Prevention of Rh-sensitization by anti-D immunoglobulin. Germ. med. Mon. 13: 551-555, 613-615 (1970).

SCOTT, R. B.: Sickle-cell anemia-high prevalence and low priority. New Eng!. J. Med. 282: 164-165 (1970).

STEWART, J. and ELSTON, R.C.: Biometrical genetics with one or two loci: physiological characters in mice. Genetics (in press, 1972).

Authors' addresses: Dr. R. C. ELSTON, Department of Biostatistics, School of Public Health, University of North Carolina, Chapel Hill, NC 27514; Dr. JOHN STEWART, De­partment of Physiology, School of Medicine, University of North Carolina, Chapel Hill, NC 27514 (USA)

I