38
U.U.D.M. Project Report 2008:20 Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska Institutet Examinator: Ingemar Kaj December 2008 Department of Mathematics Uppsala University Hierarchical Linear Models and Structural Equation Modelling for the Children of Siblings model Ralf Kuja-Halkola

Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

U.U.D.M. Project Report 2008:20

Examensarbete i matematisk statistik, 30 hpHandledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska Institutet Examinator: Ingemar KajDecember 2008

Department of MathematicsUppsala University

Hierarchical Linear Models and Structural Equation Modelling for the Children of Siblings model

Ralf Kuja-Halkola

Page 2: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska
Page 3: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

Hierarchical Linear Models andStructural Equation Modelling

for the Children of Siblings model

Ralf Kuja-Halkola

Abstract

The main aim of this paper is to examine and apply the Children of Siblingsmodel to normally distributed outcome data. The model refers to siblings who aremothers/aunts and/or fathers/uncles and on whose children an outcome variable ismeasured. This is done through two statistical models. First a Hierarchical LinearModel (also known as multi-level model) which takes into account clustering of datain three levels. The levels correspond to increasing sizes of a family: individual, nu-clear family and extended family. This model makes it possible to examine whethera specific exposure is responsible for a change in the outcome variable after con-founders and clustering have been considered. Furthermore the genetical relatednessis taken into account making it possible to draw conclusions regarding the familialeffects on the association between exposure and outcome. The second model is a2-level Structional Equation Model which aims at splitting up variance/covariancein three different predetermined components representing genetic influence, environ-mental similarity and environmental dissimilarity using two levels of clustering inthe data. The clustering levels are within- and between- nuclear families, which alsoare considered hierarchical. This model investigates the importance of genetic heri-tability contra environmental heritability. An example study is performed using datafor smoking habits of mothers during pregnancy and the ”psychological functioningcapacity”, a prognosis of the ability to cope with war-time stress, of their children asmeasured by Forsvarsverket during the medical examination for military service. Wefound that smoking during pregnancy does not affect the psychological functioningcapacity. The exposures effect on the outcome has familial confounding, and this ismainly explained through genetical relatedness

1

Page 4: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

Contents

1 Introduction and aims 31.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Quasi-experimental study . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 The Children of Siblings model . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Methods - Theory 52.1 Hierarchical Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 CoS - a Hierarchical Linear Model . . . . . . . . . . . . . . . . . . . 62.1.2 The submodels included in the CoS . . . . . . . . . . . . . . . . . . 92.1.3 Interaction terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.4 Estimation and degrees of freedom . . . . . . . . . . . . . . . . . . 12

2.2 Structural Equation Models . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1 What is Structural Equation Models? . . . . . . . . . . . . . . . . . 152.2.2 Matrix notation approach . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Model fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.4 1-level SEM, ACE model . . . . . . . . . . . . . . . . . . . . . . . 182.2.5 2-level SEM for ACE . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.6 Estimation: Maximum Likelihood using Expectation Maximization 21

3 Data 223.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 The variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Results 254.1 The problem of belonging to multiple families . . . . . . . . . . . . . . . . 274.2 HLM results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 SEM results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Discussion 325.1 Limitations of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Future improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

References 35

2

Page 5: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

1 Introduction and aims

The aim of this paper is to make the Children of Siblings (CoS) model accessible forresearchers, both in theory and in practice. By using this model it is possible to drawconclusions from data that can be influenced by familial confounding. The confounding canbe due to genetics or environment, and the model also tries to distinguish the importanceof these two sources.

An example, it is not unlikely that a child’s reading ability is influenced by the environ-ment in which he/she is brought up (access to books, parental encouragement). Probablyhis/her sibling will have a similar environment. Thus they will be more similar than whencomparing to a random child taken from the same population (e.g. swedish children borna certain year). Furthermore they are genetically similar, but the similarity is different forfull- and half- siblings. They will also share genes with their cousins (here the geneticalsimilarity also differs depending on whether they are full- or half-cousins), but the environ-ment in which cousins are brought up may differ. By considering all these (dis-)similaritieswe are hopefully able to isolate a specific exposure’s influence on an outcome, and if thereis familial confounding isolate the source of this.

The outcome variable will be limited to be normally distributed. A study will beperformed whether a (male) child’s exposure to mother’s Smoking During Pregnancy (SDP)influence his Psychological Functional capacity score (PF), as measured by Forsvarsverketduring the medical examination for military service (henceforth called conscript). The dataconcerns boys born between 1973 and 1988 (99% born between 1979 and 1988). We will tryto explain the model in a conceptually simple way, and when examining the theory behindthe model rather than focusing on all the small problems we try to keep it at an overview-level. This since a non-mathematician researcher should be able to grasp, and apply, theresult. In the theory section some more advanced theoretical problems are addressed.

The Background section introduces the idea of the model and indicates its applicability.The Methods - Theory section explains the models of CoS and deals with the mathemat-ical theory behind the models and the obstacles to overcome. In the Data- Analysis- andResults-section we perform the analysis of how SDP influence the PF-score. The datais taken from a large data base called the MgrCrime data base which include informa-tion about relatedness, maternal and paternal social and economic status and birth-data.Merging of data from different sub-data bases and forming of variables are made with SAS,analyses are made with SAS and Mplus1. Finally the Discussion section consider possibleconclusions to be drawn from the anlyses. We also discuss and suggest improvements, bothfor the current study and for the model to be applicable to other problems.

1.1 Background

Researchers in the epidemiological field often try to address causation. Does a mother’sSDP influence the child’s intellect? Does a father’s bipolarity cause his infant child to

1Data- and analysis- program code can be obtained from the author, e-mail: [email protected].

3

Page 6: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

have a higher mortality rate? Does a certain lifestyle increase the risk of cardiovasculardiseases? Such questions are hard to answer since the causality is confounded by theinterplay between outcome and exposure; an exposed child’s mother also provides genesand other environment. For example, the exposure to SDP might be a negligible part ofthe whole picture making the child perform poorly in e.g. a test of PF. The problem ofparents providing (possibly) ”hazardous” genes as well as the child having a likelier risk tobe exposed to a ”hazardous” environment is called Gene-Environment correlation (rGE).The child is at double risk to be exposed to ”hazardous” environment, both due to theupbringing and to her/his own genes making her/him more prone to put herself/himself insuch environment1. The causality in these studies are difficult to assess, but with designssuch as sibling comparison or the Children of Twins (CoT) model there is a possibility togive some answers. The CoS model is another part of this tool-box.

The model CoS extends the CoT model, which has been used for some time by re-searchers in e.g. epidemiology and psychology (see for example (D’Onofrio, 2005; Hardenet al., 2007)). By using CoS rather than CoT a greater population is made available foranalysis, and using CoS rather than ordinary sibling comparison can help differentiate thepossible underlying mechanisms (D’Onofrio et al., In press). The CoT model, as well asthe CoS model, aims at distinguishing if a risk factor effects the outcome, and whether thisis environmentally- and/or genetically impelled. This is not a simple task because of therGE. In the CoS an important part consist of considering the genetic relatedness betweensiblings (full or half) and/or cousins (full or half) and some conclusions regarding geneticalimportance can, hopefully, be drawn.

The approach that is used is similar to that used by Brian M. D’Onofrio in (D’Onofrioet al., In press; D’Onofrio et al., 2008). The idea to use this design is also mentioned in(Rutter, 2007).

1.2 Quasi-experimental study

The basis of statistical theory in general relies on the data being a valid statistical sample,e.g. acquired through an experimental design randomizing subjects to different exposures.For many studies, as the study that we have performed, this is not true. By treatingthe data as experimental a number of assumptions are made, which more easily can beviolated in database studies or ”natural experiments” (Rutter, 2007) as often used inepidemiology. By labeling the study ”quasi-experimental” this is emphasized. Exampleswhen then assumptions possibly will not hold are

• Voluntary or involuntary self-censorship. When data is collected through self-reportingthe supplier of the information may be less likely to report certain variables. E.g.mothers smoking during pregnancy might feel ashamed and choose not to report this,making the data biased.

1That the parent passing their genes also can create a ”hazardous” environment is called passive gene-environment correlation, that the child creates her/his own ”hazardous” environment due to genes inheritedfrom parents is called active gene-environment correlation (Rutter, 2007).

4

Page 7: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

• Sampling bias reflecting the possibility that data collected in a voluntary manneris susceptible to e.g. social bias; people of low socioeconomic status may have lessincentive to supply wanted information.

Often a randomized case-control design is out of the question, for example it wouldbe unethical to randomize mothers to smoke or not to smoke during pregnancy. Andrandomizing parent to have a certain mental disorder and thus effecting their child isimpossible. The second best option in this case is to use available data, and analyze thisdata as if it came from a designed experiment. Even though the researcher has to beaware of the possible pitfalls, these kind of studies fills an important role. And even if theinferences can be questionable they can at least be said to be hypotheses generating.

1.3 The Children of Siblings model

Our goal is to explore the influence of an exposure on an outcome after controlling forconfounders and clustering effects. The confounders are the covariates that influence theoutcome and might affect the causation. In the CoS model the clusters are families andsubfamilies. The extended families are clusters which consists of grandparents, their chil-dren and spouses and their grandchildren. Clusters within these clusters are the nuclearfamilies consisting of mothers, fathers and their children. These clusters can be said tobe hierarchical. Another level of the hierarchy is the individual level, consisting of thechildren. By approaching the model in this way it is reasonable to consider a model withthree levels of hierarchicity.

The CoS model is designed to detect if the (possible) influence of the exposure is dueto genetics or environment. By assuming that siblings are raised equally (i.e. have thesame environment in upbringing), and that cousins are not, a possibility to differentiatethe variation arises. When comparing cousins any similarity in the correlation betweenexposure and outcome will be due to both genetic- and environmental (dis-)similarities(full cousins share 0.125 of their genes and half cousins share 0.0625 of their genes). Andwhen comparing siblings (0.5 shared genes) with maternal half siblings (0.25 shared genes)any detection of difference between the two groups will be (thought of to be) mainly due togenetics. The confounders are just nuisance and the in analyses and will not be investigatedin depth.

2 Methods - Theory

The CoS model is hierarchical by construct. To cope with this a Hierarchial Linear Model(HLM) will be applied. This model is able to incorporate the dependency structure withineach cluster/family and divide the variance into level-specific parts. This way of takingcare of the dependence using clusters can be thought of to take into acount unmeasuredconfounders apart from the measured ones included in the model (D’Onofrio et al., 2008).

The hierarchicity will also be incorporated into a Structural Equation Model (SEM).The SEM aims to explain the variability in the outcome variable in terms of genetical and

5

Page 8: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

environmental similarity/dissimilarity. The model has two levels; one regarding children oftwo female siblings, within each of the siblings separately. The other between the two sib-lings. This model uses an approach which divides the variation into three components: TheA-component which measures the genetical influence on likeness, the C-component mea-suring the environmental influence that makes the siblings similar and the E-componentmeasuring the environmental influence that makes sibling different. When we are referringto the ACE-model this is the model to have in mind.

2.1 Hierarchical Linear Models

The Hierarchical Linear Model (HLM) (also known as multi level model) is a way tocope with dependent and nested structures in data using clustering in many levels. Thebasis is an ordinary linear model. By letting e.g. the intercept vary randomly in eachlevel the model produced represent a hierarchical structure (Raudenbush & Bryk, 2002).The variation of the intercept depends on cluster-belonging of the outcome variable. Thisstructure allows for hierarchical nesting, e.g. in CoS the nuclear family is nested withinextended families (or households) and offspring is nested within nuclear families.

To conceptualize this model we imagine how the data is ”created”. Having in minda three level model, we try to understand the formation of the unit at the individuallevel. First, at the extended family level, a cluster is ”drawn” from the extended familyclusters according to some rules. The nuclear family level cluster is then ”drawn”, andthe information about the extended family level cluster is incorporated into this cluster.Finally the unit at the individual level is ”drawn”, and this is affected by which clustersat the nuclear family- and extended family- level the unit belongs to. When comparingthis unit to other units, the extended family level cluster of the unit is compared with theextended family level cluster of the unit to which it is compared, the same holds for thenuclear family level cluster.

2.1.1 CoS - a Hierarchical Linear Model

The hierarchicity of the model enables a partitioning of the variance into parts correspond-ing to each level of the hierarchy. Let Yint denote the outcome of the i:th offspring in then:th nuclear-family within the t:th extended family. The model without any covariates is

Yint = λ0 + eint + rnt + ut. (1)

This model relies upon the assumptions that the random errors are normally distributed,eint ∼ N(0, σ2), the errors at the nuclear family level are normally distributed as well,rnt ∼ N(0, τ 2

1 ), as are the errors at the extended family level ut ∼ N(0, τ 22 ). We interpret

the equation (1) as: ”The outcome Yint equals a grand mean λ0 and includes three randomeffects: eint - a random effect at the individual level for individual i, rnt - a random effect atthe nuclear level for nuclear family n where the individual i is included and ut - a randomeffect at the extended family level for extended family t including the nuclear family n andindividual i.” Another way to describe the random parts of (1) is that the variation of the

6

Page 9: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

offspring from the nuclear-family mean is measured by eint, the variation of the nuclear-family from the extended family mean is accounted for in rnt and the variation for theextended family from the overall mean is found in ut. At each level there is a possibilityto include covariates which are thought to have an impact on the outcome and thereforeconfounding the causality. At the individual level (level 1) consider a total of p covariates,and the simple linear model

Yint = π0nt +

p∑j=1

πjnt(αjint) + eint, (2)

where the meaning of the notation πjnt(αjint) is explained below. To include the other levelsrandom errors as in (1), let the intercept π0nt vary randomly while including covariatesspecific to the nuclear family level (level 2). With q covariates at the nuclear family levelwe obtain

π0nt = β0t +

q∑j=1

βjt(χjnt) + rnt. (3)

Similarly, the extended family level (level 3) terms are introduced by letting the interceptvary randomly once again and including covariates that are similar for the whole extendedfamily. Using the linear model with s covariates at the extended family level,

β0t = λ0 +s∑j=1

λj(ξjt) + ut. (4)

By combining (2) - (4) and using the fact that only the intercepts, not the slopes, varybetween clusters (i.e. πjnt = πj,∀n, t, similar for β), the model is

Yint = λ0 +

p∑j=1

πj(αjint) +

q∑j=1

βj(χjnt) +s∑j=1

λj(ξjt) + eint + rnt + ut. (5)

Assumptions made are that Cov(eint, rnt) = 0, Cov(eint, ut) = 0 and Cov(rnt, ut) = 0,i.e. all random effects are uncorrelated with each other. The variables αjint, χknt and ξltrepresent indicator variables for covariate j, k and l at level 1, 2 and 3, respectively, forfixed effect covariates. In case of a continuous effect (considered to be measured withouterror) they represent a centered version of the varibles, creating contrast codes. E.g. themodel containing (only) one regression coefficient π1 associated with α1int = continuousvariable X1int would look like Yint = λ0 + π1 · (X1int − X1.nt) + eint + rnt + ut. But if thecontinuous variable were on the level 2, e.g. χ1nt = X1nt, the centering would look like Yint =λ0+β1·(X1nt−X1.t)+eint+rnt+ut. The reason for the cluster-level centering is that possiblebias from correlation between the covariate and random effects will be (approximately)removed (Neuhaus & McCulloch, 2006), furthermore this contrast-coding approach yieldsparameters equivalent to fixed effects (Greene, 2003). There are possibilities of letting theslopes vary randomly as well, but in this application they are not, see (Raudenbush &

7

Page 10: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

Bryk, 2002) for a more complete treatment of the HLM models. As can be seen the modelin (5) is a linear mixed model, with a certain design of the random- and fixed-effects.

The HLM design is reflected in the covariance matrix. For example let the outcomevector Y = (Y111, Y211, Y121, Y221, Y112, Y122)T be all the data from the three level HLM usedin CoS. In this small example the first extended family consists of two nuclear families withtwo children in each and the second extended family exists of two nuclear families withone child in each. The covariance matrix have the blocked structure of (6).

Cov(Y ) =σ2 + τ 2

1 + τ 22 τ 2

1 + τ 22 τ 2

2 τ 22 0 0

τ 21 + τ 2

2 σ2 + τ 21 + τ 2

2 τ 22 τ 2

2 0 0τ 2

2 τ 22 σ2 + τ 2

1 + τ 22 τ 2

1 + τ 22 0 0

τ 22 τ 2

2 τ 21 + τ 2

2 σ2 + τ 21 + τ 2

2 0 00 0 0 0 σ2 + τ 2

1 + τ 22 τ 2

2

0 0 0 0 τ 22 σ2 + τ 2

1 + τ 22

(6)

In the covariance matrix (6) the different families (nuclear and extended) can be foundthrough looking at the covariance parameters of r11, r21, r12, r22 (τ 2

1 ) and u1, u2 (τ 22 ).

Let us turn the attention to the different level of the covariates. We will describe eachlevel, and exemplify with variables used in the study in this paper. The individual level,level 1, includes the outcome variable and confounders specific to the individual, e.g. amother’s SDP, a mother’s age at childbirth and a child’s age when measuring the outcomevariable. The level 2 is the nuclear family level, variables common to a nuclear familybelongs here. Examples are socioeconomic status for mother/father, whether the siblingsare full or half and the cohabitation of parents. The level 3 is the extended family level,the variables common for the whole extended family is counted to this level, an exampleis whether the aunts are full- or half-siblings.

In figure 1 a schematical illustration of the three levels of hirachicity is found. The out-come is measured at level 1, and the ρ1 represents that the aunts have some correlation be-cause of shared genes and environment. The ρ2 points out that correlation between siblingsis present, this time also due to genes and environment, while the ρ3 represents a correlationbetween cousins that is (thought of to be) mainly due to genetics. The ρ4 indicates corre-lation between half siblings, which is unlike that of ρ2 in terms of genetic similarity. Thereare more correlations which have not been indicated in the figure. All of these correlationsof exposure, confounding variables and outcome has to be taken into account as rigor-ous as possible and the CoS model deals with this through clustering and submodelling.

8

Page 11: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

F M

F FM M

MM F M M

F Level 3

Level 2

Level 1ρ2 ρ4

ρ1

ρ3

Figure 1: An illustration of the different levels, M=male F= female.

2.1.2 The submodels included in the CoS

The ingenuity of the CoS model lies in taking good care of the information about geneticalsimilarity/dissimilarity in the individuals included in the data. This will be addressed in anumber of submodels. The first model, M1, is an ordinary HLM with no consideration ofgenetics or covariates, M1 performs a simple test of any correlation between the exposureand the outcome. The next considered model, M2, where all confounders are included, isan ordinary way of testing for significance for the exposure. But the idea of the CoS modelis that there is a possibility that significance of exposure in model M2 can be confoundedby non-independence between e.g. cousins due to partly shared genetics. In model M3

siblings are compared with siblings and cousins with cousins regarding the exposure. Thismeans that the similarity of siblings and cousins are addressed, but there is no differen-tiation between environmental and genetical effects. A mean across all cousins includedin an extended family supplies an unrelated comparison. In models M4 (cousins) and M5

(siblings) consideration will be taken to whether the cousins (for M4)/siblings (for M5)are full or half. These models uses subsamples which include two sisters and one of theirchildren in M4 or two children per mother in M5, where possible. The subsamples areindicated with superscripts (c) for cousins in M4 and (s) for siblings in M5. We favour theinclusion of mothers/sisters with diverse exposure where there is a possibilty of importantcomparisons. For example if a mother in M5 has more than two children and of thesejust one has been exposed we will pick the one with exposure and randomly sample oneof the others. In M4 and M5 the interaction term between the exposure-variable and thefull-/half- sibling/cousin will be included in the models. This term measures the differencebetween the differentially exposed sibling/cousin in the full/half pairs yielding a geneti-cally/environmentally informative result. A description of the submodels in equation (7)

9

Page 12: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

- (11) below uses the exposure variable SDP, from the study performed, as an example,the confounders are unspecified though.

• Model 1, M1 - An ordinary HLM, examines the magnitude of association betweenSDP (=0 or 1) and PF using the entire sample.

Yint = λ0 + π1(SDPint) + eint + rnt + ut. (7)

• Model 2, M2 - As above but controlling for covariates/confounders.

Yint = λ0 +π1(SDPint)+

p∑j=2

πj(αjint)+

q∑j=1

βj(χjnt)+s∑j=1

λj(ξjt)+eint+rnt+ut. (8)

• Model 3, M3 - This model compares cousins (with cousins) and siblings (with sib-lings) who are differentially exposed to SDP, in these calculations we have includedcontrast coding.

Yint =λ0 + π1 · (SDPint − SDP.nt) +

p∑j=2

πj(αjint)

+ β1 · (SDP.nt − SDP..t) +

q∑j=2

βj(χjnt) +s∑j=1

λj(ξjt) + eint + rnt + ut.

(9)

The contrast codes are calculated as follows:

– For siblings: the average SDP for a mother over all her pregnancies is subtractedfrom the SDP for each son, thus comparing siblings differently exposed to SDPwithin nuclear family.

– For cousins: the average SDP for sisters over all their pregnancies is subtractedfrom the mean SDP for each nuclear family, comparing nuclear families differ-ently exposed to SDP, between nuclear families; within extended families.

• Model 4, M4 - Compares cousins whose parents are full siblings with those who haveparents that are half siblings to further examine the aspect of genetic relatedness. Forthis a subsample is created, including two adult siblings, and one of their children,within each extended family (where possible). The same approach to contrast codingas above is used, recalculated for the new subset, and including a full-/half-cousinvariable (values 0 and 1). The interaction term between the contrast for smokingand the full-/half-sibling variable now estimates the difference between cousin pairtypes.

Y(c)int = λ0 + β1 · (SDP

(c)

.nt − SDP(c)

..t ) +

p∑j=1

πj(αjint)

+

q∑j=2

βj(χjnt) +s∑j=1

λj(ξjt) + eint + rnt + ut.

(10)

10

Page 13: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

• Model 5, M5 - Compares full- and half- siblings. As above, but with sibling pairsand their types.

Y(s)int = λ0 + π1 · (SDP

(s)int − SDP

(s)

.nt) +

p∑j=2

πj(αjint)

+

q∑j=1

βj(χjnt) +s∑j=1

λj(ξjt) + eint + rnt + ut.

(11)

Of course each model is estimated separately, meaning that for example the grand meanλ0 will (probably) have different values in each model.

We have chosen to include all confounders in our model, and by stepwise eliminationremove the non-significant ones, each step eliminating the least significant confounder.Another way to pick confounders to be in the models is to test for correlation prior tofitting the big models and thus come up with the risk factors/confounders (as in (D’Onofrioet al., In press)).

The contrast coding is of great importance. Since the mean over all cousins in anextended family is included we can find an estimate for unrelated individuals in eachmodel. The differences when comparing within extended- or nuclear family are found inthe contrast. E.g. siblings in a family in M3 might have a high mean SDP, and also a lowermean result in PF, the contrast allows us to examine whether the non-SDP siblings aredifferent from the SDP sibling within this family. By doing this we can eliminate the mainfamily effect, and make statements regarding the comparisons of individuals with similar”starting position”.

2.1.3 Interaction terms

When deciding which, if any, interaction terms to be included in the model there are acouple of things to be aware of. First, interaction between levels seems unfeasible (Mont-gomery, 2005) because all levels of factors at the lower level is possible not present for everyhigher level. In our analysis the interaction between the contrast coded SDP-variable andthe full or half sibling/cousin-variable must be there because it is of a main interest forthe conclusions. From (D’Onofrio et al., In press) (as mentioned this is one of the papersusing the CoS model which we are using as a template) it can be found that none of theinteraction terms between confounders has been taken into consideration. This approachis also favorable when dealing with large datasets to lower the time for computation. An-other way to approach this is to test for all possible interactions and stepwise eliminatethe non-significant ones. Or perhaps the middle-way; pick interaction terms to be includedusing knowledge about them. No matter which approach to be taken an informed decisionhas to be made.

11

Page 14: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

2.1.4 Estimation and degrees of freedom

The mixed model of the type in (5) is to be estimated using SAS:s proc mixed. Proceedingfrom this to estimation of the parameters π, β and λ and the variance parameters is doneusing the General Linear Mixed Model. We rewrite the equation (5) in matrix form

Y = Xβ +Zu+ ε. (12)

The X are the values of the fixed effects in β, the values in Z are the dummy variableswhich indicates which random effects in u that are of interest and ε are the randomerrors corresponding to the eint:s. To illustrate the transition from (5) to (12) we usethe outcome vector Y = (Y111, Y211, Y121, Y221, Y112, Y122)T as we did when illustrating thecovariance matrix (6). For each level of the hierarchy we include two effects, one fixed withtwo levels and one continuous. The β-vector specifying the covariates is

β(10×1) = (λ0, π1(1), π1(2), π2, β1(1), β1(2), β2, λ1(1), λ1(2), λ2)T . (13)

With made-up values for the fixed effect covariates π1, β1 and λ1 and centered versions forthe continuous covariates π2, β2 and λ2 we have

X(6×10) =

1 1 0 (X2111 − X2.11) 1 0 (X211 − X2.1) 1 0 (X21 − X2.)1 0 1 (X2211 − X2.11) 1 0 (X211 − X2.1) 1 0 (X21 − X2.)1 0 1 (X2121 − X2.21) 0 1 (X221 − X2.1) 1 0 (X21 − X2.)1 0 1 (X2221 − X2.21) 0 1 (X221 − X2.1) 1 0 (X21 − X2.)1 0 1 (X2112 − X2.12) 1 0 (X212 − X2.2) 0 1 (X22 − X2.)1 1 0 (X2122 − X2.22) 1 0 (X222 − X2.2) 0 1 (X22 − X2.)

. (14)

Within each of the two extended families we have two nuclear families, this yields a randomeffects vector

u(6×1) = (r11, r21, r12, r22, u1, u2)T . (15)

The specifications are dependent on which family each individual belongs to

Z(6×6) =

1 0 0 0 1 01 0 0 0 1 00 1 0 0 1 00 1 0 0 1 00 0 1 0 0 10 0 0 1 0 1

. (16)

The random errors vector is simply

ε(6×1) = (e111, e211, e121, e221, e112, e122)T . (17)

Following the notation in (Littell et al., 2006), let u ∼ N(0,G), ε ∼ N(0,R) and alsoCov(u, ε) = 0 is an assumption. Proceeding from this to write a joint normal distribution

12

Page 15: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

for u and ε

f(u, ε) =1

(2π)(n+g)/2

∣∣∣∣ G 00 R

∣∣∣∣−1/2

·

exp

(−1

2

[u

y−Xβ−Zu

]T [G−1 0

0 R−1

] [u

y−Xβ−Zu

]) (18)

with n = the sample size and g = the number of elements in u. The joint distributionf(u, ε) can be thought of as a joint likelihood. But since the randomness in Zu is presentthe maximation of this (quasi-) likelihood would yield an incorrect solution, the estimationmethod starts out here and then refines the estimation. By maximizing the distribution(likelihood) with respect to β and u solutions β and u can be obtained for β and u, (see(Littell et al., 2006) for details) via partial derivatives the mixed models equations can befound: [

XTR−1X XTR−1ZZTR−1Z ZTR−1Z +G−1

] [βu

]=

[XTR−1XZTR−1Z

]. (19)

This can be solved as[βu

]=

[(XTV −1X)−1XTV −1y

GZTV −1(y −X(XTV −1X)−1XTV −1y

) ] =

[(XTV −1X)−1XTV −1y

GZTV −1(y −Xβ

) ](20)

where V = V ar(y) = ZGZT +R. We have ignored the possibility of needing to deal withgeneralized inverses (to which there are solutions which are implemented in proc mixed.)These equations can of course be solved, but it can be time consuming to find the inverseof V and the estimation u of u is dependent on the estimation of β. Through a method”SWEEP” the time is considerably lowered, for explanation of the method see for example(Smith & Graser, 1986).

The default method of estimation using SAS:s proc mixed for solutions to the HLM:s isRestricted Maximum Likelihood (REML). The reason for using REML rather than ordinaryMaximum Likelihood (ML) is that ML tends to bias the estimation of the standard errorsnegatively (Littell et al., 2006). This results in smaller confidence intervals and in theend hypotheses tests that are more likely to be overly liberal. The Minimum VarianceQuadratic Unbiased Estimators (MIVQUE) procedure is non-iterative, in contrast to MLand REML, which finds the quadratic variance/covariance estimators that comes as closeas analytically possible to satisfying the Lehmann-Scheffe criterion for uniformly minimumvariance (Slanger, 1996). This procedure provide unbiased estimations of the varianceparameters. The reason for choosing MIVQUE over REML is that sometimes convergencecan be hard to obtain using REML or ML and also to save computing time when handlinglarge data sets1. Sometimes there can be some trouble with convergence due to problemsinvolving one or more confounders, a way to approach this problem is to use MIVQUE

1In the current study the data sets for M1, M2 and M3 can be considered large, and convergencesometimes fails using REML.

13

Page 16: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

during the stepwise elimination of non-significant variables to identify ”problem-variables”,but the final analysis should be made using REML. The REML procedure uses the valuesof MIVQUE as a starting point.

REML actually is just a estimation technique for the covariance parameters, the fixedeffects β are estimated at the REML estimates of the covariance parameters. Let us startwith the ordinary ML method. The ML minimizes the -2 log likelihood, let θ be somecovariance parameters, y be the observed sample, from (20) we can find a solution β(θ)for the fixed effects

β(θ) = (XTV (θ)−1X)−1XTV (θ)−1y. (21)

Thus the -2 log likelihood will be

−2l(θ;y) = (n+ g) ln(2π) + ln |V (θ)|+ (y −Xβ(θ))TV (θ)−1(y −Xβ(θ)). (22)

By minimizing this function with regard to θ the ML solution can be found.The ML estimators for a sample y = (y1, . . . , yn)T independent identically distributed

from N(µ, σ2) with unknown µ and σ2 are µ = 1n

∑i yi and σ2 = 1

n

∑i(yi − y)2. The σ2 is

biased by −n−1σ2 but if the mean was known the unbiased estimator σ2 = 1n

∑i(yi − µ)2

would constitute the ML estimand for σ2. Thus the unknown mean seems to be a source ofbias for the covariance parameter, the REML incorporates this fact. The REML estimationprocedure can be seen as a maximation of the likelihood of transposed data. Instead of,as in ordinary ML, minimizing the -2 log likelihood of Y REML minimized the -2 loglikelihood of KY where K is chosen such that E(KY ) = 0, i.e. the maximization occurson error contrasts. A known result shown by Harville 1974 (see (Littell et al., 2006) fordetails) is that the restricted likelihood for REML is

LR(θ) =1

(2π)(n−p)/2

∣∣XTX∣∣1/2 1

|V (θ)|1/21

|XTV (θ)−1X|1/2·

· exp

(−1

2(y −Xβ(θ))TV (θ)−1(y −Xβ(θ))

).

(23)

Thus the -2 log likelihood is

−2lR(θ;y) = ln |V (θ)|+ln∣∣XTV (θ)−1X

∣∣+(y−Xβ(θ))TV (θ)−1(y−Xβ(θ))+cR (24)

where the extra term ln∣∣XTV (θ)−1X

∣∣ is the only difference to the ML -2 log likelihoodin terms of what is minimized. The cR term is a constant term which is not affected of thechoice of matrix K, and is independent of θ.

The minimization of (24) with regards to θ is performed iteratively since the covarianceparameters V (θ) is included non-linearly. The solution of the fixed effects are foundthrough the estimated covariance components θ as β(θ) = (XTV (θ)−1X)−1XTV (θ)−1y.Proc mixed uses a ridge-stabilized Newton-Raphson algorithm for for minimization ofthe -2 log likelihood (Littell et al., 2006).

Now we turn to the problem of picking the correct degrees of freedom for the F -testsand confidence interval construction for the fixed factors in the model. Since the regular

14

Page 17: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

ANOVA assumptions (e.g. the independence of outcomes and homoscedasticity) do nothold when using the mixed model a correction of the degrees of freedom has to be made. Ifno corrections are made the resulting confidence intervals will be too narrow, and testingof significance will be overly liberal. Thus the importance of the correct degrees of freedomcan not be stressed enough. There are a couple of ways to do this, one is the Fai-Cornelius(FC) method as implemented in the satterthwaithe option in specifying the model usingproc mixed. Another option is available: Kenward-Rogers which often is more suitingwhen the covariance structure is more complex, but in this study the covariance structureis relatively simple and thus the FC option is better (Littell et al., 2006) (Schaalje et al.,2001). The FC uses a Sattertwaite approximation of the degrees of freedom found throughspectral decomposition of the covariance matrix of the estimated β (for details see (Schaaljeet al., 2001)). A third option is the between-within option, this is most suitable for highlyunbalanced data. The best choice for the actual study is the sattertwaithe option, butsometimes there can be data that are too unbalanced and the between-within approachcan be suitable.

2.2 Structural Equation Models

We will use the SEM on data for siblings and their children to examine whether an expo-sure’s correlation to an outcome is due to genetics, environment that make siblings similaror environment that makes siblings dissimilar. The model used to do this is called theACE model, where the A represents genetical influence, C represent environment thatmakes siblings similar and E represent environment that makes siblings dissimilar.

2.2.1 What is Structural Equation Models?

The Structural Equation Models are models which, through mean- and covariance matrix-analyses, ”depict relationships between observed variables” (Schumacker & Lomax, 2004).Behind this fuzzy formulation lies a complex theory mainly developed by applied re-searchers in psychology, psychometrics and econometrics (Hays et al., 2005). The methodis closely related with factor analysis. The difference is that while factor analysis focuson the covariance matrices and through this latent (unmeasured) factors, SEM includesmore possible dependencies within variables, within latent factors and between variablesand latent factors. SEM is able to cope with multiple latent variables in combinationwith measured indicator and outcome variables using dependence between and within thelatent and indicator and outcome variables. Since the dependences can be complicatedresearchers often choose to use a type of diagram to indicate the models. An example isshown in figure 2 where the F1 and F2 represent latent (unobserved) independent variables,x1 and x2 represent observed dependent variables and y is the outcome variable. The V1

and V2 imply that the latent unobservable variables are estimated with an error. Furtherexplanation of arrows are found below.

15

Page 18: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

F1 F2

x1 x2

y

V1 V2

Figure 2: An illustration of SEM.

The illustrations often use circles to indicate latent unmeasured factors and squares formeasured indicator and outcome variables. By using this presentation it is easy to createa picture of whats going on, which is quite suitable if you are more interested in drawingconclusions rather than understanding the theory behind. The figure 2, for example, saysthat the indicator variables x1 and x2 and the outcome variable y all three are manifestsof two unmeasured factors. The indicator variables also has a correlation. In generalthe model consists of two parts, the measurement model and the structural model. Inthe measurement model the measured variables are included, in the structural model thestructures of the latent variables are described.

2.2.2 Matrix notation approach

Apart from the distinction between observed- and latent variables there is also a distinctionbetween dependent- and independent- variables. The independent variables are variablesthat are not influenced by any other variable and the dependent variables are variableswhich are influenced by other variables. The SEM can be written in matrix notation,the notation follows that of (Schumacker & Lomax, 2004). First the structural model.Let η(m×1) be the latent dependent variables, a (m × 1) vector, and ξ(n×1) be the latentindependent variables, a (n× 1) vector. Let the relationships between the latent variablesbe described by B(m×m) for latent dependent variables amongst themselves and let Γ(m×n)

describe the relation of latent independent variables to latent dependent variables. Let theequation prediction errors be ζ(m×1). Then the model is

η = Bη + Γξ + ζ. (25)

The covariance matrix Cov(ξ) = Φ(n×n) describes the variances and covariances among theindependent latent variables, and Cov(ζ) = Ψ(m×m) is the covariance matrix for the latentdependent prediction equation errors. Let us turn the attention to the measurement model.For the latent dependent variables η, let Y(p×1) be the observed measures (dependent orindependent) and let Λy (p×m) be the relationship between observed variables and latent

16

Page 19: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

dependent variables. Let the measurement errors for Y be ε(p×1), the measurement modelwill be

Y = Λyη + ε. (26)

In the same manner the independent latent variables are found through the measuredX(q×1), and the relationship between observed variables (dependent or independent) andlatent dependent variables is defined to be found in Λx (q×n). Let the measurement errorsbe δ(q×1), the measurement model will be

X = Λxξ + δ. (27)

The covariance matrices of ε and δ can be labeled Cov(ε) = Θε (p×p) and Cov(δ) = Θδ (q×q)and contain the covariances between errors of the observed dependent- and independentvariables respectively.

These rather complex notations to describe the model include eight matrices, of whichfour (Φ,Ψ,Θε, and Θδ) are random covariance matrices. All matrices can be estimatedand used for inferences. This is done by letting parameters of the matrices be free, fixedor constrained. Which parameters that are of interest is of course dependent on how themodel is formulated, all matrices might not be of interest for a model (Schumacker &Lomax, 2004).

2.2.3 Model fit

We will use two model fit measures, Root Mean Squared Error of Approximation (RMSEA)and Comparative Fit Index (CFI).

The RMSEA is defined as (Muthen, 1998-2004)

RMSEA =√max((2FML(π)/d− 1/n), 0)

√G (28)

where FML is the ML fitting function for G groups, d is the number of degrees of freedomfor the model, n is the number of observations and π is the ML estimate under H0. Thefitting function is

FML(π) = − lnLH0

n+ ln

LH1

n(29)

where LH0 is the likelihood of the fitted model and LH1 is the likelihood of a unrestrictedmodel. The unrestriction denotes that the means and covariances are not constrained. TheRMSEA is a global measure, and it improves when more variables are added to the model.A value of 0.05 or less is deemed to be acceptable (Schumacker & Lomax, 2004).

The CFI is defined as

CFI = 1−max(χ2

H0− dH0 , 0)

max(χ2H0− dH0 , χ

2B − dB, 0)

(30)

where H0 and B represents the fitted- and baseline- model respectively. The baseline modelhas uncorrelated outcomes with unrestricted means. As can be seen the CFI utilizes the χ2-distribution, this relies on the normal assumption of the outcome variable. The CFI spans

17

Page 20: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

from 0 to 1, where 0 indicates no fit and 1 perfect fit. A value over 0.90 indicates a ”good”fit (Schumacker & Lomax, 2004). The χ2

H for the models are based on that the estimatedparameters in the covariance matrix have a normal error. The difference between theobserved and the hypothesed parameters squared divided by the hypothesed parameterfollows a χ2-distribution with degrees of freedom = number of estimated parameters -1. The actual χ2

H-statistic is more advanced though ”taking into account stratification,non-independence of observations due to cluster sampling, and/or unequal probability ofselection. Subpopulation analysis is also available.”(Muthen & Muthen, 1998-2007).

When modifying the model according to findings some parameters might be excluded.The estimation procedure includes the forming of confidence intervals constructed withstandard errors of the parameters. Following (Schumacker & Lomax, 2004) the rules ofexclusion of parameters is based on three criteria: The parameter should be in the expecteddirection, be statistically different from zero and make practical sense. But Schumacker &Lomax also stresses that ”If a parameter is not significant but is of substantive interest,then the parameter should probably remain in the model.” The actual ACE-model isbased on that the parameters associated with the respective parts A, C and E must beconsidered as a ”group”. Thus exclusion of just one parameter is probably unwise, if oneis excluded the rest must be excluded as well. This also applies in the opposite direction,if one of the parameters is slightly non significant but the rest is not then that parametercan, and should, stay in the model.

2.2.4 1-level SEM, ACE model

The ACE model is described in figure 3 for two sisters, their SDP and their childrens’PF. The left part of the figure represents one sibling, and the right part the other. Theparameters connected to the A-part (VA, bA and Acov) models the variation in outcomedue to genetics. The parameters in connection with the C-factor represent the variationdue to shared environment, the environmental influence that makes siblings similar. Theparameters associated with the E-factor represents the variation due to unshared environ-ment, making siblings dissimilar. The arrows with a 1 besides them says that a parameteris (perfectly) measured by the parameter to which the arrow points. The arrows withparameters bF (F = A,B,C) besides them are regression coefficients, e.g. PF1 is regressedon A1 and the resulting regression coefficient is bA. The arrows pointing in two directionsindicates correlation/covariation, e.g. A1 covaries with A2 with value Acov.

18

Page 21: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

A1 C1 E1

SDP1

PF1

1bA

1

bC

1

bE

VA VC VE

A2 C2 E2

SDP2

PF2

1bA

1

bC

1

bE

VA VC VE

Acov Ccov

r

Figure 3: An illustration of SEM.

By using constraints on the parameters we can find the value of VA, VC and VE; the vari-ations of the A-, C- and E-parameters. We may also be interested in bA, bC and bE;the regression parameters of PF on the A-, C- and E-parameters. The r is the covaria-tion/correlation between the outcome for both cousins, this might be of interest as well.The ACE-parameters are independent latent variables (the covariances between the A:sand C:s are not considered to induce dependent variables) and the SDP:s and PF:s aredependent variables. The covariance matrices Φ and Θδ are the ones of main interest inthe model, since the latent independent variables are measured by all observed variables.The constraints to put upon the model are those that arises from the assumptions for themodel: 1. The genetical likeness of full siblings are twice of that for half siblings. 2. Theenvironment that makes siblings alike are equal for siblings. The constraints also includesetting covariances to zero for variables that are not considered to covary in the model.This includes only independent latent variables, so all of the information will be handledby (27) and the vectors and matrices X, Λx, ξ, δ, Φ and Θδ. The structural equation ofinterest will be

X = Λxξ + δ ⇔

SDP1

PF1

SDP2

PF2

=

1 1 1 0 0 0bA bC bE 0 0 00 0 0 1 1 10 0 0 bA bC bE

A1

C1

E1

A2

C2

E2

+

δ1

δ2

δ3

δ4

(31)

19

Page 22: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

And the covariance matrix for the latent independent variables is

Φ =

VA 0 0 Acov 0 00 VC 0 0 Ccov 00 0 VE 0 0 0

Acov 0 0 VA 0 00 Ccov 0 0 VC 00 0 0 0 0 VE

(32)

The last matrix is the observed dependent variables measurement error covariance matrix

Θδ =

0 0 0 00 ϕPF1 0 r0 0 0 00 r 0 ϕPF2

(33)

The model indicates that SDP is measured without error, but not PF. Further restric-tions will also be laid upon Acov and Ccov to fit our modelling assumptions. We assumeequal environment for the siblings, this means that Ccov = Vc. For the genetic part thefull- and half- sibling induce different restrictions, full siblings share 0.5 of their genesso Acovfull = 1

2VA. Half siblings share half of the amount shared by full siblings thus

Acovhalf = 12Acovfull = 1

4VA. The last two of the restrictions are the most informative parts,

by enforcing this restriction on the model the genetic influence can be measured since thisdiffers between the group of full- and half- siblings.

2.2.5 2-level SEM for ACE

The idea of multiple levels can be applied to SEM. The model is then splitted up ina within- and between-part, within and between refers to within- and between- nuclearfamilies. This does not mean that each part is estimated separatly but as in the HLM theyare estimated simultaneously. We apply this approach to the ACE-model to incorporatethe nuclear families into the model. The model looks like figure 4 where the upper portionof the figure represent the within part, and the lower portion the between.

20

Page 23: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

SDP1 SDP2

PF1 PF2

w w

WITHIN

BETWEEN

A1 C1 E1

SDP1

PF1

1bA

1

bC

1

bE

VA VC VE

A2 C2 E2

SDP2

PF2

1bA

1

bC

1

bE

VA VC VE

Acov Ccov

r

Figure 4: An illustration of the 2-level SEM describing the ACE model.

We want to explain how the variation can be splitted up between the ACE-parametersand from this draw conclusions regarding environmental and genetical relevance for theinfluence of SDP on PF. To do this we want to incorporate the confounders in the model,one simple way to do this is to perform a regression analysis for PF on the confoundersand then do the SEM-analysis on the residuals. Since the SEM aims to use the familialinformation we perform a simple regression analysis without using clustering. The SDP-variable is not included in the regression since this is the indicator variable in the ACEmodel.

2.2.6 Estimation: Maximum Likelihood using Expectation Maximization

The estimation procedure uses a ML-approach with Expectation Maximization. A fulldescription of the procedure is out of the scope of this paper, but details are given in(Muthen, 1998-2004). The model incorporates that two classes are possible, full- and half-siblings (more classes can be included if wanted). The method maximizes the likelihood,which is a mixture distribution of the separate classes. Through conditioning on data andclass belonging the likelihood is approximated and maximized.

21

Page 24: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

3 Data

3.1 The data

The main idea of this study is to examine whether SDP influences the childs PsychologicalFunctioning capacity PF. This is a variable that is obtained from a standardized psycho-logical interview for Swedish boys during conscript. PF is a prognosis of the ability tocope with stress in war-time (Nilsson et al., 2001). Any conclusions made by the studyconcerns how the exposure of prenatal nicotine affects the susceptibility to stress in lateadolescence/early adulthood. The boys are born between 1973 and 1988 and were at con-script between 1997 and 2006. The ages varies between 18 and 30, but the majority are18 or 19 (281 999 of 287 456, 98.1%). To be able to make correct inferences regarding theassociation of interest other effect must be taken into account. To cope with the rGE ananalysis must be made with covariates which may affect the outcome. How to pick theseis always a subjective task, and prior knowledge must be used to make correct decisionsregarding possible confounders to include in the model.

The data used for the CoS model is taken from a large dataset called the MgrCrimedata base. This data base consist of many merged Swedish data bases, a couple of whichis used in this study. The data bases used includes information collected from: BRA(Swedish National Council for Crime Prevention), EpC (Centre for Epidemiology at theNational Board of Health and Welfare), Pliktverket (National Service Administration) andSCB (Statistics Sweden). Variables of importance to this study are extracted from thedata base and re-merged as suitable for the models M1 −M5. The data used is stored indifferent data sets as described in figure 5.

Central is the Multi-Generation Register which enables connection between childrenand parents/grandparents through the usage of unique identification numbers for each in-dividual (variables originating from this registry: lopnr, idfam, idextfam, idspouse,

idgrandpa, lopnrmor, lopnrfar, lopnrmormor, lopnrmorfar, lopnrfarmor,

lopnrfarfar). The National Crime Register is used to find out whether the parentsare convicted of a crime (crimem, crimef). The Medical Birth Register provides in-formation concerning the birth of the children such as weight (birthweight), gestationallength (pregtime), mothers age at childbirth (agem) and mothers SDP (rok1, sdp). Fromthe National Service Administration the outcome variable PF is collected, the age atconscript (conscriptage) is also a variable of interest. To find out about socioeconomicstatus (seim, seif), income (incomem, incomef) and cohabitation (cohab) of parentsthe Population and Housing Census of 1990 is used. Educational level (edum, eduf)for parents are obtained from the Register of Education.

22

Page 25: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

Multi-GenerationRegisterlopnr,

lopnrmor,lopnrfar,

lopnrmormor,lopnrmorfar,lopnrfarmor,lopnrfarfar.

National CrimeRegistercrime.

Medical BirthRegisterrok1,

birthweight,agem, pregtime.

National ServiceAdministration

pf,conscriptage.

Populationand HousingCensus 1990

sei, income,cohab.

Register ofEducationedu.

SampleM1, M2, M3

N = 287 456

Subsamplecousin comparison

M4

N = 47 990

Subsamplesibling comparison

M5

N = 26 118

SubsampleACE dataN = 9 460

Figure 5: The data flow.

3.1.1 The variables

agem Mother’s age at childbirth, categorical variable taking on 5 values; 1:”less than 20years”, 2:”20-25 years”, 3:”25-30 years”, 4:”30-35 years”, 5:”greater than 35 years”.

birthnr An integer which indicates the number in the birtorder the child has.

birthnr2 As birthnr but recalculated for the M5-subsample.

birthweight all The child’s birthweight, centered for unrelated comparison created frombviktbs.

23

Page 26: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

birthweight cou The child’s birthweight, centered for cousin comparison.

birthweight sib The child’s birthweight, centered for sibling comparison.

bviktbs The child’s birthweight.

crimef A variable indicating whether the father has been convicted for a crime between1973 and 2004.

crimem A variable indicating whether the mother has been convicted for a crime between1973 and 2004.

cohab The cohabitation of parents at the time of child’s birth. 0: ”Lives toghether withchild’s father”,1: ”Mother single”, 2: ”Not married to/divorced from- child’s father(no cohabitation data)”, . : ”no information”

conscriptage Child’s age at conscript, a categorical variable taking on 3 values; 1:”lessthan 17.5 years”, 2:”17.5-18.5 years”, 3:”greater than 18.5 years”.

eduf The education for father 2004, a categorical variable taking on 7 values; 1:”Pri-mary and lower secondary school, less than 9 years of education”, 2:”Primary andlower secondary school, 9 years of education”, 3:”Upper secondary school, 2-3 yearsof education”, 4:”Post secondary school, less than 2 years of education”, 5:”Postsecondary school/college/university, 2-5 years of education”, 6:”Postgraduate educa-tion”, 7:”Unknown”.

edum The education for mother 2004, a categorical variable taking on values as above.

fullhalf c A variable indicating if the cousins are full- or half- cousins.

fullhalf s A variable indicating if the siblings are full- or half- siblings.

idextfam The index of the extended family.

idfam The index of the family.

idgrandpa The id-number of the man with whom the grandmother with id-numberidextfam has the children included in the extended family.

idspouse The index of the man with whom to the mother with id-number idfam has thechild.

incomef The income of the father 1990, a categorical variable taking on 4 values; 1:”lessthan 100 kkr”, 2:”100-200 kkr”, 3:”200-300 kkr”, 4:”greater than 300 kkr”.

incomem The income of the mother 1990, values as above.

lopnr The index of the child of interest.

24

Page 27: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

lopnrfar The index of the father of the child.

lopnrfarfar The index of the paternal grandfather of the child.

lopnrfarmor The index of the paternal grandmother of the child.

lopnrmor The index of the mother of the child, the same as idfam.

lopnrmorfar The index of the maternal grandfather of the child.

lopnrmormor The index of the maternal grandmother of the child.

pf The Psychological Functioning capacity as measured by Forsvarsverket at conscript,this is a discrete variable taking on integer values 1 − 9 and has a stipulated meanof 5 and variance of 4. The distribution is called stanine.

pregtime The gestational time, a categorical variable taking on 4 values; 1:”28-31 weeks”,2:”31-36 weeks”, 3:”36-41 weeks”, 4:”greater than 41 weeks”.

rok1, sdp A dichotomous variable (= 0, 1) indicating whether the mother smoked duringpregnancy.

sdp cou A contrast coded variable indicating the difference of a mother’s mean SDPduring her pregnancies and the extended families SDP mean during pregnancies.

sdp mean cou The mean value of sdp for cousins in an extended family.

sdp sib A contrast coded variable indicating the difference of a mother’s SDP for a singlechild and the mean over all her pregnancies.

seif The socioeconomic status of the father 1990, this is a categorical variable taking onvalues; 1:”Blue collar worker (In production of goods), not specially trained”, 2:”Bluecollar worker, specially trained”, 3:”White collar worker, lower level”, 4:”White collarworker, intermediate level”, 5:”White collar worker, upper level or Self-employed,academic work”, 6:”Self-employed”, 7:”Uncategorized employed or no information”.

seim The socioeconomic status of the mother 1990, a categorical variable taking on valuesas above.

4 Results

Since we are considering normally distributed variables we will treat the PF variable asa (proposed) N(µ, σ2) variable, where µ = 5 and σ2 = 4 are the predefined values. Nomajor problems has been encountered by doing this, but the F -tests, and consequently theinference and conclusions of the analyses, are questionable. We will only consider maineffect of the confounders, not the interactions, to save computational time and simplify

25

Page 28: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

Figure 6: Left: The M1 quantile-quantile-plot. Right: The M3 quantile-quantile-plot

interpretation. In the left part of figure 6 the quantile-quantile plot shows that eventhough the data approximately follows the normal distribution the discrete nature of thevariable is still present after controlling for SDP and clustering as in M1. The deviationsfrom the normal-line must be seen as major since the sample size is so large. A second q-q-plot is included, see right part of figure 6, regarding the model M3, the model controlledfor SDP, covariates and clustering effects. The data seems to connect better to the normal-quantile line, but the deviations at the ends are still major. The histogram of the scaled(i.e. they are scaled as a proposed N(0, 1) distribution) residuals of M4 are displayed infigure 7 together with the assumed normal distribution and a fitted kernel distribution.This also shows that there are deviances from the normal assumption. Ways to deal with

Figure 7: The histogram of the residuals of M4.

26

Page 29: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

the misfitting of the model are discussed in the discussion section. Since PF is measuredin a comprehensive scale the variable will not be standardized. The actual raw data on PFhas mean 4.86 and variance 3.24.

A program is written in SAS1 to merge the data and create variables in an appropriatefashion, and then to perform the mixed model regression analysis using proc mixed. Acritical part of creating the program lies in keeping track of relation between individuals,this is done through the identification numbers (as described above). The analyses of theHLM are made in SAS. For the SEM analysis the data is created in SAS, but the analysisis made with Mplus (original code provided by Brian D’Onofrio).

4.1 The problem of belonging to multiple families

A problem not yet addressed is the possibility for a child to belong to multiple nuclear andextended families in the HLM models. To start with it is possible to avoid the multiplenuclear family-problem by indexing nuclear families by study the mother. The reason forthis is that the child stays with the mother in most the cases of break-up, and thus recievingthe (unmeasured) environment from her. The child then share the environment with hishalf siblings. The child might still belong to two extended families, the other through hisfather. A couple of different approaches to this is listed below

1. Only allowing a child to appear only once, deleting one of the extended families thathe belongs to.

2. Only indexing extended families through one of the parents, missing the comparisonsthat could (should) be made through the other extended family that he belongs to.

3. Allowing multiple extended family inclusion, letting the child appear a maximum oftwo times in the data, and thus twice in the analysis.

4. As 3 but controlling for this in some manner.

The fourth option should be the best, but limited time prevents us from exploring thisoption. Each one of the others carries with them some problems, but number one should bevalid for inference but loosing a lot of information on the way. The second approach wouldignore a lot of correlation in the data and (possibly) bias the inference in some manner. Acombination of the second and third choice is used, depending on what the submodel areexamining. The multiple inclusion makes the inferences concerning cousins more correct,while allowing for more comparisons between cousins. But since some families appeartwice the sibling comparisons are bound to be overly exact, producing narrower confidenceintervals, as well as being biased. To minimize the bias and avoid confidence intervalproblems the approach of multiple comparisons are applies to the datasets where cousinsare compared i.e. M3 and M4. In the other models we use the single inclusion through oneof the parents. An extra analysis of M3 with the single inclusion is also made to investigate

1All program codes can be obtained from the author.

27

Page 30: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

whether the sibling comparison is trustworthy. The number of multiple inclusions in thedataset used in M3 are 69 196 as compared to the sample size of 287 456.

4.2 HLM results

The selection process finds 287 456 individuals availiable for analyses, all of theese areincluded in models M1 −M3. In the data set there are 117 822 individuals with non-SDP,44 550 with SDP and 125 084 with a missing value for SDP. There are 2 709 individualswith SDP that differs from any of their siblings, and 18 190 individuals that has SDP thatdiffers from any of their cousins and/or siblings.

In table 1 the regression coefficients for PF on SDP are listed. Model M1 shows thatPF is influenced by SDP, the (crude) regression coefficient of -0.37 tells us that on averagea child whose mother smoked during the pregnancy will perform 0.37 units worse than achild whose mother did not smoke during her pregnancy. The p-value is < 0.0001, thistells us that the study is not in vain and that there is a connection between SDP and PF,but this could still be confounded. We will now examine whether this result will still holdwhen using the models M2 −M5 to address causation.

In model M2 the confounders are included and this yields a diminished regressioncoefficient for SDP, the value is -0.14 and still the effect is significant. The significancelevel for the SDP and confounders in M2 are listed in table 2 showing that all confoundersare significant at a 5%-level. When comparing how much of the variation that is explainedat the three different levels M1 has more variation explained at level 3, between extendedfamilies, than M2 (0.092 vs 0.058). This is expected since covariates are included that candiffer between individuals both within and between nuclear families as well as betweenextended families. Another effect is that the between families variation (level 2) is slightlylarger.

In M3 the unrelated comparison has a point estimate of -0.17. When comparing cousinswithout any consideration of genetic relation there is a borderline significant result withestimate -0.06 and p-value 0.0418. When the siblings are compared, again not makinguse of information regarding genetic relatedness, there is a estimate 0.20 in the positivedirection. This finding is quite astonishing; the sibling who has been exposed to SDP isless susceptible to stress than the unexposed one. The extra single child-inclusion analysisconfirms the results, the differences in point estimate is minimal, the standard errors arelarger but not yielding non-significant values. A possible reason for this result is if thereis a differce in PF depending on whether the siblings are full or half, this is examined inthe M5 model. There could also be other unmeasured confounders that influence PF.

In model M4 the subset includes 47 990 individuals. In this subset the number of fullcousins with both having non-SDP are 27 372, the number of full cousins where both haveSDP are 5 062. The number of half cousins with both non-SDP are 978, and where bothhave SDP 492. The number of full cousins with varying SDP are 17 356, and half cousinswith varying SDP are 1 380. The multiple inclusions (i.e. individuals belonging to multipleextended families) are 4 650. Using this subsample in M4 we still get significant results, butin the range of 0.01 <p-value< 0.05 (unrelated comparison has p-value< 0.0001 though).

28

Page 31: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

The comparison of half cousins yields an estimate of -0.25, and when comparing full cousinsthe estimate is -0.03=-0.25+0.22. The difference is found through the interaction betweencousin type and SDP, in table 1 denoted ”Full cousin - Half cousin”.

In the mode M5 the result for the siblings is examined. In table 3 the confounders andexposure are listed together with the p-values. The sample size is 26 118. The numberof full siblings where both have non-SDP are 18 702, where both have SDP 4 450. Thenumber of half sibling where both have non-SDP are 256 and where both have SDP 292.The number of full siblings where SDP varies are 2300 and for half siblings 118. Theunrelated comparison is still significant and in the negative direction, estimate -0.18. Thecomparisons within sibling pairs, in table 1 denoted ”Half sibling” and ”Full sibling - Halfsibling”, are not significant. This rejects the previous M3-result that SDP has an positiveeffect on PF when comparing siblings, the variation between full- and half- siblings seemsto be the reason for this. The level 3 variation-parameter turned out to be non-significant,p-value 0.181, so it was set to be zero. This indicated that there is no variation whencomparing extended families, all the variation is captured at the nuclear family level. Themodel is thus a 2-level HLM. The fact that the extended family-level is non-significant givesan indication of possible origin of familial confounding. The differences in environment neednot to be in the model (as it is when comparing cousins) thus the familial confoundingseems to originate in the differences within nuclear family i.e. genetics.

29

Page 32: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

Mod

el:

Mod

el1

Mod

el2

Mod

el3

Mod

el4

Mod

el5

SD

PE

stim

ate

(CI)

Sta

nd

ard

erro

rE

stim

ate

(CI)

Sta

nd

ard

erro

rE

stim

ate

(CI)

Sta

nd

ard

erro

rE

stim

ate

(CI)

Sta

nd

ard

erro

rE

stim

ate

(CI)

Sta

nd

ard

erro

rU

nre

late

dco

mp

ari

son

-0.3

7∗∗

∗0.0

10

-0.1

4∗∗

∗0.0

12

-0.1

7∗∗

∗0.0

12

-0.2

4∗∗

∗0.0

28

-0.1

8∗∗

∗0.0

38

(-0.3

9,-

0.3

5)

(-0.1

6,-

0.1

1)

(-0.2

0,-

0.1

5)

(-0.2

9,-

0.1

8)

(-0.2

5,-

0.1

0)

Cousi

ncom

paris

on

All

cou

sin

s-0

.06∗∗

0.0

27

(-0.1

1,-

0.0

02)

Half

cou

sin

s-0

.25∗∗

0.1

08

(-0.4

6,-

0.0

3)

Fu

llco

usi

n-

Half

cou

sin

0.2

2∗∗

0.1

12

(0.0

05,0

.44)

Sib

ling

com

paris

on

All

sib

lin

gs

0.2

0∗∗

∗0.0

50

(0.1

0,0

.30)

Half

sib

lin

gs

0.1

3∗

0.0

69

(-0.0

04,0

.27)

Fu

llsi

blin

g-

Half

sib

lin

g-0

.33∗

0.3

49

(-1.0

2,0

.35)

Off

spri

ng

incl

ud

ed287

456

287

456

287

456

47

990

26

118

Varia

nce

com

ponents

Am

ou

nt

exp

lain

edp-v

alu

eA

mou

nt

exp

lain

edp-v

alu

eA

mou

nt

exp

lain

edp-v

alu

eA

mou

nt

exp

lain

edp-v

alu

eA

mou

nt

exp

lain

edp-v

alu

e

Bet

wee

nex

ten

ded

fam

ily

0.0

92

<0.0

001

0.0

58

<0.0

001

0.0

62

<0.0

001

0.0

58

<0.0

001

00.1

81

Bet

wee

nnu

clea

rfa

mily

0.2

00

<0.0

001

0.2

07

<0.0

001

0.2

02

<0.0

001

0.1

45

0.0

175

0.2

60

<0.0

001

Wit

hin

nu

clea

rfa

mily

0.7

07

<0.0

001

0.7

35

<0.0

001

0.7

36

<0.0

001

0.7

97

<0.0

001

0.7

40

<0.0

001

Tab

le1:

Reg

ress

ion

par

amet

eres

tim

ates

for

SD

Pre

gres

sed

onP

Ffo

rth

ediff

eren

tm

odel

s.T

he

inte

rval

s(C

I:s)

are

at95

%co

nfiden

cele

vel.

The

”am

ount

expla

ined

”in

the

vari

ance

par

tsh

ows

how

the

vari

ance

ispar

titi

oned

bet

wee

nth

ele

vels

.∗∗∗ p

-val

ue<

0.01

,∗∗

0.01<p-

valu

e<0.

05an

d∗ p

-val

ue>

0.05

.

30

Page 33: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

Solution for fixed effects for M2

Effect p-valueROK1 < 0.0001seim < 0.0001incomem < 0.0001seif < 0.0001incomef < 0.0001pregtime 0.0255agem < 0.0001cohab < 0.0001birthyear < 0.0001birthnr < 0.0001edum < 0.0001eduf < 0.0001crimem < 0.0001crimef < 0.0001birthweight all < 0.0001

Table 2: p-values for covariates in model 2, including the centered continuous ones.

4.3 SEM results

The residuals to be used for the ACE model are produced using linear regression. In thisstep a lot of observations are lost because of missing values. This is the reason of the fewresidual observations left for analysis, the sample size is reduced from the whole sample of287 456 to the ACE sample of 9 460.

The ACE-model was fitted, and the CFI was 0.869 and RMSEA was 0.032. The lowCFI indicates that the model does not fit the data well, an inspection of the parametersshows that the C-parameter has a point estimate of 0.006 but a standard error of 28.978making it non-significant. The parameter bC was unstable in the sense that it was large(= 4.095) but had very large standard errors (= 424.367). This result implies that onelatent variable (in this case the C-variable) is negligible (D’Onofrio et al., 2008). A newmodel constraining the parameters associated with the C-parameter to be zero was fitted.The model yields a CFI of 0.884 and RMSEA of 0.029, the results of this model are collectedin table 4. Even though the model fit is better the CFI value is still below the recommendedvalue of 0.90 (Schumacker & Lomax, 2004). The A-parameter has estimate 0.104 andstandard error 0.006, the E-parameter has estimate 0.040 and standard error 0.008. Thusthe remaining variance between the adult siblings, after correcting for known confounders,can be 72% explained by genetics (= the heritability), and 28% by environment makingsiblings different. The value of the parameter w measuring the influence of SDP on PFwithin nuclear family is in the positive direction. This is in accordance with the resultfor HLM model M3 where we found a positive influence when comparing siblings. The

31

Page 34: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

Solution for fixed effects for M5

Effect p-valuesdp mean cou < 0.0001sdp sib > 0.05fullhalf 0.0296sdp sib*fullhalf > 0.05seim < 0.0001seif < 0.0001incomem > 0.05incomef < 0.0001agem < 0.0001conscriptage < 0.0001cohab > 0.05edum < 0.0001eduf 0.0037crimem > 0.05crimef 0.0002pregtime > 0.05birthnr2 < 0.0001vikt sib > 0.05

Table 3: Solution for confounders in model 5, including the centered continuous ones.

95% confidence interval of parameter bE includes zero, but in accordance with the earlierdiscussion we choose to keep the parameter in the model. The parameter rh (modelling thecorrelation in outcome between half siblings) is also found to be non-significant, but sincethe parameter rf (correlation in outcome between full siblings) is significant we keep theparameter in the model. As for the w-parameter we have found it to be non-significant, butif excluding it from the model we get a poorer fit (CFI=0.875). And since this is centralin our analysis (the association between PF and SDP) we choose to keep the parameter inthe model.

5 Discussion

We have tried to assess causation using two hierarchical cluster model designs; HLM andSEM. When comparing unrelated individuals SDP is associated with a lower PF scoreimplying that if a mother smoked during her pregnancy her child will show a higher sus-ceptibility to stress. But when taking familial relations into account using the hierarchicaldesign the effect vanishes. The HLM indicated that the familial confounding was genet-ically invoked, and the ACE-model that was fitted using SEM confirmed this. We canmake a causious interpretation of our results: Mothers prone to SDP provide genes thatmake the child more susceptible to stress than children of mothers less prone to SDP. Even

32

Page 35: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

CFI 0.884RMSEA 0.029

Effect Estimate Standard error Confidence intervalVA 0.104 0.006 (0.092,0.117)VC - - (-,-)VE 0.040 0.008 (0.024,0.056)bA -0.362 0.149 (-0.655,-0.070)bC - - (-,-)bE -0.213 0.465 (-1.125,0.698)w 0.372 0.228 (-0.075,0.818 )rf 0.217 0.043 (0.133,0.301)rh -0.013 0.147 (-0.301,0.274)

Table 4: The solution for the ACE model with 95% confidence intervals. The parameternames are as in figure 4 exept for rf and rh which are the values for the r in figure 4 butfor full- and half- cousins respectively.

though parents supply the (possibly hazardous) environment as well, we have reasons tobelieve that genes play a significant role.

Through the submodelling in HLM we found some indicators that there are familialconfounding of how SDP effects PF. The term comparing full- and half-cousins in M4 is(borderline) significant, but in M5 the comparison between full- and half- siblings is non-significant. This can be interpreted as that the unmeasured confounding of individualsincrease with genetic relatedness and possibly environmental likeness. If the family did notmatter the effect would be always present, and quite similar in magnitude, for all models.As shown this is not the case. The use of SEM in the ACE-model allows us to examine thecause of the familial confounding, this is done through comparing aunts and their children.The finding that the C-part, environment making the aunts alike, is negligible shows thatthe confounding is due to genetics.

5.1 Limitations of the model

The model is limited in the outcome variable being restricted to be continuous and nor-mally distributed. In the actual study the assumption is breached and thus inferences arenot to be accepted without further investigation. Experience tells us that the deviationsfrom normality could be acceptable though. For applications to other data sets and perfor-mance of other analyses using this study as a template the inferences would be valid if theoutcome variable is continuous and normally distributed. When examining the assump-tions regarding the siblings/cousins environmental upbringing a couple of notes has to bemade. The assumption that cousins have a non-shared environment can be incorrect. Onecan easily imagine that cousins would have a more similar environment with each otherthan if compared with unrelated children (on average). And the opposite can be said aboutsiblings; the environment might not be identical during upbringing of the children (one wayto cope with this is to include mothers age and/or birthorder as covariates in the modelas we have done in the current study).

33

Page 36: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

5.2 Future improvements

If we do not trust the normality of the data a way to ”force” normality is to rank theresiduals and order them into a perfect N(0, 1) distribution and perform analysis on therank-ordered residuals. For other applications logistic regression could fit the model tobinary outcome variables. This can be done using SAS:s proc glimmix. If the outcomeis not binary, but rather many discrete outcomes is possible then Generalized EstimationEquations could solve the problem, this is applicable to both ordinal and nominal outcomes.This could be another solution to the non-normality problem in our study. A suitablesoftware application to use for this would be SAS:s proc genmod. Using Cox regressionsurvival data could be examined within the CoS model context. In SAS proc phreg(phreg ∼ Proportional Hazard REGression) deals with these kind of models.

The fitting of the ACE-model in Mplus can incorporate more adult siblings allowingother possible relations (full- and half- cousins, mono- and di-zygotic twins). Also theamount of lost observations could be lowered using some kind of imputation of missingvalues. As the model is implemented now many observations are cut out of the estimationsbecause of some missing values. When finding the residuals no values will be produced forthese observations. An even better approach is to include the confounders directly in theSEM model rather than modelling ACE on the residuals. There are still many problemsto be solved before doing this. The problems concers e.g. in which level the confoundersbelong and how the confounders should be modelled to effect the outcome (through latentfactors, directly, in combination of exposure etcetera).

The number of genetical- and environmental- relationships covered in this model canbe extended to include twins. Twins can be mono- or di-zygotic (sharing 1 or 0.5 of theirgenes) and, as in CoT, this can help to elucidate the source of familial confounding.

34

Page 37: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

References

D’Onofrio, Brian M. 2005. Encyclopedia of Statistics in Behavioral Science, Children ofTwin design. Vol. 1. John Wiley & Sons, Ltd, Chichester.

D’Onofrio, Brian M., van Hulle, Carol A., Waldman, Irwin D., Rodgers, Joseph Lee,Harden, K. Paige, Rathouz, Paul J., & Lahey, Benjamin B. 2008. Smoking duringpregnancy and offspring externalizing problems: An exploration of genetic and envi-ronmental confounds. Development and Psychopathology, 20, 139–164.

D’Onofrio, Brian M., Singh, Amber L., Iliadou, Anastasia, Lambe, Mats, Hultman,Christina, Neiderhiser, Jenae M., Langstrom, Niklas, & Lichtenstein, Paul. In press.A quasi-experimental approach of maternal smoking during pregnancy and offspringacademic achievement. Child development.

Greene, W. H. 2003. Econometric Analysis. 5 edn. Upper Saddle River, NJ: Prentice-Hall.

Harden, K. Paige, Lynch, Stacy K., Turkheimer, Eric, Emery, Robert E., D’Onofrio,Brian M., Slutske, Wendy S., Waldron, Mary D., Heath, Andrew C., Statham, Dixie J.,& Martin, Nicholas G. 2007. A Behavior Genetic Investigation of Adolescent Moth-erhood and Offspring Mental Health Problems. Journal of Abnormal Psychology,116(4), 667–683.

Hays, Ron D., Revicki, Dennis, & Coyne, Karin S. 2005. Application of Structural EquationModeling to Health Outcomes Research. Evaluation and the Health Professions, 28,295–309.

Littell, Ramon C., Milliken, George A, Stroup, Walter W., Wolfinger, Russell D., & Schn-abenberger, Oliver. 2006. SAS for Mixed Models. SAS Press.

Montgomery, Douglas C. 2005. Design and Analysis of Experiments. 6 edn. John Wiley &Sons, inc.

Muthen, Bengt O. 1998-2004. Mplus Technical Appendices. Los Angeles, CA: Muthen &Muthen.

Muthen, Linda K., & Muthen, Bengt O. 1998-2007. Mplus User’s Guide. 5 edn. LosAngeles, CA: Muthen & Muthen.

Neuhaus, John M., & McCulloch, Charles E. 2006. Separating between- and within-clustercovariate effects by using conditional and partitioning methods. Journal of the RoyalStatistical Society, 68, 859–872.

Nilsson, Peter M, Nyberg, Per, & Ostergren, Per-Olof. 2001. Increased susceptibility tostress at a psychological assesement of stress tolerance is associated with impairedfetal growth. International journal of Epidemiology, 30, 75–80.

35

Page 38: Hierarchical Linear Models and Structural Equation ...Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska

Raudenbush, Stephen W., & Bryk, Anthony S. 2002. Hierarchical Linear Models - Appli-cations and Data Analysis Methods, second edition. Sage Publications.

Rutter, Michael. 2007. Proceeding From Observed Correlation to Causal Inference. Per-spectives on psychological science, 2(4), 377–395.

Schaalje, G. Bruce, McBride, Justin B., & Fellingham, Gilbert W. 2001. Approximationsto Distributions of Test Statistics in Complex Mixed Linear Models Using SAS ProcMIXED. http://www2.sas.com/proceedings/sugi26/p262-26.pdf.

Schumacker, Randall E., & Lomax, Richard G. 2004. A Beginner’s guide to StructuralEquation Modeling. 2 edn. Lawrence Erlbaum Associates, Inc.

Slanger, W. D. 1996. Least Squares Lehmann-Scheffe Estimation of Variances and Covari-ances with Mixed Linear Models. Journal of Animal Science, 74, 2577–2585.

Smith, S. P., & Graser, H.U. 1986. Estimating Variance Components in a Class of MixedModels by Restricted Maximum Likelihood. Journal of Dairy Science, 69(4), 1156–1165.

36