MCMC for random effect models - UoB Interactive Serverseis.bris.ac.uk/~frwjb/materials/wbnztalk.pdf · MCMC for random effect models In the last session we looked at an educational

MCMC for random effect models In the last session we looked at an educational dataset containing 4059 pupils in 65 schools. We were interested in what affects a pupil’s exam score at the end of their schooling (normexam). We considered several predictor variables: an intake score based on a reading test (standlrt), the pupils gender, and the type of school they attended. Finally we considered the effects of the 65 schools themselves and fitted 64 dummy variables for the schools assuming school 1 to be the baseline school. We could alternatively have removed the intercept term and fitted all 65 school dummies. This would mean that the school effects would then be relative to 0 rather than to school 1. As we have been using MCMC sampling in a Bayesian framework we have had to specify prior distributions for all our unknown parameters. For the fixed effects in the model (including the school effects) we have so far used independent uniform priors for each effect to say that we know nothing ‘a priori’ about any of the parameters. Our model can be written

yij = Xijβ + uj + eij, eij ~ N(0,σ2e)

where Xij contains the predictor variables (excluding an intercept), uj are the (fixed) school effects and eij are the individual pupil residuals. We are assuming that the pupil level residuals are random effects but that the school effects are fixed independent quantities. This means that if we knew the value of β, we can evaluate uj from the data for school j only. We could of course fit the school effects as random effects and this will result in the model: yij = Xijβ + uj + eij, uj ~ N(0,σ2

u) eij ~ N(0,σ2e)

Fixed or Random Effects? Consider a set of data containing a class of pupils and in particular consider a pupil John Smith who is a boy from a working class background. Let us assume the pupils have taken two tests one before they entered the class (on which John got 6 out of 10) and one that the children have just taken (on which John got 30 out of 40). We could fit a simple model that relates the current test to the previous test, gender and social class and this will give us predicted scores for combinations of the predictors. For example we may find that boys from a working class background who scored 6 in the pretest are predicted to score 27.5 in this test. However if our interest is directly on John Smith then we are more interested in the 30 that he scored rather than the 27.5 that our model predicts. Similarly when we have higher-level units, for example schools if we are really interested in the results of a particular school then we are really interested in that schools data and hence a fixed effect model is appropriate. If however we have chosen to sample from several schools and are primarily interested in controlling for the school effects then a random effects model is more appropriate. Of course variables like gender should always be treated as fixed effects as it doesn’t make sense to think of male and female to be a sample from a population of genders!!!

MCMC Algorithm for random effects models When fitting random effects models, we cannot calculate the posterior distributions directly and so MCMC algorithms are required. For the two level variance components model, MLwiN uses the following Gibbs sampling algorithm: Firstly set starting values (MLwiN often uses the ML estimates) for each of the unknown parameters, β(0),uj(0),σ2

u(0),σ2e(0).

Here the numbers in brackets refers to the iteration number. We then perform the following four steps 1. Generate β(1) from its multivariate Normal conditional

posterior distribution p(β| u(0), σ2u(0), σ2

e(0)). 2. For each of the 65 uj Generate uj(1) from its Normal

conditional posterior distribution p(uj| β(1),σ2u(0),σ2

e(0)). 3. Generate σ2

u(1) from its inverse Gamma conditional posterior distribution p(σ2

u| u(1)). 4. Generate σ2

e(1) from its inverse Gamma conditional posterior distribution p(σ2

e| β(1), u(1)). These 4 steps are then repeated (over and over again) replacing the original starting values with the new values that were generated by this iteration.

Model Comparison (Maximum Likelihood Estimation) Using the Maximum likelihood IGLS method, model comparison is straightforward for Normal responses assuming the two models are nested. For each model we have a deviance and in reality IGLS fits a huge multivariate normal response model with a structured variance matrix, Y~MVN(Xβ,V) that is equivalent to the required multilevel model. Therefore assuming we use the multivariate Normal likelihood then each random variance and covariance is a parameter in the structured variance matrix and so uses up 1 degree of freedom. So for example in moving from the variance components model to the random slopes regression model we add 3 new parameters: the fixed effect for standlrt, the between slopes variance and the covariance between slopes and intercept. Therefore we can compute the change in deviance and this has a (large sample) �2 distribution with 3 degrees of freedom and so we can perform a likelihood ratio test. Here the change of deviance = 5829.23 – 5514.16 = 315.07 which is highly significant so unsurprisingly the random slopes regression model is significantly better than variance components model.

Model Comparison (MCMC) Spiegelhalter, Best et al. (2002) introduced the Deviance Information Criterion (DIC) diagnostic in a paper to the RSS this year, which got a mixed reaction from the Bayesian community present. The DIC is an extension of the AIC diagnostic that can be calculated directly from the chains produced by an MCMC run and is a diagnostic that combines model fit with complexity. As models get more complex by the addition of extra parameters their fit improves. The DIC (and the AIC) diagnostic therefore penalizes additional parameters so that a parsimonious model is chosen. The diagnostic DIC = Deviance + 2*pD where pD is the ‘effective number of parameters’ which can be calculated from the chain as the difference between the mean deviance (Dbar) in the chain and the deviance at the mean values for the parameters (D(�bar)). In random effect models the ‘effective number of parameters’ is less than the nominal number that an equivalent fixed effect model would have due to the additional distributional assumption for the random effects. The DIC diagnostic will then give a single number for each model with the smallest value representing the best model.

Tutorial dataset example We will consider the DIC diagnostic with the tutorial dataset and fit several models to compare the DIC values obtained as detailed below:

�� Model 1 is the NULL model that consists of fitting simply a constant (mean) for the normexam response variable.

�� Model 2 is a simple linear regression model of normexam against standlrt.

�� Model 3 consists of 65 fixed school effects (in fact 64 dummy variables).

�� Model 4 consists of treating school as a random effect. �� Model 5 consists of 65 fixed school intercepts and 65

fixed school residuals i.e. fitting linear regressions to each school.

�� Model 6 is a random slopes regression model. Model Nominal

parameters Effective params pD

Deviance D(�bar)

DIC

1 2 2.01 11509.37 11513.382 3 3.02 9760.51 9766.563 66 66.06 10783.49 10915.624 68 60.03 10790.01 10910.085 131 131.02 8987.03 9249.086 136 91.67 9031.32 9214.65

So we see that the RSR model is the best model fitted so far. Notice how the random models (4 & 6) have more nominal parameters than their fixed counterparts but less effective parameters and a lower DIC value.

Interval Estimation of School Ranks UK Society is today often driven by:

�� Competition - Schools competing to be higher on a league table. - Hospitals being judged on performance. - Universities being judged on their research output. �� Sensationalism -Newspapers are reporting the ‘best’ school or ‘worst’ hospital. -Everything is centred around point estimates and not interval estimates. It is worth noting that in a league table someone has to be top and someone has to be bottom. In reality what is more important is not whether a school is lower in a league table than another school but whether a school is ‘significantly worse’ than another school. Interval estimates go someway to correcting for sensationalizing results based on point estimates. However we must still be careful not to make the leap that being ‘significantly worse’ on some univariate measure e.g. exam results at age 16 necessarily makes one school worse than another school. The ‘worse’ school may have a poorer student intake, which we could account for by doing a ‘value-added’ study where we account for intake in our model. Alternatively they may excel in particular subjects but do very badly in other subjects. Ideally fitting statistical models will highlight unusual (outlying) schools where more investigation could be carried out and potentially a government could concentrate resources.

MCMC and Interval Estimates for ranks One clear advantage of simulation-based approaches is the ease that they can be used to construct point and interval estimates for additional parameters that are functions of the model parameters. To construct such estimates using ML approaches would either involve lots of complex mathematics or be impossible. MCMC estimation gives us these estimates for free!!

In this box we see a plot of 95% interval estimates for the 65 school ranks in our dataset sorted in mean rank order. For the ‘best’ schools, school 59 has a rank interval [1,4] while school 28 has a rank interval [1,5] and 2 other schools also have intervals that include 1. For the ‘worst’ schools, school 53 has a rank interval [63,65] while school 63 has an interval [56,65] and school 6 has an interval [59,64] and on average has a worse rank than school 63. We can therefore see that choosing a ‘best’ or ‘worst’ school is not a simple task! Also if we examine the schools nearer the centre of the dataset our task of judging the better of two schools is even more difficult.

‘Default’ Prior Distributions The Bayesian approach requires that all unknown parameters are given prior distributions as part of the statistical model. Often in practice we want to express the fact that we know nothing about the parameters and hence want to use some ‘default’ priors for these scenarios. Such priors can be variously described as ‘diffuse’, ‘flat’ or ‘non-informative’. For certain parameters for example the fixed effects the choice of ‘default’ prior doesn’t appear to be too important: normally a uniform prior or a normal prior with large variance is used. For variance parameters however the choice is not so obvious. Flat priors on one particular scale will be anything but flat on another scale. Uniform priors on σ, σ2 and log(σ2) or their proper equivalents have all been advocated. See Browne (1998) and Browne and Draper (2002) for some comparisons and Browne and Draper (2000) for equivalent comparisons for variance matrices. It is always advisable to run a sensitivity analysis on the choice of prior and this also applies when informative priors are used. A sensitivity analysis will show that your inferences are robust to your prior choice.

WinBUGS and the MLwiN -> WinBUGS interface WinBUGS History WinBUGS is a piece of software that can be used for estimating models using MCMC estimation. It is produced by a team of researchers headed by David Spiegelhalter who were originally all based at the MRC Biostatistics unit in Cambridge, UK. The original BUGS package was produced in the early 1990’s with BUGS being an acronym for Bayesian inference Using Gibbs Sampling. It allowed the user to specify Bayesian statistical models via a series of statements in what is effectively its own high-level programming language. The user could then combine the model specification with data and initial values stored in separate files. BUGS would compile the model and produce chains of samples from the posterior distribution via its MCMC estimation engine which could then be output to other software, e.g. S-plus to produce results and diagnostics. A suite of S functions known as CODA written by Nicky Best and colleagues could then be used to produce convergence diagnostics and summary statistics. The majority of the WinBUGS team including the programmers, Andrew Thomas and Dave Lunn are now based at Imperial College’s department of Epidemiology in London. Since the first release of BUGS, the software has evolved from a DOS/UNIX based estimation engine to a full-blown Windows-based statistical package which incorporates directly many of the graphical functions originally only available via S-plus.

The MLwiN – WinBUGS interface The MCMC estimation methods in MLwiN are just one of a set of possible estimation features available in the package. MLwiN is a piece of software designed to only fit specific types of model. For these models however we try to include lots of additional software features that are suitable for the types of models e.g. residual plots, very efficient Maximum Likelihood estimation procedures. The WinBUGS package aims to fit a far wider range of models. This however means that sometimes its algorithms are often slower to run and sometimes less efficient due to the single-site updating nature of their algorithms although they are improving this all the time and in fact for some models the WinBUGS algorithms are more efficient than MLwiN. The WinBUGS syntax is also more like a programming language making it less accessible to some researchers and this is the motivation behind writing an interface between the two packages. The interface allows users to specify their model in MLwiN via the Equations window and then at the click of a button to generate the corresponding WinBUGS code. In the chapter that you will work through we do this for a simple Normal response model and show how to modify the WinBUGS code to allow t residuals which MLwiN does not allow. The interface can also then input the estimate chains produced in WinBUGS back into MLwiN so that we can use the MLwiN diagnostics etc.

An Example of WinBUGS model code Here is the WinBUGS code for a simple linear regression model. As you can see there are three distinct parts to the model, the model definition, the initial values for the unknown parameters and the data values. #----MODEL Definition---------------- model { # Level 1 definition for(i in 1:N) { normexam[i] ~ dnorm(mu[i],tau) mu[i]<- beta[1] + beta[2] * standlrt[i] } # Priors for fixed effects for (k in 1:2) { beta[k] ~ dflat() } # Priors for random terms tau ~ dgamma(0.001,0.001) sigma2 <- 1/tau } #----Initial values file---------------------------- list(beta= c(-0.001191,0.595057), tau= 1.542213) #----Data File---------------------------------- list(N= 4059, standlrt = c(0.619059,0.205802,-1.364576,0.205802, ……..., -1.364576), normexam = c(0.261324,0.134067, …………, -1.029067))

DAG Models The models that WinBUGS generally fits can each be described by a directed acyclic graph or DAG. A DAG can refer either to the diagrams that display the dependence of the variables and data in the model, or the models themselves. Example of a DAG for our linear regression model.

for(i IN 1 : N)

beta[2]beta[1]

tau

normexam[i]

standlrt[i]mu[i]

Here all parameters and data in the model are fitted into boxes described as nodes. Each of these nodes is then joined to the others via a series of arrows or edges, which indicate dependence between the two nodes. A single edge identifies stochastic dependence whilst a double (hollow) edge identifies a logical or deterministic dependence. The oval nodes are stochastic nodes that have some distributional assumption whilst the rectangular nodes are constants. BUGS was originally designed to fit only DAG models although certain models for example CAR spatial models can be fitted by assuming a whole section of the model which cannot be expressed as a DAG is contained in a single node.

WinBUGS version 1.4

At the end of October a new version of WinBUGS (1.4) was available for Beta testing. This version contains many improvements over earlier versions including more statistical distributions, graphical output options and more efficient estimation procedures. One new feature, which actually existed in the original DOS-based BUGS program is the ability to call WinBUGS via a script file. This means that in the future it will possible to run models in WinBUGS from MLwiN by the touch of a button. Below is the MLwiN screen for a prototype that allows users to run WinBUGS via scripts and then input the chains.

MODELLING DISCRETE RESPONSE DATA

�� Sometime we want to model data where the response variable is discrete.

Logistic models

��We may have a binary response variable yij = 1 or 0 (e.g. dead / alive!)

ij� is the probability that yij = 1

Assume yij ~ Binomial ),( ijijn �

Var(yij |�ij) = � ijijij n/)1( ��

In the case of binary data, however, nij = 1 (as each proportion is based on 1 observation). This is fitted as a denominator in the models. �� logistic regression models can also be used to model proportions in aggregate data e.g. Proportion unemployed in area i of region j. nij would be the number of people eligible to work in area i of region j.

Poisson models

�� Poisson models are useful for modelling count data, particularly that associated with rare events.

e.g. Cases of malignant melanoma in populations. ��Counts of cases will be non-negative but a Normal model could produce negative predictions (hence prefer to model logarithms of counts).

�� Fit a Poisson model to the counts with a log-link function (and fit the expected values as an offset called ‘offs’). yij ~ Poisson (�ij) log(�ij) = log(expij) + Xij� + Zij�j, or alternatively log(�ij/expij) = Xij� + Zij�j.

How to fit the models:- A binary logistic example (voters within areas)

�� From the Equations window, change the default Normal response distribution to Binomial :-

which gives:-

Note that, whilst yij ~ Binomial , ),( ijijn �

uj ~ Normal (0, � ) 2u

��Hence we need two constant terms (bcons and cons). bcons models the individual level variance, and cons the rest. ��Need denominator column (‘denom’).

Hence we have:- Logit link function (could be probit or log-log) BCONS only appears in random part

* = transform uj assumed to be Normal Var (eij) forced to be 1

Interpretation of explanatory variables in logistic regression models

β1 = coefficient of explanatory variable x1. It is expressed in terms of logits (log odds-ratios) Exp (β1) may be interpreted as the effect of x1 on the odds that y = 1. �� If x1 is continuous, for each 1 unit increase in x1, the odds are multiplied by exp(β1). �� If x1 is binary (0,1) exp(β1) is interpreted as an odds ratio comparing the odds that y=1 for x1=1 relative to the odds that y=1 for x1=0. �� If x1 is categorical with l categories then:- We set up l-1 dummy variables xni (i=1…l-1); coefficients βni. exp(βni) interpreted as an odds ratio, comparing the odds that y=1 for x1 = i relative to the odds that y=1 for x1=l.

e.g.

Whether or not voted conservative unemp 21 point attitude scale Note:- Coefficient for unemp = 0.069 Exp(0.069) = 1.07 Hence the odds of voting conservative increase by 0.07 for each 1 unit increase in the respondents attitude score

Variance functionsIn this session we will look at modelling the variance

and in particular heteroskedasticity or complex varia�tion at level �� We will �rstly motivate this topic byconsidering a ��level problem�

Partitioning the dataset

Here we see the mean and variance of our response vari�able normexam for di�erent partitions of the dataset�

Partition N Mean Variance

Whole dataset ��

Boys �� Girls ��

LRT � �� LRT � �� LRT � �� LRT � �� LRT � �� LRT � ��

�� LRT ��

In the slides that follow we use �uij as the level � vari�ance function and �eij as the level � variance functionfor individual i in school j�

A � Level Model

yij � N�� X�ij��V��eij � �e�� X�ij�e�� X��ij�e��

��

where X� is London Reading test �LRT� score�This graph mimics the results from partitioning the

data�

Standardised LRT score

Leve

l 1 v

aria

nce

-3 -2 -1 0 1 2 3

0.64

0.66

0.68

0.70

0.72

0.74

A � level model with a constant variance atlevel ��

yij � N�� X�ij��V��uij � �u��

�eij � �e�� X�ij�e�� X��ij�e��

��

where X� is London Reading test �LRT� score�


Var

ianc

e

-3 -2 -1 0 1 2 3

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Level 1 varianceLevel 2 variance

A � Level model with complex variation atboth levels � and ��

yij � N�� X�ij��V��uij � �u�� X�ij�u�� X��ij�u��

�eij � �e�� X�ij�e�� X��ij�e��

��

where X� is London Reading test �LRT� score�


Var

ianc

e

-3 -2 -1 0 1 2 3

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Level 1 varianceLevel 2 variance

A � level model with a more complicated vari�ance structure at level �

yij � N�� X�ij�� X�ij��V��uij � �u�� X�ij�u�� X��ij�u��

�eij � �e�� X�ij�e�� X�ijX�ij�e�� X�ij�e��

��where X� is London Reading test �LRT� score� and

X� is � for boys and � for girls�


Var

ianc

e

-3 -2 -1 0 1 2 3

0.2

0.4

0.6

0.8

Level 1 variance - BoysLevel 1 variance - GirlsLevel 2 variance

NON-HIERACHICAL MULTILEVEL MODELS Two types :

�� Cross-classified models �� Multiple membership models

Cross-classification For example: hospitals by neighbourhoods. Hospitals will draw patients from many different neighbourhoods and the inhabitants of a neighbourhood will go to many hospitals. No pure hierarchy can be found and patients are said to be contained within a cross-classification of hospitals by neighbourhoods: neighbourhood 1 Neighbourhood 2 neighbourhood 3 hospital 1 XX X hospital 2 X X hospital 3 XX X hospital 4 X XXX Hospital H1 H2 H3 H4 Patient P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 Neighbourhood N1 N2 N3 We can sort the data by patient within hospital (as above) or patient within neighbourhood but not both.

Other examples:

�� pupils within primary schools by secondary schools �� patients within GPs by hospitals �� interviewees within interviewers by surveys �� repeated measures within raters by individual(e.g. patients by nurses)

nurse 1 nurse 2 nurse 3 patient 1 X XX patient 2 X X X patient 3 X XX

Notation Hospital H1 H2 H3 H4 Patient P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 Neighbourhood N1 N2 N3 I nbhd(i) hosp(i) 1 1 1 2 2 1 3 1 1 4 2 2 5 1 2 6 2 2 7 2 3 8 3 3 9 3 4 10 2 4 11 3 4 12 3 4

iihospinbhdi euuy ��)3(

)()2(

)(0� Here classification 2 is neighbourhood and classification 3 is hospital. Classification 1 always corresponds to the classification at which the response measurements are made, in this case patients.

Classification diagrams

�� One node per classification. �� Nodes linked by arrows indicate a nested relationship �� Unlinked nodes indicate a crossed relationship

Hospital

Patient

Neighbourhood

Hospital Neighbourhood

Patient

crossed structure nested structure These diagrams are useful in situations where we have multiple classifications to indicate the population structure.

Data example : Artificial insemination by donor 1901 women 279 donors 1328 donations 12100 ovulatory cycles The response is whether conception occurs in a given cycle The structure is: Women w1 w2 w3 Cycles c1 c2 c3 c4… c1 c2 c3 c4… c1 c2 c3 c4… Donations d1 d2 d1 d2 d3 d1 d2 Donors m1 m2 m3

Which can also be represented as…

Donation

Donor Woman

Cycle

We can write the model as

),0(~

),0(~

),0(~

)()logit(

),1(~

2)4(

)4()(

2)3(

)3()(

2)2(

)2()(

)4()(

)3()(

)2()(i

uidonor

uidonation

uiwoman

idonoridonationiwomani

ii

Nu

Nu

Nu

uuuX

Binomialy

�

�

�

��

�

��

Results Parameter Description Estimate(se) � 0 intercept -4.04(2.30)

�1 azoospermia * 0.22(0.11)

� 2 semen quality 0.19(0.03)

� 3 womens age>35 -0.30(0.14)

� 4 sperm count 0.20(0.07)

� 5 sperm motility 0.02(0.06)

� 6 insemination to early -0.72(0.19)

� 7 insemination to late -0.27(0.10) 2

)2(u� women variance 1.02(0.21)

2)3(u� donation variance 0.644(0.21)

2)4(u� donor variance 0.338(0.07)

*fecundability of women not impaired

Multiple membership models These are models where level 1 units are members of more than one higher level unit. For example,

�� Pupils change schools/classes and each school/class has an effect on pupil outcomes

�� Patients are seen by more than one nurse during the course of their treatment

),0(~

),0(~

)(

2

2)2(

)2(

)(

)2()2(,

ei

uj

inursejijjiii

Ne

Nu

euwXBy

�

�

��

��

(1)

Note that nurse(i) now indexes the set of nurses that treat patient i and is a weighting factor relating patient i to nurse j. For example, with four patients and three nurses, we may have the following weights

)2(, jiw

n1(j=1) n2(j=2) N3(j=3) p1(i=1) 0.5 0 0.5 p2(i=2) 1 0 0 p3(i=3) 0 0.5 0.5 p4(i=4) 0.5 0.5 0 Here patient 1 was seen by nurses 1 and 3, but not nurse 2 and so on. If we

substitute the values of , i and j from the above table into (1) we get the following series of equations:

)2(, jiw

i

i

i

i

euuXBy

euuXBy

euXBy

euuXBy

��

��

��

��

)2(2

)2(14

)2(3

)2(23

)2(12

)2(3

)2(11

5.05.0

5.05.0

1

5.05.0

Classification diagrams for multiple membership models Double arrows indicate a multiple membership relationship between classifications

nurse

patient

We can mix multiple membership, crossed and hierarchical structures in a single model

hospital

nurse GP practice

patient Here patients are multiple members of nurses, nurses are nested within hospitals and GP practice is crossed with both nurse and hospital. We can write the model as

),0(~),0(~

),0(~),0(~

)(

22)4(

)4()(

2)3(

)3()(

2)2(

)2(

)(

)4()(

)3()(

)2()2(,

eiuigpp

uihospitaluj

inursejiigppihospitaljjiii

NeNu

NuNu

euuuwXBy

��

��

��

��

Example involving, nesting, crossing and multiple-membership – Danish chickens dataset. Production hierarchy 10,127 child flocks 725 houses 304 farms Breeding hierarchy 10,127 child flocks 200 parent flocks Response is whether a child flock is infected with salmonella. farm f1 f2… Houses h1 h2 h1 h2 Child flocks c1 c2 c3… c1 c2 c3…. c1 c2 c3…. c1 c2 c3…. Parent flock p1 p2 p3 p4 p5….

or

Farm

House

Parent flock

Child flock

Note the breeding hierarchy is crossed with the multiple membership production hierarchy. We can write the model as:

),0(~

),0(~),0(~

)()logit(

),1(~

2)4(

)4()(

2)3(

)3()(

2)2(

)2(

)(.

)4()(

)3()(

)2()2(,i

uifarm

uihouseuj

iflockpjiifarmihousejjii

ii

Nu

NuNu

euuuwXB

Binomialy

�

��

�

�

��

��

Results: Parameter Description Estimate(se) � 0 intercept -2.322(0.213)

�1 1996 -1.239(0.162)

� 2 1997 -1.165(0.187)

� 3 hatchery 2 -1.733(0.255)

� 4 hatchery 3 -0.211(0.252)

� 5 hatchery 4 -1.062(0.388) 2

)2(u� parent flock variance 0.895(0.179)

2)3(u� House variance 0.208(0.108)

2)4(u� farm variance 0.927(0.197)

Multivariate Normal Response Models and Missing Data In all of the models we have considered so far we have picked one of our variables as a response and fitted an equation to relate it to our other ‘predictor’ variables. E.g. yij = Xijβ + Zijuj + eij Consider a scenario where we have many measures that we are equally interested in, for example at the end of schooling a pupil will take many exams in different subjects and we may be interested in their marks in each of these subjects. The normexam example we have looked at so far gets around this by using the total exam score in all subjects as a measure rather than individual subjects but educational ability is not in reality a scalar measure and different predictors will be important for different subjects. A quick solution would be to fit many models, one for each subject, so for example fit a (multilevel) model to the mathematics scores and then another model for chemistry etc. All of these models could exist in isolation and this would answer questions like ‘Do girls do better than boys in science?’ but we may also be interested in other questions like ‘Do pupils who do well in Mathematics also do well in Chemistry?’ and this cannot be answered by a series of independent models. The solution is to fit a single model with many responses and for this we will use a multivariate Normal response model.

Multivariate Normal Single level model To answer our 2nd question we are really interested in whether there is a positive or negative correlation between Mathematics and Chemistry scores in the population of students. Assuming (for simplicity) that we have scores out of 100 for each pupil for both mathematics (mi) and Chemistry (ci) then we can calculate the correlation, �mc between these two variables where

)var()var(/),cov( cmcmmc ��

We could equivalently use the terms that form the correlation to give a multivariate distribution for the two responses:

��

��

��

�

��

�

��

�

�

��

�

�

)var(),cov(),cov()var(

,~ccm

cmmcm

MVNicim

If we rename m and c as y0 and y1 then we can rewrite the above distribution as the following model (renaming many terms)

��

�

�

��

�

�

��

�

�

�

��

�

��

�

�

��

�

�

�

�

2101

0120,

00

~10

111

000

ee

eeMVNieie

ieiyieiy

��

��

�

�

This is a simple single level multivariate response model and we can now extend this model by adding predictor variables etc. if we so wish. If we have additional levels in our dataset we can also easily extend this model to a multilevel setting just as we would a single response model.

Multilevel Multivariate Response Models in MLwiN In MLwiN we fit multivariate models by assuming a dummy bottom level that represents the responses. This means we have responses nested within individuals and so the model we discuss on the last slide will appear as:

��

�

�

��

�

�

��

�

�

�

��

�

��

�

�

��

�

�

�

�

2101

0120,

00

~1

0

111

000

uu

uuMVNjuju

jujyjujy

��

��

�

�

where u replaces e as the residuals are now effectively at level 2. In fact the terms β0 and u0i are multiplied by an indicator variable mathsij, which takes the value 1 when the response is mathematics and 0 when the response is chemistry. Similarly β1 and u1i are multiplied by an indicator function chemistryij, which takes the value 1 when the response is chemistry and 0 when the response is mathematics. In this way the multivariate model is in fact fitted (when we use ML methods) as a univariate model with no random terms at level 1. To fit a 2 level model for example including school effects to our current model we will actually fit a 3 level model with responses nested within students nested within schools.

��

�

�

��

�

�

��

�

�

�

��

�

��

�

�

��

�

�

��

�

�

��

�

�

��

�

�

�

��

�

��

�

�

��

�

�

�

2101

0120,

00

~1

0

2101

0120,

00

~10

)111()000(

uu

uuMVNjkujku

vv

vvMVNkvkv

kvjkuijkchemistrykvjkuijkmathsijky

��

��

��

��

��

where mathsijk = 1 when i = 0 and chemistryijk = 1 when i = 1.

Missing Data We will here give a brief introduction to missing data methods. In MLwiN multivariate response models can handle observations where some of the responses are missing. When using MCMC estimation algorithms the missing responses are imputed as part of the estimation procedure. Imputation Methods If we consider the following small dataset Y 5 16 25 24 ? X 10 22 31 35 26 Here we wish to impute the value ? based on the data so that we can fit models to the complete dataset. There are many simple imputation methods that could be used. For example mean imputation would impute the average value of Y (17.5). An improvement is a regression imputation which would fit a regression line to the data and impute the value of Y when X = 26 from this line (18.7). The disadvantage of this approach is that it will result in variances that are too small as all unknown values are placed exactly on the line. A method that corrects for this is multiple imputation where rather than predict the unknown Y exactly from the line we instead generate several datasets each with a generated unknown Y which is drawn at random from the predictive distribution of the unknown. We then perform any further statistical analysis on all the generated datasets and average the results. MCMC Sampling can be thought of as an infinite extension of multiple imputation. Here we only have one dataset (with unknowns) but at each iteration of the MCMC algorithm we generate the missing data.

Measurement Error Modelling One main aim of statistical modelling is to explore the relationship between variables. For example in a linear regression model we are interested in the relationship between a response y and a predictor variable x. We hope that x will firstly explain some of the variation in the values of y in our dataset and secondly allow us to predict values of y for other observations with known x. The linear regression can be written: yi = β0 + β1 xi + ei

Here the ei are residuals or errors and these represent unexplained variation, which will be due to many causes including important missing covariates and measurement errors in the y variable. In the physical sciences we can perform experiments to verify known laws. Students of physics will probably recall ‘Hookes law’ experiments, which involved putting weights onto the end of a spring and measuring the length of extension. In such experiments (assuming the weight is not above a particular value) the relationship between length and weight is known to be linear and so any points that do not lie on the regression line will be due to measurement errors either in measuring the weight or the length. In the case of a known linear relationship small measurement errors will not have a great effect but more generally these measurement errors will influence both the coefficients that define the regression line and the standard errors of these coefficients i.e. our confidence in the estimated line.

Measurement Error Modelling In this section we will consider measurement errors in predictor variables and consider a simple linear regression model. Measurement errors in response variables will be captured as part of the error term ei. Ideally we would fit the model yi = β0 + β1 xi

t + ei (1) where xi

t is the true predictor value for individual i but we cannot actually observe xi

t so instead we fit yi = β0 + β1 xi

o + ei (2) where xi

o is the observed predictor value for individual i. We can consider xi

o = xit + mi, where mi is a measurement error

for observation i. So we need to account for the fact that our predictor is not measured directly. Ideally we would like to know the exact values of the measurement errors so that we could calculate the xi

t and hence fit model (1) – in practice we make some assumption about the errors and include this assumption in our model so that the errors will be corrected for. We will assume that there are no systematic errors in our predictor and so we will assume that the errors are normally distributed with zero mean and variance σ2

m. The value of σ2m is

needed for the model and may be estimated from the data by repeated sampling. In order to adjust for measurement errors it is important to look at the effect of the measurement errors in the model

The Effects of Measurement Errors In the example below the heights of 50 students (in metres) at age 16 were measured. Their heights at age 11 were also measured and used as a predictor variable. Below we plot in blue (solid triangles and solid line) the true data and regression line and in red (hollow triangles and dashed line) the same data with random measurement errors added and the subsequent regression line. We have here used fairly small measurement errors but as the errors increase so the discrepancy between the lines increases.

The equation of the lines here are: yi = -0.244 + 1.512 xi

t

yi = -0.068 + 1.361 xio

and if we were to simply fit the errors (line not shown) yi = 1.569 + 0.512 mi

So the observed line is a weighted sum of the true and error lines. Effects of Measurement Errors in Multilevel models

�� In both single level and multilevel models, measurement errors will lead to predictor values that are (in magnitude) underestimated.

�� Higher-level random effects and residuals associated with the predictor term that has measurement error (i.e. random slopes) will also be underestimated leading to the corresponding higher-level variances being underestimated.

�� As a result of the underestimation of fixed and random effects we will find that the level 1 variance will be overestimated as the errors have reduced the predictive power of the predictor and hence increased the residual variation.

Measurement Errors in categorical predictor variables Measurement errors in categorical predictors are also known as misclassifications. We may for example have categorized an individual to be from category A when in reality they were from category B. As with continuous measurement errors the effect of misclassification is generally an underestimation of the coefficients of affected predictor variables. Currently there are no facilities in MLwiN for dealing with misclassification errors, although we aim to look at such errors at a later date.

Documents

MCMC for random effect models - UoB Interactive Serverseis.bris.ac.uk/~frwjb/materials/wbnztalk.pdf · MCMC for random effect models In the last session we looked at an educational