Mixed Model for Study

Lecture Slides on Mixed Models

Based on

A Course in Mixed Models for Use inAnimal Health and Animal Welfare Research

Søren Højsgaard & Erik Jørgensen

Biometry Research UnitDanish Institute of Agricultural Sciences

Research Centre Foulum

October 18, 2001

1 Preface

In the spring 2001 the Biometry Research group at the Danish Institute of Agricultural Sciencesarranged a course in Mixed models for researchers at the Department of Animal Health andAnimal Welfare at the same institute. The course consisted a combination of lectures, groupexercises, written assignments and a final project report based on data from experiments thatthe project participants were involved in.

During the course, the book SAS System for Mixed Models by Littell et al. (1996) was used,referred to as LMSW in the present document. It was necessary to supplement the book withadditional theoretical material and examples based on data from the research institute. Thisled to a comprehensive number of slides used for the presentations.

This supplementary material is compiled in the present document. We hope the readers willfind it useful. Maybe the online version1 of this document will be even more useful, because ofthe hypertext facilities.

Søren Højsgaard & Erik Jø[email protected] [email protected]

Biometry Research UnitDanish Institute of Agricultural Sciences

Research Centre FoulumP.O. Box 50

DK-8830 Tjele

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/HSVmixed2001Slides.pdf

3

http://www.agrsci.dk/jbs/biome/index_uk.shtml

http://www.agrsci.dk/hsv/index.shtml

http://www.agrsci.dk/hsv/index.shtml

mailto:[email protected]

mailto:[email protected]

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/HSVmixed2001Slides.pdf

1 Preface

4

Contents

1 Preface 3

Contents 9

2 Overview of slides 11

3 Basic Concepts from Linear algebra) 13Why Linear Algebra?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26n–dimensional Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Linear Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Linear dependence and independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Projections onto Linear Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Linear normal models 39Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Linear Normal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Random Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Functions of Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 51The Distribution of a LNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52The Expectation in a LNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Representations of Models in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Least Squares Estimation in a LNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Estimation on matrix form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64The parameter vector β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Estimability and Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Estimability in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Least Squares Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82Hypothetis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Calculating things in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5 Some Basic Statistical Concepts 97Data and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Why the Normal Distribution is so “Normal” . . . . . . . . . . . . . . . . . . . . . . . 101

5

Contents

The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102Some General Principles of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 104Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

How good is an estimator? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Consistency of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112Desirable Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 112The Method of Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 113

The Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115The Maximum likelihood principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

How Good is the Estimate? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120The Asymptotic Normal Distribution of the MLE . . . . . . . . . . . . . . . . . . . . . 122Asymptotical normality of transformations of the MLE . . . . . . . . . . . . . . . . . . 125Tests of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126How to get the asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6 An overview 137Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138Darwins maize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138Galtons tilgang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140Korrekt tilgang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142Hvad er sket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143Den 5. potte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144Populations genetik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145Populations genetik/ Husdyravl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146Mixed Models generelt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7 Experimental planning and design 149Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150Forskningsprocessen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150Darwins majs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151Hypoteser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152Luse Beslutningsstøtte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152Forskningsbeslutningsstøtte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153Designmuligheder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8 Randomized Complete Block Design 157Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158Linear Normal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158Random vs. Fixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159ML - estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160Proc Mixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162Andre eksempler pa RCBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164Proc Mixed fortsat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173IC - options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

6

Contents

9 Randomized Complete Block Design II 175Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176BLUEs and BLUPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179BLUP Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180Model Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

10 Split-Plot Experiments 183The General Idea behind Split–Plot Experiments . . . . . . . . . . . . . . . . . . . . . 184Variance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186Comparing Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188Inference Issues for Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189Analysis of the Split–Plot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 190Modelling the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191Three Technical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191Back to the Original Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195Unbalanced cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195Satterthwaites approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198How Good is Satterthwaites Approximation . . . . . . . . . . . . . . . . . . . . . . . . 201Two–sample Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202Split–Plot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203Making the “right” tests with PROC MIXED . . . . . . . . . . . . . . . . . . . . . . . 204A Severe Warning!! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205Some Tentative Conclusions on Satterthwaite . . . . . . . . . . . . . . . . . . . . . . . 207Random or Fixed Effects? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208Multilocation Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

11 Examples of Split-Plot Designs 213Example: W. Schouten Ph.D. work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214Breed Effect on Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216Straw shortener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217Group Housing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218Herd Investigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218Multilocation trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

12 Estimation and tests in mixed models 221Maximum Likelihood and Linear Normal Models . . . . . . . . . . . . . . . . . . . . . 222Maximum Likelihood Estimation in Mixed Models . . . . . . . . . . . . . . . . . . . . 225Using ML or REML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230Tests in Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

13 Complications concerning Variance Components 235Sugar Beet example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237Reason . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238Likelihood contour plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

7

Contents

G not positive definite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239Warning Satterthwaite goes wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241Testing effects of random components . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

14 Repeated Measurements 245Analyzing Repeated Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246Tacit Assumptions when using the Split–Plot Model . . . . . . . . . . . . . . . . . . . 248Modelling of Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250Types of random variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250Unstructured Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251The AR(1)–model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253How to estimate the autocorrelation?? . . . . . . . . . . . . . . . . . . . . . . . . . . . 256Compound Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260Which Covariance Structure to use? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261Numerical Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262What does the covariance structure mean for the conclusions? . . . . . . . . . . . . . . 263

15 Repeated Measurements: Covariance structures 265Repeated statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266Types of variance structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

Unstructured . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268Autoregressive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269Antedependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271Toeplitz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272Heterogeneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273AR vs CS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

16 Random Regression 275The Basic Idea behind Random Regression . . . . . . . . . . . . . . . . . . . . . . . . 276Analyzing the Individual Regression Coefficients . . . . . . . . . . . . . . . . . . . . . 279Random Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280How to ... In SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283Correlation structure in Random Regression Models . . . . . . . . . . . . . . . . . . . 285

17 Factor Structure Diagrams 289Factor Structure Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290Two–way ANOVA with Replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291Two–way ANOVA without Replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . 293Block Experiments with Replicates within Blocks . . . . . . . . . . . . . . . . . . . . . 294Block Experiments without Replicates within Blocks . . . . . . . . . . . . . . . . . . . 296Split Plot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

18 Covariate Models and Multivariate Response 301Example of the use of covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

8

Contents

Model reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304Table 5:1 LMSW page 5.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304SAS- Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306Feed vs daily gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307Multivariate Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310The Components of a MLNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311How to ... In SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315The general setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

19 Heterogeneous Variance 319Why Variance Heterogeneity is Important to Recognize . . . . . . . . . . . . . . . . . 320Graphical Investigation of the Variance Structure . . . . . . . . . . . . . . . . . . . . . 321Variance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322The Delta–method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324Taylors Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325Applying Taylors Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326Transformation of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328Modelling Variance Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330Heterogeneous Variance for Grouped Data . . . . . . . . . . . . . . . . . . . . . . . . . 334Power–of–Mean for Data with Covariates . . . . . . . . . . . . . . . . . . . . . . . . . 340Noget om transformationer, normalfordelingsapproximation og konfidensintervaller . . 345Transformation og konfidensintervaller . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

20 Variansheterogeneity: Example of effect of transformation 355Variance Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358

Model of Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360Model comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361Treatment differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363Natural Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

21 Variance Homogeneity: Diurnal Variation 365Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366Random Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367Model of mean ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368Modelling variance inhomogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368SAS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

22 Links to supplementary material 371

Bibliography 373

9

Contents

10

2 Overview of slides

The course was arranged consisting of three blocks of lectures.

1. Brush-up concerning the necessary prerequisites of statistical concepts, linear algebra andlinear normal models. In addition, a historic review was given and experimental planningdiscussed. This covers Chapter 3-7.

2. This block of lectures covered the basic application of Mixed Models within the experi-mental designs typically used at the Department of Animal Health and Animal Welfare.That is

• randomized complete block designs, (Chapter 8 and 9),

• split-plot designs (Chapter 10 and 11),

• repeated measurements. (Chapter 14 and 15)

• random regression. (Chapter 16)

• covariates and multivariate response. (Chapter 18)

In addition the fundamentals concerning estimation and tests in Mixed Models, is dis-cuused in Chapter 12. The two remaining issues: numerical problems (Chapter 13) andfactor structure diagrams (Chapter 17) were included because of questions raised from theparticipants. In practical examples some of the variance components estimates were veryoften set to 0, leading to problems concerning the calculations of d.f. (i.e., with Satterth-waites approximation). This further raised a need for a more ’manual’ approach towardsd.f. calculations in different designs.

3. In the final part of the course some additional topics and developments within MixedModels were presented and efforts were made to give a general summary and overview ofthe topics. Lectures concerning variance heterogeneity is presented in Chapter 19 and 20.An example using the presented methods on data concerning diurnal variation is presentedin Chapter 21

In addition, the preliminary work on the final project report were presented during thisfinal block.

11

2 Overview of slides

The final chapter (22) in this book consist of links to supplementary material. Mainly, SASexamples.

The exercises uses in the course is not included but can be found by visiting the home page ofthe course1

Finally, it should be mentioned that each chapter starts with a very short introduction to thetopic. In addition, a link to the full screen version of the presentation can be found.

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/HSVmixed2001.htm

12

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/HSVmixed2001.htm

3 Basic Concepts from Linear algebra)

Linear algebra is an important prerequisite in order to understand the model formulation andcalculations within Mixed Model. The following slides served as a brush-up on the theory, withpresentation of the most important concepts and results.

Link to the full screen presentation1

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/LinAlg.f.pdf

13

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/LinAlg.f.pdf

Why Linear Algebra??

• Many statistical models used in practice are assumed to have some

kind of a linear structure. (Linear regression and analysis of variance

are classical examples.)

• Linear algebra is the branch of mathematics that deals with linear

structures.

• Linear algebra is a convenient tool for handling models with linear

structures.

• Moreover, many concepts from linear algebra can be given

geometrical interpretation.

October 17, 2001 Mixed Models Course 1

• Hence geometry can be a way to understand statistical models with

linear structures



14

Vectors

Vectors: A column vector is a list of numbers stacked on top of each

other, e.g.

a =

2

1

3

A row vector is a list of numbers written one after the other, e.g.

b = (2, 1, 3)

In both cases, the list is ordered, i.e.

(2, 1, 3) 6= (1, 2, 3).


• Note In what follows all vectors are column vectors unless

otherwise stated.

In general an n–vector has the form

a =

a1

a2

...

an

where the ais are numbers.


15

Transpose of vectors: This means that a column vector is turned

into a row vector and that a row vector is turned into a column

vector. The transpose is denoted by “>”. For example,

a> = (a1, a2, . . . , an)

Hence transposing twice takes us back to where we started:

a = (a>)>

• Example:

1

3

2

>

= [1, 3, 2] og [1, 3, 2]> =

1

3

2


Multiplying a vector by a number: If a is a vector and α is a

number then αa is the vector

αa =

αa1

αa2

...

αan

• Example:

7

1

3

2

=

7

21

14



16

Sum of vectors: Let a and b be n–vectors. The sum a + b is the

n–vector

a + b =

a1

a2

...

an

+

b1

b2

...

bn

=

a1 + b1

a2 + b2

...

an + bn

= b + a

• Note Only vectors of the same dimension can be added !

• Example:

1

3

2

+

2

8

9

=

1 + 2

3 + 8

2 + 9

=

3

11

11


Inner product of vectors: Let a and b be n–vectors. The inner

product a · b is the number

a · b = a1b1 + a2b2 + · · ·+ anbn =n

∑

i=1

aibi

• Note The product is a number – not a vector

• Note Only vectors of the same dimension can be multiplied!

• Example:

1

3

2

·

2

8

9

= 1 · 2 + 3 · 8 + 2 · 9 = 44


17

The length (norm) of a vector: The length (or norm) of a vector

a is

||a|| =√

a · a =

√

√

√

√

n∑

i=1

a2

i

The 0–vector and the 1–vector: The 0-vector (1–vector) is a

vector with 0 (1) on all entries. The 0–vector (1–vector) is

frequently written simply as 0 (1) or as 0n (1n) to emphasize that

it is of length n.

Orthogonal (perpendicular) vectors: Two vectors a and b with

a 6= 0 and b 6= 0 are orthogonal if their inner product is zero,

written

a ⊥ b ⇔ a · b = 0


Matrices

Matrix: A matrix A with r rows og c columns is an r × c table of

the form

A =

a11 a12 . . . a1c

a21 a22 . . . a2c

... ... . . . ...

ar1 ar2 . . . arc

It is said that A has the dimension r × c.

• Note One can regard A as consisting of c columns vectors put

after each other:

A = [a1 : a2 : · · · : ac]



18

Transpose of matrices: A matrix is transposed by interchanging

rows and columns and is denoted by “>”. That is,

A> =

a11 a21 . . . ar1

a12 a22 . . . ar2

... ... . . . ...

a1c a2c . . . arc

Example:

1 2

3 8

2 9

>

=

[

1 3 2

2 8 9

]


• Note If A is an r × c matrix then A> is a c× r matrix.

• Note One can regard a column vector of length r as an r × 1

matrix and a row vector of length c as a 1× c matrix.


19

Multiplying a matrix with a number: For a number α and a matrix

A, the product αA is the matrix

αA =

αa11 αa12 . . . αa1c

αa21 αa22 . . . αa2c

... ... . . . ...

αar1 αar2 . . . αarc

Example:

7

1 2

3 8

2 9

=

7 14

21 56

14 63


Sum of matrices: Let A = [a1 : a2 : · · · : ac] and B = [b1 : b2 : · · · :bc] be r × c matrices.

The sum A + B is the r × c matrix given by

A + B = [a1 + b1 : a2 + b2 : · · · : as + bs]

=

a11 a12 . . . a1c

a21 a22 . . . a2c

... ... . . . ...

ar1 ar2 . . . arc

+

b11 b12 . . . b1c

b21 b22 . . . b2c

... ... . . . ...

br1 br2 . . . brc

=

a11 + b11 a12 + b12 . . . a1c + b1c

a21 + b21 a22 + b22 . . . a2c + b2c

... ... . . . ...

ar1 + br1 ar2 + br2 . . . arc + brc

= B + A



20

• Note Only matrices with the same dimensions can be added.

Example:

1 2

3 8

2 9

+

5 4

8 2

3 7

=

6 6

11 10

5 16


Multiplication of a matrix and a vector: Let A be an r× c matrix

and let b be a c-dimensional column vector. The product Ab is the

r × 1 matrix

Ab =

a11 a12 . . . a1c

a21 a22 . . . a2c

... ... . . . ...

ar1 ar2 . . . arc

b1

b2

...

bc

=

a11b1 + a12b2 + · · ·+ a1cbc

a21b1 + a22b2 + · · ·+ a2cbc

...

ar1b1 + ar2b2 + · · ·+ arcbc

• Eksempel:

1 2

3 8

2 9

[

5

8

]

=

1 · 5 + 2 · 83 · 5 + 8 · 82 · 5 + 9 · 8

=

21

79

82


21

Multiplication of matrices: Let A be an r× c matrix and B a c× t

matrix, i.e. B = [b1 : b2 : · · · : bt]. The product AB is the r × t

matrix given by:

AB = A[b1 : b2 : · · · : bt] = [Ab1 : Ab2 : · · · : Abt]

Example:

[

1 23 82 9

]

[

5 48 2

]

=

1 2

3 8

2 9

[

5

8

]

:

1 2

3 8

2 9

[

4

2

]

=

1 · 5 + 2 · 8 1 · 4 + 2 · 23 · 5 + 8 · 8 3 · 4 + 8 · 22 · 5 + 9 · 8 2 · 4 + 9 · 2

=

21 8

79 28

82 26


• Note The product AB can only be formed if the number of

rows in B and the number of columns in A are the same. On

that case, A and B are said to be conforme.

• Note In general AB and BA are not identical.

A mnemonic for matrix multiplication is :

[

1 23 82 9

]

[

5 48 2

]

=

5 4

8 2

1 2 1 · 5 + 2 · 8 1 · 4 + 2 · 23 8 3 · 5 + 8 · 8 3 · 4 + 8 · 22 9 2 · 5 + 9 · 8 2 · 4 + 9 · 2

=

21 8

79 28

82 26



22

Special matrices:

• An n× n matrix is said to be a square matrix

• A matrix with 0 on all entries is the 0–matrix and is often written

simply as 0 (or as 0r×c to emphasize the dimension).

• A matrix consisting of 1s in all entries is of written J (or as Jr×c

to emphasize the dimension).

• A square matrix with 0 on all off–diagonal entries and elements

d1, d2, . . . , dn on the diagonal is said to be a diagonal matrix and

is iften written diag{d1, d2, . . . , dn}• A diagonal matrix 1s on the diagonal is called the unity matrix

and is denoted I (or In×n to emphasize the dimension).

• A matrix A is a symmetric matrix A = A>.


Some rules for matrix operations: For (conformable) matrices

A,B and C the following rules apply

(A + B)> = A> + B>

(AB)> = B>A>

A(B + C) = AB + AC

AB = AC 6⇒ B = C


23

Inverse of a matrix: The inverse of an n×n matrix A is the matrix

B (which is also n× n) which multiplied with A gives the identity

matrix I. That is,

AB = BA = I.

One says that B is A’s inverse and writes B = A−1.

• Note Only square matrices can have an inverse.

• Note Not all square matrices have an inverse.

• Note When the inverse exists, it is unique.

• Note Finding the inverse of a large matrix A is numerically

complicated.


Example 1. It is easy find the inverse for a 2× 2 matrix. When

A =

[

a b

c d

]

then the inverse is

A−1 =1

ad− bc

[

d −b

−c a

]

under the assumption that ab − bc 6= 0. The number ab − bc iscalled the determinant of A, sometimes written det(A).

If the determinant det(A) = 0, then A has no inverse. fin



24

Example 2. Finding the inverse of a diagonal matrix is easy: Let

A =

a1 0 . . . 0

0 a2 0... . . . 0

0 0 . . . an

where all ai 6= 0. Then the inverse is

A−1 =

1

a10 . . . 0

0 1

a20

... . . . 0

0 0 . . . 1

an

If one ai = 0 then A−1 does not exist. fin


Generalized inverse: Not all square matrices have an inverse.

However all square matrices have a generalized inverse.

A generalized inverse of a square matrix A is a matrix A− satisfying

that

AA−A = A

Any square matrix has an infinite number of generalized inverses.


25

Linear Combinations

Let a1, a2, . . . , ac be r–vectors and let A = [a1 : a2 : · · · : ac] be the

corresponding r × c matrix.

Let vv = (v1, v2, . . . , vc)> be a c-vector and let

x = Av = a1v1 + a2v2 + · · ·+ acvc =∑

j

ajvj

Then the r–vector x is said to be a linear combination of

a1, a2, . . . , ac.


Let w = (w1, w2, . . . , wc)> be another c vector and let

correspondingly y = Aw =∑

j ajwj.

Then the following can be noted:

• For a number α the vector αx = α(Av) = A(αv) is also a linear

combination of a1, a2, . . . , ac.

• The sum x+y = Av+Aw = A(v+w) is also a linear combination

of a1, a2, . . . , ac.

• Hence if x and y are both linear combination a1, a2, . . . , ac then so

is the sum αx + βy where α and β are numbers.



26

n–dimensional Spaces

A 2–vector x = (x1, x2) can be regarded as the point with

coordinates (x1, x2) in a 2–dimensional coordinate system, i.e. in the

plane.

Likewise a 3–vector x = (x1, x2, x3) can be regarded as the point

with coordinates (x1, x2, x3) in a 3–dimensional coordinate system,

i.e. in space.

In general an n–vector x = (x1, x2, . . . , xn) can be regarded as the

point with coordinates (x1, x2, . . . , xn) in an n–dimensional

coordinate system, i.e. in an n–dimensional space. Such as space

shall here be referred to as Rn. Its hard to draw!


To justify such n–dimensional spaces, suppose x consists of a

location of an object (that takes 3 coordinates), the temperature of

the object (that occupies one coordinate) and the time (that also

occupies one coordinate). Hence the total information about the

object can be regarded as a point in a 5–dimensional space.

Note that If x and y are both vectors in Rn then so is the sum

αx + βy.


27

Linear Subspaces

Consider a set a1, a2, . . . , ac of r–vectors.

We can regard these vectors as “building blocks” for creating new

vectors as linear combinations of the building blocks. Any such

vector is an r–vector

The set of vectors which can be created as linear combinations of

the “building blocks” is called a linear subspace of Rr.

Such a space, let us call it L, is said to be spanned by a1, a2, . . . , ac

and we write L = span(a1, a2, . . . , ac).


Example 3. Consider the vectors

a1 =

2

6

4

, a2 =

1

5

7

Hence span(a1, a2) is the set of vectors which can be written as

y =

2

6

4

v1 +

1

5

7

v2

for alle possible choices of v = (v1, v2). fin



28

More precisely, L consists of all vectors of the form

a1v1 + a2v2 + · · ·+ acvc

for all possible choices of c–vectors v = (v2, . . . , vc).

It is common to organize the building blocks as a matrix

A = [a1 : · · · : ac]. Then another way of describing L is as the set of

vectors that can be written as Av, or more precisely

L = {y|y = Av for all possible vectors v}

Frequenly one uses the name span(A) for L.


There are some additional aspects of subspaces of which a few will

be illustrated:

Example 4. Consider again the subspace L = span(a1, a2) where

a1 = (2, 6, 4)> a2 = (1, 5, 7)>

• A question is whether all vectors y = (y1, y2, y3)> can be written

as y = a1v1 + a2v2?

The answer is “no”, for example y = (1, 5, 3) can not be writtenin that form.

• Another question is whether there are other ways of representingL?

The answer is “yes” – there are infinitely many. To pick one, letb1 = a1 + a2 and b2 = a1 − a2. Then L = span(b1, b2).


29

fin

• Note The 0-vector belongs to all linear subspaces. In the previous

example one gets y = 0 when choosing α = (0, 0, 0).)


Linear dependence and independence

Linearly dependent vectors: A set of vectors a1, ..., ac arelinearly dependent if one of them can be written as a linearcombination of the others, for example if

ac =c−1∑

j=1

ajqj

where the vjs are numbers.

Linearly independent vectors: If none of the vectors a1, ..., ac canbe written as a linear combination of the others, the set is said tobe linearly independent.



30

Throw–out–technique: If one vector, say ac, can be written as alinear combination of the other vectors, then it can be thrown awaywith changing the structure of the space, i.e.

span(a1, . . . , ac) = span(a1, . . . , ac−1)

This process can go on until one ends up with a set of linearlyindependent vectors.

This allow us to find a representation of the which is as simple(economical) as possible.



a1 =

2

6

4

, a2 =

1

5

7

, a3 =

0

2

5

og x =

3

13

16

1. The vector x is a linear combination of a1, a2 and a3, sincex = a1 + a2 + a3.

2. Since a3 = a2 − 1

2a1, the ai–vectors are linearly dependent.

Consequently x can be written as a linear combination of onlya1 og a2, because x = 1

2a1 + 2a2.

3. The vectors a1, a2 are linearly independent and so are the setsa1, a3 and a2, a3.


31

fin

Basis of a subspace: If the vectors a1, ..., ac span a given subspace

L and are linearly independent, the are said to be a basis for L.

Any linear subspace has infinitely many different bases.

Dimension of a linear subspace: Yet all bases of a linear subspace

shares have a common feature: They have the same number of

elements. The number of elements of a basis is the dimension of

the subspace.

Throw–away: Having a linearly dependent set of vectors a1, ..., ac

on can always apply the throw–away–technique to obtain a

linearly independent set of vectors. This set is then a basis


for span(a1, . . . , ac).


a1 =

2

6

4

, a2 =

1

5

7

, a3 =

0

2

5

b1 =

1

3

2

and b2 =

2

8

9

and the corresponding matrices A = [a1 : a2 : a3], A = [a1 : a2] ogB = [b1 : b2].

1. Since a3 = a2 − 1

2a1, the ai vectors are linearly dependent.



32

fin

• Note Since L = span(A) = span(B) one can think of the

matrices A and B as two different ways of representing the same

linear subspace.


Projections onto Linear Subspaces

Example 7. Consider the vector a = (2, 2) and y = (1, 2).

Clear y is not in span(a). In statistics the following question isextremely important: Can we find a vector y in span(a) which is as“close to” y as possible?

The answer is “yes”: Find the (orthogonal) projection of the pointy onto the line going through a. There is a simple mathematicalexpression for obtaining y, namely

y = a(a>a)−1a>y =

[

2

2

]

1

8[2, 2]

[

1

2

]

=1

2

[

1 1

1 1

] [

1

2

]

=

[

3

23

2

]


33

The property of y is that the length of y − y is as small as possible.

Moreover, y − y and y are orthogonal. fin

In general let y be an r–vector and let A = [a1 : · · · : ac] be an r × c

matrix.

Then there always exist a vector y in span(A) which is as close to y

as possible.

If y is in span(A), then y = y because in this case the lenght of

y − y is zero.

If y is not in span(A) then the expression is as follows: Assume that

all columns of A are linearly independent. (Recall that if that is not


the case we can throw away redundant columns without changing

the space spanned by those remaining.)

Then y = Py where

P = A(A>A)−1A>

is the projection matrix onto span(A).

It then holds that

1. Py is in span().

2. Py is the vector in span(A) which is closest to y (in the sense thatthe lenght of y − y is minmized.

3. Py = y if and only if y is already in span(A).



34

Example 8. Consider the 3× 2 matrix A = [a1 : a2], where

a1 =

1

3

2

og a2 =

2

8

9

Then the projection matrix onto span(A) is P = A(A>A)−1A>. Tofind P we first calculate

A>A =

[

1 3 2

2 8 9

]

1 2

3 8

2 9

=

[

14 44

44 149

]


Hence

(X>X)−1 =1

150

[

149 −44

−44 14

]

From this we find

(X>X)−1X> =1

150

[

149 −44

−44 14

] [

1 3 2

2 8 9

]

=1

150

[

61 95 98

−16 −20 38

]


35

Finally we find

P = A(A>A)−1A> =1

150

1 2

3 8

2 9

[

61 95 98

−16 −20 38

]

=1

150

29 55 −22

55 125 10

−22 10 146

fin


Exercises in linear algebra

Exercise 1. 1. Are the vectors (1, 1) and (1, 2) orthogonal?

2. Are (1, 1) and (2,−2) ?

3. Are (1, 1) and (−1,−1) ?

4. Make a drawing which illustrates these vectors

Exercise 2. Let

A =

1 2

3 4

5 6

.



36

1. Is A symmetrical?

2. Is A>A symmetrical?

3. Is AA> symmetrical?

4. What is the result from adding A and A>?

Exercise 3. Let

A =

[

1 2

3 4

]

, and B =

[

1 0

1 1

]

.

Calculate AB and BA. What can be concluded from this?

Exercise 4. Let a = (1, 1, 1, 0, 0, 0)> be a 6 × 1 matrix. Find aa>

and a>a.


Exercise 5. Let

A =

[

a b

c d

]

and

B =1

ad− bc

[

d −b

−c a

]

Calculate AB. What can be concluded from this?

Exercise 6. What is the inverse to the 3× 3 matrix diag(1, 4, 9)?

Exercise 7. Two equations with two unknowns. COnvince yourself


37

that the system of equations

x1 + 2x2 = 3

2x1 + 3x2 = 4

can be written as

[

1 2

2 3

] [

x1

x2

]

=

[

3

4

]

,

i.e. as Ax = b. Find A−1 and use this for solving the system of

equations as follows:

x = Ix = A−1Ax = A−1b.


Exercise 8. Let

A =

1 0

1 0

0 1

0 1

.

1. How do vectors of the form Av look when v = (v1, v2)>?

2. Find the projection matrix P = A(A>A)−1A>.

3. Let y = (1, 3, 5, 7)>. Find Py.



38

4 Linear normal models

Linear normal models serves as a natural starting point for the presentation of Mixed Modelstheory. Most researchers within animal science has a least a working knowledge of linear normalmodels

These slides served the purpose of giving an overview of the different concepts, and to link theconcepts with the underlying statistical theory. Finally, the standard terminology used withinSAS, were presented from a theoretical point of view.

Link to the full screen presentation1.

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/LNM.f.pdf

39

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/LNM.f.pdf

Introduction

Many well known statistical models used in practice, for example

• linear regression,

• multiple regression,

• analysis of variance,

• analysis of covariance,

can be formulated in the general framework of linear normal models

(abbreviated LNM), which undoubtly is the most important class of

models in statistics.


A linear normal model is also sometimes called a

general linear model.

The SAS procedure PROC GLM is designed to deal with the class of

linear normal models.

Any linear normal model can be formulated in matrix form as

Y = Xβ + ε

where Y is an n× 1 vector of observations, X is an n× p matrix of

covariates, β is a p× 1 vector of unknown parameters and ε is a

n× 1 vector of unobservable random errors.



40

Example 1. One–way analysis of variance.

The modelYkl = αk + εkl

where εkl ∼ N(0, σ2) for k = 1, 2 and l = 1, 2, 3 can be written inmatrix form as

Y11

Y12

Y13

Y21

Y22

Y23

=

1 01 01 00 10 10 1

[

α1

α2

]

+

ε11ε12ε13ε21ε22ε23

Y = X β + ε


The vector of expected values µ = (µ11, µ12, . . . , µ23)> is

µ11

µ12

µ13

µ21

µ22

µ23

=

1 01 01 00 10 10 1

[

α1

α2

]

=

α1

α1

α1

α2

α2

α2

µ = X β

fin


41

There are good reasons for dealing with LNMs in general instead, of

treating regression analysis, analysis of variance etc. separately.

For LNMs in general it is easy to establish how to

• estimate parameters,

• estimate contrasts,

• make significance tests,

• perform model control.

From these general results, it can be deduced how to make the

corresponding tests in e.g. regression models and in analysis of

variance


It is also convenient to work with LNMs in matrix terminology,

because any LNM can be formulated generally as

y = Xβ + ε

Moreover, random effects models (mixed models) are an extension of

linear normal models. I.e. any linear normal model is in a sense also

a mixed model.

Many aspects of mixed models become extremely cumbersome if the

matrix representation is not available.



42

Example 2. Simple linear regression:

The linear regression model

Yi = β0 + β1xi + εi

where εi ∼ N(0, σ2) for i = 1, . . . , 6 can be written in matrix form as

Y1

Y2

Y3

Y4

Y5

Y6

=

1 x1

1 x2

1 x3

1 x4

1 x5

1 x6

[

β0

β1

]

+

ε1ε2ε3ε4ε5ε6

Y = X β + ε


The vector of expected values µ = (µ1, µ2, . . . , µ6)> is

µ1

µ2

µ3

µ4

µ5

µ6

=

1 x1

1 x2

1 x3

1 x4

1 x5

1 x6

[

β0

β1

]

=

β0 + β1x1

β0 + β1x2

β0 + β1x3

β0 + β1x4

β0 + β1x5

β0 + β1x6

µ = X β

fin


43

Linear Normal Models

A linear normal model (LNM) is defined as follows:

1. The observations y1, . . . , yn come from (are realizations of)

independent random variables Y1, . . . , Yn.

2. Each random variable has a normal distribution

Yi = µi + εi εi ∼ N(0, σ2).

Hence each Yi is allowed to have its own mean value, but the

variance σ2 is the same for all i = 1, . . . , n.


3. To each observation yi there are covariates (known constants)

xi1, . . . , xip such that

µi = µ(β)i = xi1β1 + xi2β2 + · · ·+ xipβp =

p∑

k=1

xikβk.

That is, the mean value µi is related to the covariates in a linear

way through the parameters β1, . . . , βp.

A practical interpretation of constant variance is that each random

variable Yi has the same tendency to deviate (in a random way)

from its expectation µi.



44

As it has been illustrated, any LNM can be cast in matrix form as

Y = Xβ + ε

where

Y : is an n× 1 vector of observations,

X : is an n× p matrix of covariates, whose ith row is xi1, . . . , xip,

β : is a p× 1 vector of unknown parameters, and

ε : is a n × 1 vector of unobservable random errors which are

independent and N(0, σ2) distributed.


The matrix X is called the design matrix (or model matrix) because

it contains information about covariates, i.e. about the design of the

study.


45

Example 3. Polynomial regression:

The polynomial regression model

Yi = β0 + β1xi + β2x2i + εi

where εi ∼ N(0, σ2) for i = 1, . . . , 6 can be written in matrix form as

Y1

Y2

Y3

Y4

Y5

Y6

=

1 x1 x21

1 x2 x22

1 x3 x23

1 x4 x24

1 x5 x25

1 x6 x26

β0

β1

β1

+

ε1ε2ε3ε4ε5ε6

µ = X β + ε

fin


Random Vectors and Matrices

A random vector Z = (Z1, . . . , Zn)> is a vector of random variables.

Since we are working with vectors of random variables, it is

convenient to establish the notions of

• expectation vector (or mean vector ) and

• covariance matrix of a vector of random variables.



46

• Most frequently the interest is the the mean vector.

• Yet, the covariance matrix is of interest when modelling that

observations can not be regarded as comming from independent

random variables.

• In fact, one view of mixed models is that mixed models are

concerned with modelling the covariance matrix is some structured

way.


The mean or expectation of a random vector is the vector of mean

values, i.e.

E(Z) =

E(Z1)

E(Z2)...

E(Zn)

=

µ1

µ2...

µn

= µ

For a LNM, we have already seen a use of this, namely through

writing

µ = Xβ.


47

The covariance matrix Cov(Z) of a random vector

Z = (Z1, . . . , Zn)>

is the n× n matrix whose element in the ith row and jth column is

the covariance between Zi and Zj.

Example 4. For example, with n = 3 we have

Cov(Z) =

[

Var(Z1) Cov(Z1, Z2) Cov(Z1, Z3)Cov(Z1, Z2) Var(Z2) Cov(Z2, Z3)Cov(Z3, Z1) Cov(Z3, Z2) Var(Z3)

]

=

[

σ21 σ12 σ13

σ21 σ22 σ23

σ31 σ32 σ23

]

fin


In general

Cov(Z)ij = Cov(Zi, Zj) = E[(Zi − µi)(zj − µj)].

In particular the diagonal elements of Cov(Z) contain the variances,

Cov(Z)ii = Cov(Zi, Zi) = E[(Zi − µi)2] = V ar(Zi).

Since Cov(Zi, Zj) = Cov(Zj, Zi), the covariance matrix is

symmetric.



48

Example 5. The error term ε = (ε1, . . . , εn) from a linear normalmodel has a very simple covariance matrix:

• Var(εi) = σ2 because the variance is the same for all units

• Cov(εi, εj) = 0 because εi and εj are independent.

• Hence

Cov(ε) = σ2

1 0 . . . 00 1 . . . 0... ... . . . ...0 0 . . . 1

= σ2In

fin


Functions of Random Vectors

Matrix algebra is useful when dealing with

linear functions of random vectors.

If Z is a random n-vector, A is an r× n matrix and b is an r–vector,

then

U = AZ + b

is also a random vector.


49

The mean and covariance of linear functions of random vectors is

easily calculated using the following:

Result 1.

E(AY + b) = AE(Y ) + b (1)

Cov(AY + b) = Cov(AY ) = ACov(Y )A> (2)


A particular application of (1) and (2) is the following:

• Let Z be a random vector of length n with mean E(Z) (an

n–vector) and covariance matrix Cov(Z) (an n× n matrix).

• Let a = (a1, . . . , an)> be a vector of numbers and consider the

linear combination U =∑

i aiZi = a>Z.

• Then (1) and (2) implies that

E(U) = E(a>Z) = a>E(Z)

Cov(U) = Cov(a>Z) = a>Cov(Z)a



50

The Multivariate Normal Distribution

So far, we have treated the mean and covariance of a random vector.

We shall now discuss a distribution of a random vector:

Definition 1. It is said that Z follows an n–dimensional

multivariate normal distribution (in short MVN) with mean vector

µ = E(Z) and covariance matrix Σ = Cov(Z), written

Z ∼ Nn(µ,Σ)

if a>Z follows a univariate normal distribution for all possible n-

vectors a.


Without going into detail, we shall just mention that if Σ has an

inverse, then Z has a density which can be written

f(z) = (2π)−n2 det(Σ)−

n2 exp{

1

2(z − µ)>Σ−1(z − µ)}

Example 6. For n = 2 the density looks as follows:


51

fin


The Distribution of a LNM

For a LNM, the vector of unobservable errors is ε = (ε1, . . . , εn)>,

where εi ∼ N(0, σ2) and ε1, . . . , εn are independent.

Hence we have

E(ε) = 0 and Cov(ε) = σ2I

Since any linear combination of independent N(0, σ2)–variables

yields a normal variable we conclude that

ε ∼ Nn(0, σ2I)



52

Hence for the linear normal model Y = Xβ + ε we find that

E(Y ) = µ = E(Xβ + ε)

= Xβ + E(ε) = Xβ

Cov(Y ) = Cov(Xβ + ε)

= Cov(ε) = σ2I

and can write

Y ∼ Nn(Xβ, σ2I).


The Expectation in a LNM

Example 7. (Continuation of Example 1).

The one–way analysis of variance model in Example 1 can beformulated at least three different ways:

1. As Ykl = αk + εkl, and β = (α1, α2)>.

2. As Ykl = δ + γk + εkl where γ2 = 0, such that γ1 is represents thetreatment effect. Hence, β2 = (δ, γ1)

>.

3. As Ykl = δ + ρk + εkl. Thus, β3 = (δ, ρ1, ρ2)>.


53

In many ways, the latter formulation is the most natural andconventional, but it poses some problems

Let

X =

1 0

1 0

1 0

0 1

0 1

0 1

X2 =

1 1

1 1

1 1

1 0

1 0

1 0

X3 =

1 1 0

1 1 0

1 1 0

1 0 1

1 0 1

1 0 1

(3)

Any vector which can be written as Xβ must be of the form(a, a, a, b, b, b)> for numbers a and b.


But that is also the case for vectors of the form X2β2 and X3β3.From this we conclude that with respect to the mean vector thematrices X, X2 and X3 are “all the same”.

This leads to that

µ = Xβ = X2β2 = X3β3.

1. X corresponds to writing the model as Ykl = αk + εkl.

2. X2 corresponds to writing the model as Ykl = δ + γk + εkl, withγ2 = 0.

3. X3 corresponds to writing the model as Ykl = δ + ρk + εkl.



54

Consider the mean vector µ = (2, 2, 2, 3, 3, 3)>. The formulation asµ = X3β3 where β3 = (δ, ρ1, ρ2)

> is different from the two others inan important way:

• Under the representation µ = Xβ, there is only one choice of βnamely β = (2, 3) which yields µ.

• Under the representation µ = X2β2, there is only one choice of β2

namely β2 = (3,−1) which yields µ.

• Under the representation µ = X3β3, there are infinitely many waysof obtaining µ. Two such are β3 = (1, 1, 2) and β3 = (3,−1, 0).

fin


• Example 7 illustrates that there in general are different

representations of the same model. Corresponding to the different

representations, there are different parameters, with different

interpretations.

• We say that the there are different parametrizations of the same

model.

• The representation µ = X3β3 is said to be over parametrized –

there are too many parameters in the model.


55

In many practical situations the models we work with are over

parametrized.

Yet, it does not matter which representation of the model we choose

and it is not really important that whether the model is over

parametrized in the following sense:

Any question that can be answered under one representation can

also be answered under another.


To treat these issues in detail, it is necessary to think about what a

LNM really says: It says that

y = Xβ + ε where µ = Xβ.

Hence β effects the distribution of the observables y only indirectly,

namely through Xβ.

Therefore since y is what can be observed, we can only use y for

saying “somethingh” about β if this “something” can be expressed

through Xβ.

This observation leads to the important notion of estimability and

estimable functions.



56

The columns of X defines a subspace of Rn which we denote by L,

i.e.

L = span(X).

The statement µ = Xβ simply means that µ can be written as a

linear combination of the column vectors of X, i.e. that µ lies in

span(X).

But as has been illustrated in Example 7, there might be more than

one β vector producing µ.

Hence by saying that µ = Xβ, all one really says is that µ belongs

to L.

Moreover, there are infinitely many different ways of representing L,


because one can always find another matrix, say X2 with

span(X2) = span(X) such that any vector µ = Xβ = X2β2.

Therefore, since the parameter vector β is closely related to the

actual representation of L, and since β might not be uniquely

determined, the value of a parameter vector β is rarely of direct

interest in itself.


57

Example 8. (Continuation of Example 2)

Let x. = 1n

∑

i xi denote the average of the xis. Define new variableszi = xi − x. and consider the regression model

Yi = α0 + α1zi + εi.

This model corresponds to “centering the xis around their mean”.Not surprisingly, this does not change the fundamental structure ofthe model - it is still a linear regression model, but with the followingnew design matrix:


X =

1 z1

1 z2

1 z3

1 z4

1 z5

1 z6

=

1 x1 − x.

1 x2 − x.

1 x3 − x.

1 x4 − x.

1 x5 − x.

1 x6 − x.

, β =

[

α0

α1

]

fin



58

Representations of Models in SAS

Here we shall illustrate some of the differences between different

ways of specifying the models in SAS.

The illustration is with PROC MIXED but applies to PROC GLM too.

The model in Example 7 can be analyzed with the SAS program

PROC MIXED;

CLASS TREAT;

MODEL Y = TREAT / SOLUTION;

RUN;

Here TREAT is a variable with levels 1 and 2.


1. First SAS generates the matrix X3.

2. SAS then realizes that the columns of X3 are linearly dependent.

3. SAS therefore proceeds by eliminating columns until a set of linearly

independent columns are achieved. This is done in a systematic

way: The column corresponding to the highest value of TREAT is

removed which yields X2.

The parameter estimates reported by SAS are therefore (δ, γ1).

Note that it is the option SOLUTION that causes the parameter

estimates to be reported.


59

The SAS program

PROC MIXED;

CLASS TREAT;

MODEL Y = TREAT / NOINT SOLUTION;

RUN;

on the other hand causes SAS to directly generate X, because the

NOINT option specifies that there shall not be a column of 1s in the

design matrix. The parameter estimates reported by SAS is therefore

(α1, α2).


Example 9. Consider the two–way analysis of variance

Yijk = δ + αi + βj + γij + εijk

where i = 1, 2, j = 1, 2 and k = 1, 2, 3. The mean vector is

µ =

1 1 0 1 0 1 0 0 0

1 1 0 0 1 0 1 0 0

1 0 1 1 0 0 0 1 0

1 0 1 0 1 0 0 0 1

δα1

α2

β1

β2

γ11

γ12

γ21

γ22

= Xβ

(where in the designmatrix we regard 1 and 0 as vectors of length 3).



60

This model is highly over parametrized. SAS handles this problem inthe way indicated above: A new design matrix giving the same modelis created, namely

µ =

1 1 1 1

1 1 0 0

1 0 1 0

1 0 0 0

δ

α1

β1

γ11

= X2β2

This corresponds to setting α2 = β2 = γ21 = γ12 = γ22 = 0 onbeforehand. (That is every time a parameter contains the levelnumber 2 in its index it is set to being zero.) fin


This means that SAS solves the problem of an over parametrized

model by simply reducing it to a representation which is not over

parametrized.

As mentioned previously, this is not a problem because any quation

that can be answered under one representation of a model can also

be answered under another.

Yet, care should be taken when it comes to interpreting output from

SAS, see Section 18.


61

Least Squares Estimation in a LNM

In a LNM, the mean µi is a function of the parameter vector β.

One frequently used criterion for estimation is the method of

least squares:

Find the vector µ = (µ1, . . . , µn)> which minimizes the sum of

squared deviations

D(β) =n

∑

i=1

(yi − µi)2

under the restriction that µ = Xβ for some parameter vector β.


• Such a vector µ always exists and is unique.

• We say that β is a least squares estimate for β. Such an estimate

β also exists, but it is in general not unique.



62


For the regression analysis we find

D(β) =n

∑

i=1

(yi − (β0 + β1xi))2

Most standard textbooks on statistics take the following approach tominimization of D(β):

1) Calculate the derivatives ∂∂β0

D(β) and ∂∂β1

D(β),

2) set these equal to zero and

3) solve for β0 and β1.


This gives

β1 =

∑

i(yi − y.)(xi − x.)∑

i(xi − x.)2

β0 = y.− β1x.

fin


63

Example 11. (Continuation of Example 1) For the one–way analysisof variance

D(β) =

2∑

k=1

3∑

l=1

(ykl − αk)2

The values of αk which minimizes D(β), where β = (α1, α2)>, are

αk =1

3

3∑

l=1

ykl = yk

The vector µ is in this case (y1, y1, y1, y2, y2, y2)>.

However, if the model is written as Ykl = δ + ρk + εkl, i.e. asY = X3β3 + ε in Example 7, there is no unique least squares estimate


of β3 = (δ, α1, α2). To see this, just note that

δ = 0, α1 = y1, α2 = y2

and

δ = (y1 + y2)/2, α1 = (y1 − y2)/2, α2 = (−y1 + y2)/2

both results in the same vector µ = (y1, y1, y1, y2, y2, y2)>. fin



64

Estimation on matrix form

The estimation problem can be formulated very generally in matrix

notation and can be solved generally using projections onto

subspaces:

Using matrix notation the least squares method is:

Find the vector µ = (µ1, . . . , µn)>

D(β) = (y − µ)>(y − µ)

under the restriction that µ = Xβ for some parameter vector β.


Then we have the following results:

1. There always exists a unique vector of expected values µ =

(µ1, . . . , µn)> which minimizes D(β).

2. The vector µ is µ = Py where P be is the projection matrix onto

span(X).

3. Since µ is in span(X), there exists a vector β1 satisfying that

µ = Xβ1. We say that β1 is a least squares estimate of β.

4. If the columns of X are linearly independent, there exists only one

vector β1 satisfying that µ = Xβ1. In that case the least squares

estimate is unique.


65

5. If the columns of X are linearly dependent, there exists several least

squares estimates, i.e. there is another vector β2 with µ = Xβ2,

and where β1 6= β2.

6. In regression problems, the least squares estimate is typically unique,

whereas in analysis of variance problems, the least squares estimate

is generally not unique.

7. In the case where the least squares estimate is unique is is given as

β = (X>X)−1X>y.

It is easy to see why it is so: We know that µ = Py =

X[(X>X)−1X>y]. However, since µ is in span(X), we also

know that µ = Xβ. But both equations can only be true if

β = (X>X)−1X>y.


The vector e = y − µ is the vector of residuals reflecting the

unobserved error vector ε.

Hence e>e = (y − µ)>(y − µ) is the residual sums of squares and if

the model fits well to data, e>e should be “small” in some sense.

If there are p linearly independent columns in X the estimate for the

variance σ2 is

σ2 =1

n− pe>e =

1

n− p(y − µ)>(y − µ)



66

Example 12. (Continuation of Example 7).

With the matrix X as in Example 7, the projection matrix becomes

P =1

3

1 1 1 0 0 0

1 1 1 0 0 0

1 1 1 0 0 0

0 0 0 1 1 1

0 0 0 1 1 1

0 0 0 1 1 1

fin


The parameter vector β

We shall now assume that the LNM is such that the columns of X

are linearly independent such that the least squares estimate

β = (X>X)−1X>y.

of β is unique.

Letting A = (X>X)−1X> we note that A is an p× n–matrix and

see that β = Ay.


67

Thinking in terms of random variables, the data y is a realization of

a random vector Y with E(Y ) = Xβ and Cov(Y ) = σ2I. Then

β(Y ) = (X>X)−1X>Y = AY

is also a random vector because β(Y ) is a function of the random

vector Y .

If the elements of A are denoted aij we see that the ith component

of β is βi =∑p

j=1 aijyj

Hence each component βi of the vector β is a linear function of the

data y. Therefore it is not surprising that the corresponding random

variables βi(Y ) are dependent is some way.


Using the relations (1) and (2) we find that

E(β(Y )) = AE(Y ) = (X>X)−1X>E(Y )

= (X>X)−1X>Xβ = β (4)

Equation (4) says that the expected value of the least squares

estimator β is simply the true but unknown value β.



68

Cov(β(Y )) = ACov(Y )A> = σ2AIA> = σ2AA>

= σ2(X>X)−1X>[(X>X)−1X>]>

= σ2(X>X)−1X>X(X>X)−1

= σ2(X>X)−1 (5)

Equation (5) says that the covariance of the least squares estimator

β is proportional to the residual variance σ2. Moreover, the matrix

(X>X)−1 does not depend on the data y but only on the design

matrix X, i.e. on how the study at hand was conducted.


Recall that on the diagonal of a covariance matrix one finds the

variances. Hence when knowning (X>X)−1 and an estimate for σ2

then we also know the variance estimates for βi.


69

Example 13. (Continuation of Example 2) Suppose xi = i andzi = i− 3.5 in the regression example for i = 1, . . . , 6.

Regression of y on x with the program

PROC GLM ;

MODEL y = x / inv;

RUN; QUIT;

gives the result


The GLM Procedure

X’X Inverse Matrix

Intercept x y

Intercept 0.8666666667 -0.2 -1.286578758

x -0.2 0.0571428571 0.4835938022

y -1.286578758 0.4835938022 3.225955579

Dependent Variable: y Sum of

Source DF Squares Mean Square F Value Pr > F

Model 1 4.09260190 4.09260190 5.07 0.0874

Error 4 3.22595558 0.80648889

Corrected Total 5 7.31855748

Standard

Parameter Estimate Error t Value Pr > |t|

Intercept -1.286578758 0.83603651 -1.54 0.1987

x 0.483593802 0.21467436 2.25 0.0874



70

The two first diagonal elements of (X>X)−1 times the varianceestimate σ (i.e. the Mean Square Error) gives variance estimates ofthe regression parameters.

The square root of these estimates are the standard errors reported.

Moreover, the covariance between the intercept and the slope isestimated to be −0.2 so these estimates are correlated.


Regression of y on z with the program

PROC GLM ;

MODEL y = z / inv;

RUN; QUIT;

gives the result


71

The GLM Procedure

X’X Inverse Matrix

Intercept z y

Intercept 0.1666666667 0 0.4059995498

z 0 0.0571428571 0.4835938022

y 0.4059995498 0.4835938022 3.225955579

The GLM Procedure

Dependent Variable: y Sum of


Model 1 4.09260190 4.09260190 5.07 0.0874

Error 4 3.22595558 0.80648889


Standard

Parameter Estimate Error t Value Pr > |t|

Intercept 0.4059995498 0.36662626 1.11 0.3302

z 0.4835938022 0.21467436 2.25 0.0874


In this case we see that centering the x values around their average(3.5) gives parameter estimates which are uncorrelated. Moreover,the estimate of the slope (and the associated standard error) is thesame as before. fin



72


With

X =

1 x1

1 x2

1 x3

1 x4

1 x5

1 x6

, β =

[

β0

β1

]

we find (when letting n = 6) that

X>X =

[

n∑

i xi∑

i xi

∑

i x2i

]


Recall that

A =

[

a b

c d

]

implies that A−1 =1

ad− bc

[

d −b

−c a

]

(provided that ab− bc 6= 0). Using this gives

(X>X)−1 =1

n∑

i x2i − (

∑

i xi)2

[ ∑

i x2i −

∑

i xi

−∑

i xi n

]

Letting K = 1n

∑

i x2i−(

∑

i xi)2, the variance of the estimator β0 for the

intercept is

V ar(β0) =

∑

i x2i

K


73

and the variance of the estimator β1 for the slope is

V ar(β1) =n

K

The estimators β0 and β1 are correlated since

Cov(β0, β1) = −1

K

∑

i

xi

fin



Since∑

i(xi − x.) = 0 (Verify this!) we find that

X>X =

[

n 0

0∑

i z2i

]

=

[

n 0

0∑

i(xi − x.)2

]

Since the inverse of a diagonal matrix is also diagonal, we concludethat the estimators α0 and α1 are independent. fin



74

The estimator β has a p–dimensional multivariate normal

distribution (in short MVN), with mean vector β and covariance

matrix σ2(X>X)−1.

This is written

β ∼ Np(β, σ2(X>X)−1).

This means that any linear combination λ>β has a univariate normal

distribution

λ>β ∼ N(λ>β, σ2λ>(X>X)−1λ) (6)

and that is a very important result for practical statistics.


Estimability and Contrasts

In a LNM with mean vector µ = Xβ one is typically interested in

making statements about (some of) the components of the

parameter vector β.

However, with µ = Xβ we only have indirect knowledge about β

because all we know is that µi =∑

j xijβj and, as has been

illustrated, β is in general not uniquely determined. That is, there

can be another vector β2 such that µ = Xβ = Xβ2.

Hence there are some constraints on what can actually be said about

β.


75

In the one–way analysis of variance of Example 1 one might be

interested in the difference α1 − α2 or in α1 itself and there is no

problem in that. For later purposes it can be noted that

α1 − α2 = (1,−1)(α1, α2)> = (1,−1)β

α1 = (1, 0)(α1, α2)> = (1, 0)β


Example 16. Consider the two–way analysis of variance

Yij = δ + αi + βj + εij

where

µ =

1 1 0 1 0

1 1 0 0 1

1 0 1 1 0

1 0 1 0 1

δ

α1

α2

β1

β2

= Xβ

It is clear that this model is grossely over parametrized (why?)

Under this model we can estimate quantities like

α1 − α2, δ + α1, δ + α1 +1

2(β1 + β2)



76

Note that

α1 − α2 = (0, 1,−1, 0, 0)β,

δ + α1 +1

2(β1 + β2) = (1, 1, 0,

1

2,1

2)β

However other things like

α1 = (0, 1, 0, 0, 0)β or β1 = (0, 0, 0, 1, 0)β

can not be estimated under this model.

fin


In a sense, the only thing uniqely determined in a LNM is µ.

Therefore the only thing one can truely say something about is linear

combinations of µ, i.e. linear combinations of the form

a>µ

for some n–vector a.

Most frequently interest is in contrasts of the form λ>β.

Therefore, a natural question is how

a>µ and λ>β

relate to each other?


77

Since µ = Xβ, we can only say something about β if one can

express it as

a>Xβ.

Note that a>X is an 1× p–vector.

Therefore, we can say something about the contrast λ>β only if one

can find an n–vector a such that

a>X = λ>

If there exists such a vector a, the contrast λ>β is said to be

estimable.

In this case the contrast is can be written

λ>β = a>Xβ = a>µ


After having estimated µ, the contrast λ>β is estimated by

λ>β = a>Xβ = a>µ.

Recall from the section on estimation that there might in general be

many least squares estimates for β. However, the following holds:

Result 2. The least squares estimate of λ>β is unique if and only

if λ>β is estimable.

In other words,

The only thing one can say something about in an unambiguous

way is estimable functions.



78

From the general result

λ>β ∼ N(λ>β, σ2λ>(X>X)−1λ) (7)

we know the distribution of the contrast λ>β and hence testing for

the contrast being zero is straight forward.

Note that transposing a>X = λ> gives X>a = λ.

Hence the condition for estimability is that λ can be written as a

linear combination of the columns of X> i.e. as a linear combination

of the rows of X.

This amounts to solving a set of linear equations – and computers

can do that!



We wish to verify that

δ + α1 +1

2(β1 + β2) = (1, 1, 0,

1

2,1

2)β

is indeed estimable.

That is, we seek a vector a = (a1, a2, a3, a4)> such that

a>X = (1, 1, 0,1

2,1

2).


79

Direct multiplication gives

a1 + a2 + a3 + a4 = 1

a1 + a2 = 1

a3 + a4 = 0

a1 + a3 =1

2

a2 + a4 =1

2

It is not hard to spot that the solution to these equations are

a1 = a2 = 1/2 and a3 = a4 = 0.

fin


Estimability in SAS

In checking whether a specific contrast is estimable, it is

recommended to use PROC GLM.

The following SAS program deals with data from Example 16

proc glm data=a;

class i j;

model y = i j/E;

lsmeans i j /E;

run;



80

The output caused by the E–option in the MODEL statement is

General Form of Estimable Functions

Effect Coefficients

1 Intercept L1

2 i 1 L2

3 i 2 L1-L2

4 j 1 L4

5 j 2 L1-L4

Recall that β = (δ, α1, α2, β1, β2). The numbers 1,2,3,4,5 identify

the entry of the λ–vector, λ = (λ1, λ2, . . . , λ5), and the Ls specify

the constraints to be satisfied by the λis.

It reads as follows: λ1 can be set to any value L1, and λ2 can be set

to any value L2. But then λ3 is constrained to be equal to L1− L2.

Likewise, λ4 can be set to any value L4, but then λ5 is constrained


to be equal to L1− L4.

From this we see how to specify some contrasts

λ = (1, 1, 0, 1, 0) : λ>β = δ + α1 + β1

λ = (1, 1, 0,1

2,1

2) : λ>β = δ + α1 +

1

2(β1 + β2)

λ = (0, 1,−1, 0, 0) : λ>β = α1 − α2

But we can also see that the contrast δ + 12(α1 + α2) is not

estimable: Taking λ1 = 1 and λ2 = λ3 = 12 would give the desired

result, but setting λ4 = 0 implies that λ5 = 1, so it is not possible.

The contrasts specified above are constructed as follows in PROC

GLM (and in PROC MIXED. Note that we have indicate two ways of


81

constructing the last contrast.

title ’Estimation of contrasts’;

proc glm data=a;

class i j;

model y = i j /E;

estimate ’Lambda 1’ intercept 1 i 1 0 j 1 0 / E;

estimate ’Lambda 2’ intercept 1 i 1 0 j .5 .5 / E;

estimate ’Lambda 3’ intercept 0 i 1 -1 j 0 0 / E;

estimate ’Lambda 3’ intercept 0 i 1 -1 / E;

run; quit;


Least Squares Means

The LSMEANS statement in GLM is an attempt to generate meaningful

estimates automatically, sometimes (but not always) with success.

These are denoted least squares means and can be constructed as

title ’Least squares means’;

proc glm data=a;

class i j;

model y = i j ;

lsmeans i j / E stderr;

run; quit;

The output caused by the E–option in the LSMEANS statement is



82

Least Squares Means

Coefficients for i Least Square Means i Level

Effect 1 2

1 Intercept 1 1

2 i 1 1 0

3 i 2 0 1

4 j 1 0.5 0.5

5 j 2 0.5 0.5

Coefficients for j Least Square Means j Level

Effect 1 2

1 Intercept 1 1

2 i 1 0.5 0.5

3 i 2 0.5 0.5

4 j 1 1 0

5 j 2 0 1


The interpretation of the columns to the right is exactly as before:

The vector λ = (1, 1, 0, 0.5, 0.5)> gives

λ>β = δ + α1 +1

2(β1 + β2).

From this we see that the LSMEANS for i = 1 is the δ + α1 plus the

“average effect” of the factor j, i.e. 12(β1 + β2).


83

Hypothetis Testing

Example 18. The two–way analysis of variance model

Yij = δ + αi + βj + εij, , i = 1, 2, j = 1, 2

is in the following be referred to as the large model.

Data is assumed to be in accordance with the large model.

Suppose we are interested in testing whether βj = 0.


The mean µij of Yij is δ + αi + βj and the mean vector has the form

µ =

µ11

µ12

µ21

µ22

=

1 1 0 1 0

1 1 0 0 1

1 0 1 1 0

1 0 1 0 1

δ

α1

α2

β1

β2

= Xβ

Testing βj = 0 corresponds to testing whether the reduced model

Yij = δ + αi + εij

is in accordance with data.



84

Under the reduced model, the mean µij of Yij is δ +αi and the meanvector has the form

µ =

µ11

µ12

µ21

µ22

=

1 1 0

1 1 0

1 0 1

1 0 1

δ

α1

α2

= X0β0

Hence testing the hypothesis βj = 0 corresponds to testing whetherµ = X0β0 when we “know” that µ = Xβ. fin


Note that any vector µ that can be written as µ = X0β0 can also be

written as µ = Xβ – simply by setting the last two elements of β to

zero.

More generally, any vector in span(X0) is also in span(X), but not

vice versa.

(Recall that span(X0) is the set of vectors that can be written as a

linear combination of the columns of X0.)

Let P and P0 be the projection matrices corresponding to X and


85

X0. The least squares estimate of µ are

µ = Py under the large model

µ = P0y under the reduced model

How to judge whether the reduced model is feasible??

The answer lies in the “distance” between the observations and the

expected values.

The vector of residuals

e = y − µ = y − Py = (I − P )y


reflect random deviations from the mean under the large model (in

which we “believe”).

Therefore the length of e (and hence the squared length e>e is

expected to be “small” in some sense.



86

If the reduced model is true then e0 = (I − P0)y is also the vector of

residuals, and the length of the vector should also be small.

On the other hand if the reduced model is not true, then e0 is not

just residuals, because it contains some of the variation due to the

factor βj.

In this case the length of the residual vector is expected to be large.

Consider the difference between the residuals

D = e− e0 = y − Py − (y − P0)y = Py − P0y = (P − P0)y

If the reduced model is true, then this difference is just difference

between residuals, and the length of D is expected to be small.


If we let d and d0 denote the number of independent columns in X

and X0, one can show the following

Result 3.

E(D>D

d− d0) =

1

d− d0E(D>D) = σ2 + k

or equivalently that

E(D>D) = (d− d0)(σ2 + k) = (d− d0)σ

2 + (d− d0)k,

where k ≥ 0 and k = 0 when the reduced model is true.

If σ2 had been known the result above would be very useful:

If D>D is “much larger” than (d− d0)σ2, this would indcate that


87

k > 0 which in turn causes us to doubt the feasibility of the reduced

model.


There are two problems in this connection:

1. σ2 is not known, and

2. what does “much larger” mean...

Yet, in Linear Normal Models there is a simple solution to this two

problems now to be outlined:



88

Problem 1: σ2 is not known

Under the large model, the variance estimate is

σ2 = e>e/(n− d),

i.e. the residual sum of squares divided by the residual degrees of

freedom.

It is well known that E(σ2) = σ2, so it is reasonable to assume that

σ2 ≈ σ2.

Therefore, if the reduced model is true (and hence k = 0), the ratio

F =D>D/(d− d0)

e>e/(n− d)≈ 1.


That takes, to some extent, “care of” the problem that σ2 is

unknown.

Problem 2: what does “much larger” mean... :

If the reduced model is not true, then the ratio F would tend to be

larger than 1. The problem remaining is to define what is meant by

“large”. On can show the following:

Result 4. If the reduced model is true then F has an Fd−d0,n−d–

distribution.

Here d − d0 is the number of parameters removed from the model

(i.e. the additional residual degrees of freedom gained by going from

the large to the reduced model), and n− d is the residual degrees of


89

freedom under the large model.

If the reduced model is not true, then F has an expected value larger

than 1.

Therefore, if F is larger than a pre–specified quantile in the

Fd−d0,n−d–distribution one would doubt the feasibility of the model

reduction, i.e. reject the hypothesis.


Calculating things in Practice

Consider again the difference between the residuals

D = e− e0 = y − Py − (y − P0)y = Py − P0y = (P − P0)y.

There is an easy way to calculate D>D in practice:

Result 5.

D>D = e>0 e0 − e>e = RSS0 −RSS

where RSS and RSS0 denote the residual (or error) sums of squares

under the large and the reduced model respectively.



90

Tests in LNMs in short form

• Consider a LNM Y ∼ Nn(µ, σ2I). Hence Y =D µ + e, where

e ∼ Nn(0, σ2I).

• Consider the models for the mean value

M : µ ∈ L = C(X) calM0 : µ ∈ M0 = C(X0) L0 ⊂ L

where M is assumed to hold true, and let M and M0 denote the

corresponding projections of dimension d and d0.

• Under M, MY = Mµ + Me = µ + Me.


• If M0 is true, then

(M −M0)Y = Mµ + Me−M0µ−M0e = (M −M0)e

is only “random noise”. In this case (M −M0)Y is expected to be

small.

• Clearly, M −M0 is the projection onto L ∩ L>.

• Hence ||(M−M0)Y ||d−d0

= Y >(M−M0)Yr(M−M0)

is a measure of how close M0Y

is to MY in relation to the difference in dimensionality of the

models.


91

• We use the results that

E(Y >AY ) = tr(AVar(Y )) + E(y)>AE(Y )

tr(M) = d, tr(M −M0) = d− d0

• Assuming only M,

E(Y >(M −M0)Y

r(M −M0)) = (

σ2

d− d0tr(M −M0)) + β>X>(M −M0)

d− d0)Xβ

= σ2 + β>X>(M −M0)

d− d0)Xβ

= σ2 + ||v||2

• If M0 is true, then ||v||2 = 0.


• If we use MSE = Y >(I−M)Yn−d

= σ2 as an estimate for σ2 then

under M0,

F =

Y >(M−M0)Yd−d0

Y >(I−M)Yn−d

≈ 1

• It is clear that nominator and denominator are independent:

(

I −M

M −M0

)

Y ∼ N

((

I −M

M −M0

)

µ;σ2

(

I −M 0

0 M −M0

))

• Under M0,

1

σ2Y >(M −M0)Y ∼ χ2(d− d0, β

>X>(M −M0)Xβ)

, i.e. a non–central χ2 distribution.



92

• Hence large values of F causes doubt in M0.


Hypothesis Testing in SAS

In practice SAS performs all relevant calculations (and,

unfortunately, a few more).

Degrees of freedom: A comment regarding the degrees of

freedom reported by SAS is appropriate:

Default in SAS is that all observations are centered around their

average.

This centering “costs” one degree of freedom and therefore SAS

reports the Corrected Total which is n− 1, where n is the

number of observations.


93

In the large model in Example 18 there are three parameters,

(δ, α1, β1)

Because of the centering of the data, SAS does not regard δ as a

parameter when it comes to reporting degrees of freedom. So the

real number of parameters is the number SAS reports plus 1. Hence

d = 2 + 1 while d0 = 1 + 1.

(Note: If the NOINT option is specified, the model degrees of

freedom become correct.)

In practice it is not a problem whether data are centered or not,

because we mainly are interested in differences between the number

of parameters, i.e. differences in degrees of freedom.


Example 19. (Continuation of Example 18) Below we find theoutput from fitting the large and the reduced model in PROC GLM.

Dependent Variable: y Large model

Sum of


Model 2 3.76999467 1.88499734 2.70 0.3954

Error 1 0.69877998 0.69877998


Source DF Type III SS Mean Square F Value Pr > F

i 1 0.73276693 0.73276693 1.05 0.4924

j 1 3.03722775 3.03722775 4.35 0.2847

Dependent Variable: y Reduced model

Sum of


Model 1 0.73276693 0.73276693 0.39 0.5951

Error 2 3.73600773 1.86800386




94

In the notation from before

D>D = RSS0 −RSS = 3.73600773 − 0.69877998 = 3.037

e>e = RSS = 0.699

d− d0 = 3− 2 = 2− 1 = 1

n− d = 4− 3 = 3− 2 = 1

The F-statistic therefore becomes

F =3.037/1

0.699/1= 4.35

This is the statistic reported in the Type III SS–section of theoutput. So in most (but not all) cases, SAS does the work for us.fin


Example 20. The two–way analysis of variance with interactions

Yijk = δ + αi + βj + γij + εijk, i = 1, 2; j = 1, 2; k = 1, 2, 3

has mean

µ =

µ11

µ12

µ21

µ22

=

1 1 0 1 0 1 0 0 0

1 1 0 0 1 0 1 0 0

1 0 1 1 0 0 0 1 0

1 0 1 0 1 0 0 0 1

δ

α1

α2

β1

β2

γ11

γ12

γ21

γ22

= Xβ


95

Here we regard µij, 1 and 0 as vectors of length 3 such that µcontains 12 elements.

In this form, the model is overparametrized so SAS works with anequivalent representation, namely

µ =

1 1 1 1

1 1 0 0

1 0 1 0

1 0 0 0

δ

α1

β1

γ11

= X2β2 (8)

fin



96

5 Some Basic Statistical Concepts

This lecture presented/refreshed basic statistic concepts, such as central limit theorem, principlesof estimation, the likelihood principle and test of hypothesis.


1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/StatTheory.f.pdf

97

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/StatTheory.f.pdf

Data and Models

The starting point for a statistical analysis is a set of observations

y = (y1, . . . , yn)

resulting from an experiment (or perhaps an observational study)

conducted in order to gain insight in a specific area.

We shall in general use the term experiment even though the setting

may not be that of a controlled experiment.


Some Characteristics:

A fundamental characteristic of the experiment is that the outcome

is stochastic rather than deterministic.

Hence, if the experiment is repeated again under similar conditions

the new result would not necessarily be y.

Because of the random/stochastic variation in data, it is natural toconsider models based on probability theory, because this is thebranch of mathematics dealing with random variation. In thissetting, the starting point is the set of possible outcomes

Y = (Y1, . . . ,Yn)

of the experiment.



98

Here Yi could be for example

• the set of all real numbers,

• the set of positive real numbers,

• the set {diseased, not diseased}, or

• the set {low, medium, high}.

The link between the observed value yi and the set of possible values

Yi is established through the notion of a random variable Yi.

A random variable Yi is a function whose values can be in the set

Yi, and the observed value yi is said to be a realization of the

random variable Yi.


The random variable Yi is a function, but not a deterministic

function such as e.g.

f(x) = x2 + 7.

It is a random function whose outcome on one hand is uncertain but

on the other hand typically governed by some rules. Those rules are

best formulated in terms of a probability distribution.

Example 1. : Binomial Experiment Any animal can be infectedwith a specific disease, i.e. it can be diseased or not–diseased.

For the ith animal in the population the state of disease is denoted byYi and Yi can therefore take one of the values {diseased, not diseased}(for brevity written simply as {1, 0}).

fin


99

Example 2. : Binomial Experiment If the possible outcomes ofYi is the set {diseased, not diseased} (for brevity written simply as{1, 0}) the random variable Yi can be either 1 or 0. A statisticalmodel for Yi is obtained by specifying the probability distribution forYi, for instance

p(Y = y) = θy(1− θ)1−y

where 0 ≤ θ ≤ 1. fin

Example 3. : Samples from the normal distribution If Yi has anormal distribution, e.g. Yi ∼ N(θ, 1) the set of possible outcomesYi is the real line. fin


In both examples, the function Yi is specified through a

probability distribution.

The distribution depends on an (unknown) parameter θ. (In the

examples, θ is a single number but more generally the parameter is a

vector θ = (θ1, . . . , θp).)



100

In statistical terms, one speaks of a parametrical statistical model:

1. It is a statistical model, because the outcome of Yi is described in

terms of a probability distribution.

2. It is a parametrical model because once the parameter θ is known

the distribution is known.


Why the Normal Distribution is so “Normal”

The most frequenly employed distribution is the normal distribution.

Many (but certainly not all) random phenomena encountered in

practice exhibit a certain regularity:

1. Observations have a tendency to be clustered around a “mean

value”.

2. Deviations from the “mean value” are often symmetric.

3. The histogram of observations can be well approximated with the

bell–shaped normal (or Gaussian) distribution


101

Histogram of z.mean

z.mean

Rel

ativ

e F

requ

ency

0.3 0.4 0.5 0.6 0.7

01

23

45

The bell-shaped curve is written

f(y;µ, σ2) =1√2πσ

exp(− 1

2σ2(y − µ)2)

Why does this bell–shaped curve fit quite well to many

phenomenons encountered in practice??


The Central Limit Theorem

Parts of the answer is given by the Central Limit Theorem:

Let Z1, . . . , Zn be independent random variables with E(Zi) = µi

and V ar(Zi) = σ2i .

Let Y =∑n

i=1 Zi.

Then E(Y ) = µ =∑

i µi and V ar(Y ) = σ2 =∑

i σ2i .

What about the distribution of Y ?



102

Result 1. The Central Limit Theorem says that

Y ∼approx N(µ, σ2).

The approximation becomes better as n →∞.

(Note: We have not made any assumption about the distribution of

the Zis – it has only been assumed that they are independent.

Many things encountered in nature can be regarded as the sum of

many small (independent) contributions. That is one explanation

why the normal distribtuion is so “normal”.


Example 4. Let Zi be uniformly distributed on [0, 1], i.e. all valuesin the [0, 1]–interval are “equally likely” for i = 1, . . . , 4.

How does the distribution of Z = 1n

∑y

i=1 Zi look?

Quite normal, actually !

Histogram of z1

z1

Rel

ativ

e F

requ

ency

0.0 0.4 0.8

0.0

0.4

0.8

1.2

Histogram of z2

z2

Rel

ativ

e F

requ

ency

0.0 0.4 0.8

0.0

0.5

1.0

1.5

Histogram of z.mean

z.mean

Rel

ativ

e F

requ

ency

0.2 0.4 0.6 0.8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−2 −1 0 1 2

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

fin


103

Some General Principles of Estimation

After establishing a statistical model a problem is to estimate the

value of the parameter θ. To find this estimate we need to make

some assumptions.

In what follows, a very fundamental assumption will be made:

There exists a true (but unknown) value of θ.

If θ had been known, then the distribution of Yi would be known

too. That is we would know the characteristics of the mechanism

which generated the data y.


A consequence of this is that the important task is to obtain a good

estimate of θ. Some examples of doing so are given in the following.



104

Example 5. (Continuation of Example 2) Consider the experimentof tossing a “pin” n times, giving data y = (y1, . . . , yn). Hence thepossible outcomes are Yi = {up, down} which we write {1, 0}.

It is assumed thatP (Yi = 1) = θ

for all i, such that the probability of observing “pin up” (!) is thesame every time. If we observe that the pin points upwards alltogether y+ =

∑

i yi times, then it takes only very little creativity tosuggest that the relative frequency

y+/n

is a sensible estimate for θ. fin


Example 6. : Linear regression Consider the case where there isassociated a known number xi to each outcome of the experimentyi, and where it is suspected that there might be an approximatelylinear relationship between xi and yi.

This can lead to the linear regression model

Yi ∼ N(θi, σ2) where θi = θ0 + θ1xi

This model is fundamentally different from the model in Example 2:In Example 2, each observation was assumed to have the samedistribution. In the present model, this is not the case as the meanfor each random variable Yi is allowed to depend on the value of xi.


105

It is well known from any standard textbook on statistics that theparameters θ = (θ0, θ1) can be estimated by minimizing the squareddistance between the observed and the expected values, i.e. byminimizing the function

D(θ0, θ1) =∑

i

(yi − (θ0 + θ1xi))2

fin


Example 7. (Continuation of Example 3) Suppose we conduct anexperiment where each observation yi is a realization of Yi ∼ N(θ, 1).Then it takes very little fantasy to suggest that the average

z1 =1

n

n∑

i=1

yi

is a sensible estimate for θ. fin



106

In the examples above it is easy to suggest ways of estimating the

unknown parameters. These can be described as:

Example 5: Estimation by the relative frequency.

Example 6: Estimation by minimizing the squared distance.

Example 7: Estimation by the average.

However, it is clear that there is a need for:

• General principles for obtaining those estimates.

• Some notion for how “good” an estimate is.


In the following we present and discuss some of these principles

briefly.

The exposition is by no means intended to be neither comprehensive

nor very precise.

The aim is solely to illustrate some of the considerations made in

connection with estimation of unknown parameters on the basis of

data.

Eventually the exposition leads to the method of maximum

likelihood.


107

Method of Moments

One approach is to base the estimation on the moments, i.e. the

expectation, variance etc. of radom variables.

Recall that the first moment of a random variable X is E(X) and

the second central moment of X is E(X −E(X))2 = V ar(X).

For Example 3 with Yi ∼ N(θ, 1) we define a new random variable,

say Z1, as the avereage of the Yis. Then it is well known that

Z1 =1

n

n∑

i=1

Yi ∼ N(θ, 1/n)


The estimate z1 = 1n

∑n

i=1 yi can then be regarded as a realization

of the random variable Z1 which has mean E(Z1) = θ.

It is important to keep in mind that Z1 is a function of Y1 . . . , Yn

which can be emphasized by writing Z1(Y ). Likewise, z1 is a

function of the observed data whih is emphasized by writing z1(y).

We say that

• the random variable Z1(Y ) is an estimator, and

• a specific value of Z1(y) is an estimate.



108

The method of moments is to consider θ(y) as a good estimate of θ

because the corresponding random variable Z1(Y ) has θ as its

expectation:

E(Z1(Y )) = θ (1)


How good is an estimator?

An estimator with the property (1) is said to be unbiased.

Unbiasedness seems to be desireable property of an estimator.

However, there are many estimators with the property (1). Two

additional ones are

• the average Z2(Y ) = (Y1+Y2)/2 of the two first random variables,

and

• Z3(Y ) = Y1, i.e. the first random variable itself.


109

Yet, intuition indicates that z1 is a “better” estimate of θ than

z2 = (y1 + y2)/2 which in turn is “better” than z3 = y1.

To be precise about what is meant by “better” we consider the

variance of the estimators:

V ar(Z1(Y )) = 1/n

V ar(Z2(Y )) = 1/2

V ar(Z3(Y )) = 1


Hence (with more than 2 observations), we have

V ar(Z1) < V ar(Z2) < V ar(Z3),

and on the basis of this it is clear that we will consider Z1 to be a

better estimate of θ than Z2 or Z3.

Note: Because estimates are realizations of random variables (their

corresponding estimators) it is “a must” always to report a the

variance, a standard deviation or a related quantity whenever

reporting the value of an estimate.



110

Someone might suggest to estimate θ by Z4(Y ) = Z1(Y ) + 7.

In terms of considering estimators with small variance as being

“good”, one can argue that Z4 is just as good as Z1, because

V ar(Z4) = V ar(Z1).

However, E(Z4) = θ + 17 6= θ, so Z4 is not an unbiased estimate of

θ.

These considerations suggest that good estimators should be

unbiased and have as small variance as possible.


These two criteria leads to the theory of

Minimum Variance Unbiased Estimation – sometimes written briefly

as MVUE. It is not surprising that Z1 is a MVUE (Minimum

Variance Unbiased Estimator).

In general, establishing MVUEs can be a complicated task: Finding

estimators that are unbiased may not be too hard, but finding one

with the smallest possible variance may be very very complicated.


111

Consistency of Estimators

The estimator Z1 has other nice properties compared with Z2, Z3

and Z4.

When the number of observations n tends to infinity, the variance ofZ1 tends to 0. The practical implication of this is straight forward:Z1 becomes indistinguishable from its expectation θ. An estimatorwith this property is said to be consistent.

Consistency is an attractive feature of an estimator, because itmeans that the estimate of θ gets better and better the more datawe collect.

It is clear that neither of Z2, Z3 and Z4 are consistent.


Desireable Properties of Estimators

From the discussion above we have found that

• Unbiasedness,

• Smallest possible variance, and

• Consistency

are three attractive properties of estimators.



112

Estimators, whatever kind they are, are functions of the random

variables Y1, . . . , Yn from which data y1, . . . , yn are realizations.

Hence estimators are random variables and as such they have a

distribution. This distribution is needed when drawing inference

about a parameter, e.g. when making a test or constructing a

confidence interval.

Therefore a fourth desireable property of an estimator is that

• The distribution of the estimator is known.


The Method of Maximum Likelihood

There is a general estimation method called maximum likelihood

estimation to be discussed in the following.

An estimator obtained from this method do not in general have the

attractive properties mentioned above – but almost. That is, when

the sample size goes to infinity (in a sufficiently well behaved way)

then the properties hold.

We say that the estimator is asymptotically unbiased, do

asymptotically have the smallest possible variance, is asymptotically

consistent and finally, the distribution of the estimator is

asymptotically normal.


113

These four properties of maximum likelihood estimators indicates

why this is such a powerful method.

Moreover, it turns out that the estimation process can be made by

maximizing a particular function, called the likelihood function.

Maximization of such a function can in practice be complicated, but

is in principle not much different from what we all learned in high

school: Calculate the derivative, set this one to zero and solve!


Example 8. : Binomial Experiment

Consider n throws with a pin where θ = Pr(“Falls with pin up”).Hence the outcome of the ith toss can be {Up,Down} writtenbriefly as {1, 0} and

p(yi; θ) = P (Yi = yi; θ) = θyi(1− θ)1−yi



114

Suppose the observed data are y = {1, 1, 0, 1, 0, 1, 0, . . . , 0, 0}.

If the outcomes of the tosses are independent, then the probability ofobserving y is

p(y; θ) = p(y1; θ)p(y2; θ) . . . p(yn; θ)

= p(1)p(1)p(0)p(1)p(0)p(1)p(0) . . . p(0)p(0)

= θθ(1− θ)θ(1− θ)θ(1− θ) . . . (1− θ)(1− θ)

= θy+(1− θ)n−y+ (2)

where n is the number of times the pin is thrown and y+ =∑

i yi isthe number of times the pin points up.

fin


The Likelihood function

When data y is observed, p(y; θ) can be regarded as a function of

θ. This function is called the likelihood function and is denoted by

L(θ).

Hence in the example,

L(θ) = θy+(1− θ)n−y+.

To be specific, let the pin be thrown n = 25 times, and suppose that

pin up is observed y+ = 10 times. Then we have

L(θ; y) = θ10(1− θ)25−10


115

Figure 1 shows a plot of L(θ) against θ for n = 25 and y+ = 10.

Theta value

Like

lihoo

d fu

nctio

n

0.0 0.2 0.4 0.6 0.8 1.0

010

^-8

2*10

^-8

3*10

^-8

4*10

^-8

5*10

^-8

Figure 1: Likelihood function for n = 25 and y+ = 10.



116

The Maximum likelihood principle

The principle in maximum likelihood estimation is that

the estimate of θ is the value of θ which maximizes the likelihood

function.

One can think of θ as the value of θ which maximizes the probability

of observing the data which one actually has observed.


• This value is called the maximum likelihood estimate (MLE) and

is often denote by θ.

• The corresponding estimator is called the maximum likelihood estimator.

For clarity one should write θ(y) for the estimate and θ(Y ) for the

corresponding estimator, but this is too cumbersome to do. So,

except for special cases, we simple write θ for both entities and then

derive from the context whether its is an estimate (a number) or and

estimator (the corresponding random variable).

Figure 1 suggests that 0.4 is the maximum likelihood estimate.


117

It is often easier to maximize the log-likelihood function often

denoted by l(θ):

l(θ) = log L(θ) = y+ log θ + (n− y+) log(1− θ)

Since log is a monotone function the value of θ maximizing l(θ) will

also maximize L(θ).


Figure 2 shows a plot of l(θ) against θ for n = 25 and y+ = 10.

Theta value

log-

Like

lihoo

d fu

nctio

n

0.0 0.2 0.4 0.6 0.8 1.0

-70

-60

-50

-40

-30

-20

Figure 2: Log–Likelihood function for n = 25 and y+ = 10.



118

Maximization of

l(θ) = y+ log θ + (n− y+) log(1− θ)

is obtained by solving the equation

S(θ) = l′(θ) = 0,

where l′(θ) denotes the derivative of l(θ).

• The function S(θ) is called the score function.

• The equation S(θ) = 0 is called the likelihood equation.

We find that

S(θ) = l′(θ) =y+

θ− n− y+

1− θ= 0


119

which happens if and only if

θ =y+

n

Hence, the maximum likelihood estimate is just the relative

frequency. The corresponding maximum likelihood estimator is

θ(Y+) =Y+

n.

Hence when y+(= 10) is observed the observed value of the

maximum likelihood estimator (i.e. the maximum likelihood

estimate) becomes θ(x) = θ(10) = 0.4 - in accordance with Figure 1

and Figure 2.


How Good is the Estimate?

When y+ = 10 and n = 25 we have θ = 0.4, but the same value is

found if y+ = 2 and n = 5.

However, intuition suggests that with 25 observations we should

have more confidence that θ is a good estimate than with only 5

observations. That is, we would expect that the variance of the

estimator is smaller with 25 observations than with only 5.

It is well known for binomial experiments that V ar(Y+) = nθ(1− θ)

and hence that V ar(θ) = θ(1− θ)/n which indeed confirms the

intuition.



120

In Figure 3 is shown the likelihood function for (n = 2, y+ = 5),

(n = 4, y+ = 10), (n = 10, y+ = 25) and (n = 20, y+ = 50).

Theta - y+=2, n=5

Like

lihoo

d fu

nctio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.01

0.02

0.03

Theta - y+=4, n=10

Like

lihoo

d fu

nctio

n

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.00

040.

0008

0.00

12

Theta - y+=10, n=25

Like

lihoo

d fu

nctio

n

0.0 0.2 0.4 0.6 0.8 1.0

010

^-8

3*10

^-8

5*10

^-8

Theta - y+=20, n=50

Like

lihoo

d fu

nctio

n

0.0 0.2 0.4 0.6 0.8 1.0

010

^-15

2*10

^-15

Figure 3: Likelihood function for (n = 5, y+ = 2), (n = 10, y+ = 4),

(n = 25, y+ = 10) and (n = 50, y+ = 20).


It is clear from those graphs that the more observations the more

“peaked” is the likelihood function and the higher is its curvature at

its maximum and.

That is, the value of L(θ) is more and more distinct from the value

of L(θ) for θ 6= θ when more and more observations are made.

It is therefore not surprising that there is a connection (indeed it

turns out to be a close connection) between that variance of the

maximum likelihood estimator and the curvature of the likelihood

functions at its maximum.

This connection is presented in the next sections.


121

The Asymptotic Normal Distribution of the MLE

In this section we present a very important result:

The maximum likelihood estimator is asymptotically normally

distributed.

This property of the MLE is central to much practical statistical

inference.


Example 9. Frequently one is interested in making statementsabout θ on the basis of the experiment. For example one mightbe interested in whether on can reasonably assume that the truevalue of θ is 0.5.

The key to answering this question is the random variable θ(y). Putin a popular way, one has to investigate whether 0.5 is a “likely”outcome of θ(Y ). To answer that question, one need to knowthe distribution of θ(Y ) – and this distributions is in general verycomplicated to find. fin



122

Therefore one frequently resorts to an approximate result, on which

so much resides in statistics:

When n →∞ and certain conditions are satisfied then it holds

approximately that

θ ∼ N(θ,− 1

l′′(θ))

That is, the distribution of θ(X) will asymptotically be like a normal

distribution with the true (but unknown) parameter θ as expectation

and and a variance − 1l′′(θ)

.


Example 10. For the binomial experiment, it is not hard to see whythe MLE is asymptotically normal:

We can regard y as a sum of independent random variables yi whereyi = 1 corresponds to pin up and yi = 0 is “pin not up”.

Hence the Central Limit Theorem gives that y is approximatelynormally distributed, and hence so is θ = y/n.

For a single experiment we know that E(yi) = θ and V ar(yi) =θ(1− θ). From this we find that

E(θ) = θ, V ar(θ) =θ(1− θ)

n

so approximately,

θ ∼ N(θ,θ(1− θ)

n)

fin


123

Example 11. In general, the answer is not so straight forward. Wetherefore outline the “standard” calculations which one goes throughin this connection:

The expression for the variance is obtained as follows: Recall that thelikelihood and score functions are given by

l(θ) = x log θ + (n− x) log(1− θ)

S(θ) = l′(θ) =x

θ− n− x

1− θ

Differentiating the scorefunction and changing sign gives

−l′′(θ) =x

θ2+

n− x

(1− θ)2


In practice θ is unknown. However, it can be justified to plug theestimate θ = x/n into l′′(θ) and this gives −l′′(θ) = n

θ(1−θ).

Hence, asymptotically,

θ ∼ N

(

θ,θ(1− θ)

n

)



124

With n = 25, x = 10 we get θ = 0.4 and V ar(θ) ≈ 0.0096. Hence,an (approximate) 95% confidence interval for θ is

(θ − 1.96

√

V ar(θ) ; θ + 1.96

√

V ar(θ))

= (0.4− 0.19 ; 0.4 + 0.19) = (0.21; 0.59)

fin


Asymptotical normality of transformations of the

MLE

If h is a function of θ then the distribution of h(θ) will,

asymptotically, look like a normal distribution with mean h(θ) and

variance which can be estimated by −(h′(θ))2/l′′(θ), i.e.

asymptotically

h(θ) ∼ N(h(θ),−h′(θ)2

l′′(θ))


125

Example 12. For example, if we are more comfortable withinterpreting the odds η = h(θ) = θ/(1−θ) we find h′(θ) = 1/(1−θ)2.Hence, asymptotically,

η ∼ N(θ

1− θ,

θ

(1− θ)n) = N(

θ

1− θ, 0.0133).

fin


Tests of Hypotheses

The final point to touch upon concerns tests of hypotheses regarding

θ.

Suppose interest is in testing whether θ is equal to a specific fixed

value θ0.

The likelihood ratio test

The maximum likelihood estimate θ is the value of θ which gives the

observed data the highest probability which is L(θ).

If the value θ0 assigns nearly the same probability L(θ0) as θ does,

we would be tempted to accept the hypothesis that θ = θ0.



126

In other words, it is tempting to consider the

likelihood ratio test statistic Q defined by

Q =L(θ0)

L(θ)

Clearly Q is a number between 0 and 1 and values close to 1 are in

favor of the hypothesis.

It can be shown that if the hypothesis is true then

−2 log Q = 2(l(θ)− l(θ0))

has (when n is large) approximately a χ2 distribution with 1 degree

of freedom. Large values of −2 log Q leads to rejection of the

hypothesis. In Figure 4 it can be seen that −2 log Q is twice the


vertical distance between the value of l in θ and θ0.

.

.

l(θ) − l(θ0)

Slopel′(θ)

θ0θ

θ

l(θ)

Figure 4: Illustration of the likelihood ratio test, the score test and

the Wald test.

The Score Test

A test statistic equivalent to −2 log Q is obtained by considering the

slope of l in the point θ0. It is known that the slope of l in θ is 0


127

(l(θ) = 0 by definition of the MLE.) Hence values of l′(θ0) near 0

will also speak in favor of the hypothesis.

It can be shown that when n is large and the hypothesis is true, the

distribtion of the so called score test

S = −l′(θ0)2/l′′(θ0)

will also look like a χ2 distribution with 1 degree of freedom.

Hence when n is large the likelihood ratio test and the score test are

equivalent.

The Wald Test

A third test is the Wald test which compares the values of θ and θ0


directly corresponding to the horizontal distance in Figure 4.

It can be shown that when n is large and the hypothesis is true, the

distribtion of the Wald test statistic

W = −(θ − θ0)2(l′′(θ))2

will also look like a χ2 distribution with 1 degree of freedom.

Note that in W is simply the square of the difference (θ− θ0) divided

by its standard deviation 1/

√

l′′(θ). In the litterature, one frequently

use the term “Wald test” about the square root of W which yields a

test statistic approximately with a N(0, 1) distribution.



128

Hence when n is large the likelihood ratio test, the score test and

the Wald test are equivalent.


How to get the asymptotic normality

This section is somewhat theoretical.

Consider the following general setup: Let X be a single random

variable. The expectation and variance of X is

µi = E(X) =

∫

xp(x; θ)dx

V ar(X) =

∫

(x− µ)p(x; θ)dx.

Since X is a random variable, then so is the score function

S(θ;X) = l′(θ;X).


129

For later purposes we need the mean and the variance of the score

function.

To obtain these quantities, we use the following facts:

S(θ) = l′(θ;x) = (log p(x; θ))′ =1

p(x; θ)p′(x; θ)

S′(θ) = l′′(θ;x) = − 1

p(x; θ)2(p′(x; θ))2 +

1

p(x; θ)p′′(x; θ)

∫

p(x; θ)dx = 1

The function S′(θ) is called the Hessian (matrix) and is very

important in connection with PROC MIXED.


Moreover, in most cases of practical interest, the order of

differentiation and integration can be interchanged. Hence

∫

d

dθp(x; θ)dx =

d

dθ

∫

p(x; θ)dx =d

dθ1 = 0

Mean of the score function We shall supress the dependence on X

in the following: We find that

E(S(θ)) = E(l′(θ)) =

∫

l′(θ)p(x; θ)dx =

∫

p′(x; θ)dx

Interchanging the order of differentiation and integration yields

E(S(θ)) =

∫

d

dθp(x; θ)dx =

d

dθ

∫

p(x; θ)dx =d

dθ1 = 0



130

So the expected value of the score function is zero.

Variance of the score function The variance of the score function

has a special name, namely the Fisher information and is usually

denoted by I(θ). Hence we have

I(θ) = V ar(S(θ)) = E(S(θ)2)

= E([l′(θ)]2)

=

∫

l′(θ)2p(x; θ)dx =

∫

1

p(x; θ)p′(x; θ)2dx

because the expected value is zero.

A more convenient expression for the variance can be found in


terms of the derivative of the score funtion:

E(S′(θ)) = E(l′′(θ))

=

∫

[− 1

p(x; θ)2(p′(x; θ))2 +

1

p(x; θ)p′′(x; θ)]p(x; θ)dx

=

∫

[− 1

p(x; θ)(p′(x; θ))2 + p′′(x; θ)]dx

Interchanging the order of differentiation and integration as before

gives that∫

p′′(x; θ)dx = 0. Hence

E(S′(θ)) = −∫

1

p(x; θ)(p′(x; θ))2dx = −V ar(S(θ,X)).


131

Hence we have for a single observation

E(S(θ)) = 0

I(θ) = V ar(S(θ)) = E(S(θ)2) = −E(S′(θ)) (3)

The likelihood for all data

From (2) it is seen that the likelihood for all data is the product of

the likelihood for each observation, i.e.

L(θ; y) = p(y1; θ) . . . p(yn; θ) =∏

i

p(yi; θ),

Consequently, the log–likelihood, the score function and the

derivative of the score function for all data is a sum of independent


components:

l(θ) =∑

i

l(θ; yi) =∑

u

li(θ)

S(θ) = l′(θ; y) =∑

i

l′(θ; yi) =∑

i

S(θ; yi) =∑

i

Si(θ),

S′(θ) =∑

i

S′i(θ), (4)

For a single observation we have

E(Si(θ)) = 0

I(θ) = V ar(Si(θ)) = E(Si(θ)2) = −E(S′i(θ))



132

and correspondingly for all observations

E(S(θ)) = 0

V ar(S(θ)) = nI(θ).

We then need three small results:

Result 1: Since S′(θ; y) =∑

i S′i(θ) it is reasonable to assume (using

the law of large numbers) that

1

nS′(θ) =

1

n

∑

i

S′i(θ) ≈ E(S′k(θ)) = −I(θ)

Result 2: Since S(θ) =∑

i Si(θ) is a sum of independent random

variables where E(Si(θ)) = 0 and V ar(Si(θ)) = I(θ). Hence by


the central limit theorem, approximately

S(θ) ∼ N(0, nI(θ))

Result 3: Let θ0 be the true (but unknown to us) value of the

parameter θ. Let us assume that θ is a good estimate, i.e. close to

θ0. Then

0 = S(θ) ≈ S(θ0) + S′(θ0)(θ − θ0)

That is

1√nS(θ0) ≈ −1

nS′(θ0)

√n(θ − θ0)

≈ I(θ0)√

n(θ − θ0)


133

The left hand side is approximately N(0, I(θ)) distributed.

Hence, approximately, 1√nI(θ0)

S(θ0) ∼ N(0, I(θ)−1). That is,

approximately,

√n(θ − θ0) ∼ N(0, I(θ)−1).

or

θ ∼ N(θ0, (nI(θ))−1).

as desired.


Likelihood and Linear Normal Models

For a linear normal model maximum likelihood estimation is the

same as least squares estimation. The unknown parameters are β

and σ2, so let θ = (β, σ2).



134

Because the observations are independent, the likelihood becomes

L(θ) = f(y1, ...yn; θ)

=

n∏

i=1

f(yi; θ)

=n∏

i=1

1√2π

1√σ2

exp(− 1

σ2(yi − µi)

2)

=1√2π

n

1√σ2

n exp(− 1

σ2

∑

i

(yi − µi))2)

=1√2π

n

1√σ2

n exp(− 1

σ2(y −Xβ)>(y −Xβ))

For the moment, suppose σ is known.


Maximizing L(θ) = L(β, σ2) is done by minimizing∑

i(yi − µi))2) = (y −Xβ)>(y −Xβ)). But this is exactly what is

done in least squares estimation.


135

Once β has been estimated, it can be verified that the maximum

likelihood estimate for σ is

σ2 =1

n(y − µ)>(y − µ)

In practice, one never uses this variance estimate. Instead one uses

σ2 =1

n− p(y − µ)>(y − µ)

where p is the number of parameters in β.


The reason for using the latter estimate is that

E(σ2) =n− p

nσ2

E(σ2) = σ2

Hence the latter estimate is unbiased while the former is not.



136

6 An overview

The purpose of this lecture was to illustrate, how the problems of the research within thebiological sciences is related to the progress within statistical theory both in general, and relatedto mixed models.

Starting out with an experiment reported from Darwin, the lecture discussed the state of the artof experimental design and analysis at Darwin’s time, proceeded with the progress in statisticaltheory, very much related to animal breeding, and ended up with the general theory of mixedmodels. Important researchers such as F. Galton, R.A. Fisher, S. Wright, C.R.Henderson werepresented.

The slides are in Danish. Link to the full screen presentation1

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/oversigt.f.pdf

137

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/oversigt.f.pdf

1

Outline

• Baggrund for metoder

• Historisk forløb

• Relation til vores fagomrader

February 7, 2001

2

Darwins Majs

• C. Darwin (1876) The effects of cross- and self-fertilisation in the

vegetable Kingdom. John Murray, London.

• c.f. Fisher, R.A. (1935) Design of Experiments. Oliver and Boyd.

February 7, 2001

6 An overview

138

3

Darwins Majs

column I II III

Crossed Self. -fert

Pot I 2348 173

8

12 2038

21 20

Pot II 22 20

1918 183

8

2148 185

8

Pot III 2218 185

8

2038 152

8

1828 164

8

2158 18

2328 162

8

Pot IV 21 18

2218 126

8

23 1548

12 18

February 7, 2001

4

Darwins Majscolumn I II III

Crossed Self. -fertPot I 23.50 17.38

12.00 20.3821.00 20.00

Pot II 22.00 20.0019.13 18.3821.50 18.63

Pot III 22.13 18.6320.38 15.2518.25 16.5021.63 18.0023.25 16.25

Pot IV 21.00 18.0022.13 12.7523.00 15.5012.00 18.00

February 7, 2001

139

5

Darwins Majs

” As only a moderate number of crossed and self-fertilised plants

were measured, it was of great importance to learn, how far the

averages were trustworthy. I therefore asked Mr Galton, who has

much experience in statistical researches, to examine some of my

tables..... I may premise that if we took by chance a dozen score of

men belonging to different nations and measured them, it would I

presume, be very rash to form any judgment from such small

numbers on their average heights. But the case is somewhat

different with my crossed and self-fertilised plants, as they were of

exactly the same age, were subjected from first to last to the same

conditions, and were descended from the same parents”

February 7, 2001

6

Galtons tilgangcolumn I II III Sorteret Diff.

Crossed Self. -fert Crossed Self. -fertPot I 23.50 17.38 23.50 20.38 3.125

12.00 20.38 23.25 20.00 3.25021.00 20.00 23.00 20.00 3.000

Pot II 22.00 20.00 22.13 18.63 3.50019.13 18.38 22.13 18.63 3.50021.50 18.63 22.00 18.38 3.625

Pot III 22.13 18.63 21.63 18.00 3.62520.38 15.25 21.50 18.00 3.50018.25 16.50 21.00 18.00 3.00021.63 18.00 21.00 17.38 3.62523.25 16.25 20.38 16.50 3.875

Pot IV 21.00 18.00 19.13 16.25 2.87522.13 12.75 18.25 15.50 2.75023.00 15.50 12.00 15.25 -3.25012.00 18.00 12.00 12.75 -0.750

February 7, 2001

6 An overview

140

7

Galtons Tilgang

• Sortering

• Differencer

• Spredning (Most probable error) – men ikke t-test

February 7, 2001

8

Hvem var Galton

Anthropologi, Meteorologi, populationsgenetik, Eugenics

(arvehygiejne), fingeraftryk, Korrelation.

Meget interesseret i malemetoder, objektiv kvantificering af

fænomener.

K. Pearson’s Guru

February 7, 2001

141

9

Korrekt tilgang ?column I II III Diff.

Crossed Self. -fertPot I 23.50 17.38 3.125

12.00 20.38 3.25021.00 20.00 3.000

Pot II 22.00 20.00 3.50019.13 18.38 3.50021.50 18.63 3.625

Pot III 22.13 18.63 3.62520.38 15.25 3.50018.25 16.50 3.00021.63 18.00 3.62523.25 16.25 3.875

Pot IV 21.00 18.00 2.87522.13 12.75 2.75023.00 15.50 -3.25012.00 18.00 -0.750

February 7, 2001

10

Korrekt tilgang ?

• Differencer

• Spredning + t-test

• Anova. Lineær Normal Model.

• Hypotesetest. Nul hypoteser.

• Uafhængighedsantagelse.

• Randomisering

February 7, 2001

6 An overview

142

11

Korrekt tilgang ?column I II III Diff.

Crossed Self. -fertPot I 23.50 17.38 3.125

12.00 20.38 3.25021.00 20.00 3.000

Pot II 22.00 20.00 3.50019.13 18.38 3.50021.50 18.63 3.625

Pot III 22.13 18.63 3.62520.38 15.25 3.50018.25 16.50 3.00021.63 18.00 3.62523.25 16.25 3.875

Pot IV 21.00 18.00 2.87522.13 12.75 2.75023.00 15.50 -3.25012.00 18.00 -0.750

February 7, 2001

12

Hvad er sket

• R.A. Fisher

? Rothamstead

• Student (W. Gossett)

February 7, 2001

143

13

Den 5. Potte

• Hvad forventer vi af udslag i potte 5. Hvad

er et gæt pa forskellen ?

• Hvorfor ?.

• Hvad er et gæt pa niveauet for Self-fertilized.

• Tilfældige effekter,

Populationer,

Stikprøver

February 7, 2001

14

Populationsgenetik

• Population

• P = A + M

• V(P ) = V(A) + V(M)

• h2 = V(A)V(P )

• Ao = 12Am + 1

2Af

February 7, 2001

6 An overview

144

15

Populationsgenetik

• R.A. Fisher

• Sewall Wright

• (Haldane)

February 7, 2001

16

Hierarkiske populationer

.

.

��

� �

�

��

� � �

� Sires

Females

Offspring

February 7, 2001

145

17

Populationsgenetik/ Husdyravl

• R.A. Fisher

• Sewall Wright

• Jay R. Lush

• C.R. Henderson

• S.R. Searle.

February 7, 2001

18

Husdyravl

• Oprindelig Hierarkisk Struktur

• Strukturen bryder ned, specielt pga. KS

• Metoder til krydset klassifikation

• Henderson’s Mixed Model Equations

February 7, 2001

6 An overview

146

19

Husdyravl

• Hovedvægt pa estimation (Selektion)

• Afhængighed beskrives ved residual varians og

heretabilitet

• Problem er primært regneteknisk (Matrice-

invertering)

• Normalt MANGE! observationer

• Hypotesetest af mindre interesse

February 7, 2001

20

Mixed Models generelt

• Gentagne malinger/longitudinelle data

• Spatiale observationer

• Hierarkiske forsøgsdesign (e.g. split-plot)

• Mixed Model Equations fælles referenceramme

• Fælles program udvikling

February 7, 2001

147

21

Mixed Models generelt

• Hypotesetest af stor interesse

• Afhængighed beskrives ved mange

variansparametre

• Begrænset antal observationer

•

• Stadig løse ender

February 7, 2001

6 An overview

148

7 Experimental planning and design

The purpose of the lecture was to refresh the concepts used in experimental planning and design,i.e., hypothesis, power of designs, blocking. Typical blocking factors were discussed.

Different types of experimental design, such as randomized block, split-plot, latin squares andfactorial designs, were discussed, and examples were sought within the participants areas ofresearch.

The slides are in Danish. Link to full-screen presentation1

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/Forsplanpl.f.pdf

149

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/Forsplanpl.f.pdf

1

Outline

• Hypotheses

• Decision Support

• Need of information for planning

• Restrictions in experimental design

• Different designs

February 12, 2001

2

Forskningsprocessen

AnsøgningPublicering

Pakke

Forsøg

February 12, 2001


150

3

Forskningsprocessen

• Fa ideer til omrader, hvor eksisterende

viden/teori er utilstrækkelig/forkert

• Foretage iagttagelser, sa ideerne kan be- eller

afkræftes

• Beslutte om viden/teori skal justeres

• (Kvantificering af viden)

• Gruppearbejde over tid og sted

February 12, 2001

4

Darwins Majscolumn I Height, Inch

Crossed Self. -fertPot I 23.50 17.38

12.00 20.3821.00 20.00

Pot II 22.00 20.0019.13 18.3821.50 18.63

Pot III 22.13 18.6320.38 15.2518.25 16.5021.63 18.0023.25 16.25

Pot IV 21.00 18.0022.13 12.7523.00 15.5012.00 18.00

February 12, 2001

151

5

Hypotheses

Hypothesis A GMO sugar beets are not harmfull to cows

Hypothesis B GMO sugar beets are harmfull to cows

Hypothesis A Pesticide use reduces fertility

Hypothesis B Pesticide use do not reduce fertility

February 12, 2001

6

Luse Beslutningsstøtte

Table 1: Sprøjteeksempel – gevinsttabel

Afgrødens tilstand

Beslutning Ingen lus Lus

SprøjtOmkostninger til

sprøjtemiddel og

arbejde

Omkostninger til

sprøjtemiddel og

arbejdeSprøjt ikke 0 Udbytte tab

February 12, 2001


152

7

Forskning Beslutningsstøtte

Table 2: Forskningseksempel – gevinsttabel’Verdens’ tilstand

Beslutning Hypotese 1 er

sand

Hypotese 2 er

sand

Accepter hypotese 1 OK Fejl !

Accepter hypotese 2 Dyr fejl ! Gennembrud

!

February 12, 2001

8

Typer af fejlkonklusion

Hypotese 1 Hypotese 2

Type I fejl Type II fejl

February 12, 2001

153

9

Muligheder i designfase

-1 0 1 2-1 0 1 2-1 0 1 2

Hypotese 1 Hypotese 2

Forøg præcision Forøg forsøgsudslag

NB!: Type I fejl er konstant, e.g. 0.05

February 12, 2001

10

Biologisk input

• Maleegenskaber

• Forventede forsøgsudslag

• Mulige konklusioner af forsøg

• Afhængige <> uafhængige hypoteser

• Hypotesegene(re)rende egenskaber

February 12, 2001


154

11

Table 3: Oversigt over forventede forsøgsudslagEgenskab Hypotese 1 er sand Hypotese 2 er sand

Behandling 1 Behandling 2 Behandling 1 Behandling 2A 100 100 100 120B... ... ... ... ...

February 12, 2001

12

Typiske blokfaktorer

• Kuld

• Sti, Flok, Bur

• Køn

• Afstamning

• Besætning

• Observatør

February 12, 2001

155

13

Begrænsninger i design muligheder

• Blokstørrelse

• Opstaldning/Management

• Ressource kamp

February 12, 2001

14

Designtyper

• Randomiseret Blokforsøg

• Split-Plot forsøg

• Romer Kvadrat

• Ikke komplette blokforsøg

• Faktorielle forsøg

• Fraktionerede designs

February 12, 2001


156

8 Randomized Complete Block Design

These are the first slides in the second block of lectures. They start off with the augmentationof the linear normal model to a mixed model. Then PROC MIXED in SAS were presented, andexample 1.2.1 in LMSW (Littell et al., 1996) were discussed. The slides can be seen as a summaryof chapter 1 in LMSW.


1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RDBC.f.pdf

157

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RDBC.f.pdf

1

Outline

• Hypotheses

• Udvidelse af LNM

• Introduktion til Proc Mixed

• RCBD eksempel (1.2.4)

February 28, 2001

2

Linear Normal Model

Y11 = δ + α1 + u1 + ε11

Y12 = δ + α2 + u1 + ε12

Y21 = δ + α1 + u2 + ε21

Y22 = δ + α2 + u2 + ε22

εij ∼ N (0, σ2)

Y ∼ N (0, σ2I)

February 28, 2001


158

5

Matrix formulering

Y11

Y12

Y21

Y22

=

1 1 01 0 11 1 01 0 1

δ

α1

α2

+

1 01 00 10 1

(

u1

u2

)+

ε11

ε12

ε21

ε22

Y = Xβ + Zu + ε

ε ∼ N (0, R), u ∼ N (0, G)

V(Zu) = Z V(u)Z> = ZGZ>

V(Y ) = ZGZ> + R

February 28, 2001

6

Random vs. Fixed

• Do the levels of the factor come from a probability distribution?

McCulloch & Searle (1997)

• Are Inferences to be drawn from these data about just these level

of the factor ? Searle, (1971)

February 28, 2001

159

7

ML - estimation

Type Distribution Estimate

LNM Y ∼ N (Xβ, σ2I) β = (X>X)−1X>y

If V is known:

LMM Y ∼ N (Xβ, V ) β = (X>V −1X)−1X>V −1y

V = ZGZ> + R is not known, depends on parameters,

V = f(σ2, σ2u).

February 28, 2001

8

Likelihood function

l(y, β, σ2, σ2u) = −1

2log |V | − 1

2(y−Xβ)>V −1(y−Xβ)− n

2log(2π)

1 2 3 4 5

−44

−43

−42

−41

−40

σ2

Logl

ike

February 28, 2001


160

9

Proc Mixed I

PROC MIXED < options > ;

BY variables ;

ID variables ;

WEIGHT variable ;

February 28, 2001

10

Proc Mixed II

CLASS variables ;

MODEL dependent = < fixed-effects > < / options > ;

RANDOM random-effects < / options > ;

REPEATED < repeated-effect> < / options > ;

PARMS (value-list) ... < / options > ;

PRIOR <distribution > < / options > ;

February 28, 2001

161

11

Proc Mixed III

CONTRAST ’label’ < fixed-effect values ... >

< | random-effect values ... > , ... < / options > ;

ESTIMATE ’label’ < fixed-effect values ... >

< | random-effect values ... >< / options > ;

LSMEANS fixed-effects < / options > ;

MAKE ’table’ OUT=SAS-data-set ;

February 28, 2001

12

Proc Mixed

Model concerns Xβ

Random concerns Zu and G = V(u)

Repeated concerns ε and R = V(ε)

February 28, 2001


162

13

Ingot Støbeblok/metal barre

metal Metal brugt til lodning (?) af Ingot (nickel, iron, copper)

Pres Tryk der brækker lodningen

/*---Data Set 1.2.4---*/data rcb;

input ingot metal $ pres;datalines;

1 n 67.01 i 71.91 c 72.2

.

.

February 28, 2001

14

Design

Ingot no.

Lodning 1 2 3 4 5 6 7

1 n i c c c n n

2 c n i i n c i

3 i c n n i i c

February 28, 2001

163

15

Andre eksempler pa RCBD

• Parrede observationer

Den rullende Afprøvning

• (Beretning 685) Stigende mængder solsikkefrø (4 niveauer). 20

kuld a 4 grise.

• Beretning 546. Opdrætningsintensitet, Jersey. 10 par enæggede

tvillinger. Høj vs. lav intensitet,

• Forskningsrapport 25. Airwash systemet. Besætning opdeles efter

lige vs ulige konumre.

February 28, 2001

16

Proc Mixed model

proc mixed data=rcb;class ingot metal;model pres=metal;random ingot;

lsmeans metal / pdiff;estimate ’nickel mean’ intercept 1 metal 0 0 1;estimate ’copper vs iron’ metal 1 -1 0;contrast ’copper vs iron’ metal 1 -1 0;

run;

February 28, 2001


164

17

Anden notation

Yijk = µ + αi + uj + εij

uj ∼ N (0, σ2u)

εij ∼ N (0, σ2ε)

February 28, 2001

18

Tredje notation

Yijk = Xβ + Zu + ε

u ∼ N (0, G)

ε ∼ N (0, R)

February 28, 2001

165

19

SAS (8E) Output

The Mixed Procedure

Model Information

Data Set WORK.RCBDependent Variable presCovariance Structure Variance ComponentsEstimation Method REMLResidual Variance Method ProfileFixed Effects SE Method Model-BasedDegrees of Freedom Method Containment

February 28, 2001

20

Class Level Information

Class Levels Values

ingot 7 1 2 3 4 5 6 7metal 3 c i n

February 28, 2001


166

21

Dimensions

Covariance Parameters 2Columns in X 4Columns in Z 7Subjects 1Max Obs Per Subject 21Observations Used 21Observations Not Used 0Total Observations 21

February 28, 2001

22

Iteration History

Iteration Evaluations -2 Res Log Like Criterion

0 1 112.409879521 1 107.79020201 0.00000000

Convergence criteria met.

February 28, 2001

167

23

Estimate of σ2u, σ2

ε

Covariance ParameterEstimates

Cov Parm Estimate

ingot 11.4478Residual 10.3716

February 28, 2001

24

Kriterier for fit af model, bruges ved modelsammenligninger.

Fit Statistics

-2 Res Log Likelihood 107.8AIC (smaller is better) 111.8AICC (smaller is better) 112.6BIC (smaller is better) 111.7

February 28, 2001


168

25

Signifikans test

Type 3 Tests of Fixed Effects

Num DenEffect DF DF F Value Pr > F

metal 2 12 6.36 0.0131

February 28, 2001

26

Degrees of Freedom

Numerator H0 : α1 = α2 = α3 = 0

K>β = 0 ⇔

0 1 −1 00 1 0 −10 0 1 −1

µ

α1

α2

α3

= 0

Num DF is rank(K)

February 28, 2001

169

27

Denominator Containment method: ”Denote the fixed effect in

question A, and search the RANDOM effect list for the effects that

syntactically contain A. For example, the RANDOM effect B(A)contains A, but the RANDOM effect C does not, even if it has the

same levels as B(A).Among the RANDOM effects that contain A, compute their rank

contribution to the (XZ) matrix. The DDF assigned to A is

the smallest of these rank contributions. If no effects are found,

the DDF for A is set equal to the residual degrees of freedom,

N − rank(XZ)”

Methods CONTAIN,BETWITHIN, RESIDUAL, SATTERTH, KENWARDROGER.MODEL .... \DDFM=SATTERTH;

February 28, 2001

28

Output fra Estimate

Estimates StandardLabel Estimate Error DF t Value Pr > |t|

nickel mean 71.1000 1.7655 12 40.27 <.0001copper vs iron -5.7143 1.7214 12 -3.32 0.0061

ContrastsNum Den

Label DF DF F Value Pr > F

copper vs iron 1 12 11.02 0.0061

February 28, 2001


170

29

Least Squares Means

StandardEffect metal Estimate Error DF t Value Pr > |t|

metal c 70.1857 1.7655 12 39.75 <.0001metal i 75.9000 1.7655 12 42.99 <.0001metal n 71.1000 1.7655 12 40.27 <.0001

Differences of Least Squares Means

StandardEffect metal _metal Estimate Error DF t Value Pr > |t|

metal c i -5.7143 1.7214 12 -3.32 0.0061metal c n -0.9143 1.7214 12 -0.53 0.6050metal i n 4.8000 1.7214 12 2.79 0.0164

February 28, 2001

30

GLM

GLM:Source DF Type III SS Mean Square F Value Pr > F

ingot 6 268.2895238 44.7149206 4.31 0.0151metal 2 131.9009524 65.9504762 6.36 0.0131

Mixed:Num Den

Effect DF DF F Value Pr >F

metal 2 12 6.36 0.0131

February 28, 2001

171

31

GLM:Standard LSMEAN

metal pres LSMEAN Error Pr > |t| Number

c 70.1857143 1.2172327 <.0001 1i 75.9000000 1.2172327 <.0001 2n 71.1000000 1.2172327 <.0001 3

Mixed: Least Squares Means

StandardEffect metal Estimate Error DF t Value Pr > |t|

metal c 70.1857 1.7655 12 39.75 <.0001metal i 75.9000 1.7655 12 42.99 <.0001metal n 71.1000 1.7655 12 40.27 <.0001

February 28, 2001

32

GLM: StandardParameter Estimate Error t Value Pr > |t|

nickel mean 71.1000000 1.21723265 58.41 <.0001copper vs iron -5.7142857 1.72142692 -3.32 0.0061

Mixed: StandardLabel Estimate Error DF t Value Pr > |t|

nickel mean 71.1000 1.7655 12 40.27 <.0001copper vs iron -5.7143 1.7214 12 -3.32 0.0061

February 28, 2001


172

33

Summary

• Model specification

• Output elements

• Estimation Methods

• Fit Statistics/Information Criterias

• Degrees of freedom, model parameters.

• GLM differs

February 28, 2001

34

IC Options

The IC option displays a table of various information criteria. The

criteria are all in smaller-is-better form, and are described in .

Criteria Formula Reference

AIC −2l + 2d Akaike (1974)

AICC −2l + 2d n∗

n∗−d−1 Burnham and Anderson (1998)

HQIC −2l + 2d log(log(n)) Hannan and Quinn (1979)

BIC −2l + d log(n) Schwarz (1978)

CAIC −2l + d(log(n) + 1) Bozdogan (1987)

Here l denotes the maximum value of the (possibly restricted) log

likelihood, d the dimension of the model, and n the number of

observations. In Version 6 of SAS/STAT software, n equals the

February 28, 2001

173

35

number of valid observations for maximum likelihood estimation and

n− p for restricted maximum likelihood estimation, where p equals

the rank of X. In later versions, n equals the number of effective

subjects as displayed in the ”Dimensions” table, unless this value

equals 1, in which case n equals the number of levels of the first

RANDOM effect you specify. If the number of effective subjects

equals 1 and you have no RANDOM statements, then n reverts to

the Version 6 values. For AICC (a finite-sample corrected version of

AIC), n∗ equals the Version 6 values of n, unless this number is less

than d + 2, in which case it equals d + 2.

For restricted likelihood estimation, d equals q the effective number

of estimated covariance parameters. In Version 6, when a parameter

estimate lies on a boundary constraint, then it is still included in the

calculation of d, but in later versions it is not. The most common

February 28, 2001

36

example of this behavior is when a variance component is estimated

to equal zero. For maximum likelihood estimation, d equals q + p.

For ODS purposes, the name of the ”Information Criteria” table is

”InfoCrit.

February 28, 2001


174

9 Randomized Complete Block Design II

These slides discussed the concept of BLUE and BLUP estimates. The question of model controlis addressed.

Link to full-screen presentation presentation1

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RDBC2SLU.f.pdf

175

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RDBC2SLU.f.pdf

1

Outline

• BLUEs and BLUPs

• Examples of model control

February 28, 2001

2

BLUEs and BLUPs

• Best Linear Unbiased Estimator l>Xβ0

• Best Predictor: E(u|y)

February 28, 2001


176

3

Linear Regression

−2 −1 0 1 2

−2

−1

01

2

x1

x 2

−2 −1 0 1 2−

2−

10

12

x1

x 2

February 28, 2001

4

Linear Regression

(x2

x1

)∼ N

((µx2

µx1

),

(V2 C21

C12 V1

))

E(X2|X1) = µx2 + C21V−11 (x1 − µx1)

V(X2|X1) = V2 − C21V−11 C>

21

V(E(X2|X1)) = C21V−11 C>

21

February 28, 2001

177

5

(u

y

)∼ N

((µu

µy

),

(G C

C> V

))

u1 = u1

u2 = u2

Y11 = δ + α1 + u1 + ε11

Y12 = δ + α2 + u1 + ε12

Y21 = δ + α1 + u2 + ε21

Y22 = δ + α2 + u2 + ε22

February 28, 2001

6

BLUEs and BLUPs

• Best Linear Unbiased Estimator l>Xβ0

• Best Predictor: E(u|y)• Best Linear Predictor: (µu) + CV −1(y − µy)• Best Linear Unbiased Predictor:

BLUP(t>Xβ + s>u) = t>Xβ0 + s>CV −1(y −Xβ0)• Estimated Best (?) Linear Unbiased Predictor:

EBLUP(t>Xβ + s>u) = t>Xβ0 + s>CV −1(y −Xβ0)

EBLUP(t>Xβ + s>u) = t>Xβ0 + s>GZ>V −1(y −Xβ0)

February 28, 2001


178

7

Variance in BLUP

u true value, u BLUP estimate, εu error in prediction.

u = u + εu ⇔ u− u = εu

V(u) = V(u) + V(εu)

The error of prediction:

V(u− u) = G− CV −1C>

The variance in BLUP value:

V(u) = CV −1C>

February 28, 2001

8

Example

One-way classification model: Effect of number of observations per block

ui = BLUP(ui) =niσ

2u

σ2 + niσ2u

(yi· − µ)

i: block no, ni: number of observations in block i, yi·: block mean.

As n →∞ the coefficientniσ

2u

σ2+niσ2u→ 1 and the variance of the BLUP estimates

V(ui) → G.

February 28, 2001

179

9

Fixed vs. Random

−5 0 5 10

0.5

1.0

1.5

2.0

2.5

Block Effect

(c) Ingots

−30 −20 −10 0 100.

51.

01.

52.

02.

5

Block Effect

(d) Litter

February 28, 2001

10

BLUP summary

• BLUP corresponds to the conditional expectation of the random effect givenobservation

• Under normality assumptions and known variance BP=BLUP

• With unknown variance this no longer holds.

• Variance of BLUPs depends on the precision of information concerning therandom effects

February 28, 2001


180

11

Model check - LNM - model

• εi are independent and indentically distributed εi ∼ N (0, σ2)

• Residual vs. predicted

• Residual vs. anything else

• Probit plots.

• εi,t vs εi,t−1

• etc.

February 28, 2001

12

Residuals – Mixed Models

Distribution of residuals Mixed Models

(y −Xβ) = (Zu + ε) ∼ N (0, V )

i.e., not iid. ( option OUTPM in PROC MIXED)

Another definition of residuals

(y −Xβ − Z E(u)) = (Z(u− u) + ε) ∼ N (0, VG − VGV −1V >G + R)

where VG = ZGZ>. i.e., not iid. ( option OUTP in PROC MIXED)

Standardized residuals ?

February 28, 2001

181

13

Residual vs predicted

72 73 74 75

−20

−10

010

p1

r1

(e)

66 68 70 72 74 76 78−

20−

15−

10−

50

510

p2

r2

(f)

February 28, 2001


182

10 Split-Plot Experiments

These slides present the theoretical background for split-plot designs. The slides augments thepresentation of split-plot designs in chapter 2 in LMSW, (Littell et al., 1996). The concept ofvariance-components are presented, and the different variance of different contrast presented. Inaddition concepts such the distribution of Sum of Squares, Satterthwaite’s approximation andthe distinction between random and fixed effects are presented.


1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SplitPlot.f.pdf

183

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SplitPlot.f.pdf

The General Idea behind Split–Plot Experiments

“Once upon a time there were linear normal models -systematic effects plus one error term...”

Yet, many experiments and studies have a hierarchical structure

• with respect to treatments

• with respect to error structures

Split–plot models is a very powerful (and early) way of handling such

situations.

April 6, 2001 Mixed Models Course 1

The name “split–plot” comes from the area of field experiements:

• Some treatments (say factor A) are applied to entire plots (parcels).

Those plots are called whole–plot–units and the factor A is the

whole plot factor.

• A plot is sometimes further sub–divided into sub–plots and other

treatments (say factor B) are applied to each of these sub–plots. The

sub–plots are called split–plot–units and the factor B is called the

split–plot factor.



184

Other examples:

• Treatment A (e.g. feeding) applied to a whole pig pen (the whole–plot)

while treatment B (something...) is applied to pigs within a pen.

• Treatment A is applied to an entire litter of piglets, treatment B is

applied to each piglet in the litter.

• Treatment A is a management straegy applied to a whole farm, while

treatment B is a treatment of each pig pen on the farm.


The basic property of split–plot experiments is that subjects within a

whole–plot are more similar than than subjects on in different

whole–plots.

More generally, Subjects/individuals/plots close (in some sense) to each

other are expected to be more similar than if they were further apart.

Split–plit models are sometimes appropriate for analyzing repeated

measurements.


185

Example 1. (Example 2.2 from LMSW).

• The effect of 3 bacterial inoculation treatments (INOC, indexed with j)applied to 2 grass cultivars (CULT, indexed with i).

• There are 4 blocks (BLOCK, indexed with k.) and and CULT is randomlyassigned to each half of the block.

• Half a block is the whole–plot unit. Each whole–plot unit is subdividedinto 3 split plot units and each INOC is applied there.

The statistical model is

yijk = µ + αi + βj + γij + rk + wik + εijk

where rk ∼ N(0, σ2r), wik ∼ N(0, σ2

w) and εijk ∼ N(0, σ2). fin


Variance and Correlation

The total variance is

Var(yijk) = Var(rk) + Var(wik) + Var(εijk)

= σ2r + σ2

w + σ2 = σ2tot

which justifies the name variance component model:

• The total variance is a sum of individual variance contributions.

• Moreover, each variance contribution can be assigned to a specific

feature of the experiment.



186

The variance components have implications for the correlation structure

among the variables:

1. Observations within the same block (k) but with different levelsof factor A (i) are correlated through the block component

Corr(yijk, yi′j′k) = Corr(yijk, yi′jk) =Cov(yijk,yi′jk)

Var(yijk) = σ2r

σ2tot

2. Observations within the same block (k) and with the same levelof factor A (i) but different levels of factor B (j) are correlated

through the block component and the whole–plot component

Corr(yijk, yij′k) =Cov(yijk,yij′k)

σ2tot

= σ2r+σ2

w

σ2tot


Hence in the split plot model it is assumed that the correlation, when

present, is positive.

The split–plot structure has important implications with respect to the

statistical inference:

1. The effect of the interaction between A and B and treatment B itself

should be compared with the “residual variation” i.e. the variation

between the split–plot units.

2. The effect of treatment A should be compared with the “whole–plot

variation”, i.e. the variation between the whole–plot units.

We shall illustrate these points for a balanced split–plot experiment.


187

Comparing Differences

Consider again the model



w) and εijk ∼ N(0, σ2), and

i = 1 . . . a, j = 1 . . . b and k = 1 . . . c.

A simple calculation of differences of means illstrates the special issues

arising in a split–plot experiment.


Different levels of factor A can be compared by

y1.. − y2.. = α1 − α2 + γ1. − γ2. + (w1. − w2.) + (ε1.. − ε2..)

Var(y1.. − y2..) = Var(w1. − w2.) + Var(ε1.. − ε2..)

= 2σ2

w

c+ 2

σ2

bc=

2c(σ2

w +σ2

b)

Different levels of factor B can be compared by

y.1. − y.2. = β1 − β2 + γ.1 − γ.2 + (ε.1. − ε.2.)

Var(y.1. − y.2.) = Var(ε.1. − ε.2.) =2c(σ2

a)



188

Hence Var(y1.. − y2..) is bigger than Var(y.1. − y.2.).

In other words, the effect of the whole–plot–factor is determined less

accurately the the effect of the split–plot factor.


Inference Issues for Mixed Models

For balanced experiments, inference is based on F–tests.

For unbalanced cases, inference is a delicate issue. Loosely speaking

“What are the denominator degrees of freedom”.

In PROC MIXED one can make “approximate F–tests” (but SAS never

informs you that the tests are only approximate).

Several suggestions have been made regarding this. One such is

Satterthwaites Approximation.


189

Analysis of the Split–Plot Experiment

Consider again the model



w) and εijk ∼ N(0, σ2), and

i = 1 . . . a, j = 1 . . . b and k = 1 . . . c.

For simplicity suppose that factor B does not represent a treatment but

only replications within each whole–plot. Then the model reduces to

yijk = µ + αi + rk + wik + εijk


The replicates due to factor B are eliminated by calculating the average

within each block and treatment:

yi.k = µ + αi + rk + (wik + εi.k) where Var(wik + εi.k) = σ2w + σ2/b

• Hence the between whole–plot variation (σ2w) remains unchanged while

the within whole–plot variation σ2 is reduced by a factor b.

• Therefore by taking more replicates within a whole–plot unit, parts ofthe variation is reduced , while other parts of the variationremains the same.



190

Modelling the Mean

Let zik = yi.k denote the mean and define uik = wik + εi.k.

Then the model for the means can be written

zik = µ + αi + rk + uik

where uik ∼ N(0, σ2u) with Var(uik) = σ2

w + 1bσ

2 = σ2u and

rk ∼ N(0, σ2r).

This is an ordinary ANOVA–model with one treatment, one (random)

block effect and no interaction. Analyzing such a model is straight

forward.


Three Technical Results

In connection with ANOVA calculations, one frequently uses the following results:

ANOVA1: Let X, Y be independent with E(X) = E(Y ) = 0 and let a be a number.Then

E(a + X + Y )2 = V ar(a + X + Y ) + [E(a + X + Y )]2

= Var(X) + Var(Y ) + a2 = E(X2) + E(Y 2) + a2

ANOVA2: Let Y1, . . . , Yn be independent with Yi ∼ N(µ, σ2), and let SSD =∑ni=1(Yi − Y.)2. Then

E(SSD) = (n− 1)σ2 = (n− 1) Var(Yi)

SSD ∼ σ2χ2(n− 1)


191

ANOVA3: Let Y1, . . . , Yn be independent with Yi = µi + εi, where εi ∼ N(µi, σ2), and

let

SSD =n∑

i=1

(Yi − Y.)2 and Q(µ) =n∑

i=1

(µi − µ.)2

Then

E(SSD) = Q(µ) + E(n∑

i=1

(εi − ε.)2) = Q(µ) + (n− 1)σ2


With

zik = µ + αi + rk + uik

summation gives

zi. = µ + αi + r. + ui.

z.. = µ + α. + r. + u..

The difference

zi. − z.. = (αi − α.) + (ui. − u..)

is a measurement of the treatment effect, and does not depend on the

block.



192

Letting SSDA =∑

i(zi. − z..)2 we find that

E(SSDA) =∑

i

(αi − α.)2 + E(∑

i

(ui. − u..)2)

= Q(α) + (a− 1)σ2

u

cand hence

E(c∑

i

(zi. − z..)2) = Q(α) + (a− 1)σ2u.

• If there is no effect of treatment A then Q(α) = 0 and SSDA has a

χ2–distribution.

• To be able to make the F–test we need to find a quantity which has

σ2u as expected value no matter whether αi = 0 or not.


1. Let SSDAC =∑

ik(zik − zi. − z.k + z..)2. It is easy to see that

zik − zi. − z.k + z.. = uik − ui. − u.k + u..

2. It is not difficult to verify (and it can be found in any standard text

book on statistics) that

E(SSDAC) = σ2u(a− 1)(c− 1).

3. Finally it is equally easy to verify that SSDA and SSDAC are

independent.


193

4. Therefore the F–statistic for testing αi = 0 becomes

F =c · SSDA/(a− 1)

SSDAC/(a− 1)(c− 1)

=c∑

i(zi. − z..)2/(a− 1)∑ik(zik − zi. − z.k + z..)2/(a− 1)(c− 1)

∼ Fa−1,(a−1)(c−1)

Large values of F are critical to the hypothesis.


• The important point is that the treatment effect of factor A is “tested

against” the variance σ2u = σ2

w + σ2/b. which largely consists of the

whole–plot variation (σ2w) + a “minor” contribution from the split–plot

variation (σ2/b).

• In the balanced case, the test for αi = 0 can be made by simply

analyzing the “means”. That is the reason why PROC GLM in

special (balanced) cases can make the correct tests in certain variance

component models.



194

Back to the Original Setup

Return to the original model with a treatment effect of factor B, i.e.


1. The interaction effect γij is tested exactly as if wik and rk had been

fixed effects. I.e. the test is made “against” the residual variation σ2.

2. In the absence of γij, the main effect βj is also tested as if wik and rk

had been fixed effects.

3. The main effect of factor A is tested as described previously. Just note

that the effect of B cancels out in all calculations.


Unbalanced cases

All the nice calculations previously presented breaks down when the

design is no longer balanced.

Consider again

yijk = µ + αi + rk + wik + εijk

and suppose this time that i = 1 . . . a, k = 1 . . . c and j = 1 . . . bik.

Hence there might not be the same number of replicates (j) within each

whole–plot unit.


195

As before, the replicates due to factor B are eliminated by calculating

the average within each block and treatment:

zik = yi.k = µ + αi + rk + (wik + εi.k)

But now with uik = wik + εi.k

Var(uik) = σ2w + σ2/bik = σ2

uik

That is, the ziks have different variances.


1. One unpleasant consequence of this is that

zi. = µ + αi + r. + ui.

has variance (σ2w + σ2

ui.)/c which depends on i.

2. Another, equally unpleaseant, consequence is that SSDAC from before

does not have a χ2 distribution.

3. Consequently, the F–statistic from before does not have an Fdistribution.



196

Some consequences of this:

• Hence we can still calculate the F–statistic, but it has an unknown

distribtution in the unbalanced case.

• Hence we have a problem in judging whether an observed F–statistic

is “large”.

• It seems plausible that when the experiment is “nearly balanced”, then

F must “nearly be F–distributed. But what is “nearly balanced”, and

what to do when the experiment is very unbalanced?


A related problem:

A related problem arises even in the balanced case. Suppose interest is in

comparing

µ11 − µ21 = α1 − α2 + γ11 − γ21.

The optimal estimate of this contrast is in the balanced case the differnce

y11. − y21.

and the variance of that difference is

Var(y11. − y21.) =23(σ2

w + σ2)


197

• The problem is that to estimate σ2w + σ2, two sums–of–squares are

needed.

• To put it in general terms, suppose SSD1 ∼ σ21χ

2(f1) and SSD2 ∼σ2

2χ2(f2) are needed. The problem arising is that the weighted sum

SSD = a1SSD1 + a2SSD2

does not have a χ2–distribtution unless σ1 = σ2 and a1 = a2.

• Satterthwaites idea was the following: Let us assume that SSD

approximately has a χ2–distribution.

• The problem is then how many degrees of freedom – but this number

can be “estimated” in the following way.


Satterthwaites approximation

Consider the two–sample problem

Yij ∼ N(µi, σ2i ), i = 1, 2, j = 1, . . . , ni

Then

Yi ∼ N(µi,σ2

i

ni), Y1 − Y2 ∼ N(µ1 − µ2,

σ21

n1+

σ22

n2)

S2i =

1fi

ni∑j=1

(Yij − Yi.)2 ∼σ2

i

fiχ2(fi), fi = ni − 1



198

Let σ2D = σ2

1n1

+ σ22

n2. A natural and unbiased estimate for σ2

D is

S2D =

S21

n1+

S22

n2(1)

Question : What is the distribution S2D?

Satterthwaite (Worked at General Electric, USA) (approx. 1945): We

don’t know but let’s approximate the distribution of S2D with a suitable

χ2–distribution:

S2D ∼approx

φ2

ηχ2(η) (2)


• With S2D = S2

1n1

+ S22

n2we have

E(S2D) =

σ21

n1+

σ22

n2= σ2

D

V ar(S2D) = 2(

σ41

n21f1

+σ4

2

n22f2

)

• Under the approximation S2D ∼approx

φ2

η χ2(η) is

E(S2D) = φ2

V ar(S2D) = 2

φ4

η


199

• Satterthwaites ide: Match the first two moments:

φ2 = σ2D

η =(σ2

D)2

σ41

n21f1

+ σ42

n22f2

• In real life σ2i and hence σ2

D are unknown. Instead we plug in the

estimates s2i and s2

D in the calculation of η:

η =(s2

D)2

s41

n21f1

+ s42

n22f2


Example 2. Let σ21 = 2, σ2

2 = 10, n1 = n2 = 6, f1 = f2 = 5. Then

σ2D =

26

+106

= 2

η =22

22

62·5 + 102

62·5

= 6.9 ≈ 7

Hence

S2D =

S21

n1+

S22

n2∼approx

σ2D

7χ2(7)

fin



200

Example 3. Let σ21 = 100, σ2

2 = 90, n1 = 100, n2 = 10, f1 = 99, f2 =9. Then

σ2D =

100100

+9010

= 10

η =(1 + 9)212

99 + 92

9

= 11.1

If the variances are assumed equal, then

σ2D =

99 · 100 + 9 · 90108

(1

100+

110

) = 10.9

which has a scaled χ2(108)–distribution.

Quite a difference! fin


How Good is Satterthwaites Approximation

The 1000 EURO question is now : How good is Satterthwaites

approximation ???

The usual answer : Simulate and calculate coverage percentages !!!


201

Two–sample Problem

Model:

Yij = µi + εij, i = 1, 2, j = 1, . . . , ni

where εij ∼ N(0, σ2i ).

1. Simulate data where µ1 = µ2.

2. Test hypothesis µ1 = µ2 at different significane levels.

- Using Satterthwaites approximation

- Using the Containment method, (default in PROC MIXED).

3. Calculate coverage percentages.


n1 σ1 n2 σ2 Method DDF Fpr0.01 χ2pr0.01 Fpr0.05 χ2pr0.05 Fpr0.10 χ2pr0.10

3 1 3 20 contain 4 0.047 0.127 0.114 0.204 0.182 0.2603 1 3 20 satterth 2.16 0.020 0.124 0.056 0.202 0.106 0.258









Table 1: Two–sample problem - 1000 simulations



202

Split–Plot Experiment

We consider the model

Yijk = µ + αi + βj + wik + εijk, i = 1, 2, k = 1, . . . , ni, j = 1, . . . , nik

where wik ∼ N(0, σ2w) and εijk ∼ N(0, σ2).

• Make simulations for different values of σ2w.

• In the simulations α1 = α2.

• Test of the hypothesis α1 = α2.


The design is as follows:

n1 = 3 and n2 = 8

i = 1 : j = 1 . . . n1k = 5

i = 2 : k = 1 . . . 3 : j = 1 . . . n1k = 3

i = 2 : k = 4 . . . 8 : j = 1 . . . n1k = 9

So all problems arise to to unbalancedness (rather than variance

heterogeneity as before).


203

σ σw Method DDF Fpr0.01 χ2pr0.01 Fpr0.05 χ2pr0.05 Fpr0.10 χ2pr0.10

1 1 contain 9 0.007 0.030 0.050 0.068 0.086 0.1251 1 satterth 9.67 0.012 0.030 0.051 0.068 0.088 0.125




Table 2: Split–Plot Experiment - 1000 simulations


Making the “right” tests with PROC MIXED

A typical SAS program for analyzing the split plot data above is like

proc mixed data=sim noitprint;class i j k subject;model y = i j /ddfm=contain chisq;random i*k;

run;

• The containment method is default in PROC MIXED (but can be

specified explicitely with ddfm=contain) in the MODEL statement.

• This tells SAS that when testing any of the fixed effects in the model,

SAS should look for a random effect which syntactically contains the



204

fixed effect: Since i is contained in i*j SAS then knows that that it is

this random effect the test should be “made against”.

• It is well known that this is the right thing to do when the experiment

is balanced.


A Severe Warning!!

A very commonly made mistake in this connection is the following:Each combination (i, k) often identifies an experimental entity, e.g. ananimal or a (whole) plot in a field. Typically one would have a variable inthe data set identifying such an entity. For illustration we have made avariable, called subject defined as (i, k). A typical SAS program wouldthen be:

proc mixed data=sim noitprint;class i j k subject;model y = i j /ddfm=contain chisq;random subject;

run;

Such a program is made under the mistaken impression that since


205

subject and (i, k) really identifies the same units in the experiment

then it should be immaterial what one writes.

This is not true, and the reason is the following:

Since i is not syntactically contained in subject the tests (for effect of

the factor i) would be made against the residual variance, which we

know is wrong.


To emphasize this point, suppose that we declare a new variable icopywhich is just a copy of i. Then writing

proc mixed data=sim noitprint;class i j k subject icopy;model y = i j /ddfm=contain chisq;random icopy*k;

run;

will also make SAS perform the test of effect of the factor i against the

residual variance which, as poined out above, is wrong.

If, however, we write ddfm=satterth in any of the examples above,

then SAS will actually identify the right variance component to make the

test for effect of factor i against.



206

Some Tentative Conclusions on Satterthwaite

• For small samples, Satterthwaites method performs much better thanthe default Containment method.

• For larger samples, there is not much difference between the twomethods. In practice, this is because the difference between thequantiles in a F (1, 7) and F (1, 14) distribution is not large whereas thedifferences between quantiles in a F (1, 2) and a F (1, 4) distribution besubstantive

• Both methods generally perform better than the large sample χ2 tests.

• A drawback of Satterthwaites method is that it is computationally


somewhat intensive.

• Results suggest the use of Satterthwaites approximation.


207

Random or Fixed Effects?

Sometimes it is straight forward to decide on whether a specific effect

should be considered as random or fixed.

In other cases, it is a more delicate issue.

The text below is taken from lecture notes by L. R. Schaeffer, University

of Guelph, Ontario, Canada:

Fixed factors are factors in which the classes comprise all of the possible classes ofinterest that could be observed. For example, the sex of an animal is either male,female, sterilized male, or sterffized female. If the number of classes in a factor is smalland confined to this number even if conceptual resampling were performed an infinitenumber of times, then the factor is likely fixed. Other examples are age classes,


lactation number, management system, cage number, and breed class. Usually if thesampling were to be repeated a second time, those factors which maintain the sameclasses between the two samplings would be fixed factors. For example, a growth trialon pigs using two diets would probably need to use the same housing facilities, thesame age groups of pigs, and the same diets, but the individual pigs would necessarilyhave to be new animals because an animal could not go through the same growthphase a second time in its life. Pig effeets would be considered a random factor whflethe other effects would be fixed.

Random factors are factors whose levels are considered to be drawn randomly from aninfinitely large population of levels. As in the previons pig experiment, pigs wereconsidered random because the pig population of the world is large enough to beconsidered infinitely large, and the group that were involved in that experiment were arandom sample from that population. In actual fact, however, the pigs on thatexperiment were likely sampled from those relatively few pigs that were available at thetime the trial started, but still they are considered to be a random factor because if theexperiment were to be repeated again, there would likely be a completely differentgroup of pigs involved.



208

Another way to determine if a factor is fixed or random is to know how the results willbe used. In a nutrition trial the results infer something about the diets in the trial. Thediets are specific and no inferences should be made about other diets not tested in theexperiment. Hence diet effects would be a fixed factor. In contrary, if animal effeetswere in the model, inferences about how any animal might respond to a specific dietmay need to be made. There should not be anything peculiar about the animal on thetrial that would nullify that inference. Animal effeets would be a random factor.

In general, a few questions need to be answered to make the correct choice of fixed orrandom factor designation. Some of the questions are:

1. How many levels of the factor a-re in the model? If smalt, then perhaps this is afixed factor. If large, tILen perhaps this is a random factor.

2. Is the number of levels in the population large enough to be considered fiffinite? Ifyes, then perhaps this factor is random.

3. Would the same levels be used again if the experiment were to be repeated a second


time? If yes, then perhaps this factor is fixed.

4. Are inferences to be made about levels not included in the experiment? If yes, thenperhaps this factor should be random.

5. Were the levels of a factor determined in a nonrandom manner? If yes, tiden perhapsthis factor should be treated as fixed.

By studying the scientific literature, a researcher should be able to get some help in this

decision process. If in doubt, then the assistance of an experienced statistician should

be sought.


209

Multilocation Trials

Consider the following setup:

• Four treatments, e.g. of housing systems for pigs are to be compared.

• Studies are carried out on 9 farms (locations)

• Within each farm a randomized block design with 3 blocks is employed,

i.e. each treatment is repeated 3 times within each farm, once in each

block.

How to analyze such data?


Note that since there are replicates within each farm, the

farm–treatment interaction can be estimated.

The following model seems appealing:

yijk = µ + τi + Lj + (RL)jk + (τL)ij + εijk

where i = 1 . . . 4 is treatment, j = 1 . . . 9 is location and k = 1 . . . 3 is

block.



210

It is reasonable to assume that (RL)jk and εijk are random. But other

effects need more consideration:

• One can consider Lj and (hence) (τL)ij as being random.

• Alternatively one can consider Lj and (τL)ij to be fixed effects.

The effects in question can be considered random if the farms (locations)

are random representatives from the population of farms with specific

characteristics.

But if the farms are selected as e.g. “those 9 farms whose owners

responded to a questionnaire sent out to all farms with given

characteristics”, then the farms are not random representatives from the


population. In that case, the effects in question should be regarded as

fixed, and one can not extrapolate the conclusions from the study

outside these 9 farms.

What to do if 6 farms are selected randomly, while 3 are not?

What to do if there are only 3 randomly selected farms in the study?


211


212

11 Examples of Split-Plot Designs

The purpose of this lecture was to illustrate the kind of problems that may arise, if split-plotdesigns are not treated properly. Most of the experiments presented were made at the DanishInstitute of Agricultural Sciences, or rather the National Institute of Animal Science, as it wascalled in those days.

Another common aspect of several of the experiments were that they have led to a heated debate.The pro’s and con’s in those debates were presented.


1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SPLITPLOTExamples.pdf

213

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SPLITPLOTExamples.pdf

. . . After reading 50 of these papers in AABS-issues [Applied

Animal Behaviour Science] of 1984 and 1985, we found that in

about 25 cases statistical methods were used incorrectly. The

main defect was that observations entered into test statistics

were not independent. In a number of cases it was totally

unclear how the authors made their computations

Hoekstra & Jansen, AABS 16 (1986) 303-308

March 6, 2001 1

Example: W. Schouten Ph.D. work

Rearing conditions and Behaviour in pigs.

How does early experience influence later behaviour ?

’Barren’ farrowing crates vs. 2 × 2 m2 vs ’enriched’ large straw pens

28 m2.

8 sows (4 sister-pairs). Within each sister-pair the pigs were

assigned to treatment at random. Each litter consisted of 8 pigs,

i.e., a total of 64 piglets.

Detailed behavioural observations

March 6, 2001 2


214

Anovas

Reported Model Litter Averages

Effect df SS F df SS F

Sister-Pair 3 384.2 3 48.0

Housing System 1 893.3 5.14∗ 1 117.0 2.424

Residual 59 10253.7 3 138.2

Total 63 7

March 6, 2001 3

Mixed model formulation

Reported model:

Yijk = µ + Pi + Hj + εijk

Pi Effect of sister pair i ∈ {1, . . . , 4}. Hj effect of housing. εijk

random residual.

Correct model:

Yijk = µ + Pi + Hj + Sij + εijk

Sij Effect of sow.

March 6, 2001 4

215

Breed effect on production

Are the present feeding standards for essential nutrients per FUp

sufficient for Ad lib feeding ?

Beretning 579. A. Just et al. (1985)

6 litters (YY) and 6 litters of (LL) 6 (7) pigs (boars, gilts,

castrates). Two levels of nutrient concentrations in the feed.

March 6, 2001 5

Model

Yijkl = µ + ai + bj + ck + dl(j) + (ab)ij + (ac)ik + εijkl

• ai: effect of feed nutrient concentration, i ∈ {1, 2}

(Norm vs. Norm +20%).

• bj: effect of breed, j ∈ 1, 2 (LL and YY).

• ck: effect of sex k, k ∈ {1, 2, 3}.

• dl(j): effect of litter l within breed j.

• (ab)ij: interaction between feed concentration and breed.

• (ac)ik: interaction between feed concentration and sex.

• εijkl: random residual.

March 6, 2001 6


216

Similar designs

• Breeding line vs. pecking behaviour

• Rearing Conditions vs. later productivity

• Effect of organic feed.

• Effect of GMO production.

March 6, 2001 7

Straw shortener

A number of sows were fed with either control feed or feed containing straw fromfields treated with straw shortener (CCC). To investigate long term effects thestudy covered 4 parities.

Reported model:Yijk = µ + ti + pj + (tp)ij + εijk

Yijk: Observed variable e.g., litter size. ti: effect of treatment. pj: effect of parity.(tp)ij: Interaction between parity and treatment.εijk: random residual

Correct model:Yijk = µ + ti + pj + (tp)ij + Sik + εijk

Sik: Effect of sow k on treatment i, Sik ∼ N (0, σ2S)

March 6, 2001 8

217

Group housing

Loose housed sows. Automatic feeding systems.

Hypothesis: Pelleted feed reduces aggression compared with mealy

feed.

Hypothesis: Pelleted feed reduces the effect of rank on received

aggression.

March 6, 2001 9

Herd Investigations

Inspired by Nørgard (1999).

Yijkl = µ + ai + sj + Hijk + vl + (vs)jl + εijklm

• Yijklm measurement at slaughter.

• ai : Effect of Abattoir i.

• sj : Effect of herd disease state j.

• Hijk Random effect of herd Hijk ∼ N (0, σ2H).

• vl: Effect of season l.

• (vs)jl: Interaction between season and disease state.

• εijklm: Random residual from mth animal. εijklm ∼ N (0, σ2)

March 6, 2001 10


218

Multi location trials

Yijk = µ + τi + Lj + R(L)jk + (τL)ij + εijk

• τi: effect of treatment

• Lj: effect of location

• R(L)jk: random effect of block within location, R(L)jk ∼

N (0, σ2R)

• (τL)ij: interaction between treatment and location

• εijk: residual εijk ∼ N (0, σ2)

March 6, 2001 11

219


220

12 Estimation and tests in mixed models

The purpose of this lecture was to give a detailed description of theoretical issues of estimationand tests in mixed models, i.e. properties of maximum likelihood estimators in the linear normalmodel and the mixed linear normal model. Concepts such as ML and REML is introduced.


1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/MLMixed.f.pdf

221

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/MLMixed.f.pdf

Maximum Likelihood and Linear Normal Models

Example 1. Consider the linear regression model

yi = β0 + β1xi + εi

We shall show that the maximum likelihood estimate and the least squaresestimate for

β = (β0, β1)

are identical.


Because of the independence, the joint density for y1, . . . , yn (and hencethe likelihood function) becomes

f(y1, ...yn;β) =n∏

i=1

f(yi;β)

=n∏

i=1

1√2π

1σ

exp(− 12σ2

(yi − (β0 + β1xi))2)

=1

√2π

n1σn

exp(− 12σ2

∑i

(yi − (β0 + β1xi))2)

= L(β)



222

The likelihood function is

L(β) =1

√2π

n1σn

exp(− 12σ2

∑i

(yi − (β0 + β1xi))2)

• Let D(β0, β1) =∑

i(yi − (β0 + β1xi))2.

• If σ is known then L(β) is maximized by minimizing the sumof squared deviations D(β0, β1) (because of the “−” sign in theexponential).

• Therefore the maximum likelihood estimate is the same as the leastsquares estimate.

fin


For a general linear normal model

y = Xβ + ε where ε ∼ N(σ2I)

the likelihood is

L(β, σ2) =1

√2π

n1σn

exp(− 12σ2

∑i

(yi − µi))2)

=1

√2π

n1σn

exp(− 12σ2

(y −Xβ)>(y −Xβ))

Hence the maximum likelihood estimate for β is found by minimizing

(y −Xβ)>(y −Xβ).


223

Once β (and hence µ) is found, it is not hard to verify that L(β, σ2) is

maximized as a function of σ2 by

σ2 =1n(y −Xβ)>(y −Xβ)

However, in practice one never uses the ML estimate for σ2. Instead one

uses

σ2 =1

n− p(y −Xβ)>(y −Xβ)

where p is the number of parameters in the model.


The reason for using σ2 instead of σ2 is that

E(σ2) = σ2

E(σ2) =n− p

nσ2

That is σ2 is an unbiased estimate for σ2 while σ2 is biased.



224

It can be noted that

σ2 =1

n− p(y −Xβ)>(y −Xβ)

is called the REML estimate for σ2, where REML means REstricted or

REsidual Maximum Likelihood.

The REML method is frequently applied in connection with mixed

models in an attempt to obtain unbiased variance estimates.


Maximum Likelihood Estimation in Mixed Models

For a mixed model

y = Xβ + Zu + ε

the variance of y is Cov(y) = V = Z Cov(u)Z> + Cov(ε).

• The unknown parameters are in this case (β, V ).

• The typical case is that V depends only on a small number of parameters

itself, e.g. on α = (σ2r , σ

2w, σ2) in a split–plot experiment.

• So we write V = V (α).


225

In mixed models, maximum likelihood estimation becomes much more

involved.

The likelihood function is

L(β, V ) =1

√2π

n det(V )−n2 exp(−1

2(y −Xβ)>V −1(y −Xβ))

Here det(V ) is a number, called the determinant of V .

There are two situations to consider: When V is known and when V is

unknown.


Case 1 - V is known: If V is known then L is maximized by

minimizing

(y −Xβ)>V −1(y −Xβ)

This quantity is minimized by

β = (X>V −1X)−1X>V −1y

which is also the weighted least squares estimate of β.



226

Case 2 - V is unknown: If V is unknown (which of course is

generally the case in practice) things become more complicated.

There are different approaches available. Two of these are

• Maximum Likelihood (ML) and

• Restricted Maximum Likelihood (REML)


Maximum Likelihood: The expression

β(V ) = (X>V −1X)−1X>V −1y

depends on V which is unknown. If the expression for β is substituted

into L we get

L(β(V ), V ) =1

√2π

ndet(V )n2 exp(−1

2(y −Xβ(V ))>V −1(y −Xβ(V )))

This likelihood depends now only on V .


227

Maximization of L has to be done iteratively.

This gives V and hence

β(V ) = (X>V −1X)−1X>V −1y

Typically, V only depends on a few parameters, say α, so we write

V = V (α).

In that case L(β(V (α)), V (α)) has to be maximized as a function of α.


Restricted Maximum Likelihood:

An alternative to ML estimation REML estimation.

This is the default method in PROC MIXED.

Consider a mixed model

y = Xβ + Zu + ε, where Var(y) = V

and V and β are unknown.

If β had been known, the residuals are

ε = y −Xβ ∼ N(0, V )



228

and one could use the ML method from before for estimating V .

However, β is not known. Therefore one frequently does the following:

The least squares estimate of β is

βls = (X>X)−1X>y

and while not the optimal estimate for β, it is still an unbiased estimate.

One then considers the residuals

εls = y −Xβls ∼ N(0, A(X)V A(X)>)

where A(X) is a known matrix which is a function of X.


The likelihood for the “residuals” εls then depends only on V and one

can maximize that likelihood numerically.

This gives the REML estimate Vreml for V .

When V depends on fewer parameters α the result is the REML estimate

αreml.

With this estimate at hand we can estimate β as

βreml = β(Vreml) = (X>V −1remlX)−1X>V −1

remly


229

Using ML or REML

In practice the ML and the REML estimates do not differ much.

The main argument for REML estimation is that, at least in the balanced

cases, Vreml is unbiased while Vml is not.

Whether Vreml is always unbiased is not known.


Tests in Mixed Models

In dealing with tests in mixed models we shall first assume that the

covariance matrix V is known.

Typically we are interested in testing hypotheses of the form λ>β = k for

some vector λ and some number k (often k = 0.)

We know that the contrast λ>β is estimable if and only if there is a

vector a such that a>X = λ>.

The estimate of the contrast is λ>β is a>Xβ, where

Xβ = X(X>V −1X)−1X>V −1y



230

Standard calculations gives that

Var(Xβ) = X(X>V −1X)−1X>V −1X(X>V −1X)−1X>

= X(X>V −1X)−1X>

so

Xβ ∼ N(Xβ, X(X>V −1X)−1X>).

Hence

a>Xβ ∼ N(a>Xβ, a>X(X>V −1X)−1X>a)

If the hypothesis λ>β = k is true then

a>Xβ − k ∼ N(0, a>X(X>V −1X)−1X>a)


Therefore if V is known the task is to test whether E(a>Xβ − k) = 0when Cov(a>Xβ − k) is known.

This can be done by constricting the statistic

X2 = (a>Xβ − k)>[a>X(X>V −1X)−1X>a]−1(a>Xβ − k)

which under the hypothesis has a χ2(f1)–distribution where f1 is the

number of parameters “eliminated” in the contrast a>Xβ = k


231

The problem is what to do when V is unknown?

In some cases V (e.g. in a split–plot experiment) the structure of V is

such that V = ω2W−1 where W is known and ω2 is unknown.

In that case, one can construct an F–statistic

F =(a>Xβ − k)>[a>X(X>W−1X)−1X>a]−1(a>Xβ − k)/f1

ω2

which under the hypothesis has an Ff1,f2–distribution.

How to derive f2 shall not be discussed here. We just note that PROCMIXED attempts to construct such test statistics and to derive the

appropriate number f2 of denominator degrees of freedom.


In this connection it is to be pointed out that it is extremely important to

specify the random effects in the RANDOM–statement in the correct way.



232

Another approach is to construct approximate F–tests by establishing

a denominator D, such that

F =(a>Xβ − k)>[a>X(X>V −1X)−1X>a]−1(a>Xβ − k)/f1

D/f2

has an approximate F–distribution when the hypothesis is true.

Adding the option DDFM=SATTERTH to the MODEL–statement causes PROCMIXED to attempt to construct such tests.


A final option is the following:

When n→∞ (in a suitably regular way) then V and V becomes

indistinguishable.

Therefore, one approach is to simply “pretend” that the ML estimate V

is the true, but unknown variance V .

One can force PROC MIXED to making such tests by adding the

CHISQ–option to a the model statement.


233


234

13 Complications concerning VarianceComponents

This lectures illustrated some of the problems that may arise because of numerical problemsin the iterative search for the maximum likelihood, and the reason why some of the variancecomponents are set equal to 0.

Based on an example from one of the exercises, the profile of the likelihood function is illustrated.

A special problem is that Satterthwaites approximation fails in the cases where the variancecomponent is set to 0, and the G matrix is not positive-semidefinit. Rules of thumb is suggestedin that case.

Finally, the relevance of a test of a positive variance component is discussed, e.g. comparableto a test of a block effect, when block is treated as a fixed effect

Link to the fullscreen presentation1

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/Complicate.pdf

235

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/Complicate.pdf

Sugar beet example

Pct Sukk

Num Den

Effect DF DF F Value Pr > F

OPTAGN 1 2 15.21 0.0599

SAATID 4 16 189.37 <.0001

OPTAGN*SAATID 4 16 5.37 0.0061

Kg

OPTAGN 1 18 336.85 <.0001

SAATID 4 18 408.52 <.0001

OPTAGN*SAATID 4 18 12.70 <.0001

March 13, 2001 1

Inspection of Log

Pct Sukk

NOTE: Convergence criteria met.

NOTE: There were 30 observations read from the data set WORK.ROER.

Kg

NOTE: Convergence criteria met.

NOTE: Estimated G matrix is not positive definite.

NOTE: There were 30 observations read from the data set WORK.ROER.

March 13, 2001 2

13 Complications concerning Variance Components

236

Sugar beet example

Table 1: Covariance Parameter EstimatesPct Sukk

Cov Parm Estimate Alpha Lower Upper

BLOK 0.001000 0.05 0.000164 37.9371

BLOK(OPTAGN) 0.001000 0.05 0.000219 0.2840

Residual 0.001333 0.05 0.000740 0.003088

Kg

BLOK 0.05344 0.05 0.01660 3.13E192

BLOK(OPTAGN) 0 . . .

Residual 5.1215 0.05 2.9241 11.2004

March 13, 2001 3

Outline

• Estimation of variance components

– Why are σ2X = 0

– Consequences

– Rules of Thumb

• Are random effects significant ?

– Are we really interested ?

– Likelihood ratio tests

March 13, 2001 4

237

Reason

The likelihood function is maximized subject to the constraint that

the variance component parameters σ2X ≥ 0.

The precision of numerical optimisation methods depends on the

internal representation of numbers in the computer. Proc Mixed

solves this by setting σ2X = 0 if it is close to 0.

Other statistical packages (R,S-Plus) handles the constraint by

maximising the likelihood as a function of log(σ2X)

Sometimes (e.g., repeated measurements) the assumption that

σ2X ≥ 0 cannot be justified.

March 13, 2001 5

Likelihood contour plot Pct Sukk

−5 −4 −3 −2 −1 0 1 2

−5

−4

−3

−2

−1

01

2

log10(σB(O)2 )

log 1

0(σB2)

March 13, 2001 6


238

Likelihood contour plot Kg

−5 −4 −3 −2 −1 0 1 2

−5

−4

−3

−2

−1

01

2

log10(σB(O)2 )

log 1

0(σB2)

March 13, 2001 7

G Not positive Definite

V(u) = G =

σ2B 0 0 0 0 0

0 . . . 0 0 0 0

0 0 σ2B 0 0 0

0 0 0 σ2B(O) 0 0

0 0 0 0 . . . 0

0 0 0 0 0 σ2B(O)

March 13, 2001 8

239


V(u) = G =

σ2B 0 0 0 0 0

0 . . . 0 0 0 0

0 0 σ2B 0 0 0

0 0 0 σ2B(O) 0 0

0 0 0 0 . . . 0

0 0 0 0 0 σ2B(O)

March 13, 2001 9


G =

σ2B 0 0 0 0 00 . . . 0 0 0 00 0 σ2

B 0 0 00 0 0 0 0 00 0 0 0 . . . 00 0 0 0 0 0

G−1 =???

March 13, 2001 10


240

Warning: Satterthwaite Goes Wrong

Satterthwaite’s approximation uses the estimated variance

components for calculation of test degrees of freedom. The

calculations includes differentiation with respect to σ2X. At boundary

values such as 0 this differentiation is not defined.

In the PARMS statement a lower bound on the estimated variance

components may be specified, e.g.,

PARMS /LBOUND=0.001,0.001,0.001;

This produces the same problems as σ2X = 0

March 13, 2001 11

Conclusions

• If estimated covariance parameters are > 0 use Satterthaites

approximation.

• If not

– If model reductions are ”natural”, reestimate parameters using

revised models.

– Nested design should be reformulated to maintain design

– Use containment method but be careful to specify model

syntactically correct. (Compare with random statement in GLM)

March 13, 2001 12

241

Testing Effects of Random components

• Why are we interested in testing σ2B > 0 ?

• Model Reduction

• σ2B = 0 is not a test and may not be used for this purpose.

• Fixed effects vs. Random Effects

• Biologically significant, i.e.,. if we sample x individuals at random,

what are the average difference between lowest and highest,

confidence interval for the difference. What is the correlation,

heretability, repeatability, sensitivity and specificity.

March 13, 2001 13

Model Reduction

Consider model A and model B that represents a special case of A,

e.g., one of the variance components σ2X = 0. B is said to be nested

within A. In this case a Likelihood Ratio test may be performed

Then 2(LogLikeA − LogLikeB) is asymptotically χ2 distributed with

(pA − pB) degrees of freedom, where pA is number of parameters in

model A.

NB! This not feasible if σ2X = 0

March 13, 2001 14


242

General recommandations

• Using ML any nested models may be compared

• Using REML only nested models with identical fixed effects may be

compared.

• With respect to test for variance components this test is

conservative, i.e., true p-value is smaller than the calculated. Thus

the test results in too few significant findings.

• With respect to test for fixed effects this test is anti conservative,

i.e., true p-value is larger than the calculated. Thus the test results

in too many significant findings. (Therefore likelihood ratio tests

should not be used for fixed effects).

March 13, 2001 15

Fixed Effects

If the variance component is 0, this implies that ui = uj for every i

and j.

i.e., Reformulate model and treat the factor of interest as Fixed.

However:

ui ≈ uj does not imply that σ2u = 0

March 13, 2001 16

243

Biologically significant

• Very often the real interest can be formulated as an interval of the

variance component parameter, e.g., is it larger than some preset

’irrelevance’ level ?

• The confidence interval produced with the CL option in the

Proc Mixed statement are often sufficient for this. However,

the general comment about sufficient sample size is VERY relevant

here.

• Many ’biologically’ relevant parameters are combinations of several

variance component parameters, e.g., correlation ( repeatability)

(σ2

A

σ2ε+σ2

A

). Therefore the joint distribution of parameter estimates

need to be considered. This is not trivial (Interest ???).

March 13, 2001 17

Covariance Matrix: Sugar beet PCT Sukk

Asymptotic Covariance Matrix of Estimates

Row Cov Parm CovP1 CovP2 CovP3

1 BLOK 3.069E-6 -8.02E-7

2 BLOK(OPTAGN) -8.02E-7 1.613E-6 -4.44E-8

3 Residual -4.44E-8 2.222E-7

March 13, 2001 18


244

14 Repeated Measurements

This lecture gives an introduction to repeated measurements, and is a supplement to Chapter 3in LMSW (Littell et al., 1996). It illustrates how it is possible to modify the tacit assumptionsof the split-plot design into a more flexible modelling of the variance matrix.

Different variance structure is illustrated graphically and the use of SAS to compare differentstructures presented. The AR(1) and CS structure are discussed in detail. Finally, methods forcomparison between different structures is shown.

Links to full-screen presentation1

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/Repeated.f.pdf

245

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/Repeated.f.pdf

Analyzing Repeated Measurements

Consider the setup:

• A treatment factor A with a levels is applied to individuals, e.g.

pigs.

• Within each treatment there are c individuals

• On each individual repeated measurements of the same response is

made at b different time points.


Example: Exercise Therapy (LMSW p. 88)

• Subjects (SUBJ) were assigned to one of three different training

programs (PROGRAM) on weightlifting.

• The strength (STRENGTH) of the subjects was measured every

second day (TIME) for a two period from the start of the study.

Some questions:

• Is there a treatment effect?

• Is there an interaction between treatment and time?



246

Mean profiles

1 2 3 4 5 6 7

79.5

80.5

81.5

82.5

time

stre

ngth

Group means

CC C C

CC CR

RR

RR R RW

WW

W W WW

The task: Comparison of the mean profiles

Clear evidence of treatment effect and treat–by–time interaction.


Individual profiles:

1 2 3 4 5 6 7

7580

8590

time

stre

ngth

CONT

1 2 3 4 5 6 7

7580

8590

time

stre

ngth

RI

1 2 3 4 5 6 7

7580

8590

time

stre

ngth

WI

No evidence of non–constant variance!!

Sometimes (but certainly not always!) repeated measurements can

be appropriately dealt with by a split–plot model.


247

• A statistical model for this situation could be

yijk = µ + αi + βj + γij + wik + εijk

where wik ∼ N(0, σ2

w) and εijk ∼ N(0, σ2).

• Here i denotes treatment, k is replications (within treatment) and

j is “time”

• “Time” is called the within–subject factor.

• Note: “Time” can also refer to different locations, e.g. in the

intestine.

• It is the usual split–plot model!


Tacit Assumptions when using the Split–Plot Model

It is important to realize the assumptions one make in applying a

split–plot model to a repeated measurement problem:

1. It is assumed that the variance is constant.

This may not a reasonable assumption: Sometimes the variance

increases with the mean, and if the mean changes over “time”, this

assumption is violated.

If time is really location in the instine, there might be certain

segments where the variance of a given respons is much larger than

in other segments.



248

2. It is assumed that the correlation between two measurements on the

same individual is the same – no matter how far the measurements

are apart in time.

This may not be a reasonable assumption: Observations close

to each other in time might be expected to be more alike than

observations far from each other.

3. It is assumed that the correlation is positive.

This may not be a reasonable assumption: Consider a feeding

experiment. If the feed intake is lower than expected in one week

because of diseases it may be higher than expected in the next

week. Hence the observations would be negatively correlated.


4. It is assumed that the biological questions can be answered through

the interaction γik and possibly the main effects αi and βj.

That might be a too crude model. For example, data might

indicate the mean value evolves over time in a specific way, e.g.

µij = µ + αi + β × j + β2 × j2


249

Modelling of Covariances

A classical way of thinking of a statistical model is as

Observables = Systematic effects + Random effects

Most frequently, the main interest is in the systematic effects, while

the random effects are considered as a nuissance.

Yet, the random effects are important to understand and to model in

an appropriate way.


Types of random variation

5 10 15 20

5010

015

020

0

x

m + e

5 10 15 20

5010

015

020

0

x

m + subj

5 10 15 20

5010

015

020

0

x

m + ser

5 10 15 20

5010

015

020

0

x

m + subj + ser + e



250

Can be summarized as:

• Random subject effect

• Serial dependence

• Residual variation


Unstructured Covariance Matrix

Consider Exercise Therapy data.

A very general model is the model where for each treatment i and

time j there is mean value µij, and the measurements have a

completely unstructured covariance matrix.

Yik =

Yi1k

...

Yi7k

∼ N7(µi =

µi1

...

µi7

, V )

where k refers to to subject within treatment, and where V is a

7× 7 unstructured matrix.


251

Since the subjects are independent the random vector arising after

stacking all Yiks on the top of each other has a covariance matrix

consisting of V ’s on the “diagonal” and 0s outside.

Such a matrix is said to be block diagonal.

Note that in V there are 7× 8/2 = 28 parameters.


This model can be fitted with the following SAS program:

proc mixed data=weight2;

class program subj time;

model strength = program time program*time / outP=pred;

repeated time / subject=subj*program type=un r;

ods listing exclude r; ods output r=r rcorr=rcorr;

data r; set r; keep col1-col7;

data rcorr; set rcorr; keep col1-col7;

run;

The data set r contains the estimated covariance matrix, whilercorr contains the correlation matrix

Note that V is the covariance matrix for Yik. But if we writeYik = µi + εik (note: everything here are vectors) then V is also thecovariance for the error terms εik which has mean 0.



252

The estimated correlation matrix is

1.0000 0.9602 0.9246 0.8716 0.8421 0.8091 0.7968

0.9602 1.0000 0.9396 0.8770 0.8596 0.8273 0.7917

0.9246 0.9396 1.0000 0.9556 0.9372 0.8975 0.8755

0.8716 0.8770 0.9556 1.0000 0.9601 0.9094 0.8874

0.8421 0.8596 0.9372 0.9601 1.0000 0.9514 0.9165

0.8091 0.8273 0.8975 0.9094 0.9514 1.0000 0.9531

0.7968 0.7917 0.8755 0.8874 0.9165 0.9531 1.0000


The AR(1)–model

Consider a sequence of measurements z1, z2, . . . , zT made on the

same experimental unit at T time points t = 1, . . . , T .

It is assumed that E(zt) = 0 for all t.

A frequently employed model is the AutoRegressive model of order

1, which states that

zt = ρzt−1 + εt t = 2, . . . , T

where εt ∼ N(0, σ2

z), all independent and where −1 < ρ < 1.


253

Hence what happens at time t is ρ times what happened at time

t− 1 + some random noise.

The variance of each zt is the same and is denoted ω2.

This variance can be found as:

ω2 = Var(zt) = Var(ρzt−1 + εt)

= ρ2 Var(zt−1) + Var(εt)

= ρ2ω2 + σ2

Hence ω2 = σ2

1−ρ2.


It is illustrative to investigate the covariance structure of this model.

First consider observations one time–step apart:

Cov(zt, zt−1) = Cov(ρzt−1 + εt, zt−1)

= ρCov(zt−1, zt−1) = ρVar(zt−1) = ρω2

Next we consider observations two time–steps apart:

Cov(zt, zt−2) = Cov(ρzt−1 + εt, zt−2)

= Cov(ρzt−1, zt−2) = ρCov(zt−1, zt−2)

= ρ2ω2



254

In general, the covariance between observations k time–steps apart is

Cov(zt, zt−k) = ρkω2

The correlation between observations k time steps apart therefore

becomes

γ(k) = Corr(zt, zt−k) =ρkω2

ω2= ρk

The number k is called the lag between the observations and γ(k) is

called the autocorrelation function

If the postulated model is correct, the autocorrelation should tend to

0 as the lag increases.

Some Autocorrelations


0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

c(0, x)

rho^

c(0,

x)

Autocorrelation, rho= 0.5

0 10 20 30 40 50

−5

05

10

x

z

Observations

0 10 20 30 40 50

−0.

50.

00.

51.

0

c(0, x)

rho^

c(0,

x)

Autocorrelation, rho= −0.5

0 10 20 30 40 50

−10

−5

05

10

x

z

Observations


255

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

c(0, x)

rho^

c(0,

x)


0 10 20 30 40 50

−10

−5

05

10

x

z

Observations

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

c(0, x)

rho^

c(0,

x)


0 10 20 30 40 50−

10−

50

510

x

z

Observations


How to estimate the autocorrelation??

A very brute–force way of estimating the autocorrelation is the

following: Suppose there are observations from 4 time points, i.e.

t = 1, . . . , 4 on many subjects and assume observations all have zero

mean.

Then the (symmetric) matrix of correlations is

Corr =

1 ρ12 ρ13 ρ14

ρ21 1 ρ23 ρ24

ρ31 ρ23 1 ρ34

ρ41 ρ24 ρ43 1



256

Simple estimates of the autocorrelation for observations one, two

and three time–step apart are

γ(1) =1

3(ρ12 + ρ23 + ρ34)

γ(2) =1

2(ρ13 + ρ24)

γ(3) =1

1(ρ14)

Obviously, for higher values of k, γ(k) will be poorly estimated as it

is the average over few values.


The autocorrelation can be estimated (as described above) byinvoking the macro:

%autocorr(r);

where r is the covariance matrix estimated in connection with the

model with unstructured covariance matrix.

If the file autocorr.sas is located in e.g. c:\stat then the macrois included, i.e. made available by submitting the statement

%include ’d:\stat\autocorr.sas’;

This creates the SAS dataset autocorr with autocorrelation and lag.


257

The macro also creates a plot of the autocorrelation against lag:

0 1 2 3 4 5 6

0.80

0.85

0.90

0.95

1.00

lag

auto

corr

Autocorrelation for Exercise Therapy data

What can be concluded from that?


• There is a clear indication of positive correlation and that the

correlation decreases with time.

• Whether the correlation structure can be appropriately described

by ρk is another issue. There is not much evidence for or against

that structure.



258

Since all autocorrelations γ(k) are positive it is tempting to plot

log γ(k) against k as well.

The reason is that if the autocorrelation is γ(k) = ρk then

log γ(k) = k log ρ.

Hence a plot of log γ(k) = k log ρ against k should approximately

yield a straight line with intercept 0 and slope log ρ:


0 1 2 3 4 5 6

−0.

20−

0.15

−0.

10−

0.05

0.00

lag

log(

auto

corr

)

Log Autocorrelation for Exercise Therapy data


259

Again, there is not any strong evidence against the AR(1) structure.

From the graph it follows that the slope is approximately

log ρ ≈ −0.23/6 = −0.038 such that ρ ≈ 0.962.

Hence the correlation between observations does decrease as the

time between them increases – but it decreases very slowly!!


Compound Symmetry

The Split–plot model can also be formulated using a REPEATED

statement instead of a RANDOM statement.

proc mixed data=weight2;

class program subj time;

model strength = program time program*time;

repeated time / type=cs sub=subj(program) r rcorr;

ods listing exclude r; ods output r=r;

run;

Fortunately, the results using a REPEATED or a RANDOM statement are

the same!

The option type=cs specifies that the covariance matrix for each



260

subject has a compound symmetry structure:

σ2 + σ2

w σ2

w . . . σ2

w

σ2

w σ2 + σ2

w . . . σ2

w... ... . . . ...

σ2

w σ2

w . . . σ2 + σ2

w

From the SAS output one sees that the correlation between

observations on the same subject is estimated to

σ2

w

σ2w + σ2

≈ 0.8892


Which Covariance Structure to use?

With all this flexibility in choosing the covariance structure, some

guidelines are needed for choosing an appropriate one:

• Parsimony: Covariance structures with few parameters are most

attractive as there are fewer parameters to be estimated from data.

• Exploratory data analysis: A graphical investigation of the data

might suggest an appropriate covariance structure.

• Subject matter considerations: Sometimes the problem at hand

really dictates an appropriate covariance structure


261

• Necessity: Sometimes one is for numerical reasons forced to use a

very simple covariance structure – PROC MIXED might not be able

to fit the complex ones.

• Numerical criteria: There are some numerical criteria, which can

be a guideline.


Numerical Criteria

AIC and BIC are some criteria to be used. They are both the

log–likelihood + some term penalizing for the number of parameters

used in the model. BIC penalizes the use of many parameters harder

than AIC.

Smaller values of both criteria indicate a good fit.

For the Exercise Therapy the result is

Structure CS AR(1) UN

AIC 1424.9 1270.8 1290.9

BIC 1428.9 1274.9 1348.1



262

Hence the result is in favor of using the AR(1)–structure.


What does the covariance structure mean for the

conclusions?

For the Exercise Therapy the p–values for the test of no interaction

effect are:

Structure CS AR(1) UN

Program*Time 0.0005 0.3007 0.1297

Radically different conclusions!

The data really suggests that the interaction is present!


263


264

15 Repeated Measurements: Covariancestructures

This lecture gives an overview of how to specify different covariance structures in SAS via theREPEATED statement in PROC MIXED. The lecture is based on the description in the on-line SAS-manual1.

The most important types of covariance structure is presented.

• Unstructured (UN)

• Autoregressive (AR(1)–SP(POW))

• Antedependence (ANTE(1))

• Toeplitz (TOEP)

• Heterogeneous variance (ARH(1),CSH, etc.)

The pro’s and con’s of the different structures are discussed

Link to full screen Presentation2

1http://dokumentation.agrsci.dk/sasdocv8/sasdoc/sashtml/onldoc.htm2http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RepeatedType.f.pdf

265

http://dokumentation.agrsci.dk/sasdocv8/sasdoc/sashtml/onldoc.htm

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RepeatedType.f.pdf

Repeated statement

Y = Xβ + Zu + ε

V(ε) = R

R is a n× n matrix, where n is number of observations.

In order to handle this, a structure of the matrix is defined with

repeated use of the elements in the structure.

March 21, 2001 1

Repeated Statement

The syntax of the REPEATED statement

REPEATED < repeated-effect > < / options >;

Usually a formulation like:

REPEATED time / subj=animal*treat ;

A good precaution is always to specify the repeated-effect

March 21, 2001 2

15 Repeated Measurements: Covariance structures

266

Missing data: example

Treat Animal Time Y

A 1 1 12.4

A 1 2 .

A 1 3 14.5

B 1 1 14.3

B 1 2 15.3

B 1 3 14.8... ... ... ...

March 21, 2001 3

PROC MIXED: REPEATED Statement

REPEATED < repeated-effect > < / options > ;

You can specify the following options in the REPEATED statement

after a slash (/).GROUP=effect HLM HLPSLDATA=SAS-data-set LOCAL LOCALWNONLOCALW R<=value-list> RC<=value-list>RCI<=value-list> RCORR<=value-list> RI<=value-list>SSCP SUBJECT=effect TYPE=covariance-structure

March 21, 2001 4

267

Types of variance structure

• Approximately 30 different methods

• ”Time”/”linear” structure vs. spatial structure

• Homogeneous vs. heterogeneous variance

• ”Banded” vs full structure

March 21, 2001 5

Unstructured: type=un

The measurements of each subject

σ11 σ12 σ13 σ14

σ22 σ23 σ24

σ33 σ34

σ44

Parameters t× (t + 1)/2

March 21, 2001 6


268

Autoregressive: type=AR(1)


σ2

1 ρ ρ2 ρ3

1 ρ ρ2

1 ρ

1

.

.

Y1 Y2 Y3 Y4 Y5

ρ ρ ρ ρ

March 21, 2001 7

Autocovariance

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

lag

ρ

March 21, 2001 8

269

Autocovariance

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

lag

ρ

March 21, 2001 9

Autoregressive: type=SP(POW)


σ2

1 ρ|t2−t1| ρ|t3−t1| ρ|t4−t1|

1 ρ|t3−t2| ρ|t4−t2|

1 ρ|t4−t3|

1

March 21, 2001 10


270

Ante-Dependence: type=ANTE(1)

AR(1)

.

.

Y1 Y2 Y3 Y4 Y5

ρ ρ ρ ρ

ANTE(1)

.

.

Y1 Y2 Y3 Y4 Y5

ρ1 ρ2 ρ3 ρ4

March 21, 2001 11

Ante-Dependence: type=ANTE(1)


σ2

1σ1σ2ρ1 σ1σ3ρ1ρ2 σ1σ4ρ1ρ2ρ3

σ2

2σ2σ3ρ2 σ2σ4ρ2ρ3

σ2

3σ3σ4ρ3

σ2

4

March 21, 2001 12

271

Toeplitz: type=TOEP


σ2 σ1 σ2 σ3

σ2 σ1 σ2

σ2 σ1

σ2

March 21, 2001 13

Heterogenous variance

Instead of identical variance at every time point, the variance is

estimated at each time point

In general, the type is found by simple adding an H to the type, i.e.,

csh, arh(1), toeph

The structures are preserved as far as the correlation between time

points are concerned

More elaborate parametric techniques are available Eq. LIN

March 21, 2001 14


272

Conclusions

• Parsimony !

• Fixed observation times and similar intervals : AR(1)

(2 parms)

• Slightly varying observation times and similar intervals

: SP(POW) (2 parms)

• Fixed observation times but intervals of different type:

ANTE(1) (2t− 1 parms (heterogen. variance))

• Fixed observation times, similar intervals, no simple

lag-structure : TOEP (t− 1 parms)

March 21, 2001 15

AR vs CS

AR(1)

.

.

Y1 Y2 Y3 Y4 Y5

ρ ρ ρ ρ

CS

.

.

A

Y1 Y2 Y3 Y4 Y5

March 21, 2001 16

273


274

16 Random Regression

The random regression model is discussed starting with an example from one of the exercises.The presentation supplements chapter 7: Random Coefficients in LMSW (Littell et al., 1996)

The basic idea behind random regression and the implementation of the model in PROC MIXEDis shown. Finally, the implications for the covariance structure of the observations is presented.

Link to full-screen presentation1

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RandomRegression.f.pdf

275

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/RandomRegression.f.pdf

The Basic Idea behind Random Regression

Feeding pigs with different amounts of vitamin E supplement.

Weights recorded weekly.

4 6 8 10 12

4060

8010

0

Time

Wei

ght

Cu = 1

4 6 8 10 12

4060

8010

0

Time

Wei

ght

Cu = 2

4 6 8 10 12

4060

8010

0

Time

Wei

ght

Cu = 3


• Clearly (random) between–subject (pig) variation.

• Approximately linear increase in weight.

• Slight tendency to larger dispersion between pigs at the end of the

study than at the beginning.

• Repeated measurement problem.

Aims:

• Find a regression model which describes the weight as function of

time.

• Draw inferences about possible treatment effects.



276

First idea: fit linear regression model (with random pig effect) and

treatment specific parameters:

yijt = αi + βit + Uij + εijt

Here, i is treatment, j is subject (pig) within treatment, t is time,

Uij ∼ N(0, σ2u) and εijt ∼ N(0, σ2), all independent.

title ’Linear regression (with random Pig effect)’;

title2 ’Treatment specific parameters’;

proc mixed data=CuFeed;

class Cu Pig;

model Weight = Cu Cu*Time /noint solution outp=R1 ;

random Cu*Pig;

run;


Plot the curves of residuals:

symbol i=j;

proc gplot data=R1;

by Cu;

plot resid*Time=Pig;

run;

4 6 8 10 12

−10

−5

05

Time

Res

id

Cu = 1

4 6 8 10 12

−10

−5

05

Time

Res

id

Cu = 2

4 6 8 10 12

−10

−5

05

Time

Res

id

Cu = 3

The “residual curves” do not look random.


277

Second idea: fit individual linear regression model (with random pig

effect):

yijt = αi + βijt + Uij + εijt

where i is treatment, j is subject (pig) within treatment, t is time,

and Uij ∼ N(0, σ2u) and εijt ∼ N(0, σ2), independent.

title ’Individual linear regressions (with random Pig effect)’;


class Cu Pig;

model Weight = Cu Cu*Pig*Time /noint solution outp=R2;

random Cu*Pig;

ods output solutionf=sf2;

proc gplot data=R2;

by Cu;

plot resid*Time=Pig;

run;


4 6 8 10 12

−4

−2

02

4

Time

Res

id

Cu = 1

4 6 8 10 12

−4

−2

02

4

Time

Res

id

Cu = 2

4 6 8 10 12

−4

−2

02

4

Time

Res

id

Cu = 3

The “residual curves” now look much more random.

This approach gives a whole lot of parameter estimates βij, where i

refers to treatment and j to individual within treatment.

How to proceed with the analysis?



278

Analyzing the Individual Regression Coefficients

Frequently the task is to estimate the effect of time for each

treatment.

A tempting (and classical) way of doing this is to continue analyzing

the βijs.

For example, βi. = 1J

∑

j βij is the average slope within treatment i.

The analysis could then proceed by comparing β1. , β2. and β3. in

some way.

Yet - it is somewhat unsatisfactory to first estimate the βijs as

systematic effects and then afterwards analyzing these as if they


were random quantities.


279

Some graphics of the βijs:

Estimates Time*Cu*Pig for Cu= 1

Estimate

Den

sity

5 6 7 8 9

0.0

0.4

0.8

−1.0 −0.5 0.0 0.5 1.0

6.0

7.5

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

Estimates Time*Cu*Pig for Cu= 2

Estimate

Den

sity

5 6 7 8 9

0.0

0.6

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

6.6

7.4

Normal Q−Q Plot


Sam

ple

Qua

ntile

sEstimates Time*Cu*Pig for Cu= 3

Estimate

Den

sity

5 6 7 8 9

0.0

0.3

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

5.5

7.0

Normal Q−Q Plot


Sam

ple

Qua

ntile

s


Random Regression

A random regression model is an alternative:

yijt = αi + βit + Uij + Bijt + εijt

The systematic effects are as usual.

The random effects are Uij ∼ N(0, σ2u), Bij ∼ N(0, σ2

B) and

εijt ∼ N(0, σ2).

It is assumed that εijt is independent of Uij and of Bij but needs

not to be assumed that Uij and Bij are independent.



280

Hence

• βi is the population slope pigs receiving the ith treatment.

• Bij describes individual random deviations from the population

slope.

In this way systematic and random variation of the regression

coefficients can be separated.


Just like the parameter estimates in a regression usually are

correlated, then so might the random effects Uij and Bij also be.

To obtain such flexibilities, we assume

[

Uij

Bij

]

∼ N2

([

0

0

]

,

[

σ2U σUB

σUB σ2B

])

If σUB = 0 then Uij and Bij are independent.


281

How to ... In SAS

Independence:

title ’Random regression model (with random Pig effect)’;

title2’Independent intercepts and slopes’;


class Cu Pig;

model Weight = Cu Cu*Time / ddfm=satterth noint solution outp=R3;

random int Time / sub=Pig type=vc solution;


ods exclude listing solutionr;

ods output solutionr=sr3;

run;

Independence of Uij and Bij is obtained by type=vc in the RANDOM

statement.


Dependence:

title ’Random regression model (with random Pig effect)’;

title2’Dependent intercepts and slopes’;


class Cu Pig;

model Weight = Cu Cu*Time / ddfm=satterth noint solution outp=R4;

random int Time / sub=Pig type=un solution;


ods exclude listing solutionr;

ods output solutionr=sr4;

run;

Dependence of Uij and Bij is obtained by type=un in the RANDOM

statement.



282

Inference

In connection with random regression models we recommend always

using the ddfm=satterth option for estimating the degrees of

freedom.

Contrast etc. can be obtained as follows:


class Cu Pig;

model Weight = Cu Time Cu*Time / ddfm=satterth solution outp=R3;

random int Time / sub=Pig type=vc solution;

lsmeans Cu / diff;

estimate ’Slope: Cu1 vs Cu2’ Cu*Time 1 -1 0;

estimate ’Slope: Cu1 vs Cu3’ Cu*Time 1 0 -1;

estimate ’Slope: Cu2 vs Cu3’ Cu*Time 1 0 -1;


run;


283

When a random regression coefficient is present in the model, then

it is important that the model also contains a random intercept.

To see why consider the random regression model


Suppose that the scale of time t is changed to t′ = c1t + c2. Then it

would be very desirable to obtain the same result whether t or t′ was

used as time in the regression.


Now we use t′ in a random regression model without random

intercept:

yijt = αi + βit + Bijt′ + εijt

= αi + βit + Bij(c1t + c2) + εijt

= αi + βit + (Bijc1t) + (Bijc2) + εijt

Hence Bijc2 will play the role of a random intercept.

In other words, the presence of a random intercept a matter of the

scale on which t is measured.

Likewise, in a polynomial regression involving t2: If there is a

random regression coefficient for t2 then there must also be a



284

random regression coefficient for t and a random intercept.


Correlation structure in Random Regression Models

Consider again the random regression model


and assume for simplicity that Uij and Bij are independent.

The variance of Yijt is

Var(Yijt) = σ2U + σ2

Bt2 + σ2e

For later use let Vt = σ2U + σ2

Bt2.


285

Next consider the variance at time t + k:

Var(Yij(t+k)) = σ2U + σ2

B(t + k)2 + σ2e = Vt+k + σ2

e

= σ2U + t2σ2

B + k(2t + k)σ2B + σ2

e

= Vt + k(2t + k)σ2B + σ2

e

The covariance between Yijt and Yij(t+k) is

Cov(Yijt, Yij(t+k)) = Cov(Uij + Bijt + εijt, Uij + Bij(t + k) + εijt)

= Var(Uij) + Cov(Bijt, Bij(t + k))

= σ2U + t(t + k)σ2

B

= [σ2U + t2σ2

B] + tkσ2B = Vt + tkσ2

B


In total

Var(Yijt) = Vt + σ2e

Var(Yij(t+k)) = Vt + k(2t + k)σ2B + σ2

e

Cov(Yijt, Yij(t+k)) = Vt + tkσ2B

Hence the correlation is

Corr(Yijt, Yij(t+k)) =Vt + tkσ2

B√

(Vt + σ2e)(Vt + k(2t + k)σ2

B + σ2e)

Now consider a fixed t. The numerator is a linear function in k while

the denominator is a quadratic function in k.



286

Hence we know from high school mathematics that

Corr(Yijt, Yij(t+k)) → 0

as k (i.e. the time span between Yijt and Yij(t+k) goes to infinity.

In other words, under the random regression model, the correlation

decreases as with distance in time.

That is an appealing property of the model!


287


288

17 Factor Structure Diagrams

The discussion with participants during the previous lectures had shown the need for an inde-pendent means of checking the degrees of freedom in the F-tests in PROC MIXED. The methodsof calculation of degrees of freedom (option ddfm) is not fool-proof. The containment methodmay lead to errors if the experimental design cannot be deducted from the model specification,and the satterthwaite method is erroneous if one of the variance component is estimated as0.

Therefore, the factor structure diagram method were presented, supplement with an exercise.


1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/FactorStructure.f.pdf

289

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/FactorStructure.f.pdf

Factor Structure Diagrams

Factor structure diagrams is a way of representing certain factorial

designs, including block experiments, split plot experiments etc.

• With such diagrams, it is for certain balanced cases easy to calculate

the correct degrees of freedom for the tests.

• It is also for certain balanced cases easy to identify which “error” an

effect is to be “tested against”.

April 17, 2001 1

However

• it is a somewhat restricted class of models that can be appropriately

represented this way.

• the degree of freedom calculations are not correct in unbalanced cases

• It is a very comprehensive task to describe the class of designs for which

factor structure diagrams can be used

Nonetheless, they are quite useful...

April 17, 2001 2


290

Two–way ANOVA with Replicates

Factors A and B have a and b levels. Replicates within each

combination A×B are denoted by the factor R with r levels.

That is, there are abr units in the experiment

The usual two–factor ANOVA model is

yabr = µ+ αa + βb + (αβ)ab + εabr

The model can be represented in a factor structure diagram

April 17, 2001 3

O11

Aaa−1

Bbb−1

ABabab−a−b+1[ABR]abrabr−ab

• The term O is to be identified with µ

• The term A is to be identified with αa

• The term AB is to be identified with (αβ)ab etc.

• Terms in [. . . ] are random effects.

April 17, 2001 4

291

Calculating the degrees of freedom

1. Fill in the levels of the factors as superscripts (i.e. the red) symbols.

2. Then calculate the degrees of freedom (DF) recursively from right to

left:

The DF for O is 1 (the blue symbol).

The DF for A is a minus the sum of DFs from factors pointing towards

A in the diagram, i.e.

a− 1 = a− 1

3. Proceed like this towards left in the diagram: The DF for AB are

ab− (a− 1)− (b− 1)− 1 = ab− a− b+ 1

April 17, 2001 5

“Proof that it works...”

%let a=4; %let b=2; %let r=3;

title ’Two-way ANOVA with replicates’;

data data1;

do A=1 to &a;

do B=1 to &b;

do R=1 to &r;

y=rannor(0);

output;

end; end; end;

proc mixed data=data1 noinfo noclprint;

class A B R;

model Y = A B A*B;

run;


Num Den


A 3 16 0.56 0.6470

B 1 16 1.54 0.2329

A*B 3 16 1.72 0.2021

April 17, 2001 6


292

Two–way ANOVA without Replicates

If there are no replicates within each combination of A and B (i.e.

r = 1), the model is

yab = µ+ αa + βb + εab

since the interaction can not be estimated.

Following the lines from before, a diagram is

O11

Aaa−1

Bbb−1

ABabab−a−b+1[ABR]abab−ab=0

April 17, 2001 7

Another way of looking at it is by saying that the random error is the

interaction!!

So a more appropriate diagram is

O11

Aaa−1

Bbb−1

[AB]abab−a−b+1

April 17, 2001 8

293


title ’Two-way ANOVA without replicates’;

data data2;

do A=1 to &a;

do B=1 to &b;

y=rannor(0);

output;

end; end;


class A B;

model Y = A B;

run;


Num Den


A 3 3 0.45 0.7377

B 1 3 0.05 0.8414

April 17, 2001 9

Block Experiments with Replicates within Blocks

If A is a (random) block effect and there are replicates of the factor B

within each block the model is

yabr = µ+ Ua + βb + Vab + εabr

The diagram is

O11

[A]aa−1

Bbb−1

[AB]abab−a−b+1[ABR]abrabr−ab

April 17, 2001 10


294

Note:

• The systematic effect B is to be tested against the random effect

closest to it in the diagram, i.e. [AB]

• Note that since A is a random effect, any factor containing A must

also be random.

April 17, 2001 11


title ’Block experiment with replicates within blocks’;

data data3;

do A=1 to &a;

U = rannor(0);

do B=1 to &b;

V = rannor(0);

do R=1 to &r;

y=rannor(0) + U + V;

output;

end; end; end;


class A B R;

model Y = B;

random A A*B;

run;


Num Den


B 1 3 14.99 0.0305

April 17, 2001 12

295

Block Experiments without Replicates within Blocks

If A is a (random) block effect and there are no replicates of the factor

B within each block the model is

yab = µ+ Ua + βb + εab

The diagram is

O11

[A]aa−1

Bbb−1

[AB]abab−a−b+1

April 17, 2001 13


title ’Block experiment without replicates within blocks’;

data data4;

do A=1 to &a;

U = rannor(0);

do B=1 to &b;

y=rannor(0) + U;

output;

end; end;


class A B;

model Y = B;

random A;

run;


Num Den


B 1 3 3.30 0.1671

April 17, 2001 14


296

Split Plot Experiment

Let A denote the whole–plot treatment and B the split–plot treatment.

Replicate units within A are denoted by R.

The model is:

yabr = µ+ αa + Uar + βb + (αβ)ab + εabr

O11

[A]aa−1

Bbb−1ABabab−a−b+1

[AR]arab−a−b+1

[ABR]abrabr−ab

April 17, 2001 15

“Proof that it works”title ’Split plot experiment’;

%let a=4; %let b=3; %let r=3;

data data5;

do A=1 to &a;

do R=1 to &r;

U = rannor(0);

do B=1 to &b;

y=rannor(0) + U;

output;

end; end; end;


class A B R;

model Y = A B A*B;

random A*R;

run;


Num Den


A 3 8 0.68 0.5901

B 2 16 3.81 0.0444

A*B 6 16 2.57 0.0618

April 17, 2001 16

297

Split Plot Experiment – Homework

Let E and C be the vitamin E and copper treatments applied to R pigs

within each combination of E and C.

Let M denote the membrane.

Hence the model is

yecrm = µ+αe+βc+(αβ)ec+Uecr+γm+(αγ)em+(βγ)cm+(αβγ)ecm+εecrm

April 17, 2001 17

The factor structure diagram becomes

O11

Eee−1

Ccc−1

Mmm−1

ECecec−e−c+1

EMemem−e−m+1

CM cmcm−c−m+1

ECMecm(e−1)(c−1)(m−1)

[ECR]ecrec(r−1)

[ECRM ]ecrmec(rm−r−m+1)

April 17, 2001 18


298

“Proof that it works”

title ’Split plot experiment - homework - with 3 membranes’;

%let sigma_G = 2;

%let sigma_M = 6;

%let sigma_E = 1;

data mem;

do cu= 1 to 2;

do e_vit= 1 to 2;

do grnr= 1 to 8;

U_g = &sigma_G * rannor(0);

do membran= 1 to 3;

V_m = &sigma_M * rannor(0);

do muskel= 1 to 2;

E = &sigma_E * rannor(0);

y = U_g + V_m + E;

output;

end;

end;

end;

end;

end;

data mem1; set mem(where=(muskel=1));

April 17, 2001 19

proc mixed data=mem1;

class cu e_vit membran grnr;

model y = cu | e_vit | membran ;

random cu*e_vit*grnr ;

run;


Num Den


cu 1 28 0.05 0.8316

e_vit 1 28 0.10 0.7489

cu*e_vit 1 28 1.55 0.2230

membran 2 56 0.10 0.9091

cu*membran 2 56 0.57 0.5708

e_vit*membran 2 56 1.26 0.2904

cu*e_vit*membran 2 56 1.16 0.3198

April 17, 2001 20

299

A Neat Little Exercise

1. Draw a factor structure diagram for the entire membrane experiment.

2. Compute the degrees of freedom for each test.

3. Verify by simulation that SAS does the right thing.

Hint: Use a BIG sheet of paper!

April 17, 2001 21


300

18 Covariate Models and MultivariateResponse

The use of covariates in mixed models is discussed, initially based on chapter 5 in LMSW (Littellet al., 1996), i.e., model specification, comparison, and reduction.

Then it is shown that the covariate model may be naturally modified to include several dependentvariables, i.e., to a multivariate response model. The data manipulation steps in SAS is describedand the necessary model specification shown.

Link to full screen presentation1

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/covariate.f.pdf

301

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/covariate.f.pdf

Example of use of covariates

Excercise 1: Treatments copper and vitamin E each at three levels.

Litters as blocks. Dependent variables, daily gain (and feed intake).

Weight at start differed.

April 17, 2001 1

Plot

15 20 25 30 35

0.6

0.7

0.8

0.9

1.0

Start weight

Dai

ly G

ain

April 17, 2001 2

18 Covariate Models and Multivariate Response

302

Plot

15 20 25 30 35

0.6

0.7

0.8

0.9

1.0

Start weight

Dai

ly G

ain

April 17, 2001 3

Yijk = (αγ)ij + Lk + βijwijk + εijk

• Yijk: Daily gain,

• wijk weight at start,

• βij regression coefficient for level ij of treatment,

• (αγ)ij interaction between copper and vitamin E,

• Lk random effect of litter (Lk ∼ N (0, σ2

L)),

• εijk random residual, εijk ∼ N (0, σ2)

Model reduction ?

April 17, 2001 4

303

Model reduction

Reformulate as additive model and remove non-significant terms

Yijk = (αγ)ij + Lk + βijwijk + εijk

(αγ)ij = µ + αi + γj + (αγ)′ij

βij = β0 + β1i + β2j + β′

ij

April 17, 2001 5

Table 5:1 LMSW, page 5.2.2

1. Are all slopes = 0 ? If fail to reject goto step 2. else goto 3

2. Fit a common slope and test hypothesis = 0. If fail to reject

compare treatments using ANOVA, else use parallel lines

3. Test that the slopes are equal. If fail to reject use common slope

model, if reject goto step 4.

4. Use the unequal slopes model.

April 17, 2001 6


304

SAS-code

Step 1:

proc Mixed data=a;

class Kuld Evit Kobber ;

model Tilv= Evit*Kobber

Startv*Evit*Kobber /noint solution ;

random kuld ;

Step 3:

model Tilv= Evit Kobber Evit*Kobber

Startv Startv*Evit Startv*Kobber

Startv*Kobber*Evit ;

April 17, 2001 7

SAS-Anova


Num Den


EVIT 2 34 0.54 0.5905

KOBBER 2 34 0.46 0.6333

EVIT*KOBBER 4 34 1.10 0.3740

STARTV 1 34 27.62 <.0001

STARTV*EVIT 2 34 0.79 0.4627

STARTV*KOBBER 2 34 0.55 0.5829

STARTV*EVIT*KOBBER 4 34 1.13 0.3572

April 17, 2001 8

305

Plot

15 20 25 30 35

0.6

0.7

0.8

0.9

1.0

Start weight

Dai

ly G

ain

April 17, 2001 9

Final Model

15 20 25 30 35

0.6

0.7

0.8

0.9

1.0

Start weight

Dai

ly G

ain

April 17, 2001 10


306

Feed per day

1.4 1.6 1.8 2.0 2.2 2.4 2.6

0.6

0.7

0.8

0.9

1.0

Feed pr day

Dai

ly G

ain

April 17, 2001 11

Feed per day

1.4 1.6 1.8 2.0 2.2 2.4 2.6

0.6

0.7

0.8

0.9

1.0

Feed pr day

Dai

ly G

ain

April 17, 2001 12

307

Feed per day

1.4 1.6 1.8 2.0 2.2 2.4 2.6

0.6

0.7

0.8

0.9

1.0

Feed pr day

Dai

ly G

ain

April 17, 2001 13

SAS-code

Test

proc Mixed data=a;

class Kuld Evit Kobber ;

model Tilv= Kobber

Fedag Fedag*Kobber ;

random kuld ;

Estimation:

model Tilv= Kobber

Fedag*Kobber /noint solution ;

April 17, 2001 14


308

Feed per day

1.4 1.6 1.8 2.0 2.2 2.4 2.6

0.6

0.7

0.8

0.9

1.0

Feed pr day

Dai

ly G

ain

April 17, 2001 15

The lines actually denotes the conditional distribution of the daily

gain given the feed intake, i.e.,

Yij = µ + βXij + εij

If both variables measures the effect of the treatment, the joint

distribution may be more interesting.

There is a relatively simple relationship between the conditional and

joint distribution.

E(Xij) = µx

E(Yij) = µy = E(µ + βXij) = µ + βµx

April 17, 2001 16

309

V(Xij) = σ2

x

V(Yij|Xij) = V(εij) = σ2

x − σyx

1

σ2x

σxy

C(Xij, Yij) = C(Xij, µ + βXij + εij) = β V(Xij) = βσ2

x

V(Yij) = σ2

ε + β2σ2

x

i.e., the joint distribution(

Xij

Yij

)

∼ N

((

µx

µy

)

,

(

σ2

x βσ2

x

βσ2

x σ2

ε + β2σ2

x

))

Can this be generalised ?

April 17, 2001 17

Multivariate Responses

Consider a feeding experiment where a treatment factor A (say

supplement of copper) is applied to pigs.

Two responses are measured:

Y 1 : Weight gain

Y 2 : Feed intake

Hence the response is a two–dimensional vector Y = (Y 1, Y 2)>.

April 17, 2001 18


310

Return to the feeding experiment.

A model for each response Y r, where r = 1, 2 could be

Y rik = µr + αr

i + εrik

where i = 1, . . . , I is treatment, k = 1, . . . K is replicates within

each treatment, and εrik ∼ N(0, σ2

r).

Hence all parameters µr, αri , σ

2

r are specific to the rth response.

April 17, 2001 21

The Components of a MLNM

For each response Y r it is assumed that E(Y r) can be written as a

linear function of the explanatory variables.

In the example,

E(Y rik) = δr + αr

i

April 17, 2001 22

311

It is assumed that the mean value has the same structure for each

response r made on the same unit.

In the example,

E(Yik) = (E(Y 1

ik), E(Y 2

ik)) = (δ1 + α1

i , δ2 + α2

i ) = (µ1

i , µ2

i )

It is also assumed that the parameters βr and βs relating to the rth

respectively the sth response have nothing in common.

In the example, this means that there are no restrictions on the

parameters of the form that e.g. α1

i and α2

i are restricted to being

identical.

April 17, 2001 23

The responses are possibly correlated. To account for this we allow

for a covariance matrix of the form

Σ = C(Yik) =

[

σ2

1σ12

σ21 σ2

2

]

The model we consider can be briefly written

Yik = (Y 1

ik, Y2

ik) ∼ N2((µ1

i , µ2

i ),Σ)

If the vectors are regarded as row vectors, then it just looks like two

linear normal models appended to each other, with the extra finesse

that the two responses are allowed to be non–independent.

And - that is just what it is !

April 17, 2001 24


312

Such models can be dealt with in a mixed model setup.

The trick is to arrange the data in columns.

Suppose there are two treatments, i.e. i = 1, 2 and two pigs per

treatment, i.e. j = 1, 2.

Then there 4 units in the experiment, each with two measurements

giving all together 8 measurements.

April 17, 2001 25

It is not very hard to see that the mean of each of these can be

written in the matrix form

E(

(

Y 1

11

Y 2

11

)

(

Y 1

12

Y 2

12

)

(

Y 1

21

Y 2

21

)

(

Y 1

22

Y 2

22

)

) =

1 1 0 0 0 0

0 0 0 1 1 0

1 1 0 0 0 0

0 0 0 1 1 0

1 0 1 0 0 0

0 0 0 1 0 1

1 0 1 0 0 0

0 0 0 1 0 1

δ1

α1

1

α2

1

δ2

α1

2

α2

2

April 17, 2001 26

313

The covariance matrix is easy to specify too: The units are assumed

independent, and hence the covariance between measurements on

different units is zero.

The covariance structure for measurements on the same unit

together with the variances are described in the 2× 2 matrix Σ.

April 17, 2001 27

For all measurements, the covariance matrix is therefore the 8× 8

matrix

C(

(

Y 1

11

Y 2

11

)

(

Y 1

12

Y 2

12

)

(

Y 1

21

Y 2

21

)

(

Y 1

22

Y 2

22

)

) =

Σ 02 02 02

02 Σ 02 02

02 02 Σ 02

02 02 02 Σ

where 02 is the 2× 2 matrix consisting exclusively of 0s.

April 17, 2001 28


314

How to ... In SAS

A brief outline about how to work with such problems in SAS.

The response variables are stacked on top of each other in a variable

called Y.

Let R be another variable with levels, say W and I indicating whether

the corresponding measurement in Y is a measurement of weight or

feed intake.

Let K be a variable identifying the subjects (within the treatment),

and let A be the treatment factor.

Then the following SAS program would do the trick:

April 17, 2001 29

proc mixed data=...;

class R K A;

model Y = R R*A / noint ddfm=satterth ...;

repeated R / subject=K*A type=un;

run;

In the REPEATED statement the subject option specifies the blocks

of the covariance matrix (in the example that there are 4 blocks).

The option type=un specifies that the blocks should be completely

unstructured

The variable R in the REPEATED statement is used for identifying the

different response types.

April 17, 2001 30

315

The General Setup

More generally,

E(Y rj ) = x>j βr

where xj are covariates for the jth experimental unit and βr is a

vector of parameters establishing the connection between E(Y rj ) and

xj

More generally,

E(Yj) = (E(Y 1

j ), E(Y 2

j ), . . . , E(Y Rj )) = x>j [β1 : β2 : · · · : βR] = x>j B

Hence B = [β1 : β2 : · · · : βR] is now a matrix of parameters where

the rth column is the parameters associated with the rth response.

April 17, 2001 31

If we let Yj = (Y 1

j , . . . , RRj ) be a row vector, then

E(Yj) = x>j B

is also a row vector and is given by

If the rows of data from all n units are stacked on top of each other

we obtain an n×R matrix

Y =

Y 1

1Y 2

1. . . , RR

1

Y 1

2Y 2

2. . . , RR

2

... ... ...

Y 1

n Y 2

n . . . , RRn

Similarly the covariates x>j can be stacked on top of each other to

give a design matrix X (with dimension n× p) in the usual way.

April 17, 2001 32


316

The previous considerations then gives that

E(Y ) = X B

(n×R) (n× p) (p×R)

i.e. the mean is now organized as a matrix rather than as a vector.

April 17, 2001 33

317


318

19 Heterogeneous Variance

The purpose of this lecture was to present why it is important to recognize variance heterogeneity,how to model such heterogeneity and consequences of different modelling approaches. Thelecture extends the description in chapter 8 in LMSW (Littell et al., 1996).

Graphical techniques for finding suitable models of variance heterogeneity is presented and vari-ance functions including the power-family is introduced. In addition, the effect of transformationis illustrated.


1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/VarianceStructure.f.pdf

319

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/VarianceStructure.f.pdf

Why Variance Heterogeneity is Important to

Recognize

Frequently the usual assumptions about variance homogeneity are

not met in practice. In that case the variance is said to be

heterogeneous.

One reason for incorporating variance heterogeneity in the model is

the ability to

• downweight portions of data which are highly variable, and

• extract more information from portions of the data which are more

precise.


As always there is a price to pay:

• The models become less parsimonious in terms of the number of

parameters.

• Fitting the models can be more difficult (numerical problems).

• Usually, only asymptotic inference can be carried out (i.e. no exact

F–tests etc.)

• Model control becomes more complicated.



320

Graphical Investigation of the Variance Structure

Frequently there is some structure on the way in which the variance

is non–constant:

Frequently the variance increases when the mean increases.

That is, the variance is a function of the mean, symbolically

Var(Y ) = f(E(Y ))

With grouped data, the variance function can sometimes be

identified.


Example 1. One-way ANOVA:

Ykl = αk + εkl

where εkl ∼ N(0, σ2k) for some treatments k = 1, 2, . . . K and

replicates within treatments l = 1, 2, . . . , Lk.

Good estimates for mean and variance in the kth group are

• Mean: yk.

• Variance: s2k = 1

Lk−1

∑

l(ykl − yk.)2

A reasonable idea is to plot s2k against yk. to see if the variance is a

function of the mean. fin


321

Variance Functions

After having found that the variance is non–constant, the next step

is to look for some structure in which it is non–constant.

This is obtained by considering a particular function for the variance

as a function of the mean.

Frequently in practice one works with the variance function

Var(Y ) = σ2µθ

where µ = E(Y ), and σ2 and θ are unknown constants.

Variance functions of this form are called the power family.


With

Var(Y ) = σ2µθ

we have a linear relationship on the log–scale:

log Var(Y ) = log σ2 + θ log µ

Therefore, in the ANOVA example the natural thing to do is to plot

log s2k against log yk. and see if the relationship is approximately

linear.

If so, it may be reasonable to assume then we are within the power

family of variance functions – and this is a nice family as shall soon

be shown.



322

Example 2. A substance X14 has been added in the concentrationfod∈ {0.0, 4.4, 6.2, 9.3} to the food for some pigs. The pigs arefed (up!) with this food until their weight is 60 kg. From thereofand until they are slaughtered at 100kg, their food does contain thesubstance.

At 60kg (sample=1) and 100kg (sample=2) muscle biopsies are madeand the concentration of the substance is determined.

0 2 4 6 8

12

3

fod

mConcentrations, 1=60kg, 2=100kg

1

1

1

1

2

2

22


Plot of individual points and of log–variance against log–mean indicatethat variance increases with the mean:

0 2 4 6 8

01

23

4

fod

X14

Sample = 1

0 2 4 6 8

01

23

4

fod

X14

Sample = 2

−1.0 −0.5 0.0 0.5 1.0

−3.

5−

2.5

−1.

5

logm

logv

Log−var vs log−mean, slope=1.23(0.25)

• One possibility is a linear increase with the slope being ≈ 1.

• Another is that there are two variances: One when fod= 0 andanother one when fod6= 0.

fin


323

From hereof there are different possibilities:

• Transform data onto a scale where the variance is (approximately)

constant

• Include the heterogeneous variance explicitly in the model


The Delta–method

First we consider transformation of data onto a scale where the

variance is approximately constant.

Let Y be a random variable and let h() be a nice function, e.g.

h(y) =√

y, h(y) = y2, h(y) = log y.

We shall investigate the properties of the transformed random

variable Z where

Z = h(Y )



324

Example 3. Let Y ∼ N(µ, σ2). If h is linear, i.e. h(y) = α + βy,then it is well known that

Z = h(Y ) ∼ N(α + βµ, β2σ2)

If h is non–linear, e.g. if h(y) = log y then Z is not normallydistributed. fin

• However, Z = h(Y ) will in certain cases be approximately normal

if Y is normal.

• Moreover, one can find the approximate mean and variance of Z

independently of whether Y is normal or not.


Taylors Approximation

The road to these results can be based on the following argument:

Let x0 and x be two numbers (not too far apart) and assume that h

is “nice” (i.e. differentiable).

Then it is well known from high school that

h(x) ≈ h(x0) + h′(x0)(x− x0).

The further x is from x0 the worse is this approximation.

This approximation is frequently called a Taylor expansion of h

around x0.


325

0 1 2 3 4

020

4060

80

x

f(x)

First order Taylor approximation

h(x) ≈ h(x0) + h′(x0)(x− x0).


Applying Taylors Approximation

Taylors approximation is now applied to the random variable Y with

mean µ = E(Y ) and variance σ2 = Var(Y ).

The approximation is around µ. We then get

Z = h(Y ) ≈ h(µ) + h′(µ)(Y − µ).

• Hence, when Y is “close to” µ, then h(Y ) is approximately a linear

function of Y .

• Y “being close to” µ means basically that σ2 has to be to be small.



326

• From the approximation

Z = h(Y ) ≈ h(µ) + h′(µ)(Y − µ).

we also conclude that

E(Z) = E(h(Y )) ≈ h(µ)

Var(Z) = Var(h(Y )) ≈ h′(µ)2 Var(Y )

• Hence, if Y is normal then it follows that Z must also be

approximately normal since Z is an approximately linear function

of Y . In this case we therefore conclude

Z = h(Y ) ≈ N(h(µ), h′(µ)2σ2).


It must be emphasized that these results are asymptotic results.

How good they are depend on many things including

• the variance of Y , i.e. how close Y –value tend to be to µ

• the form of h – how “smooth” (that is how close to being linear)

h is.


327

Transformation of Data

The previous results can sometimes be used for identifying

transformations of data onto a scale where the variance is constant.

It is assumed in the following that

E(Yi) = µi and V ar(Yi) = σ2µθi .

By plotting log–variance against log–mean one can frequently get a

good estimate of θ, and from that one can (sometimes) identify an

appropriate transformation.


We look for a function h such that Z = h(Y ) has constant variance

σ2Z:

• From the previous section we have

σ2Z = Var(h(Y )) ≈ h′(µ)2 Var(Y ) = h′(µ)2σ2µθ

• If we solve for h′ we get

h′(µ) ≈√

σ2Z

σ2µ−

θ2

• For later use let c =

√

σ2Z

σ2 . Hence we look for a function h which



328

satisfies that its derivative is

h′(µ) = cµ−β2 .


Such an equation is called a differential equation.

The search for h has to be taken in two steps:

When θ = 2: Then h′(µ) = c1

µ, and high school knowledge tell us

that the solution is the natural logarithm, i.e.

h(µ) = c log(µ).

When θ 6= 2: In this case we need the anti–derivative of a simple

power function. It is then well know from high school that

h(µ) = c2

2− θµ

2−θ2 .


329

With Var(Y ) = σ2µθ there are some well known special cases:

• Note that θ = 0 implies that the Var(Y ) = σ2.

(As is the case in Linear Normal Models)

• Note that σ2 = θ = 1 implies that the Var(Y ) = µ.

(As is the case in the Poisson distribution.)

• Note that θ = 2 implies that the Var(Y ) = σ2µ2.

(I.e. the coefficient of variation is constant as is the case in the

Gamma distribution.)


Modelling Variance Heterogeneity

As has been seen transformation of data in an attempt to obtain

variance can be a mixed blessing:

• the transformation can ruin the linearity of the men structure.

• it can be very difficult to report contrasts and their standard error

on the original scale.

An attractive alternative to transformation is therefore to include

variance heterogeneity in the model.



330

Consider the pig–feeding example from before and the model

yis = α + βxi + βsxi + εis

where i is pig, s is sample and xi is the dose given to the ith pig.

• if εis ∼ N(0, σ2) then it is a LNM, i.e. there is assumed variance

homgeneity.

• if εis ∼ N(0, σ2xi

) then we accomodate for different variances

corresponding to different doses of x. (Recall that xi can assume

4 different values, so there are 4 different variance parameters

• if εis ∼ N(0, σ21) when xi = 0.0 and εis ∼ N(0, σ2

2) when xi 6= 0.0

there are two different variance parameters in the model.


• if εis ∼ N(0, σ2xi,s

) then we accomodate for different variances

corresponding to different doses of x and for there different samples

(Hence there are 8 different variance parameters).


331

Fitting the models in PROC MIXED:

data biopsi; set biopsi; fod_c =fod; if fod=0.0 then fod_c2 = 1;

else fod_c2=2;

title ’Variance homogeneity’;

proc mixed data=biopsi;

class sample fod_c fod_c2;

model x14=fod fod*sample / ddfm=satterth chisq solution outp=o1;


title ’Variance heterogeneity, 4 variances’;





repeated fod_c/ type=un(1);


title ’Variance heterogeneity, 2 variances’;





repeated fod_c2/ type=un(1);

run;



332

Parts of the SAS output is

Variance homogeneity: Residual 0.1262-2 Res Log Likelihood 51.6AIC (smaller is better) 53.6AICC (smaller is better) 53.7BIC (smaller is better) 55.4

Variance heterogeneity, 4 variances: Cov Parm Estimate-2 Res Log Likelihood 39.1 UN(1,1) 0.02512AIC (smaller is better) 47.1 UN(2,2) 0.08855AICC (smaller is better) 48.0 UN(3,3) 0.1491BIC (smaller is better) 54.6 UN(4,4) 0.2481

Variance heterogeneity, 2 variances: Cov Parm Estimate-2 Res Log Likelihood 41.8 UN(1,1) 0.02517AIC (smaller is better) 45.8 UN(2,2) 0.1592AICC (smaller is better) 46.1BIC (smaller is better) 49.6


The parameter estimates are:

Effect sample Estimate StdErr DF tValue Probt model

Intercept 0.3130 0.09145 46 3.42 0.0013 varhomo

fod 0.1453 0.01735 46 8.38 <.0001 varhomo

fod*sample 1 0.2433 0.01689 46 14.40 <.0001 varhomo

fod*sample 2 0 . . . . varhomo

Intercept 0.2608 0.04468 12.1 5.84 <.0001 varhet1

fod 0.1546 0.01552 41 9.96 <.0001 varhet1

fod*sample 1 0.2489 0.01985 33.3 12.54 <.0001 varhet1

fod*sample 2 0 . . . . varhet1

Intercept 0.2620 0.04489 11.9 5.84 <.0001 varhet2

fod 0.1524 0.01466 44.9 10.39 <.0001 varhet2

fod*sample 1 0.2432 0.01897 34.9 12.82 <.0001 varhet2

fod*sample 2 0 . . . . varhet2


333

Heterogeneous Variance for Grouped Data

Example 4. Example 8.2 from LMSW, p. 268.

• The response is the ultrafiltration rate UFR (in ml/hr) of 20 highflux membrane dialyzers measured at 7 different transmembranepressures TMP.

• The measurements are made in vivo and the aim is to characterizethe ultrafiltration characteristics of the membranes.

• The dialyzers are evaluated in vitro using bovine blood and flowrates QB of either 200 or 300 dl/min.


0.5 1.0 1.5 2.0 2.5 3.0

020

4060

tmp

ufr

QB= 200

0.5 1.0 1.5 2.0 2.5 3.0

020

4060

tmp

ufr

QB= 300

• Plots suggest inhomogeneous variance, and more specifically thatvariance increases with the mean.

• The plot also suggest that there might be individual curves for eachmembrane, i.e. to consider random regression coefficient models.



334

The starting point is the 4. degree polynomial model

yimj = β0 + τi + (β1 + δ1i)ximj + (β2 + δ2i)x2imj

+(β3 + δ3i)x3imj + (β4 + δ4i)x

4imj + εimj

where x is TMP, i denotes QB–level, m is membrane within QB–level, and j is the jt measurement on the membrane to which themeasurement ximj is associated.

There are 7 measurements on each membrane, so a crude startingpoint could be to assume that εim = (εim1, . . . , εim7) follows a7–dimensional normal distribution,

εim ∼ N(0, R)

where R is an unstructured 7× 7 covariance matrix.


The SAS program employed by LMSW for fitting this model is

proc mixed data=dial;

class qb sub;

model ufr = tmp|tmp|tmp|tmp qb|tmp|tmp|tmp|tmp;

repeated / type=un subject=sub r rcorr;

ods output r=r rcorr=rcorr;

run;

With this program data is treated as being equidistant in TMP, i.e. theactual difference between two TMP–measurements is accounted for.

This becomes transparent if the program is rewritten as


class qb sub index;


repeated index / type=un subject=sub r rcorr;


run;


335

Some of the SAS output is

Estimated Covariance matrix2.76 2.90 3.57 3.04 0.36 0.46 0.642.90 5.10 6.40 6.38 4.13 3.32 1.163.57 6.40 11.15 12.46 8.33 5.44 4.023.04 6.38 12.46 18.54 13.38 10.90 7.680.36 4.13 8.33 13.38 17.71 13.83 12.040.46 3.32 5.44 10.90 13.83 20.31 11.330.64 1.16 4.02 7.68 12.04 11.33 19.67

Estimated Correlation matrix1.00 0.77 0.64 0.43 0.05 0.06 0.090.77 1.00 0.85 0.66 0.43 0.33 0.120.64 0.85 1.00 0.87 0.59 0.36 0.270.43 0.66 0.87 1.00 0.74 0.56 0.400.05 0.43 0.59 0.74 1.00 0.73 0.650.06 0.33 0.36 0.56 0.73 1.00 0.570.09 0.12 0.27 0.40 0.65 0.57 1.00


fin

• Note that with the model above there are 7×8/2 = 28 parameters

in the covariance matrix.

• The variances increase with TMP, and hence the covariances increase

with the differences in TMP.

• Yet, the correlations decrease with the difference in TMP.

• We seek a more parsimoneous model describing this correlation

structure.



336

• A simple AR(1) model in which the ijth element of R is

Rij = σ2ρ|i−j|

(which has 2 parameters) will clearly not fit to these data.

• A more flexible alternative is the heterogeneous AR(1) model (the

ARH(1) model) in which the ijth element of R is

Rij = σiσjρ|i−j|

(which has 8 parameters). This model is still much more

parsimonious than the unstructured covariance matrix which

requires 28 parameters.


The ARH(1) model can be fitted using


class qb sub index;


repeated index / type=arh(1) subject=sub r rcorr;


run;


337

The empirical and estimated correlation matrix from the ARH(1)

model are close:

Estimated Correlation matrix (ARH(1))1.00 0.76 0.58 0.44 0.34 0.26 0.200.76 1.00 0.76 0.58 0.44 0.34 0.260.58 0.76 1.00 0.76 0.58 0.44 0.340.44 0.58 0.76 1.00 0.76 0.58 0.440.34 0.44 0.58 0.76 1.00 0.76 0.580.26 0.34 0.44 0.58 0.76 1.00 0.760.20 0.26 0.34 0.44 0.58 0.76 1.00

Estimated Correlation matrix (Unstructured)1.00 0.77 0.64 0.43 0.05 0.06 0.090.77 1.00 0.85 0.66 0.43 0.33 0.120.64 0.85 1.00 0.87 0.59 0.36 0.270.43 0.66 0.87 1.00 0.74 0.56 0.400.05 0.43 0.59 0.74 1.00 0.73 0.650.06 0.33 0.36 0.56 0.73 1.00 0.570.09 0.12 0.27 0.40 0.65 0.57 1.00


For the model with the unstructured covariance matrix, a plot of the

residuals against TMP gives some insight:

0.5 1.0 1.5 2.0 2.5 3.0

−10

−5

05

tmp

Res

id

Residuals, UN − QB= 200

0.5 1.0 1.5 2.0 2.5 3.0

−10

−5

05

tmp

Res

id

Residuals, UN − QB= 300

• The profiles do not vary randomly around 0 – some profiles are

steadily increasing, other steadily decreasing.



338

• This suggests that maybe we are not faced with variance

heterogeneity but rather with individual regression coefficients.

• (After all, there is likely to be some variation between the

membranes).

The random regression model is fitted by:

proc mixed data=dial ;

class qb sub index;

model ufr = tmp|tmp|tmp|tmp qb|tmp|tmp|tmp|tmp / outp=o2;

random int tmp tmp*tmp / subject=sub type=un;

run;


Now there is no tendency for the residuals to be steadily increasing

or decreasing when plotted against TMP.

0.5 1.0 1.5 2.0 2.5 3.0

−4

−2

02

4

tmp

Res

id

Residuals, RandomReg − QB= 200

0.5 1.0 1.5 2.0 2.5 3.0

−4

−2

02

4

tmp

Res

id

Residuals, RandomReg − QB= 300

Yet, the curves are still somewhat “smooth” suggesting that some

within subject variation has yet to be accounted for.


339

Power–of–Mean for Data with Covariates

Previously it was discussed that the variance can sometimes be

regarded as a function of the mean.

This was used for

• identifying situations where serious variance heterogeneity was

present

• suggesting transformations of data

Yet, until now the actual structure – the variance as a function of

the mean has never been used directly.


Usually when estimating variance/covariance parameters this is done

by subtracting estimates for the mean from the observed data to

give residuals. The residuals are then used for estimating the

variance/covariance parameters.

REML estimation is a clear example of this.

• In the setup in this section the mean and variance parameters are

not estimated separately.

• With this setup, one can capture variance heterogeneity together

with having random regression coefficients in the model



340

• We consider cases where the variance of the residuals is

Var(εi) = σ2|µi|θ

such that the R–matrix is diagonal with Rii = σ2|µi|θ.

• Since µi = x>i β, the mixed model becomes complicated:

y = Xβ + Zu + ε

where

E(Y ) = Xβ

Var(ε) = R(σ2, β, θ) = diag(σ2|x>i β|θ)

are both functions of β.


• Consequently, maximizing the likelihood function is going to be a

very complicated task.


341

Yet, it is easy to suggest a heuristic solution to the estimation

problem:

• Suppose we have a provisional estimate βp of β.

• If this estimate is plugged into R, i.e.

R(σ2, βp, θ) = diag(σ2|x>i βp|θ) = R(σ2, θ)

then R is all of a sudden only a function of σ2 and the power θ.

• These parameters can be estimated, together with β and the

parameters in Var(u) in PROC MIXED.

• The trick is then to set βp equal to the new estimate for β and

repeat the iteration until the parameters stop changing.


In LMSW, p. 278 a way of doing it is shown. A simpler way is given

here:

1. First the iteration has to be started:


class qb sub;

model ufr = tmp|tmp|tmp|tmp qb|tmp|tmp|tmp|tmp / s;

random int tmp tmp*tmp / type=un sub=sub;

repeated / local;

ods output solutionf=sf covparms=cp;

run;



342

2. Then the estimated parameters β are used as provisional parameters

in the next iteration. (This happens in the repeated statement).

The estimated parameters of Var(u) as used as starting point

for the maximization algorithm. (This happens in the parms

statement).

This step is not necessary to but it speeds up the procedure

considerably:


class qb sub;

model ufr = tmp|tmp|tmp|tmp qb|tmp|tmp|tmp|tmp / s outp=o3;

random int tmp tmp*tmp / type=un sub=sub;

repeated / local=pom(sf);

parms / pdata=cp;

ods output solutionf=sf1 covparms=cp1;

run;


3. Finally the provisional estimate βp is set to the recent estimate for

β.

Likewise, the starting values for the parameters in Var(u) are set

to the recently estimated values of these:

proc compare brief data=sf compare=sf1;

var estimate;

data sf; set sf1;

data cp; set cp1;

run;

Now iterate between 2. and 3. until convergence, i.e. until the

parameters in sf and sf1 become very similar.


343

Parts of the output from the final iteration is

Covariance Parameter Estimates

Cov Parm Subject Estimate

UN(1,1) sub 3.8360

UN(2,1) sub -5.8353

UN(2,2) sub 28.2501

UN(3,1) sub 1.3778

UN(3,2) sub -8.3312

UN(3,3) sub 2.6970

POM 1.9785

Residual 0.001974

The power is estimated to 1.9785 ≈ 2 which, in a sense, corresponds

to the case of constant coefficient of variation.


Now there is no tendency for the residuals to be steadily increasing

or decreasing when plotted against TMP.

Also the curves are less smooth than before, suggesting that more of

the within subject variation has yet to be accounted for.

0.5 1.0 1.5 2.0 2.5 3.0

−6

−4

−2

02

4

tmp

Res

id

Residuals, POM − QB= 200

0.5 1.0 1.5 2.0 2.5 3.0

−6

−4

−2

02

4

tmp

Res

id

Residuals, POM − QB= 300



344

Noget om transformationer,

normalfordelingsapproximationen og

konfidensintervaller

Baseret pa 250 kvitteringer for indkøb af benzin li samt tilsvarende

registreringer af kørte kilometer pr. tankfuld ki er benzinøkonomien

yi =ki

li, i = 1, . . . , 250

udtrykt ved kilometer pr. liter beregnet.

Histogrammet og probitdiagrammet i øverste række af nedenstaende

figur viser at man med rimelighed kan antage at yi’erne er


realisationer af stokastiske variabler Yi, hvor

Yi ∼ N(µ, σ2), i = 1, . . . , 250

Pa basis af data kan man nu opstille f.eks. et konfidensinterval, for µ.

Af forskellige grunde beslutter man sig for at ville sælge bilen i USA,

hvor man sædvanligvis angiver benzinøkonomi som “gallon pr. 100

miles”. For at gøre det nemt betragter vi i stedet “liter pr. 100 km”,

nemlig

zi = 100liki

= 1001

yi.

Det vil sige at vi transformerer data som zi = h(yi) = 100/yi.


345

Det er velkendt at hvis Yi er normalfordelt, sa er 100/Yi IKKE

normalfordelt.

Nedenfor ses histogrammer og qqplots for Y og Z = h(100/Y ).Histogram of y

yF

requ

ency

10 11 12 13 14 15

010

30

−3 −2 −1 0 1 2 3

1012

14

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

Histogram of z

z

Fre

quen

cy

7 8 9 10 11

020

4060

−3 −2 −1 0 1 2 3

78

910

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

Man sporer en svagt højreskæv fordeling for zierne, men ellers ser

data ud til rimeligt at kunne beskrives ved en normalfordeling. Det

vil sige at med en vis rimelighed kan man arbejde med at 100/Yi

tilnærmelsesvist er normalfordelt.


Ovenstaende data er i virkeligheden 250 observationer simulerede fra

en N(12, 12)–fordeling.

Vi skal nu illustrere at approximationen til normalfordelingen bliver

gradvist darligere nar spredningen bliver større.

Vi har derfor gennemført ovenstaende for spredningen σ = 2 og

σ = 3. Resultaterne er vist nedenfor:

Histogram of y

y

Fre

qu

en

cy

6 8 10 12 14 16 18

01

03

0

−3 −2 −1 0 1 2 3

81

21

6

Normal Q−Q Plot


Sa

mp

le Q

ua

ntile

s

Histogram of z

z

Fre

qu

en

cy

6 8 10 12 14

02

04

06

0

−3 −2 −1 0 1 2 3

68

10

14

Normal Q−Q Plot


Sa

mp

le Q

ua

ntile

s

Histogram of y

y

Fre

qu

en

cy

5 10 15 20

02

04

06

0

−3 −2 −1 0 1 2 3

51

01

52

0

Normal Q−Q Plot


Sa

mp

le Q

ua

ntile

s

Histogram of z

z

Fre

qu

en

cy

5 10 15 20

02

06

0

−3 −2 −1 0 1 2 3

51

01

52

0

Normal Q−Q Plot


Sa

mp

le Q

ua

ntile

s

Vi skal nu illustrere hvorledes det gar med middelværdien og



346

variansen af de transformerede data.

I det følgende lader vi E(Z) = η og V ar(Z) = τ 2. Vi kan da

estimere η og τ2 direkte pa bagrrund af de transformerede data som

henholdsvis gennemsnittet og stiskprøvevariansen.

Dernæst bemærkes at med h(x) = 100/x er h′(x) = −100/x2. Af

resultaterne

E(Z) = E(h(Y )) ≈ h(µ)

V ar(Z) = V ar(h(Y )) ≈ h′(µ)2V ar(Y )

har vi derfor at E(Z) ≈ 100/µ og V ar(Z) ≈ 10000σ2/µ4.

For de σ = 1, 2, 3 er tallene givet i tabelen nedenfor.


Det ses at 1

µ er en god approximation til E(Z) = η og ligeledes er

10000σ2

µ4 en rimelig tilnærmelse til V ar(Z) = τ 2 nar spredningen er

lille. Det fremgar ogsa at nar spredingen bliver stor, blive specielt

approximationen til V ar(Z) = τ 2 darlig.

Størrelse σ = 1 σ = 2 σ = 3

µ 11.968 11.919 11.962

σ2 1.146 4.007 9.475

η 8.423 8.658 9.261

τ2 0.588 2.834 18.833

E(Z) = E(h(Y )) ≈ 1001

µ 8.355 8.389 8.359

V ar(Z) = V ar(h(Y )) ≈ 10000σ2

µ4 0.558 1.985 4.627

Afslutningsvis bemærkes at η og µ i dette eksempel er et udtryk for


347

det samme nemlig benzinøkonomien.

Gennem transformationen af data zi = 100/yi fas at zi er utrykt i

“liter pr. 100 km”, hvilket ogsa bliver enheden for E(Z) = η.

Enheden for µ er “km. pr. liter”, og derfor er enheden for 100/µ

“liter pr. 100 km”.

Man kan derfor diskutere hvorvidt 100/µ eller η er den relevante

størrelse. De estimeres forskelligt, den første som 100 gange et

reciprokt gennemsnit og den anden som 100 gange gennemsnittet at


reciprokke data:

100/µ = 100(1

n

∑

i

yi)−1

η =1

n

∑

i

zi = 1001

n

∑

i

1

yi

Beslutter man sig for at enheden “liter pr. 100 km” er den relevante

størrelse, sa har vi altsa to mader at fa den frem pa: Enten som et

gennemsnit af transformerede data eller som en transformation af

middelværdien af de oprindelige data.



348

Transformation og konfidensintervaller

Antag at de observerede data er y1, . . . , yn og at disse f.eks. for at

opna varianshomogenitet er transformeret til z1, . . . , zn med

transformationen h, dvs. zi = h(yi).

Pa den transformerede skala er der udført en statistisk analyse. Lad

θ være den størrelse vi er interesserede i. Pa baggrund af (de

transformerede) data fas et estimat θ, for θ samt et estimat σθ for

spredningen pa θ.

F.eks. kunne θ være hældningen i en lineær regression

Zi = α + θxi + εi.


Generelt er et (1− α) konfidensinterval for θ givet ved to stokastiske

variable Zlav og Zhøj sadan at sandsynligheden for at θ ligger i

intervallet [Zlav, Zhøj] er 100(1− α)%.

I mange klassiske lineære modeller beregnes et (1− α)

konfidensinterval som

Zlav = θ − t1−α2(d)σθ

Zhøj = θ + t1−α2(d)σθ

hvor t1−α2(d) er 1− α

2–fraktilen i en t–fordeling med d frihedsgrader.

Hvis f.eks. θ er hældningen i en regression som ovenfor sa udtrykker

θ den forventede tilvækst pa Z nar x øges med een enhed.

Ofte er man interesseret i at undersøge udtrykke den forventede


349

tilvækst af Y altsa pa den originale skala nar x øges med een enhed.

Populært sagt, vil man udtrykke θ “pa den oprindelige skala”.

Dette gøres ofte ved følgende. Lad h−1 være den omvendte funktion

til h. Da lader man h−1(θ) være et udtryk for θ “pa den oprindelige

skala”.

Man anvender derfor h−1 pa den estimerede værdi θ, hvilket giver

η = h−1(θ). Konfidensgrænserne pa den transformerede skala kan

ogsa transformeres tilbage med h−1:

Hvis h er strengt voksende da er

Ylav = h−1(Zlav)

Yhøj = h−1(Yhøj)


og hvis h er strengt aftagende, sa er

Ylav = h−1(Zhøj)

Yhøj = h−1(Ylav)

Hvis [Zlav, Zhøj] er et 100(1− α)% konfidensinterval for θ da er

[Ylav, Yhøj] er et 100(1− α)% konfidensinterval for h−1(θ).

Bemærk: [Zlav, Zhøj] er symmetrisk omkring θ men [Ylav, Yhøj] er

IKKE generelt symmetrisk omkring h−1(θ).

Hvis h er approximativt lineær, da her h−1 ligesa, og i det tilfælde

bliver [Ylav, Yhøj] tilnærmelsesvist symmetrisk omkring h−1(θ).

Et alternativ til ovenstaende er følgende: Middelværdi og varians pa



350

den transformerede skala er tilnæmelsesvis givet ved

E(Z) = E(h(Y )) ≈ = h(E(Y ))

V ar(Z) = V ar(h(Y )) ≈ h′(E(Y ))2V ar(Y ).

Man kan nu løse disse ved hjælp af h−1. Man far

E(Y ) ≈ h−1(E(Z))

V ar(Y ) ≈ V ar(Z)

[h′(E(Y ))]2=

V ar(Z)

[h′(h−1(E(Z)))]2.

Disse resultater kan anvendes pa parameteren θ, som vi er


interesseret i. Man far da

η = h−1(θ)

ση =σθ

|h′(η)|

Det er nu fristende at udregne konfidensgrænser for h−1(θ) som

Ylav = η − t1−α2(d)ση

Yhøj = η + t1−α2(d)ση

Dette interval bliver symmetrisk omkring η.

Der er dog ikke savidt vides gode formelle argumenter for at kalde


351

[Ylav, Yhøj] for et 100(1− α)% konfidensinterval for h−1(θ). Derfor

anbfeales generelt [Ylav, Yhøj]

Det vil dog i nogle tilfælde være tilfældet at [Ylav, Yhøj] og

[Ylav, Yhøj] faktisk ligner hinanden meget.

Dette sker hvis varitionen i datamaterialet er lille. Indenfor et

snævert interval han h da betragtes som nogenlunde lineær, hvorved

ovennævnte approximationer bliver gode.


Eksempel: Antag at data er transformeret som

zi = h(yi)

Pa baggrund af de transformerede data laves en regression

Zi = α + βxi + εi

Vi er interesserede i et konfidensinterval for h−1(β).

Pa baggrund af data estimeres β = 0.25 of σβ = 0.03.

Vi vil nu sammenligne to mader at beregne intervallerne pa. For

argumentets skyld skal vi gennemføre tilsvarende beregninger for

σβ = 0.06 og σβ = 0.09.



352

For simpelhedens skyld antager vi at der er sa mange observationer

at t fordelingen ligner en normalfordeling. Dermed bliver

t1−α2(d) ≈ 1.96 for α = 0.05.

Bemærk først at

h(y) =√

(y) = y1/2 hvormed

h−1(y) = y2 og

h′(y) =1

2√

y.

og at η = h−1(β) = 0.0625 samt at h′(η) = 2 (regn selv efter)!.


For σβ = 0.03 fas nu

Zlav = β − 1.96σβ = 0.19

Zhøj = β + 1.96σβ = 0.31

Transformeres disse grænser tilbage ved h−1 fas

Ylav = 0.192 = 0.0361

Yhøj = 0.312 = 0.0961

der ikke er symmetrisk omkring η = h−1(β) (men næsten!).


353

Under den anden metode skitseret ovenfor skal vi beregne

ση =σβ

|h′(η)2|

=σβ

2= 0.015

idet σβ = 0.03. Vi far nu

Ylav = η − 1.96ση = 0.0331

Yhøj = η + 1.96ση = 0.0919.

Vi ser altsa at intervallerne [Ylav, Yhøj] og [Ylav, Yhøj] ligner

hinanden meget.


For σβ = 0.06 gennemføres helt analoge beregninger og vi finder

ση = 0.06/2 = 0.03. Dermed fas

Zlav = β − 1.96σβ = 0.13

Zhøj = β + 1.96σβ = 0.37

Ylav = 0.132 = 0.0175

Yhøj = 0.372 = 0.1351

Ylav = η − 1.96ση = −0.0037

Yhøj = η + 1.96ση = 0.1213.

Vi ser nu at intervallerne [Ylav, Yhøj] og [Ylav, Yhøj] bliver mere

forskellige.



354

20 Variansheterogeneity: Example of effect oftransformation

This lecture illustrates the consequence of transformation, based on an analysis of an experimentinvestigation the effect of feed concentration on muscle content of a certain ingredient.

Transformation back to the original scale is discussed, both related to the mean level and toestimates of treatment effects.

Finally, examples are shown of different scales for usual production traits within animal produc-tion.


1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/VariansHetero.f.pdf

355

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/VariansHetero.f.pdf

1

Variance homogeneity

• Variance homogeneity

• Transformation as a solution

• Effect of back-transformation.

12. oktober 2001

2


Yij = µ + αi + εij

εij ∼ N (0, σ2)

Variance homogeneity implied by missing suffix

12. oktober 2001

20 Variansheterogeneity: Example of effect of transformation

356

3


Herd no. Herd type Observations Herd averageA 1 10 12.3B 2 10 13.6C 1 10 10.2D 2 10 15.0

12. oktober 2001

4


Herd no. Herd type Observations Herd averageA 1 100 12.3B 2 100 13.6C 1 1 10.2D 2 1 15.0

Weigh according to precision in measurements

12. oktober 2001

357

5

Variance of an average

Y =1

nobs

nobs∑i

Yi

V(Y ) = σ2Y

1nobs

The magnitude of variance inhomogeneity can be assessed byusing this as an analogue.

12. oktober 2001

6

Example

A certain ingredient is added to the feed ration in theconcentration x, x ∈ {0.0, 4.4, 6.2, 9.3}. The pigs are fed with therations until 60 kg. Biopsies are made at 60 kg. Concentration ofthe feed ingredient in the biopsy is measured. Let yi denote theconcentration of the ingredient in animal i.

12. oktober 2001


358

7

Mean curve

0 2 4 6 8

01

23

45

x, Feed contents

y, M

uscl

e co

nc.,

60 k

g

0 2 4 6 8

01

23

45

x, Feed contents

y, M

uscl

e co

nc.,

60 k

g

0 2 4 6 8

01

23

45

x, Feed contents

y, M

uscl

e co

nc.,

60 k

g

12. oktober 2001

8

Transformation ?

0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.1

0.2

0.3

0.4

Mean

Var

ianc

e

−1.0 −0.5 0.0 0.5 1.0

−3.

5−

2.5

−1.

5

Log(Mean)

Log(

Var

ianc

e)

12. oktober 2001

359

9

Model of expectations

E(y) = µ + αi

E(√

y) = µ + αi ⇒ E(y) = µ2 + α2i + 2µαi

E(log(y)) = µ + αi ⇒ E(y) = exp(µ) exp(αi)

12. oktober 2001

10

Curve �tting

E(y) = µ + β1x + β2x2

E(√

y) = µ + β1x + β2x2

E(log(y)) = µ + β1x + β2x2

12. oktober 2001


360

11

Model comparison

Dependent variable y√

y

Parameter Estimate P-value Estimate P-valueβ1 0.438 0.081∗∗∗ 0.242 0.026∗∗∗

β2 -0.007 0.008 -0.010 0.003∗∗

12. oktober 2001

12

Sqrt transformed

0 2 4 6 8

0.5

1.0

1.5

2.0

x, Feed contents

sqrt

(y),

Mus

cle

conc

., 60

kg

0 2 4 6 8

01

23

45

x, Feed contents

y, M

uscl

e co

nc.,

60 k

g

12. oktober 2001

361

13

Comparisons

0 2 4 6 8

01

23

45

x, Feed contents

y, M

uscl

e co

nc.,

60 k

g

0 2 4 6 8

01

23

45

x, Feed contents

y, M

uscl

e co

nc.,

60 k

g

12. oktober 2001

14

Treatment differences

Very often we are inter-ested in estimating treat-ment differences, α1−α2.In SASwe may use PDIFFoption in LSMEANS, orESTIMATE.How do we transform ??

0 2 4 6 8

0.5

1.0

1.5

2.0

x, Feed contents

sqrt

(y),

Mus

cle

conc

., 60

kg

12. oktober 2001


362

15

Conclusion

• Transformations may achieve variance homogeneity

• Transformations changes the model of the mean

• Back transformations of expected values OK

• Back transformations of general estimable functions may causeproblem

12. oktober 2001

16

Natural scales ?

• Geometric cell-count

• Daily gain vs Age at slaughter

• Feed utilisation FU/Gain vs. Gain/FU

• Calvings per cow year vs. Calving interval.

• Feeding interval vs. Feeding frequency

12. oktober 2001

363


364

21 Variance Homogeneity: Diurnal Variation

The purpose of this lecture was to illustrate the application and combination of some of theadvanced topics presented during the course.

A data set consisting of half-hourly observations of cortisol release in pigs was analysed using arandom regression model to capture the individual difference between pigs in diurnal variation.The power-of-mean approach was used to model the variance heterogeneity.

The application of such a model requires iterative use of PROC MIXED

The experience with the model was that it was possible to estimate the model parameters, butthat it was necessary to ’nudge’ the procedure to secure convergence of the iterative calculations,and that the calculations were very time-consuming. At the current of state-of-the art theapplication of such models is not a routine matter.


1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/PowerOfMean.f.pdf

365

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/PowerOfMean.f.pdf

Example

In an experiment pigs were assigned to two different treatments on

order to study the effect of the treatment on the diurnal release of

cortisol. Cortisol were sampled continuously in a period of

approximately 24 hours for each animal.

Yijlm = µ+αi+Aij+(β1+B1j) cos(2π24

tijk)+(β2+B2j) sin(2π24

tijk)+εijk

where Yijk is the logarithmic transformed plasma cortisol cortisol, µ

general mean, αi effect of treatment, Aij random effect of animal j

within treatment i.

May 2, 2001 1

cos(2π24

tijk) and sin(2π24

tijk) are covariates for estimation of the

diurnal variation. βk and Bkj are corresponding regression

parameters. βk a systematic effect and, Bkj a random deviation

from the line. The random effects (Aij, B1k, B2k)> ∼ N 3(0, V ),

where V is a 3× 3 variance matrix. εijk ∼ N (0, σ2)

May 2, 2001 2


366

Random regression model

The model is a random regression model and can be estimated usingthe following SAS statements

*Initial model ;

data a ;

....

PI=3.141593 ;

sint=sin(time*2*pi/24) ;

cost=cos(time*2*pi/24) ;

proc mixed CL data=a ;

class beh dyr ;

model Logcort = beh sint cost /ddfm=satterth ;

random intercept sint cost / subject=dyr*kuld*beh type=un ;

May 2, 2001 3

Resultat eksempler

15 20 25 30 35

3.0

4.0

5.0

6.0

Timer

log(

Cor

tisol

) Dyrnr: 17111

15 20 25 30 35

3.0

4.0

5.0

6.0

Timer

log(

Cor

tisol

) Dyrnr: 31111

15 20 25 30 35

3.0

4.0

5.0

6.0

Timer

log(

Cor

tisol

) Dyrnr: 35111

May 2, 2001 4

367

Model of Mean ?

exp(Xβ) =exp(µ + αi + Aij+ (1)

(β1 + B1j) cos(2π24

tijk) + (β2 + B2j) sin(2π24

tijk)) (2)

May 2, 2001 5

Modelling variance inhomogeneity

Logarithmic transform of cortisol were used because the variance

increased with the mean. Another approach to model this increase

directly.

Using the so-called power of mean method, we use the measured

cortisol level directly, but instead of homogenous variance we assume

εijk ∼ N (0, σ2

n|Xβ|δ)

and estimate σ2

n and δ.

In order to do this it is neccessary to perform the calculations with

PROC MIXED iterativly.

May 2, 2001 6


368

SAS Model

*Initial model ;

proc mixed CL data=a ;

class kuld beh dyr ;

model cortisol = beh sint cost /ddfm=satterth s;

random intercept sint cost / subject=dyr*kuld*beh type=un ;

repeated / subject=dyr*kuld*beh local ;

ods output SolutionF=sf ;

ods output Covparms=cp ;

run;

May 2, 2001 7

* Loop ;

proc mixed CL data=a maxiTER=100 CONVH=1e-8;

class kuld beh dyr ;

model cortisol = beh sint cost /ddfm=satterth s;

random intercept sint cost /

subject=dyr*kuld*beh type=un s ;

repeated /local=pom(sf) ;

parms /pdata=cp ;

ods output SolutionF=sf1 ;

ods output SolutionR=Coeff ;

ods output Covparms=cp1 ;

run ;

proc compare brief data=sf compare=sf1 ;

var estimate ;

run;

data sf ; set sf1 ;

data cp ; set cp1 ;

run;

May 2, 2001 8

369

Experience

• δ was estimated as 3.10, indicating that logarithmic may not be

sufficient to obtain variance homogeneity (y−1

2)

• Estimation of a single model run much longer with pom

• It was necessary to adjust convergence criteria to obtain

convergence

• Approx. 10 iterations needed.

May 2, 2001 9


370

22 Links to supplementary material

In order to illustrate the underlying principles in linear algebra it was necessary to introducea method for performing the calculations. For that purpose the IML procedure of SAS wasintroduced using the small program in ImlExample.sas1

Several SAS macros were introduced for performing standard calculations, e.g., a SAS macrofor calculation of autocorrelations2. The biometry research unit has further SAS macros andexamples on this web-page3.

The book used for the course, LMSW (Littell et al., 1996), contains a series of program examples.These examples may be downloaded from SAS institutes home pages, but can be found here 4

as well. Another important link is the SAS online manual5

Finally, most of the course participants used Word for text processing and SAS for making graphs.To get these two programs to interact satisfactorily was clearly a problem. Therefore a shortnote Eksport af grafer fra SAS til Word6 were made, and references made to SAS tech. reportts252x7 were the export facilities are discussed in detail.

1http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/ImlExample.sas2http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SAS/autocorr.sas3 http://www.jbs.agrsci.dk/Biometri/SASmateriale/SASmateriale.html4http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SAS/sasmixed.sas5http://dokumentation.agrsci.dk/sasdocv8/sasdoc/sashtml/onldoc.htm6http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SAS2Word.pdf7http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/ts252x.pdf

371

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/ImlExample.sas

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SAS/autocorr.sas

http://www.jbs.agrsci.dk/Biometri/SASmateriale/SASmateriale.html

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SAS/sasmixed.sas

http://dokumentation.agrsci.dk/sasdocv8/sasdoc/sashtml/onldoc.htm

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/SAS2Word.pdf

http://www.jbs.agrsci.dk/biometri/Courses/HSVmixed2001/ts252x.pdf

22 Links to supplementary material

372

Bibliography

Littell, R.C., G.A. Milliken, W.W. Stroup, & R.D. Wolfinger (1996). SAS System for MixedModels. SAS Institute, Inc., Cary, NC.

373

Documents

Mixed Model for Study