STAT3012 TYPED LECTURE NOTES
STAT3012 Typed Lecture Notes
1
APPLIED LINEAR MODELS STAT3012 NOTES Contents Introduction ............................................................................................................................................ 7
Assessments ........................................................................................................................................ 7
Outline ................................................................................................................................................ 7
Outcomes ............................................................................................................................................ 8
Rough Weekly Outline ........................................................................................................................ 9
Background ......................................................................................................................................... 9
Roll of statistics ............................................................................................................................... 9
Unit aims ....................................................................................................................................... 10
‘Linear Models’.................................................................................................................................. 10
Note on ‘Linear Model’ ................................................................................................................. 10
Why linear models? ...................................................................................................................... 10
Course Synopsis ................................................................................................................................ 11
Lecture structure ........................................................................................................................... 11
Obtaining Data ...................................................................................................................................... 11
Experiments vs Observational studies .............................................................................................. 11
Definitions ..................................................................................................................................... 11
Experiments .................................................................................................................................. 12
Observational Studies ................................................................................................................... 12
Key message: ................................................................................................................................. 14
Statistical Principles in the Design of Experiments ............................................................................... 15
Theory: .............................................................................................................................................. 15
Terminology Definitions ................................................................................................................ 15
Common Experimental design problems: ..................................................................................... 17
Intro to R studio .................................................................................................................................... 18
Basic theory:...................................................................................................................................... 19
Expressions and assignments ........................................................................................................ 19
Sweave .......................................................................................................................................... 19
Simple Linear Regression ...................................................................................................................... 19
Assumed Knowledge: ........................................................................................................................ 20
Correlation Coefficient .................................................................................................................. 20
STAT3012 Typed Lecture Notes
2
Data and model ................................................................................................................................. 21
Components: ................................................................................................................................. 21
Parameter Estimation: ...................................................................................................................... 24
Least Squares Estimates (LSEs) ..................................................................................................... 24
Method of Maximum Liklihood .................................................................................................... 26
Correlation Coefficient and Regression Slope .............................................................................. 27
Estimating the Error Variance: ...................................................................................................... 28
Explaining the variability of the 𝑌’s .............................................................................................. 30
Diagnostics and Inference in Regression .............................................................................................. 30
Model Diagnostics: ............................................................................................................................ 31
Assessing assumptions .................................................................................................................. 31
Q-Q plots ....................................................................................................................................... 32
Inference for a Linear Regression Model .......................................................................................... 35
Theorem: Distribution of 𝒄𝑻𝒀 ...................................................................................................... 35
Distribution of LSEs (least squares estimator) .............................................................................. 38
Inference and Prediction in simple linear regression ....................................................................... 39
Sampling Distribution of 𝛽1 .......................................................................................................... 39
Inference for the error variance 𝜎2 .............................................................................................. 41
Prediction and estimation in SLR .......................................................................................................... 42
Prediction of 𝑌|𝑥0 ............................................................................................................................. 42
Prediction vs estimation of mean response ................................................................................. 43
Sampling distribution of mean response ...................................................................................... 43
Multiple Regression .............................................................................................................................. 46
Multiple regression Theory – Data ................................................................................................... 47
Principle of least squares estimation ............................................................................................ 47
Polynomial regression: .................................................................................................................. 48
Goodness of Fit (GoF) criteria: ...................................................................................................... 52
Predicated value ........................................................................................................................... 57
Matrix approach to multiple regression ........................................................................................... 60
Matrix formulation of the linear model ........................................................................................ 60
Leverage and Cook’s Distance .......................................................................................................... 64
Outlying points in ℝ𝑝 .................................................................................................................... 64
Extensions to Regression Modelling: .................................................................................................... 72
Theory: ANOVA/ANCOVA ect ........................................................................................................... 72
Three treatments: ......................................................................................................................... 73
General F test /Multicollinearity ........................................................................................................... 76
STAT3012 Typed Lecture Notes
3
General 𝐹 test ................................................................................................................................... 76
Classical approach ......................................................................................................................... 78
Variable selection: Backward and forward ........................................................................................... 83
Motivation: ....................................................................................................................................... 84
Possible subsets: ........................................................................................................................... 84
The linear regression model 𝑚 ..................................................................................................... 84
Automated variable selection algorithms ......................................................................................... 85
Steps .............................................................................................................................................. 85
Backward variable selection ......................................................................................................... 85
Forward variable selection ............................................................................................................ 90
Stepwise AIC and BIC ........................................................................................................................ 92
Theory: Stepwise forward variable selection ............................................................................... 93
Theory: More goodness of fit criteria ........................................................................................... 96
Polynomial Regression ........................................................................................................................ 103
Theory: ............................................................................................................................................ 104
Polynomial Regression Model..................................................................................................... 104
Collinearity .................................................................................................................................. 107
Robust Regression ............................................................................................................................... 111
References and further reading: ..................................................................................................... 111
Theory: ............................................................................................................................................ 112
𝑥 and 𝑦 outliers ........................................................................................................................... 112
Theory: ........................................................................................................................................ 114
Alternatives to LS and L1: ............................................................................................................ 116
Resistant and efficient regression ............................................................................................... 118
One way ANOVA ................................................................................................................................. 122
One way ANOVA: ............................................................................................................................ 122
Scope: .......................................................................................................................................... 122
ANOVA Model ................................................................................................................................. 123
Model equation ........................................................................................................................... 123
More 1 way ANOVA ........................................................................................................................ 129
Structure of ANOVA table for single treatment model............................................................... 130
Distribution of Treatment sum of squares (TSS) ......................................................................... 130
Contrasts: .................................................................................................................................... 131
Multiple Comparisons ......................................................................................................................... 135
Keeping the 𝛼 Error ......................................................................................................................... 135
Example: 3 CI’s ............................................................................................................................ 135
STAT3012 Typed Lecture Notes
4
Data snooping ................................................................................................................................. 136
Tukey’s Confidence Intervals (Honest Significance difference) .................................................. 138
Bonferroni Cis.............................................................................................................................. 139
Scheffe simultaneous CI .............................................................................................................. 140
Conclusion: Multiple testing ........................................................................................................... 141
Quantitative factors ............................................................................................................................ 141
Factor or Numerical Variable? ........................................................................................................ 142
Example : drug levels .................................................................................................................. 142
Polynomial regression ..................................................................................................................... 142
Polynomial regression equivalent to ANOVA ............................................................................. 142
Nesting of linear effects .............................................................................................................. 142
2 way ANOVA ...................................................................................................................................... 148
2 way analysis of variance ............................................................................................................... 149
Additive factor model ................................................................................................................. 149
Main effects model for 2 factors ................................................................................................ 149
Estimation: .................................................................................................................................. 150
More 2 way ANOVA ........................................................................................................................ 154
Recall: Decomposing TSS given 𝑛𝑖𝑗 = 𝑟 ...................................................................................... 154
Test for interaction effects: ........................................................................................................ 155
Mean response/ interaction plot ................................................................................................ 156
Assessing Normality ............................................................................................................................ 161
Assessing normality ........................................................................................................................ 161
Data and testing problems: ......................................................................................................... 161
Pearson’s chi- squared test ......................................................................................................... 162
Kolmogorov-Smirnov (KS) test .................................................................................................... 164
How to get correct p values? ...................................................................................................... 169
Introduction to the Design of Experiments ........................................................................................ 172
Origins: RA Fisher ............................................................................................................................ 172
Randomised Design: ....................................................................................................................... 172
Example: 𝑡 = 2; 𝑛1 = 2; 𝑛2 = 3 ................................................................................................ 173
Randomised complete block design (RCBD) ............................................................................... 173
Simple design for comparing one factor ......................................................................................... 174
Completely randomised design .................................................................................................. 174
Randomised complete block design (RCBD) ....................................................................................... 179
Recall: .............................................................................................................................................. 179
2 way ANOVA for complete dock design ........................................................................................ 179
STAT3012 Typed Lecture Notes
5
Assumptions: ............................................................................................................................... 179
Construction of RCBD’s ............................................................................................................... 180
2 way ANOVA table ......................................................................................................................... 182
Example: omninbus ..................................................................................................................... 182
Pairwise differences: ................................................................................................................... 184
Latin square design ............................................................................................................................. 185
Motivation: ..................................................................................................................................... 186
Definition of standard 𝑡2 latin square design (LSD) ................................................................... 186
Cyclic permutation in LSD ........................................................................................................... 187
Linear regression model for LSD ................................................................................................. 187
Analysis of LSD ................................................................................................................................ 187
3 way ANOVA for LSD.................................................................................................................. 188
Error variance in LSD ................................................................................................................... 190
Treatment contrasts ................................................................................................................... 191
Revisiting design of experiments ........................................................................................................ 193
Concepts: ........................................................................................................................................ 193
Previous work: ............................................................................................................................ 193
Experimental unit and observational unit ...................................................................................... 193
Example: lady tasting tea ............................................................................................................ 193
Example: tomatoes: .................................................................................................................... 194
Relationship between experimental unit (EU) and observational unit (OU) .................................. 194
Mathematical formulation .......................................................................................................... 194
Blocking ........................................................................................................................................... 194
Example: weed control ............................................................................................................... 195
How to block? ............................................................................................................................. 195
Ideal vs reality ............................................................................................................................. 197
Nested factors ..................................................................................................................................... 197
Concepts: ........................................................................................................................................ 198
Example: scientists in labs........................................................................................................... 198
Modelling with nested factors: ................................................................................................... 199
Nested Design ..................................................................................................................................... 204
Concepts ......................................................................................................................................... 204
Example: Calf feeding.................................................................................................................. 205
Pseudo-replication .......................................................................................................................... 206
Technical & biological replication ............................................................................................... 206
More on Split-plot designs .......................................................................................................... 210
STAT3012 Typed Lecture Notes
6
Incomplete Block Design ..................................................................................................................... 211
Incomplete block designs ................................................................................................................ 212
Example: Potato yield ................................................................................................................. 212
Different types of sum of squares............................................................................................... 213
Balanced incomplete block design (BIBD) .................................................................................. 213
Analysis of Covariance (ANCOVA) ....................................................................................................... 218
Example: Optimal fish meal for ducklings ................................................................................... 219
ANCOVA .......................................................................................................................................... 220
Example: Fish meal ..................................................................................................................... 221
Common slope model ................................................................................................................. 221
ANCOVA: Treatment contrasts ................................................................................................... 222
Linear models for ANCOVA ......................................................................................................... 223
Random Effects Model ........................................................................................................................ 225
Example: Sodium content in beer ............................................................................................... 225
One way ANOVA model 2 ............................................................................................................... 226
Example: Beer sodium content ................................................................................................... 227
Differences between one-way ANOVA model 1 and 2 ............................................................... 228
Linear Mixed Models .......................................................................................................................... 230
Linear Mixed Model ........................................................................................................................ 230
Mixed model equations (MME) .................................................................................................. 231
Performance test: random effect model .................................................................................... 234
Some notes on linear mixed models: .......................................................................................... 237
Variance Component Estimation ........................................................................................................ 238
Concepts: ........................................................................................................................................ 239
Methods of moments estimate of variance components .............................................................. 239
Maximum likelihood estimate of variance components ............................................................ 240
Residual (restricted) likelihood ................................................................................................... 240
Longitudinal Data ................................................................................................................................ 244
Repeated measures and longitudinal data ..................................................................................... 246
Example: Sleep data .................................................................................................................... 246
Agricultural Data ................................................................................................................................. 251
Example: Split plot experiment for bean yield............................................................................ 252
Example: Pedigree information .................................................................................................. 255
More thoughts on covariance structure ......................................................................................... 255
Hierarchical Data ................................................................................................................................. 256
Nested vs cross random effects ...................................................................................................... 257
STAT3012 Typed Lecture Notes
7
Small area estimation ................................................................................................................. 257
Area level model ......................................................................................................................... 258
Unit Level model: ........................................................................................................................ 260
Final Thoughts: ............................................................................................................................ 260
Revision Lecture and Exam Information ............................................................................................. 261
Exam info: ....................................................................................................................................... 261
Summary: ........................................................................................................................................ 261
Multiple Linear Regression: ........................................................................................................ 261
ANOVA and experimental Design ............................................................................................... 261
Linear Mixed Model .................................................................................................................... 262
Lecture 1.
Introduction Michael Stewart
Carslaw 818
Assessments - Week 04: Wed, 28/03/18 a quiz in place of the lecture
- Week 10: Fri, 18/05/17 a quiz in place of the lecture
- Week 13: no computer lab
Outline The main objective of this course is to introduce the fundamental concepts of analysis of data from
both observational studies and experimental designs using classical linear methods, together with
the teaching of concepts of collection of data and design of experiments. Additional objectives are to
gain competency in the application and understanding of linear models and regression methods with
diagnostics for checking appropriateness of models; to be introduced to robust regression methods;
to be introduced to the design and analysis of experiments and to further understand the notions of
STAT3012 Typed Lecture Notes
8
replication, randomisation and ideas of factorial designs; to enhance proficiency in the use of the R
statistical package to give analyses and graphical displays.
Outcomes - Proficiency in the use of the general F-test as the main tool to choose between two nested
regression models
- proficiency in assessing model assumptions and outlier detection in regression models
through standard diagnostic plots (box plot, scatterplot, Q-Q-plot, Cook’s distance plot,
leverage vs residual plot), through influence measures (leverage values, Cook’s distance) and
through tests (Bartlett test against homoscedasticiy and normality tests)
- proficiency in the understanding and application of multiple linear regression and in the
understanding of R2 and the adjusted R2
- proficiency in the understanding and application of 1-way ANOVA models of type I and II,
including finding an interpretation of the TSS term through using the concept of orthogonal
contrasts and making inference on all parameters
- proficiency in the understanding and application of 2-way ANOVA models of type I and
making inference on all parameters
- proficiency in the calculation and decomposition of sum of squares terms in multi-way
ANOVA for orthogonal designs
- competency in correcting multiple pairwise comparisons by applying the Tukey, Scheff´e
and Bonferroni correction
- competency in deriving the least-squares estimator in linear regression
- competency in the calculation and interpretation of confidence intervals for all parameters
in linear regression
- competency in the understanding of the difference between confidence intervals and
prediction intervals
- competency in model selection through using the F-test, t-test, AIC or BIC through full
searches or by using step-wise procedures (backward, forward, stepwise)
- competency of polynomial regression models and their selection through using orthogonal
polynomials • competency in using the R function lmer for the fitting of mixed models and a
basic understanding of these complicated models
- competency in reducing a nominal factor in a multi-way ANOVA to a continuous variable
through using linear contrast coefficients
- competency in calculating the distribution for contrasts and using this to calculate
confidence intervals for contrasts
- competency in the design of an appropriate scheme for treatment allocation and data
collection as well as the correct analysis for complete randomised designs (CBD),
randomised CBD (RCBD), Latin square designs (LSD), incomplete block designs (IBD) and
balanced IBD (BIBD), ANCOVAs, and nested designs
- competency in the understanding of blocks, nested factors, interactions terms and
confounding in experimental designs
- competency in using R to compute estimates and standard errors for regression parameters
without built-in functions such as lm and aov, for generating treatment allocation lists for
the CBD, RCBD and LSD.
- basic understanding of L1 regression, M regression and MM regression 3
STAT3012 Typed Lecture Notes
9
- advanced stream students will additionally have competency in theoretical aspects of
regression methods, in particular the Gauss-Markov theorem and appreciation of the Ftest;
if time permits, partial correlation coefficients will be taught and in that case a level of
competency should be reached
Rough Weekly Outline 1. Experimental designs, observational studies, software R, simple linear regression
2. Model diagnostics, inference for linear regression, fitting multiple linear regression models
3. Inference for multiple regression models, multiple correlation coefficients, Leverage and
Cook’s distance, the general F-test
4. Subset selection using stepwise procedures and AIC, Cp and BIC
5. Polynomial regression, orthogonal polynomials, Robust regression, 1-way ANOVA
6. Simultaneous CIs, decomposing sums of squares
7. Quantitative factors, 2-way ANOVA, interactions
8. 2-way ANOVA with interactions, Normality tests, experimental designs
9. Randomized complete block designs, Latin square designs, incomplete block designs
10. Analysis of covariance, nested factors
11. Revisiting Experimental Design, nested designs, random effect model
12. Variance component estimation, mixed effects models, longitudinal data
13. Agricultural data, hierarchical data, revision
5/03/2018
Background - The scientific method is about getting knowledge based on (hard) evidence, which involves
the following steps:
o Formulate question
o Collect relevant data
o Do statistical analysis of data
o Draw conclusions
Example – abundance of bird species
- What habitat characteristics explain the diversity of bird species?
- Field study: collect relevant information (ask Ecologists / Biologists)
- Calculate correlation coefficients, look at scatter plots, use multiple linear regression.
- The best statistical model is based on characteristic x, y and z.
Roll of statistics - Design an appropriate scheme for data collection;
- Perform a suitable statistical analysis;
- Derive valid conclusions.
Applied Linear Models will help you do all these things.
STAT3012 Typed Lecture Notes
10
Unit aims The overall aim of this unit is to develop skills in the statistical analysis of data from designed
experiments and observational studies.
Specific learning objectives are
- Understanding the fundamentals of good design in experiments and studies;
- Understanding the elements of a linear model;
- Ability to develop and apply linear models for data from real-world experiments and studies;
- Further proficiency with a statistical computer package for linear modelling
Note: destinction between observational studies and experiments
‘Linear Models’
Note on ‘Linear Model’ It’s clear that 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜖𝑖 is linear
But what if 𝑥 is not linear?
- Suppose you have data on a response variable 𝑦 (e.g. blood pressure) and an explanatory
variable 𝑥 (e.g. a measurement of cholesterol).
- Want to model the relationship between the mean value of 𝑦, and 𝑥.
- Might use a ‘simple’ linear regression model:
𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝛾𝑥𝑖 + 𝜖𝑖
You might say this is ‘quadratic regression’, but this is linear between 𝑦 and the parameters (usually
𝛽’s)
Eg
𝐸[𝑌] = 𝛽0 + 𝛽1𝑥2
As:
𝐸[𝑌] = (1 𝑥2) (𝛽0
𝛽1)
- As we know 𝑥 (that’s from the data), it’s the 𝛽’s that we actually don’t know and have to
estimate
Why linear models? - Linear models are easy to apply and interpret.
- The mathematical theory underlying linear models is very well understood.
- We can investigate the relationship between a response and lots of explanatory variables in
a straightforward manner.
- A linear model will often (but not always) provide an adequate approximation to reality
STAT3012 Typed Lecture Notes
11
Course Synopsis - Introduction / overview
o Statistical experiments
o RStudio – R, TEXand html
o Simple linear regression - again!
- Multiple linear regression
- Analysis of variance and covariance (of experimental data)
- Analysis of experimental designs
o Repeated measures, nested factors, complete designs,
o Balanced incomplete designs, LSD, random effects, mixed models, etc
Lecture structure - Theme is shown after the lecture number
- Refresher key concepts from previous lecture(s)
- New concepts – motivation
- Several blocks of new material
o Theory : definitions, theorems and proofs
o Examples : by hand and with R
- Summary and outlook
Obtaining Data
3 types of data:
- Available data
- Experimental
- Observational studies
Experiments vs Observational studies
Definitions
Definition 1 (Experiment).
Something is done to people, animals or objects in order to observe the response.
Definition 2 (Observational study).
Individuals are observed and variables of interest are measured, but nothing is deliberately done to
the individuals to affect the response.
Example: Blood pressure
Ten volunteers have their blood pressure measured on day 1 and day 2 in a study. Describe a first
scenario in which these volunteers take part in an observational study and a second scenario where
they are part of an experiment.
STAT3012 Typed Lecture Notes
12
- Observational study: the volunteers receive no specific instruction, i.e. their blood pressure
is measured together with other variables (amount of sleep, alcohol and food intake, level of
exercise etc).
- Experiment: each volunteer is assigned to a treatment group - 5h vs 8h sleep, with the
objective to different rates of change in blood pressure.
Lecture 2. Wednesday, 7 March 2018
Experiments
Theory
Terminology
- Individuals on which the experiment is done are called (experimental) units.
- Each unit is subjected to a specific experimental condition called a treatment.
- The treatment is determined by the combination of values (or levels) taken by the
explanatory variables (or factors).
Principle of well designed experiment:
1. Control: The effects of lurking variables on the response should be controlled.
E.g. effectiveness of drinking V: dosage, time of day, time since last meal
2. Randomization: Using impersonal chance to assign experimental units to treatments is
important for two reasons:
a. It removes the danger experimental bias;
E.g. Young people ask young people only, thus conclusion only true for
young people.
b. It allows the laws of probability to be applied in a straightforward fashion to the
results, and for conclusions to be interpreted in terms of causation.
3. Replication: This reduces chance variation in results. The larger the n the smaller the
standard error.
Observational Studies
Terminology
- The population is the entire group of individuals about which we want information.
- A sample is that part of the population that is examined in order to gather data.
Example: Birds of the High Paramo
- A paramo is an exposed, high plateau in the tropical parts of South America.
- In the northern Andes, there is a pattern of `islands' of vegetation within the otherwise bare
paramo.
- An (observational) study conducted to investigate the bird life in this region.
STAT3012 Typed Lecture Notes
13
- One question of interest : what characteristics of these islands (if any) affect the diversity of
bird species?
Investigation:
For each island of vegetation the following variables were recorded:
- number of species of bird present (𝑁)
- area of the island in square kilometers (𝐴𝑅),
- elevation in thousands of meters (𝐸𝐿),
- the distance from Ecuador in kilometers (𝐷𝐸𝑐)
- distance to the nearest other island in kilometers (𝐷𝑁𝐼).
Reference: Vuilleumier (1970), `Insular biogeography in continental regions. I. Thenorthern Andes of
South America', American Naturaliste, 104, 373-388.
Initial questions: observational study or experiment? experimental units? treatment? levels?
population? sample?
STAT3012 Typed Lecture Notes
14
Example: stimulating effects of caffeine
Key message: Good design is importatnt
- Data can be obtained from a variety of sources.
- Importance of good design in experiments and observational studies cannot be overstated.
- Poorly designed studies can lead to data that cannot answer the scientific question at hand
no matter how cleverly they are analyzed. (otherwise GIGO: Garbage in, Garbage Out)
STAT3012 Typed Lecture Notes
15
Statistical Principles in the Design of Experiments Several important facts:
- Statistically designed experiments are economical;
- They allow to measure the influence of one or several factors on a response;
- They allow the estimation of the magnitude of experimental error;
- Economical means achieve fixed type I&II error with smallest 𝑛 (lowest error for lowest
experimental units).
New concepts:
- Notion of factors, blocks, covariates, confounding variables, design layout, effects,
interactions, replications.
- Common problems in experimental designs: masking, under-powered or overpowered
studies.
Theory:
Terminology Definitions
Block
Group of homogeneous experimental units
- Eg: STAT3012 vs STAT3912
Confounding
One or more effects that cannot unambiguously be attributed to a single factor or interaction.
- Eg: murder rate and icecream consumption related by relationship of hot weather
Covariate:
Uncontrollable variable that influences the response but is unaffected by any other experimental
factors:
- Eg: age, health
Design (layout)
Complete specification of experimental test runs, including blocking, randomization, repeat tests,
replication, and the assignment of factor level combinations to experimental units.
Effect:
Change in the average response between two factor-level combination or between two
experimental conditions.
Factor.
A controllable experimental variable that is thought to influence the response.
- Sitting in the front vs sitting in the back.
- Each factor level combination is regarded as a different treatment
Interaction.
STAT3012 Typed Lecture Notes
16
Existence of joint factor effects in which the effects of each factor depends on the levels of the other
factors.
Replication.
Repetition of an entire experiment or a portion of an experiment under two or more sets of
conditions.
Response.
Outcome or result of an experiment.
Unit (item).
Entity on which a measurement or an observation is made; sometimes refers to the actual
measurement or observation.
Example: agricultural experiment
STAT3012 Typed Lecture Notes
17
Example: pipes
A test program was conducted to evaluate the quality of epoxy-glass-ber pipes taken from each of
two manufacturing plants. Each pipe was produced under normal or severe operating conditions and
at one of two water temperatures. The following test conditions constituted the experimental
protocol:
Common Experimental design problems: - Masking of factor effects: experimental variation masks factor effects.
- Uncontrolled factors: uncontrolled factors compromise experimental conclusions (too few
factors).
- Erroneous principles of efficiency lead to unnecessary waste or inconclusive results (too
many factors, too complex designs, e.g. gene arrays with p ≈ 30k but n ≤ 100).
- Scientific objectives for many-factor experiments may not be achieved with one-factor-at-a-
time designs
STAT3012 Typed Lecture Notes
18
Example: masking problem
Two possible situations for an experimental factor with two levels (red/blue):
There is a clear effect (of the same size) but it can only be detected in case 2.
Intro to R studio New Concepts:
- Vectors put into data frames (flexible data structure).
- A data frame is a collection of column vectors each of the same length.
- The vectors may be numeric, factor, or whatever and each particular column of a data frame
is given a name (chosen by the user, or assigned a default by R).
- A data matrix in R is a collection of numeric vectors of the same length.
- An array in R is a collection of matrices of the same size.
- A list in R is a collection of different R objects
STAT3012 Typed Lecture Notes
19
Basic theory:
Expressions and assignments
Sweave
Lecture 3. Friday, 9 March 2018
Simple Linear Regression
In statistics, simple linear regression is a linear regression model with a single explanatory variable.[1][2][3][4] That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simplerefers to the fact that the outcome variable is related to a single predictor.
It is common to make the additional hypothesis that the ordinary least squares method should be used to minimize the residuals (vertical distances between the points of the data set and the fitted line). Under this hypothesis, the accuracy of a line through the sample points is measured by the sum of squared residuals, and the goal is to make this sum as small as possible. Other regression methods that can be used in place of ordinary least squares include least absolute deviations(minimizing the sum of absolute values of residuals) and the Theil–Sen estimator (which chooses a line whose slope is the median of the slopes determined by pairs of sample points). Deming regression (total least squares) also finds a line that fits a set of two-dimensional sample points, but (unlike ordinary least squares, least absolute deviations, and median slope regression) it is not really an instance of simple linear regression, because it does not separate the coordinates into one dependent and one independent variable and could potentially return a vertical line as its fit.
The remainder of the article assumes an ordinary least squares regression. In this case, the
slope of the fitted line is equal to the correlation between y and xcorrected by the ratio of
STAT3012 Typed Lecture Notes
20
standard deviations of these variables. The intercept of the fitted line is such that it passes
through the center of mass (x, y) of the data points.
Assumed Knowledge: This topic is assumed knowledge and was already taught in STAT2x12, where the emphasis was on
the practical application of simple linear regression. In STAT3x12 this topic is revisited with the aim
to gain a better theoretical understanding
- Four assumptions for linear regression: errors are iid𝑁(0, 𝜎2).
- Least squares estimates for the parameters in the linear regression model.
- Pearson correlation (𝑟), 𝑟2, and their interpretation.
- Residuals estimate errors
Correlation Coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced /ˈpɪərsən/), also referred to
as Pearson's r, the Pearson product-moment correlation coefficient(PPMCC) or
the bivariate correlation,[1] is a measure of the linear correlation between two variables X and Y.
It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear
correlation, and −1 is total negative linear correlation. It is widely used in the sciences. It was
developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s.[2][3][4]
In statistics, the correlation coefficient r measures the strength and
direction of a linear relationship between two variables on
a scatterplot. The value of r is always between +1 and –1. To
interpret its value, see which of the following values your
correlation r is closest to:
Exactly –1. A perfect downhill (negative) linear relationship
–0.70. A strong downhill (negative) linear relationship
–0.50. A moderate downhill (negative) relationship
–0.30. A weak downhill (negative) linear relationship
0. No linear relationship
+0.30. A weak uphill (positive) linear relationship
+0.50. A moderate uphill (positive) relationship
+0.70. A strong uphill (positive) linear relationship
Exactly +1. A perfect uphill (positive) linear relationship
STAT3012 Typed Lecture Notes
21
𝑟 =∑ (𝑥𝑖 − �̅�)(𝑦𝑖 − �̅�)𝑛
𝑖=1
√∑ (𝑥𝑖 − �̅�)2𝑛𝑖=1 √∑ (𝑦𝑖 − �̅�)2𝑛
𝑖=1
Data and model Simple linear regression seeks to model the relationship between
- the mean of a response variable, 𝑌 , and
- a single explanatory variable (or predictor/covariate) 𝑥.
For data (𝑥1, 𝑌1), … , (𝑥𝑛, 𝑌𝑛), where 𝑥1, … , 𝑥𝑛 (lower case) are known constants and 𝑌𝑖 = 𝑦𝑖 are the
observed random responses, we formulate the simple linear regression model as
𝑌𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝜖𝑖
Components: There are two new components to the RHS of equation 1
- 𝛽0,1 are the regression parameters
- 𝜖1,…,𝑛 are error terms, satisfying
Error assumptions
1. 𝐸[𝜖𝑖] = 0, for 𝑖 = 1, . . , 𝑛
2. 𝜖𝑖’s are independent
3. 𝑉𝑎𝑟(𝜖𝑖) = 𝜎2 (homoscedasticity assumption)
4. 𝜖𝑖 normally distributied
STAT3012 Typed Lecture Notes
22
Also described as:
𝜖𝑖 ∼ 𝑁𝐼𝐷(0, 𝜎2); 𝑖 = 1, , … , 𝑛
Example: Body weight vers brain weight
- Data from Allison T and Cicchetti D (1976), Sleep in Mammals, Ecological Constitutional
Correlates Science 194:732-734.
- It is of interest to know whether brain weight for different mammal species truly depends on
body weight.
- View brain weight as the response (𝑌) variable and body weight as the predictor (𝑥) variable.
STAT3012 Typed Lecture Notes
23
STAT3012 Typed Lecture Notes
24
- We can test if there is an underlying simple linear regression model
- Note that the log-transformed data result in a more homogenous scatterplot.
Parameter Estimation: We have 𝑌𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝜖𝑖, under assumption of 𝜖𝑖 ∼ 𝑁𝐼𝐷(0, 𝜎2)
Giving that
𝐸[𝑌𝑖|𝑥𝑖] = 𝜇𝑖 = 𝛽0 + 𝛽1𝑥𝑖
Our stragefY:
- Estimate parameters 𝛽0 and 𝛽1 from the data with the method of least squares, which in
the case of normal errors is the same as using the maximum-likelihood method.
Least Squares Estimates (LSEs)
�̂�0 and �̂�1 are those values that minimize the sum of squares
STAT3012 Typed Lecture Notes
25
𝑆(𝛽0, 𝛽1) = ∑(𝑌𝑖 − 𝜇𝑖)2
𝑛
𝑖=1
= ∑(𝑌𝑖 − 𝛽0 − 𝛽1𝑥𝑖)2
𝑛
𝑖=1
The least squares estimators �̂�0 and �̂�1 are given by
�̂�0 = �̅� − �̂�1�̅�; �̂�1 =𝑆𝑥𝑦
𝑆𝑥𝑥
Where the sum of squares 𝑆𝑥𝑥, 𝑆𝑦𝑦, 𝑆𝑥𝑦 are
Proof of LSE
Estimate the parameters via least squares using partial derivatives ∇𝑆 = 0
Easier to find
min𝛽0,𝛽1
𝑆(𝛽0, 𝛽1) = min𝛽1
[minβ0
𝑆(𝛽0, 𝛽1)]
- So minimize each one separately
Giving:
1.
�̂�0 = �̅� − �̂�1�̅�
2. Subbing in to get
(�̅�𝑛�̅� − �̂�1𝑛�̅�2) + �̂�1 ∑ 𝑥𝑖2 = ∑ 𝑥𝑖𝑌𝑖
→ �̂�1 =𝑆𝑥𝑦
𝑆𝑥𝑥
Example: Brain weight
In R: use the lm function (linear model).
- lm.out = lm(y∼x)
STAT3012 Typed Lecture Notes
26
Method of Maximum Liklihood - As we have seen, the simple linear regression model is 𝑌𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝜖𝑖 with errors 𝜖𝑖 ∼
𝑁𝐼𝐷(0, 𝜎2)
𝑌𝑖 ∼ 𝑁(𝛽0 + 𝛽1𝑥𝑖, 𝜎2)
Joint Density
- Remember that joint density is a product of ‘individual’ density functions
𝑓(𝑦1, … , 𝑦𝑛) = ∏ 𝑓𝑦𝑖(𝑦𝑖)
𝑛
𝑖=1
The joint density of the independent random responses 𝑌𝑖 evaluated at (the observed values) 𝒚𝑻 =
(𝑦1, … , 𝑦𝑛) is
𝑓(𝒚; 𝛽0, 𝛽1, 𝜎) =1
√2𝜋𝜎𝑒
−(𝑦1−𝛽0−𝛽1𝑥1)2
2𝜎2 × … × 𝑒−
(𝑦𝑛−𝛽0−𝛽1𝑥1)2
2𝜎2
= (1
√2𝜋𝜎)
𝑛
exp (−1
2𝜎2∑(𝑦𝑖 − 𝛽0 − 𝛽1𝑥𝑖)2
𝑛
𝑖=1
)
- The method of maximum-likelihood is called such because it finds parameter values �̂�0,1
and �̂� that maximise the joint density (likelihood).
o For 𝛽: we want to maximise the joint density, for each 𝜎 held fixed.
STAT3012 Typed Lecture Notes
27
- One can show (worksheet week 2) that maximising (3) over 𝛽0 and 𝛽1 is independent of 𝜎
and is achieved by minimising ∑ (𝑦𝑖 − 𝛽0 − 𝛽1𝑥𝑖)2𝑛𝑖=1
- In this (special) case the method of maximum-likelihood gives the same parameter estimates
as the method of least-squares.
Sampling distribution of �̂�0,1
From the error assumptions, it follows that
�̂�0 ∼ 𝑁 (𝛽0, 𝜎2 [1
𝑛+
�̅�2
𝑆𝑥𝑥]) ; �̂�1 ∼ 𝑁 (𝛽1,
𝜎2
𝑆𝑥𝑥)
Fitted Regression Line
The fitted regression line equation becomes
𝑦 = �̂�0 + �̂�1𝑥 = �̅� + �̂�1(𝑥 − �̅�)
- Thus, the regression line passes through the component wise mine (�̅�, �̅�)
Correlation Coefficient and Regression Slope Recall that the Pearson correlation coefficient between vectors 𝑥 and 𝑦 is
𝑟 =𝑆𝑥𝑦
√𝑆𝑥𝑥𝑆𝑦𝑦
∈ [−1,1]
So we see that
�̂�1 = 𝑟 √𝑆𝑦𝑦
𝑆𝑥𝑥
- �̂�1 has the same sign as 𝑟 (is a scalled version of 𝑟)
Example: Brain weight
Fitted regression line:
𝐵𝑟𝑎𝑖𝑛𝑊𝑡 = 91.00 + 0.97 × 𝐵𝑜𝑑𝑦𝑊𝑡
STAT3012 Typed Lecture Notes
28
Estimating the Error Variance:
Residual sum of squares (RSS)
- 𝜎2 (the error variance) is also an unknown parameter
- an estimator can be obtained using residual sum of squares (RSS)
𝑅𝑆𝑆 = ∑(𝑌𝑖 − �̂�𝑖)2
𝑛
𝑖=1
= ∑(𝑌𝑖 − �̂�0 − �̂�1𝑥𝑖)2
𝑛
𝑖=1
We find that:
𝑅𝑆𝑆 ∼ 𝜎2𝜒𝑛−22 ⟹ 𝐸[𝑅𝑆𝑆] = (𝑛 − 2)𝜎2
Unbiased estimate of 𝜎2
So, the unbiased estimate of 𝝈𝟐 is
𝑠2 =1
𝑛 − 2𝑅𝑆𝑆 =
1
𝑛 − 2∑(𝑌𝑖 − �̂�0 − �̂�1𝑥𝑖)
2𝑛
𝑖=1
STAT3012 Typed Lecture Notes
29
Example: Brain weight
Residuals as Error Estimators:
- The residuals 𝑅𝑖can be thought of as estimates for the error terms, 𝜖𝑖, in the model.
- The empirical distribution of the residuals is an estimator of the error distribution.
- The residuals sum up to 0:
∑ 𝑅𝑖 = 0
- Note: random variables that sum to a constant cannot be independent!
Example: Brain weight