Multivariate Statistics

Multivariate Statistics

Regression Analysis

W. M. van der Veld

University of Amsterdam

Overview• Digression: Galton revisited• Types of regression• Goals of regression• Spurious effects• Simple regression

– Prediction– Fitting a line– OLS estimation– Assessment of the fit (R2)– Assumptions– Confidence intervals of the estimates

• The matrix approach– Multiple Regression– Assumptions– OLS estimation– Assessment of the fit (R2)

Digression: Galton revisited

Digression: Galton revisited

• Galton was inspired by the work of his cousin (who’s that?)• In one his ‘early’ studies, Galton looked at the development of size and

weight of sweet pea seeds over two generations.• This lead to the conclusion that size and weight have a tendency to regress

toward the mean, called ‘reversion’ by Galton. – i.e., produced offspring is less extreme than ancestors.

• Galton assumed that the same process would hold for human beings. – Indeed, he found that the physical characteristics of thousands of human

volunteers showed a similar regression toward mediocrity in human hereditary stature.

• It is ironic, that today the term ‘regression’ is still used based on its role in history. Even the term reversion is more appropriate. But in contemporary statistics, there is an more accurate description of the process that is described.

• Which is?

Types of regression

Types of regression

Nomenclature

x-variables y-variables ε

Predictor Predicted Error in prediction

Explanatory Explained Residual

Independent Dependent Residual

Stimulus Response Residual

Cause Effect Residual

Exogenous Endogenous Residual

Types of regression

x-variables y-variables

# related # related

Simple 1 NA 1 NA

Multiple >1 YES 1 NA

Multivariate >1 YES >1 YES

1 x1 y1

2

1 x1 y1

x2 y2

1

x1

y1

x2

Goals of regression

Goals of regression: Prediction

• x1 = Scholastic achievement (at elementary school)

• y1 = Choice of secondary school (VMBO, MAVO, HAVO, VWO)

11

1 x1 y1

Goals of regression: Prediction

• x1 = Scholastic achievement (at elementary school)

• x2 = Quality of elementary school


1

x1

y1

x2

Goals of regression: Causation

• x1 = Socio-economic status

• x2 = Quality of elementary school


• y2 = Scholastic achievement (at elementary school)

2

1 x1 y1

x2 y2

Spurious effects

The effect of confounding variables

Spurious effects

• Below is the ‘true’ model, – Where x2 is the confounding variable– And the relationship between y1 and y2 is partly spurious.

2

1 y1

x2 y2

Spurious effects

• Say we estimate the effect of y2 on y1. Using the simplest model.

• What will that effect be? (draw the model)

• Now three examples of the effect of a confounding variable

Choice of Secondary School

Scholastic Achievement

Quality of Elementary School

y1 1.00y2 0.25 1.00x2 0.52 0.40 1.00

Spurious effects [1]

• Say we want to estimate the effect of y2 on y1, taking into account the effect of quality of elementary school. What will that effect be? (draw the model)




y1 1.00y2 0.25 1.00x2 0.52 0.40 1.00


• Suppose the correlations have changed. What will that effect be in this case?




y1 1.00y2 0.25 1.00x2 0.68 (was 0.52) 0.60 (was 0.4) 1.00


• Suppose the correlations have changed. What will that effect be in this case?




y1 1.00y2 0.25 1.00x2 0.32 (was 0.52) -0.40 (was 0.4) 1.00

Spurious effects

• It should be obvious, that some relationships are spurious.• In order to get unbiased estimates of the effects, we need to

include the confounding variables. • In our example the observed relation was 0.25

– Situation 1: The causal effect almost disappeared (0.05)– Situation 2: The causal effect almost doubled (0.45)– Situation 3: The causal effect changed sign (-0.25)

• This is related to the assumption that the error terms are unrelated with each other or with other variables in the model.

Simple regression

Prediction

• Let f(x) be given: y = 0.5 + 2x

• What can we tell about y?– That y equals 0 when x equals 0.

– That y increases 2 points when x increases 1 point.

– And also that whenx = 0 => y = 0.5x = 1 => y = 2.5x = 2 => y = 4.5et cetera, for all values of x within the range of possible x values.

• What’s the use of knowing this?

Prediction

• Let’s say x is the number of hours of sunshine today, andy is the amount of sunshine tomorrow (in hours).

• We can PREDICT y under the condition that we know x.– And we know x because we can measure the number of hours of of

sunshine today.

• Of course, we know from the weather forecasting that the causal mechanisms are more complex.

• Nevertheless, suppose that in 50% of the predictions we are correct, then who cares that real life is more complex.

Fitting a line

• Let’s start all over again.

• The regression equation is: y = α + βx + ε

• Where:– α is called the intercept,

– β is called the regression coefficient,

– ε is a random variable indicating the error in the equation, and

– y is a random variable.

• α and β are unknown population parameters.

• We must first know them, before we can start to make predictions or even more (causality).

• We can estimate α and β if we have observations for x and y.

• How?

Fitting a line• On the right there are 25

observations of an x and a y variable.

• We can plot the observations in a (2-dimensional) plane.

• This way we can test whether the relationship is linear.

• If not, linear regression is not the way.

Fitting a line• This relationship is pretty linear. • We have already drawn the regression line through the observed points.• It can be seen that none of the observations is on the prediction line. • Still this prediction line has some properties which makes it the best line.

OLS estimation

• So how do we determine the best line?i.e. under what conditions must α and β be estimated.– By making the sum of errors as small as possible.– But because some errors are negative and others positive (so they can

equal each other out), we determine the best line by– Making the sum of squared errors as small as possible: min(Σε2) .

• There are other possibilities (not discussed).• The error is the result of a wrong prediction, thus when we

– predict that ŷ=4, and– in fact the observed y=5, than– the error=1

• Thus: ε = y – ŷ • Note that we don’t know ε nor ŷ.

OLS estimation

• Let us write the prediction equation as: ŷ = a + bx.– Note that in the prediction there is no error, it’s a prediction!– The error stems from the observed value of y.

• ε = y – ŷ = y – (a + bx),• We are looking for the values of ε that minimize the sum of

squares.– min(Σε2 ) =

min(Σ(y – (a + bx))2 ) =min(Σ(y – a – bx)2 )min(g(x)), where g(x) = Σ(y – a – bx)2

• This means that we have to find the (partial) derivative and solve for a and b while the derivative is zero.

OLS estimation

• The minimum of g(x) = Σ(y – a – bx)2 is found by solving the derivative when equal to zero.

• The parameters of the model are the unknowns: a & b

• The partial derivatives are:– ∂g/∂a = 2Σ(y – a – bx)*-1 = -2Σ(y – a – bx)

– ∂g/∂b = 2Σ(y – a – bx)*-x = -2Σx(y – a – bx)

• Setting them to zero and solving the equations results in:

)var(

)cov(22

2 x

xy

xx

yyxx

nx

x

nyx

yxb

i

ii

ii

iiii

xbya

Digression: terminology

yandx of products of sum corrected the

s y'and sx' theof mean for the correction the

products of sum duncorrecte the

sx' theof squares of sum corrected the

sx' theof mean for the correction the

squares of sum duncorrecte the

SPSS. as such packages

lstatistica in foundcommonly is that ogy, terminolSome

22

2

2

n

yxyx

n

yx

yxn

xx

n

x

x

iiii

ii

ii

ii

i

i

Digression: deviation scores

• These formulas can be rewritten as follows, if the variables are expressed in deviation scores:

)(

)(

)var(

)cov(

xE

xyE

x

xyb

? xbya

)var(

)cov(22

2 x

xy

xx

yyxx

nx

x

nyx

yxb

i

ii

ii

iiii

xbya

OLS estimation

• Using the 25 observations of X and Y, we can estimate both α and β.

• You could compute both a and b as exercise at home if you like:b = -0.079829a = 9.4240 - -0.079829*52.60 = 13.623005

• So, the estimated regression equation is: ŷ = a + bx = 13.623005 - 0.079829x

• These estimates are computed using the criterion that Σε2 should be minimized.

• From that criterion (OLS) it follows that Σε should be 0.• ε = y – ŷ, we already had y and we can calculate ŷ.• Is it true that Σε=0?

i.e. does our estimation procedure yield parameter estimates with this property?

• We have computed all predicted y’s.

• It can be seen that in all 25 cases we make prediction errors, i.e. yi – ŷi ≠ 0

• If we sum over the last column, then we get: Σε=-0.02.

• This deviation is due to rounding errors.

OLS estimation

Assessment of fit

• How good is the prediction of y with x?

– One possibility the size of the average error.

– Not good since Σε=0; hence 1/nΣε is also 0.

• The error is written as: ε = y – ŷ, which is the same as:

222

22

ˆˆ

:obtain weout, work this weif

)ˆ(ˆ

:obtain wens,observatio all

over sum and sides both square takeif

)ˆ(ˆ

yyyyyy

yyyyyy

yyyyyy

iiii

iiii

iiii

Assessment of fit

• SST is the variation of y [=n*var(y)].

• SST is the sum SSR and SSE

• If there are no prediction errors, SSE will be zero!

• In that case all observations will be on the prediction line, and we will have a perfect prediction.

• An indication of how good the prediction is, is therefore given as: SSR/SST.

• This quotient is called R2.

SSR].[ regression todue squares of Sumˆ 3.

],SSE[ regressionabout squares of Sumˆ 2.

SST], totalSS[ mean about the squares of Sum 1.

2

22

2

yy

yy

yy

i

ii

i

Assumptions

• Up to this point we have made no assumptions at all that involve probability distributions.

• A number of specified algebraic calculations have been made, but that’s all.

• We now make the basic assumptions that in the model that are needed for correct OLS estimation:yi = α + βx + εi where i=1,2,….,n

• (1) εi is a random variable with E(εi) = 0 and var(εi) = σ2

• (2) cov(εi,εj) = 0 where i≠j• Thus: E(yi) = E(α + βxi + εi) = E(α + βxi) = α + βxi

– Where α and β are constants, and xi for a given case i is fixed.

• Thus: var(yi) = var(α + βxi + εi) = var(εi) = σ2

– Where α and β are constants, and xi for a given case i is fixed.

Assumptions

• (3) cov(εi,xi) = 0, no spurious relations

• For statistical tests it is necessary to also assume the following:

• (4) εi is normally distributed with mean zero and variance σ2.So εi ~ N(0, σ2)

– Under this assumption εi and εj are not only uncorrelated, but necessarily independent.

Confidence intervals of the estimates

• Both equations provide OLS estimates of the parameters of the (simple) regression model. But, how close are these estimates to the population values α and β?

• Due to sampling variation our estimates deviate from the true population values.

• We are interested in the interval in which the true value should lie, given our estimate.

• This interval is unfortunately called the confidence interval.• Due to some odd circumstances (Fisher?) we generally use the

arbitrary value of 0.95 to define the limits of our interval (95%).

2xx

yyxxb

i

ii

xbya


• This might lead to the wrong idea that there is a 0.95 probability that the true population value lies within this interval.

• The meaning of a confidence interval is however:If samples of the same size are drawn repeatedly from a population, and a confidence interval is calculated from each sample, then 95% of these intervals should contain the population mean.

• So we are talking here about variations due to sampling.

• How do we determine sampling variance of the parameters a and b?

• Let us start with b.


• The equation for the estimation of b is:b = Σ[xi - E(x)][yi - E(y)] / Σ[xi - E(x)]2

• From that can be derived that the variance of b due to sampling errors equals:var(b) = σ2/Σ[xi - E(x)]2

• We can also compute the standard deviation of b, which is the square root of the variance of b.

• This statistic is normally called the standard error.• The standard error of b will be:

s.e.(b) = √[var(b)] = σ / √[Σ[xi - E(x)]2

• Normally σ is unknown so we use the estimate s, assuming that the model is correct, i.e. y = ŷ.

• Therefore: est. s.e.(b) = s / √[Σ[xi - E(x)]2]• This is an expression for the sampling variation.


• This expression of the standard error allows the computation of a confidence interval, so that we get an idea about the closeness of b to β.

• If it is assumed that the errors are all from the same normal distribution (assumption 3), then the confidence interval can be computed with:

• b ± t(n-2,1-½ α) [est. s.e.(b)]

• Where t is the t-value from a t-distribution with k degrees of freedom and a specified α-level. In this case the estimate of s2 has n-2 degrees of freedom. And normally α is chosen to be 0.05.

• It is clear that the smaller the standard error, the closer b is to β.


• Is it possible to test whether an estimate of b is equal to some value B?

• Yes, we can test whether b and B are statistically different from each other.

• t = (b-B) / [est. s.e.(b)]

• If |t| < t(n-2,1-½ α), then this difference is said to be statistically not significant..

• The test that is usually done in statistical packages is the test whether the parameter is different from zero (t=b/[est. s.e.(b)], i.e. whether zero is included in the interval.


• Now the sampling variation of a.• The equation for the estimation of a is:

a = E(y)-bE(x)• From that can be derived that:

var(a) = σ2 Σxi2 /n Σ[xi - E(x)]2

• The standard error of a:• s.e.(a) = σ [√Σxi

2 / n Σ[xi - E(x)]2)• Normally σ is unknown so we use the estimate s, assuming that

the model is correct, i.e. y = ŷ. • Therefore: est. s.e.(a) = s [√Σxi

2 / n Σ[xi - E(x)]2)• This is an expression for the sampling variation of a.• Confidence intervals and t-test are computed in the same way as

b.

The Matrix Approach

Multiple Regression

The Matrix Approach

• The use of matrix notation has many advantages:

• If a problem is written and solved in matrix terms, the solution can be applied to any regression problem no matter how many terms there are in the regression equation.

Multiple Regression

• Instead of simple regression we now consider multiple regression, with p independent variables.

• Notice that simple regression is a special case of multiple regression.

x1

y

x2

xp-1

xp

: :

Multiple Regression

• Notice for each case i we have a separate equation, so with a sample size of n, we will have n equation.

• Which can also be written as a system of equations.

1 Intercept

2p Partial regression slope coefficients

i Residual term associated with the ith case

ipipiii XXXY 33221

Multiple Regression

nppnnn

p

p

n XXX

XXX

XXX

Y

Y

Y

2

1

2

1

32

23222

13121

2

1

1

1

1

y = X +

(n 1) (n p) (p 1) (n 1)

npnpnn

pp

pp

n XXX

XXX

XXX

Y

Y

Y

33221

223232221

113132121

2

1

ipipiii XXXY 33221

Assumptions

• Model assumptions.• (1) εi is a random variable

– with mean zero, and

– constant variance.

• (2) There is no correlation between the ith and jth residual terms

0

0

0

2

1

n

EE

ε

0jiεεE

Iεε 2E

Assumptions

• (3) εi is normally distributed with mean zero and variance σ2.So εi ~ N(0, σ2)– Under this assumption εi and εj are not only

uncorrelated, but necessarily independent.

• (4) Covariance between the X’s and residual terms is 0.– No forgotten variables that cause spurious

relations.

• (5) no multicollinearity.– x variables are not linear dependent, otherwise

Det(X)=0 => Inverse does not exist.

0,cov X

OLS Estimation

• If these assumptions hold, then the OLS estimators have the following properties:– They are unbiased linear estimators, and

– They are also minimum variance estimators.

• This means that the OLS estimator is best linear unbiased estimator of a parameter θ.• Linear

• Unbiased, i.e., E( ) = • Minimum variance in class of all linear unbiased estimators

• Unbiased and minimum variance properties means that OLS estimators are efficient estimators

OLS Estimation

• If one or more of the assumptions are not met than the OLS estimators are no longer best linear unbiased estimator.

• Does this matter?

• Yes, it means we require an alternative method for characterizing the association between our y and x variables

OLS Estimation• The population regression model: y = Xβ + ε• We want to estimate β, which has the property that it minimizes

the sum of squared errors (or residuals): min(SSE) = min(Σε2).• Let’s us call the estimate of β: b• So we have to find an expression of SSE in term of the variables

and parameters of the model.• SSE = Σε2 = ε’ε• y = Xb + ε => ε = y - Xb• SSE = ε’ε = (y-Xb)’(y-Xb).• SSE = y’y – y’Xb – b’X’y + b’X’Xb

OLS Estimation• SSE = y’y – y’Xb – b’X’y + b’X’Xb• Find the minimum for this function means solving the derivative

function for 0; so• ∂SSE/∂b’ = 0 – 0 – X’y + X’Xb = 0• Hence: X’Xb = X’y

• It is now easily seen, that b can be estimated with:• (X’X)-1X’Xb = (X’X)-1X’y =>• Ib = (X’X)-1X’y =>• b = (X’X)-1X’y• This is the OLS estimator of β. • The result is a column vector containing all estimates.

OLS Estimation

• Similarity between factor and regression analysis solutions:

• The factor equation: x = Λξ + δ

• The covariance equation: Σ = ΛΦΛ’ + Θδ

• The ULS solution: – Where i is the total number of unique elements of the correlations matrix.

• Regression equation: y = Xβ + ε

• The variance equation: σ2y= β2 + σ2

ε

• OLS Solution: – Where i is the number of cases in the data matrix.

• OLS Estimator of β: b = (X’X)-1X’y

2

0

2iols min)ˆ(ymin)y(y,F ε

n

iy

i

0

2uls )ˆˆˆˆ(min)ˆ(S,F δΘΛΦΛSΣ

Assessment of fit

• Same measure is used: R2.

• It represents the proportion of variability in the dependent variable that is attributable to the independent variables, i.e. the total explained variance.

• 0 ≤ R2 ≤ 1 (=perfect fit)

• R2 = SSR / SST (see sheet 33)

• Now in the situation of multiple regression, i.e. more than 1 independent variable.

Assessment of fit

• Obviously, the more independent variables, the larger R2 will be.

• This is all right, but when comparing models, the model with the most independent variables will always be the best model.

• So, when comparing models use the adjusted R2.

• Adj(R2) = R2 [n-p]/[n-1]– Where n is the number of cases, and p the number of independent

variables.

Documents

Multivariate Statistics