55
Multivariate Statistics Regression Analysis W. M. van der Veld University of Amsterdam

Multivariate Statistics

  • Upload
    xia

  • View
    70

  • Download
    1

Embed Size (px)

DESCRIPTION

Multivariate Statistics. Regression Analysis W. M. van der Veld University of Amsterdam. Overview. Digression: Galton revisited Types of regression Goals of regression Spurious effects Simple regression Prediction Fitting a line OLS estimation Assessment of the fit (R 2 ) Assumptions - PowerPoint PPT Presentation

Citation preview

Page 1: Multivariate Statistics

Multivariate Statistics

Regression Analysis

W. M. van der Veld

University of Amsterdam

Page 2: Multivariate Statistics

Overview• Digression: Galton revisited• Types of regression• Goals of regression• Spurious effects• Simple regression

– Prediction– Fitting a line– OLS estimation– Assessment of the fit (R2)– Assumptions– Confidence intervals of the estimates

• The matrix approach– Multiple Regression– Assumptions– OLS estimation– Assessment of the fit (R2)

Page 3: Multivariate Statistics

Digression: Galton revisited

Page 4: Multivariate Statistics

Digression: Galton revisited

• Galton was inspired by the work of his cousin (who’s that?)• In one his ‘early’ studies, Galton looked at the development of size and

weight of sweet pea seeds over two generations.• This lead to the conclusion that size and weight have a tendency to regress

toward the mean, called ‘reversion’ by Galton. – i.e., produced offspring is less extreme than ancestors.

• Galton assumed that the same process would hold for human beings. – Indeed, he found that the physical characteristics of thousands of human

volunteers showed a similar regression toward mediocrity in human hereditary stature.

• It is ironic, that today the term ‘regression’ is still used based on its role in history. Even the term reversion is more appropriate. But in contemporary statistics, there is an more accurate description of the process that is described.

• Which is?

Page 5: Multivariate Statistics

Types of regression

Page 6: Multivariate Statistics

Types of regression

Nomenclature

x-variables y-variables ε

Predictor Predicted Error in prediction

Explanatory Explained Residual

Independent Dependent Residual

Stimulus Response Residual

Cause Effect Residual

Exogenous Endogenous Residual

Page 7: Multivariate Statistics

Types of regression

x-variables y-variables

# related # related

Simple 1 NA 1 NA

Multiple >1 YES 1 NA

Multivariate >1 YES >1 YES

1 x1 y1

2

1 x1 y1

x2 y2

1

x1

y1

x2

Page 8: Multivariate Statistics

Goals of regression

Page 9: Multivariate Statistics

Goals of regression: Prediction

• x1 = Scholastic achievement (at elementary school)

• y1 = Choice of secondary school (VMBO, MAVO, HAVO, VWO)

11

1 x1 y1

Page 10: Multivariate Statistics

Goals of regression: Prediction

• x1 = Scholastic achievement (at elementary school)

• x2 = Quality of elementary school

• y1 = Choice of secondary school (VMBO, MAVO, HAVO, VWO)

1

x1

y1

x2

Page 11: Multivariate Statistics

Goals of regression: Causation

• x1 = Socio-economic status

• x2 = Quality of elementary school

• y1 = Choice of secondary school (VMBO, MAVO, HAVO, VWO)

• y2 = Scholastic achievement (at elementary school)

2

1 x1 y1

x2 y2

Page 12: Multivariate Statistics

Spurious effects

The effect of confounding variables

Page 13: Multivariate Statistics

Spurious effects

• Below is the ‘true’ model, – Where x2 is the confounding variable– And the relationship between y1 and y2 is partly spurious.

2

1 y1

x2 y2

Page 14: Multivariate Statistics

Spurious effects

• Say we estimate the effect of y2 on y1. Using the simplest model.

• What will that effect be? (draw the model)

• Now three examples of the effect of a confounding variable

Choice of Secondary School

Scholastic Achievement

Quality of Elementary School

y1 1.00y2 0.25 1.00x2 0.52 0.40 1.00

Page 15: Multivariate Statistics

Spurious effects [1]

• Say we want to estimate the effect of y2 on y1, taking into account the effect of quality of elementary school. What will that effect be? (draw the model)

Choice of Secondary School

Scholastic Achievement

Quality of Elementary School

y1 1.00y2 0.25 1.00x2 0.52 0.40 1.00

Page 16: Multivariate Statistics

Spurious effects [2]

• Suppose the correlations have changed. What will that effect be in this case?

Choice of Secondary School

Scholastic Achievement

Quality of Elementary School

y1 1.00y2 0.25 1.00x2 0.68 (was 0.52) 0.60 (was 0.4) 1.00

Page 17: Multivariate Statistics

Spurious effects [3]

• Suppose the correlations have changed. What will that effect be in this case?

Choice of Secondary School

Scholastic Achievement

Quality of Elementary School

y1 1.00y2 0.25 1.00x2 0.32 (was 0.52) -0.40 (was 0.4) 1.00

Page 18: Multivariate Statistics

Spurious effects

• It should be obvious, that some relationships are spurious.• In order to get unbiased estimates of the effects, we need to

include the confounding variables. • In our example the observed relation was 0.25

– Situation 1: The causal effect almost disappeared (0.05)– Situation 2: The causal effect almost doubled (0.45)– Situation 3: The causal effect changed sign (-0.25)

• This is related to the assumption that the error terms are unrelated with each other or with other variables in the model.

Page 19: Multivariate Statistics

Simple regression

Page 20: Multivariate Statistics

Prediction

• Let f(x) be given: y = 0.5 + 2x

• What can we tell about y?– That y equals 0 when x equals 0.

– That y increases 2 points when x increases 1 point.

– And also that whenx = 0 => y = 0.5x = 1 => y = 2.5x = 2 => y = 4.5et cetera, for all values of x within the range of possible x values.

• What’s the use of knowing this?

Page 21: Multivariate Statistics

Prediction

• Let’s say x is the number of hours of sunshine today, andy is the amount of sunshine tomorrow (in hours).

• We can PREDICT y under the condition that we know x.– And we know x because we can measure the number of hours of of

sunshine today.

• Of course, we know from the weather forecasting that the causal mechanisms are more complex.

• Nevertheless, suppose that in 50% of the predictions we are correct, then who cares that real life is more complex.

Page 22: Multivariate Statistics

Fitting a line

• Let’s start all over again.

• The regression equation is: y = α + βx + ε

• Where:– α is called the intercept,

– β is called the regression coefficient,

– ε is a random variable indicating the error in the equation, and

– y is a random variable.

• α and β are unknown population parameters.

• We must first know them, before we can start to make predictions or even more (causality).

• We can estimate α and β if we have observations for x and y.

• How?

Page 23: Multivariate Statistics

Fitting a line• On the right there are 25

observations of an x and a y variable.

• We can plot the observations in a (2-dimensional) plane.

• This way we can test whether the relationship is linear.

• If not, linear regression is not the way.

Page 24: Multivariate Statistics

Fitting a line• This relationship is pretty linear. • We have already drawn the regression line through the observed points.• It can be seen that none of the observations is on the prediction line. • Still this prediction line has some properties which makes it the best line.

Page 25: Multivariate Statistics

OLS estimation

• So how do we determine the best line?i.e. under what conditions must α and β be estimated.– By making the sum of errors as small as possible.– But because some errors are negative and others positive (so they can

equal each other out), we determine the best line by– Making the sum of squared errors as small as possible: min(Σε2) .

• There are other possibilities (not discussed).• The error is the result of a wrong prediction, thus when we

– predict that ŷ=4, and– in fact the observed y=5, than– the error=1

• Thus: ε = y – ŷ • Note that we don’t know ε nor ŷ.

Page 26: Multivariate Statistics

OLS estimation

• Let us write the prediction equation as: ŷ = a + bx.– Note that in the prediction there is no error, it’s a prediction!– The error stems from the observed value of y.

• ε = y – ŷ = y – (a + bx),• We are looking for the values of ε that minimize the sum of

squares.– min(Σε2 ) =

min(Σ(y – (a + bx))2 ) =min(Σ(y – a – bx)2 )min(g(x)), where g(x) = Σ(y – a – bx)2

• This means that we have to find the (partial) derivative and solve for a and b while the derivative is zero.

Page 27: Multivariate Statistics

OLS estimation

• The minimum of g(x) = Σ(y – a – bx)2 is found by solving the derivative when equal to zero.

• The parameters of the model are the unknowns: a & b

• The partial derivatives are:– ∂g/∂a = 2Σ(y – a – bx)*-1 = -2Σ(y – a – bx)

– ∂g/∂b = 2Σ(y – a – bx)*-x = -2Σx(y – a – bx)

• Setting them to zero and solving the equations results in:

)var(

)cov(22

2 x

xy

xx

yyxx

nx

x

nyx

yxb

i

ii

ii

iiii

xbya

Page 28: Multivariate Statistics

Digression: terminology

yandx of products of sum corrected the

s y'and sx' theof mean for the correction the

products of sum duncorrecte the

sx' theof squares of sum corrected the

sx' theof mean for the correction the

squares of sum duncorrecte the

SPSS. as such packages

lstatistica in foundcommonly is that ogy, terminolSome

22

2

2

n

yxyx

n

yx

yxn

xx

n

x

x

iiii

ii

ii

ii

i

i

Page 29: Multivariate Statistics

Digression: deviation scores

• These formulas can be rewritten as follows, if the variables are expressed in deviation scores:

)(

)(

)var(

)cov(

xE

xyE

x

xyb

? xbya

)var(

)cov(22

2 x

xy

xx

yyxx

nx

x

nyx

yxb

i

ii

ii

iiii

xbya

Page 30: Multivariate Statistics

OLS estimation

• Using the 25 observations of X and Y, we can estimate both α and β.

• You could compute both a and b as exercise at home if you like:b = -0.079829a = 9.4240 - -0.079829*52.60 = 13.623005

• So, the estimated regression equation is: ŷ = a + bx = 13.623005 - 0.079829x

• These estimates are computed using the criterion that Σε2 should be minimized.

• From that criterion (OLS) it follows that Σε should be 0.• ε = y – ŷ, we already had y and we can calculate ŷ.• Is it true that Σε=0?

i.e. does our estimation procedure yield parameter estimates with this property?

Page 31: Multivariate Statistics

• We have computed all predicted y’s.

• It can be seen that in all 25 cases we make prediction errors, i.e. yi – ŷi ≠ 0

• If we sum over the last column, then we get: Σε=-0.02.

• This deviation is due to rounding errors.

OLS estimation

Page 32: Multivariate Statistics

Assessment of fit

• How good is the prediction of y with x?

– One possibility the size of the average error.

– Not good since Σε=0; hence 1/nΣε is also 0.

• The error is written as: ε = y – ŷ, which is the same as:

222

22

ˆˆ

:obtain weout, work this weif

)ˆ(ˆ

:obtain wens,observatio all

over sum and sides both square takeif

)ˆ(ˆ

yyyyyy

yyyyyy

yyyyyy

iiii

iiii

iiii

Page 33: Multivariate Statistics

Assessment of fit

• SST is the variation of y [=n*var(y)].

• SST is the sum SSR and SSE

• If there are no prediction errors, SSE will be zero!

• In that case all observations will be on the prediction line, and we will have a perfect prediction.

• An indication of how good the prediction is, is therefore given as: SSR/SST.

• This quotient is called R2.

SSR].[ regression todue squares of Sumˆ 3.

],SSE[ regressionabout squares of Sumˆ 2.

SST], totalSS[ mean about the squares of Sum 1.

2

22

2

yy

yy

yy

i

ii

i

Page 34: Multivariate Statistics

Assumptions

• Up to this point we have made no assumptions at all that involve probability distributions.

• A number of specified algebraic calculations have been made, but that’s all.

• We now make the basic assumptions that in the model that are needed for correct OLS estimation:yi = α + βx + εi where i=1,2,….,n

• (1) εi is a random variable with E(εi) = 0 and var(εi) = σ2

• (2) cov(εi,εj) = 0 where i≠j• Thus: E(yi) = E(α + βxi + εi) = E(α + βxi) = α + βxi

– Where α and β are constants, and xi for a given case i is fixed.

• Thus: var(yi) = var(α + βxi + εi) = var(εi) = σ2

– Where α and β are constants, and xi for a given case i is fixed.

Page 35: Multivariate Statistics

Assumptions

• (3) cov(εi,xi) = 0, no spurious relations

• For statistical tests it is necessary to also assume the following:

• (4) εi is normally distributed with mean zero and variance σ2.So εi ~ N(0, σ2)

– Under this assumption εi and εj are not only uncorrelated, but necessarily independent.

Page 36: Multivariate Statistics

Confidence intervals of the estimates

• Both equations provide OLS estimates of the parameters of the (simple) regression model. But, how close are these estimates to the population values α and β?

• Due to sampling variation our estimates deviate from the true population values.

• We are interested in the interval in which the true value should lie, given our estimate.

• This interval is unfortunately called the confidence interval.• Due to some odd circumstances (Fisher?) we generally use the

arbitrary value of 0.95 to define the limits of our interval (95%).

2xx

yyxxb

i

ii

xbya

Page 37: Multivariate Statistics

Confidence intervals of the estimates

• This might lead to the wrong idea that there is a 0.95 probability that the true population value lies within this interval.

• The meaning of a confidence interval is however:If samples of the same size are drawn repeatedly from a population, and a confidence interval is calculated from each sample, then 95% of these intervals should contain the population mean.

• So we are talking here about variations due to sampling.

• How do we determine sampling variance of the parameters a and b?

• Let us start with b.

Page 38: Multivariate Statistics

Confidence intervals of the estimates

• The equation for the estimation of b is:b = Σ[xi - E(x)][yi - E(y)] / Σ[xi - E(x)]2

• From that can be derived that the variance of b due to sampling errors equals:var(b) = σ2/Σ[xi - E(x)]2

• We can also compute the standard deviation of b, which is the square root of the variance of b.

• This statistic is normally called the standard error.• The standard error of b will be:

s.e.(b) = √[var(b)] = σ / √[Σ[xi - E(x)]2

• Normally σ is unknown so we use the estimate s, assuming that the model is correct, i.e. y = ŷ.

• Therefore: est. s.e.(b) = s / √[Σ[xi - E(x)]2]• This is an expression for the sampling variation.

Page 39: Multivariate Statistics

Confidence intervals of the estimates

• This expression of the standard error allows the computation of a confidence interval, so that we get an idea about the closeness of b to β.

• If it is assumed that the errors are all from the same normal distribution (assumption 3), then the confidence interval can be computed with:

• b ± t(n-2,1-½ α) [est. s.e.(b)]

• Where t is the t-value from a t-distribution with k degrees of freedom and a specified α-level. In this case the estimate of s2 has n-2 degrees of freedom. And normally α is chosen to be 0.05.

• It is clear that the smaller the standard error, the closer b is to β.

Page 40: Multivariate Statistics

Confidence intervals of the estimates

• Is it possible to test whether an estimate of b is equal to some value B?

• Yes, we can test whether b and B are statistically different from each other.

• t = (b-B) / [est. s.e.(b)]

• If |t| < t(n-2,1-½ α), then this difference is said to be statistically not significant..

• The test that is usually done in statistical packages is the test whether the parameter is different from zero (t=b/[est. s.e.(b)], i.e. whether zero is included in the interval.

Page 41: Multivariate Statistics

Confidence intervals of the estimates

• Now the sampling variation of a.• The equation for the estimation of a is:

a = E(y)-bE(x)• From that can be derived that:

var(a) = σ2 Σxi2 /n Σ[xi - E(x)]2

• The standard error of a:• s.e.(a) = σ [√Σxi

2 / n Σ[xi - E(x)]2)• Normally σ is unknown so we use the estimate s, assuming that

the model is correct, i.e. y = ŷ. • Therefore: est. s.e.(a) = s [√Σxi

2 / n Σ[xi - E(x)]2)• This is an expression for the sampling variation of a.• Confidence intervals and t-test are computed in the same way as

b.

Page 42: Multivariate Statistics

The Matrix Approach

Multiple Regression

Page 43: Multivariate Statistics

The Matrix Approach

• The use of matrix notation has many advantages:

• If a problem is written and solved in matrix terms, the solution can be applied to any regression problem no matter how many terms there are in the regression equation.

Page 44: Multivariate Statistics

Multiple Regression

• Instead of simple regression we now consider multiple regression, with p independent variables.

• Notice that simple regression is a special case of multiple regression.

x1

y

x2

xp-1

xp

: :

Page 45: Multivariate Statistics

Multiple Regression

• Notice for each case i we have a separate equation, so with a sample size of n, we will have n equation.

• Which can also be written as a system of equations.

1 Intercept

2p Partial regression slope coefficients

i Residual term associated with the ith case

ipipiii XXXY 33221

Page 46: Multivariate Statistics

Multiple Regression

nppnnn

p

p

n XXX

XXX

XXX

Y

Y

Y

2

1

2

1

32

23222

13121

2

1

1

1

1

y = X +

(n 1) (n p) (p 1) (n 1)

npnpnn

pp

pp

n XXX

XXX

XXX

Y

Y

Y

33221

223232221

113132121

2

1

ipipiii XXXY 33221

Page 47: Multivariate Statistics

Assumptions

• Model assumptions.• (1) εi is a random variable

– with mean zero, and

– constant variance.

• (2) There is no correlation between the ith and jth residual terms

0

0

0

2

1

n

EE

ε

0jiεεE

Iεε 2E

Page 48: Multivariate Statistics

Assumptions

• (3) εi is normally distributed with mean zero and variance σ2.So εi ~ N(0, σ2)– Under this assumption εi and εj are not only

uncorrelated, but necessarily independent.

• (4) Covariance between the X’s and residual terms is 0.– No forgotten variables that cause spurious

relations.

• (5) no multicollinearity.– x variables are not linear dependent, otherwise

Det(X)=0 => Inverse does not exist.

0,cov X

Page 49: Multivariate Statistics

OLS Estimation

• If these assumptions hold, then the OLS estimators have the following properties:– They are unbiased linear estimators, and

– They are also minimum variance estimators.

• This means that the OLS estimator is best linear unbiased estimator of a parameter θ.• Linear

• Unbiased, i.e., E( ) = • Minimum variance in class of all linear unbiased estimators

• Unbiased and minimum variance properties means that OLS estimators are efficient estimators

Page 50: Multivariate Statistics

OLS Estimation

• If one or more of the assumptions are not met than the OLS estimators are no longer best linear unbiased estimator.

• Does this matter?

• Yes, it means we require an alternative method for characterizing the association between our y and x variables

Page 51: Multivariate Statistics

OLS Estimation• The population regression model: y = Xβ + ε• We want to estimate β, which has the property that it minimizes

the sum of squared errors (or residuals): min(SSE) = min(Σε2).• Let’s us call the estimate of β: b• So we have to find an expression of SSE in term of the variables

and parameters of the model.• SSE = Σε2 = ε’ε• y = Xb + ε => ε = y - Xb• SSE = ε’ε = (y-Xb)’(y-Xb).• SSE = y’y – y’Xb – b’X’y + b’X’Xb

Page 52: Multivariate Statistics

OLS Estimation• SSE = y’y – y’Xb – b’X’y + b’X’Xb• Find the minimum for this function means solving the derivative

function for 0; so• ∂SSE/∂b’ = 0 – 0 – X’y + X’Xb = 0• Hence: X’Xb = X’y

• It is now easily seen, that b can be estimated with:• (X’X)-1X’Xb = (X’X)-1X’y =>• Ib = (X’X)-1X’y =>• b = (X’X)-1X’y• This is the OLS estimator of β. • The result is a column vector containing all estimates.

Page 53: Multivariate Statistics

OLS Estimation

• Similarity between factor and regression analysis solutions:

• The factor equation: x = Λξ + δ

• The covariance equation: Σ = ΛΦΛ’ + Θδ

• The ULS solution: – Where i is the total number of unique elements of the correlations matrix.

• Regression equation: y = Xβ + ε

• The variance equation: σ2y= β2 + σ2

ε

• OLS Solution: – Where i is the number of cases in the data matrix.

• OLS Estimator of β: b = (X’X)-1X’y

2

0

2iols min)ˆ(ymin)y(y,F ε

n

iy

i

0

2uls )ˆˆˆˆ(min)ˆ(S,F δΘΛΦΛSΣ

Page 54: Multivariate Statistics

Assessment of fit

• Same measure is used: R2.

• It represents the proportion of variability in the dependent variable that is attributable to the independent variables, i.e. the total explained variance.

• 0 ≤ R2 ≤ 1 (=perfect fit)

• R2 = SSR / SST (see sheet 33)

• Now in the situation of multiple regression, i.e. more than 1 independent variable.

Page 55: Multivariate Statistics

Assessment of fit

• Obviously, the more independent variables, the larger R2 will be.

• This is all right, but when comparing models, the model with the most independent variables will always be the best model.

• So, when comparing models use the adjusted R2.

• Adj(R2) = R2 [n-p]/[n-1]– Where n is the number of cases, and p the number of independent

variables.