Transcript

Economics 105: Statistics• Go over GH 19• GH 20 due next Thur• Modeling exercise …

• Variables you all mentioned on Tue …• Gun ownership per capita• Avg temperature / humidity• GDP per capita for each city• HS drop out rate • % of young people in the city• unemployment rate

The Multiple Regression Model

Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi)

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

• Endogenous explanatory variables

Modeling Exercise examples• What is the effect of your roommate’s SAT

scores on your grades? The effect of studying?

• Do police reduce crime?

• Does more education increase wages?

• What is the effect of school start time on academic achievement?

• Does movie violence increase violent crime?

Endogenous Explanatory Variable• Causes of endogenous explanatory variables

include …• Wrong functional form• Omitted variable bias … occurs if both the

1. Omitted variable theoretically determines Y2. Omitted variable is correlated with an included X

• Errors-in-variables (aka, measurement error)

• Sample selection bias• Simultaneity bias (Y also determines X)

Properties of OLS Estimator

• Gauss-Markov Theorem• Under assumptions (1) - (5) [don’t need normality

of errors], is B.L.U.E. of

• Unbiased estimator• Efficiency of an estimator

• Intuition for when var is smaller• We won’t know , so we’ll need to estimate it

Measures of Variation• Total variation is made up of two parts:

Total Sum of Squares

Regression Sum of Squares

Error Sum of Squares

where:

= Average value of the dependent variable

Yi = Observed values of the dependent variable

i = Predicted value of Y for the given Xi value

(continued)

Xi

Y

X

Yi

SST = (Yi - Y)2

SSE = (Yi - Yi )2

SSR = (Yi - Y)2

_

_

_

Y

Y

Y_Y

Measures of Variation

• The coefficient of determination is the portion of the total variation in Y that is explained by variation in X

• Also called r-squared and denoted r2 (or R2)

Goodness of Fit

R2 = 1

Examples of Approximate R2 Values

Y

X

Y

X

R2 = 1

R2 = 1

Perfect linear relationship between X and Y:

100% of the variation in Y is explained by variation in X

Examples of Approximate R2 Values

Y

X

Y

X

0 < R2 < 1

Weaker linear relationships between X and Y:

Some but not all of the variation in Y is explained by variation in X

Examples of Approximate R2 Values

R2 = 0

No linear relationship between X and Y:

The value of Y does not depend on X. (None of the variation in Y is explained by variation in X)

Y

XR2 = 0

Standard Error of the Estimate• The variation of observations around the sample

regression line is estimated by

• an unbiased estimator of std dev of error term

where SSE = error sum of squares n = sample size K = number of slope beta parameters

• Also called “standard error of the model,” or “root mean squared error” (RMSE). Book calls SYX.

Comparing Standard Errors

YY

X X

• Se is a measure of the variation of observed Y values around the regression line

• The magnitude of Se should always be judged relative to the variation of the Y values in the sample data (measured by SY, the sample standard deviation of the actual Y values)

• Closer to 0, than to sY , the better the fit

“Multiple R”• The coefficient of determination, , in simple

regressions of the form, is equal to the square of the correlation coefficient.

•

• Provides a link between correlation and regression• “Multiple R” is • In multiple regression context,

• It is another, less commonly used, measure of strength of the relationship between the dependent var and the independent (explanatory) vars.

• One should not place too much importance on obtaining a high R2

• If all else is equal, a model with a higher R2 explains a higher fraction of the variance– The model has more explanatory power

• Dependent var must be the same to compare• However, R2 can be influenced by factors such as the

nature of the data– Cross-sectional data on individual people: .1 to .2 – Cross-sectional data on firms, counties, cities,

countries, states: .4 to .6– Time-series data: > .80

Goodness of Fit