# Regression Assumptions

• View
27

0

Embed Size (px)

DESCRIPTION

Regression Assumptions. Best Linear Unbiased Estimate (BLUE). If the following assumptions are met: The Model is Complete Linear Additive Variables are measured at an interval or ratio scale without error T h e regression error term is normally distributed has an expected value of 0 - PowerPoint PPT Presentation

Transcript

• If the following assumptions are met:The Model isCompleteLinearAdditiveVariables are measured at an interval or ratio scale without errorThe regression error term isunrelated to predictors normally distributed has an expected value of 0errors are independenthomoscedasticityIn a system of interrelated equations the errors are unrelated to each other

Characteristics of OLS if sample is probability sampleUnbiasedEfficient Consistent

• Unbiased:E(b)= b is the sample is the true, population coefficientOn the average we are on targetEfficientStandard error will be minimumConsistentAs N increases the standard error decreases and closes in on the population value

• MealsParents education. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB if AVG_ED>0 & AVG_ED F = 0.0000 Residual | 37321960.3 10075 3704.41293 R-squared = 0.6370-------------+----------------------------------------------------------------------Adj R-squared = 0.6368 Total | 102825274 10081 10199.9081 Root MSE = 60.864

------------------------------------------------------------------------------------------------------------ API13 | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------------------------------------- MEALS | .1843877 .0394747 4.67 0.000 .0508435 AVG_ED | 92.81476 1.575453 58.91 0.000 .6976283 P_EL | .6984374 .0469403 14.88 0.000 .1225343 P_GATE | .8179836 .0666113 12.28 0.000 .0769699 EMER | -1.095043 .1424199 -7.69 0.000 -.046344 DMOB | 4.715438 .0817277 57.70 0.000 .3746754 _cons | 52.79082 8.491632 6.22 0.000 .------------------------------------------------------------------------------------------------------------. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ED>0 & AVG_ED F = 0.0000 Residual | 35197921.9 10068 3496.01926 R-squared = 0.6577-------------+---------------------------------------------------------------------- Adj R-squared = 0.6572 Total | 102825274 10081 10199.9081 Root MSE = 59.127

-------------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta--------------+----------------------------------------------------------------------------------------------- MEALS | .370891 .0395857 9.37 0.000 .1022703 AVG_ED | 89.51041 1.851184 48.35 0.000 .6727917 P_EL | .2773577 .0526058 5.27 0.000 .0486598 P_GATE | .7084009 .0664352 10.66 0.000 .0666584 EMER | -.7563048 .1396315 -5.42 0.000 -.032008 DMOB | 4.398746 .0817144 53.83 0.000 .349512 PCT_AA | -1.096513 .0651923 -16.82 0.000 -.1112841 PCT_AI | -1.731408 .1560803 -11.09 0.000 -.0718944 PCT_AS | .5951273 .0585275 10.17 0.000 .0715228 PCT_FI | .2598189 .1650952 1.57 0.116 .0099543 PCT_HI | .0231088 .0445723 0.52 0.604 .0066676 PCT_PI | -2.745531 .6295791 -4.36 0.000 -.0274142 PCT_MR | -.8061266 .1838885 -4.38 0.000 -.0295927 _cons | 96.52733 9.305661 10.37 0.000 .-----------------------------------------------------------------------------------------------------------

• DiagnosisTheoreticalRemedyIncluding new variables

• Violation of linearityAn almost perfect relationship will appear as a weak oneAlmost all linear relations stop being linear at a certain point

• Diagnosis: Visual scatter plotsComparing regression with continuous and dummied independent variableRemedy:Use dummiesY=a+bX+e becomesY=a+b1D1+ +bk-1Dk-1+e where X is broken up into k dummies (Di) and k-1 is included. If the R-square of this equation is significantly higher than the R-square of the original that is a sign of non-linearity. The pattern of the slopes (bi) will indicate the shape of the non-linearity.

Transform the variables through a non-linear transformation, thereforeY=a+bX+e becomes

Quadratic:Y=a+b1X+b2X2+e Cubic: Y=a+b1X+b2X2+b3X3+e Kth degree polynomial: Y=a+b1X++bkXk+e

Logarithmic:Y=a+b*log(X)+e orExponential: log(Y)=a+bX+e or Y=ea+bx+e Inverse:Y=a+b/X+e etc.

• Inflection point: -b1/2*b2 -(-3.666183)/2*.0181756=100.85425As you approach 100% the negative effect disappearsMeaningless!

• Y=a+b1X1+b2X2+e

The assumption is that both X1 and X2 each, separately add to Y regardless of the value of the other.E.g. Inc=a+b1Education+b2Citizenship+eImagine, that the effect of X1 depends on X2. If Citizen Inc=a+b*1Education+e*If Not Citizen Inc=a+b**1Education+e**where b*1 >b**1 You cannot simply add the two. If Citizenship is takes only two values, their effect is multiplicative:Inc=a+b1Education*b2Citizenship+eThere are many examples of the violation of additivity:E.g., the effect of previous knowledge (X1) and effort (X2) on grades (Y)The effect of race and skills on income (discrimination)The effect of paternal and maternal education on academic achievement

• Diagnosis:Try other functional forms and compare R-squares

Remedy:Introducing the multiplicative term as a new variable so Yi=a+b1X1+b2X2+e becomesYi=a+b1X1+b2X2+b3Z+ e where Z=X1*X2Or transforming the equation into additive formIf Y=a*X1b1*X2b2*e thenlog Y=log(a)+b1log(X1)+b2log(X2)+e so

• Coefficients(a)

Model Unstandardized CoefficientsStandardized CoefficientstSig. BStd. ErrorBeta 1(Constant)454.5424.151 109.497.000 AVG_ED107.9381.481.80172.896.000 ESCHOOL145.8015.386.70727.073.000 AVG_ED*ESCHOOL(interaction)-33.1451.885-.495-17.587.000a Dependent Variable: API13

Model Summary

ModelRR SquareAdjusted R SquareStd. Error of the Estimate1.730(a).533.53369.867a Predictors: (Constant), INTESXED, AVG_ED, ESCHOOL

Coefficients(a)

Model Unstandardized CoefficientsStandardized CoefficientstSig. BStd. ErrorBeta 1(Constant)510.0302.738 186.250.000 AVG_ED87.476.930.64994.085.000 ESCHOOL54.3521.424.26438.179.000a Dependent Variable: API13

Model Summary

ModelRR SquareAdjusted R SquareStd. Error of the Estimate1.720(a).519.51970.918a Predictors: (Constant), ESCHOOL, AVG_ED

Does parents education matter more in elementary school or later?

• Pred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*ESCHOOLi+(-33.145)*AVG_EDi*ESCHOOLi

IF ESCHOOL=1 i.e. school is an elementary schoolPred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*1+(-33.145)*AVG_EDi*1 =454.542+ 107.938*AVG_EDi+ 145.801+(-33.145)*AVG_EDi =(454.542 + 145.801)+ (107.938 -33.145)*AVG_EDi =

600.343+74.793*AVG_EDi

IF ESCHOOL=0i.e. school is not an elementary but a middle or high schoolPred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*0+(-33.145)*AVG_EDi*0 =

454.542+ 107.938*AVG_EDi

The effect of parental education is larger after elementary school!Is this difference statistically significant?

Coefficients(a)

Model Unstandardized CoefficientsStandardized CoefficientstSig. BStd. ErrorBeta 1(Constant)454.5424.151 109.497.000 AVG_ED107.9381.481.80172.896.000 ESCHOOL145.8015.386.70727.073.000 AVG_ED*ESCHOOL(interaction)-33.1451.885-.495-17.587.000a Dependent Variable: API13

• Dependent(

Nominal

Ordinal

Interval/Ratio

Independent

Dichotomous

Polytomous

Dichotomous

2x2 table

Dummy variables with logit/probit

Kx2 table

Dummy variables with multinomial logit/probit

Nx2 table

Dummy variables with ordered logit/probit

Difference of means test

Regression with dummy variables

Polytomous

2xK table

Dummy variables with logit/probit

KxK table

Dummy variables with multinomial logit/probit

NxK table

Dummy variables with ordered logit/probit

ANOVA

Regression with dummy variables

Ordinal

2xN table

Dummy variables with logit/probit or just

logit/probit

NxK table

Dummy variables with multinomial logit/probit or just multinomial logit/probit

KxK table

Dummy variables with ordered logit/probit or just

ordered

logit/probit

Regression with dummy or

just

Regression

Interval/Ratio

Logit/probit

Multinomial logit/pobit

Ordered logit/probit

Regression

• Take Y=a+bX+eSuppose X*=X+e where X is the real value and e is a random measurement errorThen Y=a+bX*+e Y=a+b(X+e)+e=a+bX+be+e Y=a+bX+E where E=be+e and b=bThe slope (b) will not change but the error will increase as a resultOur R-square will be smallerOur standard errors will be larger t-values smaller significance smallerSuppose X#=X+cW+ewhere W is a systematic measurement error c is a weightThen Y=a+bX#+e Y=a+b(X+cW+e)+e=a+bX+bcW+E b=b iff rwx=0 or rwy=0 otherwise bb which means that the slope will change together with the increase in the error. Apart from the problems stated above, that means thatOur slope will be wrong

• Diagnosis:Look at the correlation of the measure with other measures of the same variableRemedy:Use multiple indicators and structural equation models (AMOS)Confirmatory factor analysisBetter measures

• Our calculations of statistical significance depe