Regression Assumptions

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Regression Assumptions. Best Linear Unbiased Estimate (BLUE). If the following assumptions are met: The Model is Complete Linear Additive Variables are measured at an interval or ratio scale without error T h e regression error term is normally distributed has an expected value of 0 - PowerPoint PPT Presentation

Transcript

  • If the following assumptions are met:The Model isCompleteLinearAdditiveVariables are measured at an interval or ratio scale without errorThe regression error term isunrelated to predictors normally distributed has an expected value of 0errors are independenthomoscedasticityIn a system of interrelated equations the errors are unrelated to each other

    Characteristics of OLS if sample is probability sampleUnbiasedEfficient Consistent

  • Unbiased:E(b)= b is the sample is the true, population coefficientOn the average we are on targetEfficientStandard error will be minimumConsistentAs N increases the standard error decreases and closes in on the population value

  • MealsParents education. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB if AVG_ED>0 & AVG_ED F = 0.0000 Residual | 37321960.3 10075 3704.41293 R-squared = 0.6370-------------+----------------------------------------------------------------------Adj R-squared = 0.6368 Total | 102825274 10081 10199.9081 Root MSE = 60.864

    ------------------------------------------------------------------------------------------------------------ API13 | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------------------------------------- MEALS | .1843877 .0394747 4.67 0.000 .0508435 AVG_ED | 92.81476 1.575453 58.91 0.000 .6976283 P_EL | .6984374 .0469403 14.88 0.000 .1225343 P_GATE | .8179836 .0666113 12.28 0.000 .0769699 EMER | -1.095043 .1424199 -7.69 0.000 -.046344 DMOB | 4.715438 .0817277 57.70 0.000 .3746754 _cons | 52.79082 8.491632 6.22 0.000 .------------------------------------------------------------------------------------------------------------. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ED>0 & AVG_ED F = 0.0000 Residual | 35197921.9 10068 3496.01926 R-squared = 0.6577-------------+---------------------------------------------------------------------- Adj R-squared = 0.6572 Total | 102825274 10081 10199.9081 Root MSE = 59.127

    -------------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta--------------+----------------------------------------------------------------------------------------------- MEALS | .370891 .0395857 9.37 0.000 .1022703 AVG_ED | 89.51041 1.851184 48.35 0.000 .6727917 P_EL | .2773577 .0526058 5.27 0.000 .0486598 P_GATE | .7084009 .0664352 10.66 0.000 .0666584 EMER | -.7563048 .1396315 -5.42 0.000 -.032008 DMOB | 4.398746 .0817144 53.83 0.000 .349512 PCT_AA | -1.096513 .0651923 -16.82 0.000 -.1112841 PCT_AI | -1.731408 .1560803 -11.09 0.000 -.0718944 PCT_AS | .5951273 .0585275 10.17 0.000 .0715228 PCT_FI | .2598189 .1650952 1.57 0.116 .0099543 PCT_HI | .0231088 .0445723 0.52 0.604 .0066676 PCT_PI | -2.745531 .6295791 -4.36 0.000 -.0274142 PCT_MR | -.8061266 .1838885 -4.38 0.000 -.0295927 _cons | 96.52733 9.305661 10.37 0.000 .-----------------------------------------------------------------------------------------------------------

  • DiagnosisTheoreticalRemedyIncluding new variables

  • Violation of linearityAn almost perfect relationship will appear as a weak oneAlmost all linear relations stop being linear at a certain point

  • Diagnosis: Visual scatter plotsComparing regression with continuous and dummied independent variableRemedy:Use dummiesY=a+bX+e becomesY=a+b1D1+ +bk-1Dk-1+e where X is broken up into k dummies (Di) and k-1 is included. If the R-square of this equation is significantly higher than the R-square of the original that is a sign of non-linearity. The pattern of the slopes (bi) will indicate the shape of the non-linearity.

    Transform the variables through a non-linear transformation, thereforeY=a+bX+e becomes

    Quadratic:Y=a+b1X+b2X2+e Cubic: Y=a+b1X+b2X2+b3X3+e Kth degree polynomial: Y=a+b1X++bkXk+e

    Logarithmic:Y=a+b*log(X)+e orExponential: log(Y)=a+bX+e or Y=ea+bx+e Inverse:Y=a+b/X+e etc.

  • Inflection point: -b1/2*b2 -(-3.666183)/2*.0181756=100.85425As you approach 100% the negative effect disappearsMeaningless!

  • Y=a+b1X1+b2X2+e

    The assumption is that both X1 and X2 each, separately add to Y regardless of the value of the other.E.g. Inc=a+b1Education+b2Citizenship+eImagine, that the effect of X1 depends on X2. If Citizen Inc=a+b*1Education+e*If Not Citizen Inc=a+b**1Education+e**where b*1 >b**1 You cannot simply add the two. If Citizenship is takes only two values, their effect is multiplicative:Inc=a+b1Education*b2Citizenship+eThere are many examples of the violation of additivity:E.g., the effect of previous knowledge (X1) and effort (X2) on grades (Y)The effect of race and skills on income (discrimination)The effect of paternal and maternal education on academic achievement

  • Diagnosis:Try other functional forms and compare R-squares

    Remedy:Introducing the multiplicative term as a new variable so Yi=a+b1X1+b2X2+e becomesYi=a+b1X1+b2X2+b3Z+ e where Z=X1*X2Or transforming the equation into additive formIf Y=a*X1b1*X2b2*e thenlog Y=log(a)+b1log(X1)+b2log(X2)+e so

  • Coefficients(a)

    Model Unstandardized CoefficientsStandardized CoefficientstSig. BStd. ErrorBeta 1(Constant)454.5424.151 109.497.000 AVG_ED107.9381.481.80172.896.000 ESCHOOL145.8015.386.70727.073.000 AVG_ED*ESCHOOL(interaction)-33.1451.885-.495-17.587.000a Dependent Variable: API13

    Model Summary

    ModelRR SquareAdjusted R SquareStd. Error of the Estimate1.730(a).533.53369.867a Predictors: (Constant), INTESXED, AVG_ED, ESCHOOL

    Coefficients(a)

    Model Unstandardized CoefficientsStandardized CoefficientstSig. BStd. ErrorBeta 1(Constant)510.0302.738 186.250.000 AVG_ED87.476.930.64994.085.000 ESCHOOL54.3521.424.26438.179.000a Dependent Variable: API13

    Model Summary

    ModelRR SquareAdjusted R SquareStd. Error of the Estimate1.720(a).519.51970.918a Predictors: (Constant), ESCHOOL, AVG_ED

    Does parents education matter more in elementary school or later?

  • Pred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*ESCHOOLi+(-33.145)*AVG_EDi*ESCHOOLi

    IF ESCHOOL=1 i.e. school is an elementary schoolPred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*1+(-33.145)*AVG_EDi*1 =454.542+ 107.938*AVG_EDi+ 145.801+(-33.145)*AVG_EDi =(454.542 + 145.801)+ (107.938 -33.145)*AVG_EDi =

    600.343+74.793*AVG_EDi

    IF ESCHOOL=0i.e. school is not an elementary but a middle or high schoolPred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*0+(-33.145)*AVG_EDi*0 =

    454.542+ 107.938*AVG_EDi

    The effect of parental education is larger after elementary school!Is this difference statistically significant?

    Coefficients(a)

    Model Unstandardized CoefficientsStandardized CoefficientstSig. BStd. ErrorBeta 1(Constant)454.5424.151 109.497.000 AVG_ED107.9381.481.80172.896.000 ESCHOOL145.8015.386.70727.073.000 AVG_ED*ESCHOOL(interaction)-33.1451.885-.495-17.587.000a Dependent Variable: API13

  • Dependent(

    Nominal

    Ordinal

    Interval/Ratio

    Independent

    Dichotomous

    Polytomous

    Dichotomous

    2x2 table

    Dummy variables with logit/probit

    Kx2 table

    Dummy variables with multinomial logit/probit

    Nx2 table

    Dummy variables with ordered logit/probit

    Difference of means test

    Regression with dummy variables

    Polytomous

    2xK table

    Dummy variables with logit/probit

    KxK table

    Dummy variables with multinomial logit/probit

    NxK table

    Dummy variables with ordered logit/probit

    ANOVA

    Regression with dummy variables

    Ordinal

    2xN table

    Dummy variables with logit/probit or just

    logit/probit

    NxK table

    Dummy variables with multinomial logit/probit or just multinomial logit/probit

    KxK table

    Dummy variables with ordered logit/probit or just

    ordered

    logit/probit

    Regression with dummy or

    just

    Regression

    Interval/Ratio

    Logit/probit

    Multinomial logit/pobit

    Ordered logit/probit

    Regression

  • Take Y=a+bX+eSuppose X*=X+e where X is the real value and e is a random measurement errorThen Y=a+bX*+e Y=a+b(X+e)+e=a+bX+be+e Y=a+bX+E where E=be+e and b=bThe slope (b) will not change but the error will increase as a resultOur R-square will be smallerOur standard errors will be larger t-values smaller significance smallerSuppose X#=X+cW+ewhere W is a systematic measurement error c is a weightThen Y=a+bX#+e Y=a+b(X+cW+e)+e=a+bX+bcW+E b=b iff rwx=0 or rwy=0 otherwise bb which means that the slope will change together with the increase in the error. Apart from the problems stated above, that means thatOur slope will be wrong

  • Diagnosis:Look at the correlation of the measure with other measures of the same variableRemedy:Use multiple indicators and structural equation models (AMOS)Confirmatory factor analysisBetter measures

  • Our calculations of statistical significance depe

Recommended

View more >