SDA 3E Chapter 6 (2)

Embed Size (px)

Citation preview

  • 8/4/2019 SDA 3E Chapter 6 (2)

    1/36

    2007 Pearson Education

    Chapter 6: RegressionAnalysis

    Part 2: Multiple Regression

  • 8/4/2019 SDA 3E Chapter 6 (2)

    2/36

    Model Form Multiple linear regression model:

    Y = b0+ b1 X1+ b2 X2 + ... + bkXk+ e

    Predicted model:

    Y = b0 + b1X1 + b2 X2 + ... + bkXk

    The bs are called partial regression coefficients.

  • 8/4/2019 SDA 3E Chapter 6 (2)

    3/36

    Example: 2000 NFL DataGames Won = b0 + b1 Yards Gained + b2 Takeaways + b3Giveaways + b4 Yards Allowed + b5 Points Scored + e

    Games Won = 8.29 + 0.00074 Yards Gained + 0.1001Takeaways - 0.0839 Giveaways - 0.0018 Yards Allowed +0.0138 Points Scored

  • 8/4/2019 SDA 3E Chapter 6 (2)

    4/36

    Excel Tool Results

  • 8/4/2019 SDA 3E Chapter 6 (2)

    5/36

    Interpreting Results Regression statistics similar to single

    independent variable case

    R Square (coefficient of multiple determination) The value .779 indicates that about 78% of the variation

    in games won can be explained by the variation in theindependent variables.

    Adjusted R2 accounts for sample size and numberof independent variables.

  • 8/4/2019 SDA 3E Chapter 6 (2)

    6/36

    ANOVA Results Significance of regression

    H0: b1 = b2= = bk= 0

    H1: at least one bj is not 0

    Note: df for residual is n k1; df for regression is k

  • 8/4/2019 SDA 3E Chapter 6 (2)

    7/36

    Residual Plots

  • 8/4/2019 SDA 3E Chapter 6 (2)

    8/36

    Test for Individual

    Coefficients

    H0: bj = 0 vs. H1: bj 0

    t = bj/standard error, with n k 1 df

    Confidence intervals: bj tn-k-1 s.e.

  • 8/4/2019 SDA 3E Chapter 6 (2)

    9/36

    Building Good Models Include only significant independent variables. Use

    the fewest necessary to permit adequateinterpretation of the dependent variable.

    10 variables has potentially 210 = 1024 models!

    As you add more explanatory variables to a model, R2increases (even if the variables are irrelevant).However, the Adjusted R2 could either increase ordecrease, thus providing information about the valueof additional variables.

    1

    1

    SST

    SSE1RAdjusted 2

    kn

    n

  • 8/4/2019 SDA 3E Chapter 6 (2)

    10/36

    Model After Dropping Yards

    Gained

    Adjusted R2

    increases slightly

  • 8/4/2019 SDA 3E Chapter 6 (2)

    11/36

    Model After Dropping

    Takeaways

  • 8/4/2019 SDA 3E Chapter 6 (2)

    12/36

    Best Subsets Regression Evaluates all possible

    models or those

    containing a fixednumber of independentvariables to identify thebest.

    Selects appropriatemodels based on Cp

    PHStatoutput

  • 8/4/2019 SDA 3E Chapter 6 (2)

    13/36

    Stepwise Regression Best subsets is not always practical.

    Stepwise regression is a search processthat adds or deletes variables at eachstep until no changes can improve themodel.

  • 8/4/2019 SDA 3E Chapter 6 (2)

    14/36

    PHStatTool: Stepwise

    Regression PHStatmenu > Regression> Stepwise

    Regression

    Enter variable ranges

    Select stepwise criteria

    Select type of method touse: general, forwardselection, or backwardelimination

  • 8/4/2019 SDA 3E Chapter 6 (2)

    15/36

    Stepwise Regression Results

    Forward Selection

    First model

    Second model

    Final model

  • 8/4/2019 SDA 3E Chapter 6 (2)

    16/36

    Multicollinearity Multicollinearity when two or more independent

    variables contain high levels of the sameinformation.

    The independent variables predict each otherbetter than the dependent variable, making itdifficult to interpret the regression coefficients andlead to poor statistical conclusions.

    Effects: Estimates of the regression coefficientsare unstable depending on which variables arepresent, signs may be opposite of expectations,and p-values can be inflated

  • 8/4/2019 SDA 3E Chapter 6 (2)

    17/36

    Correlation Matrix

    The correlation between Points Scored and Yards Gainedis larger than any correlation between Games Won andother independent variables.

  • 8/4/2019 SDA 3E Chapter 6 (2)

    18/36

    Measuring Multicollinearity Variance Inflation Factor, VIF =

    Option in PHStatroutine (be sure to checkthe box).

    If no multicollinearity, VIF = 1

    Researchers suggest that VIF should be nogreater than 5

    21

    1

    jr

  • 8/4/2019 SDA 3E Chapter 6 (2)

    19/36

    Variance Inflation Factors

  • 8/4/2019 SDA 3E Chapter 6 (2)

    20/36

    Models with Categorical

    Independent Variables Examples

    Gender (male, female)

    College graduate (no, 2-year degree, 4-year degree, postgraduate degree)

    Own home (yes, no)

  • 8/4/2019 SDA 3E Chapter 6 (2)

    21/36

    Example How do age and MBA degree affect

    employee salaries?

    Y=b0+b1X1+b2X2+ e

    where

    Y= salary

    X1= age

    X2= MBA indicator (0 = No; 1 = Yes)

  • 8/4/2019 SDA 3E Chapter 6 (2)

    22/36

    Results

  • 8/4/2019 SDA 3E Chapter 6 (2)

    23/36

    Model Salary = 893.59 + 1044.15 Age + 14767.23

    MBA No MBA: Salary = 893.59 + 1044.15 Age

    MBA: Salary = 15660.82 + 1044.15 Age

    The models suggest that the rate of salaryincrease for age is the same for both groups.

    However, individuals with MBAs might earnrelatively higher salaries as they get older. Inother words, the slope ofAgemay depend onthe value ofMBA. Such a dependence iscalled an interaction.

  • 8/4/2019 SDA 3E Chapter 6 (2)

    24/36

    Interaction Model Y = b0 + b1Age + b2MBA + b3Age*MBA + e

  • 8/4/2019 SDA 3E Chapter 6 (2)

    25/36

    Results With Interaction Term

  • 8/4/2019 SDA 3E Chapter 6 (2)

    26/36

    Final Model

  • 8/4/2019 SDA 3E Chapter 6 (2)

    27/36

    Model Results Salary = 3323.11 + 984.25 Age +

    425.58 MBA*Age

    No MBA: Salary = 3323.11 + 984.25 Age +425.58 (0)*Age

    = 3323.11 + 984.25 Age

    MBA: Salary = 3323.11 + 984.25 Age +425.58 (1)*Age

    = 3323.11 + 1409.83 Age

  • 8/4/2019 SDA 3E Chapter 6 (2)

    28/36

    Categorical Variables With

    More Than Two Levels For k > 2 levels, add k-1 additional variables.

    Example: The Excel file Surface Finish.xls

    provides measurements of the surface finishof 35 parts produced on a lathe, along withthe revolutions per minute (RPM) of thespindle and one of four types of cutting tools

    used. The engineer who collected the data isinterested in predicting the surface finish as afunction of RPM and type of tool.

  • 8/4/2019 SDA 3E Chapter 6 (2)

    29/36

    Model Y = b0 + b1X1 + b2X2 + b3X3 + b4X4 + e

    where

    Y = surface finishX1 = RPM

    X2 = tool type B

    X3 = tool type C

    X4 = tool type D

    Tool Type X2 X3 X4A 0 0 0B 1 0 0C 0 1 0D 0 0 1

    Tool Type A: Y=b0+b1X1+ e

    Tool Type B: Y=b0+b1X1+b2+e

    Tool Type C: Y=b0+b1X1+b3 +e

    Tool Type D: Y=b0+b1X1+b4 +e

  • 8/4/2019 SDA 3E Chapter 6 (2)

    30/36

    Regression Results

    Y

    Y

    Y

    Y

  • 8/4/2019 SDA 3E Chapter 6 (2)

    31/36

    Results Surface Finish = 24.49 + 0.098 RPM 13.31 Type B

    20.49 Type C 26.04 Type D

    Tool A: Surface Finish = 24.49 + 0.098 RPM 13.31(0) 20.49(0)

    26.04(0) = 24.49 + 0.098 RPM

    Tool B: Surface Finish = 24.49 + 0.098 RPM 13.3(1) 20.49(0)26.04(0) = 11.18 + 0.098 RPM

    Tool C: Surface Finish = 24.49 + 0.098 RPM 13.31(0) 20.49(1)26.04(0) = 4.00 + 0.098 RPM

    Tool D: Surface Finish = 24.49 + 0.098 RPM 13.3(0) 20.49(0)26.04(1) = -1.55 + 0.098 RPM

  • 8/4/2019 SDA 3E Chapter 6 (2)

    32/36

    Nonlinear Models Interaction terms (X1*X2) or nonlinear variables

    (X22) do not make a model nonlinear; linear

    regression still applies because the model is linear in

    the parameters:

    Y = b0+ b1 X1+ b2 X2 + b3X1X2 + b4X32 + e

    However, if the parameters are nonlinear (Y = aXb),

    then you must try to transform the model or use anonlinear regression technique.

  • 8/4/2019 SDA 3E Chapter 6 (2)

    33/36

    Example: Energy Imports

    Delete explainablevariation due to oil

    embargo

  • 8/4/2019 SDA 3E Chapter 6 (2)

    34/36

    Residual Plot

    Residual plot suggests nonlinearity

  • 8/4/2019 SDA 3E Chapter 6 (2)

    35/36

    Alternative ModelsLinear model (R2 = 0.944)

    Total Imports = -938764498.6 + 481246.5728*Year

    2nd order polynomial:Y = b0 + b1X + b2X

    2 + e (R2= 0.985)

    Total Imports = 3292116394

    33822316.2*Year +8687.66*Year2

  • 8/4/2019 SDA 3E Chapter 6 (2)

    36/36

    Exponential ModelY = aebX (R2 = 0.973)

    Transformation: Ln Y = Ln a + bX

    Model: Ln Y = -87.14 + 0.052*Year

    Original variables: Y = 1.43E-38 e0.052X