Regression and Multivariate analysis

Embed Size (px)

Citation preview

  • 7/29/2019 Regression and Multivariate analysis

    1/44

    Least Squares Regression andMultiple Regression

  • 7/29/2019 Regression and Multivariate analysis

    2/44

    Regression: A Simplified Example

    X(predictor)

    Y(criterion)

    3 14

    4 182 10

    1 6

    5 22

    3 146 26

    Lets find the best-fitting equation for predictingnew, as yet unknown scores on Y from scoreson X. The regression equation takes the form Y= a + bX + e where Y is the dependent or

    criterion variable were trying to predict, a is theintercept or point where the regression linecrosses the Y axis, X is the independent or

    predictorvariable, b is the weight by which wemultiply the value of X (it is the slope of theregression line, and is how many units Y

    increases (decreases) for every unit change inX), and e is an error term (basically an estimateof how much our prediction is off). a and bare often called regression coefficients. WhenY is an estimated value it is usually symbolizedas Y

  • 7/29/2019 Regression and Multivariate analysis

    3/44

    Finding the Regression Line withSPSS

    First lets use a scatterplot tovisualize the relationshipbetween X and Y. The firstthing we notice is that thepoints appear to form astraight line and that that as

    X gets larger, Y gets larger,so it would appear that wehave a strong, positiverelationship between X and Y.Based on the way the pointsseem to fall, what do youthink the value of Y would befor a person who obtained a

    score of 7 on X?

    X

    76543210

    Y

    30

    20

    10

    0

  • 7/29/2019 Regression and Multivariate analysis

    4/44

    Fitting a Line to the Scatterplot

    Next lets fit a line to thescatterplot. Note that thepoints appear to be fit wellby the straight line, andthat the line crosses the Y

    axis (at the point called theintercept, or the constant ain our regression equation)at about the point y = 2.So its a good guess thatour regression equation willbe something like y = 2 +

    some positive multiple of X,since the values of Y look tobe about 4-5 times the sizeof X

    X

    76543210

    Y

    30

    20

    10

    0

  • 7/29/2019 Regression and Multivariate analysis

    5/44

    The Least Squares Solution toFinding the Regression Equation

    Mathematically, the regression equation is that combinationof constant and weights bon the predictors (the Xs) whichminimizes the sum, across all subjects, of the squareddifferences between their predicted scores (e.g. the scoresthey would get if the regression equation were doing thepredicting) and the obtained scores (their actual scores) onthe criterion Y (that is, it minimizes the error sum ofsquares or residuals). This is known as the least squaressolution

    The correlation between the obtained scores on thecriterion or dependent variable, Y, and the scores predictedby the regression equation is expressed in the correlation

    coefficient, r, or in the case of more than one independentvariable, R.* Alternatively it expresses the correlationbetween Y and the weighted combination of predictors. Rranges from zero to 1

    *SPSS uses R in the regression output even if there is onlyone predictor

  • 7/29/2019 Regression and Multivariate analysis

    6/44

    Using SPSS to Calculate theRegression Equation

    Download the Data Filesimpleregressionexample.sav and open it in SPSS

    In Data Editor, we will goto Analyze/ Regression /Linear and move X into

    the Independent box (inregression theIndependent variables arethe predictor variables)and move Y into thedependent box and clickOK. The dependentvariable, Y, is the one for

    which we are trying tofind an equation that willpredict new cases of Ygiven than we know X

    http://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/simpleregressionexample.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/simpleregressionexample.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/simpleregressionexample.sav
  • 7/29/2019 Regression and Multivariate analysis

    7/44

    Obtaining the Regression Equationfrom the SPSS Output

    Coefficientsa

    2.000 .000 . .

    4.000 .000 1.000 . .

    (Constant)

    X

    Model

    1

    B Std. Error

    Unstandardized

    Coefficients

    Beta

    Standardized

    Coefficients

    t Sig.

    Dependent Variable: Ya.

    This table gives us theregression coefficients. Look inthe column calledunstandardized coefficients.There are two values ofprovided. The first one, labeledthe constant, is the intercept a,or the point at which theregression line crosses the yaxis. The second one, X, is theunstandardized regressionweight or the b from ourregression equation. So thisoutput tells us that the best-

    fitting equation forpredicting Y from X is Y = 2+ (4)X. Lets check that outwith a known value of X and Y.According to the equation, if X is3, Y should be 2 + 4(3), or 14.How about when X = 5?

    X Y

    3 14

    4 18

    2 10

    1 6

    5 22

    3 14

    6 26

    The constantrepresenting theintercept is the valuethat the dependentvariable would take

    when all the predictorsare at a value of zero.In some treatmentsthis is called B0instead ofa

  • 7/29/2019 Regression and Multivariate analysis

    8/44

    What is the Regression Equation whenthe Scores are in Standard (Z) Units?

    When the scores on X and Y have been converted to Zscores, then the intercept disappears (because the twosets of scores are expressed on the same scale) and theequation for predicting Y from X just becomes Y = BetaX,

    where Beta is the standardized coefficient reported inyour SPSS regression procedure output

    Coefficientsa

    2.000 .000 . .

    4.000 .000 1.000 . .

    (Constant)

    X

    Model

    1

    B Std. Error

    Unstandardized

    Coefficients

    Beta

    Standardized

    Coefficients

    t Sig.

    Dependent Variabl e: Ya.

    In the bivariate case, where there is only one X and one Y, thestandardized beta weight will equal the correlation coefficient.Lets confirm this by seeing what would happen if we convertour raw scores to Z scores

  • 7/29/2019 Regression and Multivariate analysis

    9/44

    Regression Equation for Z scores

    In SPSS I have converted X and Y to two new variables, ZX and ZY,expressed in standard score units. You achieve this by going to Analyze/Descriptive/ Descriptives (dont do this now), moving the variables youwant to convert into the variables box, and selecting save standardizedvalues as variables. This creates the new variables expressed as Z scores.Note that if you reran the linear regression analysis that we just did on theraw scores, that in the output for the regression equation for predicting the

    standard scores on Y the constant has dropped out and the equation is nowof the form y = Beta x, where Beta is equal to 1. In this case the z scoresare identical on X and Y although they certainly wouldnt always be

    Coefficientsa

    .000 .000 . .

    1.000 .000 1.000 . .

    (Constant)

    Zscore(X)

    Model

    1

    B Std. Error

    Unstandardized

    Coefficients

    Beta

    Standardized

    Coefficients

    t Sig.

    Dependent Variabl e: Zscore(Y)a.

    Correlations

    1 1.000**

    . .

    7 7

    1.000** 1

    . .

    7 7

    Pearson Correlati on

    Sig. (2-tailed)

    N

    Pearson Correlati on

    Sig. (2-tailed)

    N

    Zscore(Y)

    Zscore(X)

    Zscore(Y) Zscore(X)

    Correlation i s signi ficant at the 0.01 l evel (2-tailed).**.

  • 7/29/2019 Regression and Multivariate analysis

    10/44

    Meaning of Regression Weights The regression weights or regression coefficients (the

    raw score s and the standardized Betas) can beinterpreted as expressing this unique contribution of avariable: you can say they represent the amount ofchange in Y that you can expect to occur per unitchange in Xi , where X is the ith variable in the predictiveequation, when statistical control has been achieved for

    all of the other variables in the equation Lets consider an example from the raw-score regression

    equation Y = 2 + (b)X, where the weight b is 4: Y = 2+ (4) X. In predicting Y, what the weight b means isthat for every unit change in X, Y will be increasedfourfold. Consider the data from this table and verifythat this is the case. For example, if X = 1, Y = 6. Nowmake a unit change of 1 in X, so that X is 2, and Ybecomes equal to 10. Make a further unit change of 2

    units to 3, and Y becomes equal to 14. Make a furtherunit change of 3 units to 4, and Y becomes equal to 18.So each unit change in X increases Y fourfold (the valueof the b weight). If the b weight were negative (e.g. y =2 bx) the value of y would decrease fourfold for everyunit increase in X

    X Y

    3 14

    4 18

    2 10

    1 6

    5 22

    3 14

    6 26

  • 7/29/2019 Regression and Multivariate analysis

    11/44

    Finding the Regression Equation forSome Real-World Data

    Download the World95.sav data file and open it in SPSSData Editor. We are going to find the regression equationfor predicting the raw (unstandardized) scores on thedependent variable, Average Female Life Expectancy (Y)from Daily Calorie Intake (X). Another way to say this isthat we are trying to find the regression of Y on X.

    Go to Graphs/Chart Builder/OK Under Choose From select ScatterDot (top leftmost icon)

    and double click to move it into the preview window Drag Daily Calorie Intake onto the X axis box Drag Average Female Life Expectancy onto the Y axis box

    and click OK In the Output viewer, double click on the chart to bring

    up the Chart Editor; go to Elements and select Fit Lineat Total, then select linear and click Close

    http://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.sav
  • 7/29/2019 Regression and Multivariate analysis

    12/44

    Scatterplot of Relationship between FemaleLife Expectancy and Daily Caloric Intake

    From the scatterplot it wouldappear that there is a strongpositive correlation between Xand Y (as daily caloric intakeincreases, life expectancy

    increases), and X can beexpected to be a good predictorof as-yet unknown cases of Y.(Note, however, that there is alot of scatter about the line andwe may need additional

    predictors to soak up some ofthe variance left over after thisparticular X has done its work(also consider loess regression

    In the loess method, weighted least squares is used tofit linear or quadratic functions of the predictors at thecenters of neighborhoods. The radius of each neighborhood

    is chosen so that the neighborhood containsa specified percentage of the data points)

  • 7/29/2019 Regression and Multivariate analysis

    13/44

    Finding the Regression Equation

    Go to Analyze/ Regression/ Linear

    Move the Average Female Life Expectancyvariable into the dependent box and the DailyCalorie Intake variable into the independent box

    Under Options, make sure include constant inequation is checked and click Continue

    Under Statistics, Check Estimates, Confidenceintervals, and Model Fit. Click Continue and

    then OK Compare your output to the next slide

  • 7/29/2019 Regression and Multivariate analysis

    14/44

    Interpreting the SPSS RegressionOutput

    From your output you can obtain the regression equation for predictingAverage Female Life Expectancy from Daily Calorie Intake. The equation isY = 25.904 + .016X + e, where e is the error term. Thus for a countrywhere the average daily calorie intake is 3000 calories, the average femalelife expectancy is about 25.904 + (.016)(3000) or 73.904 years. This is araw score regression equation

    If the data were expressed in standardscores, the equation would be ZY =.775ZX + e, and .775 is also thecorrelation between X and Y. This is astandard score regression equation

    Coefficientsa

    25.904 4.175 6.204 .000 17.583 34.225

    .016 .001 .775 10.491 .000 .013 .019

    (Constant)

    Daily calorie intake

    Model

    1

    B Std. Error

    Unstandardized

    Coefficients

    Beta

    Standardized

    Coefficients

    t Sig. Lower Bound Upper Bound

    95% Confidence Interval for B

    Dependent Variable: Average female life e xpectancya.

    a b

    These weights are calledunstandardizedpartial regressioncoefficients or weights

    This is a standardized partialregression coefficient or beta weightSignificanceof constant

    of little use.Just saysthat itdifferssignificantly

    from zero(e.g when xis zero, y isnot zero)

  • 7/29/2019 Regression and Multivariate analysis

    15/44

    More Information from the SPSSRegression Output

    There are some other questions we could ask about this regression (1) Is the regression equation a significant predictor of Y? (That is, is it

    good enough to reject the null hypothesis, which is more or less that themean of Y is the best predictor of any given obtained Y). To find this outwe consult the ANOVA output which is provided and look for a significantvalue ofF. In this case the regression equation is significant

    (2) How much of the variation in Y can be explained by the regressionequation? To find this out we look for the value of R square, which is .601

    ANOVAb

    5792.910 1 5792.910 110.055 .000a

    3842.477 73 52.637

    9635.387 74

    Regression

    Residual

    Total

    Model

    1

    Sum of

    Squares df Mean Square F Sig.

    Predictors: (Constant), Daily calorie intakea.

    Dependent Variable: Average female life expectancyb.

    Model Summary

    .775a .601 .596 7.255

    Model

    1

    R R Square

    Adj usted

    R Square

    Std. Error of

    the Estimate

    Predictors: (Constant), Daily calorie intakea.Residual SS is the sum of squared deviations ofthe known values of Y and the predicted valuesof Y based on the equation

    Regression SS is the sum of the squared deviations of the

    predicted variable about its mean

  • 7/29/2019 Regression and Multivariate analysis

    16/44

    How Much Error do We Have?

    Just how good a job will our regression equation do inpredicting new cases of Y? As it happens the greaterthe departure of the obtained Y scores from thelocation that the regression equation predicted theyshould be, the larger the error

    If you created a distribution of all the errors ofprediction (what are called the residuals or thedifferences between observed and predicted score foreach case), the standard deviation of this distributionwould be the standard error of estimate

    The standard error of estimate can be used to put

    confidence intervals orprediction intervals aroundpredicted scores to indicate the interval within whichthey might fall, with a certain level of confidence suchas .05

  • 7/29/2019 Regression and Multivariate analysis

    17/44

    Confidence Intervals in Regression

    Coefficientsa

    25.904 4.175 6.204 .000 17.583 34.225

    .016 .001 .775 10.491 .000 .013 .019

    (Constant)

    Daily calorie in take

    Model

    1

    B Std. Error

    Unstandardized

    Coefficients

    Beta

    Standardized

    Coefficients

    t Sig. Lower Bound Upper Bound

    95% Confidence Interval fo r B

    Dependent Variable: Average femal e li fe expectancya.

    Look at the columns headed 95% confidence intervals. These columns putconfidence intervals based on the standard error of estimate around theregression coefficients a and b. Thus for example in the table below we cansay with 95% confidence that the value of the constant a lies somewherebetween 17.583 and 34.225, and the value of the regression coefficient b(unstandardized) lies somewhere between .013 and .019)

    Model Summary

    .775a

    .601 .596 7.255

    Model

    1

    R R Square

    Adj usted

    R Square

    Std. Error of

    the Estimate

    Predictors: (Constant), Daily calorie intakea.

    Looking at the standard error of thestandardized coefficient we can see that the

    estimate R (which is also the standardizedversion ofb) is 775. Thus we could say with95% confidence that if ZX is the Z scorecorresponding to a particular calorie level,life expectancy is .775 (Zx) plus or minus7.255 years

    SEE = SD of X multiplied by thesquare root of the coeffiecient of

    nondetermination.Says what anerror standard score of 1 is equal toin terms of Y units

  • 7/29/2019 Regression and Multivariate analysis

    18/44

    Multivariate Analysis

    Multivariate analysis is a term applied to a related set of statisticaltechniques which seek to assess and in some cases summarize ormake more parsimonious the relationships among a set ofindependent variables and a set of dependent variables

    Multivariate analyses seeks to answer questions such as Is there a linear combination of personal and intellectual traits that will

    maximally discriminate between people who will successfully completefreshman year of college and people who drop out? What linearcombination of characteristics of the tax return and the taxpayer bestdistinguish between those whom it would and would not be worthwhile toaudit? (Discriminant Analysis)

    What are the underlying factors of an 94-item statistics test, and how cana more parsimonious measure of statistical knowledge be achieved?(Factor Analysis)

    What are the effects of gender, ethnicity, and language spoken in thehome and their interaction on a set of ten socio-economic statusindicators? Even if none of these is significant by itself, will their linearcombination yield significant effects? (MANOVA, Multiple Regression)

  • 7/29/2019 Regression and Multivariate analysis

    19/44

    More Examples of MultivariateAnalysis Questions

    What are the underlying dimensions of judgment in aset of similarity and/or preference ratings of politicalcandidates? (Multidimensional Scaling)

    What is the incremental contribution of each of tenpredictors of marital happiness? Should all of the

    variables be kept in the prediction equation? What is themaximum accuracy of prediction that can be achieved?(Stepwise Multiple Regression Analysis)

    How do a set of univariate measures of nonverbalbehavior combine to predict ratings of communicatorattractiveness? (Multiple regression)

    What is the correlation between a set of measures

    assessing the attractiveness of a communicator and asecond set of measures assessing the communicatorsverbal skills? (Canonical Correlation)

  • 7/29/2019 Regression and Multivariate analysis

    20/44

    An Example (sort of) of MultivariateAnalysis: Multiple Regression

    A good place to start in learning about multivariate analysisis with multiple regression. Perhaps it is not strictlyspeaking a multivariate procedure since although there aremultiple independent variables there is only one dependentvariable Canonical correlation is perhaps a more classic multivariate

    procedure with multiple dependent and independent variables Multiple regression is a relative of simple bivariate or zero-

    order correlation (two interval-level variables) In multiple regression, the investigator is concerned with

    predicting a dependent or criterion variable from two ormore independent variables. The regression equation (raw

    score version) takes the form Y = a + b1X1 + b2X2 + b3X3 +..bnXn + e One motivation for doing this is to be able to predict the scores

    on cases for which measurements have not yet been obtainedor might be difficult to obtain . The regression equation can beused to classify, rate, or rank new cases

  • 7/29/2019 Regression and Multivariate analysis

    21/44

    Coding Categorical Variables inRegression

    In multiple regression, both theindependent or predictor variables and thedependent or criterion variables areusually continuous (interval or ratio-levelmeasurement) although sometimes therewill be concocted or dummy independent

    variables which are categorical (e.g., menand women are assigned scores of one ortwo on a dummy gender variable; or, formore categories, K-1 dummy variables areused where 1 equals has the propertyand 0 equals doesnt have the property

    Consider the race variable from one of ourdata sets which has three categories:

    White, African-American, and Other. Tocode this variable for multiple regression,you create two dummy variables, Whiteand African-American. Each subject willget a score of either 1 or 0 on each of thetwo variables

    Subject 1

    Caucas.1 0

    Subject 2

    African-American

    0 1

    Subject 3

    Other0 0

    Caucasian African-American

  • 7/29/2019 Regression and Multivariate analysis

    22/44

    Coding Categorical Variables inRegression, contd

    Subject 1

    HighStatusAttireCondition

    1 0

    Subject 2

    MediumStatusAttireCondition

    0 1

    Subject 3

    LowStatusAttireCondition

    0 0

    You can use this same type ofprocedure to code assignments tolevels of a treatment in anexperiment, and thus you can use a

    factor from an experiment, suchas interviewer status, as a predictor

    variable in a regression. Forexample if you had an experimentwith three levels of interviewerattire, you would create one dummyvariable for the high status attirecondition and one for the mediumstatus attire and the people in the

    low status attire condition would get0,0 on both variables, where highstatus condition subjects would get1,0 and medium status conditionsubjects would get 0, 1 scores onthe two variables, respectively

    High Status Medium Status

  • 7/29/2019 Regression and Multivariate analysis

    23/44

    Regression and Prediction

    Most regression analyses look for a linear relationshipbetween predictors and criterion although nonlinear trendscan be explored through regression procedures as well

    In multiple regression we attempt to derive an equationwhich is the weighted sum of two or more variables. Theequation tells you how much weight to place on each of thevariables to arrive at the optimal predictive combination

    The equation that is arrived at is the best combination ofpredictors for the sample from which it was derived. Buthow well will it predict new cases? Sometimes the regression equation is tested against a new

    sample of cases to see how well it holds up. The first sample

    is used for the derivation study (to derive the equation) and asecond sample is used for cross-validation. If the secondsample was part of the original sample reserved for just thiscross-validation purpose, then it is called a hold-outsample.

  • 7/29/2019 Regression and Multivariate analysis

    24/44

    Simultaneous Multiple RegressionAnalysis

    One of the most important notions in multipleregression analysis is the notion of statisticalcontrol, that is, mathematical operations toremove the effects of potentially confounding

    or third variables from the relationshipbetween a predictor or IV and a criterion orDV. Terms you might hear which refer tothis include Partialing

    Controlling for Residualizing Holding constant

  • 7/29/2019 Regression and Multivariate analysis

    25/44

    Meaning of Regression Weights

    In multiple regression when you have multiple predictors ofthe same dependent or criterion variable Y the standardizedregression coefficient, or Beta1 expresses the independentcontribution to predicting variable Y of X1 when the effectsof the other variables X2 through Xn are not a factor (havebeen statistically controlled for), and similarly for weights

    Beta2 through Betan These regression weights or coefficients can be tested for

    statistical significance and it will be possible to state with95% (or 99%) confidence that the magnitude of thecoefficient differs from zero, and thus that that particularpredictor makes a contribution to predicting the criterion or

    dependent variable, Y, that is unrelated to the contributionof any of the other predictors

  • 7/29/2019 Regression and Multivariate analysis

    26/44

    Tests of the Predictors

    The magnitude of the raw score weights (usually symbolized by b1,b2, etc) cannot be directly compared since they are associated with(usually) variables with different units of measurement

    It is common practice to compare the standardized regressionweights (the Beta1, Beta 2, etc) andmake claims about the relativeimportance of the unique contribution of each predictor variable to

    predicting the criterion It is also possible to do tests for the significance of the differencesbetween two predictors: is one a significantly better predictor than theother

    These coefficients vary from sample to sample so its not prudent togeneralize too much about the relative ability of two predictors to predict

    Its also the case that in the context of the regression equation thevariable which is a good predictor is not the original variable, but rather a

    residualized version for which the effects of all the other variables havebeen held constant. So the magnitude of its contribution is relative tothe other variables, and only holds for this particular combination ofvariables included in the predictive equation

  • 7/29/2019 Regression and Multivariate analysis

    27/44

    How Do we Find the RegressionWeights (Beta Weights)?

    Although this is not how SPSS would calculate them,we can get the Beta weights from the zero-order(pairwise) correlations between Y and the variouspredictor variables X1, X2, etc and theintercorrelations among the latter

    Suppose we want to find the beta weights for anequation Y = Beta1X1 + Beta2X2

    We need three correlations: the correlation betweenY and X1; the correlation between Y and X2, and the

    correlation between X1 and X2

  • 7/29/2019 Regression and Multivariate analysis

    28/44

    How Do we Find the RegressionWeights (Beta Weights)?, contd

    Lets suppose we have the following data: rfor Y and X1 =.776; rfor Y and X2 is .869; and rfor X1 and X 2 is .682.

    The formula for predicting the standardized partialregression weight for X1 with the effects of X2 removed is

    * Beta X1Y.X2 = r X1Y (r X2Y)(r X1X2)

    1 r2X1X2

    Substituting the correlations we already have in the formula,we find that the beta weight for the predictive effect of

    variable X1 on Y is equal to .776 (.869)(.682) / 1 (.682)2= .342. To compute the second weight, Beta X2Y.X1, we justswitch the first and second terms in the numerator.Now lets see that in the context of an SPSS-calculatedmultiple regression

    *Read this as the Beta weight for the regression of Y on X1when the effects of X2 have been removed

  • 7/29/2019 Regression and Multivariate analysis

    29/44

    Multiple Regression using SPSS

    Suppose we think that the ability of Daily Calorie Intake topredict Female Life Expectancy is not adequate, and wewould like to achieve a more accurate prediction. One wayto do this is to add additional variables to the equation andconduct a multiple regression analysis.

    Suppose we have a suspicion that literacy rate might alsobe a good predictor, not only as a general measure of thestate of the countrys development but also as an indicatorof the likelihood that individuals will have the wherewithalto access health and medical information. We have noparticular reasons to assume that literacy rate and calorieconsumption are correlated, so we will assume for themoment that they will have a separate and additive effecton female life expectancy

    Lets add literacy rate (People who Read %) as a secondpredictor (X2), so now our equation that we are looking foris Y = a + b1X1 + b2X2

    where Y = Female Life Expectancy,Daily Calorie Intake is X1 andLiteracy Rate is X2

  • 7/29/2019 Regression and Multivariate analysis

    30/44

    Multiple Regression using SPSS:Steps to Set Up the Analysis

    Download the World95.sav datafile and open it in SPSS DataEditor.

    In Data Editor go to Analyze/Regression/ Linear and click Reset

    Put Average Female LifeExpectancy into the Dependent box

    Put Daily Calorie Intake and Peoplewho Read % into the Independentsbox

    Under Statistics, select Estimates,Confidence Intervals, Model Fit,Descriptives, Part and PartialCorrelation, R Square Change,Collinearity Diagnostics, and clickContinue

    Under Options, check IncludeConstant in the Equation, clickContinue and then OK

    Compare your output to the nextseveral slides

    http://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.sav
  • 7/29/2019 Regression and Multivariate analysis

    31/44

    Interpreting Your SPSS MultipleRegression Output

    Correlations

    1.000 .776 .869

    .776 1.000 .682

    .869 .682 1.000

    . .000 .000

    .000 . .000

    .000 .000 .

    74 74 74

    74 74 74

    74 74 74

    Average fem ale li fe

    expectancy

    Daily calorie i ntake

    People who read (%)

    Average fem ale li fe

    expectancy

    Daily calorie i ntake

    People who read (%)

    Average fem ale li fe

    expectancy

    Daily calorie i ntake

    People who read (%)

    Pearson Correlation

    Sig. (1-tailed)

    N

    Average

    female life

    expectancy

    Daily calorie

    intake

    People who

    read (%)

    First lets look at the zero-order (pairwise)correlations between Average Female LifeExpectancy (Y), Daily Calorie Intake (X1) and Peoplewho Read (X2). Note that these are .776 for Y withX1, .869 for Y with X2, and .682 for X1 with X2

    rYX1

    rYX2

    rX1X2

  • 7/29/2019 Regression and Multivariate analysis

    32/44

    Examining the Regression Weights

    Above are the raw (unstandardized) and standardized regression weights forthe regression of female life expectancy on daily calorie intake andpercentage of people who read. Consistent with our hand calculation, thestandardized regression coefficient (beta weight) for daily caloric intake is.342. The beta weight for percentage of people who read is much larger,.636. What this weight means is that for every unit change in percentage of

    people who read (that is, for every increase by a factor of one standarddeviation on the people who read variable), Y (female life expectancy) willincrease by a multiple of .636 standard deviations. Note that both the betacoefficients are significant at p < .001

    Coefficientsa

    25.838 2.882 8.964 .000 20.090 31.585

    .315 .034 .636 9.202 .000 .247 .383 .869 .738 .465 .535 1.868

    .007 .001 .342 4.949 .000 .004 .010 .776 .506 .250 .535 1.868

    (Constant)

    People who read (%)

    Daily calorie i ntake

    Model

    1

    B Std. Error

    Unstandardized

    Coefficients

    Beta

    Standardized

    Coefficients

    t Sig. Lower Bound Upper Bound

    95% Confidence Interval for B

    Zero-order Partial Part

    Correlations

    Tolerance VIF

    Collin earity Statistics

    Dependent Variable: Average female life expectancya.

  • 7/29/2019 Regression and Multivariate analysis

    33/44

    R, R Square, and the SEE

    Model Summary

    .905a .818 .813 4.948 .818 159.922 2 71 .000

    Model

    1

    R R Square

    Adjusted

    R Square

    Std. Error of

    the Estimate

    R Square

    Change F Change df1 df2 Sig. F Change

    Change Statistics

    Predictors: (Constant), People who read (%), Daily calorie intakea.

    Above is the model summary, which has some importantstatistics. It gives us R and R square for the regression ofY (female life expectancy) on the two predictors. R is.905, which is a very high correlation. R square tells us

    what proportion of the variation in female life expectancyis explained by the two predictors, a very high .818. Itgives us the standard error of estimate, which we can useto put confidence intervals around the unstandardizedregression coefficients

  • 7/29/2019 Regression and Multivariate analysis

    34/44

    FTest for the Significance of theRegression Equation

    ANOVAb

    7829.451 2 3914.726 159.922 .000a

    1738.008 71 24.479

    9567.459 73

    Regression

    Residual

    Total

    Model

    1

    Sum of

    Squares df Mean Square F Sig.

    Predictors: (Constant), People who read (%), Daily calorie intakea.

    Dependent Variable: Average female l ife expectancyb.

    Next we look at the Ftest of the significance of theRegression equation, Y = .342 X1 + .636 X2. Is this so much better apredictor of female literacy (Y) than simply using the mean of Y that thedifference is statistically significant? The Ftest is a ratio of the mean squarefor the regression equation to the mean square for the residual (the

    departures of the actual scores on Y from what the regression equationpredicted). In this case we have a very large value ofF, which is significantat p

  • 7/29/2019 Regression and Multivariate analysis

    35/44

    Confidence Intervals around theRegression Weights

    Coefficientsa

    25.838 2.882 8.964 .000 20.090 31.585.007 .001 .342 4.949 .000 .004 .010 .776 .506 .250

    .315 .034 .636 9.202 .000 .247 .383 .869 .738 .465

    (Constant)Daily calorie in take

    People who read (%)

    Model

    1

    B Std. Error

    Unstandardized

    Coefficients

    Beta

    Standardized

    Coefficients

    t Sig. Lower Bound Upper Bound

    95% Confidence Interval for B

    Zero-order Partial Part

    Correlations

    Dependent Variable: Average female life expectancya.

    Finally, your output provides confidence intervals around theunstandardized regression coefficients. Thus we can say

    with 95% confidence that the unstandardized weight toapply to daily calorie intake to predict female life expectancyranges between .004 and .010, and that theundstandardized weight to apply to percentage of peoplewho read ranges between .247 and .383

  • 7/29/2019 Regression and Multivariate analysis

    36/44

    Multicollinearity

    One of the requirements for a mathematical solution to themultiple regression problem is that the predictors or independentvariables not be highly correlated

    If in fact two predictors are perfectly correlated, the analysiscannot be completed

    Multicollinearity (the case in which two or more of the predictors

    are too highly correlated) also leads to unstable partial regressioncoefficients which wont hold up when applied to a new sample ofcases

    Further, if predictors are too highly correlated with each other theirshared variance with the dependent or criterion variable may beredundant and its hard to tell just using statistical procedureswhich variable is producing the effect

    Moreover, the regression weights for the predictors would lookmuch like their zero-order correlations with Y if the predictors aredependent; if the predictors are highly correlated this mayproduce regression weights that dont really reflect theindependent contribution to prediction of each of the predictors

  • 7/29/2019 Regression and Multivariate analysis

    37/44

    Multicollinearity, contd

    As a rule of thumb, bivariate zero-order correlations betweenpredictors should not exceed .80 This is easy to prevent; run complete analysis of all possible pairs of

    predictors using the correlation procedure

    Also, no predictor should be totally accounted for by a combinationof the other predictors

    Look at tolerance levels. Tolerance for a predictor variable is equal to1-R2 for an equation where one of the predictors is regressed on all ofthe other predictors. If the predictor is highly correlated with(explained by) the combination of the other predictors, it will have alow tolerance, approaching zero, because the R2 will be large

    So, zero tolerance = BAD, near 1 tolerance = GOOD in terms ofindependence of a predictor

    The best prediction occurs when the predictors are

    moderately independent of each other, but each is highlycorrelated with the dependent (criterion) variable Y Some interpretive problems resulting from multicollinearity can be

    resolved using path analysis (see Chapter 3 in Grimm and Yarnold)

  • 7/29/2019 Regression and Multivariate analysis

    38/44

    Multicollinearity Issues in ourCurrent SPSS Problem

    Correlations

    1.000 .776 .869

    .776 1.000 .682

    .869 .682 1.000

    . .000 .000

    .000 . .000

    .000 .000 .

    74 74 74

    74 74 74

    74 74 74

    Average fem ale li fe

    expectancy

    Daily calorie i ntake

    People who read (%)

    Average fem ale li fe

    expectancy

    Daily calorie i ntakePeople who read (%)

    Average fem ale li fe

    expectancy

    Daily calorie i ntake

    People who read (%)

    Pearson Correlation

    Sig. (1-tailed)

    N

    Average

    female life

    expectancy

    Daily calorie

    intake

    People who

    read (%)

    From our SPSS output we note that the correlation between our two predictors,Daily Calorie Intake (X1) and People who Read (X2) is .682. This is a prettyhigh correlation for two predictors to be interpreted independently: it meanseach explains about half the variation in the other. If you look at the zeroorder correlation of our Y variable, average life expectancy with % people whoread, you note that the correlation is quite high, .869. However, the value of rfor the two variable combination was .905, which is an improvement.

    rYX1

    rYX2

    rX1X2

  • 7/29/2019 Regression and Multivariate analysis

    39/44

    Multicollinearity Issues in ourCurrent SPSS Problem, contd

    The table below is excerpted from the more complete table on Slide 32.Look at the tolerance value. Recall that zero tolerance means very highmulticollinearity (high intercorrelation among the predictors, which is bad).Tolerance is .535 for both variables (since there are only two, the value isthe same for either one predicting the other)

    VIF(variance inflation factor) is a completely redundant statistic withtolerance (it is 1/tolerance). The higher it is, the greater themulticollinearity. When there is no multicollinearity the value of VIF equals1. Multicollinearity problems have to be dealt with (by getting rid ofredundant predictor variables or other means) if VIF approaches 10 (thatmeans that only about 10% of the variance in the predictor in question isnot explained by the combination of the other predictors)

    In the case of our twopredictors, there is someindication of multicollinearitybut not enough to throw outone of the variables

  • 7/29/2019 Regression and Multivariate analysis

    40/44

    Specification Errors

    One type of specification error is that the relationship among thevariables that you are looking at is not linear (e.g., you know thatY peaks at high and low levels of one or more predictors (acurvilinear relationship) but you are using linear regressionanyhow. There are options for nonlinear regression available thatshould be used in such a case

    Another type of specification error occurs when you have eitherunderspecified or overspecified the model by (a) failing to includeall relevant predictors (for example including weight but not heightin an equation for predicting obesity or (b) including predictorswhich are not relevant. Most irrelevant predictors will not evenshow up in the final regression equation unless you insist on it, butthey can affect the results if they are correlated with at least someof the other predictors

    For proper specification nothing beats a good theory (as opposedto launching a fishing expedition)

  • 7/29/2019 Regression and Multivariate analysis

    41/44

    Types of Multiple RegressionAnalysis

    So far we have looked at a standard or simultaneous multipleregression analysis where all of the predictor variables were enteredat the same time, that is, considered in combination with each othersimultaneously

    But there are other types of multiple regression analyses which canyield some interesting results

    Hierarchical regression analysis refers to the method of regression inwhich not all of the variables are entered simultaneously but ratherone at a time or a few at a time, and at each step the correlation of Y,the criterion variable, with the current set of predictors is calculatedand evaluated. At each stage the R square that is calculated showsthe incremental change in variance accounted for in Y with theaddition of the most recently entered predictor, and that is exclusivelyassociated with that predictor.

    Tests can be done to determine the significance of the change in Rsquare at each step to see if each newly added predictor makes asignificant improvement in the predictive power of the regressionequation

    The order in which variables are entered makes a difference to theoutcome. The researcher determines the order on theoretical grounds(exception is stepwise analysis)

  • 7/29/2019 Regression and Multivariate analysis

    42/44

    Stepwise Multiple Regression

    Stepwise multiple regression is a variant of hierarchicalregression where the order of entry is determined not bythe researcher but on empirical criteria

    In the forward inclusion version of stepwise regression theorder of entry is determined at each step by calculatingwhich variable will produce the greatest increase in Rsquare (the amount of variance in the dependent variable Yaccounted for) at that step

    In the backward elimination version of stepwise multipleregression the analysis starts off with all of the predictors atthe first step and then eliminates them so that eachsuccessive step has fewer predictors in the equation.

    Elimination is based on an empirical criterion that is thereverse of that for forward inclusion (the variable thatproduces the smallest decline in R square is removed ateach step)

  • 7/29/2019 Regression and Multivariate analysis

    43/44

    Reducing the Overall Level of TypeI Error

    One of the problems with doing multiple regression is that thereare a lot of significance tests being conducted simultaneously, butfor all practical purposes each test is treated as an independentone even though the data are related. When a large number oftests are done, theoretically the likelihood of Type I error increases(failing to reject the null hypothesis when it is in fact true)

    This is particularly problematic in stepwise regression with theiterative process of assessing significance of R square over andover again not to speak of the significance of individual regressioncoefficients

    Therefore it is desirable to do something to reduce the increasedchance of making Type I errors (finding significant results thatarent there) such as keeping the number of predictors to aminimum to reduce the number of times you go to the normal

    table to obtain a significance level, or dividing the usual requiredconfidence level by the number of predictors, or keeping theintercorrelation of the predictors as low as possible (avoiding useof redundant predictors, which would cause you to basically testthe significance of the same relationship to Y over and over)

  • 7/29/2019 Regression and Multivariate analysis

    44/44

    Reducing the Overall Level of TypeI Error, contd

    This may be of particular importance when theresearcher is testing a theory which has a network ofinterlocking claims such that the invalidation of one ofthem brings the whole thing tumbling down An issue of HCR (July 2003) devoted several papers to

    exploring this question

    As mentioned in class before, the Bonferroni procedure issometimes used, but its hard to swallow, as you have todivide the usual confidence level of .05 by the number oftests you expect to perform, so if you are conductingthirty tests, you have to set your alpha level at .05/30 or.0017 for each test. With stepwise regression its not

    clear in advance how many tests you will have toperform although you can estimate it by the number ofpredictor variables you intend to start off with