Regression and Multivariate analysis

7/29/2019 Regression and Multivariate analysis

1/44

Least Squares Regression andMultiple Regression


2/44

Regression: A Simplified Example

X(predictor)

Y(criterion)

3 14

4 182 10

1 6

5 22

3 146 26

Lets find the best-fitting equation for predictingnew, as yet unknown scores on Y from scoreson X. The regression equation takes the form Y= a + bX + e where Y is the dependent or

criterion variable were trying to predict, a is theintercept or point where the regression linecrosses the Y axis, X is the independent or

predictorvariable, b is the weight by which wemultiply the value of X (it is the slope of theregression line, and is how many units Y

increases (decreases) for every unit change inX), and e is an error term (basically an estimateof how much our prediction is off). a and bare often called regression coefficients. WhenY is an estimated value it is usually symbolizedas Y


3/44

Finding the Regression Line withSPSS

First lets use a scatterplot tovisualize the relationshipbetween X and Y. The firstthing we notice is that thepoints appear to form astraight line and that that as

X gets larger, Y gets larger,so it would appear that wehave a strong, positiverelationship between X and Y.Based on the way the pointsseem to fall, what do youthink the value of Y would befor a person who obtained a

score of 7 on X?

X

76543210

Y

30

20

10

0


4/44

Fitting a Line to the Scatterplot

Next lets fit a line to thescatterplot. Note that thepoints appear to be fit wellby the straight line, andthat the line crosses the Y

axis (at the point called theintercept, or the constant ain our regression equation)at about the point y = 2.So its a good guess thatour regression equation willbe something like y = 2 +

some positive multiple of X,since the values of Y look tobe about 4-5 times the sizeof X

X

76543210

Y

30

20

10

0


5/44

The Least Squares Solution toFinding the Regression Equation

Mathematically, the regression equation is that combinationof constant and weights bon the predictors (the Xs) whichminimizes the sum, across all subjects, of the squareddifferences between their predicted scores (e.g. the scoresthey would get if the regression equation were doing thepredicting) and the obtained scores (their actual scores) onthe criterion Y (that is, it minimizes the error sum ofsquares or residuals). This is known as the least squaressolution

The correlation between the obtained scores on thecriterion or dependent variable, Y, and the scores predictedby the regression equation is expressed in the correlation

coefficient, r, or in the case of more than one independentvariable, R.* Alternatively it expresses the correlationbetween Y and the weighted combination of predictors. Rranges from zero to 1

*SPSS uses R in the regression output even if there is onlyone predictor


6/44

Using SPSS to Calculate theRegression Equation

Download the Data Filesimpleregressionexample.sav and open it in SPSS

In Data Editor, we will goto Analyze/ Regression /Linear and move X into

the Independent box (inregression theIndependent variables arethe predictor variables)and move Y into thedependent box and clickOK. The dependentvariable, Y, is the one for

which we are trying tofind an equation that willpredict new cases of Ygiven than we know X
http://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/simpleregressionexample.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/simpleregressionexample.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/simpleregressionexample.sav


7/44

Obtaining the Regression Equationfrom the SPSS Output

Coefficientsa

2.000 .000 . .

4.000 .000 1.000 . .

(Constant)

X

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Ya.

This table gives us theregression coefficients. Look inthe column calledunstandardized coefficients.There are two values ofprovided. The first one, labeledthe constant, is the intercept a,or the point at which theregression line crosses the yaxis. The second one, X, is theunstandardized regressionweight or the b from ourregression equation. So thisoutput tells us that the best-

fitting equation forpredicting Y from X is Y = 2+ (4)X. Lets check that outwith a known value of X and Y.According to the equation, if X is3, Y should be 2 + 4(3), or 14.How about when X = 5?

X Y

3 14

4 18

2 10

1 6

5 22

3 14

6 26

The constantrepresenting theintercept is the valuethat the dependentvariable would take

when all the predictorsare at a value of zero.In some treatmentsthis is called B0instead ofa


8/44

What is the Regression Equation whenthe Scores are in Standard (Z) Units?

When the scores on X and Y have been converted to Zscores, then the intercept disappears (because the twosets of scores are expressed on the same scale) and theequation for predicting Y from X just becomes Y = BetaX,

where Beta is the standardized coefficient reported inyour SPSS regression procedure output

Coefficientsa

2.000 .000 . .

4.000 .000 1.000 . .

(Constant)

X

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variabl e: Ya.

In the bivariate case, where there is only one X and one Y, thestandardized beta weight will equal the correlation coefficient.Lets confirm this by seeing what would happen if we convertour raw scores to Z scores


9/44

Regression Equation for Z scores

In SPSS I have converted X and Y to two new variables, ZX and ZY,expressed in standard score units. You achieve this by going to Analyze/Descriptive/ Descriptives (dont do this now), moving the variables youwant to convert into the variables box, and selecting save standardizedvalues as variables. This creates the new variables expressed as Z scores.Note that if you reran the linear regression analysis that we just did on theraw scores, that in the output for the regression equation for predicting the

standard scores on Y the constant has dropped out and the equation is nowof the form y = Beta x, where Beta is equal to 1. In this case the z scoresare identical on X and Y although they certainly wouldnt always be

Coefficientsa

.000 .000 . .

1.000 .000 1.000 . .

(Constant)

Zscore(X)

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variabl e: Zscore(Y)a.

Correlations

1 1.000**

. .

7 7

1.000** 1

. .

7 7

Pearson Correlati on

Sig. (2-tailed)

N

Pearson Correlati on

Sig. (2-tailed)

N

Zscore(Y)

Zscore(X)

Zscore(Y) Zscore(X)

Correlation i s signi ficant at the 0.01 l evel (2-tailed).**.


10/44

Meaning of Regression Weights The regression weights or regression coefficients (the

raw score s and the standardized Betas) can beinterpreted as expressing this unique contribution of avariable: you can say they represent the amount ofchange in Y that you can expect to occur per unitchange in Xi , where X is the ith variable in the predictiveequation, when statistical control has been achieved for

all of the other variables in the equation Lets consider an example from the raw-score regression

equation Y = 2 + (b)X, where the weight b is 4: Y = 2+ (4) X. In predicting Y, what the weight b means isthat for every unit change in X, Y will be increasedfourfold. Consider the data from this table and verifythat this is the case. For example, if X = 1, Y = 6. Nowmake a unit change of 1 in X, so that X is 2, and Ybecomes equal to 10. Make a further unit change of 2

units to 3, and Y becomes equal to 14. Make a furtherunit change of 3 units to 4, and Y becomes equal to 18.So each unit change in X increases Y fourfold (the valueof the b weight). If the b weight were negative (e.g. y =2 bx) the value of y would decrease fourfold for everyunit increase in X

X Y

3 14

4 18

2 10

1 6

5 22

3 14

6 26


11/44

Finding the Regression Equation forSome Real-World Data

Download the World95.sav data file and open it in SPSSData Editor. We are going to find the regression equationfor predicting the raw (unstandardized) scores on thedependent variable, Average Female Life Expectancy (Y)from Daily Calorie Intake (X). Another way to say this isthat we are trying to find the regression of Y on X.

Go to Graphs/Chart Builder/OK Under Choose From select ScatterDot (top leftmost icon)

and double click to move it into the preview window Drag Daily Calorie Intake onto the X axis box Drag Average Female Life Expectancy onto the Y axis box

and click OK In the Output viewer, double click on the chart to bring

up the Chart Editor; go to Elements and select Fit Lineat Total, then select linear and click Close
http://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.sav


12/44

Scatterplot of Relationship between FemaleLife Expectancy and Daily Caloric Intake

From the scatterplot it wouldappear that there is a strongpositive correlation between Xand Y (as daily caloric intakeincreases, life expectancy

increases), and X can beexpected to be a good predictorof as-yet unknown cases of Y.(Note, however, that there is alot of scatter about the line andwe may need additional

predictors to soak up some ofthe variance left over after thisparticular X has done its work(also consider loess regression

In the loess method, weighted least squares is used tofit linear or quadratic functions of the predictors at thecenters of neighborhoods. The radius of each neighborhood

is chosen so that the neighborhood containsa specified percentage of the data points)


13/44

Finding the Regression Equation

Go to Analyze/ Regression/ Linear

Move the Average Female Life Expectancyvariable into the dependent box and the DailyCalorie Intake variable into the independent box

Under Options, make sure include constant inequation is checked and click Continue

Under Statistics, Check Estimates, Confidenceintervals, and Model Fit. Click Continue and

then OK Compare your output to the next slide


14/44

Interpreting the SPSS RegressionOutput

From your output you can obtain the regression equation for predictingAverage Female Life Expectancy from Daily Calorie Intake. The equation isY = 25.904 + .016X + e, where e is the error term. Thus for a countrywhere the average daily calorie intake is 3000 calories, the average femalelife expectancy is about 25.904 + (.016)(3000) or 73.904 years. This is araw score regression equation

If the data were expressed in standardscores, the equation would be ZY =.775ZX + e, and .775 is also thecorrelation between X and Y. This is astandard score regression equation

Coefficientsa

25.904 4.175 6.204 .000 17.583 34.225

.016 .001 .775 10.491 .000 .013 .019

(Constant)

Daily calorie intake

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval for B

Dependent Variable: Average female life e xpectancya.

a b

These weights are calledunstandardizedpartial regressioncoefficients or weights

This is a standardized partialregression coefficient or beta weightSignificanceof constant

of little use.Just saysthat itdifferssignificantly

from zero(e.g when xis zero, y isnot zero)


15/44

More Information from the SPSSRegression Output

There are some other questions we could ask about this regression (1) Is the regression equation a significant predictor of Y? (That is, is it

good enough to reject the null hypothesis, which is more or less that themean of Y is the best predictor of any given obtained Y). To find this outwe consult the ANOVA output which is provided and look for a significantvalue ofF. In this case the regression equation is significant

(2) How much of the variation in Y can be explained by the regressionequation? To find this out we look for the value of R square, which is .601

ANOVAb

5792.910 1 5792.910 110.055 .000a

3842.477 73 52.637

9635.387 74

Regression

Residual

Total

Model

1

Sum of

Squares df Mean Square F Sig.

Predictors: (Constant), Daily calorie intakea.

Dependent Variable: Average female life expectancyb.

Model Summary

.775a .601 .596 7.255

Model

1

R R Square

Adj usted

R Square

Std. Error of

the Estimate

Predictors: (Constant), Daily calorie intakea.Residual SS is the sum of squared deviations ofthe known values of Y and the predicted valuesof Y based on the equation

Regression SS is the sum of the squared deviations of the

predicted variable about its mean


16/44

How Much Error do We Have?

Just how good a job will our regression equation do inpredicting new cases of Y? As it happens the greaterthe departure of the obtained Y scores from thelocation that the regression equation predicted theyshould be, the larger the error

If you created a distribution of all the errors ofprediction (what are called the residuals or thedifferences between observed and predicted score foreach case), the standard deviation of this distributionwould be the standard error of estimate

The standard error of estimate can be used to put

confidence intervals orprediction intervals aroundpredicted scores to indicate the interval within whichthey might fall, with a certain level of confidence suchas .05


17/44

Confidence Intervals in Regression

Coefficientsa

25.904 4.175 6.204 .000 17.583 34.225

.016 .001 .775 10.491 .000 .013 .019

(Constant)

Daily calorie in take

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients


95% Confidence Interval fo r B

Dependent Variable: Average femal e li fe expectancya.

Look at the columns headed 95% confidence intervals. These columns putconfidence intervals based on the standard error of estimate around theregression coefficients a and b. Thus for example in the table below we cansay with 95% confidence that the value of the constant a lies somewherebetween 17.583 and 34.225, and the value of the regression coefficient b(unstandardized) lies somewhere between .013 and .019)

Model Summary

.775a

.601 .596 7.255

Model

1

R R Square

Adj usted

R Square

Std. Error of

the Estimate

Predictors: (Constant), Daily calorie intakea.

Looking at the standard error of thestandardized coefficient we can see that the

estimate R (which is also the standardizedversion ofb) is 775. Thus we could say with95% confidence that if ZX is the Z scorecorresponding to a particular calorie level,life expectancy is .775 (Zx) plus or minus7.255 years

SEE = SD of X multiplied by thesquare root of the coeffiecient of

nondetermination.Says what anerror standard score of 1 is equal toin terms of Y units


18/44

Multivariate Analysis

Multivariate analysis is a term applied to a related set of statisticaltechniques which seek to assess and in some cases summarize ormake more parsimonious the relationships among a set ofindependent variables and a set of dependent variables

Multivariate analyses seeks to answer questions such as Is there a linear combination of personal and intellectual traits that will

maximally discriminate between people who will successfully completefreshman year of college and people who drop out? What linearcombination of characteristics of the tax return and the taxpayer bestdistinguish between those whom it would and would not be worthwhile toaudit? (Discriminant Analysis)

What are the underlying factors of an 94-item statistics test, and how cana more parsimonious measure of statistical knowledge be achieved?(Factor Analysis)

What are the effects of gender, ethnicity, and language spoken in thehome and their interaction on a set of ten socio-economic statusindicators? Even if none of these is significant by itself, will their linearcombination yield significant effects? (MANOVA, Multiple Regression)


19/44

More Examples of MultivariateAnalysis Questions

What are the underlying dimensions of judgment in aset of similarity and/or preference ratings of politicalcandidates? (Multidimensional Scaling)

What is the incremental contribution of each of tenpredictors of marital happiness? Should all of the

variables be kept in the prediction equation? What is themaximum accuracy of prediction that can be achieved?(Stepwise Multiple Regression Analysis)

How do a set of univariate measures of nonverbalbehavior combine to predict ratings of communicatorattractiveness? (Multiple regression)

What is the correlation between a set of measures

assessing the attractiveness of a communicator and asecond set of measures assessing the communicatorsverbal skills? (Canonical Correlation)


20/44

An Example (sort of) of MultivariateAnalysis: Multiple Regression

A good place to start in learning about multivariate analysisis with multiple regression. Perhaps it is not strictlyspeaking a multivariate procedure since although there aremultiple independent variables there is only one dependentvariable Canonical correlation is perhaps a more classic multivariate

procedure with multiple dependent and independent variables Multiple regression is a relative of simple bivariate or zero-

order correlation (two interval-level variables) In multiple regression, the investigator is concerned with

predicting a dependent or criterion variable from two ormore independent variables. The regression equation (raw

score version) takes the form Y = a + b1X1 + b2X2 + b3X3 +..bnXn + e One motivation for doing this is to be able to predict the scores

on cases for which measurements have not yet been obtainedor might be difficult to obtain . The regression equation can beused to classify, rate, or rank new cases


21/44

Coding Categorical Variables inRegression

In multiple regression, both theindependent or predictor variables and thedependent or criterion variables areusually continuous (interval or ratio-levelmeasurement) although sometimes therewill be concocted or dummy independent

variables which are categorical (e.g., menand women are assigned scores of one ortwo on a dummy gender variable; or, formore categories, K-1 dummy variables areused where 1 equals has the propertyand 0 equals doesnt have the property

Consider the race variable from one of ourdata sets which has three categories:

White, African-American, and Other. Tocode this variable for multiple regression,you create two dummy variables, Whiteand African-American. Each subject willget a score of either 1 or 0 on each of thetwo variables

Subject 1

Caucas.1 0

Subject 2

African-American

0 1

Subject 3

Other0 0

Caucasian African-American


22/44

Coding Categorical Variables inRegression, contd

Subject 1

HighStatusAttireCondition

1 0

Subject 2

MediumStatusAttireCondition

0 1

Subject 3

LowStatusAttireCondition

0 0

You can use this same type ofprocedure to code assignments tolevels of a treatment in anexperiment, and thus you can use a

factor from an experiment, suchas interviewer status, as a predictor

variable in a regression. Forexample if you had an experimentwith three levels of interviewerattire, you would create one dummyvariable for the high status attirecondition and one for the mediumstatus attire and the people in the

low status attire condition would get0,0 on both variables, where highstatus condition subjects would get1,0 and medium status conditionsubjects would get 0, 1 scores onthe two variables, respectively

High Status Medium Status


23/44

Regression and Prediction

Most regression analyses look for a linear relationshipbetween predictors and criterion although nonlinear trendscan be explored through regression procedures as well

In multiple regression we attempt to derive an equationwhich is the weighted sum of two or more variables. Theequation tells you how much weight to place on each of thevariables to arrive at the optimal predictive combination

The equation that is arrived at is the best combination ofpredictors for the sample from which it was derived. Buthow well will it predict new cases? Sometimes the regression equation is tested against a new

sample of cases to see how well it holds up. The first sample

is used for the derivation study (to derive the equation) and asecond sample is used for cross-validation. If the secondsample was part of the original sample reserved for just thiscross-validation purpose, then it is called a hold-outsample.


24/44

Simultaneous Multiple RegressionAnalysis

One of the most important notions in multipleregression analysis is the notion of statisticalcontrol, that is, mathematical operations toremove the effects of potentially confounding

or third variables from the relationshipbetween a predictor or IV and a criterion orDV. Terms you might hear which refer tothis include Partialing

Controlling for Residualizing Holding constant


25/44

Meaning of Regression Weights

In multiple regression when you have multiple predictors ofthe same dependent or criterion variable Y the standardizedregression coefficient, or Beta1 expresses the independentcontribution to predicting variable Y of X1 when the effectsof the other variables X2 through Xn are not a factor (havebeen statistically controlled for), and similarly for weights

Beta2 through Betan These regression weights or coefficients can be tested for

statistical significance and it will be possible to state with95% (or 99%) confidence that the magnitude of thecoefficient differs from zero, and thus that that particularpredictor makes a contribution to predicting the criterion or

dependent variable, Y, that is unrelated to the contributionof any of the other predictors


26/44

Tests of the Predictors

The magnitude of the raw score weights (usually symbolized by b1,b2, etc) cannot be directly compared since they are associated with(usually) variables with different units of measurement

It is common practice to compare the standardized regressionweights (the Beta1, Beta 2, etc) andmake claims about the relativeimportance of the unique contribution of each predictor variable to

predicting the criterion It is also possible to do tests for the significance of the differencesbetween two predictors: is one a significantly better predictor than theother

These coefficients vary from sample to sample so its not prudent togeneralize too much about the relative ability of two predictors to predict

Its also the case that in the context of the regression equation thevariable which is a good predictor is not the original variable, but rather a

residualized version for which the effects of all the other variables havebeen held constant. So the magnitude of its contribution is relative tothe other variables, and only holds for this particular combination ofvariables included in the predictive equation


27/44

How Do we Find the RegressionWeights (Beta Weights)?

Although this is not how SPSS would calculate them,we can get the Beta weights from the zero-order(pairwise) correlations between Y and the variouspredictor variables X1, X2, etc and theintercorrelations among the latter

Suppose we want to find the beta weights for anequation Y = Beta1X1 + Beta2X2

We need three correlations: the correlation betweenY and X1; the correlation between Y and X2, and the

correlation between X1 and X2


28/44

How Do we Find the RegressionWeights (Beta Weights)?, contd

Lets suppose we have the following data: rfor Y and X1 =.776; rfor Y and X2 is .869; and rfor X1 and X 2 is .682.

The formula for predicting the standardized partialregression weight for X1 with the effects of X2 removed is

* Beta X1Y.X2 = r X1Y (r X2Y)(r X1X2)

1 r2X1X2

Substituting the correlations we already have in the formula,we find that the beta weight for the predictive effect of

variable X1 on Y is equal to .776 (.869)(.682) / 1 (.682)2= .342. To compute the second weight, Beta X2Y.X1, we justswitch the first and second terms in the numerator.Now lets see that in the context of an SPSS-calculatedmultiple regression

*Read this as the Beta weight for the regression of Y on X1when the effects of X2 have been removed


29/44

Multiple Regression using SPSS

Suppose we think that the ability of Daily Calorie Intake topredict Female Life Expectancy is not adequate, and wewould like to achieve a more accurate prediction. One wayto do this is to add additional variables to the equation andconduct a multiple regression analysis.

Suppose we have a suspicion that literacy rate might alsobe a good predictor, not only as a general measure of thestate of the countrys development but also as an indicatorof the likelihood that individuals will have the wherewithalto access health and medical information. We have noparticular reasons to assume that literacy rate and calorieconsumption are correlated, so we will assume for themoment that they will have a separate and additive effecton female life expectancy

Lets add literacy rate (People who Read %) as a secondpredictor (X2), so now our equation that we are looking foris Y = a + b1X1 + b2X2

where Y = Female Life Expectancy,Daily Calorie Intake is X1 andLiteracy Rate is X2


30/44

Multiple Regression using SPSS:Steps to Set Up the Analysis

Download the World95.sav datafile and open it in SPSS DataEditor.

In Data Editor go to Analyze/Regression/ Linear and click Reset

Put Average Female LifeExpectancy into the Dependent box

Put Daily Calorie Intake and Peoplewho Read % into the Independentsbox

Under Statistics, select Estimates,Confidence Intervals, Model Fit,Descriptives, Part and PartialCorrelation, R Square Change,Collinearity Diagnostics, and clickContinue

Under Options, check IncludeConstant in the Equation, clickContinue and then OK

Compare your output to the nextseveral slides
http://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.savhttp://www-rcf.usc.edu/~mmclaugh/550x/DataFiles/World95.sav


31/44

Interpreting Your SPSS MultipleRegression Output

Correlations

1.000 .776 .869

.776 1.000 .682

.869 .682 1.000

. .000 .000

.000 . .000

.000 .000 .

74 74 74

74 74 74

74 74 74

Average fem ale li fe

expectancy

Daily calorie i ntake

People who read (%)


expectancy


People who read (%)


expectancy


People who read (%)

Pearson Correlation

Sig. (1-tailed)

N

Average

female life

expectancy

Daily calorie

intake

People who

read (%)

First lets look at the zero-order (pairwise)correlations between Average Female LifeExpectancy (Y), Daily Calorie Intake (X1) and Peoplewho Read (X2). Note that these are .776 for Y withX1, .869 for Y with X2, and .682 for X1 with X2

rYX1

rYX2

rX1X2


32/44

Examining the Regression Weights

Above are the raw (unstandardized) and standardized regression weights forthe regression of female life expectancy on daily calorie intake andpercentage of people who read. Consistent with our hand calculation, thestandardized regression coefficient (beta weight) for daily caloric intake is.342. The beta weight for percentage of people who read is much larger,.636. What this weight means is that for every unit change in percentage of

people who read (that is, for every increase by a factor of one standarddeviation on the people who read variable), Y (female life expectancy) willincrease by a multiple of .636 standard deviations. Note that both the betacoefficients are significant at p < .001

Coefficientsa

25.838 2.882 8.964 .000 20.090 31.585

.315 .034 .636 9.202 .000 .247 .383 .869 .738 .465 .535 1.868

.007 .001 .342 4.949 .000 .004 .010 .776 .506 .250 .535 1.868

(Constant)

People who read (%)


Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients



Zero-order Partial Part

Correlations

Tolerance VIF

Collin earity Statistics

Dependent Variable: Average female life expectancya.


33/44

R, R Square, and the SEE

Model Summary

.905a .818 .813 4.948 .818 159.922 2 71 .000

Model

1

R R Square

Adjusted

R Square

Std. Error of

the Estimate

R Square

Change F Change df1 df2 Sig. F Change

Change Statistics

Predictors: (Constant), People who read (%), Daily calorie intakea.

Above is the model summary, which has some importantstatistics. It gives us R and R square for the regression ofY (female life expectancy) on the two predictors. R is.905, which is a very high correlation. R square tells us

what proportion of the variation in female life expectancyis explained by the two predictors, a very high .818. Itgives us the standard error of estimate, which we can useto put confidence intervals around the unstandardizedregression coefficients


34/44

FTest for the Significance of theRegression Equation

ANOVAb

7829.451 2 3914.726 159.922 .000a

1738.008 71 24.479

9567.459 73

Regression

Residual

Total

Model

1

Sum of

Squares df Mean Square F Sig.

Predictors: (Constant), People who read (%), Daily calorie intakea.

Dependent Variable: Average female l ife expectancyb.

Next we look at the Ftest of the significance of theRegression equation, Y = .342 X1 + .636 X2. Is this so much better apredictor of female literacy (Y) than simply using the mean of Y that thedifference is statistically significant? The Ftest is a ratio of the mean squarefor the regression equation to the mean square for the residual (the

departures of the actual scores on Y from what the regression equationpredicted). In this case we have a very large value ofF, which is significantat p


35/44

Confidence Intervals around theRegression Weights

Coefficientsa

25.838 2.882 8.964 .000 20.090 31.585.007 .001 .342 4.949 .000 .004 .010 .776 .506 .250

.315 .034 .636 9.202 .000 .247 .383 .869 .738 .465

(Constant)Daily calorie in take

People who read (%)

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients



Zero-order Partial Part

Correlations

Dependent Variable: Average female life expectancya.

Finally, your output provides confidence intervals around theunstandardized regression coefficients. Thus we can say

with 95% confidence that the unstandardized weight toapply to daily calorie intake to predict female life expectancyranges between .004 and .010, and that theundstandardized weight to apply to percentage of peoplewho read ranges between .247 and .383


36/44

Multicollinearity

One of the requirements for a mathematical solution to themultiple regression problem is that the predictors or independentvariables not be highly correlated

If in fact two predictors are perfectly correlated, the analysiscannot be completed

Multicollinearity (the case in which two or more of the predictors

are too highly correlated) also leads to unstable partial regressioncoefficients which wont hold up when applied to a new sample ofcases

Further, if predictors are too highly correlated with each other theirshared variance with the dependent or criterion variable may beredundant and its hard to tell just using statistical procedureswhich variable is producing the effect

Moreover, the regression weights for the predictors would lookmuch like their zero-order correlations with Y if the predictors aredependent; if the predictors are highly correlated this mayproduce regression weights that dont really reflect theindependent contribution to prediction of each of the predictors


37/44

Multicollinearity, contd

As a rule of thumb, bivariate zero-order correlations betweenpredictors should not exceed .80 This is easy to prevent; run complete analysis of all possible pairs of

predictors using the correlation procedure

Also, no predictor should be totally accounted for by a combinationof the other predictors

Look at tolerance levels. Tolerance for a predictor variable is equal to1-R2 for an equation where one of the predictors is regressed on all ofthe other predictors. If the predictor is highly correlated with(explained by) the combination of the other predictors, it will have alow tolerance, approaching zero, because the R2 will be large

So, zero tolerance = BAD, near 1 tolerance = GOOD in terms ofindependence of a predictor

The best prediction occurs when the predictors are

moderately independent of each other, but each is highlycorrelated with the dependent (criterion) variable Y Some interpretive problems resulting from multicollinearity can be

resolved using path analysis (see Chapter 3 in Grimm and Yarnold)


38/44

Multicollinearity Issues in ourCurrent SPSS Problem

Correlations

1.000 .776 .869

.776 1.000 .682

.869 .682 1.000

. .000 .000

.000 . .000

.000 .000 .

74 74 74

74 74 74

74 74 74


expectancy


People who read (%)


expectancy

Daily calorie i ntakePeople who read (%)


expectancy


People who read (%)

Pearson Correlation

Sig. (1-tailed)

N

Average

female life

expectancy

Daily calorie

intake

People who

read (%)

From our SPSS output we note that the correlation between our two predictors,Daily Calorie Intake (X1) and People who Read (X2) is .682. This is a prettyhigh correlation for two predictors to be interpreted independently: it meanseach explains about half the variation in the other. If you look at the zeroorder correlation of our Y variable, average life expectancy with % people whoread, you note that the correlation is quite high, .869. However, the value of rfor the two variable combination was .905, which is an improvement.

rYX1

rYX2

rX1X2


39/44

Multicollinearity Issues in ourCurrent SPSS Problem, contd

The table below is excerpted from the more complete table on Slide 32.Look at the tolerance value. Recall that zero tolerance means very highmulticollinearity (high intercorrelation among the predictors, which is bad).Tolerance is .535 for both variables (since there are only two, the value isthe same for either one predicting the other)

VIF(variance inflation factor) is a completely redundant statistic withtolerance (it is 1/tolerance). The higher it is, the greater themulticollinearity. When there is no multicollinearity the value of VIF equals1. Multicollinearity problems have to be dealt with (by getting rid ofredundant predictor variables or other means) if VIF approaches 10 (thatmeans that only about 10% of the variance in the predictor in question isnot explained by the combination of the other predictors)

In the case of our twopredictors, there is someindication of multicollinearitybut not enough to throw outone of the variables


40/44

Specification Errors

One type of specification error is that the relationship among thevariables that you are looking at is not linear (e.g., you know thatY peaks at high and low levels of one or more predictors (acurvilinear relationship) but you are using linear regressionanyhow. There are options for nonlinear regression available thatshould be used in such a case

Another type of specification error occurs when you have eitherunderspecified or overspecified the model by (a) failing to includeall relevant predictors (for example including weight but not heightin an equation for predicting obesity or (b) including predictorswhich are not relevant. Most irrelevant predictors will not evenshow up in the final regression equation unless you insist on it, butthey can affect the results if they are correlated with at least someof the other predictors

For proper specification nothing beats a good theory (as opposedto launching a fishing expedition)


41/44

Types of Multiple RegressionAnalysis

So far we have looked at a standard or simultaneous multipleregression analysis where all of the predictor variables were enteredat the same time, that is, considered in combination with each othersimultaneously

But there are other types of multiple regression analyses which canyield some interesting results

Hierarchical regression analysis refers to the method of regression inwhich not all of the variables are entered simultaneously but ratherone at a time or a few at a time, and at each step the correlation of Y,the criterion variable, with the current set of predictors is calculatedand evaluated. At each stage the R square that is calculated showsthe incremental change in variance accounted for in Y with theaddition of the most recently entered predictor, and that is exclusivelyassociated with that predictor.

Tests can be done to determine the significance of the change in Rsquare at each step to see if each newly added predictor makes asignificant improvement in the predictive power of the regressionequation

The order in which variables are entered makes a difference to theoutcome. The researcher determines the order on theoretical grounds(exception is stepwise analysis)


42/44

Stepwise Multiple Regression

Stepwise multiple regression is a variant of hierarchicalregression where the order of entry is determined not bythe researcher but on empirical criteria

In the forward inclusion version of stepwise regression theorder of entry is determined at each step by calculatingwhich variable will produce the greatest increase in Rsquare (the amount of variance in the dependent variable Yaccounted for) at that step

In the backward elimination version of stepwise multipleregression the analysis starts off with all of the predictors atthe first step and then eliminates them so that eachsuccessive step has fewer predictors in the equation.

Elimination is based on an empirical criterion that is thereverse of that for forward inclusion (the variable thatproduces the smallest decline in R square is removed ateach step)


43/44

Reducing the Overall Level of TypeI Error

One of the problems with doing multiple regression is that thereare a lot of significance tests being conducted simultaneously, butfor all practical purposes each test is treated as an independentone even though the data are related. When a large number oftests are done, theoretically the likelihood of Type I error increases(failing to reject the null hypothesis when it is in fact true)

This is particularly problematic in stepwise regression with theiterative process of assessing significance of R square over andover again not to speak of the significance of individual regressioncoefficients

Therefore it is desirable to do something to reduce the increasedchance of making Type I errors (finding significant results thatarent there) such as keeping the number of predictors to aminimum to reduce the number of times you go to the normal

table to obtain a significance level, or dividing the usual requiredconfidence level by the number of predictors, or keeping theintercorrelation of the predictors as low as possible (avoiding useof redundant predictors, which would cause you to basically testthe significance of the same relationship to Y over and over)


44/44

Reducing the Overall Level of TypeI Error, contd

This may be of particular importance when theresearcher is testing a theory which has a network ofinterlocking claims such that the invalidation of one ofthem brings the whole thing tumbling down An issue of HCR (July 2003) devoted several papers to

exploring this question

As mentioned in class before, the Bonferroni procedure issometimes used, but its hard to swallow, as you have todivide the usual confidence level of .05 by the number oftests you expect to perform, so if you are conductingthirty tests, you have to set your alpha level at .05/30 or.0017 for each test. With stepwise regression its not

clear in advance how many tests you will have toperform although you can estimate it by the number ofpredictor variables you intend to start off with

Documents

Regression and Multivariate analysis