27
Regression analysis and multiple regression: Here’s the beef* *Graphic kindly provided by Microsoft.

Regression analysis and multiple regression: Here’s the beef*

Embed Size (px)

DESCRIPTION

Regression analysis and multiple regression: Here’s the beef*. *Graphic kindly provided by Microsoft. Generally, regression analysis is used with interval and ratio data Regression analysis is a method of determining the specific function relating y to x ===> Y= f (X) - PowerPoint PPT Presentation

Citation preview

Page 1: Regression analysis and multiple regression:  Here’s the beef*

Regression analysis and multiple regression: Here’s the beef*

*Graphic kindly provided by Microsoft.

Page 2: Regression analysis and multiple regression:  Here’s the beef*

• Generally, regression analysis is used with interval and ratio data

• Regression analysis is a method of determining the specific function relating y to x ===> Y=f(X)

• Not really cause and effect(s) but… how the independent variables combine to help predict the dependent variable

• Widely used in the social sciences• Provides a value called R2 (R-squared) which tells how

well a set of variables explains a dependent variable

Page 3: Regression analysis and multiple regression:  Here’s the beef*

• To explain means to reduce errors when predicting the dependent variables scores on the basis of information about the independent variables

• The regression results measure the direction and size of the effect of each variable on the dependent variable

• The form of the regression line is: Y=a+bX, where Y is the dependent variable, a is the intercept, b is the slope, and X is the independent variable

Page 4: Regression analysis and multiple regression:  Here’s the beef*

Regression analysis

• Examples: If we know Tara’s IQ, what can we say about her prospects of satisfactorily completing a university degree? Knowing Nancy’s prior voting record, can we make any informed guesses concerning her vote in the coming provincial election? Kendra is enrolled in a statistics course. If we know her score on the midterm exam, can we make a reasonably good estimate of her grade on the final?

Page 5: Regression analysis and multiple regression:  Here’s the beef*

• forms of regression analysis, depending on the complexity of the relationships being studied.

• The simplest is known as linear regression. Assumes a perfect linear association between two variables. The straight line connecting points together is called the regression line.

Page 6: Regression analysis and multiple regression:  Here’s the beef*

• The regression line, rarely, cuts through all points in a distribution (e.g., picture a scatterplot). As such, we can draw an approximate line showing the best possible linear representation of the several points.

• Recall geometry: a straight line on a graph can be represented by the equation:

Page 7: Regression analysis and multiple regression:  Here’s the beef*

• To simplify our discussion, let’s start with an example of two variables that are usually perfectly related: monthly salary and yearly income. ===> Y=12X

• Let’s add one more factor to this linear relationship. Suppose that we received a Christmas bonus of $500 ===> Y=500+12X

• In the above income example, the slope of the line is 12, which means that Y changes by 12 for each change of one unit in X.

Page 8: Regression analysis and multiple regression:  Here’s the beef*

Example of linear regression• If we are interested in exploring the

relationship between SEI and EDUC using linear regression we would do the following

• First, assign SEI as our dependent variable and EDUC as our independent variable

• Run SPSS using Analyze-->Regression--> Linear

• Interpret the output -- look only at R2 and the unstandardized coefficients and their associated levels of significance

Page 9: Regression analysis and multiple regression:  Here’s the beef*

.591a .349 .348 15.347Model1

R R SquareAdjustedR Square

Std. Errorof the

Estimate

Model Summary

Predictors: (Constant), HIGHEST YEAR OF SCHOOLCOMPLETED

a.

Page 10: Regression analysis and multiple regression:  Here’s the beef*

-4.321 1.969 -2.195 .028

3.917 .142 .591 27.624 .000

(Constant)

HIGHESTYEAR OFSCHOOLCOMPLETED

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Coefficientsa

Dependent Variable: RESPONDENT SOCIOECONOMIC INDEXa.

Page 11: Regression analysis and multiple regression:  Here’s the beef*

• Taking the unstandardized B (beta) coefficients for the constant and the variable EDUC gives us the following regression equation:

SEI = -4.321 + (EDUC*3.917)

• For example, the predicted SEI for someone with 18 years of education is:

SEI = -4.321 + (18*3.917)

= 66.2

Page 12: Regression analysis and multiple regression:  Here’s the beef*

Let’s look at Pearson’s r for SEI and EDUC

1.000 .591**

.591** 1.000

. .000

.000 .

1494 1427

1427 1432

HIGHEST YEAR OFSCHOOLCOMPLETED

RESPONDENTSOCIOECONOMICINDEX

HIGHEST YEAR OFSCHOOLCOMPLETED

RESPONDENTSOCIOECONOMICINDEX

HIGHEST YEAR OFSCHOOLCOMPLETED

RESPONDENTSOCIOECONOMICINDEX

PearsonCorrelation

Sig.(2-tailed)

N

HIGHESTYEAR OFSCHOOL

COMPLETED

RESPONDENTSOCIOECONOMIC

INDEX

Correlations

Correlation is significant at the 0.01 level (2-tailed).**.

Page 13: Regression analysis and multiple regression:  Here’s the beef*

The scatterplot shows the following relationship

HIGHEST YEAR OF SCHOOL COMPLETED

3020100-10

RE

SP

ON

DE

NT

SO

CIO

EC

ON

OM

IC IN

DE

X

100

80

60

40

20

0

Page 14: Regression analysis and multiple regression:  Here’s the beef*

Multiple regression example• If we believe that variables other than EDUC

influenced SEI we could bring them in to the model using stepwise multiple regression.

• Let’s consider the influence of EDUC, AGE, and SEX.

• Now remember… we can only use interval/ratio variables in regression, and SEX is nominal.

• To get around this we need to use dummy variable re-coding for SEX.

Page 15: Regression analysis and multiple regression:  Here’s the beef*

• Since SEX is coded 1=male and 2=female in the GSS, and we believe a priori that being male confers status advantages, we will code for “maleness.”

• We want to recode so that male=1 and female=0. This allows is to assume that male=100% male and female=0% male.

• Use Transform-->Recode-->Into different variables to create a new variable called SEX2

• Run the regression by Analyze-->Regression

-->Linear (make sure that method=stepwise)

Page 16: Regression analysis and multiple regression:  Here’s the beef*

.591a .349 .349 15.351

.600b .360 .359 15.226

.602c .362 .361 15.205

Model1

2

3

R R SquareAdjustedR Square

Std. Errorof the

Estimate

Model Summary

Predictors: (Constant), HIGHEST YEAR OF SCHOOLCOMPLETED

a.

Predictors: (Constant), HIGHEST YEAR OF SCHOOLCOMPLETED, AGE OF RESPONDENT

b.

Predictors: (Constant), HIGHEST YEAR OF SCHOOLCOMPLETED, AGE OF RESPONDENT, SEX2

c.

Page 17: Regression analysis and multiple regression:  Here’s the beef*

-4.313 1.969 -2.190 .029

3.919 .142 .591 27.624 .000

-11.140 2.393 -4.655 .000

4.017 .142 .606 28.268 .000

.123 .025 .106 4.938 .000

-11.826 2.409 -4.909 .000

4.000 .142 .603 28.151 .000

.124 .025 .107 5.002 .000

1.819 .809 .048 2.247 .025

(Constant)

HIGHEST YEAROF SCHOOLCOMPLETED

(Constant)

HIGHEST YEAROF SCHOOLCOMPLETED

AGE OFRESPONDENT

(Constant)

HIGHEST YEAROF SCHOOLCOMPLETED

AGE OFRESPONDENT

SEX2

Model1

2

3

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Coefficientsa

Dependent Variable: RESPONDENT SOCIOECONOMIC INDEXa.

Page 18: Regression analysis and multiple regression:  Here’s the beef*

• Model 1 (EDUC)SEI = -4.313 + (EDUC*3.919)

• Model 2 (EDUC and AGE)SEI = -11.140 + (EDUC*4.017) + (AGE*.123)

• Model 3 (EDUC, AGE, and SEX2)SEI = 11.826 + (EDUC*4.000) +

(AGE*.124) + (SEX2*1.819)

Page 19: Regression analysis and multiple regression:  Here’s the beef*

Example from Model 3• What is the predicted SEI score for a 40

year old woman with 13 years of education?

SEI = -11.826 + (13*4.000) +

(40*.124) + (0*1.819)

SEI = 45.13

• What is the predicted SEI score for a 25 year old man with 18 years of education?

SEI = -11.826 + (18*4.000) +

(25*.124) + (1*1.819)

SEI = 65.09

Page 20: Regression analysis and multiple regression:  Here’s the beef*

Multiple regression

• Viewed as a plane rather than a line.

Page 21: Regression analysis and multiple regression:  Here’s the beef*

There are several assumptions associated with using a multiple regression model:

• linearity• equal variance: variation around the regression

line is constant (known as homoscedastic)• normality: errors are normally distributed• independence: different errors are sampled

independently.

Multicollinearity occurs when independent variables are highly

correlated (usually over .80).

Page 22: Regression analysis and multiple regression:  Here’s the beef*

Dummy regression analysis

• Multiple regression accommodates several quantitative independent variables, but frequently independent variables of interest are qualitative. Dummy variable regressors permit the effects of qualitative independent variables to be incorporated into a regression equation.

• Suppose that, along with a quantitative independent variable X there is a two-category (dichotomous) independent variable thought to influence the dependent variable Y.

• For example, if Y is income, X may be years of education and the qualitative independent variable may be gender.

Page 23: Regression analysis and multiple regression:  Here’s the beef*

Dummy variable coding a polytomous independent variable

• When a qualitative independent variable has several categories (polytomous), its effects can be captured by coding a set of dummy regressor.

• A variable with m categories gives rise to m-1 dummy variables.

• For example, to add region effects to a regression in which income is the dependent variable and education and labour-force experience are quantitative independent variables:

Page 24: Regression analysis and multiple regression:  Here’s the beef*

Dummy regressors

Region D1 D2 D3 D4

East 1 0 0 0

Quebec 0 1 0 0

Ontario 0 0 1 0

Prarie 0 0 0 1

B.C.* 0 0 0 0

* arbitrary reference or baseline category

Thus the model represents 5 parallel regression planes, one for each region.

Page 25: Regression analysis and multiple regression:  Here’s the beef*

Diagnosing and correcting problems in regression

• Collinearity: When there is a perfect linear relationship among the independent variables in a regression, the least-squares regression coefficients are not uniquely defined.

• Strong, but less than perfect, collinearity (sometimes called multicollinearity) doesn’t prevent the least-squares coefficients from being calculated, but makes them unstable: coefficient standard errors are big; small changes in the data (due even to rounding errors) can cause large changes in the regression coefficients.

Page 26: Regression analysis and multiple regression:  Here’s the beef*

• The variance inflation factor (VIF) measures the extent to which collinearity affects sampling variance.

• The VIF is at a minimum (1) when R2=0 and at a maximum (infinity) when R2=1

• Caveat: The VIF is not very useful when an effect is spread over several degrees of freedom.

• Collinearity is a data problem; it does not imply that model is wrong, only that the data are incapable of providing good estimates of model parameters.

• If, for example, X1 and X2 are perfectly correlated in a set of data, it’s impossible to separate their effects.

Page 27: Regression analysis and multiple regression:  Here’s the beef*

There are, however, several strategies for coping with collinear data

• Give up: This is an honest, if unsatisfying answer.• Collect new data• Reconsider the model: perhaps X1 and X2 are

better conceived as alternative measures of the same construct, in which case their high correlation is indicative of high reliability. Get rid of one of them or combine them in some manner (index)