57
INFO 515 Lecture #7 1 Action Research Correlation and Regression INFO 515 Glenn Booker

Action Research Correlation and Regression

  • Upload
    kenley

  • View
    15

  • Download
    0

Embed Size (px)

DESCRIPTION

Action Research Correlation and Regression. INFO 515 Glenn Booker. Measures of Association. Measures of association are used to determine how strong the relationship is between two variables or measures, and how we can predict such a relationship - PowerPoint PPT Presentation

Citation preview

Page 1: Action Research Correlation and Regression

INFO 515 Lecture #7 1

Action ResearchCorrelation and

Regression

INFO 515Glenn Booker

Page 2: Action Research Correlation and Regression

INFO 515 Lecture #7 2

Measures of Association Measures of association are used to

determine how strong the relationship is between two variables or measures, and how we can predict such a relationship

Only applies for interval or ratio scale variables Everything this week only applies to interval

or ratio scale variables!

Page 3: Action Research Correlation and Regression

INFO 515 Lecture #7 3

Measures of Association For example, I have GRE and GPA scores

for a random sample of graduate students How strong is the relationship between GRE

scores and GPA? Do these variables relate to each other in some way?

If there is a strong relationship, how well can we predict the values of one variable when values of the other variable are known?

Page 4: Action Research Correlation and Regression

INFO 515 Lecture #7 4

Strength of Prediction Two techniques are used to describe the

strength of a relationship, and predict values of one variable when another variable’s value is known Correlation: Describes the degree (strength)

to which the two variables are related Regression: Used to predict the values of one

variable when values of the other are known

Page 5: Action Research Correlation and Regression

INFO 515 Lecture #7 5

Strength of Prediction Correlation and regression are linked -- the

ability to predict one variable when another variable is known depends on the degree and direction of the variables’ relationship in the first place We find correlation before we calculate

regression So generating a regression without checking

for a correlation first is pointless (though we’ll do both at once)

Page 6: Action Research Correlation and Regression

INFO 515 Lecture #7 6

Correlation There are different types of statistical

measures of correlation They give us a measure known as the

correlation coefficient The most common procedure used is known as

the Pearson’s Product Moment Correlation, or Pearson’s ‘r’

Page 7: Action Research Correlation and Regression

INFO 515 Lecture #7 7

Pearson’s ‘r’ Can only be calculated for interval or ratio

scale data Its value is a real number from -1 to +1

Strength: As the value of ‘r’ approaches -1 or +1, the relationship is stronger. As the magnitude of ‘r’ approaches zero, we see little or no relationship

Page 8: Action Research Correlation and Regression

INFO 515 Lecture #7 8

Pearson’s ‘r’ For example, ‘r’ might equal 0.89, -0.9,

0.613, or -0.3 Which would be the strongest correlation?

Direction: Positive or negative correlation can not be distinguished from looking at ‘r’ Direction of correlation depends on the type

of equation used, and the resulting constants obtained for it

Page 9: Action Research Correlation and Regression

INFO 515 Lecture #7 9

Example of Relationships Positive direction -- as the independent

variable increases, the dependent variable tends to increase:

Student GRE (X) GPA1 (Y)1 1500 4.02 1400 3.83 1250 3.54 1050 3.15 950 2.9

Page 10: Action Research Correlation and Regression

INFO 515 Lecture #7 10

Example of Relationships Negative direction -- as the dependent

variable increases, the independent variable decreases:

Student GRE (X) GPA2 (Y)1 1500 2.92 1400 3.13 1250 3.44 1050 3.75 950 4.0

Page 11: Action Research Correlation and Regression

INFO 515 Lecture #7 11

Positive and Negative Correlation

Positive correlation, r = 1.0

2.80

3.00

3.20

3.40

3.60

3.80

4.00

900 1000 1100 1200 1300 1400 1500

GRE

Observed

Linear

GPA1

Negative correlation, r = 1.0

2.80

3.00

3.20

3.40

3.60

3.80

4.00

900 1000 1100 1200 1300 1400 1500

GRE

Observed

Linear

GPA2

Notice that high ‘r’ doesn’t tell whether the correlation is positive or negative!

Data from slide 9

Data from slide 10

Page 12: Action Research Correlation and Regression

INFO 515 Lecture #7 12

*Important Note* An association value provided by a

correlation analysis, such as Pearson’s ‘r’, tells us nothing about causation In this case, high GRE scores don’t necessarily

cause high or low GPA scores, and vice versa

Page 13: Action Research Correlation and Regression

INFO 515 Lecture #7 13

Significance of r We can test for the significance of r (to see

whether our relationship is statistically significant) by consulting a table of critical values for r (Action Research p. 41/42) Table “VALUES OF THE CORRELATION

COEFFICIENT FOR DIFFERENT LEVELS OF SIGNIFICANCE”

Where df = (number of data pairs) – 2

Page 14: Action Research Correlation and Regression

INFO 515 Lecture #7 14

Significance of r We test the null hypothesis that the

correlation between the two variables is equal to zero (there is no relationship between them)

Reject the null hypothesis (H0) if the absolute value of r is greater than the critical r value Reject H0 if |r| > rcrit

This is similar to evaluating actual versus critical ‘t’ values

Page 15: Action Research Correlation and Regression

INFO 515 Lecture #7 15

Significance of r Example So if we had 20 pairs of data For two-tail 95% confidence (P=.05), the

critical ‘r’ value at df=20-2=18 is 0.444 So reject the null hypothesis (hence

correlation is statistically significant) if: r > 0.444 or r < -0.444

Page 16: Action Research Correlation and Regression

INFO 515 Lecture #7 16

Strength of “|r|” Absolute value of Pearson’s ‘r’ indicates

the strength of a correlation 1.0 to 0.9: very strong correlation 0.9 to 0.7: strong 0.7 to 0.4: moderate to substantial 0.4 to 0.2: moderate to low 0.2 to 0.0: low to negligible correlation

Notice that a correlation can be strong, but still not be statistically significant! (especially for small data sets)

Page 17: Action Research Correlation and Regression

INFO 515 Lecture #7 17

*Important Notes* The stronger the r, the smaller the

standard estimate of the error, the better the prediction!

A significant r does not necessarily mean that you have a strong correlation A significant r means that whatever correlation

you do have is not due to random chance

Page 18: Action Research Correlation and Regression

INFO 515 Lecture #7 18

Coefficient of Determination By squaring r, we can determine the

amount of variance the two variables share (called “explained variance”) R Square is the coefficient of determination

So, an “R Square” of 0.94 means that 94% of the variance in the Y variable is explained by the variance of the X variable

Page 19: Action Research Correlation and Regression

INFO 515 Lecture #7 19

What is R Squared?• The Coefficient of determination, R2,

is a measure of the goodness of fit• R2 ranges from 0 to 1

• R2 = 1 is a perfect fit (all data points fall on the estimated line or curve)

• R2 = 0 means that the variable(s) have no explanatory power

Page 20: Action Research Correlation and Regression

INFO 515 Lecture #7 20

What is R Squared?• Having R2 closer to 1 helps choose which

regression model is best suited to a problem

• Having R2 actually equal zero is very difficult• A sample of ten random numbers from Excel

still obtained an R2 of 0.006

Page 21: Action Research Correlation and Regression

INFO 515 Lecture #7 21

Scatter Plots It’s nice to use R2 to determine the

strength of a relationship, but visual feedback helps verify whether the model fits the data well Also helps look for data fliers (outliers)

A scatter plot (or scatter gram) allows us to compare any two interval or ratio scale variables, and see how data points are related to each other

Page 22: Action Research Correlation and Regression

INFO 515 Lecture #7 22

Scatter Plots Scatter plots are two-dimensional graphs

with an axis for each variable (independent variable X and dependent variable Y)

To construct: place an * on the graph for each X and Y value from the data

Seeing data this way can help choose the correct mathematical model for the data

Page 23: Action Research Correlation and Regression

INFO 515 Lecture #7 23

Scatter Plots

*

X(Indep.)

Y(Dep.)

Data point (2, 3)X=2

Y=3

(0, 0)

Page 24: Action Research Correlation and Regression

INFO 515 Lecture #7 24

Models Allow us to focus on select elements of the

problem at hand, and ignore irrelevant ones

May show how parts of the problem relate to each other

May be expressed as equations, mappings, or diagrams

May be chosen or derived before or after measurement (theory vs. empirical)

Page 25: Action Research Correlation and Regression

INFO 515 Lecture #7 25

Modeling Often we look for a linear relationship –

one described by fitting a straight line as well to the data as possible

More generally, any equation could be used as the basis for regression modeling, or describing the relationship between two variables You could have Y = a*X**2 + b*ln(X) +

c*sin(d*X-e)

Page 26: Action Research Correlation and Regression

INFO 515 Lecture #7 26

Linear Model

X(Indep.)

Y(Dep.)

Y = m*X + b or Y = b0 + b1*X

b = Y axis intercept

1 unit of X

m = slope

Page 27: Action Research Correlation and Regression

INFO 515 Lecture #7 27

Linear Model Pearson’s ‘r’ for linear regression is

calculated per (Action Research p. 29/30) Define: N = number of data pairs

SX = Sum of all X valuesSX2 = Sum of all (X values squared)SY = Sum of all Y valuesSY2 = Sum of all (Y values squared) SXY = Sum of all (X values times Y values)

Pearson’s r = [N*(SXY) – (SX)*(SY)] / sqrt[(N*(SX2) – (SX)^2)*(N*(SY2) – (SY)^2)]

Page 28: Action Research Correlation and Regression

INFO 515 Lecture #7 28

Linear Model For the linear model, you could find the

slope ‘m’ and Y-intercept ‘b’ from m = (r) * (standard deviation of Y) /

(standard deviation of X) b = (mean of Y) – (m)*(mean of X)

But it’s a lot easier to use SPSS’ slope=b1 and Y intercept = b0

Page 29: Action Research Correlation and Regression

INFO 515 Lecture #7 29

Regression Analysis Allows us to predict the likely value of

one variable from knowledge of another variable

The two variables should be fairly highly correlated (close to a straight line)

The regression equation is a mathematical expression of the relationship between 2 variables on, for example, a straight line

Page 30: Action Research Correlation and Regression

INFO 515 Lecture #7 30

Regression Equation Y = mX + b In this linear equation, you predict Y values (the

dependent variable) from known values of X (the independent variable); this is called the regression of Y on X The regression equation is fundamentally an equation

for plotting a straight line, so the stronger our correlation -- the closer our variables will fall to a straight line, and the better our prediction will be

Page 31: Action Research Correlation and Regression

INFO 515 Lecture #7 31

Linear Regression

y

xChoose “best” line by minimizing the sum of the squares of the vertical distances between the data points and the regression line

y

y

y = a + b*x^

y = y + ^

Page 32: Action Research Correlation and Regression

INFO 515 Lecture #7 32

Standard Error of the Estimate Is the standard deviation of data around

the regression line Tells how much the actual values of Y

deviate from the predicted values of Y

Page 33: Action Research Correlation and Regression

INFO 515 Lecture #7 33

Standard Error of the Estimate After you calculate the standard error of

the estimate, you add and subtract the value from your predicted values of Y to get a % area around the regression line within which you would expect repeated actual values to occur or cluster if you took many samples (sort of like a sampling distribution for the mean….)

Page 34: Action Research Correlation and Regression

INFO 515 Lecture #7 34

Standard Error of Estimate The Standard Error of Estimate for Y

predicted by X issy/x = sqrt[sum of(Y–predicted Y)2 /(N–2)]where ‘Y’ is each actual Y value‘predicted Y’ is the Y value predicted by the linear regression‘N’ is the number of data pairs

For example on (Action Research p. 33/34), Sy/x = sqrt(2.641/(10-2)) = 0.574

Page 35: Action Research Correlation and Regression

INFO 515 Lecture #7 35

Standard Error of the Estimate So, if the standard error of the estimate is

equal to 0.574, and if you have a predicted Y value of 4.560, then 68% of your actual values, with repeated sampling, would fall between 3.986 and 5.134 (predicted Y +/- 1 std error) The smaller the standard error, the closer

your actual values are to the regression line, and the more confident you can be in your prediction

Page 36: Action Research Correlation and Regression

INFO 515 Lecture #7 36

SPSS Regression Equations Instead of constants called ‘m’ and ‘b’,

‘b0’ and ‘b1’ are used for most equations The meaning of ‘b0’ and ‘b1’ varies,

depending on the type of equation which is being modeled Can repress the use of ‘b0’ by unchecking

“Include constant in equation”

Page 37: Action Research Correlation and Regression

INFO 515 Lecture #7 37

SPSS Regression Models Linear model

Y = b0 + b1*X Logarithmic model

Y = b0 + b1*ln(X) where ‘ln’ = natural log Inverse model

Y = b0 + b1/XSimilar to the form X*Y = constant, which is a hyperbola

Page 38: Action Research Correlation and Regression

INFO 515 Lecture #7 38

SPSS Regression Models Power model

Y = b0*(X**b1) Compound model

Y = b0*(b1**X) A variant of this is the Logistic model, which

requires a constant input ‘u’ which is larger than Y for any actual data pointY = 1/[ 1/u + b0*(b1**X) ]

Where “**” indicates “to the power of”

Page 39: Action Research Correlation and Regression

INFO 515 Lecture #7 39

SPSS Regression Models Exponential model

Y = b0*exp(b1*X) Other exponential functions

S modelY = exp(b0 + b1/X)

Growth model (is almost identical to the exponential model)Y = exp(b0 + b1*X)

“exp” means “e to the power of”;

e = 2.7182818…

Page 40: Action Research Correlation and Regression

INFO 515 Lecture #7 40

SPSS Regression Models Polynomials beyond the Linear model

(linear is a first order polynomial): Quadratic (second order)

Y = b0 + b1*X + b2*X**2 Cubic (third order)

Y = b0 + b1*X + b2*X**2 + b3*X**3These are the only equations which use constants b2 & b3

Higher order polynomials require the Regression module of SPSS, which can do regression using any equation you enter

Page 41: Action Research Correlation and Regression

INFO 515 Lecture #7 41

Y = whattheflock? To help picture these equations

Make an X variable over some typical range (0 to 10 in a small increment, maybe 0.01)

Define a Y variable Calculate the Y variable using Transform >

Compute… and whatever equation you want to see

Pick values for b0 and b1 that aren’t 0, 1, or 2 Have SPSS plot the results of a regression of Y

vs X for that type of equation

Page 42: Action Research Correlation and Regression

INFO 515 Lecture #7 42

How Apply This? Given a set of data containing two

variables of interest, generate a scatter plot to get some idea of what the data looks like

Choose which types of models are most likely to be useful

For only linear models, use Analyze / Regression / Linear...

Page 43: Action Research Correlation and Regression

INFO 515 Lecture #7 43

How Apply This? Select the Independent (X) and Dependent

(Y) variables Rules may be applied to limit the scope of

the analysis, e.g. gender=1 Dozens of other characteristics may also

be obtained, which are beyond our scope here

Page 44: Action Research Correlation and Regression

INFO 515 Lecture #7 44

How Apply This? Then check for the R Square value in the

Model Summary Check the Coefficients to make sure they

are all significant (e.g. Sig. < 0.050) If so, use the ‘b0’ and ‘b1’ coefficients

from under the ‘B’ column (see Statistics for Software Process Improvement handout), plus or minus the standard errors “SE B”

Page 45: Action Research Correlation and Regression

INFO 515 Lecture #7 45

Regression Example For example, go back to the

“GSS91 political.sav” data set Generate a linear regression (Analyze >

Regression > Linear) for ‘age’ as the Independent variable, and ‘partyid’ as the Dependent variable

Notice that R2 and the ANOVA summary are given, with F and its significance

Page 46: Action Research Correlation and Regression

INFO 515 Lecture #7 46

Regression ExampleModel Summary

.075a .006 .005 2.082Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), AGE OF RESPONDENTa.

ANOVAb

36.235 1 36.235 8.361 .004a

6457.063 1490 4.334

6493.298 1491

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), AGE OF RESPONDENTa.

Dependent Variable: POLITICAL PARTY AFFILIATIONb.

Page 47: Action Research Correlation and Regression

INFO 515 Lecture #7 47

Regression Example The R Square of 0.006 means there is a

very slight correlation (little strength) But the ANOVA Significance well under

0.050 confirms there is a statistically significant relationship here - it’s just a really weak one

Page 48: Action Research Correlation and Regression

INFO 515 Lecture #7 48

Regression Example

Coefficientsa

3.333 .148 22.462 .000

-.009 .003 -.075 -2.892 .004

(Constant)

AGE OF RESPONDENT

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: POLITICAL PARTY AFFILIATIONa.

Output from Analyze > Regression > Linear

Output from Analyze > Regression > Curve Estimation

Coefficients

-.009 .003 -.075 -2.892 .004

3.333 .148 22.462 .000

AGE OF RESPONDENT

(Constant)

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Page 49: Action Research Correlation and Regression

INFO 515 Lecture #7 49

Regression Example The heart of the regression analysis is in

the Coefficients section We could look up ‘t’ on a critical values

table, but it’s easier to: See if all values of Sig are < 0.050 - if they

are, reject the null hypothesis, meaning there is a significant relationship If so, use the values under B for b0 and b1 If any coefficient has Sig > 0.050, don’t use

that regression (coeff might be zero)

Page 50: Action Research Correlation and Regression

INFO 515 Lecture #7 50

Regression Example The answer for “what is the effect of age

on political view?” is that there is a very weak but statistically significant linear relationship, with a reduction of 0.009 (b1) political view categories per year From the Variable View of the data, since low

values are liberal and large values conservative, this means that people tend to get slightly more liberal as they get older

Page 51: Action Research Correlation and Regression

INFO 515 Lecture #7 51

Curve Estimation Example For the other regression options, choose

Analyze / Regression / Curve Estimation… Define the Dependents (variable) and the

Independent variable - note that multiple Dependents may be selected

Check which math models you want used Display the ANOVA table for reference

Page 52: Action Research Correlation and Regression

INFO 515 Lecture #7 52

Curve Estimation Example SPSS Tip: up to three regression models

can be plotted at once, so don’t select more than that if you want a scatter plot to go with the data and the regressions

For the same example just used, get a summary for the linear and quadratic models (Analyze > Regression > Curve Estimation)

Find “R Square” for each model Generally pick the model with largest R Square Already saw Linear output, now see Quadratic

Page 53: Action Research Correlation and Regression

INFO 515 Lecture #7 53

Curve Estimation Example For the quadratic regression, R Square

is slightly higher, and the ANOVA is still significant Model Summary

.094 .009 .008 2.079R R Square

AdjustedR Square

Std. Error ofthe Estimate

The independent variable is AGE OF RESPONDENT.

ANOVA

57.801 2 28.901 6.687 .001

6435.496 1489 4.322

6493.298 1491

Regression

Residual

Total

Sum ofSquares df Mean Square F Sig.

The independent variable is AGE OF RESPONDENT.

Page 54: Action Research Correlation and Regression

INFO 515 Lecture #7 54

Curve Estimation Example The Quadratic coefficients are all

significant at the 0.050 levelCoefficients

-.048 .018 -.410 -2.691 .007

.000 .000 .341 2.234 .026

4.191 .412 10.175 .000

AGE OF RESPONDENT

AGE OF RESPONDENT** 2

(Constant)

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Interpret as partyid = (4.191 +/- 0.412) + (-0.048 +/- 0.018)*age + (0.0003918+/- 0.0001754)*age**2Edit the data table, then double click on the cells to get the values of b2 and its std error.

Page 55: Action Research Correlation and Regression

INFO 515 Lecture #7 55

Curve Estimation Example The data set will be plotted as the

Observed points, with the regression models shown for comparison

Look to see which model most closely matches the data

Look for regions of data which do or don’t match the model well (if any)

Page 56: Action Research Correlation and Regression

INFO 515 Lecture #7 56

Curve Estimation Example

<- quadratic<- linear

Page 57: Action Research Correlation and Regression

INFO 515 Lecture #7 57

Curve Estimation Procedure See which models are significant (throw

out the rest!) Compare the R Square values to see which

provides the best fit Use the graph to verify visually that the

correct model was chosen Use the model equation’s ‘B’ values and

their standard errors to describe and predict the data’s behavior