102
Correlation and Simple Linear Regression Correlation and Simple Linear Regression PSY440 June 10, 2008

Correlation and Simple Linear Regression PSY440 June 10, 2008

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Correlation and Simple Linear Regression PSY440 June 10, 2008

Correlation and Simple Linear RegressionCorrelation and Simple Linear Regression

PSY440

June 10, 2008

Page 2: Correlation and Simple Linear Regression PSY440 June 10, 2008

A few points of clarification

• For the chi-squared test, the results are unreliable if the expected frequency in too many of your cells is too low.

• A rule of thumb is that the minimum expected frequency should be 5 (i.e., no cells with expected counts less than 5). A more conservative rule recommended by some is a minimum expected frequency of 10. If your minimum is too low, you need a larger sample! The more categories you have the larger your sample must be.

• SPSS will warn you if you have any cells with expected frequency less than 5.

Page 3: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regarding threats to internal validity

• One of the strengths of well-designed single-subject research is the use of repeated observations during each phase.

• Repeated observations during baseline and intervention (during an AB study, e.g.) helps rule out testing, instrumentation (somewhat) and regression. These effects would be unlikely to result in a marked change between experimental phases that is not apparent during repeated observations before and after the phase change.

Page 4: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regarding histograms

The difference between a histogram and a bar graph is that the variable on the x axis (which represents the score on the variable being graphed, as opposed to the frequency of observations) is conceptualized as being continuous in a histogram, whereas a bar graph represents discrete categories along the x axis.

Page 5: Correlation and Simple Linear Regression PSY440 June 10, 2008

About the exam….

Exam on Thursday will cover material from the first three weeks of class (lectures 1-6, or everything through Chi-Squared tests).

Emphasis of exam will be on generating results with computers (calculations by hand will not be emphasized), and interpreting the results.

Exam questions will be based mainly on lecture material and modeled on previous active learning experiences (homework and in-class demonstrations and exercises).

Knowledge of material on qualitative methods and experimental & single-subject design is expected.

Page 6: Correlation and Simple Linear Regression PSY440 June 10, 2008

Before we move on…..

Any questions?

Page 7: Correlation and Simple Linear Regression PSY440 June 10, 2008

Today’s lecture and next homework

Today’s lecture will cover correlation and simple (bivariate) regression.

Homework based on today’s lecture will be distributed on Thursday and due on Tuesday (June 17).

Page 8: Correlation and Simple Linear Regression PSY440 June 10, 2008

Correlation

• A correlation is the association between scores on two variables– age and coordination skills in children, as kids

get older their motor coordination tends to improve

– price and quality, generally the more expensive something is the higher in quality it is

Page 9: Correlation and Simple Linear Regression PSY440 June 10, 2008

Correlation and Causality

Correlational research– Correlation as a statistical procedure is

generally used to measure the association between two (or more) continuous variables

– Correlation as a kind of research design refers to observational studies in which there is no experimental manipulation.

Page 10: Correlation and Simple Linear Regression PSY440 June 10, 2008

Correlation and CausalityCorrelational research

– Not all “correlational” (i.e., observational) research designs use correlation as the statistical procedure for analyzing the data (example: comparison of verbal abilities between boys and girls - observational study - don’t manipulate gender - but probably analyze mean differences with t-tests).

– But: Virtually of the inferential statistical methods (including t-tests, anova, ancova) covered in 440 can be represented in terms of correlational/regression models (general linear model - we’ll talk more about this later).

– Bottom line: Don’t confuse design with analytic strategy.

Page 11: Correlation and Simple Linear Regression PSY440 June 10, 2008

Correlation and Causality

• Correlations (like other linear statistical models) describe relationships between variables, but DO NOT explain why the variables are related

Suppose that Dr. Steward finds that rates of spilled coffee and severity of plane turbulence are strongly positively correlated.

One might argue that turbulence cause coffee spills

One might argue that spilling coffee causes turbulence

Page 12: Correlation and Simple Linear Regression PSY440 June 10, 2008

Correlation and Causation

Suppose that Dr. Cranium finds a positive correlation between head size and digit span (roughly the number of digits you can remember).

One might argue that bigger your head, the larger your digit span

1

21

24

1537

One might argue that head size and digit span both increase with age (but head size and digit span aren’t directly related)

Page 13: Correlation and Simple Linear Regression PSY440 June 10, 2008

Correlation and Causation

Observational research and correlational statistical methods (including regression and path analysis) can be used to compare competing models of causation, to see which model fits the data best.

One might argue that bigger your head, the larger your digit span

1

21

24

1537

One might argue that head size and digit span both increase with age (but head size and digit span aren’t directly related)

Page 14: Correlation and Simple Linear Regression PSY440 June 10, 2008

Relationships between variables

• Properties of a statistical correlation– Form (linear or non-linear)– Direction (positive or negative)– Strength (none, weak, strong, perfect)

• To examine this relationship you should:– Make a scatterplot - a picture of the relationship– Compute the Correlation Coefficient - a numerical

description of the relationship

Page 15: Correlation and Simple Linear Regression PSY440 June 10, 2008

Graphing Correlations

• Steps for making a scatterplot (scatter diagram)1. Draw axes and assign variables to them

2. Determine range of values for each variable and mark on axes

3. Mark a dot for each person’s pair of scores

Page 16: Correlation and Simple Linear Regression PSY440 June 10, 2008

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6

X Y

Page 17: Correlation and Simple Linear Regression PSY440 June 10, 2008

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6B 1 2

X Y

Page 18: Correlation and Simple Linear Regression PSY440 June 10, 2008

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6B 1 2C 5 6

X Y

Page 19: Correlation and Simple Linear Regression PSY440 June 10, 2008

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6B 1 2C 5 6

D 3 4

X Y

Page 20: Correlation and Simple Linear Regression PSY440 June 10, 2008

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6B 1 2C 5 6

D 3 4

E 3 2

X Y

Page 21: Correlation and Simple Linear Regression PSY440 June 10, 2008

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Imagine a line through the data points

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6B 1 2C 5 6

D 3 4

E 3 2

X Y

• Useful for “seeing” the relationship– Form, Direction,

and Strength

Page 22: Correlation and Simple Linear Regression PSY440 June 10, 2008

Scatterplots with Excel and SPSS

In SPSS, charts menu=>legacy dialogues=>scatter/dot=>simple scatter

Click on define, and select which variable you want on the x axis and which on the y axis.

In Excel, insert menu=>chart=>xyscatter

Specify if variables are arranged in rows or columns and select the cells with the relevant data.

Page 23: Correlation and Simple Linear Regression PSY440 June 10, 2008

FormNon-linearLinear

Page 24: Correlation and Simple Linear Regression PSY440 June 10, 2008

NegativePositive

Direction

• X & Y vary in the same direction

• As X goes up, Y goes up

• positive Pearson’s r

• X & Y vary in opposite directions

• As X goes up, Y goes down

• negative Pearson’s r

Y

X

Y

X

Page 25: Correlation and Simple Linear Regression PSY440 June 10, 2008

Strength

• The strength of the relationship– Spread around the line (note the axis scales)

– Correlation coefficient will range from -1 to +1• Zero means “no relationship”.

• The farther the r is from zero, the stronger the relationship

– In general when we talk about correlation coefficients: Correlation coefficient = Pearson’s product moment coefficient = Pearson’s r = r.

Page 26: Correlation and Simple Linear Regression PSY440 June 10, 2008

Strength

r = 1.0“perfect positive corr.”r2 = 100%

r = -1.0“perfect negative corr.”r2 = 100%

r = 0.0“no relationship”r2 = 0.0

-1.0 0.0 +1.0

The farther from zero, the stronger the relationship

Page 27: Correlation and Simple Linear Regression PSY440 June 10, 2008

The Correlation Coefficient

• Formulas for the correlation coefficient:

r = XZ YZ∑N

r =SP

SSX SSY

SP = X − X ( ) Y −Y ( )∑

Conceptual Formula Common Alternative

Page 28: Correlation and Simple Linear Regression PSY440 June 10, 2008

The Correlation Coefficient

• Formulas for the correlation coefficient:

r = XZ YZ∑N

r =SP

SSX SSY

SP = X − X ( ) Y −Y ( )∑

Conceptual Formula Common alternative

Page 29: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using SP)

• Step 1: SP (Sum of the Products)

SP = X − X ( ) Y −Y ( )∑

mean 3.6 4.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )

Page 30: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using SP)

• Step 1: SP (Sum of the Products)

SP = X − X ( ) Y −Y ( )∑

mean 3.6 4.0

2.4

0.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )= 6 - 3.6

-2.6 = 1 - 3.6

1.4 = 5 - 3.6

-0.6 = 3 - 3.6

-0.6 = 3 - 3.6

Quick check

Page 31: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using SP)

• Step 1: SP (Sum of the Products)

SP = X − X ( ) Y −Y ( )∑

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0 0.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )2.0 = 6 - 4.0

-2.0 = 2 - 4.0

2.0 = 6 - 4.0

0.0= 4 - 4.0

-2.0= 2 - 4.0

Quick check

Page 32: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using SP)

• Step 1: SP (Sum of the Products)

SP = X − X ( ) Y −Y ( )∑

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0 14.0 SP

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.8* =

5.2* =

2.8* =

0.0* =

1.2* =

Page 33: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using SP)

• Step 2: SSX & SSY

Page 34: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using SP)

• Step 2: SSX & SSY

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0 14.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.85.2

2.8

0.0

1.2

X − X ( )2

5.76

15.20

SSX

2 =6.762 =

1.962 =

0.362 =

0.362 =

Page 35: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using SP)

• Step 2: SSX & SSY

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0 14.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.85.2

2.8

0.0

1.2

X − X ( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

2 = 4.02 = 4.02 = 4.02 = 0.02 = 4.0

16.0

SSY

Page 36: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using SP)

• Step 3: compute r

r =SP

SSX SSY

Page 37: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using SP)

• Step 3: compute r

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0 14.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.85.2

2.8

0.0

1.2

X − X ( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0

SSYSSX

SP

r =SP

SSX SSY

Page 38: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r

• Step 3: compute r

14.015.20 16.0

SSYSSX

SP

r =SP

SSX SSY

Page 39: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r

• Step 3: compute r

15.20 16.0

SSYSSX

r =14

SSXSSY

Page 40: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r

• Step 3: compute r

15.20

SSX

r =14

SSX * 16

Page 41: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r

• Step 3: compute r

r =14

15.2 *16

Page 42: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r

• Step 3: compute rr =

1415.2 * 16

=0.89

Y

X1

2

34

5

6

1 2 3 4 5 6

• Appears linear

• Positive relationship

• Fairly strong relationship• .89 is far from 0, near +1

Page 43: Correlation and Simple Linear Regression PSY440 June 10, 2008

The Correlation Coefficient

• Formulas for the correlation coefficient:

r = XZ YZ∑N

r =SP

SSXSSY

SP = X−X( ) Y −Y( )∑

Conceptual Formula Common alternative

Page 44: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using z-scores)

• Step 1: compute standard deviation for X and Y (note: keep track of sample or population)

6 61 25 6

3 4

3 2

X Y

• For this example we will assume the data is from a population

Page 45: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using z-scores)

• Step 1: compute standard deviation for X and Y (note: keep track of sample or population)

Mean 3.6

2.4-2.6

1.4

-0.6

-0.6

0.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

X − X ( )2

5.766.76

1.96

0.36

0.36

15.20

SSXStd dev 1.74

σ =SSX

N=

15.2

5= 1.74

• For this example we will assume the data is from a population

Page 46: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using z-scores)

• Step 1: compute standard deviation for X and Y (note: keep track of sample or population)

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

2.0-2.0

2.0

0.0

-2.0

0.0

6 61 25 6

3 4

3 2

X Y X −X( )

Y −Y ( )X −X( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0

SSYStd dev 1.74 1.79

• For this example we will assume the data is from a population

σ =SSY

N

=16.0

5= 1.79

Page 47: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using z-scores)

• Step 2: compute z-scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

2.0-2.0

2.0

0.0

-2.0

6 61 25 6

3 4

3 2

X Y

X − X ( ) Y −Y( )X −X( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX

1.74 1.79

1.38 =2.4

1.74

X −X( )sX

Page 48: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using z-scores)

• Step 2: compute z-scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

2.0-2.0

2.0

0.0

-2.0

6 61 25 6

3 4

3 2

X Y

X − X ( ) Y −Y( )X −X( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX

X −X( )sX

1.74 1.79

1.38-1.49

0.8

- 0.34

- 0.34

0.0 Quick check

Page 49: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using z-scores)

• Step 2: compute z-scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

2.0-2.0

2.0

0.0

-2.0

6 61 25 6

3 4

3 2

X Y X −X( )

Y −Y ( )X −X( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

1.74 1.79

1.1

Y −Y( )sY

=2.0

1.791.38-1.49

0.8

- 0.34

- 0.34

Page 50: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using z-scores)

• Step 2: compute z-scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

2.0-2.0

2.0

0.0

-2.0

6 61 25 6

3 4

3 2

X Y X −X( )

Y −Y ( )X −X( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

Y −Y( )sY

1.74 1.79

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

Quick check

Page 51: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using z-scores)

• Step 3: compute r

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

6 61 25 6

3 4

3 2

X Y ZX ZY

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

1.74 1.790.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.52

X −X( ) X −X( )2

r =ZXZY∑N

Y −Y( )

1.38-1.49

0.8

- 0.34

- 0.34

* =

Page 52: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using z-scores)

• Step 3: compute r

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

6 61 25 6

3 4

3 2

X Y ZX ZY

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

1.74 1.790.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.521.64

0.88

0.0

0.37

X −X( ) X −X( )2

r =ZXZY∑N

=4.41

5

Y −Y( )

1.38-1.49

0.8

- 0.34

- 0.34

=0.88

4.41

Page 53: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Pearson’s r (using z-scores)

• Step 3: compute r

Y

X1

2

34

5

6

1 2 3 4 5 6

• Appears linear

• Positive relationship

• Fairly strong relationship• .88 is far from 0, near +1

r =ZXZY∑N

=0.88

Page 54: Correlation and Simple Linear Regression PSY440 June 10, 2008

Correlation in Research Articles

• Correlation matrix– A display of the correlations between more than two variables

Acculturation

• Why have a “-”?

• Why only half the table filled with numbers?

Page 55: Correlation and Simple Linear Regression PSY440 June 10, 2008

Correlations with SPSS & Excel

SPSS: Analyze => correlate=> bivariateThen select the variables you want correlation(s) for (can

select just one pair, or many variables to get a correlation matrix)

Try this with height and shoe size in our data.Now try with height, shoe size, mother’s height, and number

of shoes owned.Excel: Arrange data for two variables in two columns or

rows & use formula bar to request a correlation:=correl(array1,array2)

Page 56: Correlation and Simple Linear Regression PSY440 June 10, 2008

SPSS correlation output

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 57: Correlation and Simple Linear Regression PSY440 June 10, 2008

Invalid inferences from correlations

Why you should always look at the scatter plot before computing (and certainly before interpreting Pearson’s r):

• Correlations are greatly affected by range of scores in data– Consider height and age relationship– Restricted range example from text (SAT and GPA)

• Extreme scores can have dramatic effects on correlations – A single extreme score can radically change r, especially when your

sample is small.

• Relations between variables may differ for subgroups, resulting in misleading r values for aggregate data

• Curvilinear relations not captures by Pearson’s r

Page 58: Correlation and Simple Linear Regression PSY440 June 10, 2008

What to do about a curvilinear pattern

• If pattern is monotonically increasing or decreasing, convert scores to ranks and compute r (using same formula) based on the rank scores. Result is called Spearman’s Rank Correlation Coefficient or Spearman’s Rho and can be requested in your spss output by checking the appropriate box when you select the variables for which you want correlations.

• If pattern is more complicated (u-shaped or s-shaped, for example), consult more advanced statistics resources.

Page 59: Correlation and Simple Linear Regression PSY440 June 10, 2008

Coefficient of determination

• When considering "how good" a relationship is, we really should consider r2 (coefficient of determination), not just r.

• This coefficient tells you the percent of the variance in one variable that is explained or accounted for by the other variable.

Page 60: Correlation and Simple Linear Regression PSY440 June 10, 2008

From Correlation to Regression• With correlation, we can examine whether variables X

& Y are related• With regression, we try to predict the value of one

variable given what we know about the other variable and the relationship between the two.

Page 61: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression• Last time: “it doesn’t matter which variable goes on the

X-axis or the Y-axis”

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• For regression this is NOT the case

• The variable that you are predicting goes on the Y-axis (criterion or “dependent” variable)

Predicted variable

Predicting variable

• The variable that you are making the prediction based on goes on the X-axis (predictor or “independent” variable)

Quiz performance

Hours of study

Page 62: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression• Correlation: “Imagine a line through the points”

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• But there are lots of possible lines

• One line is the “best fitting line”

• Regression: compute the equation corresponding to this “best fitting line”

Quiz performance

Hours of study

Page 63: Correlation and Simple Linear Regression PSY440 June 10, 2008

The equation for a line

• A brief review of geometry

Y = (X)(slope) + (intercept)

2.0

Y

X

1

2

3

4

5

6

1 2 3 4 5 60

Y = intercept, when X = 0

Page 64: Correlation and Simple Linear Regression PSY440 June 10, 2008

The equation for a line

• A brief review of geometry

Y = (X)(slope) + (intercept)

2.0

Change in Y

Change in X= slope

0.5

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

1

2

0

Page 65: Correlation and Simple Linear Regression PSY440 June 10, 2008

The equation for a line

• A brief review of geometry

Y = (X)(slope) + (intercept)Y

X

1

2

3

4

5

6

1 2 3 4 5 60

Y = (X)(0.5) + 2.0

Page 66: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression

• A brief review of geometry• Consider a perfect correlation

Y = (X)(0.5) + (2.0)Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Can make specific predictions about Y based on X

X = 5

Y = ?Y = (5)(0.5) + (2.0)

Y = 2.5 + 2 = 4.54.5

Page 67: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Consider a less than perfect correlation• The line still represents the

predicted values of Y given X

Y = (X)(0.5) + (2.0)X = 5

Y = ?Y = (5)(0.5) + (2.0)

Y = 2.5 + 2 = 4.54.5

Page 68: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• The “best fitting line” is the one that minimizes the error (differences) between the predicted scores (the line) and the actual scores (the points)

• Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line

Page 69: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression

• The linear model

Y = intercept + slope (X) + error

μY = β0 + β1X + ε

Beta’s ( ) are sometimes called parameters

Come in two types:

• standardized

• unstanderdized μY = β0 + β1X + ε )ZY =(β)(ZX ) + ε

Now let’s go through an example computing these things

Page 70: Correlation and Simple Linear Regression PSY440 June 10, 2008

Scatterplot

• Using the dataset from our correlation example

6 61 25 6

3 4

3 2

X Y Y

X

1

23456

1 2 3 4 5 6

Page 71: Correlation and Simple Linear Regression PSY440 June 10, 2008

From when we computed Pearson’s r

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.85.2

2.8

0.0

1.2

X − X ( )2

5.766.76

1.96

0.36

0.36

Y −Y ( )2

4.04.0

4.0

0.0

4.0

14.015.20 16.0

SSYSSX

SP

Page 72: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing regression line (with raw scores)

6 61 25 6

3 4

3 2

X Y

14.015.20 16.0

SSYSSX

SP

slope = b =SP

SSX

=14

15.2= 0.92

intercept = a = Y − bX

mean 3.6 4.0 €

=4.0 − (0.92)(3.6)

=0.688

Page 73: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing regression line(with raw scores)

6 61 25 6

3 4

3 2

X Y

slope = b = 0.92

mean 3.6 4.0

intercept = 0.688

Y

X

1

23456

1 2 3 4 5 6

Y = 0.92X + 0.688

Page 74: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing regression line (with raw scores)

6 61 25 6

3 4

3 2

X Y

slope = b = 0.92

mean 3.6 4.0

intercept = 0.688

Y

X

1

23456

1 2 3 4 5 6

X

Y

Y = 0.92X + 0.688

The two means will be on the line

Page 75: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing regression line (standardized, using z-scores)

• Sometimes the regression equation is standardized. – Computed based on z-scores rather than with raw scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

6 61 25 6

3 4

3 2

X Y5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

1.74 1.790.0

1.1-1.1

0.0

-1.1

1.1

0.0

X −X( ) X −X( )2

Y −Y( )

1.38-1.49

0.8

- 0.34

- 0.34

Page 76: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing regression line (standardized, using z-scores)

• Sometimes the regression equation is standardized. – Computed based on z-scores rather than with raw scores

ZX ZY

0.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

• Prediction model– Predicted Z score (on criterion variable) =

standardized regression coefficient multiplied by Z score on predictor variable

– Formula

)ZY =(β)(ZX )

– The standardized regression coefficient (β)

• In bivariate prediction, β = r

Page 77: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing regression line (with z-scores)

slope =β =r =0.89

meanintercept =0.0

ZY

ZX

-1

1

2

0

1 2

ZX ZY

0.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

)ZY =(β)(ZX )

-2

-1-2

Page 78: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression

• Also need a measure of error

Y = X(.5) + (2.0) + error Y = X(.5) + (2.0) + error

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Same line, but different relationships (strength difference)

Y = intercept + slope (X)+ error

• The linear equation isn’t the whole thing

Page 79: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression

• Error

– Actual score minus the predicted score

• Measures of error– r2 (r-squared)

– Proportionate reduction in error

• Note: Total squared error when predicting from the mean = SSTotal=SSY

=SStotal − SSerror

SStotal

– Squared error using prediction model = Sum of the squared residuals = SSresidual= SSerror

Page 80: Correlation and Simple Linear Regression PSY440 June 10, 2008

R-squared

• r2 represents the percent variance in Y accounted for by X

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

r = 0.8 r = 0.5r2 = 0.64 r2 = 0.25

64% variance explained 25% variance explained

Page 81: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Error around the line

• Compute the difference between the predicted values and the observed values (“residuals”)

• Square the differences

• Add up the squared differences

Y

X

1

23456

1 2 3 4 5 6

• Sum of the squared residuals = SSresidual = SSerror

Page 82: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

ˆ Y

Y =0.92X + 0.688Predicted values of Y (points on the line)

• Sum of the squared residuals = SSresidual = SSerror

Page 83: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

ˆ Y

Y =0.92X + 0.688

= (0.92)(6)+0.688

Predicted values of Y (points on the line)

• Sum of the squared residuals = SSresidual = SSerror

Page 84: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

ˆ Y

Y =0.92X + 0.688

= (0.92)(6)+0.688

1.6 = (0.92)(1)+0.688

5.3 = (0.92)(5)+0.688

3.45 = (0.92)(3)+0.688

3.45 = (0.92)(3)+0.688

• Sum of the squared residuals = SSresidual = SSerror

Page 85: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Error around the line

Y

X

123

45

6

1 2 3 4 5 6

• Sum of the squared residuals = SSresidual = SSerror

X Y

ˆ Y 6 61 25 6

3 4

3 2

6.21.6

5.3

3.45

3.45

6.2

1.6

5.3

3.45

Y =0.92X + 0.688

Page 86: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

ˆ Y

Y − ˆ Y ( )-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

residuals• Sum of the squared residuals = SSresidual = SSerror

Quick check

6 - 6.2 =

2 - 1.6 =

6 - 5.3 =

4 - 3.45 =

2 - 3.45 =

Page 87: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

0.040.16

0.49

0.30

2.10

3.09

ˆ Y

Y − ˆ Y ( )

Y − ˆ Y ( )2

-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

Page 88: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

0.040.16

0.49

0.30

2.10

3.09

ˆ Y

Y − ˆ Y ( )

Y − ˆ Y ( )2

-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0

SSY

Page 89: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Error around the line• Sum of the squared residuals = SSresidual = SSerror

• Standard error of estimate (from textbook) is analagous to standard deviation. It is the square root of the average error: sx.y= sqrt(SSerror/df)

• Also, the standard error of estimate is related to r2 and to the standard deviaion of y:

• sx.y=sy*sqrt(1-r2)

Page 90: Correlation and Simple Linear Regression PSY440 June 10, 2008

Computing Error around the line

3.09

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

16.0

SSY

– Proportionate reduction in error =SStotal − SSerror

SStotal

=16.0 − 3.09

16.0= 0.81

• Also (like r2) represents the percent variance in Y accounted for by X

• In fact, it is mathematically identical to r2

Page 91: Correlation and Simple Linear Regression PSY440 June 10, 2008

Seeing patterns in the error

• Residual plots• The sum of the residuals should always equal 0 (as should the mean).

– the least squares regression line splits the data in half, half of the error is above the line and half is below the line.

• In addition to summing to zero, we also want the residuals to be randomly distributed.

– That is, there should be no pattern to the residuals. – If there is a pattern, it may suggest that there is more than a simple linear relationship

between the two variables.

• Residual plots are very useful tools to examine the relationship even further. – These are basically scatterplots of the residuals (Yobs-Ypred) against the Explanatory (X)

variable

(note: the examples actually plot the residuals that have transformed into z-scores).

Page 92: Correlation and Simple Linear Regression PSY440 June 10, 2008

Seeing patterns in the error

• The residual plot shows that the residuals fall randomly above and below the line. Critically there doesn't seem to be a discernable pattern to the residuals.

Residual plotScatter plot

• The scatterplot shows a nice linear relationship.

Page 93: Correlation and Simple Linear Regression PSY440 June 10, 2008

Seeing patterns in the error

Residual plot

• The scatterplot also shows a nice linear relationship.

• The residual plot shows that the residuals get larger as X increases.

• This suggests that the variability around the line is not constant across values of X.

• This is referred to as a violation of homogeniety of variance.

Scatter plot

Page 94: Correlation and Simple Linear Regression PSY440 June 10, 2008

Seeing patterns in the error

• The residual plot suggests that a non-linear relationship may be more appropriate (see how a curved pattern appears in the residual plot).

Residual plotScatter plot

• The scatterplot shows what may be a linear relationship.

Page 95: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression in SPSS

• Running the analysis in SPSS is pretty easy– Analyze: Regression: Linear– X or predictor variable(s) go into the ‘independent

variable’ field– Y or predicted variable goes into the ‘dependent

variable’ field– You can save the residuals as a new variable to plot

the residuals against x as shown in the previous slide.

• You get a lot of output

Page 96: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression in SPSS• The variables in the model

• r

• r2

• Unstandardized coefficients

• Slope (indep var name)• Intercept (constant)

• Standardized coefficients

• We’ll get back to these numbers in a few weeks

Page 97: Correlation and Simple Linear Regression PSY440 June 10, 2008

In Excel

• With Data Analysis “Tool Pack” you can perform regression analysis

• With standard software package, you can get bivariate correlation (which is the same as the standardized regression coefficient), you can create a scatterplot, and you can request a trend line (as we did when plotting data for single-subject research), which is a regression line (what is y and what is x in that case?)

Page 98: Correlation and Simple Linear Regression PSY440 June 10, 2008

Considerations:

Slope is dependent on variance of x and yStandardized slope = r (weaker associations

between x and y result in flatter slopes)Means as the association becomes weaker,

your prediction of y is more influenced by the mean of y than by changes in x.

Regression to the mean is a special case of this…..

Page 99: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression to the meanSometimes reliability is represented as r values (test-retest, split-half).If you have a test with low test-retest reliability, your score on the first

administration is only weakly related to your score on the second administration. It is influenced by a considerable amount of error variance.

Score(true)=Score(observed)+/-ErrorScore-/+Error=Score(observed)Any time you take a measurement, the observed score reflects your true score

plus error.The further away your observed score gets from the mean score for the test, the

more likely it is that the distance from the mean is due at least in part to error.If error is randomly distributed, then your next observed score is more likely to

be closer to the mean than farther from the mean.

Page 100: Correlation and Simple Linear Regression PSY440 June 10, 2008

Regression to the meanIf x=obs1 and y=obs2, and the test-retest reliability of your measure is relatively low

(say, r=.5), then your first score only helps predict your second score somewhat.Standardized regression equation is

y=.5x + error

On a standardized test with mean=0 and sd=1, if you get a score above the mean, say 1.2, the first time you take the test, (obs1=x=1.2), and the test-retest reliability is only .5, your predicted score the next time you take the test is .5*1.2=.6 . You are more likely to score closer to the mean. This doesn’t mean that you will definitely score closer to the mean, it just means that on average, people who score 1.2 sd above the mean the first time tend to have scores closer to .6 the next time they are tested. This is because the test isn’t that reliable, and the original observation of 1.2 includes error. For the average person with that score (but not for everyone), the error is part of what accounts for the difference between the score and the mean.

If your test has higher reliability, then the regression to the mean effect is reduced.

Page 101: Correlation and Simple Linear Regression PSY440 June 10, 2008

Multiple Regression

• Multiple regression prediction models

μY = β0 + β1X1 + β2 X2 + β 3X3 + ε

“fit” “residual”

Page 102: Correlation and Simple Linear Regression PSY440 June 10, 2008

Prediction in Research Articles

• Bivariate prediction models rarely reported

• Multiple regression results commonly reported