103

Correlation and regression. Lecture Correlation Regression Exercise Group tasks on correlation and regression Free experiment supervision/help

Embed Size (px)

Citation preview

Page 1: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help
Page 2: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

LectureCorrelationRegression

ExerciseGroup tasks on correlation and

regressionFree experiment supervision/help

Page 3: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Last week we covered four types of non-parametric statistical tests

They make no assumptions about the data's characteristics.

Use if any of the three properties below are true:

(a) the data are not normally distributed (e.g. skewed);

(b) the data show in-homogeneity of variance; (c) the data are measurements on an ordinal

scale (can be ranked).

Page 4: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Non-parametric tests make few assumptions about the distribution of the data being analyzed

They get around this by not using raw scores, but by ranking them: The lowest score get rank 1, the next lowest rank 2, etc. Different from test to test how ranking is carried out, but same

principle

The analysis is carried out on the ranks, not the raw data

Ranking data means we lose information – we do not know the distance between the ranks

This means that non-par tests are less powerful than par tests, and that non-par tests are less likely to discover an effect in our data than

par tests (increased chance of type II error)

Page 5: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Examples of parametric tests and their non-parametric equivalents:

Parametric test: Non-parametric counterpart:

Pearson correlation Spearman's correlation

(No equivalent test) Chi-Square test

Independent-means t-test Mann-Whitney test

Dependent-means t-test Wilcoxon test

One-way Independent Measures Analysis of Variance (ANOVA) Kruskal-Wallis test

One-way Repeated-Measures ANOVA Friedman's test

Page 6: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Just like parametric tests, which non-parametric test to use depends on the experimental design (repeated measures or within groups), and the number of/level of IVs

Page 7: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Mann-Whitney: Two conditions, two groups, each participant one score

Wilcoxon: Two conditions, one group, each participant two scores (one per condition)

Kruskal-Wallis: 3+ conditions, different people in all conditions, each participant one score

Friedman´s ANOVA: 3+ conditions, one group, each participant 3+ scores

Page 8: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Which nonparametric test?

1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed

1. Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams

2. Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone

3. Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners.

Consider: How many groups? How many levels of IV/conditions?

Page 9: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed [3 groups, each one score, 2 conditions - Kruskal-Wallis].

2. Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams [one group, each 4 scores, 4 conditions - Friedman´s ANOVA].

3. Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone [one group, each 4 scores, 4 conditions – Friedman´s ANOVA]

4. Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners. [4 groups, each one score – Kruskal-Wallis]

Page 10: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help
Page 11: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

We often want to know if there is a relationship between two variables

Do people who drive fast cars get into accidents more often?

Do students who give the teacher red apples get higher grades?

Do blondes have more fun?Etc.

Page 12: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Correlation coefficient:

A succinct measure of the strength of the relationship between two variables (e.g. height and weight, age and reaction time, IQ and exam score).

Page 13: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

A correlation is a measure of the linear relationship between variables

Two variables can be related in different ways:

1) positively related: The faster the car, the more accidents

2) not related: Speed of the car does not matter on the amount of accidents

3) negatively related: The faster the car, the less accidents

Page 14: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

We describe the relationship between variables statistically by looking at two measures: Covariance Correlation coefficient

We represent relationships graphically using scatterplots

The simplest way to decide if two variables are associated is to evaluate if they covary

Recall: Variance of one variable is the average amount the scores in the sample vary from the mean – if variance is high, the scores in the sample are very different from the mean

Page 15: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Low and high variance around the mean of a sample

0

1

2

3

4

5

6

0 1 2 3 4 5 6

Page 16: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

If we want to know whether two variables are related, we want to know if changes in the scores of one variable is met with similar changes in the other variable

Therefore, when one variable deviates from its mean, we would expect the SCORES of the other variable to deviate from its mean in a similar way

Example: We take 5 ppl, show them a commercial and measure how many packets of sweets they buy the week after

If the number of times a commercial was seen relates to how many packets of sweets that were bought relates, the scores should vary around the mean of the two samples in a similar way

Page 17: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Looks likeA relationshipexists

Page 18: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

How do we calculate the exact similarity between the pattern of difference in the two variables (samples)?

We calculate covariance

Step 1: multiply the difference between the scores and the mean in the two samples

Note that if the difference between the means and the two scores are both positive or both negative, we get a positive value (+ * + = + and - * - = +)

If the difference between the means and the two scores is negative and positive, we get a negative value (+ * - = -)

Page 19: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Step 2: Divide with the sum of observations (scores) -1: N-1

Same equation as for calculating variance Except that we multiply differences with the

corresponding difference of the score in the other sample, rather than squaring the differences within one sample

Page 20: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Positive covariance indicate that as one variable deviates from the mean, so does the other in the same direction. Faster cars lead to more accidents

Negative covariance indicate that as one variables deviates from the mean, so does the other but in the opposite direction Faster cars lead to less accidents

Page 21: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Covariance however depend on the scale of measurement used – it is not an independent measure

To overcome this problem, we standardize the covariance – so covariance is comparable across all experiments, no matter what type of measure we use

Page 22: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

We do this by converting differences between scores and means into standard deviations

Recall: Any score can be expressed in terms of how many SD´s it is away from the mean (the z-score)

We therefore divide covariance with the SDs of both samples Two samples, we need the SD from both of

them to standardize the covariance

Page 23: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

This standardized covariance is known as the correlation coefficient

This is also called Pearsons correlation coefficient and is one of the most important formulas in statistics

Page 24: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

When we standardize covariance we end up with a value that lies between -1 and +1

If r = +1 , we have a perfect positive relationship

Page 25: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

+1 (perfect positive correlation: as X increases, so does Y):

Y

X

Page 26: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

When we standardize covariance we end up with a value that lies between -1 and +1

If r = -1 , we have a perfect negative relationship

Page 27: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Perfect negative correlation: As X increases, Y decreases, or vice versa

Y

X

Page 28: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

If r = 0 there is no correlation between the two samples – changes in sample X are not associated with systematic changes in sample Y, or vice versa.

Recall that we can use correlation coefficient as a measure of effect size

An r of +/- 0.1 is a small effect, 0.3 medium effect and 0.5 large effect

Page 29: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help
Page 30: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Before performing correlational analysis we plot a scatterplot to get an idea about how the variables covary

A scatterplot is a graph of the scores of one sample (variable) vs. the scores of another sample Further variables can be included in a 3D

plot.

Page 31: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

A scatterplot informs if: There is a relationships between the

variables What kind of relationship it is If any cases (scores) are markedly

different – outliers – these cause problems

We normally plot the IV on the x-axis, and the DV on the y-axis

Page 32: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

A 2D scatterplot

Page 33: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

A 3D scatterplot

Page 34: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Using SPSS to obtain scatterplots: (a) simple scatterplot:Graphs > Legacy Dialogs > Scatter/Dot...

Page 35: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Using SPSS to obtain scatterplots: (a) simple scatterplot:Graphs > Chartbuilder

1. Pick Scatterdot2. Drag "Simple scatter" icon into chart preview window.

3. Drag X and Y variables into x-axis and y-axis boxes in chart preview window

Page 36: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Using SPSS to obtain scatterplots: (b) scatterplot with regression line:Analyze > Regression > Curve Estimation...

Model Summary and Parameter Estimates

Dependent Variable: memory score

.736 16.741 1 6 .006 17.167 8.333EquationLinear

R Square F df1 df2 Sig.

Model Summary

Constant b1

Parameter Estimates

The independent variable is number of vitamin treatments.

”constant" is the intercept with y-axis, "b1" is the slope”

Page 37: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Having visually looked at the data, we can conduct a correlation analysis in SPSS Procedure in page 123 in chapter 4 of Field´s

book in the compendium

Note: Two types of correlation: Bivariate and partial Bivariate is correlation between two variables Partial correlation is the same, but controlling

for one or more additional variables

Page 38: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Using SPSS to obtain correlations:

Analyze > Correlate > Bivariate...

Correlations

1 .858**

.006

8 8

.858** 1

.006

8 8

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

number of vitamintreatments

memory score

number ofvitamin

treatmentsmemory

score

Correlation is significant at the 0.01 level (2-tailed).**.

Correlations

1.000 .928**

. .001

8 8

.928** 1.000

.001 .

8 8

Correlation Coefficient

Sig. (2-tailed)

N

Correlation Coefficient

Sig. (2-tailed)

N

number of vitamintreatments

memory score

Spearman's rho

number ofvitamin

treatmentsmemory

score

Correlation is significant at the 0.01 level (2-tailed).**.

Page 39: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

There are various types of correlation coefficient, for different purposes:

Pearson's "r":Used when both X and Y variables are(a) continuous;(b) (ideally) measurements on interval or ratio scales;(c) normally distributed - e.g. height, weight, I.Q.

Spearman's rho:In same circumstances as (1), except that data need only be on an ordinal scale - e.g. attitudes, personality scores.

Page 40: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

r is a parametric test: the data have to have certain characteristics (parameters) before it can be used.

rho is a non-parametric test - less fussy about the nature of the data on which it is performed.

Both are dead easy to calculate in SPSS

Page 41: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help
Page 42: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Calculating Pearson's r: a worked example:

Is there a relationship between the number of parties a person gives each month, and the amount of flour they purchase from Møller-Mogens?

Page 43: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Our algorithm for the correlation coefficient from before, slightly modified:

N

YY

N

XX

N

YXXY

r2

2

2

2

Page 44: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Month: Flour production (X):

No. of parties (Y):

X2 Y2 XY

A 37 75 1369 5625 2775

B 41 78 1681 6084 3198

C 48 88 2304 7744 4224

D 32 80 1024 6400 2560

E 36 78 1296 6084 2808

F 30 71 900 5041 2130

G 40 75 1600 5625 3000

H 45 83 2025 6889 3735

I 39 74 1521 5476 2886

J 34 74 1156 5476 2516

N=10 ΣX = 382 ΣY =776 ΣX2 = 14876 ΣY2 = 60444 ΣXY = 29832

Page 45: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

1077660444

1038214876

10 776382

29832r 22

N

YY

N

XX

N

YXXY

r2

2

2

2

Using our values (from the bottom row of the table:)

N=10 ΣX = 382 ΣY =776 ΣX2 = 14876 ΣY2 = 60444 ΣXY = 29832

Page 46: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

7455.0391.25380.188

40.22660.283

80.188r

60.602176044440.1459214876

20.2964329832r

r is 0.75. This is a positive correlation: People who buy a lot of flour from Møller-Mogens also hold a lot of parties (and vice versa).

Page 47: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

How to interpret the size of a correlation: r2 (r * r, “r-square”) is the "coefficient of

determination". It tells us what proportion of the variation in the Y scores is associated with changes in X.

E.g., if r is 0.2, r2 is 4% (0.2 * 0.2 = 0.04 = 4%).

Only 4% of the variation in Y scores is attributable to Y's relationship with X.

Thus, knowing a person's Y score tells you essentially nothing about what their X score might be!

Page 48: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Our correlation of 0.75 gives an r2 of 56%.

 An r of 0.9, gives an r2 of (0.9 * 0.9 = .81) = 81%.

Note that correlations become much stronger the closer they are to 1 (or -1).

Correlations of .6 or -.6 (r2 = 36%) are much better than correlations of .3 or -.3 (r2 = 9%), not merely twice as strong!

Page 49: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help
Page 50: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

We use Spearman´s correlation coefficient when the data hev violated parametric assumptions (e.g. non-normal distribution)

Spearman´s correlation coefficient works with ranking the data in the samples just like other non-parametric tests

Page 51: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Spearman's rho measures the degree of monotonicity rather than linearity in the relationship between two variables - i.e., the extent to which there is some kind of change in X associated with changes in Y:

Hence, copes better than Pearson's r when the relationship is monotonic but non-linear - e.g.:

But not:

Page 52: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help
Page 53: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Some pertinent notes on interpreting correlations:

Correlation does not imply causality:

X might cause Y.Y might cause X.Z (or a whole set of factors) might

cause both X and Y.

Page 54: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Factors affecting the size of a correlation coefficient:

1. Sample size and random variation:

The larger the sample, the more stable the correlation coefficient.

Correlations obtained with small samples are unreliable.

Page 55: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Conclusion: You need a large sample before you can be really sure that your sample r is an accurate reflection of the population r.

Limits within which 80% of sample r's will fall, when the true (population) correlation is 0:

Sample size: 80% limits for r:

5 -0.69 to +0.69

15 -0.35 to +0.35

25 -0.26 to +0.26

50 -0.18 to +0.18

100 -0.13 to +0.13

200 -0.09 to +0.09

Page 56: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

2. Linearity of the relationship:

Pearson’s r measures the strength of the linear relationship between two variables; r will be misleading if there is a strong but non-linear relationship. e.g.:

Page 57: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

3. Range of talent (variability): The smaller the amount of variability in X

and/or Y, the lower the apparent correlation. e.g.:

no linear trend,

if viewed in

isolation

strong linear

relationship

overall

Page 58: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

4. Homoscedasticity (equal variability): r describes the average strength of the

relationship between X and Y. Hence scores should have a constant amount of variability at all points in their distribution.

in this region: low variability

of Y (small Y-Y' )

in this region: high variability

of Y (large Y-Y' )

regression line

Page 59: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

5. Effect of discontinuous distributions:

A few outliers can distort things considerably. There is no real correlation between X and Y in the below case.

Page 60: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Deciding what is a "good" correlation:

A moderate correlation could be due to either:

(a) sampling variation (and hence a "fluke"); or

(b) a genuine association between the variables concerned.

How can we tell which of these is correct?

Page 61: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Large negative correlations:unlikely to occur by chance

Large positive correlations:unlikely to occur by chance

r = 0Small correlations:likely to occur by chance

Distribution of r's obtained using samples drawn from two uncorrelated populations of scores:

Distributiondiagram for rvalues

Page 62: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

r = - 0.44 r = + 0.44r = 0

0.025 0.025

For an N of 20:

For a sample size of 20, 5 out of 100 random samples are likely to produce an r of 0.44 or larger, merely by chance (i.e., even though in the population, there was no correlation at all!)

Page 63: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Thus we arbitrarily decide that:

(a) If our sample correlation is so large that it would occur by chance only 5 times in a hundred, we will assume that it reflects a genuine correlation in the population from which the sample came.

(b) If a correlation like ours is likely to occur by chance more often than this, we assume it has arisen merely by chance, and that it is not evidence for a correlation in the parent population.

Page 64: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

How do we know how likely it is to obtain a sample correlation as large as ours by chance?

Tables (on the website) give this information for different sample sizes.

An illustration of how to use these tables:

Suppose we take a sample of 20 people, and measure their eye-separation and back hairiness. Our sample r is .75. Does this reflect a true correlation between eye-separation and hairiness in the parent population, or has our r arisen merely by chance (i.e. because we have a freaky sample)?

Page 65: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Step 1:

Calculate the "degrees of freedom" (DF = the number of pairs of scores, minus 2).

Here, we have 20 pairs of scores, so DF = 18.

Step 2:

Find a table of "critical values for Pearson's r".

Page 66: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Part of a table of "critical values for Pearson's r":

Level of significance (two-tailed)

df .05 .01 .001

17 .4555 .5751 .6932

18 .4438 .5614 .6787

19 .4329 .5487 .6652

20 .4227 .5368 .6524

With 18 df, a correlation of .4438 or larger will occur by chance with a probability of 0.05: I.e., if we took 100 samples of 20 people, about 5 of those samples are likely to produce an r of .4438 or larger (even though there is actually no correlation in the population!)

Page 67: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Part of a table of "critical values for Pearson's r":

Level of significance (two-tailed)

df .05 .01 .001

17 .4555 .5751 .6932

18 .4438 .5614 .6787

20 .4227 .5368 .6524

With 18 df, a correlation of .5614 or larger will occur by chance with a probability of 0.01: i.e., if we took 100 samples of 20 people, about 1 of those 100 samples is likely to give an r of .5614 or larger (again, even though there is actually no correlation in the population!)

Page 68: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Part of a table of "critical values for Pearson's r":

Level of significance (two-tailed)

df .05 .01 .001

17 .4555 .5751 .6932

18 .4438 .5614 .6787

20 .4227 .5368 .6524

With 18 df, a correlation of .6787 or larger will occur by chance with a probability of 0.001: i.e., if we took 1000 samples of 20 people, about 1 of those 1000 samples is likely to give an r of .6787 or larger (again, even though there is actually no correlation in the population!)

Page 69: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

The table shows that an r of .6787 is likely to occur by chance only once in a thousand times.

Our obtained r is .75. This is larger than .6787.

Hence our obtained r of .75 is likely to occur by chance less than one time in a thousand (p<0.001).

Page 70: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Conclusion:

Any sample correlation could in principle occur due to chance or because it reflects a true relationship in the population from which the sample was taken.

Because our r of .75 is so unlikely to occur by chance, we can safely assume that there really is a relationship between eye-separation and back-hairiness.

Page 71: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Important point:

Do not confuse statistical significance with practical importance.

We have just assessed "statistical significance" - the likelihood that our obtained correlation has arisen merely by chance

Our r of .75 is "highly significant" (i.e., highly unlikely to have arisen by chance).

However, a weak correlation can be statistically significant, if the sample size is large enough:

With 100 DF, an r of .1946 is "significant" in the sense that it is unlikely to have arisen by chance (r's bigger than this will occur by chance only 5 in a 100 times)

Page 72: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

The coefficient of determination (r2) shows that an r of 0.1946 this is not a strong relationship in a practical sense

r2 = 0.1946 * 0.1946 = 0.0379 = 3.79%

Knowledge of one of the variables would account for only 3.79% of the variance in the other - completely useless for predictive purposes!

Page 73: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

My scaly butt is of

large size!

Page 74: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help
Page 75: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

The relationship between two variables (e.g. height and weight; age and I.Q.) can be described graphically with a scatterplot

short medium long

y-axis:age(years)

old

medium

young

An individual's performance (each person suppliestwo scores, age and r.t.)

x-axis: reaction time (msec)

Page 76: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

We are often interested in seeing whether or not a linear relationship exists between two variables.

Here, there is a strong positive relationship between reaction time and age:

(a) positive correlation

age (years)

r.t.(msec.)

300

350

400

450

500

550

600

650

700

20 30 40 50 60 70 80

Page 77: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Here is an equally strong but negative relationship between reaction time and age:

(b) negative correlation

age (years)

r.t.(msec.)

300

350

400

450

500

550

600

650

20 30 40 50 60 70 80

Page 78: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

(c) no correlation

age (years)

r.t.(msec.)

0

50

100

150

200

250

300

350

400

0 10 20 30 40 50 60 70 80

And here, there is no statistically significant relationship between RT and age:

Page 79: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

If we find a reasonably strong linear relationship between two variables, we might want to fit a straight line to the scatterplot.

There are two reasons for wanting to do this:

(a)For description: The line acts as a succinct description of the "idealized" relationship between our two variables, a relationship which we assume the real data reflect somewhat imperfectly.

(b)For prediction: We could use the line to obtain estimates of values for one of the variables, on the basis of knowledge of the value of the other variable (e.g. if we knew a person's height, we could predict their weight).

Page 80: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Linear Regression is an objective method of fitting a line to a scatterplot - better than trying to do it by eye!

(a) positive correlation

age (years)

r.t.(msec.)

300

350

400

450

500

550

600

650

700

20 30 40 50 60 70 80

Which line is the best fit to the data?

Page 81: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

The recipe for drawing a straight line: To draw a line, we need two values:

(a) the intercept - the point at which the line intercepts the vertical axis of the graph

(b) the slope of the line.

same intercept, different slopes: different intercepts, same slope:

Page 82: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

The formula for a straight line:

Y = a + b * X

Y is a value on the vertical (Y) axis;

a is the intercept (the point at which the line intersects the vertical axis of the graph [Y-axis]);

b is the slope of the line;

X is any value on the horizontal (X) axis.

Page 83: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Linear regression step-by-step:

10 individuals do two tests: a stress test, and a statistics test. What is the relationship between stress and statistics performance?

subject: stress (X) test score (Y)

A 18 84

B 31 67

C 25 63

D 29 89

E 21 93

F 32 63

G 40 55

H 36 70

I 35 53

J 27 77

Page 84: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Scatterplot of relationshipbetween test scores and stress

scores:

stress score (X)

testscore (Y)

0

20

40

60

80

100

120

0 10 20 30 40 50

Draw a scatterplot to see what the data look like:

Page 85: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

There is a negative relationship between stress scores and statistics scores: People who scored high on the statistics test tend to

have low stress levels, and

People who scored low on the statistics test tend to have high stress levels.

Scatterplot of relationshipbetween test scores and stress

scores:

stress score (X)

testscore (Y)

0

20

40

60

80

100

120

0 10 20 30 40 50

Page 86: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Calculating the regression line:

We need to find "a" (the intercept) and "b" (the slope) of the line.

Work out "b" first, and "a" second.

Page 87: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

To calculate “b”, the b of the line:

2

2

N

XX

N

YXXY

b

Page 88: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

subjects: X (stress) X2 Y (test) XY

A 18 182 = 324 84 18 * 84 = 1512 B 31 312 = 961 67 31 * 67 = 2077 C 25 252 = 625 63 25 * 63 = 1575 D 29 292 = 841 89 29 * 89 = 2581 E 21 212 = 441 93 21 * 93 = 1953 F 32 322 = 1024 63 32 * 63 = 2016 G 40 402 = 1600 55 40 * 55 = 2200 H 36 362 = 1296 70 36 * 70 = 2520 I 35 352 = 1225 53 35 * 53 = 1855 J 27 272 = 729 77 27 * 77 = 2079

X = X2 = Y = XY = 294 9066 714 20368

Page 89: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

We also need:

N = the number of pairs of scores, = 10 in this case.

(X)2 = "the sum of X squared" = 294 * 294 = 86436.

NB!: (X)2 means "square the sum of X“: Add together

all of the X values to get a total, and then square this total.

X2 means "sum the squared X values“: Square each X value, and then add together these squared X values to get a total.

Page 90: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

60.8643906660.2099120368

1086436

9066

10714294

20368b

Working through the formula for b:

476.140.42260.623

Page 91: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

b = -1.476.

b is negative, because the regression line slopes downwards from left to right: As stress scores (X) increase, statistics scores (Y) decrease.

Scatterplot of relationshipbetween test scores and stress

scores:

stress score (X)

testscore (Y)

0

20

40

60

80

100

120

0 10 20 30 40 50

Page 92: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Now work out a:

a Y b X Y is the mean of the Y scores: = 71.4 .

X is the mean of the X scores: = 29.4.

b = -1.476

Therefore a = 71.4 - (-1.476 * 29.4) = 114.80

Page 93: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

The complete regression equation now looks like:

Y' = 114.80 + ( -1.476 * X)

To draw the line, input any three different values for X, in order to get associated values for Y'. For X = 10, Y' = 114.80 + (-1.476 * 10) = 100.04

For X = 30, Y' = 114.80 + (-1.476 * 30) = 70.52

For X = 50, Y' = 114.80 + (-1.476 * 50) = 41.00

Page 94: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Regression line for predicting test scores (Y) from stress scores (X):

stress score (X)

testscore (Y)

0

20

40

60

80

100

120

0 10 20 30 40 50

Plot:

X = 10, Y' = 100.04

X = 30, Y' = 70.52

X = 50, Y' = 41.00

intercept = 114.80

Page 95: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Important:

This is the regression line for predicting statistics test score on the basis of knowledge of a person's stress score; this is the "regression of Y on X".

To predict stress score on the basis of knowledge of statistics test score (the "regression of X on Y"), we can't use this regression line!

Page 96: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

To predict Y from X requires a line that minimizes the deviations of the predicted Y's from actual Y's.

To predict X from Y requires a line that minimizes the deviations of the predicted X's from actual X's - a different task (although somewhat similar)!

Solution: To calculate regression of X on Y, swap the column labels (so that the "X" values are now the "Y" values, and vice versa); and re-do the calculations. So X is now test results, Y is now stress score

Page 97: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Regression lines for predicting Y from X, and vice versa:

stress score (X)

test score (Y)

0

20

40

60

80

100

120

0 10 20 30 40 50

X on Y: predicts test score, given knowledge of stress score

n.b.: intercept = 55

Y on X: predicts stress score, given knowledge of test score

Page 98: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Simple regression in SPSS: Page 155 in Field´s book in the compendium

More advanced types of regression handle non-linear regression

Also, multi-variate regression: Regression with more than two variables

Page 99: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Slide 99

Analyze ->Regression ->Linear Regression

Page 100: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Slide 100

Model Summary

.578a

.335 .331 65.9914

Model

1

R R SquareAdjusted R

Square

Std. Errorof the

Estimate

Predictors: (Constant), Advertising Budget (thousandsof pounds)

a.

Output in two tables:

The first provides the R and R-square values, and the SE

Page 101: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

The second provides an ANOVA.

The ANOVA provides an estimate of whether the regression model is significantly better than if we used the mean values of the samples

I.e., whether our regression model predicts the variation in the DV significantly well or not

ANOVAb

433687.833 1 433687.833 99.587 .000a

862264.167 198 4354.870

1295952.0 199

RegressionResidual

Total

Model

1

Sum ofSquares df

MeanSquare F Sig.

Predictors: (Constant), Advertising Budget (thousands of pounds)a.

Dependent Variable: Record Sales (thousands)b.

Page 102: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

Third table provides information on prediction

Here, we can see that an increase of 1 in X causes Y to increase 0.096

We see that a = 134.140 (intercept with Y-axis)

Coefficientsa

134.140 7.537 17.799 .000

9.612E-02 .010 .578 9.979 .000

(Constant)Advertising Budget(thousands of pounds)

Model

1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Record Sales (thousands)a.

Page 103: Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

This gives us the formula:

Y = 134.14 + (0.09612*X)

Coefficientsa

134.140 7.537 17.799 .000

9.612E-02 .010 .578 9.979 .000

(Constant)Advertising Budget(thousands of pounds)

Model

1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Record Sales (thousands)a.