72
Correlation, Regression & T- test Prepared By: Dr. Kumara Thevan a/l Krishnan

Lect w8 w9_correlation_regression

Embed Size (px)

DESCRIPTION

correlation_regression

Citation preview

Page 1: Lect w8 w9_correlation_regression

Correlation, Regression & T-test

Prepared By: Dr. Kumara Thevan a/l Krishnan

Page 2: Lect w8 w9_correlation_regression

Introduction

Investigation on a relationship between two or more numerical or quantitative variables can be conducted using techniques of correlation and regression analysis.

! - Correlation is a statistical method used to determine

whether a linear relationship between variable exist.

! - Simple linear regression is a statistical method used to

described the nature of the relationship between two variables.

Page 3: Lect w8 w9_correlation_regression

Definition

Scatterplot (or scatter diagram) i s a graph in which the paired (x,y) sample data are plotted with a horizontal x axis and a vertical y axis.

!

Each individual (x,y) pair is plotted as a single point.

Page 4: Lect w8 w9_correlation_regression

Definition

Correlation

!

exists between two variables when one of them is related to the other in some way

Page 5: Lect w8 w9_correlation_regression

Example

Open SPSS.

Data ; weight height biometry male 2012.sav

Graph

Legacy dialogs=> scatter plot

Height (X); Weight ( Y)

Page 6: Lect w8 w9_correlation_regression

BMI testCategory BMI range – kg/m

Very severely underweight less than 15

Severely underweight from 15.0 to 16.0

Underweight from 16.0 to 18.5

Normal (healthy weight) from 18.5 to 25

Overweight from 25 to 30Obese Class I (Moderately obese)

from 30 to 35

Obese Class II (Severely obese)

from 35 to 40

Obese Class III (Very severely obese) over 40

Page 7: Lect w8 w9_correlation_regression

Normality test?

• The Kolmogorov-Smirnov and Shapiro-Wilk test.

• The compare the scores in the sample to a normally distributed set of scores with the same mean and s.d.

• If p>0.05, the test is non-significant. Tells us that the distribution of the sample is not s ign i f icant ly d i f ferent f rom a normal distribution.

• The test is significant (p<0.05) then the distribution in question is significantly different from a normal distribution(non-normal)

Page 8: Lect w8 w9_correlation_regression

What can you see?

Page 9: Lect w8 w9_correlation_regression

Positive Linear Correlation

x x

yy y

x

(a) Positive (b) Strong positive

(c) Perfect positive

Page 10: Lect w8 w9_correlation_regression

Negative Linear Correlation

x x

yy y

x

(d) Negative (e) Strong negative

(f) Perfect negative

Page 11: Lect w8 w9_correlation_regression

What can you see?

Page 12: Lect w8 w9_correlation_regression

Test 1

- Do scatter plot using data ExamAnxiety.sav

- Exam performance (%) – y axis

- Exam anxiety – x axis

- Color – place gender

- Results?

!- Try 3D plot – 3 variables

Page 13: Lect w8 w9_correlation_regression

Bivariate correlation

• Having taken a preliminary glance at the data, we can proceed to conduct the correlation analysis.

Page 14: Lect w8 w9_correlation_regression

Definition

!

Linear Correlation Coefficient r measures strength of the linear

relationship between paired x- and y-quantitative values in a sample

Page 15: Lect w8 w9_correlation_regression

No Linear Correlation

x x

yy

(g) No Correlation (h) Nonlinear Correlation

Page 16: Lect w8 w9_correlation_regression

Definition

!

Linear Correlation Coefficient r sometimes referred to as the

Pearson product moment correlation coefficient

Page 17: Lect w8 w9_correlation_regression

Notation for the Linear Correlation Coefficient

n number of pairs of data presented. Σ denotes the addition of the items indicated. Σx denotes the sum of all x values. Σx2 indicates that each x score should be squared and then

those squares added. (Σx)2 indicates that the x scores should be added and the total

then squared. Σxy indicates that each x score should be first multiplied by its

corresponding y score. After obtaining all such products, find their sum. r represents linear correlation coefficient for a sample ρ represents linear correlation coefficient for a population

Page 18: Lect w8 w9_correlation_regression

nΣxy - (Σx)(Σy)

n(Σx2) - (Σx)2 n(Σy2) - (Σy)2r =

Definition Linear Correlation Coefficient r

Page 19: Lect w8 w9_correlation_regression

Test 2

• Run the correlation analysis ExamAnxiety.sav

• Assumption – Data is normally distributed

Page 20: Lect w8 w9_correlation_regression

ResultsCorrelations

Exam Performance

(%) Exam AnxietyTime Spent

RevisingExam Performance (%)

Pearson Correlation

1 -.441 .397Sig. (1-tailed) .000 .000N 103 103 103

Exam Anxiety Pearson Correlation

-.441 1 -.709Sig. (1-tailed) .000 .000N 103 103 103

Time Spent Revising Pearson Correlation

.397 -.709 1Sig. (1-tailed) .000 .000N 103 103 103

**. Correlation is significant at the 0.01 level (1-tailed).

Page 21: Lect w8 w9_correlation_regression

Interpretation

• Exam performance is positively related to the amount of time spent revising, with a coefficient of r= 0.397, which is also significant at p< 0.01.

!• Exam anxiety appears to be negatively

related to the time spent revising (r= -0.709, p< 0.01)

Page 22: Lect w8 w9_correlation_regression

Interpretation

• Each variable is perfectly correlated with itself (r=1).

!• Exam performance is negatively related to

exam anxiety with a Pearson correlation coefficient of r= - 0.441 and there is less than 0.01 probability that a correlation coeficient this big would have occurred by chance in a sample of 103 people.

Page 23: Lect w8 w9_correlation_regression

In layman term

• exam anxiety , exam mark

• Revision time , exam mark

• Revision time , exam anxiety

Page 24: Lect w8 w9_correlation_regression

Hands on

• Is there a linear association between weight and heart girth in this herd of cows?

• Weight was measured in kg and heart girth in cm on 10 cows

!!!

• Assume data is normally distributed

Page 25: Lect w8 w9_correlation_regression
Page 26: Lect w8 w9_correlation_regression

• The sample coefficient of correlation is 0.704. The P value is 0.012, which is less than 0.05. The conclusion is that correlation exists in the population.

Correlations

Weight GirthWeight Pearson Correlation

1 .704

Sig. (1-tailed) .012N 10 10

Girth Pearson Correlation.704 1

Sig. (1-tailed) .012N 10 10

*. Correlation is significant at the 0.05 level (1-tailed).

Page 27: Lect w8 w9_correlation_regression

Using R2 for interpretation

( correlation coefficient) 2 = coefficient of determination, R2

!R2 is a measure of the amount of variability

in one variable that is explained by the other.

Page 28: Lect w8 w9_correlation_regression

ExampleCorrelations

Exam Performance

(%) Exam AnxietyTime Spent

RevisingExam Performance (%)

Pearson Correlation

1 -.441 .397Sig. (1-tailed) .000 .000N 103 103 103

Exam Anxiety Pearson Correlation

-.441 1 -.709Sig. (1-tailed) .000 .000N 103 103 103

Time Spent Revising Pearson Correlation

.397 -.709 1Sig. (1-tailed) .000 .000N 103 103 103

**. Correlation is significant at the 0.01 level (1-tailed).

Page 29: Lect w8 w9_correlation_regression

Example

Exam anxiety and exam performance

• ( correlation coefficient) 2 = coefficient of determination, R2

!R2 = ( -0.441) 2 = 0.194

!• In % = 0.194 x 100 = 19.4%

Page 30: Lect w8 w9_correlation_regression

• Although exam anxiety was correlated with exam performance, it can account for only 19.4 % of variation in exam scores.

!• 80.6% of the variability to be accounted

for other variables such as different ability, different level of preparation and so on…)

Page 31: Lect w8 w9_correlation_regression

Hands on

Subject Age, x Pressure, yA 43 128B 48 120C 56 135D 61 143E 67 141F 70 152

Compute the value of the correlation coefficient for the data? Do you have enough Statistical evidence that this relationship does not occur by chance?

Page 32: Lect w8 w9_correlation_regression

CorrelationsAge Pressure

Age Pearson Correlation 1 .897Sig. (2-tailed) .015N 6 6

Pressure Pearson Correlation .897 1

Sig. (2-tailed) .015N 6 6

*. Correlation is significant at the 0.05 level (2-tailed).

R 2 = ?

Page 33: Lect w8 w9_correlation_regression

Regression

Correlation do not provide the predictive power of variables.

!In regression analysis we fit a predictive

model to our data and use that model to predict values of the dependent variable from one or more independent variables.

Page 34: Lect w8 w9_correlation_regression

Independent V. Dependent

• Intentionally manipulated

• Controlled

• Vary at known rate

• Cause

• Intentionally left alone

• Measured

• Vary at unknown rate

• Effect

Page 35: Lect w8 w9_correlation_regression

• Simple regression seeks to predict an outcome variable from a single predictor variable whereas multiple regression seeks to predict an outcome from several predictors.

!Outcomei = (Modeli) + errori

Yi = (bo + b1 xi ) + ei

Page 36: Lect w8 w9_correlation_regression
Page 37: Lect w8 w9_correlation_regression

Least squares

Least squares is a method of finding the line that best fits the data.

!This “line of best fit” is found by

ascertaining which line, of all of the possible lines that could be drawn, results in the least amount of difference between the observed data points and the line.

Page 38: Lect w8 w9_correlation_regression

The vertical lines (dashed) represent the differences (or residuals) between the line and the actual data

Page 39: Lect w8 w9_correlation_regression

• “The best fit line” – there will be small differences between the values predicted by the line and the data that were actually observed.

!• Our interest- in the vertical differences

between the line and the actual data because we are using the line to predict values of Y from values of the X-variable.

!• Some data fall above or below the line,

indicating there is difference between the model fitted to these data and the data collected.

Page 40: Lect w8 w9_correlation_regression

• These difference called “residuals”.

• If the “residuals” +ve and –ve cancelled each other

!How ?

!• Square the differences before adding up.

• If the squared differences are large, the line is not representative of the data; if the squared differences is small then is representative.

Page 41: Lect w8 w9_correlation_regression

Total sum of squares, SST

SST uses the differences between the observed data and the mean value of Y

Page 42: Lect w8 w9_correlation_regression

• The sum of squared differences (SS) can be calculated for any line that is fitted to some data; the “goodness of fit” of each line can then be compared by looking at the sum of squares for each.

!• The method of least squares works by

selecting the line that has the lowest sum of squared differences(so it chooses the line that best represents the observed data)

!• This “line of best fit” known as a regression

line.

Page 43: Lect w8 w9_correlation_regression

Residual sum of squares, SSR

SSR uses the differences between the observed data and the regression line

Page 44: Lect w8 w9_correlation_regression

SS M uses the differences between the mean value of Y and the regression line

Model sum of squares (SS M)

Page 45: Lect w8 w9_correlation_regression

F-ratio

F-test = MSM

MSR

!

MSM (mean square for the model)

! = SS M

Number of variables in the model

Page 46: Lect w8 w9_correlation_regression

F-ratio

F-test = MSM

MSR

!MSR (mean square for the model)

! = SS R

Number of Observation- Number of parameters being estimated

Page 47: Lect w8 w9_correlation_regression

F- ratio

• a good model should have a large F-ratio (greater than 1 at least)

Page 48: Lect w8 w9_correlation_regression

Test 1

Open sample date – Record1.sav

Graph=> scatterplot

!Analyze=> regression

Page 49: Lect w8 w9_correlation_regression
Page 50: Lect w8 w9_correlation_regression

Model Summary

Model R R SquareAdjusted R

SquareStd. Error of the Estimate

1.578 .335 .331 65.991

a. Predictors: (Constant), Advertsing Budget (thousands of pounds)

Page 51: Lect w8 w9_correlation_regression

Interpretation

• The value of R2 is 0.335, which tell us that advertising expenditure can account 33.5% of the variation in record sales.

!• This means that 66% of the variation in

record sales cannot be explained by advertising alone

Page 52: Lect w8 w9_correlation_regression

F ratio 99.58, which is significant at p< 0.001(because the value in column labelled Sig. is less than 0.001. !This result tells us there is less than a 0.1% chance that an F ratio this large would happen by chance alone.Overall, the regression model predicts record sales significantly well.

Page 53: Lect w8 w9_correlation_regression

Multiple regression• Open data ; Record2.sav

Page 54: Lect w8 w9_correlation_regression
Page 55: Lect w8 w9_correlation_regression
Page 56: Lect w8 w9_correlation_regression
Page 57: Lect w8 w9_correlation_regression
Page 58: Lect w8 w9_correlation_regression

ResultsDescriptive Statistics

Mean Std. Deviation NRecord Sales (thousands) 193.20 80.699 200Advertsing Budget (thousands of pounds) 6.1441E2 485.65521 200

No. of plays on Radio 1 per week 27.50 12.270 200

Attractiveness of Band 6.77 1.395 200

Page 59: Lect w8 w9_correlation_regression

Correlations

Record Sales

(thousands)

Advertsing Budget

(thousands of pounds)

No. of plays on Radio 1 per week

Attractiveness of Band

Pearson Correlation Record Sales (thousands)

1.000 .578 .599 .326Advertsing Budget (thousands of pounds)

.578 1.000 .102 .081

No. of plays on Radio 1 per week .599 .102 1.000 .182

Attractiveness of Band

.326 .081 .182 1.000Sig. (1-tailed) Record Sales

(thousands). .000 .000 .000

Advertsing Budget (thousands of pounds)

.000 . .076 .128

No. of plays on Radio 1 per week .000 .076 . .005

Attractiveness of Band

.000 .128 .005 .N Record Sales

(thousands)200 200 200 200

Advertsing Budget (thousands of pounds)

200 200 200 200

No. of plays on Radio 1 per week 200 200 200 200

Attractiveness of Band

200 200 200 200

Page 60: Lect w8 w9_correlation_regression

Model Summary

Model RR

SquareAdjusted R

Square

Std. Error of the

Estimate

Change Statistics

Durbin-Watson

R Square Change

F Change df1 df2

Sig. F Change

1 .815 .665 .660 47.087 .665 129.498 3 196 .000 1.950

a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget (thousands of pounds), No. of plays on Radio 1 per weekb. Dependent Variable: Record Sales (thousands)

Model Summary

Model R R SquareAdjusted R

SquareStd. Error of the Estimate

1 .578 .335 .331 65.991a. Predictors: (Constant), Advertsing Budget (thousands of pounds)

Page 61: Lect w8 w9_correlation_regression

ANOVA

ModelSum of

Squares dfMean

Square F Sig.1 Regression 433687.833 1 433687.83

399.587 .000

Residual 862264.167 198 4354.870Total 1295952.00

0199

a. Predictors: (Constant), Advertsing Budget (thousands of pounds)b. Dependent Variable: Record Sales (thousands)

ANOVA

ModelSum of

Squares dfMean

Square F Sig.1 Regression 861377.41

83 287125.80

6129.49

8.000

Residual 434574.582

196 2217.217Total 1295952.0

00199

a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget (thousands of pounds), No. of plays on Radio 1 per week

b. Dependent Variable: Record Sales (thousands)

Page 62: Lect w8 w9_correlation_regression

Hands on

• Open file softdrinks.sav

!• Do multiple regression analysis

!• Y – dependent – delivery time

Page 63: Lect w8 w9_correlation_regression

ResultsModel Summary

Model R R SquareAdjusted R

SquareStd. Error of the Estimate

1 .980 .960 .956 3.25947a. Predictors: (Constant), distance, cases

ANOVA

Model Sum of Squares df Mean

Square F Sig.1 Regression 5550.811 2 2775.405 261.235 .000

Residual 233.732 22 10.624Total 5784.543 24

a. Predictors: (Constant), distance, cases

b. Dependent Variable: time

Page 64: Lect w8 w9_correlation_regression

Coefficients

Model

Unstandardized Coefficients

Standardized

t Sig.B Std. Error Beta1 (Constant) 2.341 1.097 2.135 .044

cases 1.616 .171 .716 9.464 .000distance .014 .004 .301 3.981 .001

a. Dependent Variable: time

Page 65: Lect w8 w9_correlation_regression

T-test

• Testing differences between means

!• Dependent means t-test: used when there are

two experimental conditions and the same participants took part in both conditions of the experiment.

!• Independent means t-test: used when there

are two experimental conditions and different participants were assigned to each condition.

Page 66: Lect w8 w9_correlation_regression

Dependent t-test

• 12 spider-phobes who were exposed to a picture of a spider (picture) and on a separate occasion a real live tarantula (real). Their anxiety was measured in each condition (half of the participants were exposed to the picture before the real spider while the other half were exposed to the real spider first).

• Which situation caused more anxiety?

!!!!

• Open spiderRM.sav

Page 67: Lect w8 w9_correlation_regression

ResultsPaired Samples Statistics

Mean NStd.

DeviationStd. Error

MeanPair 1 Picture of

Spider 40.00 12 9.293 2.683Real Spider 47.00 12 11.029 3.184

Paired Samples Correlations

N Correlation Sig.Pair 1 Picture of Spider &

Real Spider12 .545 .067

r= 0.545, not significantly correlated p > 0.05

Page 68: Lect w8 w9_correlation_regression

Paired Samples Test

Paired Differences

t dfSig. (2-tailed)

Mean

Std. Deviati

onStd. Error Mean

95% Confidence

Interval of the Difference

Lower UpperPair 1

Picture of Spider - Real Spider

-7.000 9.807 2.831 -13.231 -.769 -2.47

3 11 .031

T-value minus; tells us that picture had a smaller mean that the real tarantula and so the Real spider led to greater anxiety than the picture. !Conclusion; that the exposure to a real spider caused a significantly more reported anxiety In spider-phobes than exposure to a picture (t(11)= -2.47, p< 0.05)

Page 69: Lect w8 w9_correlation_regression

Hands on

• All students who enroll in a certain memory course are given a pretest before the course begin. At the completion of the course, post test their scores are listed here. Verify the results shown on the output by calculating the values and assume normality.

Std 1 2 3 4 5 6 7 8 9 10Before 93 86 72 54 92 65 80 81 62 73

After 98 92 80 62 91 78 89 78 71 80

Page 70: Lect w8 w9_correlation_regression

Independent t-test

• We have 12 spider-phobes who were exposed to a picture of a spider and 12 different spider-phobes who were exposed to a real life tarantula. The anxiety level measured.

!• Open spiderBG.sav

Page 71: Lect w8 w9_correlation_regression

Group StatisticsSpider or Picture? N Mean Std.

DeviationStd. Error

MeanAnxiety Picture 12 40.00 9.293 2.683

Real Spider 12 47.00 11.029 3.184

Independent Samples Test

Levene's Test for Equality of

Variances t-test for Equality of Means

F Sig. t df

Sig. (2-

tailed)

Mean Differe

nce

Std. Error

Difference

95% Confidence

Interval of the Difference

Lower UpperAnxiety Equal

variances assumed

.782 .386 -1.681 22 .107 -7.000 4.163 -15.63

4 1.634Equal variances not assumed

-1.681

21.385 .107 -7.000 4.163 -15.64

9 1.649

Page 72: Lect w8 w9_correlation_regression

Thank you