Lect w8 w9_correlation_regression

Correlation, Regression & T-test

Prepared By: Dr. Kumara Thevan a/l Krishnan

Introduction

Investigation on a relationship between two or more numerical or quantitative variables can be conducted using techniques of correlation and regression analysis.

! - Correlation is a statistical method used to determine

whether a linear relationship between variable exist.

! - Simple linear regression is a statistical method used to

described the nature of the relationship between two variables.

Definition

Scatterplot (or scatter diagram) i s a graph in which the paired (x,y) sample data are plotted with a horizontal x axis and a vertical y axis.

!

Each individual (x,y) pair is plotted as a single point.

Definition

Correlation

!

exists between two variables when one of them is related to the other in some way

Example

Open SPSS.

Data ; weight height biometry male 2012.sav

Graph

Legacy dialogs=> scatter plot

Height (X); Weight ( Y)

BMI testCategory BMI range – kg/m

Very severely underweight less than 15

Severely underweight from 15.0 to 16.0

Underweight from 16.0 to 18.5

Normal (healthy weight) from 18.5 to 25

Overweight from 25 to 30Obese Class I (Moderately obese)

from 30 to 35

Obese Class II (Severely obese)

from 35 to 40

Obese Class III (Very severely obese) over 40

Normality test?

• The Kolmogorov-Smirnov and Shapiro-Wilk test.

• The compare the scores in the sample to a normally distributed set of scores with the same mean and s.d.

• If p>0.05, the test is non-significant. Tells us that the distribution of the sample is not s ign i f icant ly d i f ferent f rom a normal distribution.

• The test is significant (p<0.05) then the distribution in question is significantly different from a normal distribution(non-normal)

What can you see?

Positive Linear Correlation

x x

yy y

x

(a) Positive (b) Strong positive

(c) Perfect positive

Negative Linear Correlation

x x

yy y

x

(d) Negative (e) Strong negative

(f) Perfect negative

What can you see?

Test 1

- Do scatter plot using data ExamAnxiety.sav

- Exam performance (%) – y axis

- Exam anxiety – x axis

- Color – place gender

- Results?

!- Try 3D plot – 3 variables

Bivariate correlation

• Having taken a preliminary glance at the data, we can proceed to conduct the correlation analysis.

Definition

!

Linear Correlation Coefficient r measures strength of the linear

relationship between paired x- and y-quantitative values in a sample

No Linear Correlation

x x

yy

(g) No Correlation (h) Nonlinear Correlation

Definition

!

Linear Correlation Coefficient r sometimes referred to as the

Pearson product moment correlation coefficient

Notation for the Linear Correlation Coefficient

n number of pairs of data presented. Σ denotes the addition of the items indicated. Σx denotes the sum of all x values. Σx2 indicates that each x score should be squared and then

those squares added. (Σx)2 indicates that the x scores should be added and the total

then squared. Σxy indicates that each x score should be first multiplied by its

corresponding y score. After obtaining all such products, find their sum. r represents linear correlation coefficient for a sample ρ represents linear correlation coefficient for a population

nΣxy - (Σx)(Σy)

n(Σx2) - (Σx)2 n(Σy2) - (Σy)2r =

Definition Linear Correlation Coefficient r

Test 2

• Run the correlation analysis ExamAnxiety.sav

• Assumption – Data is normally distributed

ResultsCorrelations

Exam Performance

(%) Exam AnxietyTime Spent

RevisingExam Performance (%)

Pearson Correlation

1 -.441 .397Sig. (1-tailed) .000 .000N 103 103 103

Exam Anxiety Pearson Correlation

-.441 1 -.709Sig. (1-tailed) .000 .000N 103 103 103

Time Spent Revising Pearson Correlation

.397 -.709 1Sig. (1-tailed) .000 .000N 103 103 103

**. Correlation is significant at the 0.01 level (1-tailed).

Interpretation

• Exam performance is positively related to the amount of time spent revising, with a coefficient of r= 0.397, which is also significant at p< 0.01.

!• Exam anxiety appears to be negatively

related to the time spent revising (r= -0.709, p< 0.01)

Interpretation

• Each variable is perfectly correlated with itself (r=1).

!• Exam performance is negatively related to

exam anxiety with a Pearson correlation coefficient of r= - 0.441 and there is less than 0.01 probability that a correlation coeficient this big would have occurred by chance in a sample of 103 people.

In layman term

• exam anxiety , exam mark

• Revision time , exam mark

• Revision time , exam anxiety

Hands on

• Is there a linear association between weight and heart girth in this herd of cows?

• Weight was measured in kg and heart girth in cm on 10 cows

!!!

• Assume data is normally distributed

• The sample coefficient of correlation is 0.704. The P value is 0.012, which is less than 0.05. The conclusion is that correlation exists in the population.

Correlations

Weight GirthWeight Pearson Correlation

1 .704

Sig. (1-tailed) .012N 10 10

Girth Pearson Correlation.704 1

Sig. (1-tailed) .012N 10 10

*. Correlation is significant at the 0.05 level (1-tailed).

Using R2 for interpretation

( correlation coefficient) 2 = coefficient of determination, R2

!R2 is a measure of the amount of variability

in one variable that is explained by the other.

ExampleCorrelations

Exam Performance

(%) Exam AnxietyTime Spent

RevisingExam Performance (%)

Pearson Correlation

1 -.441 .397Sig. (1-tailed) .000 .000N 103 103 103

Exam Anxiety Pearson Correlation

-.441 1 -.709Sig. (1-tailed) .000 .000N 103 103 103

Time Spent Revising Pearson Correlation

.397 -.709 1Sig. (1-tailed) .000 .000N 103 103 103

**. Correlation is significant at the 0.01 level (1-tailed).

Example

Exam anxiety and exam performance

• ( correlation coefficient) 2 = coefficient of determination, R2

!R2 = ( -0.441) 2 = 0.194

!• In % = 0.194 x 100 = 19.4%

• Although exam anxiety was correlated with exam performance, it can account for only 19.4 % of variation in exam scores.

!• 80.6% of the variability to be accounted

for other variables such as different ability, different level of preparation and so on…)

Hands on

Subject Age, x Pressure, yA 43 128B 48 120C 56 135D 61 143E 67 141F 70 152

Compute the value of the correlation coefficient for the data? Do you have enough Statistical evidence that this relationship does not occur by chance?

CorrelationsAge Pressure

Age Pearson Correlation 1 .897Sig. (2-tailed) .015N 6 6

Pressure Pearson Correlation .897 1

Sig. (2-tailed) .015N 6 6

*. Correlation is significant at the 0.05 level (2-tailed).

R 2 = ?

Regression

Correlation do not provide the predictive power of variables.

!In regression analysis we fit a predictive

model to our data and use that model to predict values of the dependent variable from one or more independent variables.

Independent V. Dependent

• Intentionally manipulated

• Controlled

• Vary at known rate

• Cause

• Intentionally left alone

• Measured

• Vary at unknown rate

• Effect

• Simple regression seeks to predict an outcome variable from a single predictor variable whereas multiple regression seeks to predict an outcome from several predictors.

!Outcomei = (Modeli) + errori

Yi = (bo + b1 xi ) + ei

Least squares

Least squares is a method of finding the line that best fits the data.

!This “line of best fit” is found by

ascertaining which line, of all of the possible lines that could be drawn, results in the least amount of difference between the observed data points and the line.

The vertical lines (dashed) represent the differences (or residuals) between the line and the actual data

• “The best fit line” – there will be small differences between the values predicted by the line and the data that were actually observed.

!• Our interest- in the vertical differences

between the line and the actual data because we are using the line to predict values of Y from values of the X-variable.

!• Some data fall above or below the line,

indicating there is difference between the model fitted to these data and the data collected.

• These difference called “residuals”.

• If the “residuals” +ve and –ve cancelled each other

!How ?

!• Square the differences before adding up.

• If the squared differences are large, the line is not representative of the data; if the squared differences is small then is representative.

Total sum of squares, SST

SST uses the differences between the observed data and the mean value of Y

• The sum of squared differences (SS) can be calculated for any line that is fitted to some data; the “goodness of fit” of each line can then be compared by looking at the sum of squares for each.

!• The method of least squares works by

selecting the line that has the lowest sum of squared differences(so it chooses the line that best represents the observed data)

!• This “line of best fit” known as a regression

line.

Residual sum of squares, SSR

SSR uses the differences between the observed data and the regression line

SS M uses the differences between the mean value of Y and the regression line

Model sum of squares (SS M)

F-ratio

F-test = MSM

MSR

!

MSM (mean square for the model)

! = SS M

Number of variables in the model

F-ratio

F-test = MSM

MSR

!MSR (mean square for the model)

! = SS R

Number of Observation- Number of parameters being estimated

F- ratio

• a good model should have a large F-ratio (greater than 1 at least)

Test 1

Open sample date – Record1.sav

Graph=> scatterplot

!Analyze=> regression

Model Summary

Model R R SquareAdjusted R

SquareStd. Error of the Estimate

1.578 .335 .331 65.991

a. Predictors: (Constant), Advertsing Budget (thousands of pounds)

Interpretation

• The value of R2 is 0.335, which tell us that advertising expenditure can account 33.5% of the variation in record sales.

!• This means that 66% of the variation in

record sales cannot be explained by advertising alone

F ratio 99.58, which is significant at p< 0.001(because the value in column labelled Sig. is less than 0.001. !This result tells us there is less than a 0.1% chance that an F ratio this large would happen by chance alone.Overall, the regression model predicts record sales significantly well.

Multiple regression• Open data ; Record2.sav

ResultsDescriptive Statistics

Mean Std. Deviation NRecord Sales (thousands) 193.20 80.699 200Advertsing Budget (thousands of pounds) 6.1441E2 485.65521 200

No. of plays on Radio 1 per week 27.50 12.270 200

Attractiveness of Band 6.77 1.395 200

Correlations

Record Sales

(thousands)

Advertsing Budget

(thousands of pounds)

No. of plays on Radio 1 per week

Attractiveness of Band

Pearson Correlation Record Sales (thousands)

1.000 .578 .599 .326Advertsing Budget (thousands of pounds)

.578 1.000 .102 .081

No. of plays on Radio 1 per week .599 .102 1.000 .182


.326 .081 .182 1.000Sig. (1-tailed) Record Sales

(thousands). .000 .000 .000

Advertsing Budget (thousands of pounds)

.000 . .076 .128

No. of plays on Radio 1 per week .000 .076 . .005


.000 .128 .005 .N Record Sales

(thousands)200 200 200 200

Advertsing Budget (thousands of pounds)

200 200 200 200

No. of plays on Radio 1 per week 200 200 200 200


200 200 200 200

Model Summary

Model RR

SquareAdjusted R

Square

Std. Error of the

Estimate

Change Statistics

Durbin-Watson

R Square Change

F Change df1 df2

Sig. F Change

1 .815 .665 .660 47.087 .665 129.498 3 196 .000 1.950

a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget (thousands of pounds), No. of plays on Radio 1 per weekb. Dependent Variable: Record Sales (thousands)

Model Summary



1 .578 .335 .331 65.991a. Predictors: (Constant), Advertsing Budget (thousands of pounds)

ANOVA

ModelSum of

Squares dfMean

Square F Sig.1 Regression 433687.833 1 433687.83

399.587 .000

Residual 862264.167 198 4354.870Total 1295952.00

0199

a. Predictors: (Constant), Advertsing Budget (thousands of pounds)b. Dependent Variable: Record Sales (thousands)

ANOVA

ModelSum of

Squares dfMean

Square F Sig.1 Regression 861377.41

83 287125.80

6129.49

8.000

Residual 434574.582

196 2217.217Total 1295952.0

00199

a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget (thousands of pounds), No. of plays on Radio 1 per week

b. Dependent Variable: Record Sales (thousands)

Hands on

• Open file softdrinks.sav

!• Do multiple regression analysis

!• Y – dependent – delivery time

ResultsModel Summary



1 .980 .960 .956 3.25947a. Predictors: (Constant), distance, cases

ANOVA

Model Sum of Squares df Mean

Square F Sig.1 Regression 5550.811 2 2775.405 261.235 .000

Residual 233.732 22 10.624Total 5784.543 24

a. Predictors: (Constant), distance, cases

b. Dependent Variable: time

Coefficients

Model

Unstandardized Coefficients

Standardized

t Sig.B Std. Error Beta1 (Constant) 2.341 1.097 2.135 .044

cases 1.616 .171 .716 9.464 .000distance .014 .004 .301 3.981 .001

a. Dependent Variable: time

T-test

• Testing differences between means

!• Dependent means t-test: used when there are

two experimental conditions and the same participants took part in both conditions of the experiment.

!• Independent means t-test: used when there

are two experimental conditions and different participants were assigned to each condition.

Dependent t-test

• 12 spider-phobes who were exposed to a picture of a spider (picture) and on a separate occasion a real live tarantula (real). Their anxiety was measured in each condition (half of the participants were exposed to the picture before the real spider while the other half were exposed to the real spider first).

• Which situation caused more anxiety?

!!!!

• Open spiderRM.sav

ResultsPaired Samples Statistics

Mean NStd.

DeviationStd. Error

MeanPair 1 Picture of

Spider 40.00 12 9.293 2.683Real Spider 47.00 12 11.029 3.184

Paired Samples Correlations

N Correlation Sig.Pair 1 Picture of Spider &

Real Spider12 .545 .067

r= 0.545, not significantly correlated p > 0.05

Paired Samples Test

Paired Differences

t dfSig. (2-tailed)

Mean

Std. Deviati

onStd. Error Mean

95% Confidence

Interval of the Difference

Lower UpperPair 1

Picture of Spider - Real Spider

-7.000 9.807 2.831 -13.231 -.769 -2.47

3 11 .031

T-value minus; tells us that picture had a smaller mean that the real tarantula and so the Real spider led to greater anxiety than the picture. !Conclusion; that the exposure to a real spider caused a significantly more reported anxiety In spider-phobes than exposure to a picture (t(11)= -2.47, p< 0.05)

Hands on

• All students who enroll in a certain memory course are given a pretest before the course begin. At the completion of the course, post test their scores are listed here. Verify the results shown on the output by calculating the values and assume normality.

Std 1 2 3 4 5 6 7 8 9 10Before 93 86 72 54 92 65 80 81 62 73

After 98 92 80 62 91 78 89 78 71 80

Independent t-test

• We have 12 spider-phobes who were exposed to a picture of a spider and 12 different spider-phobes who were exposed to a real life tarantula. The anxiety level measured.

!• Open spiderBG.sav

Group StatisticsSpider or Picture? N Mean Std.

DeviationStd. Error

MeanAnxiety Picture 12 40.00 9.293 2.683

Real Spider 12 47.00 11.029 3.184

Independent Samples Test

Levene's Test for Equality of

Variances t-test for Equality of Means

F Sig. t df

Sig. (2-

tailed)

Mean Differe

nce

Std. Error

Difference

95% Confidence

Interval of the Difference

Lower UpperAnxiety Equal

variances assumed

.782 .386 -1.681 22 .107 -7.000 4.163 -15.63

4 1.634Equal variances not assumed

-1.681

21.385 .107 -7.000 4.163 -15.64

9 1.649

Thank you

Education

Lect w8 w9_correlation_regression