Upload
rione-drevale
View
110
Download
1
Embed Size (px)
DESCRIPTION
correlation_regression
Citation preview
Correlation, Regression & T-test
Prepared By: Dr. Kumara Thevan a/l Krishnan
Introduction
Investigation on a relationship between two or more numerical or quantitative variables can be conducted using techniques of correlation and regression analysis.
! - Correlation is a statistical method used to determine
whether a linear relationship between variable exist.
! - Simple linear regression is a statistical method used to
described the nature of the relationship between two variables.
Definition
Scatterplot (or scatter diagram) i s a graph in which the paired (x,y) sample data are plotted with a horizontal x axis and a vertical y axis.
!
Each individual (x,y) pair is plotted as a single point.
Definition
Correlation
!
exists between two variables when one of them is related to the other in some way
Example
Open SPSS.
Data ; weight height biometry male 2012.sav
Graph
Legacy dialogs=> scatter plot
Height (X); Weight ( Y)
BMI testCategory BMI range – kg/m
Very severely underweight less than 15
Severely underweight from 15.0 to 16.0
Underweight from 16.0 to 18.5
Normal (healthy weight) from 18.5 to 25
Overweight from 25 to 30Obese Class I (Moderately obese)
from 30 to 35
Obese Class II (Severely obese)
from 35 to 40
Obese Class III (Very severely obese) over 40
Normality test?
• The Kolmogorov-Smirnov and Shapiro-Wilk test.
• The compare the scores in the sample to a normally distributed set of scores with the same mean and s.d.
• If p>0.05, the test is non-significant. Tells us that the distribution of the sample is not s ign i f icant ly d i f ferent f rom a normal distribution.
• The test is significant (p<0.05) then the distribution in question is significantly different from a normal distribution(non-normal)
What can you see?
Positive Linear Correlation
x x
yy y
x
(a) Positive (b) Strong positive
(c) Perfect positive
Negative Linear Correlation
x x
yy y
x
(d) Negative (e) Strong negative
(f) Perfect negative
What can you see?
Test 1
- Do scatter plot using data ExamAnxiety.sav
- Exam performance (%) – y axis
- Exam anxiety – x axis
- Color – place gender
- Results?
!- Try 3D plot – 3 variables
Bivariate correlation
• Having taken a preliminary glance at the data, we can proceed to conduct the correlation analysis.
Definition
!
Linear Correlation Coefficient r measures strength of the linear
relationship between paired x- and y-quantitative values in a sample
No Linear Correlation
x x
yy
(g) No Correlation (h) Nonlinear Correlation
Definition
!
Linear Correlation Coefficient r sometimes referred to as the
Pearson product moment correlation coefficient
Notation for the Linear Correlation Coefficient
n number of pairs of data presented. Σ denotes the addition of the items indicated. Σx denotes the sum of all x values. Σx2 indicates that each x score should be squared and then
those squares added. (Σx)2 indicates that the x scores should be added and the total
then squared. Σxy indicates that each x score should be first multiplied by its
corresponding y score. After obtaining all such products, find their sum. r represents linear correlation coefficient for a sample ρ represents linear correlation coefficient for a population
nΣxy - (Σx)(Σy)
n(Σx2) - (Σx)2 n(Σy2) - (Σy)2r =
Definition Linear Correlation Coefficient r
Test 2
• Run the correlation analysis ExamAnxiety.sav
• Assumption – Data is normally distributed
ResultsCorrelations
Exam Performance
(%) Exam AnxietyTime Spent
RevisingExam Performance (%)
Pearson Correlation
1 -.441 .397Sig. (1-tailed) .000 .000N 103 103 103
Exam Anxiety Pearson Correlation
-.441 1 -.709Sig. (1-tailed) .000 .000N 103 103 103
Time Spent Revising Pearson Correlation
.397 -.709 1Sig. (1-tailed) .000 .000N 103 103 103
**. Correlation is significant at the 0.01 level (1-tailed).
Interpretation
• Exam performance is positively related to the amount of time spent revising, with a coefficient of r= 0.397, which is also significant at p< 0.01.
!• Exam anxiety appears to be negatively
related to the time spent revising (r= -0.709, p< 0.01)
Interpretation
• Each variable is perfectly correlated with itself (r=1).
!• Exam performance is negatively related to
exam anxiety with a Pearson correlation coefficient of r= - 0.441 and there is less than 0.01 probability that a correlation coeficient this big would have occurred by chance in a sample of 103 people.
In layman term
• exam anxiety , exam mark
• Revision time , exam mark
• Revision time , exam anxiety
Hands on
• Is there a linear association between weight and heart girth in this herd of cows?
• Weight was measured in kg and heart girth in cm on 10 cows
!!!
• Assume data is normally distributed
• The sample coefficient of correlation is 0.704. The P value is 0.012, which is less than 0.05. The conclusion is that correlation exists in the population.
Correlations
Weight GirthWeight Pearson Correlation
1 .704
Sig. (1-tailed) .012N 10 10
Girth Pearson Correlation.704 1
Sig. (1-tailed) .012N 10 10
*. Correlation is significant at the 0.05 level (1-tailed).
Using R2 for interpretation
( correlation coefficient) 2 = coefficient of determination, R2
!R2 is a measure of the amount of variability
in one variable that is explained by the other.
ExampleCorrelations
Exam Performance
(%) Exam AnxietyTime Spent
RevisingExam Performance (%)
Pearson Correlation
1 -.441 .397Sig. (1-tailed) .000 .000N 103 103 103
Exam Anxiety Pearson Correlation
-.441 1 -.709Sig. (1-tailed) .000 .000N 103 103 103
Time Spent Revising Pearson Correlation
.397 -.709 1Sig. (1-tailed) .000 .000N 103 103 103
**. Correlation is significant at the 0.01 level (1-tailed).
Example
Exam anxiety and exam performance
• ( correlation coefficient) 2 = coefficient of determination, R2
!R2 = ( -0.441) 2 = 0.194
!• In % = 0.194 x 100 = 19.4%
• Although exam anxiety was correlated with exam performance, it can account for only 19.4 % of variation in exam scores.
!• 80.6% of the variability to be accounted
for other variables such as different ability, different level of preparation and so on…)
Hands on
Subject Age, x Pressure, yA 43 128B 48 120C 56 135D 61 143E 67 141F 70 152
Compute the value of the correlation coefficient for the data? Do you have enough Statistical evidence that this relationship does not occur by chance?
CorrelationsAge Pressure
Age Pearson Correlation 1 .897Sig. (2-tailed) .015N 6 6
Pressure Pearson Correlation .897 1
Sig. (2-tailed) .015N 6 6
*. Correlation is significant at the 0.05 level (2-tailed).
R 2 = ?
Regression
Correlation do not provide the predictive power of variables.
!In regression analysis we fit a predictive
model to our data and use that model to predict values of the dependent variable from one or more independent variables.
Independent V. Dependent
• Intentionally manipulated
• Controlled
• Vary at known rate
• Cause
• Intentionally left alone
• Measured
• Vary at unknown rate
• Effect
• Simple regression seeks to predict an outcome variable from a single predictor variable whereas multiple regression seeks to predict an outcome from several predictors.
!Outcomei = (Modeli) + errori
Yi = (bo + b1 xi ) + ei
Least squares
Least squares is a method of finding the line that best fits the data.
!This “line of best fit” is found by
ascertaining which line, of all of the possible lines that could be drawn, results in the least amount of difference between the observed data points and the line.
The vertical lines (dashed) represent the differences (or residuals) between the line and the actual data
• “The best fit line” – there will be small differences between the values predicted by the line and the data that were actually observed.
!• Our interest- in the vertical differences
between the line and the actual data because we are using the line to predict values of Y from values of the X-variable.
!• Some data fall above or below the line,
indicating there is difference between the model fitted to these data and the data collected.
• These difference called “residuals”.
• If the “residuals” +ve and –ve cancelled each other
!How ?
!• Square the differences before adding up.
• If the squared differences are large, the line is not representative of the data; if the squared differences is small then is representative.
Total sum of squares, SST
SST uses the differences between the observed data and the mean value of Y
• The sum of squared differences (SS) can be calculated for any line that is fitted to some data; the “goodness of fit” of each line can then be compared by looking at the sum of squares for each.
!• The method of least squares works by
selecting the line that has the lowest sum of squared differences(so it chooses the line that best represents the observed data)
!• This “line of best fit” known as a regression
line.
Residual sum of squares, SSR
SSR uses the differences between the observed data and the regression line
SS M uses the differences between the mean value of Y and the regression line
Model sum of squares (SS M)
F-ratio
F-test = MSM
MSR
!
MSM (mean square for the model)
! = SS M
Number of variables in the model
F-ratio
F-test = MSM
MSR
!MSR (mean square for the model)
! = SS R
Number of Observation- Number of parameters being estimated
F- ratio
• a good model should have a large F-ratio (greater than 1 at least)
Test 1
Open sample date – Record1.sav
Graph=> scatterplot
!Analyze=> regression
Model Summary
Model R R SquareAdjusted R
SquareStd. Error of the Estimate
1.578 .335 .331 65.991
a. Predictors: (Constant), Advertsing Budget (thousands of pounds)
Interpretation
• The value of R2 is 0.335, which tell us that advertising expenditure can account 33.5% of the variation in record sales.
!• This means that 66% of the variation in
record sales cannot be explained by advertising alone
F ratio 99.58, which is significant at p< 0.001(because the value in column labelled Sig. is less than 0.001. !This result tells us there is less than a 0.1% chance that an F ratio this large would happen by chance alone.Overall, the regression model predicts record sales significantly well.
Multiple regression• Open data ; Record2.sav
ResultsDescriptive Statistics
Mean Std. Deviation NRecord Sales (thousands) 193.20 80.699 200Advertsing Budget (thousands of pounds) 6.1441E2 485.65521 200
No. of plays on Radio 1 per week 27.50 12.270 200
Attractiveness of Band 6.77 1.395 200
Correlations
Record Sales
(thousands)
Advertsing Budget
(thousands of pounds)
No. of plays on Radio 1 per week
Attractiveness of Band
Pearson Correlation Record Sales (thousands)
1.000 .578 .599 .326Advertsing Budget (thousands of pounds)
.578 1.000 .102 .081
No. of plays on Radio 1 per week .599 .102 1.000 .182
Attractiveness of Band
.326 .081 .182 1.000Sig. (1-tailed) Record Sales
(thousands). .000 .000 .000
Advertsing Budget (thousands of pounds)
.000 . .076 .128
No. of plays on Radio 1 per week .000 .076 . .005
Attractiveness of Band
.000 .128 .005 .N Record Sales
(thousands)200 200 200 200
Advertsing Budget (thousands of pounds)
200 200 200 200
No. of plays on Radio 1 per week 200 200 200 200
Attractiveness of Band
200 200 200 200
Model Summary
Model RR
SquareAdjusted R
Square
Std. Error of the
Estimate
Change Statistics
Durbin-Watson
R Square Change
F Change df1 df2
Sig. F Change
1 .815 .665 .660 47.087 .665 129.498 3 196 .000 1.950
a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget (thousands of pounds), No. of plays on Radio 1 per weekb. Dependent Variable: Record Sales (thousands)
Model Summary
Model R R SquareAdjusted R
SquareStd. Error of the Estimate
1 .578 .335 .331 65.991a. Predictors: (Constant), Advertsing Budget (thousands of pounds)
ANOVA
ModelSum of
Squares dfMean
Square F Sig.1 Regression 433687.833 1 433687.83
399.587 .000
Residual 862264.167 198 4354.870Total 1295952.00
0199
a. Predictors: (Constant), Advertsing Budget (thousands of pounds)b. Dependent Variable: Record Sales (thousands)
ANOVA
ModelSum of
Squares dfMean
Square F Sig.1 Regression 861377.41
83 287125.80
6129.49
8.000
Residual 434574.582
196 2217.217Total 1295952.0
00199
a. Predictors: (Constant), Attractiveness of Band, Advertsing Budget (thousands of pounds), No. of plays on Radio 1 per week
b. Dependent Variable: Record Sales (thousands)
Hands on
• Open file softdrinks.sav
!• Do multiple regression analysis
!• Y – dependent – delivery time
ResultsModel Summary
Model R R SquareAdjusted R
SquareStd. Error of the Estimate
1 .980 .960 .956 3.25947a. Predictors: (Constant), distance, cases
ANOVA
Model Sum of Squares df Mean
Square F Sig.1 Regression 5550.811 2 2775.405 261.235 .000
Residual 233.732 22 10.624Total 5784.543 24
a. Predictors: (Constant), distance, cases
b. Dependent Variable: time
Coefficients
Model
Unstandardized Coefficients
Standardized
t Sig.B Std. Error Beta1 (Constant) 2.341 1.097 2.135 .044
cases 1.616 .171 .716 9.464 .000distance .014 .004 .301 3.981 .001
a. Dependent Variable: time
T-test
• Testing differences between means
!• Dependent means t-test: used when there are
two experimental conditions and the same participants took part in both conditions of the experiment.
!• Independent means t-test: used when there
are two experimental conditions and different participants were assigned to each condition.
Dependent t-test
• 12 spider-phobes who were exposed to a picture of a spider (picture) and on a separate occasion a real live tarantula (real). Their anxiety was measured in each condition (half of the participants were exposed to the picture before the real spider while the other half were exposed to the real spider first).
• Which situation caused more anxiety?
!!!!
• Open spiderRM.sav
ResultsPaired Samples Statistics
Mean NStd.
DeviationStd. Error
MeanPair 1 Picture of
Spider 40.00 12 9.293 2.683Real Spider 47.00 12 11.029 3.184
Paired Samples Correlations
N Correlation Sig.Pair 1 Picture of Spider &
Real Spider12 .545 .067
r= 0.545, not significantly correlated p > 0.05
Paired Samples Test
Paired Differences
t dfSig. (2-tailed)
Mean
Std. Deviati
onStd. Error Mean
95% Confidence
Interval of the Difference
Lower UpperPair 1
Picture of Spider - Real Spider
-7.000 9.807 2.831 -13.231 -.769 -2.47
3 11 .031
T-value minus; tells us that picture had a smaller mean that the real tarantula and so the Real spider led to greater anxiety than the picture. !Conclusion; that the exposure to a real spider caused a significantly more reported anxiety In spider-phobes than exposure to a picture (t(11)= -2.47, p< 0.05)
Hands on
• All students who enroll in a certain memory course are given a pretest before the course begin. At the completion of the course, post test their scores are listed here. Verify the results shown on the output by calculating the values and assume normality.
Std 1 2 3 4 5 6 7 8 9 10Before 93 86 72 54 92 65 80 81 62 73
After 98 92 80 62 91 78 89 78 71 80
Independent t-test
• We have 12 spider-phobes who were exposed to a picture of a spider and 12 different spider-phobes who were exposed to a real life tarantula. The anxiety level measured.
!• Open spiderBG.sav
Group StatisticsSpider or Picture? N Mean Std.
DeviationStd. Error
MeanAnxiety Picture 12 40.00 9.293 2.683
Real Spider 12 47.00 11.029 3.184
Independent Samples Test
Levene's Test for Equality of
Variances t-test for Equality of Means
F Sig. t df
Sig. (2-
tailed)
Mean Differe
nce
Std. Error
Difference
95% Confidence
Interval of the Difference
Lower UpperAnxiety Equal
variances assumed
.782 .386 -1.681 22 .107 -7.000 4.163 -15.63
4 1.634Equal variances not assumed
-1.681
21.385 .107 -7.000 4.163 -15.64
9 1.649
Thank you