Upload
coral-chambers
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
Multivariate Data
Descriptive techniques for Multivariate data
In most research situations data is collected on more than one variable (usually many variables)
Graphical Techniques
• The scatter plot
• The two dimensional Histogram
The Scatter Plot
For two variables X and Y we will have a measurements for each variable on each case:
xi, yi
xi = the value of X for case i
and
yi = the value of Y for case i.
To Construct a scatter plot we plot the points:
(xi, yi)
for each case on the X-Y plane.
(xi, yi)
xi
yi
Data Set #3
The following table gives data on Verbal IQ, Math IQ,Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Initial FinalVerbal Math Reading Reading
Student IQ IQ Acheivement Acheivement
1 86 94 1.1 1.72 104 103 1.5 1.73 86 92 1.5 1.94 105 100 2.0 2.05 118 115 1.9 3.56 96 102 1.4 2.47 90 87 1.5 1.88 95 100 1.4 2.09 105 96 1.7 1.7
10 84 80 1.6 1.711 94 87 1.6 1.712 119 116 1.7 3.113 82 91 1.2 1.814 80 93 1.0 1.715 109 124 1.8 2.516 111 119 1.4 3.017 89 94 1.6 1.818 99 117 1.6 2.619 94 93 1.4 1.420 99 110 1.4 2.021 95 97 1.5 1.322 102 104 1.7 3.123 102 93 1.6 1.9
Scatter Plot
0
20
40
60
80
100
120
140
0 20 40 60 80 100 120 140
Verbal IQ
Mat
h I
Q
Scatter Plot
0
20
40
60
80
100
120
140
0 20 40 60 80 100 120 140
Verbal IQ
Mat
h I
Q
(84,80)
Scatter Plot
60
70
80
90
100
110
120
130
60 70 80 90 100 110 120 130
Verbal IQ
Mat
h I
Q
Some Scatter Patterns
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
• Circular
• No relationship between X and Y
• Unable to predict Y from X
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
• Ellipsoidal
• Positive relationship between X and Y
• Increases in X correspond to increases in Y (but not always)
• Major axis of the ellipse has positive slope
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
Example
Verbal IQ, MathIQ
Scatter Plot
60
70
80
90
100
110
120
130
60 70 80 90 100 110 120 130
Verbal IQ
Mat
h I
Q
Some More Patterns
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Ellipsoidal (thinner ellipse)
• Stronger positive relationship between X and Y
• Increases in X correspond to increases in Y (more freqequently)
• Major axis of the ellipse has positive slope
• Minor axis of the ellipse much smaller
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Increased strength in the positive relationship between X and Y
• Increases in X correspond to increases in Y (almost always)
• Minor axis of the ellipse extremely small in relationship to the Major axis of the ellipse.
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Perfect positive relationship between X and Y
• Y perfectly predictable from X
• Data falls exactly along a straight line with positive slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Ellipsoidal
• Negative relationship between X and Y
• Increases in X correspond to decreases in Y (but not always)
• Major axis of the ellipse has negative slope slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• The strength of the relationship can increase until changes in Y can be perfectly predicted from X
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
Some Non-Linear Patterns
0
200
400
600
800
1000
1200
-20 -10 0 10 20 30 40 50
0
200
400
600
800
1000
1200
-20 -10 0 10 20 30 40 50
• In a Linear pattern Y increase with respect to X at a constant rate
• In a Non-linear pattern the rate that Y increases with respect to X is variable
Growth Patterns
-20
0
20
40
60
80
100
120
0 10 20 30 40 50
-150
-100
-50
0
50
100
150
0 10 20 30 40 50
-20
0
20
40
60
80
100
120
0 10 20 30 40 50
• Growth patterns frequently follow a sigmoid curve
• Growth at the start is slow
• It then speeds up
• Slows down again as it reaches it limiting size
0
20
40
60
80
100
120
0 10 20 30 40 50
Reviewthe scatter plot
Some Scatter Patterns
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140 0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Circular
• No relationship between X and Y
• Unable to predict Y from X
Ellipsoidal
• Positive relationship between X and Y
• Increases in X correspond to increases in Y (but not always)
• Major axis of the ellipse has positive slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
Ellipsoidal
• Negative relationship between X and Y
• Increases in X correspond to decreases in Y (but not always)
• Major axis of the ellipse has negative slope slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140 0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
Non-Linear Patterns
0
200
400
600
800
1000
1200
-20 -10 0 10 20 30 40 50
-20
0
20
40
60
80
100
120
0 10 20 30 40 50
Measures of strength of a relationship (Correlation)
• Pearson’s correlation coefficient (r)
• Spearman’s rank correlation coefficient (rho, )
Assume that we have collected data on two variables X and Y. Let
(x1, y1) (x2, y2) (x3, y3) … (xn, yn)
denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)
From this data we can compute summary statistics for each variable.
The means
and
n
xx
n
ii
1
n
yy
n
ii
1
The standard deviations
and
11
2
n
xxs
n
ii
x
11
2
n
yys
n
ii
y
These statistics:
• give information for each variable separately
but
• give no information about the relationship between the two variables
x yxs ys
Consider the statistics:
n
iixx xxS
1
2
n
iiyy yyS
1
2
n
iiixy yyxxS
1
The first two statistics:
• are used to measure variability in each variable
• they are used to compute the sample standard deviations
n
iixx xxS
1
2
n
iiyy yyS
1
2and
1
n
Ss xx
x 1
n
Ss yy
y
The third statistic:
• is used to measure correlation• If two variables are positively related the sign of
will agree with the sign of
n
iiixy yyxxS
1
xxi
yyi
•When is positive will be positive.
•When xi is above its mean, yi will be above its
mean
•When is negative will be negative.
•When xi is below its mean, yi will be below its
mean
The product will be positive for most cases.
xxi yyi
xxi yyi
yyxx ii
This implies that the statistic
• will be positive
• Most of the terms in this sum will be positive
n
iiixy yyxxS
1
On the other hand
• If two variables are negatively related the sign of
will be opposite in sign to
xxi
yyi
•When is positive will be negative.
•When xi is above its mean, yi will be below its
mean
•When is negative will be positive.
•When xi is below its mean, yi will be above its
mean
The product will be negative for most cases.
xxi yyi
xxi yyi
yyxx ii
Again implies that the statistic
• will be negative
• Most of the terms in this sum will be negative
n
iiixy yyxxS
1
Pearsons correlation coefficient is defined as below:
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
The denominator:
is always positive
n
ii
n
ii yyxx
1
2
1
2
The numerator:
• is positive if there is a positive relationship between X ad Y and
• negative if there is a negative relationship between X ad Y.
• This property carries over to Pearson’s correlation coefficient r
n
iii yyxx
1
Properties of Pearson’s correlation coefficient r
1. The value of r is always between –1 and +1.2. If the relationship between X and Y is positive, then
r will be positive.3. If the relationship between X and Y is negative,
then r will be negative.4. If there is no relationship between X and Y, then r
will be zero.
5. The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope.
6. The value of r will be -1 if the points, (xi, yi) lie on a straight line with negative slope.
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r =1
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = 0.95
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = 0.7
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
r = 0.4
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
r = 0
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.4
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.7
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.8
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.95
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -1
Computing formulae for the statistics:
n
iixx xxS
1
2
n
iiyy yyS
1
2
n
iiixy yyxxS
1
n
x
xxxS
n
iin
ii
n
iixx
2
1
1
2
1
2
n
yx
yx
n
ii
n
iin
iii
11
1
n
y
yyyS
n
iin
ii
n
iiyy
2
1
1
2
1
2
n
iiixy yyxxS
1
To compute
first compute
Then
xxS yyS xyS
n
iixC
1
2
n
iii yxE
1
n
iiyD
1
2
n
iiyB
1
n
iixA
1
n
ACSxx
2
n
BDS yy
2
n
BAESxy
Example
Verbal IQ, MathIQ
Data Set #3
The following table gives data on Verbal IQ, Math IQ,Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Initial FinalVerbal Math Reading Reading
Student IQ IQ Acheivement Acheivement
1 86 94 1.1 1.72 104 103 1.5 1.73 86 92 1.5 1.94 105 100 2.0 2.05 118 115 1.9 3.56 96 102 1.4 2.47 90 87 1.5 1.88 95 100 1.4 2.09 105 96 1.7 1.7
10 84 80 1.6 1.711 94 87 1.6 1.712 119 116 1.7 3.113 82 91 1.2 1.814 80 93 1.0 1.715 109 124 1.8 2.516 111 119 1.4 3.017 89 94 1.6 1.818 99 117 1.6 2.619 94 93 1.4 1.420 99 110 1.4 2.021 95 97 1.5 1.322 102 104 1.7 3.123 102 93 1.6 1.9
Scatter Plot
60
70
80
90
100
110
120
130
60 70 80 90 100 110 120 130
Verbal IQ
Mat
h I
Q
Now
Hence
2214941
2
n
iix 227199
1
n
iii yx234363
1
2
n
iiy
23071
n
iiy2244
1
n
iix
652.255723
2244221494
2
xxS
87.296023
2307234363
2
yyS
043.2116
23
23072244227199 xyS
Thus Pearsons correlation coefficient is:
yyxx
xy
SS
Sr
769.087.2960652.2557
043.2116
Thus r = 0.769
• Verbal IQ and Math IQ are positively correlated.
• If Verbal IQ is above (below) the mean then for most cases Math IQ will also be above (below) the mean.
Is the improvement in reading achievement (RA) related to either Verbal IQ or Math IQ?
improvement in RA = Final RA – Initial RA
The Data
Student Math IQ Verbal IQ Initial RA Final RA Imp RA1 86 94 1.1 1.7 0.62 104 103 1.5 1.7 0.23 86 92 1.5 1.9 0.44 105 100 2 2 05 118 115 1.9 3.5 1.66 96 102 1.4 2.4 17 90 87 1.5 1.8 0.38 95 100 1.4 2 0.69 105 96 1.7 1.7 010 84 80 1.6 1.7 0.111 94 87 1.6 1.7 0.112 119 116 1.7 3.1 1.413 82 91 1.2 1.8 0.614 80 93 1 1.7 0.715 109 124 1.8 2.5 0.716 111 119 1.4 3 1.617 89 94 1.6 1.8 0.218 99 117 1.6 2.6 119 94 93 1.4 1.4 020 99 110 1.4 2 0.621 95 97 1.5 1.3 -0.222 102 104 1.7 3.1 1.423 102 93 1.6 1.9 0.3
r = 0.48469
Correlation between Math IQ and RA Improvement
Correlation between Verbal IQ and RA Improvement
r = 0.68318
r = 0.48469Scatterplot: Math IQ vs RA Improvement
-0.4
0.1
0.6
1.1
1.6
70 80 90 100 110 120
Scatterplot: Verbal IQ vs RA Improvement
r = 0.68318
-0.4
0
0.4
0.8
1.2
1.6
70 80 90 100 110 120 130
Spearman’s rank
correlation coefficient
(rho)
Spearman’s rank correlation coefficient (rho)
Spearman’s rank correlation coefficient is computed as follows:• Arrange the observations on X in increasing order and assign them the ranks 1, 2, 3, …, n• Arrange the observations on Y in increasing order and assign them the ranks 1, 2, 3, …, n.
•For any case (i) let (xi, yi) denote the observations on X and Y and let (ri, si) denote the ranks on X and Y.
• If the variables X and Y are strongly positively correlated the ranks on X should generally agree with the ranks on Y. (The largest X should be the largest Y, The smallest X should be the smallest Y).
• If the variables X and Y are strongly negatively correlated the ranks on X should in the reverse order to the ranks on Y. (The largest X should be the smallest Y, The smallest X should be the largest Y).
• If the variables X and Y are uncorrelated the ranks on X should randomly distributed with the ranks on Y.
Spearman’s rank correlation coefficient
is defined as follows:
For each case let di = ri – si = difference in the two ranks.
Then Spearman’s rank correlation coefficient () is defined as follows:
1
61
21
2
nn
dn
ii
Properties of Spearman’s rank correlation coefficient 1. The value of is always between –1 and +1.2. If the relationship between X and Y is positive, then
will be positive.3. If the relationship between X and Y is negative,
then will be negative.4. If there is no relationship between X and Y, then
will be zero.5. The value of will be +1 if the ranks of X
completely agree with the ranks of Y.6. The value of will be -1 if the ranks of X are in
reverse order to the ranks of Y.
Examplexi 25.0 33.9 16.7 37.4 24.6 17.3 40.2
yi 24.3 38.7 13.4 32.1 28.0 12.5 44.9
Ranking the X’s and the Y’s we get:
ri 4 5 1 6 3 2 7
si 3 6 2 5 4 1 7
Computing the differences in ranks gives us:
di 1 -1 -1 1 -1 1 0
61
2
n
iid
1
61
21
2
nn
dn
ii
177
661
2
47
31
487
361
893.028
25
Computing Pearsons correlation coefficient, r, for the same problem:
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
n
x
xxxS
n
iin
ii
n
iixx
2
1
1
2
1
2
n
yx
yx
n
ii
n
iin
iii
11
1
n
y
yyyS
n
iin
ii
n
iiyy
2
1
1
2
1
2
n
iiixy yyxxS
1
To compute
first compute
xxS yyS xyS
35.59721
2
n
iixC
78.60531
n
iii yxE
41.62541
2
n
iiyD
9.1931
n
iiyB1.195
1
n
iixA
Then
63.5347
1.19535.5972
22
n
ACSxx
38.8837
9.19341.6254
22
n
BDS yy
51.649
7
9.1931.19578.6053
n
BAESxy
and
Compare with
945.038.88363.534
51.649r
893.0
Comments: Spearman’s rank correlation coefficient and Pearson’s correlation coefficient r
1. The value of can also be computed from:
2. Spearman’s is Pearson’s r computed from the ranks.
n
ii
n
ii
n
iii
ssrr
ssrr
1
2
1
2
1
3. Spearman’s is less sensitive to extreme observations. (outliers)
4. The value of Pearson’s r is much more sensitive to extreme outliers.
This is similar to the comparison between the median and the mean, the standard deviation and the pseudo-standard deviation. The mean and standard deviation are more sensitive to outliers than the median and pseudo- standard deviation.
Scatter plots
Some Scatter Patterns
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140 0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Circular
• No relationship between X and Y
• Unable to predict Y from X
Ellipsoidal
• Positive relationship between X and Y
• Increases in X correspond to increases in Y (but not always)
• Major axis of the ellipse has positive slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
Ellipsoidal
• Negative relationship between X and Y
• Increases in X correspond to decreases in Y (but not always)
• Major axis of the ellipse has negative slope slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140 0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
Non-Linear Patterns
0
200
400
600
800
1000
1200
-20 -10 0 10 20 30 40 50
-20
0
20
40
60
80
100
120
0 10 20 30 40 50
Measuring correlation
1. Pearson’s correlation coefficient r
2. Spearman’s rank correlation coefficient
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
iii
n
ii
srdnn
d
,
1
61
21
2
Simple Linear Regression
Fitting straight lines to data
The Least Squares Line The Regression Line
• When data is correlated it falls roughly about a straight line.
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
In this situation wants to:• Find the equation of the straight line through
the data that yields the best fit.
The equation of any straight line:is of the form:
Y = a + bX
b = the slope of the linea = the intercept of the line
a
Run = x2-x1
Rise = y2-y1
b =RiseRun x2-x1
=y2-y1
• a is the value of Y when X is zero
• b is the rate that Y increases per unit increase in X.
• For a straight line this rate is constant.
• For non linear curves the rate that Y increases per unit increase in X varies with X.
Linear
0
20
40
60
80
100
120
0 10 20 30 40 50
Non-linear
Age Class 30-40 40-50 50-60 60-70 70-80Mipoint Age (X) 35 45 55 65 75Median BP (Y) 114 124 143 158 166
Example: In the following example both blood pressure and age were measure for each female subject. Subjects were grouped into age classes and the median Blood Pressure measurement was computed for each age class. He data are summarized below:
0
20
40
60
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80
Y = 65.1 + 1.38 X
Graph:
Interpretation of the slope and intercept
1. Intercept – value of Y at X = 0.– Predicted Blood pressure of a newborn (65.1).– This interpretation remains valid only if
linearity is true down to X = 0.
2. Slope – rate of increase in Y per unit increase in X.
– Blood Pressure increases 1.38 units each year.
The Least Squares Line
Fitting the best straight line
to “linear” data
Reasons for fitting a straight line to data
1. It provides a precise description of the relationship between Y and X.
2. The interpretation of the parameters of the line (slope and intercept) leads to an improved understanding of the phenomena that is under study.
3. The equation of the line is useful for prediction of the dependent variable (Y) from the independent variable (X).
Assume that we have collected data on two variables X and Y. Let
(x1, y1) (x2, y2) (x3, y3) … (xn, yn)
denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)
LetY = a + b X
denote an arbitrary equation of a straight line.a and b are known values.This equation can be used to predict for each value of X, the value of Y.
For example, if X = xi (as for the ith case) then the predicted value of Y is:
ii bxay ˆ
For example if
Y = a + b X = 25.2 + 2.0 X
Is the equation of the straight line.
and if X = xi = 20 (for the ith case) then the
predicted value of Y is:
2.65200.22.25ˆ ii bxay
If the actual value of Y is yi = 70.0 for case i, then the difference
is the error in the prediction for case i.
is also called the residual for case i
8.42.6570ˆ ii yy
iiiii bxayyyr ˆ
If the residual
can be computed for each case in the sample,
The residual sum of squares (RSS) is
a measure of the “goodness of fit of the line
Y = a + bX to the data
iiiii bxayyyr ˆ
,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
X
Y=a+bX
Y
(x1,y1)
(x2,y2)
(x3,y3)
(x4,y4)
r1
r2
r3 r4
The optimal choice of a and b will result in the residual sum of squares
attaining a minimum.
If this is the case than the line:
Y = a + bX
is called the Least Squares Line
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
R.S.S = 3389.9
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 10 + (0.5)X
R.S.S = 1861.9
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 15 + (0.5)X
R.S.S = 833.9
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 20 + (0.5)X
R.S.S = 883.1
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 20 + (1)X
R.S.S = 303.98
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 20 + (0.7)X
R.S.S = 225.74
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 26.46 + (0.55)X
The equation for the least squares line
Let
n
iixx xxS
1
2
n
iiyy yyS
1
2
n
iiixy yyxxS
1
n
x
xxxS
n
iin
ii
n
iixx
2
1
1
2
1
2
n
yx
yx
n
ii
n
iin
iii
11
1
n
y
yyyS
n
iin
ii
n
iiyy
2
1
1
2
1
2
n
iiixy yyxxS
1
Computing Formulae:
Then the slope of the least squares line can be shown to be:
n
ii
n
iii
xx
xy
xx
yyxx
S
Sb
1
2
1
and the intercept of the least squares line can be shown to be:
xS
Syxbya
xx
xy
The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.
Country (i) Xi Yi
Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
404,541
2
n
iix
914,161
n
iii yx
018,61
2
n
iiy
Fitting the Least Squares Line
6641
n
iix
2261
n
iiy
55.1432211
66454404
2
xxS
73.1374
11
2266018
2
yyS
82.3271
11
22666416914 xyS
Fitting the Least Squares Line
First compute the following three quantities:
Computing Estimate of Slope and Intercept
288.055.14322
82.3271
xx
xy
S
Sb
756.611
664288.0
11
226
xbya
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
Y = 6.756 + (0.228)X
Interpretation of the slope and intercept
1. Intercept – value of Y at X = 0.– Predicted death rate from lung cancer
(6.756) for men in 1950 in Counties with no smoking in 1930 (X = 0).
2. Slope – rate of increase in Y per unit increase in X.
– Death rate from lung cancer for men in 1950 increases 0.228 units for each increase of 1 cigarette per capita consumption in 1930.
Age Class 30-40 40-50 50-60 60-70 70-80Mipoint Age (X) 35 45 55 65 75Median BP (Y) 114 124 143 158 166
Example: In the following example both blood pressure and age were measure for each female subject. Subjects were grouped into age classes and the median Blood Pressure measurement was computed for each age class. He data are summarized below:
125,161
2
n
iix
155,401
n
iii yx
341,1011
2
n
iiy
Fitting the Least Squares Line
2751
n
iix
7051
n
iiy
10005
27516125
2
xxS
1936
5
705101341
2
yyS
1380
5
70527540155 xyS
Fitting the Least Squares Line
First compute the following three quantities:
Computing Estimate of Slope and Intercept
38.11000
1380
xx
xy
S
Sb
1.655
275380.1
5
705
xbya
0
20
40
60
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80
Y = 65.1 + 1.38 X
Graph:
Relationship between correlation and Linear Regression
1. Pearsons correlation.
• Takes values between –1 and +1
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
2. Least squares Line Y = a + bX– Minimises the Residual Sum of Squares:
– The Sum of Squares that measures the variability in Y that is unexplained by X.
– This can also be denoted by:
SSunexplained
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
Some other Sum of Squares:
– The Sum of Squares that measures the total variability in Y (ignoring X).
n
iiTotal yySS
1
2
– The Sum of Squares that measures the total variability in Y that is explained by X.
n
iiExplained yySS
1
2ˆ
It can be shown:
(Total variability in Y) = (variability in Y explained by X) + (variability in Y unexplained by X)
n
iii
n
ii
n
ii yyyyyy
1
2
1
2
1
2 ˆˆ
lainedUnExplainedTotal SSSSSS exp
It can also be shown:
= proportion variability in Y explained by X.
= the coefficient of determination
n
ii
n
ii
yy
yyr
1
2
1
2
2
ˆ
Further:
= proportion variability in Y that is unexplained by X.
n
ii
n
iii
yy
yyr
1
2
1
2
2
ˆ1
Example
TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.
Country (i) Xi Yi
Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20
55.1432211
66454404
2
xxS
73.1374
11
2266018
2
yyS
82.3271
11
22666416914 xyS
Fitting the Least Squares Line
First compute the following three quantities:
Computing Estimate of Slope and Intercept
288.055.14322
82.3271
xx
xy
S
Sb
756.611
664288.0
11
226
xbya
Computing r and r2
737.0
73.137455.14322
82.3271
yyxx
xy
SS
Sr
544.0737.0 22 r
54.4% of the variability in Y (death rate due to lung Cancer (1950) is explained by X (per capita cigarette smoking in 1930)
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
Y = 6.756 + (0.228)X
Comments• Correlation will be +1 or -1 if the data lies on a
straight line.
• Correlation can be zero or close to zero if the data is either– Not related or– In some situations non-linear
0
0.5
1
1.5
2
2.5
3
3.5
-1.5 -1 -0.5 0 0.5 1 1.5
ExampleThe data
X Y
1.00 4.001.40 2.561.80 1.442.20 0.642.60 0.163.00 0.003.40 0.163.80 0.644.20 1.444.60 2.565.00 4.00
S xx = 17.6, S yy = 21.9648, S xy = 0
r = 0
0.00
1.00
2.00
3.00
4.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
One should be careful in interpreting zero correlation.It does not necessarily imply that Y is not related to X.It could happen that Y is non-linearly related to X.One should plot Y vs X before concluding that Y is not related to X.