Upload
zara-nabilah
View
227
Download
0
Embed Size (px)
Citation preview
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
1/23
CORRELATION AND REGRESSION C 5606 / 5/
CORRELATION AND REGRESSION
OBJECTIVES
General Objective
To understand and apply the concept of correlation and regression
Specific Objectives
At the end of the unit, you should be able to:
Draw a scatterplot for a set of ordered pairs
Compute the correlation coefficient
Compute the equation of the regression line
1
UNIT 5
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
2/23
CORRELATION AND REGRESSION C 5606 / 5/
5.0 CORRELATION
So far we have considered the statistics of one variable. Of course we sometimes getdata involving two variables. For example, look at the marks obtained on twoMathematics paper by a group of students below.
Student A B C D E F G H I J
Paper 1 42 84 50 42 33 50 69 81 50 35
Paper 2 31 83 42 60 28 63 59 92 73 40
So what can we find out from the data ? Students B and H have done very well onboth papers, E has done very badly on both papers, student I has done much betteron paper 2 than paper 1.
A graph might help us to make more sense of the data, as would the average (mean)mark for papers 1 and 2. The most useful type of graph is a scatter diagram.
2
INPUT
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
3/23
CORRELATION AND REGRESSION C 5606 / 5/
5.1 CORRELATION- SCATTER DIAGRAM
If we plot the data as points, with marks for Paper 1 on the x- axis and for paper 2 onthe y-axis, we obtain a graph like the one shown heree. Note that we do not need tostart the scales at zero.
We see that the points go roughly from bottom left to top right(this is made clearer byenclosing the points as shown below.
3
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
4/23
CORRELATION AND REGRESSION C 5606 / 5/
From the data the mean value for paper 1 x = 53.6
And for paper 2 y = 57.1
We now plot the line x = 53.6 and y = 57.1 on the scatter diagram:
The line divide the graph into four quadrants :
Top Right All points have both x values and y values greater than their respective
means i.e. (x x )
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
5/23
CORRELATION AND REGRESSION C 5606 / 5/
Roughly speaking:
Positive correlation the higher the value of x, the higher the value of y.Negative correlation the higher value of x, the lower value of y.Zero correlation no fixed relationship between x and y.
Again this is made clearer by drawing the lines y = y , x = x .
You have met scatter diagrams in your work of which you may have drawn a line ofbest fit on the graph in order to estimate a value of y given a value of x. The line wasdrawn by eye but you would know that the line passes through the mean values of (x , y ) as shown below.
5
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
6/23
CORRELATION AND REGRESSION C 5606 / 5/
The lines on the first two diagrams are relatively easy to draw, but where do we draw
a line on the third and having drawn it, would it be of any practical use?
Notice that we have been looking for a special type of relationship between the x andy values a straight line or linear relationship. The fact that we cant find such arelationship does not mean that there is no relationship at all.
The product-moment formula for determining the linear correlation coefficient
The convention of dealing with data
Horizontal (x) axis The independent variable
Vertical (y) axis The dependent variable
Let us look at some data on the height of students and the distance they can throw acricket ball.
Height (x) cm 122 124 133 138 144 156 158 161 164 168
Distance (y) m 41 38 52 56 29 54 59 61 63 67
Just looking at the data, a general response might be the taller a person, the furtherthey can throw a cricket ball. (apart from the odd person!)
6
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
7/23
CORRELATION AND REGRESSION C 5606 / 5/
Does a scatter diagram support that hypothesis?
The example below shows one drawback: SCALE
7
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
8/23
CORRELATION AND REGRESSION C 5606 / 5/
One of the measures of the degree of linear correlation between two variables iscalled the coefficient of correlation, denoted by the symbol r. The coefficient of
correlation for two variables, say X and Y, is given by:
[ ]22 )()())((
YYXX
YYXXr
= oe simply = [ ]))(( 22 yx
xy
Example 5.1
a) Determine the coefficient of correlation between X and Y based on the databelow.
X 4 5 6 9
Y 12 10 8 6
b) The data given below gives the experimental values obtained for the torque outputfrom an electric motor, X, against the current taken from the supply, Y. Determinethe value, degree and nature of the coefficient of linear correlation between thevariables X and Y (if there is one).
X 0 1 2 3 4 5 6 7 8 9
Y 4 6 6 6 8 10 10 10 14 12
The value of the correlation coefficient ranges from
+1 for a perfect correlation
to -1 for a perfect negative correlation
8
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
9/23
CORRELATION AND REGRESSION C 5606 / 5/
Solution to Example 5.1
a) Construct a table from the given data.
1 2 3 4 5 6 7
X Y x = X -X
y = Y-Y xy x2 y2
4 12 -2 3 -6 4 9
5 10 -1 1 -1 1 1
6 8 0 -1 0 0 1
9 6 3 -3 -9 9 9
24=X =36Y = 16xy 14
2=x 20
2=y
64
24==X 9
4
36==Y
r = [ ] [ ]9562.0
280
16
)20)(14(
16
))((22
=
=
=
yx
xy
b)
X Yx =
XX
y =YY xy x2 y2
0 4 -4.5 -4.6 20.7 20.25 21.16
1 6 -3.5 -2.6 9.1 12.25 6.76
2 6 -2.5 -2.6 6.5 6.25 6.76
3 6 -1.5 -2.6 3.9 2.25 6.76
4 8 -0.5 -0.6 0.3 0.25 0.36
5 10 0.5 1.4 0.7 0.25 1.96
6 10 1.5 1.4 2.1 2.25 1.96
7 10 2.5 1.4 3.5 6.25 1.96
8 14 3.5 5.4 18.9 12.25 29.16
9 12 4.5 3.4 15.3 20.25 11.56
5.410
45
45
==
==
X
x
10
86
86
==
=
Y
y
81=xy .
0
5.822 =x 4.882 = y
r = [ ] [ ]95.0
)4.88)(5.82(
81
))(( 22==
yx
xy
9
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
10/23
CORRELATION AND REGRESSION C 5606 / 5/
A good direct correlation exists between the the values of X and Y.
ACTIVITY 5A
TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT INPUT...!
1. Determine the coefficient of correlation up to 4 decimal places between X and Ybased on the data below.
X 122 124 133 138 144 156 158 161 164 168
Y 41 38 52 56 29 54 59 61 63 67
2. The co-ordinates given below refer to an experiment to verufy Newtons law of
cooling over a limited range of values. Determine the value, degree and nature ofthe coefficient of correlation.
Time (min) 4 8 10 12 16 22
Temperatuer (oC) 46 34 30 26 24 20
3. The following results were obtained experimentally when verifying Hookes law:
Load (N) 2 5 8 11 15Extension (mm) 2 23 62 119 223
Determine the value, degree and nature of the coefficient of correlation.
4. The thickness of case-hardening achieved varies with temperature and some co-ordinated obtained by experiment are as shown.
Temperature (oC) 400 420 350 320 400 480 440 370
Thickness (m) 3.7 3.4 3.7 3.8 3.6 3.3 3.4 3.7
10
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
11/23
CORRELATION AND REGRESSION C 5606 / 5/
Determine the coefficient of correlation based on these values.+-
FEEDBACK TO ACTIVITY 5A
1. r = 0.72892. r = -0.92, good, inverse3. 0.97, good, direct4. 0.93
11
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
12/23
CORRELATION AND REGRESSION C 5606 / 5/
5.2 LEAST SQUARES REGRESSION LINE
Scatter Diagrams Line Of the Best
We have already referred to the drawing of a line of best fit by eye
Thev only calculation involved determining x dan y , since the line of best fitpasses through the point ( x , y ).
From the line you might be expected to estimate a y value given an x- value. Ofcourse, by eye line fitting is a subjective matter, trying to minimise the distancesbetween the points and the line.
A mathematical computation method is available to produce two lines : known as yand x ( to estimate value of y) and x on y ( to estimate values of x)
These are known as (Linear) Regression Lines or Least-Squares Regression Lines.
12
INPUT
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
13/23
CORRELATION AND REGRESSION C 5606 / 5/
Scatter Diagrams The y on x Regression Line
Since the line must pass through (( x , y ), the parameters that can vary are thegradient of the line and the point where the line cuts the y axis.
The equation of the line will be of the form y = a + bx y on x ( some syllabuses useGreek letters and instead of a and b)
The y on x line minimises the sum of the squares of the vertical distances from thepoints to the regression line ( the square of the distance is used to ensure a positiveresult).
As with correlation there is a formula derived from a proof and a corresponding computational method. The proof is not required at A/AS Level )
For y = a + bx b =
nxx
n
yxxy
2
2 )(
(
a = y -b x
Where y and x are the mean values of y and x.
13
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
14/23
CORRELATION AND REGRESSION C 5606 / 5/
Example 5.2
a) y on x Regression Line ( Least Squares Regression Line )
x 2.5 4 8 5 7 9.5 8.5 12.5 12.5 14.5
y 3.5 3 6.5 7 8 11 9 10.5 13 13
x = 84 y = 84.5 xy = 827 2x = 845.5 n = 10 x = 8.4 y =8.45
Calculate the regression line y on x.
b) Based on the data alreday calculated, find the regression line y on x and estimatethe value of y when x = 160
x = 1468 y = 520 xy = 77689 2x = 218070 n = 10 x = 8.4
Solution to Example 5.2
a) To calculate the regression line y-on-x
b =
n
xx
n
yxxy
2
2)(
(
=
)10
84(5.845
10
)5.8484(827
2
x
= 0.8377
a = y - b x = 8.45 (0.8377 x 8.4) = 1.4133
So least squares regression line y - on - x is y = 1.4133 + 0.8377 x
Least Squares Regression Line - y - on x
From the previous page , the least squares regression line y - on - x is :
14
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
15/23
CORRELATION AND REGRESSION C 5606 / 5/
y = 1.4133 + 0.8377x
We can now use this equation to calculate ( estimate) a value of y for a given valueof x .
For example . Find a value for y given x = 10
Substituting y = 1.4133 + (0.8377 x 10)
Finding a value from within the range of x is called interpolation
Warning . Estimation a value from outside the data range ( say x = 20 ) is calledextrapolation and should bec avoided ( at all cost ) since you do not know that therelationship between x and y will hold for larger and smaller values than thoserecorded.
b) For the regression line y on x,
b =
n
xx
n
yxxy
2
2 )(
(
=
)10
1468(218070
10
)5201468(77689
2
x
= 0.5270
a = y - (b x ) = 52 - (0.5270 x 146.8 ) = - 25.3636
So, regresson line is y = -25.3636 + 0.5270x
When x = 160, y = -25.3636 + (0.5270 x 160) = 58.96
15
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
16/23
CORRELATION AND REGRESSION C 5606 / 5/
ACTIVITY 5B
TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT INPUT...!
a. The table shows the results for a number of athletes. X represents longjump (metres )
x = 19 y = 66xy = 126.22 2x = 36.44 n = 8
X y x2 y2 xy
1.8 6.7 3.24 44.89 12.06
2.1 7.6 4.41 57.76 15.96
1.9 6.3 3.61 39.69 11.97
2.0 6.8 4.00 46.24 13.6
1.8 5.9 3.24 34.81 10.62
1.8 7.9 3.24 62.41 14.22
1.6 5.5 2.56 30.25 8.81.8 5.6 3.24 31.36 10.08
1.9 6.5 3.61 42.25 12.35
2.3 7.2 5.29 51.84 16.56
19 66 36.44 441.5 126.22
Calculate the values of b for the regression line y = a + bx
b. The length y metres of a cable subjected to a load of x kilograms is given by
y = + x. In an experiment to estimate and for a particular cable, the valueof of y was measured for each of x . The following quantities were calculated fromthe 15 pair of values.
x = 225 y = 238xy = 3581 2x = 3625
Calculated the least squares estimates of and
16
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
17/23
CORRELATION AND REGRESSION C 5606 / 5/
c. Set of bivariate data can be summarised as follows :
x = 21 y = 43 xy = 171 2x = 91 n = 6 2y = 335
i) Calculate the equation of the regression line of y on x . Give your answer inthe form y = a + bx, where the values of a and b should be stated to 3significant figures.
ii) It is required to estimate the value of y for a given value of x. Statecircumstances under which the regression line of x and y should be used,rather than the regression line of y and x
17
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
18/23
CORRELATION AND REGRESSION C 5606 / 5/
FEEDBACK TO ACTIVITY 5B
a. b = 2.4118
b. y = + x y = 15.69 + 0.014x
c. i) a = 3.0688, regression line is y = 3.07 + 1.17 ( 3 significant figures)ii) Use regression line of x on y to estimate value of x when y is theindependent variable.
18
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
19/23
CORRELATION AND REGRESSION C 5606 / 5/
SELF ASSESSMENT 5
You are approaching success. Try all the questions in this self-assessment sectionand check your answers given on the next page. If you encounter any problems,consult your instructor. Good luck.
1. The data given below refers to the relationship between man-hours workedand production achieved in a factory. Determine the coefficient ofcorrelation.
Index ofproductionman-hourbasis
100 97 100 101 93 103 91 89 110 86
Index ofproduction,actualbasis
94 91 100 105 84 112 83 80 123 78
2. The number of man-days lost per week due to sickness in two similardepartments of a factory are show for a 12-week period.
Department A 20
18
19 21 17 18 12 16 14 17 13 15
Department B 18
21
18 20 17 19 16 15 15 18 16 18
Determine the coefficent of correlation and comment on its degree andnature.
19
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
20/23
CORRELATION AND REGRESSION C 5606 / 5/
3. The masses and height for ten people were measured and the results are as shown.
Mass(kg)
38 38 38 44 44 51 32 51 77 32
Height(cm)
135 140 137 141 147 145 132 149 164 130
Calculate the coefficient of correlation for this data
4. The relationship between the pressure and volume of a gas was measuredand the follwowing results were obtained :
Pressure
(kPa)
58 62 67 73 81 81 86 92 104
Volume(m3)
0.36 0.97 0.43 0.52 0.48 0.29 0.31 0.75 0.27
Determine the coefficient of correlation and comment on the resultobtained.
5. The caloric intake of rats varies with body mass as shown below.
Bodymass(g)
2.0 3.1 3.6 4.6 5.0 6.0 7.0 8.0 8.5 9.0 10.0
CaloricIntake(cal h-1
1.52.1 3.2 3.6 3.6 3.9 4.1 4.2 4.5 4.6 5.9
Is there a linear correlation between these results ?
20
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
21/23
CORRELATION AND REGRESSION C 5606 / 5/
6. Determine the coefficient of correlation for the data given below and testthe null hypothesis that = 0 at a level of significance of 0.1. Thedatagiven relates the number of hours of sunshime per week to the hourslost due to sickness.
Hours ofsunshine/week
10 13 15 17 18 20 22 23 24
Hous lost dueto sickness
90 75 75 65 55 45 55 45 35
7. The length y metres of a cable subjected to a load of x kilograms is givenby y = + x. In an experiment to estimate and a particular cable, thevalue of y was measured for each of 15 values of x. The followingquantities were calculated from the pairs of values.
x = 225 y = 238.5 xy = 3581 2x = 3625
a) Calculate the least squares estimates of and
8. A set of bivariate data can be summarised as follows
x = 21 y = 43 xy = 171 2x = 91 n = 6 2
y = 335
i) Calculated the equation of regression line of y and x. Give youranswer in the form y = a + bx, where the values of a and b shouldbe stated to 3 significant figures.
ii) It is required to estimate the value of y for a given value of x. Statecircumstances under which the regression line of x and y shouldbe used, rather than the regression line of y on x
9. The data given below is relationship between the heights and masses of tenpeople.
Height,X cm
175 180 193 165 187 171 198 168 184 177
Mass,Y kg
82 78 86 72 91 80 95 72 89 74
Determine the equation of the regression line of mass on height,expressing the regression coefficients correct to two decimal places.
21
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
22/23
CORRELATION AND REGRESSION C 5606 / 5/
10. The power needed to drive a lathe increase as the cutting angle of the toolincrease when cutting a constant speed and depth of cut. The relationship formild steel is :
Cuttingangle(degrees)X
50 55 60 65 70 75 80 85 90
Power(kW)Y
6.2 6.8 7.6 8.2 8.1 8.8 9.7 10.0 10.4
Determine a) the equation of the regression line of power on cutting angle andb) the equation of the regression line of cutting angle on power,
expresing the regression coefficients correct to three significantfigures in each case.
22
8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )
23/23
CORRELATION AND REGRESSION C 5606 / 5/
FEEDBACK TO SELF ASSESSMENT 5
Have you tried all the questions?? If YES, check your answers now.
1. 0.972. 0.70 , fair direct
3. 0.974. -0.31, It is probable that the measurements were made at different
Temperatures
5. r = 0.94, hence there is a good, direct correlation.
6. r = -0.95, t.99 7 = 1.42 I tI = 8.05 hypothesis is rejected
7. = 15.69 = 0.014 y= 15.69 + 0.014x
8. i) y = 3.07 + 1.17xii) use regression line of x and y to estimate value of x when y is the
independent variable.
9. y = -036.83 + 0.66x
10. a) Y = 1.14 + 0.104 Xb) X = -9.27 + 9.41Y
23