Unit 5 ( CORRELATION AND REGRESSION )

Embed Size (px)

Citation preview

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    1/23

    CORRELATION AND REGRESSION C 5606 / 5/

    CORRELATION AND REGRESSION

    OBJECTIVES

    General Objective

    To understand and apply the concept of correlation and regression

    Specific Objectives

    At the end of the unit, you should be able to:

    Draw a scatterplot for a set of ordered pairs

    Compute the correlation coefficient

    Compute the equation of the regression line

    1

    UNIT 5

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    2/23

    CORRELATION AND REGRESSION C 5606 / 5/

    5.0 CORRELATION

    So far we have considered the statistics of one variable. Of course we sometimes getdata involving two variables. For example, look at the marks obtained on twoMathematics paper by a group of students below.

    Student A B C D E F G H I J

    Paper 1 42 84 50 42 33 50 69 81 50 35

    Paper 2 31 83 42 60 28 63 59 92 73 40

    So what can we find out from the data ? Students B and H have done very well onboth papers, E has done very badly on both papers, student I has done much betteron paper 2 than paper 1.

    A graph might help us to make more sense of the data, as would the average (mean)mark for papers 1 and 2. The most useful type of graph is a scatter diagram.

    2

    INPUT

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    3/23

    CORRELATION AND REGRESSION C 5606 / 5/

    5.1 CORRELATION- SCATTER DIAGRAM

    If we plot the data as points, with marks for Paper 1 on the x- axis and for paper 2 onthe y-axis, we obtain a graph like the one shown heree. Note that we do not need tostart the scales at zero.

    We see that the points go roughly from bottom left to top right(this is made clearer byenclosing the points as shown below.

    3

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    4/23

    CORRELATION AND REGRESSION C 5606 / 5/

    From the data the mean value for paper 1 x = 53.6

    And for paper 2 y = 57.1

    We now plot the line x = 53.6 and y = 57.1 on the scatter diagram:

    The line divide the graph into four quadrants :

    Top Right All points have both x values and y values greater than their respective

    means i.e. (x x )

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    5/23

    CORRELATION AND REGRESSION C 5606 / 5/

    Roughly speaking:

    Positive correlation the higher the value of x, the higher the value of y.Negative correlation the higher value of x, the lower value of y.Zero correlation no fixed relationship between x and y.

    Again this is made clearer by drawing the lines y = y , x = x .

    You have met scatter diagrams in your work of which you may have drawn a line ofbest fit on the graph in order to estimate a value of y given a value of x. The line wasdrawn by eye but you would know that the line passes through the mean values of (x , y ) as shown below.

    5

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    6/23

    CORRELATION AND REGRESSION C 5606 / 5/

    The lines on the first two diagrams are relatively easy to draw, but where do we draw

    a line on the third and having drawn it, would it be of any practical use?

    Notice that we have been looking for a special type of relationship between the x andy values a straight line or linear relationship. The fact that we cant find such arelationship does not mean that there is no relationship at all.

    The product-moment formula for determining the linear correlation coefficient

    The convention of dealing with data

    Horizontal (x) axis The independent variable

    Vertical (y) axis The dependent variable

    Let us look at some data on the height of students and the distance they can throw acricket ball.

    Height (x) cm 122 124 133 138 144 156 158 161 164 168

    Distance (y) m 41 38 52 56 29 54 59 61 63 67

    Just looking at the data, a general response might be the taller a person, the furtherthey can throw a cricket ball. (apart from the odd person!)

    6

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    7/23

    CORRELATION AND REGRESSION C 5606 / 5/

    Does a scatter diagram support that hypothesis?

    The example below shows one drawback: SCALE

    7

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    8/23

    CORRELATION AND REGRESSION C 5606 / 5/

    One of the measures of the degree of linear correlation between two variables iscalled the coefficient of correlation, denoted by the symbol r. The coefficient of

    correlation for two variables, say X and Y, is given by:

    [ ]22 )()())((

    YYXX

    YYXXr

    = oe simply = [ ]))(( 22 yx

    xy

    Example 5.1

    a) Determine the coefficient of correlation between X and Y based on the databelow.

    X 4 5 6 9

    Y 12 10 8 6

    b) The data given below gives the experimental values obtained for the torque outputfrom an electric motor, X, against the current taken from the supply, Y. Determinethe value, degree and nature of the coefficient of linear correlation between thevariables X and Y (if there is one).

    X 0 1 2 3 4 5 6 7 8 9

    Y 4 6 6 6 8 10 10 10 14 12

    The value of the correlation coefficient ranges from

    +1 for a perfect correlation

    to -1 for a perfect negative correlation

    8

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    9/23

    CORRELATION AND REGRESSION C 5606 / 5/

    Solution to Example 5.1

    a) Construct a table from the given data.

    1 2 3 4 5 6 7

    X Y x = X -X

    y = Y-Y xy x2 y2

    4 12 -2 3 -6 4 9

    5 10 -1 1 -1 1 1

    6 8 0 -1 0 0 1

    9 6 3 -3 -9 9 9

    24=X =36Y = 16xy 14

    2=x 20

    2=y

    64

    24==X 9

    4

    36==Y

    r = [ ] [ ]9562.0

    280

    16

    )20)(14(

    16

    ))((22

    =

    =

    =

    yx

    xy

    b)

    X Yx =

    XX

    y =YY xy x2 y2

    0 4 -4.5 -4.6 20.7 20.25 21.16

    1 6 -3.5 -2.6 9.1 12.25 6.76

    2 6 -2.5 -2.6 6.5 6.25 6.76

    3 6 -1.5 -2.6 3.9 2.25 6.76

    4 8 -0.5 -0.6 0.3 0.25 0.36

    5 10 0.5 1.4 0.7 0.25 1.96

    6 10 1.5 1.4 2.1 2.25 1.96

    7 10 2.5 1.4 3.5 6.25 1.96

    8 14 3.5 5.4 18.9 12.25 29.16

    9 12 4.5 3.4 15.3 20.25 11.56

    5.410

    45

    45

    ==

    ==

    X

    x

    10

    86

    86

    ==

    =

    Y

    y

    81=xy .

    0

    5.822 =x 4.882 = y

    r = [ ] [ ]95.0

    )4.88)(5.82(

    81

    ))(( 22==

    yx

    xy

    9

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    10/23

    CORRELATION AND REGRESSION C 5606 / 5/

    A good direct correlation exists between the the values of X and Y.

    ACTIVITY 5A

    TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT INPUT...!

    1. Determine the coefficient of correlation up to 4 decimal places between X and Ybased on the data below.

    X 122 124 133 138 144 156 158 161 164 168

    Y 41 38 52 56 29 54 59 61 63 67

    2. The co-ordinates given below refer to an experiment to verufy Newtons law of

    cooling over a limited range of values. Determine the value, degree and nature ofthe coefficient of correlation.

    Time (min) 4 8 10 12 16 22

    Temperatuer (oC) 46 34 30 26 24 20

    3. The following results were obtained experimentally when verifying Hookes law:

    Load (N) 2 5 8 11 15Extension (mm) 2 23 62 119 223

    Determine the value, degree and nature of the coefficient of correlation.

    4. The thickness of case-hardening achieved varies with temperature and some co-ordinated obtained by experiment are as shown.

    Temperature (oC) 400 420 350 320 400 480 440 370

    Thickness (m) 3.7 3.4 3.7 3.8 3.6 3.3 3.4 3.7

    10

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    11/23

    CORRELATION AND REGRESSION C 5606 / 5/

    Determine the coefficient of correlation based on these values.+-

    FEEDBACK TO ACTIVITY 5A

    1. r = 0.72892. r = -0.92, good, inverse3. 0.97, good, direct4. 0.93

    11

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    12/23

    CORRELATION AND REGRESSION C 5606 / 5/

    5.2 LEAST SQUARES REGRESSION LINE

    Scatter Diagrams Line Of the Best

    We have already referred to the drawing of a line of best fit by eye

    Thev only calculation involved determining x dan y , since the line of best fitpasses through the point ( x , y ).

    From the line you might be expected to estimate a y value given an x- value. Ofcourse, by eye line fitting is a subjective matter, trying to minimise the distancesbetween the points and the line.

    A mathematical computation method is available to produce two lines : known as yand x ( to estimate value of y) and x on y ( to estimate values of x)

    These are known as (Linear) Regression Lines or Least-Squares Regression Lines.

    12

    INPUT

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    13/23

    CORRELATION AND REGRESSION C 5606 / 5/

    Scatter Diagrams The y on x Regression Line

    Since the line must pass through (( x , y ), the parameters that can vary are thegradient of the line and the point where the line cuts the y axis.

    The equation of the line will be of the form y = a + bx y on x ( some syllabuses useGreek letters and instead of a and b)

    The y on x line minimises the sum of the squares of the vertical distances from thepoints to the regression line ( the square of the distance is used to ensure a positiveresult).

    As with correlation there is a formula derived from a proof and a corresponding computational method. The proof is not required at A/AS Level )

    For y = a + bx b =

    nxx

    n

    yxxy

    2

    2 )(

    (

    a = y -b x

    Where y and x are the mean values of y and x.

    13

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    14/23

    CORRELATION AND REGRESSION C 5606 / 5/

    Example 5.2

    a) y on x Regression Line ( Least Squares Regression Line )

    x 2.5 4 8 5 7 9.5 8.5 12.5 12.5 14.5

    y 3.5 3 6.5 7 8 11 9 10.5 13 13

    x = 84 y = 84.5 xy = 827 2x = 845.5 n = 10 x = 8.4 y =8.45

    Calculate the regression line y on x.

    b) Based on the data alreday calculated, find the regression line y on x and estimatethe value of y when x = 160

    x = 1468 y = 520 xy = 77689 2x = 218070 n = 10 x = 8.4

    Solution to Example 5.2

    a) To calculate the regression line y-on-x

    b =

    n

    xx

    n

    yxxy

    2

    2)(

    (

    =

    )10

    84(5.845

    10

    )5.8484(827

    2

    x

    = 0.8377

    a = y - b x = 8.45 (0.8377 x 8.4) = 1.4133

    So least squares regression line y - on - x is y = 1.4133 + 0.8377 x

    Least Squares Regression Line - y - on x

    From the previous page , the least squares regression line y - on - x is :

    14

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    15/23

    CORRELATION AND REGRESSION C 5606 / 5/

    y = 1.4133 + 0.8377x

    We can now use this equation to calculate ( estimate) a value of y for a given valueof x .

    For example . Find a value for y given x = 10

    Substituting y = 1.4133 + (0.8377 x 10)

    Finding a value from within the range of x is called interpolation

    Warning . Estimation a value from outside the data range ( say x = 20 ) is calledextrapolation and should bec avoided ( at all cost ) since you do not know that therelationship between x and y will hold for larger and smaller values than thoserecorded.

    b) For the regression line y on x,

    b =

    n

    xx

    n

    yxxy

    2

    2 )(

    (

    =

    )10

    1468(218070

    10

    )5201468(77689

    2

    x

    = 0.5270

    a = y - (b x ) = 52 - (0.5270 x 146.8 ) = - 25.3636

    So, regresson line is y = -25.3636 + 0.5270x

    When x = 160, y = -25.3636 + (0.5270 x 160) = 58.96

    15

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    16/23

    CORRELATION AND REGRESSION C 5606 / 5/

    ACTIVITY 5B

    TEST YOUR UNDERSTANDING BEFORE PROCEEDING TO THE NEXT INPUT...!

    a. The table shows the results for a number of athletes. X represents longjump (metres )

    x = 19 y = 66xy = 126.22 2x = 36.44 n = 8

    X y x2 y2 xy

    1.8 6.7 3.24 44.89 12.06

    2.1 7.6 4.41 57.76 15.96

    1.9 6.3 3.61 39.69 11.97

    2.0 6.8 4.00 46.24 13.6

    1.8 5.9 3.24 34.81 10.62

    1.8 7.9 3.24 62.41 14.22

    1.6 5.5 2.56 30.25 8.81.8 5.6 3.24 31.36 10.08

    1.9 6.5 3.61 42.25 12.35

    2.3 7.2 5.29 51.84 16.56

    19 66 36.44 441.5 126.22

    Calculate the values of b for the regression line y = a + bx

    b. The length y metres of a cable subjected to a load of x kilograms is given by

    y = + x. In an experiment to estimate and for a particular cable, the valueof of y was measured for each of x . The following quantities were calculated fromthe 15 pair of values.

    x = 225 y = 238xy = 3581 2x = 3625

    Calculated the least squares estimates of and

    16

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    17/23

    CORRELATION AND REGRESSION C 5606 / 5/

    c. Set of bivariate data can be summarised as follows :

    x = 21 y = 43 xy = 171 2x = 91 n = 6 2y = 335

    i) Calculate the equation of the regression line of y on x . Give your answer inthe form y = a + bx, where the values of a and b should be stated to 3significant figures.

    ii) It is required to estimate the value of y for a given value of x. Statecircumstances under which the regression line of x and y should be used,rather than the regression line of y and x

    17

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    18/23

    CORRELATION AND REGRESSION C 5606 / 5/

    FEEDBACK TO ACTIVITY 5B

    a. b = 2.4118

    b. y = + x y = 15.69 + 0.014x

    c. i) a = 3.0688, regression line is y = 3.07 + 1.17 ( 3 significant figures)ii) Use regression line of x on y to estimate value of x when y is theindependent variable.

    18

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    19/23

    CORRELATION AND REGRESSION C 5606 / 5/

    SELF ASSESSMENT 5

    You are approaching success. Try all the questions in this self-assessment sectionand check your answers given on the next page. If you encounter any problems,consult your instructor. Good luck.

    1. The data given below refers to the relationship between man-hours workedand production achieved in a factory. Determine the coefficient ofcorrelation.

    Index ofproductionman-hourbasis

    100 97 100 101 93 103 91 89 110 86

    Index ofproduction,actualbasis

    94 91 100 105 84 112 83 80 123 78

    2. The number of man-days lost per week due to sickness in two similardepartments of a factory are show for a 12-week period.

    Department A 20

    18

    19 21 17 18 12 16 14 17 13 15

    Department B 18

    21

    18 20 17 19 16 15 15 18 16 18

    Determine the coefficent of correlation and comment on its degree andnature.

    19

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    20/23

    CORRELATION AND REGRESSION C 5606 / 5/

    3. The masses and height for ten people were measured and the results are as shown.

    Mass(kg)

    38 38 38 44 44 51 32 51 77 32

    Height(cm)

    135 140 137 141 147 145 132 149 164 130

    Calculate the coefficient of correlation for this data

    4. The relationship between the pressure and volume of a gas was measuredand the follwowing results were obtained :

    Pressure

    (kPa)

    58 62 67 73 81 81 86 92 104

    Volume(m3)

    0.36 0.97 0.43 0.52 0.48 0.29 0.31 0.75 0.27

    Determine the coefficient of correlation and comment on the resultobtained.

    5. The caloric intake of rats varies with body mass as shown below.

    Bodymass(g)

    2.0 3.1 3.6 4.6 5.0 6.0 7.0 8.0 8.5 9.0 10.0

    CaloricIntake(cal h-1

    1.52.1 3.2 3.6 3.6 3.9 4.1 4.2 4.5 4.6 5.9

    Is there a linear correlation between these results ?

    20

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    21/23

    CORRELATION AND REGRESSION C 5606 / 5/

    6. Determine the coefficient of correlation for the data given below and testthe null hypothesis that = 0 at a level of significance of 0.1. Thedatagiven relates the number of hours of sunshime per week to the hourslost due to sickness.

    Hours ofsunshine/week

    10 13 15 17 18 20 22 23 24

    Hous lost dueto sickness

    90 75 75 65 55 45 55 45 35

    7. The length y metres of a cable subjected to a load of x kilograms is givenby y = + x. In an experiment to estimate and a particular cable, thevalue of y was measured for each of 15 values of x. The followingquantities were calculated from the pairs of values.

    x = 225 y = 238.5 xy = 3581 2x = 3625

    a) Calculate the least squares estimates of and

    8. A set of bivariate data can be summarised as follows

    x = 21 y = 43 xy = 171 2x = 91 n = 6 2

    y = 335

    i) Calculated the equation of regression line of y and x. Give youranswer in the form y = a + bx, where the values of a and b shouldbe stated to 3 significant figures.

    ii) It is required to estimate the value of y for a given value of x. Statecircumstances under which the regression line of x and y shouldbe used, rather than the regression line of y on x

    9. The data given below is relationship between the heights and masses of tenpeople.

    Height,X cm

    175 180 193 165 187 171 198 168 184 177

    Mass,Y kg

    82 78 86 72 91 80 95 72 89 74

    Determine the equation of the regression line of mass on height,expressing the regression coefficients correct to two decimal places.

    21

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    22/23

    CORRELATION AND REGRESSION C 5606 / 5/

    10. The power needed to drive a lathe increase as the cutting angle of the toolincrease when cutting a constant speed and depth of cut. The relationship formild steel is :

    Cuttingangle(degrees)X

    50 55 60 65 70 75 80 85 90

    Power(kW)Y

    6.2 6.8 7.6 8.2 8.1 8.8 9.7 10.0 10.4

    Determine a) the equation of the regression line of power on cutting angle andb) the equation of the regression line of cutting angle on power,

    expresing the regression coefficients correct to three significantfigures in each case.

    22

  • 8/6/2019 Unit 5 ( CORRELATION AND REGRESSION )

    23/23

    CORRELATION AND REGRESSION C 5606 / 5/

    FEEDBACK TO SELF ASSESSMENT 5

    Have you tried all the questions?? If YES, check your answers now.

    1. 0.972. 0.70 , fair direct

    3. 0.974. -0.31, It is probable that the measurements were made at different

    Temperatures

    5. r = 0.94, hence there is a good, direct correlation.

    6. r = -0.95, t.99 7 = 1.42 I tI = 8.05 hypothesis is rejected

    7. = 15.69 = 0.014 y= 15.69 + 0.014x

    8. i) y = 3.07 + 1.17xii) use regression line of x and y to estimate value of x when y is the

    independent variable.

    9. y = -036.83 + 0.66x

    10. a) Y = 1.14 + 0.104 Xb) X = -9.27 + 9.41Y

    23