(12) Bivariate Data

Embed Size (px)

Citation preview

  • 7/30/2019 (12) Bivariate Data

    1/31

    Applied Statistics and Computing Lab

    BIVARIATE DATA

    Applied Statistics and Computing Lab

    Indian School of Business

  • 7/30/2019 (12) Bivariate Data

    2/31

    Applied Statistics and Computing Lab

    Learning goals

    Understanding bivariate data

    Understanding the idea of correlation

    Understanding linear regression

    2

  • 7/30/2019 (12) Bivariate Data

    3/31

    Applied Statistics and Computing Lab

    Bivariate Data

    3

  • 7/30/2019 (12) Bivariate Data

    4/31

    Applied Statistics and Computing Lab4

  • 7/30/2019 (12) Bivariate Data

    5/31

    Applied Statistics and Computing Lab5

  • 7/30/2019 (12) Bivariate Data

    6/31

    Applied Statistics and Computing Lab6

  • 7/30/2019 (12) Bivariate Data

    7/31

    Applied Statistics and Computing Lab7

  • 7/30/2019 (12) Bivariate Data

    8/31

    Applied Statistics and Computing Lab

    Why study variables together Variation in one variable may or may not affect

    the variation in another variable Understanding the relationship

    When the value of one variable changes, compare the

    other variable for: Direction of movement and

    : Magnitude of movement

    PredictionIf a new value of one variable is observed, can we

    predict the corresponding value of the other variable?

    8

  • 7/30/2019 (12) Bivariate Data

    9/31

    Applied Statistics and Computing Lab

    Statistics for bivariate dataX Y

    .. ..

    .. ..

    9

    X|Y .. .. Totals

    ..

    ..

    Totals 1

    E(XY)

    Data type I ( + + + )

    =

    Data type II ( + + + )

    ,

    =

    ,

    Data type I:

    Data type II: (tabulating relative frequencies;

    in case if there are multiple observationswith same values of X and Y)

    E(X), E(Y), V(X) and V(Y) are calculated as

    per the univariate mean and varianceformulae

  • 7/30/2019 (12) Bivariate Data

    10/31

    Applied Statistics and Computing Lab

    Covariance (denoted by Cov) We understand the variation in a single variable by looking at the

    movement of its values from a central tendency

    For a bivariate data, we want to look at the combined deviation The sign of such a measure may tell us about the two variables and

    how they covary

    Hence we can take product of the two sets of deviations

    The Covariance calculates just this! It is defined as the expected value of the product of the deviation

    of X from its mean, and the deviation of Y from its mean*

    A reasonable measure of joint variation10

    ).(),cov(

    )().()(),cov(

    ))]())(([(),cov(

    n

    y

    n

    x

    n

    xyYX

    YEXEXYEYX

    YEYXEXEYX

    =

    =

    =

    *Aczel A., Sounderpandian J. Complete business statistics

  • 7/30/2019 (12) Bivariate Data

    11/31

    Applied Statistics and Computing Lab

    Covariance (contd.) Covariance is independent of change of origin but

    affected by change of scale

    Covariance of 2 variables is always lesser than or equalto the product of variances of those two variables

    Unit of covariance is obtained by taking a product ofthe units of X and Y

    11

    )],[cov(),cov(

    )(and

    )(For

    YXcdVU

    d

    bYV

    c

    aXU

    =

    =

    =

    )().(),cov( YVarXVarYX

  • 7/30/2019 (12) Bivariate Data

    12/31

    Applied Statistics and Computing Lab

    Covariance (contd.) cov(Waist circumference, adipose tissue area) = 643.39

    Can we compare this with another covariance? For the Body measurement data, consider both the Weight

    and the Height of all the individuals

    What is the covariance between Height and Weight for boththe genders?

    = 27.13Kg. : Cms. and

    = 40.38Kg. : Cms.

    What information do we obtain by comparing these two

    covariance values?

    12

  • 7/30/2019 (12) Bivariate Data

    13/31

    Applied Statistics and Computing Lab

    Standardization If we standardize both the variables, the covariance is independent

    of the unit of measurement

    Makes the covariances of both categories comparable It would then lie between [-1,1]

    The number is closer to 0 => the variables do not covary much

    The number closer to 1 or -1 => the variables covary highly

    , = 0.43

    , = 0.53

    The height and weight are moderately related to each other, forboth the genders

    We will see that this covariance is the same as the measure westudy next!

    13

  • 7/30/2019 (12) Bivariate Data

    14/31

    Applied Statistics and Computing Lab

    Correlation coefficient Denoted by (called rho)

    Defined as the measure of the degree of linear association between the

    two variables X and Y* Indicates the strength of and direction in which the two variables would

    move, in relation with each other

    Calculated as the proportion of the covariance between X and Y, to theproduct of standard deviations of X and Y

    =( , )

    Correlation coefficient is also termed as the Pearson Product-momentCorrelation Coefficient

    , = 0.77 ( , ) = 0.43

    ( , ) = 0.53

    14*Aczel A., Sounderpandian J. Complete business statistics

  • 7/30/2019 (12) Bivariate Data

    15/31

    Applied Statistics and Computing Lab

    Properties of Correlation coefficient Correlation coefficient of two variables is equal to the

    covariance of their standardised forms

    Lies between -1 and 1 (extremes included)

    1 1

    It is a dimension-free measure or a measure free of

    units Is independent of both, change of origin and change of

    scale

    =( )

    =( )

    ,

    =

    15

  • 7/30/2019 (12) Bivariate Data

    16/31

    Applied Statistics and Computing Lab16

    Perfect positive correlation.

    If one of X or Y increases,

    the other one must increaseas per an exact linear

    relation. Similarly if one

    decreases, the other

    decreases by the same rule.

    Perfect negative correlation.

    If one of X or Y increases,

    the other must decrease asper an exact liner relation.

    Similarly if one decreases,

    the other increases by the

    same rule.

    No linear relationship. Strong negative correlation.

    If one of X or Y increases,the other decreases as per a

    moderately strong linear

    relation. Similarly if one

    decreases, the other

    increases by the same rule.

    Strong negative correlation.

    If one of X or Y increases,

    the other decreases as per a

    very strong linear relation.

    Similarly if one decreases,

    the other increases by the

    same rule.

    Moderate positive

    correlation. If one of X or Y

    increases, the other must

    increase as per a moderately

    strong linear relation.

    Similarly if one decreases,

    the other decreases by the

    same rule.

    Weak positive correlation. If

    one of X or Y increases, the

    other must increase as per a

    weak linear relation.

    Similarly if one decreases,

    the other decreases by the

    same rule.

    No linear relationship.

    Visuals from Aczel A., Sounderpandian J. Complete business statistics

  • 7/30/2019 (12) Bivariate Data

    17/31

    Applied Statistics and Computing Lab

    Limitations of correlation coefficient

    Correlation

    coefficient = 0.911!

    17

    Y X

    0.6 2.01

    0.2 2

    0.2 2

    0.2 2

    0.1 2

    0.1 2

    0.1 2

    0.05 2

    0.05 2

    0 2

    X Y

    -3 9

    -2 4

    -1 1

    0 0

    1 1

    2 4

    3 9

    Correlation coefficient= 0

    Yet, there exists aperfect quadraticrelation between Xand Y

  • 7/30/2019 (12) Bivariate Data

    18/31

    Applied Statistics and Computing Lab

    Correlation and causality A huge Roger Federer fan!

    Watches several Fedearer - Nadal matches live

    on television

    Has recorded that Federer loses approximately80% of the matches, that this fan watches live

    Does he cause Federer to lose, by watching

    the match?

    18

  • 7/30/2019 (12) Bivariate Data

    19/31

    Applied Statistics and Computing Lab

    Other measures Rank correlation

    To measure the degree of correlation between two ordinal variables or

    rankings

    : Company rankings given by two different publications

    : Ranks of universities published on two websites

    Consider two groups of women. They are grouped based on whether they use a

    particular brand of shampoo (say Shampoo A) or not. For each of the groups,responses are collated to indicate which of the five characteristics about their

    shampoo are most important to them.

    19

    Characteristics Group 1 rankings Group 2 rankings D=(rank 2 rank 1)

    Characteristic 1 1 5 4

    Characteristic 2 3 3 0

    Characteristic 3 2 4 2

    Characteristic 4 5 1 -4

    Characteristic 5 4 2 -2

  • 7/30/2019 (12) Bivariate Data

    20/31

    Applied Statistics and Computing Lab

    Other measures (contd.) Spearmans Rank correlation coefficient ():

    = 1

    6

    ( 1)

    Where, d= difference between 2 ranks of each object

    n= Number of objects

    This rank correlation is also equal to the Pearson product-moment

    correlation applied to the ranks organised in an ascending order

    Lies in the interval [-1,1]

    Higher the positive correlation coefficient, greater the degree of

    agreement between two ranks

    Higher the negative correlation coefficient (closer to -1), greater the

    degree of disagreement between two ranks

    A correlation coefficient of 0 indicates that there is absolutely no similarity

    in the two ranks given to the same object

    20

  • 7/30/2019 (12) Bivariate Data

    21/31

    Applied Statistics and Computing Lab

    Other measures (contd.) Kendalls Tau ():

    = ( )

    12 ( 1)

    For n objects with ranks , ; for each i=1,2,,n, a pair of observations ( , )

    and , is said to be,

    concordantif the ranks of both elements agree i.e. both ( > ) and

    > OR both ( < ) and <

    discordantif( > ) and ( < ) OR ( < ) and ( > ), the pair is

    said to be discordant

    Neither concordant nor discordant if( = ) or =

    Lies in the interval [-1,1]

    If the agreement between two rankings is perfect, coefficient = 1

    If the disagreement between two rankings is perfect, coefficient = -1

    If the rankings are independent, the coefficient would be close to 0

    21

  • 7/30/2019 (12) Bivariate Data

    22/31

    Applied Statistics and Computing Lab

    Linear Regression Suppose now, the variation in one variable (X) influences the

    variation in the other variable (Y)

    Is the adipose tissue area is influenced by waist circumference? Are ice-cream sales affected by the temperature in the city?

    The variable X i.e. the variable that influences, is also referred to as

    the predictor variable or the independent variable or theexplanatory variable

    The variable Y i.e. the variable that is being influenced, is alsoreferred to as the outcome variable or the dependent variable orthe explanatory variable

    Can we draw one line such that the equation of that line explainsthe relation between X and Y?

    Which line describes the relationship in a reasonable way?

    22

  • 7/30/2019 (12) Bivariate Data

    23/31

    Applied Statistics and Computing Lab23

  • 7/30/2019 (12) Bivariate Data

    24/31

    Applied Statistics and Computing Lab

    Linear regression (contd.)

    24Visuals from Aczel A., Sounderpandian J. Complete business statistics

    This line minimizes the sum of squared vertical distances

  • 7/30/2019 (12) Bivariate Data

    25/31

    Applied Statistics and Computing Lab

    Linear regression (contd.) Simple linear regression model:

    =

    +

    +

    where, Y=Outcome variable

    X=Predictor variable

    =Random component in the model

    =( )( )

    ( )

    = -

    If we can safely assume linear relationship between Xand Y, this model predicts average value by which Y willchange for one unit change in X

    25

  • 7/30/2019 (12) Bivariate Data

    26/31

    Applied Statistics and Computing Lab

    Linear regression (contd.)

    26

    The model is estimated using Method of least squares

    This method tries to minimize the sum of squared errors

    There are other methods of estimation

    Visuals from Aczel A., Sounderpandian J. Complete business statistics

  • 7/30/2019 (12) Bivariate Data

    27/31

    Applied Statistics and Computing Lab

    Linear regression (contd.) Goodness of the model depends on the strength

    of linear relationship between X and Y The error could comprise of factors other than X,

    that may affect Y

    The coefficient of determination or is a

    measure of the strength of linearity in the

    relationship

    It indicates the proportion of variation in Y, that is

    explained by X

    27

  • 7/30/2019 (12) Bivariate Data

    28/31

    Applied Statistics and Computing Lab

    Linear regression (contd.) Fitting a linear regression for the Waist circumference-Adipose tissue data gives following output in R:

    We get the following regression equation: = 71.26 + 0.2( )

    28

    Coefficients: Estimate Std. Error t value Pr(>|t|)

    (Intercept) 71.26327 1.88565 37.79

  • 7/30/2019 (12) Bivariate Data

    29/31

    Applied Statistics and Computing Lab

    Linear regression (contd.)

    29

  • 7/30/2019 (12) Bivariate Data

    30/31

    Applied Statistics and Computing Lab

    R-codesFunction R-code

    Dotplot install.packages(TeachingDemos)

    library(TeachingDemos)

    dots(variable name)

    Scatter plot plot(variable1 name,variable2 name)

    Covariance cov(variable1 name,variable2 name)

    Correlation cor(variable1 name,variable2 name)Spearmans rank correlation cor(variable1 name,variable2 name,

    method=spearman)

    Kendalls tau cor(variable1 name,variable2 name,

    method=kendall)Linear regression lm(response variable ~ explanatory

    variable)

    Regression line abline(response variable ~ explanatory

    variable)

    30

  • 7/30/2019 (12) Bivariate Data

    31/31

    Applied Statistics and Computing Lab

    Thank you