28
Correlation Rizal Maulana

Correlation Rizal Maulana. Outline Basic Concepts of Correlation Scatter Diagrams One Sample Hypothesis Testing for Correlation

Embed Size (px)

Citation preview

Correlation

Rizal Maulana

Outline Basic Concepts of Correlation Scatter Diagrams One Sample Hypothesis Testing for Correlation

Basic Concepts of Correlation Definition : The covariance between two sample random

variables x and y is a measure of the linear association between the two variables, and is defined by the formula

Observation: The covariance is similar to the variance, except that the covariance is defined for two variables (x and y above) whereas the variance is defined for only one variable. In fact, cov(x, x) = var(x).

The covariance can be thought of as the sum of matches and mismatches among the pairs of data elements for x and y

A match occurs when both elements in the pair are on the same side of their mean; a mismatch occurs when one element in the pair is above its mean and the other is below its mean.

The covariance is positive when the matches outweigh the mismatches and is negative when the mismatches outweigh the matches.

The stronger the linear relationship the larger the value of the covariance will be.

The size of the covariance is also influenced by the scale of the data elements, and so in order to eliminate the scale factor the correlation coefficient is used as the scale-free metric of linear relationship.

Definition : The correlation coefficient between two sample variables x and y is a scale-free measure of linear association between the two variables, and is given by the formula

Observation: The covariance can be calculated as

As a result, we can also calculate the correlation coefficient as

Property 1: -1 < r < 1 If r is close to 1 then x and y are positively correlated. A positive

linear correlation means that high values of x are associated with high values of y and low values of x are associated with low values of y.

If r is close to -1 then x and y are negatively correlated. A negative linear correlation means that high values of x are associated with low values of y, and low values of x are associated with high values of y.

When r is close to 0 there is little linear relationship between x and y.

Definition : The covariance between two random variables x and y for a population with discrete or continuous pdf is

Definition : The (Pearson’s product moment) correlation coefficient for two variables  x and y for a population with discrete or continuous pdf is

If x and y are independent then cov(x, y) = 0 The following is true for both for the sample and population :

proof :

Observation: It turns out that r is not an unbiased estimate of ρ. A relatively unbiased estimate of  is given by the adjusted correlation coefficient  radj:

while radj is a better estimate of the population correlation, especially for small values of n, for large values of n it is easy to see that radj ≈ r.

For constant a and random variables x, y and z, the following are true both for the sample and population definitions of covariance:

a. cov(x, y) = cov(y, x)b. cov(x, x) = var(x)c. cov(a, y) = 0

d. cov(ax, y) = a · cov(x, y)

e. cov(x+z, y) = cov(x, y)+ cov(z, y)

If x and y are random variables and z = ax + b where a and b are constants then the correlation coefficient between x and y is the same as the correlation coefficient between z and y.

and so stdev(z) = a · stdev(x). Thus

Property 2 :

Proof

Since ti and ei are independent, cov(t,e) = 0, and so

Thus

Excel Functions:1. COVAR(R1, R2) = the population covariance between the data in

arrays R1 and R2. If R1 contains data {x1,…,xn}, R2 contains {y1,…,yn}, x = AVERAGE(R1) and y = AVERAGE(R2), then COVAR(R1, R2) has the value

This is the same as the formula given in Definition 1, with n replaced by n – 1. Excel doesn’t have a sample version of the covariance, although this can be calculated using the formula:

 n * COVAR(R1, R2) / (n – 1)

2. CORREL(R1, R2) = the correlation coefficient of data in arrays R1 and R2. This function can be used for both the sample and population versions of the correlation coefficient. Note that:◦ CORREL(R1, R2) = COVAR(R1. R2) / (STDEVP(R1) * STDEVP(R2)) = the

population version of the correlation coefficient◦ CORREL(R1, R2) = n * COVAR(R1. R2) / (STDEV(R1) * STDEV(R2) * (n  –

1)) = the sample version of the correlation coefficient

3. Excel also provide COVARIANCE.S(R1, R2) to compute the sample covariance as well as COVARIANCE.P(R1, R2) which is equivalent to COVAR(R1, R2). Also, the Real Statistics supplemental functions COVARP(R1, R2) and COVARS(R1, R2) compute the population and sample covariances respectively.

Scatter Diagrams To better visualize the association between two data sets {x1,

…, xn} and {y1, …, yn} we can employ a chart called a scatter diagram (also called a scatter plot). This is done in Excel by highlighting the data in the two data sets and selecting Insert > Charts|Scatter.

This figure illustrates the relationship between a scatter diagram and the correlation coefficient (or covariance).

One Sample Hypothesis Testing for Correlation

As we do in Sampling Distributions, we can consider the distribution of r over repeated samples of x and y.

We require x and y have a joint bivariate normal distribution or that samples are sufficiently large.

We can think of a bivariate normal distribution as the three-dimensional version of the normal distribution, in which any vertical slice through the surface which graphs the distribution results in an ordinary bell curve.

The sampling distribution of r is only symmetric when ρ = 0 (i.e. when x and y are independent).

If ρ ≠ 0, then the sampling distribution is asymmetric and so the following theorem does not apply, and other methods of inference must be used.

Theorem 1: Suppose ρ = 0. If x and y have a bivariate normal distribution or if the sample size n is sufficiently large, then r has a normal distribution with mean 0, and t = r/sr ~ T(n – 2) where

here the numerator r of the random variable t is the estimate of ρ = 0 and sr is the standard error of t.

Observation: If we solve the equation in Theorem 1 for r, we get

Observation: The theorem can be used to test the hypothesis that population random variables x and y are independent i.e. ρ = 0.

Example 1 A study is designed to check the relationship between smoking and

longevity. A sample of 15 men 50 years and older was taken and the average number of cigarettes smoked per day and the age at death was recorded, as summarized in the table. Can we conclude from the sample that longevity is independent of smoking?

The scatter diagram for this data is as follows. We have also included the linear trend line that seems to best match the data.

Next we calculate the correlation coefficient of the sample using the CORREL function:

r = CORREL(R1, R2) = -.713From the scatter diagram and the correlation coefficient, it is clear that the population correlation is likely to be negative.

The absolute value of the correlation coefficient looks high, but is it high enough? To determine this, we establish the following null hypothesis:

H0: ρ = 0 Recall that ρ = 0 would mean that the two population variables are

independent. We use  t =  r/sr as the test statistic where sr is as in Theorem 1. Based on

the null hypothesis, ρ = 0, we can apply Theorem 1, provided x and y have a bivariate normal distribution.

It is difficult to check for bivariate normality, but we can at least check to make sure that each variable is approximately normal via QQ plots.

Both samples appear to normal, and so by Theorem 1, we know that t has approximately a t-distribution with n – 2 = 13 degrees of freedom. We now calculate

Finally, we perform either one of the following tests:p-value = TDIST(ABS(-3.67), 13, 2) = .00282 < .05 = α (two-tail)tcrit = TINV(.05, 13) = 2.16 < 3.67 = |tobs |

And so we reject the null hypothesis, and conclude there is a non-zero correlation between smoking and longevity. In fact, it appears from the data that increased levels of smoking reduces longevity.

Example 2 The US Census Bureau collects statistics comparing the various 50 states.

The following table shows the poverty rate (% of population below the poverty level) and infant mortality rate per 1,000 live births) by state. Based on this data, can we conclude the poverty and infant mortality rates by state are correlated?

The correlation coefficient of the sample is given byr = CORREL(R1, R2) = .564

Where R1 is the range containing the poverty data and R2 is the range containing the infant mortality data.

From the scatter diagram and the correlation coefficient, it is clear that the population correlation is likely to be positive, and so this time we use the following one-tail null hypothesis:

H0: ρ ≤ 0 Based on the null hypothesis we will assume that ρ = 0 (best case), and so

as in Example 1

Finally, we perform either one of the following tests:p-value = TDIST(4.737, 48, 1) = 9.8E-08 < .05 = α (one-tail)tcrit = TINV(.05, 48) = 2.011 < 4.737 = tobs

And so we reject the null hypothesis, and conclude there is a non-zero correlation between poverty and infant mortality.

Observation: For samples of any given size n it turns out that r is not normally distributed when ρ ≠ 0 (even when the population has a normal distribution), and so we can’t use Theorem 1.

There is a simple transformation of r, however, that gets around this problem, and allows us to test whether ρ = ρ0 for some value of ρ0 ≠ 0.

Definition 1: For any r define the Fisher transformation of r as follows:

Theorem 2: If x and y have a joint bivariate normal distribution or n is sufficiently large, then the Fisher transformation r’ of the correlation coefficient r for samples of size n has distribution N(ρ′, sr′) where

Corollary 1: Suppose r1 and r2 are as in the theorem where r1 and r2 are based on independent samples and further suppose that ρ1 = ρ2. If z is defined as follows, then z ~ N(0,1)

where

Excel Functions: Excel provides functions that calculate the Fisher transformation and its inverse.FISHER (r) = .5 * LN((1 + r) / (1 – r))FISHERINV(z) = (EXP(2 * z) – 1) / (EXP(2 * z) + 1)

Observation: We can use Theorem 2 to test the null hypothesis H0: ρ = ρ0. This test is very sensitive to outliers. If outliers are present it may be better to use the Spearman rank correlation test or Kendall’s tau test.

Example 3 Suppose we calculate r = .6 for a sample of size n = 100. Test the

following null hypothesis and find the 95% confidence interval.H0: ρ = .7

Observe thatr′ = FISHER(r) = FISHER(.6) = 0.693ρ′ = FISHER(ρ) = FISHER(.7) = 0.867sr′ = 1 / SQRT(n – 3) = 1 / SQRT(100 – 3) = 0.102

Since r′ < ρ′ we are looking at the left tail of a two-tail test

p-value = NORMDIST(r′, ρ′, sr′, TRUE) = NORMDIST(.693, .867, .102, TRUE) = .0432 > 0.025 = α/2

r′-crit = NORMINV(α/2, ρ′, sr′) = NORMINV(.025, .867, .102) = .668 < .693 = r′

In either c′ase, we cannot reject the null hypothesis.

The 95% confidential interval for ρ′ isr′ ± zcrit ∙ sr′ = 0.693 ± 1.96 ∙ 0.102 = (0.494, 0.892)

Here zcrit = ABS(NORMSINV(.025)) = 1.96. The 95% confidence interval for ρ′ is therefore (FISHERINV(0.494),

FISHERINV(0.892)) = (.457, .712). Note that .7 lies in this interval, confirming our conclusion not to reject the

null hypothesis.

Effect Size And Power Until now, when we have discussed effect size we have used

some version of Cohen’s d. The correlation coefficient r (as well as r2) provides another

common measure of effect size. We now show how to calculate the power of testing correlation

using the approach from Power of a Sample.

Example 4 A market research team is conducting a study in which they believe the

correlation between increases in product sales and marketing expenditures is 0.35. What is the power of the one-tail test if they use a sample of size 40 with α = .05? How big does their sample need to be to carry out the study with α = .05 and power = .80?