17
Reading Material #9 (Correlation Analysis) --------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------- Gabino P. Petilos, Ph.D. FIC, EDRE 231, 2 nd Sem 11-12 CORRELATION ANALYSIS INTRODUCTION It has been said that research is conducted in order to find relationship between or among variables. When factors or variables are relate d in some systematic patte rn, so that a change in the value of one is associated with a concurrent change in the value of the other, we say that they are corre lated. Thus, we know that ability level is correlat ed with academic performance based on our common observation that students belonging to high ability level tends to show better academic performance while those belonging to low ability level tend to show poor academic performance. In statistics, we not only establish the existence of certain correlations but also measure the direction a nd the degre e of correlatio n. Ideally, we want to know the correlation between two variables X and Y in a given population (Figure 1). The correlation between these variables is denoted by the symbol , called population correlation coefficient. Since it is not alway s feasible to study th e entire populati on, we attempt to describe the correlation between X and Y by drawing a random sample from the population. We denote the estimate of the parameter by the sample correlation coefficient r. If a sample is used to estimate the amount of correlation between two variables, significance testing is called for to find out if the variables in the actual population are indeed significantly re lated. For this reason, correlation analysis employs both descriptive statistics as well as inferential statistics. Correlation analysis is concerned with the linear relationship between two variables. It aims to determine the direction (whether positive or negative) as well as the strength (whether weak, moderate, or strong) of linear association between two variables. When two variables vary in the same direction, we say that the variables are positively correlated. For example, it has been shown that IQ and academic performance are positively correlated. This means that a person who has high IQ would tend to have a good academic performance Population =? X Y Sample r=? Figure 1

Module 9 (Correlation Analysis)

Embed Size (px)

Citation preview

Page 1: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 1/17

Reading Material #9 (Correlation Analysis)

---------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------------------------Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

CORRELATION ANALYSIS

INTRODUCTION

It has been said that research is conducted in order to find relationship between oramong variables. When factors or variables are related in some systematic pattern, so that a

change in the value of one is associated with a concurrent change in the value of the other,

we say that they are correlated. Thus, we know that ability level is correlated with academic

performance based on our common observation that students belonging to high ability level

tends to show better academic performance while those belonging to low ability level tend

to show poor academic performance.

In statistics, we not only establish the existence of certain correlations but also

measure the direction and the degree of correlation. Ideally, we want to know the

correlation between two variables X and Y in a given population (Figure 1). The correlation

between these variables is denoted by the symbol , called population correlation

coefficient. Since it is not always feasible to study the entire population, we attempt to

describe the correlation between X and Y by drawing a random sample from the population.

We denote the estimate of the parameter by the sample correlation coefficient r.

If a sample is used to estimate the amount of correlation between two variables,

significance testing is called for to find out if the variables in the actual population are indeed

significantly related. For this reason, correlation analysis employs both descriptive statistics

as well as inferential statistics. 

Correlation analysis is concerned with the linear relationship between two variables.

It aims to determine the direction (whether positive or negative) as well as the strength

(whether weak, moderate, or strong) of linear association between two variables. When

two variables vary in the same direction, we say that the variables are positively correlated.

For example, it has been shown that IQ and academic performance are positively correlated.

This means that a person who has high IQ would tend to have a good academic performance

Population

=?

X Y

Sample

r=?

Figure 1

Page 2: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 2/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 2

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

in school and in turn a person's good academic performance is usually associated with his

high IQ. Other examples of variables which are positively correlated are:

  Grade in Mathematics and Grade in Physics;

  Work performance and Level of morale;

  Number of hours spent in studying and Grades in mathematics.

On the other hand, when two variables vary in the opposite direction, the variables

are said to be negatively correlated. Examples of variables which exhibit negatively

correlation are:

   Academic achievement and Hours per week of watching TV  

  Time spent in typing practice and Number of typing errors 

   Absenteeism and Job satisfaction 

Variables that are not linearly correlated have zero correlation. For instance, height

of students and their ability level have a zero correlation. In this example, it does not make

sense to associate a particular value of height to a particular ability level. As anotherexample, there is zero correlation between size of shoes and level of income of bank

managers!

The direction and strength of linear correlation between variables may be described

using a statistical device called “scatter plot” or “scatter diagram. Examples of scatter plots

are given in Figure 1. Here, the scatter plots from (a) to (c) illustrate a positive correlation

between the two variables in varying strengths while (d) to (f) illustrate a negative

correlation also in varying strengths. The scatter plots in (a) and (d) illustrate a perfect

correlation between the two variables while those of (g) and (h) illustrate a zero correlation.

(a) perfect positive (b) strong positive (c) weak positive (d) perfect negative

(e) strong negative (f) weak negative (g) zero correlation (h) zero correlation 

Figure 1. Examples of Scatter Plots

Page 3: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 3/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 3

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

A strong correlation between two variables can occur when not all points fall on the

line of relationship but they are close to it. If the distances of the points are far from the

line, the correlation is said to be weak (or low) [See graphs (c) and (f)].

When the points do not tend o follow the path of a straight line, the correlation is

said to be zero. This is illustrated by the scatter plots in (g) and (h). Note that zero

correlation between two variables does not necessarily mean that the variables are not

related. In (g) for instance, there is zero linear correlation between the variables yet they are

related in a quadratic sense.

THE CORRELATION COEFFICIENT

As mentioned earlier, the scatter diagram is a visual device which is useful in

characterizing the direction and strength of linear correlation between two variables. The

direction of relationship is perhaps easy to discern in a scatter diagram. However,

interpretation of the strength of linear correlation using a scatter diagram is not easy since it

is open to various interpretations when viewed by different persons.

The correlation coefficient is another tool by which the direction and strength of 

linear correlation between two variables may be described. As a measure of correlation, the

correlation coefficient ranges in value from -1.0 to +1.0. Thus if  (rho) represents the

population correlation coefficient, then

-1.0    +1.0

If  = 1.0, the variables are said to be perfectly correlated in a positive sense. If the

value is -1.0, the variables are perfectly correlated in a negative sense. A value of  = 0indicates a zero linear correlation between the two variables. Figure 2 illustrates the

descriptive interpretation of the correlation coefficient.

-0.5 moderate negative correlation

Figure 2. Interpretation of  

Value of  Interpretation of  

(Direction of correlation)

Interpretation of  

(Strength of correlation)

 

 

 

 

 

+1.0 perfect positive correlation

+0.5 moderate postive correlation

0 zero correlation

-1.0 perfect negative correlation

POSITIVE CORRELATION

POSITIVE CORRELATION

weak or low correlation

weak or low correlation

strong or high correlation

strong or high correlation

Page 4: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 4/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 4

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

Since is usually unknown, it has to be estimated from a sample data. The estimate

of  is called a sample correlation coefficient. We present in the succeeding discussion

techniques of estimating the population correlation coefficient.

PEARSON'S PRODUCT-MOMENT CORRELATION COEFFICIENT

Although there are several measures of correlation, the most common measure and

useful one is the Pearson’s product moment correlation denoted by r . This measure of 

correlation is used when both variables are measured in at least the interval scale. The

computational formula for the Pearson's r is given by

2222)()(

))((

Y Y N X  X N

Y  X  XY Nr  (Equation 1)

The Pearson’s r is a parametric measure of correlation. The following assumptionsmust be satisfied when using the Pearson's r :

1.  Both variables X and Y must be measured in at least the interval scale;

2.  Observations are sampled from a bivariate normal distribution; and

3.  The variables are linearly related.

Example 1. The table below shows experimental data for the observed pairs ( x , y ). Find the

value of r.

 x  2 3 7 4 6 8 5

y  3 5 8 5 7 10 5

Solution: Without loss of generality, let us assume that the first two assumptions above have

been satisfied. To determine whether the third assumption is also satisfied, we

construct a scatter plot for the given data. This scatter plot is shown below.

-1 1 2 3 4 5 6 7 8 9 10-1

1

2

3

4

5

6

7

8

9

10

x

y

 

Clearly, the scatter plot suggests a linear relationship between the two variables. To

determine the extent of correlation between the two variables, we compute the value of  r  

using Equation 1. The following worksheet illustrates how this value is computed.

Page 5: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 5/17

Page 6: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 6/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 6

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

correlation coefficient of 0.72 between political affiliation and religion in social science

research may be interpreted as high. The same value, however, may be interpreted as low

when used as a measure of reliability or validity of standardized tests. Also, It is often

tempting to say that an r  value of 0.80 is twice as strong as an r  value of 0.40. Such an

interpretation is incorrect since the correlation scale is not ratio or interval but rather an

ordinal one.

Another consideration in interpreting a correlation coefficient is when the value is 0.

In general, a value of  r = 0 does not mean that the variables are not related. As shown in

Figure 1 (g), a value of 0 merely implies that there is no linear association between the two

variables. Moreover, values of  r  that are different from 0 cannot be construed that one

variable causes the other which means that if two variables are correlated, it does not imply

that one of them causes the other.

One meaningful interpretation of r involves the concept of  coefficient of 

determination which is denoted by2

r  .  This value gives us a measure of the amount of 

variation in one variable which can be attributed to the variation of the other variable and

vice versa. Thus, if r = 0.91, r 2

= 0.8281 or 82.81% which means that 82.81% of the variation

in one variable is accounted for by the variation of the other variable and versa. The

coefficient of determination is a very a important and useful concept in regression analysis.

Testing the Significance of r  

If the value of the correlation coefficient is obtained from a sample data, the

researcher would often want to know whether the variables are in fact related in the actual

population from which the sample was drawn. The hypothesis of interest is about whether

the population correlation coefficient is zero or not. Thus the following null hypothesismust be tested using the obtained sample correlation coefficient.

Null Hypothesis : Ho: = 0 (There is no correlation between X and Y)

Alternative Hypothesis : Ha :   0 (For a Non-directional Test)

: Ha : < 0 or > 0 (For a Directional Test)

To test whether the obtained Pearson’s r is significantly different from zero, a t-test

could be used if N < 30 or z-test is N  30. The test statistics are given below:

21

2r 

Nr t  , d.f. = N - 2 (Equation 2)

Thus, for instance, if r = .91 and n = 7, we have

9078.4)91(.1

27)91(.

2

t   

Page 7: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 7/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 7

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

At = 0.05 (two-tailed), the corresponding critical value of  t  is 2.571. Since the

absolute value of the computed t -value exceeded the critical value, we reject the null

hypothesis. We conclude that the relationship between the two variables cannot be

attributed to chance.

When analyzed using the SPSS, we obtain the following output. This table provides

us both descriptive and inferential information about the correlation between the variables

X and Y. In this table, the value of the Pearson’s r is 0.913, hence the correlation is positive

and the strength of linear correlation is high. Also the associated p-value is .004 (two-tailed)

which is less than α = 0.025 (2

05.0 ), hence the null hypothesis is rejected.

Correlations

1 .913**

.004

7 7

.913** 1

.004

7 7

Pearson Correlation

Sig. (2-tailed)

N

Pearson CorrelationSig. (2-tailed)

N

X

Y

X Y

Correlation is signif icant at the 0.01 level**.

Another test statistic for testing the null hypothesis about the value of the population

correlation coefficient is the z-test. This test statistic is particulary useful when the

hypothesized value of the population correlation coefficient is different from zero, as for

example Ho: = 0 (0 0). In using this test, we first apply the Fisher's Transformation to

the obtained value of  r  to get the corresponding z-value. For a given r , the transformed

value of z is given by

 

  

 

r z

1

1ln

2

1.

For this variable, the mean and standard deviation are given by

 

  

 

r z

1

1ln

2

1  and

3

1

n Z   .

Therefore, the equation z

zz

 Z   

 

is a standard score which follows the standard

normal distribution. Using the same values of r  and N in Example 1, for example, we have

5275.109.

91.1ln

2

1

 

  

 

z  and 5.0

37

1

 Z   .

Page 8: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 8/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 8

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

Thus, 055.35.0

05275.1

 Z  which is significant at =.05 level of significance (two-

tailed.

The Pearson's product moment correlation is the most popular measure of 

correlation. However, as was pointed out earlier, this measure is appropriate only whenboth variables are measured in at least the interval scale. When the assumptions on the use

of  r  are not met, it is not advisable to use the Pearson’s r. Instead, we estimate the

population correlation using other measures of correlation. The succeeding discussion

considers other measures of correlation when the scale of measurement is not interval and

one of the assumptions (normality and linearity) is violated.

OTHER MEASURES OF CORRELATION

The Spearman's Rank Order Correlation (r s)

The Spearman's rho (r s)a

is a measure of correlation based on the difference between

ranks of the values of two variables X and Y. It is used when both variables are measured in

at least the ordinal scale. The Spearman’s rho is the nonparametric counterpart of the

Pearson’s r. Unlike the Pearson's r , this measure does not make assumption about normality

of distribution of the paired data.

The formula for computing the Spearman's r s is given by

)1)(1(

61

2

NNN

d r s

(Equation 3)

where d is the difference between the ranks of paired values of  X and Y , and N is the total

number of cases.

When ranking the data, “1” is usually treated as the lowest rank corresponding to the

lowest score value of the variable, followed by “2” for the next higher score, etc. Thus,

higher ranks correspond to higher scores while lower ranks correspond to lower scores. You

have to adapt this rule of ranking numbers because this is the convention used in analyzing

ordinal data using nonparametric statistics. (Note: The same value of  2d  in Equation 3 is

obtained if we assign rank 1 to the highest score instead of rank 1 to the lowest score. Checkthis!)

Another important rule that you should remember is the assignment of ranks for tied

scores. The rule is very simple: think of the scores as if they were distinct, get their ranks,

and assign the average of their ranks as the ranks of the tied scores. Let us illustrate these

rules by considering an example.

a We don’t use the symbol for rho since this is our symbol for the population correlation coefficient. 

Page 9: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 9/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 9

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

Suppose we have the following scores in an achievement test in science: 47, 43, 46,

40, 43, 47, 47, 48. The scores in ascending order with their ranks if they were distinct and

actual ranks considering the tied scores are shown as follows:

Score 40 43 43 46 47 47 47 48

Ranks if scores were distinct: 1 2 3 4 5 6 7 8Actuals rank (with tied scores) 1 2.5 2.5 4 6 6 6 8

5.22

32

  6

3

765

 

Thus, the two 43’s are ranked 2.5 each while the three 47’s are ranked 6 each.

Example 2. The following hypothetical data are the grades of 7 students in mathematics and

statistics. Estimate the strength of correlation between the variables using the

Spearman’s rank order correlation coefficient. 

X

(Grade in Math)

Y

(Grade in Statistics)

86 88

78 78

79 78

85 86

87 90

90 88

87 78

Solution: The necessary computations are indicated in the following table based on the

ranks of the values of  X and values of Y denoted by RX and RY, respectively.

  X Y R X  RY  d  d 2 

86 88 4 5.5 1.5 2.25

78 78 1 2 1 1

79 78 2 2 0 0

85 86 3 4 1 1

87 90 5.5 7 1.5 2.25

90 88 7 5.5 -1.5 2.25

87 78 5.5 2 -3.5 12.25

212 d   

Since N = 7, and 212 d  , it follows that  625.0

)48)(7(

)21(61

sr  or 63.0sr  .

Hence, there is a substantial positive correlation between students’ grades in

mathematics and grades in statistics.

Page 10: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 10/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 10

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

Testing the Significance of r s 

The test statistic for testing the significance of the r s is similar to Equation 2. The

test statistic is given by

21

2s

sr 

Nr t  , d.f. = N - 2. (Equation 4)

Using Example 2, the relevant null hypothesis and the corresponding alternative

hypothesis can be stated as follows:

H0: There is no significant correlation between grades in mathematics and grades in Statistics.

Ha: There is a significant correlation between grades in mathematics and grades in Statistics.

Since 625.0sr  and n = 7, the computed t-value is given by

7903.1)625(.1

27)625(.

2

t   

We use a two-tailed test because the alternative hypothesis is non-directional. At

0.05α (two-tailed), and d.f. = 5, the corresponding critical value of  t is 2.571. Since the

absolute value of the computed t -value did not exceed the critical value, the null hypothesis

cannot be rejected. We say that the data did not provide sufficient evidence to reject the

null hypothesis.

--------------------------------------------------------------------------------------------------------------------------

Note: When a statistical test IS NOT SIGNIFICANT, we accept the null hypothesis. Accepting the nullhypothesis, however, does not mean that it (the null hypothesis) is true because we only considered

one sample out of the so many possible samples from the population.

--------------------------------------------------------------------------------------------------------------------------

We present below the SPSS output for the same data analyzed using the Spearman’s

rank-order correlation coefficient. Note that the p-value = 0.151 > α = 0.025. Hence, the

variables are NOT significantly correlated. Note the discrepancy between the value we

obtained using the formula and the value in the SPSS output which is 0.604. Some sort of 

adjustment is made in the SPSS formula because of tied observations. (Research on this.)

Correlations

1.000 .604

. .151

7 7

.604 1.000

.151 .

7 7

Correlation Coef f icient

Sig. (2-tailed)

N

Correlation Coef f icient

Sig. (2-tailed)

N

Grade in Mathematics

Grade in Statistics

Spearman's rho

Grade in

Mathematics

Grade in

Statistics

 

Page 11: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 11/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 11

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

The Point-Biserial Correlation Coefficient (r  pb)

In the previous discussion, we considered the correlation between the variables  X 

and Y when both are measured in at least the interval scale or ordinal scale. If for instance

the variable  X  is a dichotomy (categorical with 2 categories) and Y  is a measured in the

interval scale, both Pearson’s r  and Spearman’s rank order correlation coefficients are not

appropriate as a measure of correlation.

The point biserial correlation coefficient which is denoted by pbr  is a measure of 

correlation which is appropriate when one variable is a dichotomy and the other is measured

in at least the interval scale. For instance, if one wants to know the strength of correlation

between gender and mathematics performance, then the point biserial correlation

coefficient will be an appropriate measure. The formula for the point biserial coefficient is

given by

 pq

SD

MMr 

 X 

q p

 pb

)(   (Equation 5)

where M p = the mean score of those in one category of the dichotomised variable

Mq = the mean score of those scoring in the other category

 p = the proportion scoring in the first category

q = the proportion scoring in the other category.

SD X = is the standard deviation of the interval variable.

NOTE: This formula is discussed on page 95 and the example is given on page 96 of Module 12.

There is another formula for the point-biserial correlation coefficient which is slightly

different from Equation 5. The formula makes use of the number of cases in thedichotomized interval variable and is given by

)1(

0101

nn

nn

SD

MMr 

 X 

 pb   (Equation 6)

where M1 = the mean score of the scores in category 1 of the dichotomised variable

M0  = the mean score of the scores in category 0 of the dichotomised variable

n1 = the number of cases in category 1

n0 = the number of cases in category 0

n = the total number of cases (n1 + n0)

SD X  = is the standard deviation of the interval variable.

The test of significance of r  pb is given by the test statistic

21

2

 pb

 pbr 

nr t 

, d.f. = n - 2. (Equation 7).

Page 12: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 12/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 12

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

Note the similarity between this test statistic and the test statistics for testing the

Pearson’s r and the Spearman’s r s (Equations 2 and 4).

Example 3. Are graduates from private high schools better than graduates from public

schools? Suppose we have the entrance test scores of 6 students who

graduated from private schools (coded 1) and 8 students who graduated from

public high schools (coded 0) as follows:

Student 89 78 94 86 85 79 81 82 96 90 88 75 87 84

School 1 0 1 0 1 0 0 0 1 0 1 0 0 1

a)  Compute the correlation coefficient between type of high school graduated

from and entrance test score using Equation 6.

b)  State the null hypothesis and the corresponding alternative hypothesis.

c)  Test the null hypothesis at α = .05 

Solution: a) We first categorize the scores into two groups coded 1 and 0 as shown below.

Group Coded 1 Group Coded 0

(Private) (Public)

89 78

94 86

85 79

96 81

88 82

84 90

75

87n1=6 M1=89.33 n0=8 M0=82.25

n = 14

SD X = 5.9927 (s.d. of all scores combined)

Using the summary values in the table, we have,

607.0)114(14

)8)(6(

9927.5

25.8233.89

 pbr  or 61.0 pbr  .

b) The null hypothesis and the corresponding alternative hypothesis based onthe given problem are as follows:

H0: There is no significant relationship between type of high school

graduated from and score in the entrance test.

Ha: Graduates from private schools are better than graduates from public

schools in terms of scores in the college entrance test.

Page 13: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 13/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 13

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

c) To test the significance of the obtained r , we compute the test statistic using

the values 607.0 pbr  and n = 14. Thus,

646.2)607(.1

214)607(.

2

t   

Because the alternative hypothesis is directional, we use a one-tailed test. The

critical value of t at α = 0.05 d.f. = 12 (one-tailed) is 1.782. Since the absolute value of the

computed t -value exceeds the critical value, the null hypothesis is rejected. Based on the

hypothetical data, we conclude that there is a significant correlation between type of school

graduated from and performance in the college entrance test. The specific relationship

between the given variables can be specified by using the mean scores of the two groups.

Thus, we say that graduates from private schools are generally better than graduates from

public high schools.

Remarks:

1.  The same conclusion is arrived at when a t-test for independent samples is conducted.

Using the pooled variance estimate, the computed t-value is 2.646 (which is equal to the

computed value in the test of significance of  r pb) with a p-value of 0.021 < 0.042b 

(20.021 since the test is one-tailed) as shown in the computer output. Hence, the mean

scores of 89.3 and 82.25 are significantly different in favor of students who graduated

from private schools.

Group Statistics

6 89.3333 4.80278 1.96073

8 82.2500 5.06388 1.79035

Type of high schoolPrivate

Public

Entrance test scoreN Mean

Std.

Deviation

Std. Error

Mean

 

Independent Samples Test

.043 .839 2.646 12 .021

2.668 11.235 .022

Equal variances

assumed

Equal variances

not assumed

Entrance test score

F Sig.

Levene's Test

f or Equality of

Variances

t df

Sig.

(2-tailed)

t-test for Equality of Means

 

b The p-value associated to the computed t of 2.646 is 0.021(2-tailed). Since the test is one-tailed, this p-value

must be multipltied by 2 (2 0.021 = 0.042) since the test is supposed to be one-tailed. 

Page 14: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 14/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 14

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

Using the computed value of t , the value of  pbr  can actually be computed by solving for

 pbr  in the formula2

1

2

 pb

 pbr 

nr t 

. This value is given by

..2

2

 f d t 

t r  pb

, where t is the

computed t-value and d.f. = n1 + n2  – 2. From the computer output, we have t = 2.646

and the corresponding degrees of freedom is 12. Hence we have,

607.012646.2

646.2

..2

2

2

2

 f d t 

t r  pb

 

which is identical to the value obtained using the formula for  pbr  . This is the reason why

Equation 6 is more preferred than Equation 5.

2.  Another reason which justifies the use of Equation 6 instead of Equation 5 is that, thevalue of   pb

r    can be obtained using the same idea as the Pearson’s. However, the

categories of the dichotomous variable are first coded as 1 and 0 (other codes are NOT

acceptable). We construct a similar worksheet for computing the Pearson’s r . The

variables to be correlated are X [school type (coded 1 and 0)] and Y (entrance test score).

The worksheet is shown as follows:

 X (School Type)

Y  (Entrance test Score)

 X 2 

Y 2 

 XY 

1 89 1 7921 89

0 78 0 6084 0

1 94 1 8836 940 86 0 7396 0

1 85 1 7225 85

0 79 0 6241 0

0 81 0 6561 0

0 82 0 6724 0

1 96 1 9216 96

0 90 0 8100 0

1 88 1 7744 88

0 75 0 5625 0

0 87 0 7569 0

1 84 1 7056 84

1 89 1 7921 89

6 X    1194Y    62  X    102298

2 Y    536 XY   

Therefore,

2222)()(

))((

Y Y N X  X N

Y  X  XY Nr   

607.0

)1194()102298(14)6()6(14

)1194)(6()536(14

22

=  pbr  as obtained before.

Page 15: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 15/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 15

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

Chi-Square Based c 

Measures of Correlation 

In our discussion of the Chi-square test, we mentioned that the test can be used to

establish independence of variables. When the null hypothesis is rejected using the Chi-

Square test, we conclude that the variables ARE NOT independent, which means that the

variables are correlated. The Chi-square value, however, does not give information as to the

strength of correlation between the variables.

We can estimate the strength of correlation by using the computed value of the Chi-

square statistic (hence the term Chi-Square based). These measures, are described as crude

measures because they are not as accurate or reliable as in the case of Pearson-based

measures (r , r s, and r pb).

Also, the technique employed here is different from the Pearson based measures in

the sense that one tries to establish first whether the variables are significantly correlated or

not using the Chi-square statistic. The strength of correlation is computed only when the Chi-square test is significant. We outline below the computation of some Chi-square based

measures of correlation.

A. Contingency Coefficient:N

2

2

Oχ 

χ C where

2χ  is the computed Chi-square value and

N is the grand total in the contingency table.

This measure is used when the contingency table is a square table with at least

three categories for each variable.

Example 4. The Chi-square value based on the contingency table below is 27.160 which is

significant at α = 0.05. Estimate the strength of correlation between interest in

sports and social class using the contingency coefficient.

Interest in SportsSocial Class

TOTALWorking Middle Upper

High 12 45 7 64

Moderate 24 40 21 85

Low 21 14 23 58

Total 57 99 51 207

Solution: The contingency table is a 3x3 square table. Hence the contingency coefficient is

an appropriate measure of correlation. Using the values 160.27χ 2 and N = 207,

we have

c Siegel S. & Castellan, J. (1988) Nonparametric Statistics. New York: McGraw-Hill Book Company (2

nded).

Page 16: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 16/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 16

-----------------------------------------------------------------------------------------------------------------

Gabino P. Petilos, Ph.D.

FIC, EDRE 231, 2nd

Sem 11-12

34.0207160.27

27.160

χ 

χ C

2

2

O

N.

Hence sports and social class are significant correlated (since the computed value

is significant) and the strength of correlation is 0.34 (weak).

B. Cramer's Coefficient:1)-N(L

χ C

2

r  , where,

2χ  is the computed Chi-square value;

N is the grand total in the contingency table; and

L is either the number of rows or the number of columns, whichever is smaller

The Cramer’s coefficient is applied when the contingency table is non -square.

Example 5. The table below is taken from page 5 of Reading Material #11. The computed

Chi-square value based on this contingency table is 20.67 which is significant at

α = 0.05. Estimate the strength of correlation between method of teaching and

academic performance using the Cramer’s coefficient. 

Performance

Category

Method of Teaching

TOTALLecture Modular CAI

Above Satisfactory 9 20 18 47

Satisfactory 12 18 21 51

Fair 15 10 8 33

Below Satisfactory 24 12 6 42

Total 60 60 53 173

Solution: As gleaned from the problem, 67.20χ 2 and N = 173. Also, the contingency

table is non-square and the smaller number of categories is 3, hence L = 3.

Therefore, we have

24.01)-173(3

20.67

1)-N(L

χ C

2

r  

Based on the computed Cramers’s coefficient, there is a significant correlation

between method of teaching and academic performance but the strength of 

correlation is weak (0.24).

Page 17: Module 9 (Correlation Analysis)

8/3/2019 Module 9 (Correlation Analysis)

http://slidepdf.com/reader/full/module-9-correlation-analysis 17/17

Reading Material #9 (Correlation)

---------------------------------------------------------------------------------------------------------------------- 17

-----------------------------------------------------------------------------------------------------------------

3. Phi Coefficient:N

χ φ

2

  , where 2χ  is the computed Chi-square value and N is the

grand total in the contingency table.

The Phi-coefficient is used only for 22 contingency tables.

Example 6. A survey of 300 undergraduate and 100 graduate students from a large

university was conducted to determine their opinions on autonomous status of 

colleges. The following contingency table was generated from the survey.

OpinionLevel of Education

TotalUndergrad Graduate

Favor 100 70 170

Not Favor 200 30 230

Total 300 100 400

Find out if there is a significant correlation between opinion and level of education at α = 0.05. If significant, estimate the strength of correlation. 

Solution: We are given a 22 table, thus we can compute the2

χ  value using the formula

))()()((

)(2

2

d cbad bca

bcad N

   with d.f. = 1.

Note that the expected frequencies are all greater than 5 (check this). Thus,

262.41)230)(170)(100)(300(

)]70)(200()30)(100[(400

))()()((

)(

χ 

222

d cbad bca

bcad N

.

The critical value of  2χ  at α = 0.05 and d.f. = 1 is 3.84. Since the computed

2χ  is greater than the critical value, the null hypothesis is rejected which means

that opinions of the students regarding the autonomoous status of college is

dependent on the level of education.

Using the Phi-coefficient, the estimated strength of correlation is

32.0400

(41.262)Nχ φ

2

  (weak).

---------------------------------------------------------------------------------------------------------------------------