View
224
Download
0
Category
Preview:
Citation preview
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 1/17
Reading Material #9 (Correlation Analysis)
---------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
CORRELATION ANALYSIS
INTRODUCTION
It has been said that research is conducted in order to find relationship between oramong variables. When factors or variables are related in some systematic pattern, so that a
change in the value of one is associated with a concurrent change in the value of the other,
we say that they are correlated. Thus, we know that ability level is correlated with academic
performance based on our common observation that students belonging to high ability level
tends to show better academic performance while those belonging to low ability level tend
to show poor academic performance.
In statistics, we not only establish the existence of certain correlations but also
measure the direction and the degree of correlation. Ideally, we want to know the
correlation between two variables X and Y in a given population (Figure 1). The correlation
between these variables is denoted by the symbol , called population correlation
coefficient. Since it is not always feasible to study the entire population, we attempt to
describe the correlation between X and Y by drawing a random sample from the population.
We denote the estimate of the parameter by the sample correlation coefficient r.
If a sample is used to estimate the amount of correlation between two variables,
significance testing is called for to find out if the variables in the actual population are indeed
significantly related. For this reason, correlation analysis employs both descriptive statistics
as well as inferential statistics.
Correlation analysis is concerned with the linear relationship between two variables.
It aims to determine the direction (whether positive or negative) as well as the strength
(whether weak, moderate, or strong) of linear association between two variables. When
two variables vary in the same direction, we say that the variables are positively correlated.
For example, it has been shown that IQ and academic performance are positively correlated.
This means that a person who has high IQ would tend to have a good academic performance
Population
=?
X Y
Sample
r=?
Figure 1
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 2/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 2
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
in school and in turn a person's good academic performance is usually associated with his
high IQ. Other examples of variables which are positively correlated are:
Grade in Mathematics and Grade in Physics;
Work performance and Level of morale;
Number of hours spent in studying and Grades in mathematics.
On the other hand, when two variables vary in the opposite direction, the variables
are said to be negatively correlated. Examples of variables which exhibit negatively
correlation are:
Academic achievement and Hours per week of watching TV
Time spent in typing practice and Number of typing errors
Absenteeism and Job satisfaction
Variables that are not linearly correlated have zero correlation. For instance, height
of students and their ability level have a zero correlation. In this example, it does not make
sense to associate a particular value of height to a particular ability level. As anotherexample, there is zero correlation between size of shoes and level of income of bank
managers!
The direction and strength of linear correlation between variables may be described
using a statistical device called “scatter plot” or “scatter diagram. Examples of scatter plots
are given in Figure 1. Here, the scatter plots from (a) to (c) illustrate a positive correlation
between the two variables in varying strengths while (d) to (f) illustrate a negative
correlation also in varying strengths. The scatter plots in (a) and (d) illustrate a perfect
correlation between the two variables while those of (g) and (h) illustrate a zero correlation.
(a) perfect positive (b) strong positive (c) weak positive (d) perfect negative
(e) strong negative (f) weak negative (g) zero correlation (h) zero correlation
Figure 1. Examples of Scatter Plots
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 3/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 3
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
A strong correlation between two variables can occur when not all points fall on the
line of relationship but they are close to it. If the distances of the points are far from the
line, the correlation is said to be weak (or low) [See graphs (c) and (f)].
When the points do not tend o follow the path of a straight line, the correlation is
said to be zero. This is illustrated by the scatter plots in (g) and (h). Note that zero
correlation between two variables does not necessarily mean that the variables are not
related. In (g) for instance, there is zero linear correlation between the variables yet they are
related in a quadratic sense.
THE CORRELATION COEFFICIENT
As mentioned earlier, the scatter diagram is a visual device which is useful in
characterizing the direction and strength of linear correlation between two variables. The
direction of relationship is perhaps easy to discern in a scatter diagram. However,
interpretation of the strength of linear correlation using a scatter diagram is not easy since it
is open to various interpretations when viewed by different persons.
The correlation coefficient is another tool by which the direction and strength of
linear correlation between two variables may be described. As a measure of correlation, the
correlation coefficient ranges in value from -1.0 to +1.0. Thus if (rho) represents the
population correlation coefficient, then
-1.0 +1.0
If = 1.0, the variables are said to be perfectly correlated in a positive sense. If the
value is -1.0, the variables are perfectly correlated in a negative sense. A value of = 0indicates a zero linear correlation between the two variables. Figure 2 illustrates the
descriptive interpretation of the correlation coefficient.
-0.5 moderate negative correlation
Figure 2. Interpretation of
Value of Interpretation of
(Direction of correlation)
Interpretation of
(Strength of correlation)
+1.0 perfect positive correlation
+0.5 moderate postive correlation
0 zero correlation
-1.0 perfect negative correlation
POSITIVE CORRELATION
POSITIVE CORRELATION
weak or low correlation
weak or low correlation
strong or high correlation
strong or high correlation
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 4/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 4
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
Since is usually unknown, it has to be estimated from a sample data. The estimate
of is called a sample correlation coefficient. We present in the succeeding discussion
techniques of estimating the population correlation coefficient.
PEARSON'S PRODUCT-MOMENT CORRELATION COEFFICIENT
Although there are several measures of correlation, the most common measure and
useful one is the Pearson’s product moment correlation denoted by r . This measure of
correlation is used when both variables are measured in at least the interval scale. The
computational formula for the Pearson's r is given by
2222)()(
))((
Y Y N X X N
Y X XY Nr (Equation 1)
The Pearson’s r is a parametric measure of correlation. The following assumptionsmust be satisfied when using the Pearson's r :
1. Both variables X and Y must be measured in at least the interval scale;
2. Observations are sampled from a bivariate normal distribution; and
3. The variables are linearly related.
Example 1. The table below shows experimental data for the observed pairs ( x , y ). Find the
value of r.
x 2 3 7 4 6 8 5
y 3 5 8 5 7 10 5
Solution: Without loss of generality, let us assume that the first two assumptions above have
been satisfied. To determine whether the third assumption is also satisfied, we
construct a scatter plot for the given data. This scatter plot is shown below.
-1 1 2 3 4 5 6 7 8 9 10-1
1
2
3
4
5
6
7
8
9
10
x
y
Clearly, the scatter plot suggests a linear relationship between the two variables. To
determine the extent of correlation between the two variables, we compute the value of r
using Equation 1. The following worksheet illustrates how this value is computed.
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 5/17
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 6/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 6
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
correlation coefficient of 0.72 between political affiliation and religion in social science
research may be interpreted as high. The same value, however, may be interpreted as low
when used as a measure of reliability or validity of standardized tests. Also, It is often
tempting to say that an r value of 0.80 is twice as strong as an r value of 0.40. Such an
interpretation is incorrect since the correlation scale is not ratio or interval but rather an
ordinal one.
Another consideration in interpreting a correlation coefficient is when the value is 0.
In general, a value of r = 0 does not mean that the variables are not related. As shown in
Figure 1 (g), a value of 0 merely implies that there is no linear association between the two
variables. Moreover, values of r that are different from 0 cannot be construed that one
variable causes the other which means that if two variables are correlated, it does not imply
that one of them causes the other.
One meaningful interpretation of r involves the concept of coefficient of
determination which is denoted by2
r . This value gives us a measure of the amount of
variation in one variable which can be attributed to the variation of the other variable and
vice versa. Thus, if r = 0.91, r 2
= 0.8281 or 82.81% which means that 82.81% of the variation
in one variable is accounted for by the variation of the other variable and versa. The
coefficient of determination is a very a important and useful concept in regression analysis.
Testing the Significance of r
If the value of the correlation coefficient is obtained from a sample data, the
researcher would often want to know whether the variables are in fact related in the actual
population from which the sample was drawn. The hypothesis of interest is about whether
the population correlation coefficient is zero or not. Thus the following null hypothesismust be tested using the obtained sample correlation coefficient.
Null Hypothesis : Ho: = 0 (There is no correlation between X and Y)
Alternative Hypothesis : Ha : 0 (For a Non-directional Test)
: Ha : < 0 or > 0 (For a Directional Test)
To test whether the obtained Pearson’s r is significantly different from zero, a t-test
could be used if N < 30 or z-test is N 30. The test statistics are given below:
21
2r
Nr t , d.f. = N - 2 (Equation 2)
Thus, for instance, if r = .91 and n = 7, we have
9078.4)91(.1
27)91(.
2
t
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 7/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 7
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
At = 0.05 (two-tailed), the corresponding critical value of t is 2.571. Since the
absolute value of the computed t -value exceeded the critical value, we reject the null
hypothesis. We conclude that the relationship between the two variables cannot be
attributed to chance.
When analyzed using the SPSS, we obtain the following output. This table provides
us both descriptive and inferential information about the correlation between the variables
X and Y. In this table, the value of the Pearson’s r is 0.913, hence the correlation is positive
and the strength of linear correlation is high. Also the associated p-value is .004 (two-tailed)
which is less than α = 0.025 (2
05.0 ), hence the null hypothesis is rejected.
Correlations
1 .913**
.004
7 7
.913** 1
.004
7 7
Pearson Correlation
Sig. (2-tailed)
N
Pearson CorrelationSig. (2-tailed)
N
X
Y
X Y
Correlation is signif icant at the 0.01 level**.
Another test statistic for testing the null hypothesis about the value of the population
correlation coefficient is the z-test. This test statistic is particulary useful when the
hypothesized value of the population correlation coefficient is different from zero, as for
example Ho: = 0 (0 0). In using this test, we first apply the Fisher's Transformation to
the obtained value of r to get the corresponding z-value. For a given r , the transformed
value of z is given by
r
r z
1
1ln
2
1.
For this variable, the mean and standard deviation are given by
r
r z
1
1ln
2
1 and
3
1
n Z .
Therefore, the equation z
zz
Z
is a standard score which follows the standard
normal distribution. Using the same values of r and N in Example 1, for example, we have
5275.109.
91.1ln
2
1
z and 5.0
37
1
Z .
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 8/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 8
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
Thus, 055.35.0
05275.1
Z which is significant at =.05 level of significance (two-
tailed.
The Pearson's product moment correlation is the most popular measure of
correlation. However, as was pointed out earlier, this measure is appropriate only whenboth variables are measured in at least the interval scale. When the assumptions on the use
of r are not met, it is not advisable to use the Pearson’s r. Instead, we estimate the
population correlation using other measures of correlation. The succeeding discussion
considers other measures of correlation when the scale of measurement is not interval and
one of the assumptions (normality and linearity) is violated.
OTHER MEASURES OF CORRELATION
The Spearman's Rank Order Correlation (r s)
The Spearman's rho (r s)a
is a measure of correlation based on the difference between
ranks of the values of two variables X and Y. It is used when both variables are measured in
at least the ordinal scale. The Spearman’s rho is the nonparametric counterpart of the
Pearson’s r. Unlike the Pearson's r , this measure does not make assumption about normality
of distribution of the paired data.
The formula for computing the Spearman's r s is given by
)1)(1(
61
2
NNN
d r s
(Equation 3)
where d is the difference between the ranks of paired values of X and Y , and N is the total
number of cases.
When ranking the data, “1” is usually treated as the lowest rank corresponding to the
lowest score value of the variable, followed by “2” for the next higher score, etc. Thus,
higher ranks correspond to higher scores while lower ranks correspond to lower scores. You
have to adapt this rule of ranking numbers because this is the convention used in analyzing
ordinal data using nonparametric statistics. (Note: The same value of 2d in Equation 3 is
obtained if we assign rank 1 to the highest score instead of rank 1 to the lowest score. Checkthis!)
Another important rule that you should remember is the assignment of ranks for tied
scores. The rule is very simple: think of the scores as if they were distinct, get their ranks,
and assign the average of their ranks as the ranks of the tied scores. Let us illustrate these
rules by considering an example.
a We don’t use the symbol for rho since this is our symbol for the population correlation coefficient.
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 9/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 9
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
Suppose we have the following scores in an achievement test in science: 47, 43, 46,
40, 43, 47, 47, 48. The scores in ascending order with their ranks if they were distinct and
actual ranks considering the tied scores are shown as follows:
Score 40 43 43 46 47 47 47 48
Ranks if scores were distinct: 1 2 3 4 5 6 7 8Actuals rank (with tied scores) 1 2.5 2.5 4 6 6 6 8
5.22
32
6
3
765
Thus, the two 43’s are ranked 2.5 each while the three 47’s are ranked 6 each.
Example 2. The following hypothetical data are the grades of 7 students in mathematics and
statistics. Estimate the strength of correlation between the variables using the
Spearman’s rank order correlation coefficient.
X
(Grade in Math)
Y
(Grade in Statistics)
86 88
78 78
79 78
85 86
87 90
90 88
87 78
Solution: The necessary computations are indicated in the following table based on the
ranks of the values of X and values of Y denoted by RX and RY, respectively.
X Y R X RY d d 2
86 88 4 5.5 1.5 2.25
78 78 1 2 1 1
79 78 2 2 0 0
85 86 3 4 1 1
87 90 5.5 7 1.5 2.25
90 88 7 5.5 -1.5 2.25
87 78 5.5 2 -3.5 12.25
212 d
Since N = 7, and 212 d , it follows that 625.0
)48)(7(
)21(61
sr or 63.0sr .
Hence, there is a substantial positive correlation between students’ grades in
mathematics and grades in statistics.
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 10/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 10
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
Testing the Significance of r s
The test statistic for testing the significance of the r s is similar to Equation 2. The
test statistic is given by
21
2s
sr
Nr t , d.f. = N - 2. (Equation 4)
Using Example 2, the relevant null hypothesis and the corresponding alternative
hypothesis can be stated as follows:
H0: There is no significant correlation between grades in mathematics and grades in Statistics.
Ha: There is a significant correlation between grades in mathematics and grades in Statistics.
Since 625.0sr and n = 7, the computed t-value is given by
7903.1)625(.1
27)625(.
2
t
We use a two-tailed test because the alternative hypothesis is non-directional. At
0.05α (two-tailed), and d.f. = 5, the corresponding critical value of t is 2.571. Since the
absolute value of the computed t -value did not exceed the critical value, the null hypothesis
cannot be rejected. We say that the data did not provide sufficient evidence to reject the
null hypothesis.
--------------------------------------------------------------------------------------------------------------------------
Note: When a statistical test IS NOT SIGNIFICANT, we accept the null hypothesis. Accepting the nullhypothesis, however, does not mean that it (the null hypothesis) is true because we only considered
one sample out of the so many possible samples from the population.
--------------------------------------------------------------------------------------------------------------------------
We present below the SPSS output for the same data analyzed using the Spearman’s
rank-order correlation coefficient. Note that the p-value = 0.151 > α = 0.025. Hence, the
variables are NOT significantly correlated. Note the discrepancy between the value we
obtained using the formula and the value in the SPSS output which is 0.604. Some sort of
adjustment is made in the SPSS formula because of tied observations. (Research on this.)
Correlations
1.000 .604
. .151
7 7
.604 1.000
.151 .
7 7
Correlation Coef f icient
Sig. (2-tailed)
N
Correlation Coef f icient
Sig. (2-tailed)
N
Grade in Mathematics
Grade in Statistics
Spearman's rho
Grade in
Mathematics
Grade in
Statistics
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 11/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 11
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
The Point-Biserial Correlation Coefficient (r pb)
In the previous discussion, we considered the correlation between the variables X
and Y when both are measured in at least the interval scale or ordinal scale. If for instance
the variable X is a dichotomy (categorical with 2 categories) and Y is a measured in the
interval scale, both Pearson’s r and Spearman’s rank order correlation coefficients are not
appropriate as a measure of correlation.
The point biserial correlation coefficient which is denoted by pbr is a measure of
correlation which is appropriate when one variable is a dichotomy and the other is measured
in at least the interval scale. For instance, if one wants to know the strength of correlation
between gender and mathematics performance, then the point biserial correlation
coefficient will be an appropriate measure. The formula for the point biserial coefficient is
given by
pq
SD
MMr
X
q p
pb
)( (Equation 5)
where M p = the mean score of those in one category of the dichotomised variable
Mq = the mean score of those scoring in the other category
p = the proportion scoring in the first category
q = the proportion scoring in the other category.
SD X = is the standard deviation of the interval variable.
NOTE: This formula is discussed on page 95 and the example is given on page 96 of Module 12.
There is another formula for the point-biserial correlation coefficient which is slightly
different from Equation 5. The formula makes use of the number of cases in thedichotomized interval variable and is given by
)1(
0101
nn
nn
SD
MMr
X
pb (Equation 6)
where M1 = the mean score of the scores in category 1 of the dichotomised variable
M0 = the mean score of the scores in category 0 of the dichotomised variable
n1 = the number of cases in category 1
n0 = the number of cases in category 0
n = the total number of cases (n1 + n0)
SD X = is the standard deviation of the interval variable.
The test of significance of r pb is given by the test statistic
21
2
pb
pbr
nr t
, d.f. = n - 2. (Equation 7).
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 12/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 12
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
Note the similarity between this test statistic and the test statistics for testing the
Pearson’s r and the Spearman’s r s (Equations 2 and 4).
Example 3. Are graduates from private high schools better than graduates from public
schools? Suppose we have the entrance test scores of 6 students who
graduated from private schools (coded 1) and 8 students who graduated from
public high schools (coded 0) as follows:
Student 89 78 94 86 85 79 81 82 96 90 88 75 87 84
School 1 0 1 0 1 0 0 0 1 0 1 0 0 1
a) Compute the correlation coefficient between type of high school graduated
from and entrance test score using Equation 6.
b) State the null hypothesis and the corresponding alternative hypothesis.
c) Test the null hypothesis at α = .05
Solution: a) We first categorize the scores into two groups coded 1 and 0 as shown below.
Group Coded 1 Group Coded 0
(Private) (Public)
89 78
94 86
85 79
96 81
88 82
84 90
75
87n1=6 M1=89.33 n0=8 M0=82.25
n = 14
SD X = 5.9927 (s.d. of all scores combined)
Using the summary values in the table, we have,
607.0)114(14
)8)(6(
9927.5
25.8233.89
pbr or 61.0 pbr .
b) The null hypothesis and the corresponding alternative hypothesis based onthe given problem are as follows:
H0: There is no significant relationship between type of high school
graduated from and score in the entrance test.
Ha: Graduates from private schools are better than graduates from public
schools in terms of scores in the college entrance test.
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 13/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 13
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
c) To test the significance of the obtained r , we compute the test statistic using
the values 607.0 pbr and n = 14. Thus,
646.2)607(.1
214)607(.
2
t
Because the alternative hypothesis is directional, we use a one-tailed test. The
critical value of t at α = 0.05 d.f. = 12 (one-tailed) is 1.782. Since the absolute value of the
computed t -value exceeds the critical value, the null hypothesis is rejected. Based on the
hypothetical data, we conclude that there is a significant correlation between type of school
graduated from and performance in the college entrance test. The specific relationship
between the given variables can be specified by using the mean scores of the two groups.
Thus, we say that graduates from private schools are generally better than graduates from
public high schools.
Remarks:
1. The same conclusion is arrived at when a t-test for independent samples is conducted.
Using the pooled variance estimate, the computed t-value is 2.646 (which is equal to the
computed value in the test of significance of r pb) with a p-value of 0.021 < 0.042b
(20.021 since the test is one-tailed) as shown in the computer output. Hence, the mean
scores of 89.3 and 82.25 are significantly different in favor of students who graduated
from private schools.
Group Statistics
6 89.3333 4.80278 1.96073
8 82.2500 5.06388 1.79035
Type of high schoolPrivate
Public
Entrance test scoreN Mean
Std.
Deviation
Std. Error
Mean
Independent Samples Test
.043 .839 2.646 12 .021
2.668 11.235 .022
Equal variances
assumed
Equal variances
not assumed
Entrance test score
F Sig.
Levene's Test
f or Equality of
Variances
t df
Sig.
(2-tailed)
t-test for Equality of Means
b The p-value associated to the computed t of 2.646 is 0.021(2-tailed). Since the test is one-tailed, this p-value
must be multipltied by 2 (2 0.021 = 0.042) since the test is supposed to be one-tailed.
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 14/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 14
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
Using the computed value of t , the value of pbr can actually be computed by solving for
pbr in the formula2
1
2
pb
pbr
nr t
. This value is given by
..2
2
f d t
t r pb
, where t is the
computed t-value and d.f. = n1 + n2 – 2. From the computer output, we have t = 2.646
and the corresponding degrees of freedom is 12. Hence we have,
607.012646.2
646.2
..2
2
2
2
f d t
t r pb
which is identical to the value obtained using the formula for pbr . This is the reason why
Equation 6 is more preferred than Equation 5.
2. Another reason which justifies the use of Equation 6 instead of Equation 5 is that, thevalue of pb
r can be obtained using the same idea as the Pearson’s. However, the
categories of the dichotomous variable are first coded as 1 and 0 (other codes are NOT
acceptable). We construct a similar worksheet for computing the Pearson’s r . The
variables to be correlated are X [school type (coded 1 and 0)] and Y (entrance test score).
The worksheet is shown as follows:
X (School Type)
Y (Entrance test Score)
X 2
Y 2
XY
1 89 1 7921 89
0 78 0 6084 0
1 94 1 8836 940 86 0 7396 0
1 85 1 7225 85
0 79 0 6241 0
0 81 0 6561 0
0 82 0 6724 0
1 96 1 9216 96
0 90 0 8100 0
1 88 1 7744 88
0 75 0 5625 0
0 87 0 7569 0
1 84 1 7056 84
1 89 1 7921 89
6 X 1194Y 62 X 102298
2 Y 536 XY
Therefore,
2222)()(
))((
Y Y N X X N
Y X XY Nr
607.0
)1194()102298(14)6()6(14
)1194)(6()536(14
22
= pbr as obtained before.
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 15/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 15
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
Chi-Square Based c
Measures of Correlation
In our discussion of the Chi-square test, we mentioned that the test can be used to
establish independence of variables. When the null hypothesis is rejected using the Chi-
Square test, we conclude that the variables ARE NOT independent, which means that the
variables are correlated. The Chi-square value, however, does not give information as to the
strength of correlation between the variables.
We can estimate the strength of correlation by using the computed value of the Chi-
square statistic (hence the term Chi-Square based). These measures, are described as crude
measures because they are not as accurate or reliable as in the case of Pearson-based
measures (r , r s, and r pb).
Also, the technique employed here is different from the Pearson based measures in
the sense that one tries to establish first whether the variables are significantly correlated or
not using the Chi-square statistic. The strength of correlation is computed only when the Chi-square test is significant. We outline below the computation of some Chi-square based
measures of correlation.
A. Contingency Coefficient:N
2
2
Oχ
χ C where
2χ is the computed Chi-square value and
N is the grand total in the contingency table.
This measure is used when the contingency table is a square table with at least
three categories for each variable.
Example 4. The Chi-square value based on the contingency table below is 27.160 which is
significant at α = 0.05. Estimate the strength of correlation between interest in
sports and social class using the contingency coefficient.
Interest in SportsSocial Class
TOTALWorking Middle Upper
High 12 45 7 64
Moderate 24 40 21 85
Low 21 14 23 58
Total 57 99 51 207
Solution: The contingency table is a 3x3 square table. Hence the contingency coefficient is
an appropriate measure of correlation. Using the values 160.27χ 2 and N = 207,
we have
c Siegel S. & Castellan, J. (1988) Nonparametric Statistics. New York: McGraw-Hill Book Company (2
nded).
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 16/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 16
-----------------------------------------------------------------------------------------------------------------
Gabino P. Petilos, Ph.D.
FIC, EDRE 231, 2nd
Sem 11-12
34.0207160.27
27.160
χ
χ C
2
2
O
N.
Hence sports and social class are significant correlated (since the computed value
is significant) and the strength of correlation is 0.34 (weak).
B. Cramer's Coefficient:1)-N(L
χ C
2
r , where,
2χ is the computed Chi-square value;
N is the grand total in the contingency table; and
L is either the number of rows or the number of columns, whichever is smaller
The Cramer’s coefficient is applied when the contingency table is non -square.
Example 5. The table below is taken from page 5 of Reading Material #11. The computed
Chi-square value based on this contingency table is 20.67 which is significant at
α = 0.05. Estimate the strength of correlation between method of teaching and
academic performance using the Cramer’s coefficient.
Performance
Category
Method of Teaching
TOTALLecture Modular CAI
Above Satisfactory 9 20 18 47
Satisfactory 12 18 21 51
Fair 15 10 8 33
Below Satisfactory 24 12 6 42
Total 60 60 53 173
Solution: As gleaned from the problem, 67.20χ 2 and N = 173. Also, the contingency
table is non-square and the smaller number of categories is 3, hence L = 3.
Therefore, we have
24.01)-173(3
20.67
1)-N(L
χ C
2
r
Based on the computed Cramers’s coefficient, there is a significant correlation
between method of teaching and academic performance but the strength of
correlation is weak (0.24).
8/3/2019 Module 9 (Correlation Analysis)
http://slidepdf.com/reader/full/module-9-correlation-analysis 17/17
Reading Material #9 (Correlation)
---------------------------------------------------------------------------------------------------------------------- 17
-----------------------------------------------------------------------------------------------------------------
3. Phi Coefficient:N
χ φ
2
, where 2χ is the computed Chi-square value and N is the
grand total in the contingency table.
The Phi-coefficient is used only for 22 contingency tables.
Example 6. A survey of 300 undergraduate and 100 graduate students from a large
university was conducted to determine their opinions on autonomous status of
colleges. The following contingency table was generated from the survey.
OpinionLevel of Education
TotalUndergrad Graduate
Favor 100 70 170
Not Favor 200 30 230
Total 300 100 400
Find out if there is a significant correlation between opinion and level of education at α = 0.05. If significant, estimate the strength of correlation.
Solution: We are given a 22 table, thus we can compute the2
χ value using the formula
))()()((
)(2
2
d cbad bca
bcad N
with d.f. = 1.
Note that the expected frequencies are all greater than 5 (check this). Thus,
262.41)230)(170)(100)(300(
)]70)(200()30)(100[(400
))()()((
)(
χ
222
d cbad bca
bcad N
.
The critical value of 2χ at α = 0.05 and d.f. = 1 is 3.84. Since the computed
2χ is greater than the critical value, the null hypothesis is rejected which means
that opinions of the students regarding the autonomoous status of college is
dependent on the level of education.
Using the Phi-coefficient, the estimated strength of correlation is
32.0400
(41.262)Nχ φ
2
(weak).
---------------------------------------------------------------------------------------------------------------------------
Recommended