12
This article was downloaded by: [University of Chicago Library] On: 08 December 2014, At: 10:26 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK PRIMUS: Problems, Resources, and Issues in Mathematics Undergraduate Studies Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/upri20 AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION William P. Fox BS and MS and Ph.D. a & Christopher W. Fowler BS and MS a a Department of Mathematical Sciences , United States Military Academy , West Point, NY, 10996, USA Published online: 13 Aug 2007. To cite this article: William P. Fox BS and MS and Ph.D. & Christopher W. Fowler BS and MS (1996) AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION, PRIMUS: Problems, Resources, and Issues in Mathematics Undergraduate Studies, 6:3, 235-244, DOI: 10.1080/10511979608965826 To link to this article: http://dx.doi.org/10.1080/10511979608965826 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions

AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

Embed Size (px)

Citation preview

Page 1: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

This article was downloaded by: [University of Chicago Library]On: 08 December 2014, At: 10:26Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number:1072954 Registered office: Mortimer House, 37-41 Mortimer Street,London W1T 3JH, UK

PRIMUS: Problems,Resources, and Issues inMathematics UndergraduateStudiesPublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/upri20

AIDING UNDERGRADUATEUNDERSTANDING OFCOVARIANCE ANDCORRELATIONWilliam P. Fox BS and MS and Ph.D. a &Christopher W. Fowler BS and MS aa Department of Mathematical Sciences , UnitedStates Military Academy , West Point, NY,10996, USAPublished online: 13 Aug 2007.

To cite this article: William P. Fox BS and MS and Ph.D. & Christopher W. FowlerBS and MS (1996) AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCEAND CORRELATION, PRIMUS: Problems, Resources, and Issues in MathematicsUndergraduate Studies, 6:3, 235-244, DOI: 10.1080/10511979608965826

To link to this article: http://dx.doi.org/10.1080/10511979608965826

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of allthe information (the “Content”) contained in the publications on ourplatform. However, Taylor & Francis, our agents, and our licensorsmake no representations or warranties whatsoever as to the accuracy,completeness, or suitability for any purpose of the Content. Any opinions

Page 2: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

and views expressed in this publication are the opinions and views ofthe authors, and are not the views of or endorsed by Taylor & Francis.The accuracy of the Content should not be relied upon and should beindependently verified with primary sources of information. Taylor andFrancis shall not be liable for any losses, actions, claims, proceedings,demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, inrelation to or arising out of the use of the Content.

This article may be used for research, teaching, and private studypurposes. Any substantial or systematic reproduction, redistribution,reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of accessand use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14

Page 3: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

Fox and Fowler Understanding of Covariance and Correlation

AIDING UNDERGRADUATEUNDERSTANDING OF

COVARIANCE ANDCORRELATION

William P. Fox and Christopher W. Fowler

ADDRESS: Department of Mathematical Sciences, United States MilitaryAcademy, West Point NY 10996 USA.

ABSTRACT: We describe the use of graphical exercises to explain theconcepts of covariance and correlation as they are used in an under­graduate course on probability and statistics. Students seem unable tograsp the concepts and interpretation, although they can perform thenecessary calculations to provide values for both the covariance andthe correlation coefficient. We provide examples that help to illustratethe two key concepts of covariance and correlation. These examplesshow the importance of understanding the meaning of correlation. Weuse graphical methods to support our numerical approaches.

KEYWORDS: Correlation, covariance, correlation coefficient , graphicalmethods, geometric aspects, scale.

INTRODUCTION

Our probability and statistics course is a required course for sophomores,and serves as our capstone core-mathematics course. As such, it ties to­gether the five threads of our core-mathematics curriculum: history of math­ematics, computation, scientific computing, modeling, and writing [11. Weweave these threads together in our investigation of covariance and cor­relation. Our faculty has noted that the students do not understand theunderlying concepts of covariance and correlation. The students can per­form the necessary calculations (usually without error), either by hand orwith mechanical aids (hand held graphing calculator or microcomputer).

235

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14

Page 4: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

PRIMUS September 1996 Volume VI Number 3

Yet , when asked to explain the meaning of these values , the students areat a loss. Typical students get confused by the magnitude of covariance.They believe covariance is significant if the calculated value is large, andinsignificant if the calculated value is very small.. They also tend to blindlyaccept the value calculated for the correlation coefficient , without consider­ing graphical evidence.

COVARIANCE AND CORRELATION

Covariance between two random variables X and Y is described as the asso­ciation (positive or negative) between these random variables. Covarianceis defined to be:

or , more simply:

Cov(X, Y) = E[(X - px)(Y - py)],

Cov(X, Y) =E[XY]- tixuv .

(1)

(la)

In the special case where X = Y, the covariance reduces to the variance ofa random variable:

Cov(X, X) = E[(X - PX )2] =VarIX] = O"i.The correlation coefficient of X and Y , PXy , is defined to be:

(2)

Cov(X, Y) (3)PXy = ,

axov

where 0"X and O"y are the standard deviations of the random variables Xand Y. Unlike some other texts, Devore [2] points out that a major defectof using the covariance to define the degree of association between tworandom variables is that the computed value depends critically on the unitsmeasured. This unit dependence follows clearly from the definition in (1).Large values of X and Y produce correspondingly larger values of O"x andO"y. A covariance value of 1875 indicates a positive association between thetwo random variables, but says little about the strength of that relationship.The students learn that the range of PXy ranges from -1 to 1, and provides ameasure of the strength of this linear relationship between random variables.In the Devore text, the relationship is strong if PXy < 0.8, moderate if0.5 < PXy < 0.8, and weak if PXy < 0.5. Students tend to use theseguidelines as if they were absolutes, abandoning all reason and graphicalevidence .

236

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14

Page 5: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

Fox and Fowler Understanding of Covariance and Correlation

Thus, the average students can perform two functions well. They canuse Equations (la) and (3), and can compare the value of the correlationcoefficient to these descriptive standards. What the student cannot do well isto describe the relationship between the random variables that leads to thesevalues. It is important for the student to be exposed to the "bigger picture",a picture including more than just computation. To help show the flaws inusing just computation, we created several example problems to illustratethe importance of trying to understand more about the association betweenrandom variables.

ILLUSTRATIVE EXAMPLES

We created several problems for the students that require more understand­ing than mere computation. In these problems, students examine data sam­ples and calculate the sample covariance and correlation coefficient. Theythen make statements concerning the relationship between random vari­ables. Finally, they examine the data graphically. We want the studentto recognize not only linear trends but patterns where associations may in­dicate a non-linear relationship. The first exercise examines the cases ofrandom variables with little or no linear association. We use Figures 1 - 2as example patterns where PXY ::::: o.

550

00 0

so - 0p0r 0

00 0

45 - 0

0

0 5 10

HOUR

Figure 1. Nonlinear trend: temperature versus time (hours).

Figure 1 plots temperature versus time. The students can easily see thenonlinear trend in the graph, while calculating a correlation coefficient nearzero. Figure 2 plots student age versus grade point average. In this case,

237

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14

Page 6: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

PRIMUS September 1996 Volume VI Number 3

the student sees an example of random variables that are uncorrelated, andperhaps even independent.

4 -80

00

0 00

« 3- 0a,C)

0

2 -

0

18 19 20 21 22

AGE

Figure 2. Uncorrelated data: student grade point average versus age.

5

TIME

4500 - 000

03500 - 0

~ 0c, 2500 - 0a:

1500 -

500 -

0

o

o00

o

a

oo

o

o

a

10

Figure 3. y = 2sin(x), 0 s x s 10.

In the second problem that we give the students, we take the function,y = 2sin(x), and extract subsets of X and Y as random variables. Thestudents are told that dependent variable is revolutions per minute (rpm)of a turbine measured against distinct time units. Each small group ofstudents receives a different sample of the random variables (as shown inFigures 3 and 4) and is asked to describe the association between the random

238

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14

Page 7: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

Fox and Fowler Understanding of Covariance and Correlation

variables. We tell the students that each data set is extracted from the samepopulation. The answers vary from "the association is nonlinear" with aPXY value of -0.096 for the data in Figure 3, to an "almost strong negativelinear association" with a PXY value of -0.893 for the data of Figure 4.

4500 - 00 0

03500 - 0

~ 0a.. 2500 -a:

01500 -

0 0

500- 0 0I

0 2 3 4 5 6

TIME

Figure 4. y = 2sin(x), 0:5 x:5 6.

We ask the students who calculated a PXY value of -0.893 to supporttheir conclusion of a linear association. They all agreed that the valueof PXY influenced their decision more than the plot . In fact , the plot ofthe data indicates that the relationship is nonlinear, and, in some cases ,students do not even examine the plot! At this point, the students areready to look before they compute. They start to think about why thecomputed answer may appear different than expected. We are trying tocreate inquisitive students who can experiment, conjecture, and examineinformation on their own. We believe problems like this can stimulate thestudents' minds.

The third example problem is an extension of the previous problem.Now, we start with the function y =sin(x) +x, and generate similar subsetsof data as shown in Figures 5 and 6:

The students, of course, do not initially have access to these plots. Theytake the data and calcu late the sam ple covariance and correlation coefficientfor their respective data set. They quickly calculate PXY = 0.941 for thefirst data set shown in Figure 5 and PXY = 0.979 for the data set shownin Figure 6. When asked to comment on the linear relationship betweenrandom variables, the students invariably answer with a chorus, "what doesthe plot of the data look like?" Their suspicions are validated when they see

239

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14

Page 8: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

PRlMUS September 1996 Volume VI Number 3

the plots of data with a "linear trend" , but which is obviously non-linear in

nature.

6 -<- - - - - - - ---- - - - --,

5

4

>- 3

2

o

o 2 3

X

4 5 6

'-------- ----l

Figure 5. y =sin (x) + x, 0 :5 x :5 6.

10

>-5

o

o 5

X

10

Figure 6. y = sin(x) + x, 0:5 x :5 12.

The fourth example illustrates the misconception that a large covarianceimplies a strong value for the correlation coefficient, and a small covarianceimplies a weak value for the correlation coefficient. In this example, weexamine data for biathlons where one random variable describes the rundistance and the other random variable describes the swim distance. Thestudents are given the data in two different forms. The first data set de-

240

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14

Page 9: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

Fox and Fowler Understanding of Covariance and Correlation

scribes the data in miles while the second data set describes the exact sameraces measured in meters. Using MINITAB, the students calculate the co­variance of each data set as follows:

240580run..mileswirncmi]

run..mile2.13488960.3535925

swim.mll

0.0928295run..metrswim.met

r un.meer swim..met5530141915979

So, one group states that the covariance between run distance and swimdistance for their data set is 0.35359, while the next group states that thecovariance between the two random variables for their data set is 915,979.The plots of each data set can then be shown side by side . The studentsquickly see that the two plots are equivalent. Finally, we reinforce thelearning point of this lesson by calculating the correlation coefficient foreach data set, yielding a common value of 0.794.

A final example illustrates another common misconception of the correla­tion coefficient. Students tend to think that a calculated value of PXY = 0.6is "twice as good" as a value of PXY = 0.3. We start this problem with afamiliar scenario. All students take a calculus sequence prior to our coursein probability and statistics. In the multivariate calculus portion, they ex­amine an optimization model for a Cobb-Douglas production function oftwo inputs, capital and labor. Now, we introduce money as a potentialthird variable. We wish to examine the relationship of input of money toproduction output. We use the model:

where Y is output, K is capital input, L is labor input, and M is monetaryinput.

Now , taking logarithms of both sides yields a linear function of Y interms of the three variables. We then isolate on the variable M and givethe students a sample of log(output) and log(money) to examine. The twoviews they eventually see are shown in Figures 7 and 8. The students firstreceive the sample data for Figure 7. They generate the plot of the dataand calculate the correlation coefficient as PXY = 0.411. Next, they receiveinformation that the money, output pair (39.284, 123.337) was incorrectlycoded and the correct pairing should be (39.284, 231.337). The new plot isshown in Figure 8. The new correlation coefficient is calculated as PXY =0.693. So, a correction of one data point yielded a 69% increase in the valueof the correlation coefficient. The data set, as a whole, is not "more linear"in nature. The students see the effects of outliers in the calculations thatthey have been performing. Correlation coefficients calculated from sample

241

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14

Page 10: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

PRlMUS September 1996 Volume VI Number 3

data are ordinal in nature. The students see that the correlation coefficientfor one data sample that is twice the value of the correlation coefficient foranother sample does not imply that the data in one is "two times closer"to being linear than the other!

300 -

0

0 0

0 0 00

:; 0%200 -0

o100 -t,...---- - - ----.---- ----,J

30 40

Money

50

Figure 7. Plot of log(Y) vs. log(M)from the Cobb-Douglas model.

280 -

270 -

260 - 0 0

~ 250 -:;o 240 - 0 0 0

230 - 00

220 -0

210I

30 40

Money

o

50

Figure 8. Plot of log(Y) vs. 10g(M) from the Cobb-Douglasmodel with corrected data point (39.284,231.337)

242

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14

Page 11: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

Fox and Fowler Understanding of Covariance and Correlation

CONCLUSION

Our students are exposed to covariance and correlation several times duringthe semester. These sample problems, worked in class in small groups, helpthe students gain an understanding of two important but frequently misun­derstood concepts. The students then demonstrate understanding duringour end-of-course capstone project. We take "real world " data on such top­ics as students scores, movement of equipment as a scheduling and trans­portation problem, or repair parts in maintenance cycles. Two examplesare a fictional project to assist the admissions department in evaluating po­tential candidates, and a scheduling problem to determine if military unitscan be moved by sea in a sufficient amount of time with defined assets. Wetypically generate 1000 rows of data for this project with 7-9 random vari­ables forming the columns. Each student is then assigned 400 random rowsof data as "their" data set for the project. Among the many requirementsare questions concerning the degree of linear association between two of thevariables. The students spend some time analyzing the data, plotting thedata, and then computing the correlation coefficient before passing theirjudgment on the strength, or even existence, of a linear association. Ourstudents gain a greater appreciation and understanding of the concepts ofcovariance and correlation. They do not have all the insights into theseconcepts, but they have been exposed to more than just the computationalaspects. They have been asked to experiment, conjecture, and make a de­cision about the relationship between random variables. These examplesstrengthen the students understanding of the subject as well as strengthenthem as self-learners. We have observed vast student improvement in ex­plaining the possible relationship between the variables of interest. Thestudents have to explain these relationships between random variables inessay questions on the comprehensive final exam. Grading of these ques­tions on the final exams indicates that the students have gained a greaterunderstanding of covariance and correlation. We attribute this "graspingthe concept" to the illustrative examples in our course.

REFERENCES

1. Arney, David C., William P. Fox, Kelley B. Mohrmann, Joseph D. My­ers , and Richard A. West. 1995. CORE Mathematics at the United StatesMilitary Academy: Leading into the 21st Century. PRIMUS. 5(4): 343-365.

2. Devore, Jay. 1991. Probability and Statistics for Engineering and theSciences. 3rd Edition. Monterey: Brooks/Cole Publishing.

243

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14

Page 12: AIDING UNDERGRADUATE UNDERSTANDING OF COVARIANCE AND CORRELATION

PRIMUS September 1996 Volume VI Number 3

3. Ryan, Barbara, Brian Joiner, and Thomas Ryan. 1985. MINITABHandbook. 2nd Edition. Boston: PWS-Kent.

4. Wapole, Ronald E. and Raymond H. Myers. 1978. Probability andStatistics for Engine ers and Scientists. 2nd Edition. New York: MacmillanPublishing.

BIOGRAPHICAL SKETCHES

William P. Fox currently is a Professor of Operations Research in the De­partment of Mathematical Sciences at the United States Military Academy.He has a BS from USMA , a MS from the Naval Postgraduate School, anda Ph.D. from Clemson University. He has taught mathematics for over tenyears and has been a co-author of several books on mathematical modeling.He has taught courses in discrete dynamical systems, probability and statis­tics, calculus, differential equations, nonlinear optimization, mathematicalprogramming, and mathematical modeling. His interests are in th e abovesubjects and in th eir int erdis ciplinary applications.

Christopher W. Fowler was an assistant professor in the Departmentof Mathematical Sciences at the United States Military Academy for threeyears. He taught multivariable calculus and directed our probability andstatistics core course. He has a BS from USMA and a MS from Georgia Tech.He is currently finishing his dissertation at Georgia Tech in OperationsResearch/Systems Engineering.

244

Dow

nloa

ded

by [

Uni

vers

ity o

f C

hica

go L

ibra

ry]

at 1

0:26

08

Dec

embe

r 20

14