11
 QBA Term Report SECTION A A New Way to Compute Pearson’s r Without Reliance on Cross-Products Submitted by Irfan Junejo Kantesh Rathi Vali Mohammad Instructor: Sir Rizwan Ahmed

QBA Report

Embed Size (px)

Citation preview

Page 1: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 1/11

 

QBA Term Report 

SECTION A A New Way to Compute Pearson’s r WithoutReliance on Cross-Products 

Submitted by

Irfan Junejo

Kantesh Rathi

Vali Mohammad

Instructor: Sir Rizwan Ahmed

Page 2: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 2/11

2 | P a g e  

Table of Contents

INTRODUCTION ............................................................................................................................................. 3

BY IRFAN JUNEJO ...................................................................................................................................... 3

THE TEACHING/COMPUTING STRATEGY ...................................................................................................... 6

BY KANTESH RATHI ................................................................................................................................... 6

COMMENTS ABOUT THE FORMULA ............................................................................................................. 8

BY VALI MOHAMMAD ............................................................................................................................... 8

EXAMPLE ....................................................................................................................................................... 9

BY IRFAN JUNEJO ...................................................................................................................................... 9

CONCLUSION ............................................................................................................................................... 10

BY VALI MOHAMMAD ............................................................................................................................. 10

INDEX .......................................................................................................................................................... 11

REFERENCES ................................................................................................................................................ 11

Page 3: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 3/11

3 | P a g e  

INTRODUCTIONBY IRFAN JUNEJO

The given article is taken from the journal ‘Teaching Statistics’ which is an international

journal for teachers and works under the banner of ‘Teaching Statistics Trust’ which is aregistered charity since 1979 and since then they have been publishing a journal thrice every

year.

In a recent article published in the same journal entitled ‘Correlation: From Picture to

Formula’, Peter Holmes1 (2001) accurately points out that scatter diagrams are very useful

when introducing students to the subject of correlation and makes it easier for them to judge

the relation between X and Y variables.

A scatter diagram is basically a tool for determining the potential relation between two

variables i.e. how one variable changes with the other one. The scatter diagram does not

however indicate the exact relation but it does indicate whether the variables are connectedor not.

For example, the scatter diagram below shows that there`s no relation between X and Y and

because all the data points don`t seem to make a distinguishable pattern.

However in this next graph on the left there`s a positive relation between the two variables

because as the value of one variable increases the other one also increases whereas the

scatter diagram on the right represents a negative relation.

1 Circulation Manager for ‘Teaching Statistics’ 

0

1

2

3

4

5

6

0 2 4 6

   Y  -   A   x   i   s

X - Axis

Data Points

Page 4: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 4/11

4 | P a g e  

In this manner the scatter diagrams helps in indicating the relation between the two variables

which is further explained under the next heading.

The scatter diagram helps in making a rough guess of r`s position which always lies between -

1 and +1. This r is the coefficient of correlation. Karl Pearson developed the correlation from

a similar but slightly different idea by Francis Galton. The coefficient of correlation i.e. r can

also be denoted by ρ. The diagram that follows explains how a scatter diagram helps the

students in making a fairly rough guess of the value of r.

Holmes states that a typical student can make reasonably fair predictions about the value of

‘r’ but they face difficulty is how the formula for Pearson`s ‘r’ to quantify its value and the

understanding that comes from observing the scatter diagram. Holmes tries to bridge the gap

between Pearson`s formula and a scatter diagram in a step by step fashion.

0

0.5

1

1.5

22.5

3

3.5

4

4.5

0 10 20

   Y  -   A   i   x   s

X - Axis

Data Points

0

5

10

15

20

25

0 10 20

   Y  -   A   i   x   s

X - Axis

Data Points

Figure 1 - http://en.wikipedia.org/wiki/Correlation_and_dependence

Page 5: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 5/11

5 | P a g e  

In the book ‘Comprehending Behavioral Statistics’, Dr. Russell Hurlbert2 also tries to bridge

the same gap between scatter diagram and r. Hurlbert first demonstrates how a tic-tac toe

grid can be super imposed on the data of the scatter diagram.

Followed by this superimposition Hurlbert argues that the data in the four corners (1, 3, 7,

and 9) of the grid have the most significant impact on the sign and magnitude of ‘r’. Lastly

Hurlbert computes the z-score cross products and then states that the Pearson`s correlation

is equal to the mean of these zxzy values.

Here`s how to compute the zxzy values:

X Zx Y Zy ZxZy 25 0.15 80 0.00 0.0014 -1.53 98 1.10 -1.6833 -1.38 50 -1.84 -2.5328 0.61 82 0.12 0.0820 -0.61 90 0.61 -0.38

∑ = 120 ∑ = 400 ∑ = -4.51Mean = 24 Mean = 80 r = -0.90SD = 6.54 SD = 16.3

The Z scores can be computed by subtracting the cell value from it Mean 3 and dividing the

whole by its Standard Deviation4. For example the Z score for the first class size is (25-24)/6.54 = 0.15

2Professor of psychology, University of Nevada

3For a data set mean is the sum of the observations divided by the number of observations.

4Standard Deviation shows how much variation there is from the "average" (mean).

Figure 2 

Page 6: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 6/11

6 | P a g e  

For the value or r we sum up the ZxZy values and divide the sum by the total number of

observation i.e. 5 in this case.

Both Holmes and Hurlbert try to connect the scatter diagram with the correlation coefficient

by making use of the Z scores. The sum of the cross products although forms the numerator

for the value of r but both the authors state that there`s a better way to show that the

formula for r truly does quantify the qualitative understanding that one gets from looking at

the scatter diagram. The advantage for this alternative approach is that it does not rely on

the Z scores instead it involves the creation of separate ‘direct’ and ‘indirect’ components of 

each score. These components, it is argued, are far more accordant with the intuitive ‘feel’

that one gets when looking at a scatter diagram.

THE TEACHING/COMPUTING STRATEGY

BY KANTESH RATHI

The best way for showing ”direct” and indirect influence of each data point in detail isstraight forward , closely understand the nature and strength of two variables (which aredependent to each other) relationship and investigate about those variables.

This procedure can be understood easily by four steps:

  First convert all given scores on X and on Y axes into Z scores. This conversion will notaffect the Pearson product-moment correlation coefficient (sometimes referred to asthe PMCC, and typically denoted by r) is a measure of the correlation (lineardependence) between two variables X and Y, giving a value between +1 and −1inclusive. Students will be aware of this important needed feature of Pearson product-moment correlation coefficient (sometimes referred to as the PMCC, and typicallydenoted by r) if asked the question: ‘If we correlate centigrade and Farehinite, heightand weight, meters or centimeters or feet or inches affect the value of the correlationcoefficient.

  Second , draw a scatter diagram with the Z-score Inside this scatter diagram, draw a

Page 7: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 7/11

7 | P a g e  

Line at a 45 angle from the origin to moving upward passing through centroid(centroid is the intersection of all hyper planes that divide X into two parts of equalmoment.). This line represent positive (direct ) relationship and reprent this line from(D) . Also draw another line that will be at 90 degree to D line and that should passfrom centroid. Second line show negative (indirect) relationship so represent this fromI.

  Third, determine the projection of each data point on positive and negativelines ,measure the distances from each these projection of D and I points tocentroid and represent these distances, direct as d and indirect as i. Thedistance of positive line indicates direct influence on r and the distance fromnegative line indicates indirect influence on r.

  Finally, after getting the value of I and D distances, we can compute the valueof r by doing squared of these values , summed and then put into the followingformula so as the value for Pearson product-moment correlation coefficient : r

0

20

40

60

80

0 20 40 60 80

Positive

0

10

20

30

40

50

60

70

0 20 40 60 80

Negative

Page 8: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 8/11

8 | P a g e  

COMMENTS ABOUT THE FORMULA

BY VALI MOHAMMAD

As we observe the above formula we can see that r will produce a positive value when thedistances d are large and the distances i are small. This situation will cause the r to produce a

positive value because this will create a compact path hence causing the i distances to beminute and much lower than the d distances hence producing a scatter diagram which ismoving from lower left to upper right on the other hand r will produce a negative result inthe case when the data points would form a cluster and are moving on the line I or in otherwords moving perpendicular to D (as can be seen from the figure above).

Another unique feature of the formula is that both will equal zero no matter whatthe data is or what the relationship between X and Y is. This means that it would be useless

and a waste of time to calculate the value of and also (i.e. the sum of theunsquared deviation scores) when measuring dispersion in the univariate case. So for peoplewho were wondering if they could find ds and ts and then divide the difference between theirsums: would have had their queries solved by the above statement and developed a clear

approach on how to use the formula best.

By looking at the above diagram some people may wonder what kind of a confusing diagram it

is and may form the opinion that it is quite difficult to find the values of as they arerepresented through the perpendicular axes rather than the vertical and horizontal axes

labeled but instead if they take a closer look they will realize that are

simple functions of (as shown below)

Page 9: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 9/11

9 | P a g e  

Through the above formulas we can easily calculate the value of and plug the valuesin the formula to get the final answer however it should be kept in mind that we require the

values of and not to calculate r so don’t go on wasting your time onsomething that is not needed instead utilize your time on the given requirements. However

you can calculate the values of and square then to get the values of :

To calculate the values of we must first determine what sign (positive or negative)

does to d or t posses. This signs of can be calculated using a set of rules which are:

The sign of d for any data point will bePositive if that data point’s z-scores meet anyone of these three conditions:(a) Both zx and zy are positive,

(b) zy is positive, zx is negative, and zy > |zx|, or(c) zx is positive, zy is negative, and zx > |zy|.If none of those conditions hold, then d will be negative. A similar set of rules can applied todetermine the sign of the i values. So we can see it is much easier to calculate the values o f

rather than .

EXAMPLEBY IRFAN JUNEJO

Page 10: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 10/11

Page 11: QBA Report

7/31/2019 QBA Report

http://slidepdf.com/reader/full/qba-report 11/11

11 | P a g e  

INDEX

centroid  7  

Correlation  3 

Francis Galton  4 

H

Hurlbert  5, 6 

Karl Pearson  4 

P

Pearson  6 

Pythagoras  10 

S

scatter diagram  3, 4, 5, 6, 8, 10 

Teaching Statistics  3 

Teaching Statistics Trust  3 

 Z 

Z scores  5, 6, 10 

REFERENCES

1.  http://en.wikipedia.org/wiki/Karl_Pearson 2.  http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient 3.  http://wps.prenhall.com/wps/media/objects/2497/2557809/MEDIA/Ch3/learnmorech

3.pdf 4.  Research Methods and Statistics: A Critical Thinking Approach by Sherri L. Jackson

5.  Statistics for People Who (Think They) Hate Statistics: Excel 2007 Edition by Neil J.

Salkind