Bivariate data - iiNetmembers.ozemail.com.au/~rdunlop/CoplandMain/MathsApps/Modell… · the total attendance at Carlton football matches D position in a queue at the pie stall E

VCEcoverageArea of studyUnits 3 & 4 • Data analysis

In thisIn this chachapterpter2A Types of data2B Back-to-back stem plots2C Parallel boxplots2D The two-way frequency

table2E The scatterplot2F The q-correlation

coefficient2G Pearson’s product–

moment correlation coefficient

2H Calculating r and the coefficient of determination

2

Bivariate data

Ch 02 FM YR 12 Page 57 Friday, November 10, 2000 9:34 AM

58

F u r t h e r M a t h e m a t i c s

Types of data

In this chapter we look at sets of data which contain two variables. We look at ways ofdisplaying the data and of measuring relationships between the two variables.

The methods we employ to do this depend entirely on the type of variables we aredealing with.

Numerical and categorical data

Examples of numerical data are:1. the heights of a group of teenagers2. the marks for a maths test3. the number of universities in a country4. ages5. salaries.

As the name suggests, numerical data involve quantities which are, broadlyspeaking,

measurable

or

countable

.Examples of categorical data are:

1. genders (sexes)2. AFL football teams3. religious denominations4. finishing positions in the Melbourne Cup5. municipalities6. ratings of 1–5 to indicate preferences for 5 different cars7. age groups, for example 0–9, 10–19, 20–29 8. hair colours.

Such categorical data, as the name suggests, have categories like masculine, feminineand neuter for gender, or Catholic, Anglican, Uniting, Baptist, Buddhist and so on forreligious denomination, or 1st, 2nd, 3rd for finishing position in the Melbourne Cup.

Note

: Some numbers may

look

like numerical data, but really be

names

or

titles

(forexample, ratings of 1 to 5 given to different samples of cake — ‘This one’s a 4’; thenumbers on netball players’ uniforms — ‘she’s number 7’). These ‘titles’ are

not count-able

; they place the subject in a category (with a name) so are

categorial

.In this chapter we look at ways of measuring the relationship between:

1. a numerical variable and a categorical variable (for example, weight and nationality)2. two categorical variables (for example, gender and religious denomination)3. two numerical variables (for example, height and weight).

Dependent and independent variables

When a relationship between two sets of variables is being examined, it is useful toknow if one of the variables depends on the other. Often we can make a judgment aboutthis but sometimes we can’t.

Consider the case where a study compared the heights of company employeesagainst their annual salaries. Common sense would suggest that the height of a com-pany employee would not depend on the person’s annual salary nor would the annualsalary of a company employee depend on the person’s height. In this case, it is notappropriate to designate one variable as independent and one as dependent.

In the case where the ages of company employees are compared with their annualsalaries, you might reasonably expect that the annual salary of an employee woulddepend on the person’s age. In this case, the age of the employee is the independentvariable and the salary of the employee is the dependent variable.

It is useful to identify the independent and dependent variables where possible, sinceit is the usual practice when displaying data on a graph to place the independent vari-able on the horizontal axis and the dependent variable on the vertical axis.


C h a p t e r 2 B i v a r i a t e d a t a

59

Types of data

1

Write down whether each of the following represents numerical or categorical data.

a

The heights in centimetres of a group of children

b

The diameters in millimetres of a tray of ball-bearings

c

The numbers of visitors to a display each day

d

The modes of transport that students in Year 11 take to school

e

The 10 most-watched television programs in a week

f

The occupations of a group of 30-year-olds

g

The numbers of subjects offered to VCE students at various schools

2

For each of the following pairs of variables, write down which is independent andwhich is dependent. If it is not possible to identify this, then write ‘not appropriate’.

a

The age of an AFL footballer and his annual salary

b

The weight of a businessman and the number of business lunches he attends each week

c

The growth of a plant and the amount of fertiliser it receives

d

The number of books read in a week and the eye colour of the readers

e

The voting intentions of a woman and her weekly consumption of red meat

f

The number of members in a household and the size of their house

3

An example of a numerical variable is:

4

In a study on mice, the dependent variable was the time (in days) for which the miceremained alive. The independent variable would most likely have been:

A

the weight of the mice

B

the amount of food eaten each day by the mice

C

the daily dosage of an experimental drug given to the mice

D

the number of mice

E

the sex of the mice

h

Life expectancies

i

Species of fish

j

Blood groups

k

Years of birth

l

Countries of birth

m

Tax brackets

A

attitude to 4-yearly elections (for or against)

B

year level of students

C

the total attendance at Carlton football matches

D

position in a queue at the pie stall

E

television channel numbers shown on a dial

remember1. Bivariate data are data with two variables.2. Numerical data involve quantities that are measurable or countable.3. Categorical data, as the name suggests, are data which are divided into categories.4. In a relationship involving two variables, if the values of one variable ‘depend’ on

the values of another variable, then the former variable is referred to as the dependent variable and the latter variable is referred to as the independent variable.

5. It is the usual practice when displaying data on a graph to place the independent variable on the horizontal axis and the dependent variable on the vertical axis.

remember

2A

mmultiple choiceultiple choice



60

F u r t h e r M a t h e m a t i c s

Back-to-back stem plots

In chapter 1, we saw how to construct a stem plot for a set of univariate data. We canalso extend a stem plot so that it displays bivariate data. Specifically, we shall create astem plot that displays the relationship between a numerical variable and a categoricalvariable. We shall limit ourselves in this section to categorical variables with just twocategories, for example sex. The two categories are used to provide two, back-to-backleaves of a stem plot.

A back-to-back stem plot is used to display bivariate data, involving a numerical variable and a categorical variable with 2 categories.

The girls and boys in Grade 4 at Kingston Primary School submitted projects on the Olympic Games. The marks they obtained out of 20 are given below.

Display the data on a back-to-back stem plot.

Girls’ marks 16 17 19 15 12 16 17 19 19 16

Boys’ marks 14 15 16 13 12 13 14 13 15 14

THINK WRITE

Identify the highest and lowest scores in order to decide on the stems.

Highest score = 19Lowest score = 12Use a stem of 1, divide into fifths.

1

1WORKEDExample


C h a p t e r 2 B i v a r i a t e d a t a 61

The back-to-back stem plot allows us to make some visual comparisons of the twodistributions. In the above example the centre of the distribution for the girls is higherthan the centre of the distribution for the boys. The spread of each of the distributionsseems to be about the same. For the boys, the marks are grouped around the 12–15marks; for the girls, they are grouped around the 16–19 marks. On the whole, we canconclude that the girls obtained better marks than the boys did.

To get a more precise picture of the centre and spread of each of the distributions wecan use the summary statistics discussed in chapter 1. Specifically, we are interested in:1. the mean and the median (to measure the centre of the distributions), and2. the interquartile range and the standard deviation (to measure the spread of the

distributions).We saw in chapter 1 that the calculation of these summary statistics is very straight-

forward and rapid using a graphics calculator.

THINK WRITE

Create an unordered stem plot first. Put the boys’ scores on the left, and the girls’ scores on the right.

Key: 1|2 = 12Leaf Stem LeafBoys Girls

13 2 3 3 1 2

4 5 4 5 4 1 56 1 6 7 6 7 6

1 9 9 9Now order the stem plot. The scores on the left should increase in value from right to left, while the scores on the right should increase in value from left to right.

Key: 1|2 = 12Leaf Stem LeafBoys Girls

3 3 3 2 1 25 5 4 4 4 1 5

6 1 6 6 6 7 71 9 9 9

2

3

The number of ‘how to vote’ cards handed out by various Australian Labor Party and Liberal party volunteers during the course of a polling day is shown below.

Display the data using a back-to-back stem plot and use this, together with summary statistics, to compare the distributions of the number of cards handed out by the Labor and Liberal volunteers.

Labor 180193

233202

246210

252222

263257

270247

229234

238226

226214

211204

Liberal 204287

215273

226266

253233

263244

272250

285261

245272

267280

275279

2WORKEDExample

Continued over page


62 F u r t h e r M a t h e m a t i c s

THINK WRITEConstruct the stem plot. Key: 18|0 = 180

Leaf Stem LeafLabor Liberal

0 183 19

4 2 20 44 1 0 21 5

9 6 6 2 22 68 4 3 23 3

7 6 24 4 57 2 25 0 3

3 26 1 3 6 70 27 2 2 3 5 9

28 0 5 7Use a graphics calculator to calculate the summary statistics: the mean, the median, the standard deviation and the interquartile range. Enter each set of data as a separate list. (See chapter 1 on how to use your graphics calculator to calculate these values.)

For the Labor volunteers:Mean = 227.9Median = 227.5Interquartile range = 36Standard deviation = 23.9

For the Liberal volunteers:Mean = 257.5Median = 264.5Interquartile range = 29.5Standard deviation = 23.4

Comment on the relationship. From the stem plot we see that the Labor distribution is symmetric and therefore the mean and the median are very close, whereas the Liberal distribution is negatively skewed.

Since the distribution is skewed, the median is a better indicator of the centre of the distribution than is the mean.

Comparing the medians therefore, we have the median number of cards handed out for Labor at 228 and for Liberal at 265, which is a big difference.

The standard deviations were similar as were the interquartile ranges. There was not a lot of difference in the spread of the data.

In essence, the Liberal party volunteers handed out a lot more ‘how to vote’ cards than the Labor party volunteers did.

1

2

3

remember1. A back-to-back stem plot displays bivariate data involving a numerical variable

and a categorical variable with two categories.2. In the ordered stem plot, the scores on the left side of the stem increase in value

from right to left.3. Together with summary statistics, back-to-back stem plots can be used for

comparing two distributions.

remember



Back-to-back stem plots

1 The marks (out of 50), obtained for the end-of-term test by the students in German andFrench classes are given below. Display the data on a back-to-back stem plot.

2 The birth masses of 10 boys and 10 girls (in kilograms, to the nearest 100 grams) arerecorded in the table below. Display the data on a back-to-back stem plot.

3 The number of delivery trucks making deliveries to a supermarket each day over a2-week period was recorded for two neighbouring supermarkets —supermarket A andsupermarket B. The data are shown below.

a Display the data on a back-to-back stem plot.b Use the stem plot, together with some summary statistics, to compare the distribu-

tions of the number of trucks delivering to supermarkets A and B.

4 The marks out of 20 for males and females on a science test for a Year-10 class aregiven below.


tions of the marks of the males and the females.

5 The end-of-year English marks for 10 students in an English class were compared over2 years. The marks for 1998 and for the same students in 1999 are shown below.


tions of the marks obtained by the students in 1998 and 1999.

German 20 38 45 21 30 39 41 22 27 33 30 21 25 32 37 42 26 31 25 37

French 23 25 36 46 44 39 38 24 25 42 38 34 28 31 44 30 35 48 43 34

Boys 3.4 5.0 4.2 3.7 4.9 3.4 3.8 4.8 3.6 4.3

Girls 3.0 2.7 3.7 3.3 4.0 3.1 2.6 3.2 3.6 3.1

A 11 15 20 25 12 16 21 27 16 17 17 22 23 24

B 10 15 20 25 30 35 16 31 32 21 23 26 28 29

Females 12 13 14 14 15 15 16 17

Males 10 12 13 14 14 15 17 19

1998 30 31 35 37 39 41 41 42 43 46

1999 22 26 27 28 30 31 31 33 34 36

2BWORKEDExample

1

WORKEDExample

2



6 The age and sex of a group of people attending a fitness class are recorded below.


tions of the ages of the female to male members of the fitness class.

7 The scores on a board game are recorded for a group of kindergarten children and for agroup of children in a preparatory school.

a Display the data on a back-to-back stem plot.b Use the stem plot, together with some summary statistics, to compare the distributions

of the scores of the kindergarten children compared to the preparatory school children.

8The pair of variables that could be displayed on a back-to-back stem plot is:A the height of student and the number of people in the student’s householdB the time put into completing an assignment and a pass or fail score on the assignmentC the weight of a businessman and his ageD the religion of an adult and the person’s head circumferenceE the income bracket of an employees and the time the employee has worked for the

company

9A back-to-back stem plot is a useful way of displaying the relationship between:A the proximity to markets (km) and the cost of fresh foods on average per kilogramB height and head circumferenceC age and attitude to gambling (for or against)D weight and ageE the money spent during a day of shopping and the number of shops visited on that day

Female 23 24 25 26 27 28 30 31

Male 22 25 30 31 36 37 42 46

Kindergarten 3 13 14 25 28 32 36 41 47 50

Prep. School 5 12 17 25 27 32 35 44 46 52





Parallel boxplotsWe saw in the previous section that we could display relationships between a numericalvariable and a categorical variable with just two categories, using a back-to-back stem plot.

When we want to display a relationship between a numerical variable and acategorical variable with more than two categories, a parallel boxplot can be used.

A parallel boxplot is obtained by constructing individual boxplots for eachdistribution, using the common scale.

Construction of individual boxplots was discussed in detail in chapter 1 on univariatedata. In this section we concentrate on comparing distributions represented by anumber of boxplots (that is, on the interpretation of parallel boxplots).

The four Year-7 classes at Western Secondary College complete the same end-of-year maths test. The marks, expressed as percentages for each of the students in the four classes, are given below.

Display the data using a parallel boxplot and use this to describe any similarities or differences in the distributions of the marks between the four classes.

7A 7B 7C 7D 7A 7B 7C 7D

40 60 50 40 69 78 70 69

43 62 51 42 63 82 72 73

45 63 53 43 63 85 73 74

47 64 55 45 68 87 74 75

50 70 57 50 70 89 76 80

52 73 60 53 75 90 80 81

53 74 63 55 80 92 82 82

54 76 65 59 85 95 82 83

57 77 67 60 89 97 85 84

60 77 69 61 90 97 89 90

THINK WRITE/DISPLAYCreate the first boxplot (for class 7A) on a graphics calculator using [STAT PLOT] and appropriate WINDOW settings. Using to show key values, sketch the first boxplot using pen and paper, leaving room for three additional plots.

12nd

TRACE

FM Fig SD 02.01a FM Fig SD 02.01b

3WORKEDExample

Continued over page



THINK WRITE

Repeat step 1 for the other three classes. All four boxplots share the common scale.

Describe the similarities and differences between the four distributions.

Class 7B had the highest median mark and the range of the distribution was only 37. The lowest mark in 7B was 60.

We notice that the median of 7A’s marks is approximately 60. So, 50% of students in 7A received less than 60. This means that half of 7A had scores that were less than the lowest score in 7B.

The range of marks in 7A was about the same as that of 7D with the highest scores in each about equal, and the lowest scores in each about equal. However, the median mark in 7D was higher than the median mark in 7A so, des-pite a similar range, more students in 7D received a higher mark than in 7A.

While 7D had a top score that was higher than that of 7C, the median score in 7C was higher than that of 7D and the bottom 25% of scores in 7D were less than the lowest score in 7C. In summary, 7B did best, followed by 7C then 7D and finally 7A.

2

30 40 50 60 70 80 90 100

7D

7C

7B

7A

Maths mark (%)

3

remember1. A relationship between a numerical variable and a categorical variable with

more than two categories can be displayed using a parallel boxplot.2. A parallel boxplot is obtained by constructing individual boxplots for each

distribution, using a common scale.

remember



Parallel boxplots

1 The heights (in cm) of students in 9A, 10A and 11A were recorded andare shown in the table below.

a Construct a parallel boxplot to show the data.b Use the boxplot to compare the distributions of height for the 3 classes.

2 The amounts of money contributed annually to superannuation schemes by people in3 different age groups are shown below.

a Construct a parallel boxplot to show the data.b Use the boxplot to comment on the distributions.

9A 10A 11A 9A 10A 11A 9A 10A 11A

120 140 151 146 153 164 158 168 175

126 143 153 147 156 166 160 170 180

131 146 154 150 162 167 162 173 187

138 147 158 156 164 169 164 175 189

140 149 160 157 165 169 165 176 193

143 151 163 158 167 172 170 180 199

20–29 30–39 40–49 20–29 30–39 40–49

2000 4000 10 000 6500 7000 13 700

3100 5200 11 200 6700 8000 13 900

5000 6000 12 000 7000 9000 14 000

5500 6300 13 300 9200 10 300 14 300

6200 6800 13 500 10 000 12 000 15 000

2CWORKEDExample

3

EXCEL Spreadsheet

Parallelboxplots

GC program

UV stats



3 The numbers of jars of vitamin A, B, C and multi-vitamins sold per week by a localchemist are shown below.

Construct a parallel boxplot to display the data and use it to compare the distributionsof sales for the 4 types of vitamin.

4The ages of the employees at 5 different companies of the same size are comparedusing the parallel boxplots shown below.

For each of the following, select from:

a Which company has the greatest range of ages?b Which company has the greatest interquartile range of ages?c Which company has the lowest median age?d Which company has the greatest range of ages among their oldest 25% of employees?

Vitamin A 5 6 7 7 8 8 9 11 13 14

Vitamin B 10 10 11 12 14 15 15 15 17 19

Vitamin C 8 8 9 9 9 10 11 12 12 13

Multi-vitamins 12 13 13 15 16 16 17 19 19 20

A company A B company B C company CD company D E company E


20 25 30 35 40 45 50 55 60

Company A

Company B

Company C

Company E

Company D

WorkS

HEET 2.1



The two-way frequency tableWhen we are examining the relationship between two categorical variables, the two-way frequency table is an excellent tool.

Consider the following example.

In the above example we have a very clear breakdown of data. We know how manyfemales preferred Labor, how many females preferred the Liberals, how many malespreferred Labor and how many males preferred the Liberals.

If we wish to compare the number of females who prefer Labor with the number ofmales who prefer Labor, we must be careful. While 12 males preferred Labor comparedto 18 females, there were, of course, fewer males than females being asked. That is,only 23 males were asked for their opinion, compared to 34 females.

To overcome this problem, we can express the figures in the table as percentages.

At a local shopping centre, 34 females, and 23 males were asked which of the two major political parties they preferred. Eighteen females and 12 males preferred Labor. Display these data in a two-way table.

THINK WRITE

Draw a table. Record the respondent’s sex in the columns and party preference in the rows of the table.

(a) We know that 34 female and 23 males were asked. Put this information into the table and fill in the total.(b) We also know that 18 females and 12 males preferred Labor. Put this information in the table and find the total of people who preferred Labor.

Fill in the remaining cells. For example, to find the number of females who preferred the Liberals, subtract the number of females preferring Labor from the total number of females asked:34 − 18 = 16.

1Party preference Female Male Total

Labor

Liberal

Total


Labor 18 12 30

Liberal

Total 34 23 57


Labor 18 12 30

Liberal 16 11 27

Total 34 23 57

4WORKEDExample



We could have calculated percentages from the table rows, rather than columns. To dothat we would, for example, have divided the number of females who preferred Labor(18) by the total number of people who preferred labor (30) and so on. The table belowshows this:

By doing this we have obtained the percentage of people who were female and pre-ferred Labor (60%), and the percentage of people who were male and preferred Labor(40%), and so on. This highlights facts different from those shown in the previoustable. In other words, different results can be obtained by calculating percentages froma table in different ways.

As a general rule, when the independent variable (in this case the respondent’s sex)is placed in the columns of the table, then the percentages should be calculated incolumns.

Party preference Female Male Total

Labor 60.0 40.0 100

Liberal 59.3 40.7 100

Party preference Female Male Total

Labor 18 12 30

Liberal 16 11 27

Total 34 23 57

THINK WRITE

Draw the table, omitting the ‘total’ column.

Fill in the table by expressing the number in each cell as a percentage of its column’s total. For example, to obtain the percentage of males who prefer Labor, we divide the number of males who prefer Labor by the total number of males and multiply by 100%.

× 100% = 52.5% (correct to 1 decimal place)

1Party preference Female Male

Labor

Liberal

Total

2

1223------

Party preference Female Male

Labor 52.9 52.2

Liberal 47.1 47.8

Total 100.0 100.0

5WORKEDExample

Fifty-seven people in a local shoppingcentre were asked whether they preferredthe Australian Labor Party or the LiberalParty. The results are given at right.Convert the numbers in this table topercentages.



Sixty-seven primary and 47 secondary school students were asked their attitude to the number of school holidays which should be given. They were asked whether there should be more, fewer or the same number. Five primary students and 2 secondary students wanted fewer holidays, 29 primary and 9 secondary students thought they had enough holidays (that is, they chose the same number) and the rest thought they needed to be given more holidays.

Present these data in percentage form in a two-way frequency table and use it to compare the opinions of the primary and the secondary students.

THINK WRITE

Put the data in a table. First fill in the given information, then find the missing information by subtracting the appropriate numbers from the totals.

Calculate the percentages. Since the independent variable (the level of the student, Primary or Secondary) has been placed in the columns of the table, we calculate the percentages in columns. For example, to obtain the percentage of primary students who wanted fewer holidays, divide the number of such students by the total number of primary students and multiply by 100%. That is, × 100% = 7.5%.Comment on the results. Secondary students were much keener on

having more holidays than were primary students.

1Attitude Primary Secondary Total

Fewer 5 2 7

Same 29 9 38

More 33 36 69

Total 67 47 114

2

567------

Attitude Primary Secondary

Fewer 7.5 4.3

Same 43.3 19.1

More 49.2 76.6

Total 100.0 100.0

3

6WORKEDExample

remember1. The two-way frequency table is an excellent tool for examining the relationship

between two categorical variables.2. If the total number of scores in each of the two categories is unequal,

percentages should be calculated in order to be able to analyse the table properly. When the independent variable is placed in the columns of the table, the percentages should be calculated in columns. That is, the numbers in each column should be expressed as a percentage of that column’s total.

remember



The two-way frequency table

1 In a survey, 139 women and 102 men were asked whether they approved or disapprovedof a proposed freeway. Thirty-seven women and 79 men approved of the freeway.Display these data in a two-way table (not as percentages).

2 Students at a secondary school were asked whether the length of lessons should be45 minutes or 1 hour. Ninety-three senior students (Years 10–12) were asked and 60preferred 1-hour lessons, whereas of the 86 junior students (Years 7–9), 36 preferred1-hour periods. Display these data in a two-way table (not as percentages).

3 For each of the following two-way frequency tables, complete the entries.

a

b

c

4 Sixty single men and women were asked whether they prefer to live alone, or to shareaccommodation with friends. The results are shown below.

Convert the numbers in this table to percentages.

Attitude Female Male Total

For 25 i 47

Against ii iii iv

Total 51 v 92

Attitude Female Male Total

For i ii 21

Against iii 21 iv

Total v 30 63

Party preference Female Male

Labor i 42%

Liberal 53% ii

Total iii iv

Rent preference Men Women Total

Live alone 12 23 35

Share with friends 9 16 25

Total 21 39 60

2D

EXCEL

Spreadsheet

Two-wayfrequencytable

WORKEDExample

4

WORKEDExample

5

SkillSH

EET 2.1



The information in the followingtwo-way frequency table relates toquestions 5 and 6. The data show thereactions of administrative staff andtechnical staff to an upgrade of thecomputer systems at a largecorporation.

5From the above table, we can conclude that:A 53% of administrative staff were for the upgradeB 37% of administrative staff were for the upgradeC 37% of administrative staff were against the upgradeD 59% of administrative staff were for the upgradeE 54% of administrative staff were against the upgrade

6From the above table, we can conclude that:A 98% of technical staff were for the upgradeB 65% of technical staff were for the upgradeC 76% of technical staff were for the upgradeD 31% of technical staff were against the upgradeE 14% of technical staff were against the upgrade

7 Delegates at the respective Liberal Party and Australian Labor Party conferences weresurveyed on whether or not they believed that marijuana should be legalised. Sixty-twoLiberal delegates were surveyed and 40 were against legalisation. Seventy-one Labordelegates were surveyed and 43 were against legalisation.

Present the data in percentage form in a two-way frequency table. Comment on anydifferences between the reactions of the Liberal and Labor delegates.

8 Sixty-one union workers were surveyed and asked whether the number of publicholidays should be reduced. Thirty-five supported a reduction. Fifty-nine non-unionworkers were also asked and 31 supported a reduction.

Present the data in percentage form in a two-way frequency table. Comment on anydifference between the reactions of the union and non-union workers.

AttitudeAdministrative

staffTechnical

staff Total

For 53 98 151

Against 37 31 68

Total 90 129 219



WORKEDExample

6



The scatterplotWe often want to know if there is some sort of relationship between two numericalvariables. A scatterplot, which gives a visual display of the relationship between twovariables, provides a good starting point.

Consider the data obtained from last year’s 12B class at Northbank Secondary Col-lege. Each student in this class of 29 students was asked to give an estimate of theaverage number of hours of study per week they did during Year 12. They were alsoasked the TER score they obtained.

The figure at right shows the data plotted on a scatterplot.It is reasonable to think that the number of hours of

study put in each week by students would affect theirTER scores and so the number of hours of study perweek is the independent variable and appears on thehorizontal axis. The TER score is the dependent variableand appears on the vertical axis.

There are 29 points on the scatterplot. Each pointrepresents the hours studied and the TER score of onestudent.

In analysing the scatterplot we look for a pattern inthe way the points lie. Certain patterns tell us that cer-tain relationships exist between the two variables. Thisis referred to as correlation. We look at what type ofcorrelation exists and how strong it is.

In the figure above right we see some sort of pattern: the points are spread in a roughcorridor from bottom left to top right. We refer to data following such a direction ashaving a positive relationship. This tells us that as the average number of hours studiedper week increases, the TER score increases.

Averagehours

of study TERscore

Averagehours

of study TERscore

Averagehours

of study TERscore

18 59 14 54 17 59

16 67 17 72 16 76

22 74 14 63 14 59

27 90 19 72 29 89

15 62 20 58 30 93

28 89 10 47 30 96

18 71 28 85 23 82

19 60 25 75 26 35

22 84 18 63 22 78

30 98 19 61

TE

R s

core

40

50

60

70

80

90

100

10 15 20 25 30

Average number of hours of study per week

(26, 35)


C h a p t e r 2 B i v a r i a t e d a t a 75The point (26, 35) is an outlier. It stands out because

it is well away from the other points and clearly is notpart of the ‘corridor’ referred to above. This outlier mayhave occurred because a student worked very hard butfound the VCE pretty tough or perhaps the student exag-gerated the number of hours he or she worked in a weekor perhaps there was a recording error. This needs to bechecked.

We could describe the rest of the data as having alinear form as the straight line in the diagram at rightindicates.

When describing the relationship between two vari-ables displayed on a scatterplot, we need to comment on:

(a) the direction — whether it is positive or negative(b) the form — whether it is linear or non-linear(c) the strength — whether it is strong, moderate or weak.Below is a gallery of scatterplots showing the various patterns we look for.

TE

R s

core

40

50

60

70

80

90

100

10 15 20 25 30

Average number of hours of study per week

Weak, positivelinear relationship

Moderate, positivelinear relationship

Strong, positivelinear relationship

Weak, negativelinear relationship

Moderate, negativelinear relationship

Strong, negativelinear relationship

Perfect, negativelinear relationship

No relationship Perfect, positivelinear relationship



The scatterplot at right shows the number of hours peoplespend at work each week and the number of hours peopleget to spend on recreational activities during the week.

Decide whether or not a relationship exists between thevariables and, if it does, comment on whether it is positiveor negative; weak, moderate or strong; and whether or notit has a linear form.THINK WRITE(a) The points on a scatterplot are spread in a

certain pattern, namely in a rough corridor from the top left to the bottom right corner. This tells us that as the work hours increase, the recreation hours decrease.

(b) The corridor is straight (that is, it would be reasonable to fit a straight line into it).

(c) The points are not too tight and not too dispersed either.

(d) The pattern resembles the central diagram in the gallery of scatterplots shown previously.

There is a moderate, negative linear relation-ship between the two variables.

Hou

rs f

or r

ecre

atio

n

5

10

15

20

25

10 20 30 40 50

Hours worked

60 70

7WORKEDExample

Data giving the average weekly number of hours studied by each student in 12B at Northbank Secondary College and the corresponding height of each student (to the nearest tenth of a metre) are given in the table below.

Construct a scatterplot for the data and use it to comment on the direction, form and strength of any relationship between the number of hours studied and the height of the students.

Averagehours

ofstudy

Height(m)

Averagehours

ofstudy

Height(m)

Averagehours

ofstudy

Height(m)

Averagehours

ofstudy

Height(m)

18 1.5 19 2.0 20 1.9 16 1.6

16 1.9 22 1.9 10 1.9 14 1.9

22 1.7 30 1.6 28 1.5 29 1.7

27 2.0 14 1.5 25 1.7 30 1.8

15 1.9 17 1.7 18 1.8 30 1.5

28 1.8 14 1.8 19 1.8 23 1.5

18 2.1 19 1.7 17 2.1 22 2.1

8WORKEDExample



Clearly, the number of hours you study for your VCE has no effect on how tall youmight be!

Note that when working with the scatterplot, to change settings at any time use. To identify the coordinates of individual points, use the key with

the arrow keys.

THINK WRITE/DISPLAY

Construct the scatterplot. In this case it is almost impossible to decide which is the independent variable and which is the dependent variable, and therefore on which axis we will place the variables. In such cases, placing either variable on either axis is reasonable.The scatterplot can be constructed using a graphics calculator:(a) Press and CLEAR any functions.(b) Press and select

4:PlotsOff. Press .(c) Press and select 1:Edit. Press .(d) Clear any existing lists and enter the list of

hours of study in L1 and the list of heights in L2.

(e) Press and select 1:Plot 1.(f) Press to turn the plot ON, and select

the first icon which indicates a scatterplot.(g) For Xlist, select L1 and for Ylist select L2 and

select the first symbol in Mark.(h) Press and select 9:ZoomStat.(i) Press to see the scatterplot.Comment on the direction of any relationship. There is no relationship; the points appear

to be randomly placed.Comment on the form of the relationship. There is no form, no linear trend, no

quadratic trend, just a random placement of points.

Comment on the strength of any relationship. Since there is no relationship, strength is not relevant.

1

2

Y=2nd [STAT PLOT]

ENTERSTAT ENTER

2nd [STAT PLOT]ENTER

ZOOMENTER

FM Fig 02.07

Average number of hours studied each week

Hei

ght (

m)

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

10 12 14 16 18 20 22 24 26 28 30

3

4

5

WINDOW TRACE

�

�



The scatterplot

Have your graphics calculator at hand for the following exercise questions.1 For each of the following pairs of variables, write down whether or not you would

reasonably expect a relationship to exist between the pair and, if so, comment onwhether it would be a positive or negative association.a Time spent in a supermarket and money spentb Income and value of car drivenc Number of children living in a house and time spent cleaning the housed Age and number of hours of competitive sport played per weeke Amount spent on petrol each week and distance travelled by car each weekf Number of hours spent in front of a computer each week and time spent playing the

piano each weekg Amount spent on weekly groceries and time spent gardening each week

remember1. When we are investigating if there is any sort of relationship between two

numerical variables, a scatterplot provides a useful starting point. It gives a visual display of the relationship between two such variables.

2. In analysing the scatterplot we look for a pattern in the way the points lie. Certain patterns tell us that certain relationships exist between the two variables. This is referred to as a correlation. We look at what type of correlation exists and how strong it is.

3. When describing the relationship between two variables displayed on a scatterplot, we need to comment on:(a) the direction — whether it is positive or negative(b) the form — whether it is linear or non-linear(c) the strength — whether it is strong, moderate or weak.

remember

2E


C h a p t e r 2 B i v a r i a t e d a t a 792 For each of the scatterplots below, describe whether or not a relationship exists between

the variables and, if it does, comment on whether it is positive or negative, whether it isweak, moderate or strong and whether or not it has a linear form.

3From the scatterplot shown at right, it would be reasonable to observe that:

A as the value of x increases, the value of y increases

B as the value of x increases, the value of y decreases

C as the value of x increases, the value of y remains the same

D as the value of x remains the same, the value of y increases

E there is no relationship between x and y

4 The population of a municipality (to the nearest hundred thousand) together withthe number of primary schools in that particular municipality is given below for11 municipalities.

Construct a scatterplot for the data and use it to comment on the direction, form andstrength of any relationship between the population and the number of primaryschools.

Population(000) 110 130 130 140 150 160 170 170 180 180 190

No. of primary schools 4 4 6 5 6 8 6 7 8 9 8

WORKEDExample

7

FM Fig 02.08a

Age

Hae

mog

lobi

n co

unt

10

1214

20 40 60 80

8

FM Fig 02.08b

Cigarettes smoked

Fitn

ess

leve

l

80100

120

0 10 20

60

Weekly hours of study

Mar

ks a

t sch

ool (

%)

0

20

4060

80

100

4 8 12 16

a b c

Hours spent gardening per week

Wee

kly

expe

nditu

re o

n ga

rden

ing

mag

azin

es (

$)

5

10

1520

25

0 5 10 15

Hours spentcooking per week

Hou

rs s

pent

usi

ng a

co

mpu

ter

per

wee

k

2

4

68

10

12

14

2 4 6 810121416Age

Tim

e un

der

wat

er (

s)

10

20

3040

50

60

70

5 10 15 20 25

d e f


y

x

WORKEDExample

8

EXCEL Spreadsheet

Scatterplot



5 The table below contains data giving the time taken for a paving job and the cost of the job.

Construct a scatterplot for the data. Comment on whether a relationship exists betweenthe time taken and the cost. If there is a relationship, describe it.

6 The table below shows the time of booking (how many days in advance) of the ticketsfor a musical performance and the corresponding row number in A-Reserve.

Construct a scatterplot for the data. Comment on whether a relationship exists betweenthe time of booking and the number of the row and, if there is a relationship, describe it.

Time taken(hours) 5 7 5 8 10 13 15 20 18 25 23

Cost ofjob ($) 1000 1000 1500 1200 2000 2500 2800 3200 2800 4000 3000

Time ofbooking

RowNo.

Time ofbooking

RowNo.

5 15 20 10

6 15 21 8

7 15 22 5

7 14 24 4

8 14 25 3

11 13 28 2

13 13 29 2

14 12 29 1

14 10 30 1

17 11 31 1



The q-correlation coefficientThe q-correlation coefficient is a measure of the strength of the association betweentwo variables. In the previous section we estimated the strength of association bylooking at a scatterplot and forming a judgment about whether the correlation betweenthe variables was positive or negative and whether the correlation was weak, moderateor strong. The calculation of the q-correlation coefficient aids us considerably inmaking that judgment.

To calculate the q-correlation coefficient:Step 1. Draw a scatterplot of the data.Step 2. Locate the median of the x-values. (If there are n points, the median is located

at the th place.) Draw a vertical line through this median value.

Step 3. Locate the median of the y-values and draw a horizontalline through this median value.

Step 4. The scatterplot is now divided into 4 sections orquadrants (hence the name ‘q’-correlation coefficient).(a) Label these sections A, B, C and D.(b) Count the number of points in each section.(c) Do not count points which are on the lines.(d) The number of points in section A is denoted by a, the number of points in

section B is denoted by b, and so on.Step 5. Calculate the q-correlation coefficient using the formula:

n 1+2

------------y

x

AB

C D

qa c+( ) b d+( )–a b c d+ + +

----------------------------------------=

Calculate the q-correlation coefficient for the data shown in the scatterplot at right.THINK WRITE

(a) Locate the median of the x-values. Note that we are talking here about the x-values of the data observations given. In the scatterplot shown there are 15 points. Each point has an x-value and a y-value. To find the median of the x-values we look for the horizontal middle point; that is, we

look for the = 8th point from

the left (from the right, the point will be the same).

(b) Draw a vertical line through this median value. Note that there are 7 points to the right of this line and 7 to the left.

y

x

1

15 1+2

---------------y

x

Medianof

x-values

9WORKEDExample

Continued over page



The value of the q-correlation coefficient in the above example indicates a strongcorrelation. The diagram below gives a rough guide to the strength of the correlationsuggested by the value of q.

THINK WRITE

(a) Locate the median of the y-values. This is done in a similar way to finding the median of the x-values except, instead of counting from the left or right, we count from the top or bottom to find the 8th point.

(b) Draw a horizontal line through this median value. Note that there are 7 points above this line and 7 below.

(a) Label the quadrants A, B, C and D.

(b) Count the number of points in each section. Do not count points that are on the lines.

a = 6, b = 0, c = 6, d = 1

Write the formula for calculating the q-coefficient.

Substitute the values of a, b, c and d into the formula and evaluate.

= 0.85 (correct to 2 decimal places)

2

y

x

3 y

x

A

DC

Ba = 6

d = 1c = 6

b = 0

4 qa c+( ) b d+( )–a b c d+ + +

----------------------------------------=

5q

6 6+( ) 0 1+( )–6 0 6 1+ + +

----------------------------------------=

1113------=

Strong positive association

–0.5

–0.25

0

0.25

0.5

0.75

1

–1

–0.75

}}}

}}}}

Moderate positive association

Weak positive association

No association

Weak negative association

Moderate negative association

Strong negative association

Val

ue o

f q


C h a p t e r 2 B i v a r i a t e d a t a 83The scatterplots below show three special values of the q-correlation coefficient.

q = q = q =

= 1 = –1 = 0The sign of the q-value indicates the direction of the relationship; that is, whether

there is a negative or positive association.In the cases shown above left and centre, the q-values are at both extremes. That is,

q = 1 and −1 respectively. We would describe the variables as showing a very strongassociation. Having said that, the points are not showing a strong linear form or, forthat matter, any linear form.

The q-correlation coefficient merely gives us an idea of which quadrants contain themost points; but beyond that, the points can be in any position in the quadrants. In thatsense, the q-correlation coefficient is a rather blunt instrument.

y

x

A

DC

B y

x

A

DC

B y

x

A

DC

B

(8 + 8) – (0 + 0)8 + 0 + 8 + 0

-------------------------------------- (0 + 0) – (8 + 8)0 + 8 + 0 + 8

-------------------------------------- (3 + 3) – (3 + 3)3 + 3 + 3 + 3

--------------------------------------

An investigation was made into the relationship between the time spent watching television in the week preceding a Maths test and the mark obtained (out of 20) in that Maths test. The following data were recorded.

Draw a scatterplot and calculate the q-correlation coefficient. Comment on the relationship between the two variables.

Time (h) Mark Time (h) Mark Time (h) Mark4 15 10 8 12 105 16 20 5 5 85 20 5 12 20 8

10 12 15 4 15 1015 8 15 12 20 10

THINK WRITE/DISPLAYDraw a scatterplot. We can use a graphics calculator to draw the scatterplot.(a) On the lists screen (press , select

and 1:Edit), enter the two lists of data into L1 and L2.

(b) Press and select 4:PlotsOff.(c) Press .(d) Press and select 1:Plot1.(e) Select On, and for Type, select the first icon

(scatterplot).(f) For Xlist, type in L1 (use [L1]); for Ylist,

type in L2; for Mark, select the first symbol.(g) Press and select 9:ZoomStat. The display

now shows the scatterplot.

1

STAT EDIT

2nd [STAT PLOT]ENTER2nd [STAT PLOT]

2nd

ZOOM

Time watching TV (hours)

Mat

hs m

ark

4

8

12

20

16

5 10 15 20 25

10WORKEDExample

Continued over page



THINK WRITE/DISPLAY

We can also use the graphicscalculator to help calculate q.(a) Press and

return to the home screen.(b) Press and

select 4:Vertical.(c) Press .(d) From the MATH menu,

select 4:median(.(e) Type L1 (use 2nd [L1])

at the prompt, then , and the scatterplot appears with the vertical median line drawn.

(f) Similarly, to create the horizontal median line, press and return to the home screen.

(g)Press and select 3:Horizontal.

(h) Press and from the MATH menu, select 4:median(.

(i) Type L2 at the prompt, press and the scatterplot appears with the horizontal median line drawn as well.

Count and record the number of points in each quadrant.

a = 1, b = 5, c = 2, d = 4

Write the formula for calculating the q-correlation coefficient.

Substitute the values of a, b, c and d into the formula and evaluate.

Comment on the relationship. There is moderate, negative association between the hours of television watched and the Maths mark obtained.The negative association means that as the number of hours of television watched prior to the test increased, the marks in the Maths test decreased. The moderate association suggests that it may be worth further investigating the association.

2

2nd [QUIT]

2nd [DRAW]

2nd [LIST]

ENTER

2nd [QUIT]

2nd [DRAW]

2nd [LIST]

ENTER

3

4 qa c+( ) b d+( )–a b c d+ + +

----------------------------------------=

5 q1 2+( ) 5 4+( )–1 5 2 4+ + +

----------------------------------------=

612------–=

0.5–=6



The q-correlation coefficient

1 Calculate the q-correlation coefficient for each of the sets of data shown on the scatter-plots below.

remember1. The q-correlation coefficient is a measure of the strength of the association

between two variables.2. To calculate the q-correlation coefficient:

Step 1. Draw a scatterplot of the data.Step 2. Locate the median of the x-values and draw a vertical line through this

median value.Step 3. Locate the median of the y-values and draw a horizontal line through

this median value.Step 4. (a) Label the sections thus formed A, B, C and D.

(b) Count the number of points in each section.(c) Do not count points which are on the lines.(d) (The number of points in section A is denoted

by a, and so on.)Step 5. Calculate the q-correlation coefficient using the formula:

3. The sign of the q-value indicates the direction of the relationship (whether there is a negative association or a positive association) while the size of it indicates the strength (whether the relationship is strong, moderate or weak).

4. The q-correlation coefficient gives us an idea of into which quadrants the points fall, but beyond that the points can be in any position in the quadrants. In that sense, the q-correlation coefficient is a rather blunt instrument.

y

x

AB

C D

qa c+( ) b d+( )–a b c d+ + +

----------------------------------------=

remember

2FWORKEDExample

9EXC

EL Spreadsheet

q-correlationy

x

y

x

y

x

a b c

y

x

y

x

y

x

d e f



2 The data given in the table below show the results of an investigation intothe mass and the height of a certain breed of dog.a Draw a scatterplot and calculate the q-correlation coefficient.b Comment on the relationship between the height and the mass of this

breed of dog.

3 The data in the table below show the number of hours spent by students who arelearning touch-typing and their corresponding speed in words per minute (wpm).a Using a graphics calculator or otherwise, calculate the q-correlation coefficient for

these data.b Comment on the relationship between the number of hours spent on learning and

the speed of typing.

4The q-correlation coefficient for data shown in the scatterplot at right is:

5

A researcher calculates the q-correlation coefficient for the relationship between time(in days) and the diameter (measured in mm) of a crystal that is changing in size. Thevalue is 0.82. Based on this, the correlation between time and the diameter of thecrystal could be described as:A strong and negativeB strong and positiveC weak and positiveD weak and negativeE moderate and positive

Height (cm) 41 40 35 38 43 44 37 39 42 44 31

Mass (kg) 4.5 5 4 3.5 5.5 5 5 4 4 6 3.5

Time(h) 20 33 22 39 40 37 46 44 24 36 50 48 29

Speed (wpm) 34 46 38 53 52 49 60 58 36 42 65 63 40

A B C D E

WORKEDExample

10


x

y

111------– 1

9---– 5

11------– 5

9--- 9

9---


WorkS

HEET 2.2



Pearson’s product–moment correlation coefficient

We saw in the previous exercise that the q-correlation coefficient was a rather bluntinstrument for measuring correlation between variables. A more precise tool isPearson’s product–moment correlation coefficient. This coefficient is used to measurethe strength of linear relationships between variables; the q-correlation coefficient, onthe other hand, can be used for both linear and non-linear relationships.

Pearson’s coefficient is therefore more specialised and can give us a much moreprecise picture of the strength of the linear relationship between two variables.

The symbol for Pearson’s product–moment correlation coefficient is r.Below is a gallery of scatterplots with the corresponding value of r for each.

The two extreme values of r (1 and −1) are shown in the first two diagrams respec-tively. It is interesting to compare these two scatterplots with those showing extremevalues (1 and −1) of q.

In the four diagrams above, the scatterplots thatshow matching values of q and r are placed sideby side. We see just how differently the points onthe scatterplots are arranged and note from thisthat the r value gives us a much sharperimpression of the relationship between thevariables. That is, a value of r = 1 means that thereis perfect linear association between the variables,which is not necessarily the case when q = 1!

r = 1 r = –1 r = 0 r = 0.7 r = –0.5

r = –0.9 r = 0.8 r = 0.3 r = –0.2

q = 1 r = 1 q = –1 r = –1

Strong positive linear association

–0.5

–0.25

0

0.25

0.5

0.75

1

–1

–0.75

}}}

}}}}

Moderate positive linear association

Weak positive linear association

No linear association

Weak negative linear association

Moderate negative linear association

Strong negative linear association

Val

ue o

f r



In describing the strength of the relationship between the variables, the rough guidewe used with the q-correlation coefficient can also be used with Pearson’s coefficient.The difference, of course, is that the value of r gives us a measure of the strength oflinear relationships specifically.

Note that the symbol ≈ means ‘aproximately equal to’. We use it instead of the = signto emphasise that the value (in this case r) is only an estimate.

In completing the worked example above, we notice that estimating the value ofr from a scatterplot is rather like making an informed guess. In the next section ofwork we will see how to obtain the actual value of r.

For each of the following:

i Estimate the value of Pearson’s product–moment correlation coefficient (r) from thescatterplot.

ii Use this to comment on the strength and direction of the relationship between the twovariables.

THINK WRITE

a Compare these scatterplots with those in the gallery of scatterplots shown previously and estimate the value of r.

a i r ≈ 0.9

Comment on the strength and direction of the relationship.

ii The relationship can be described as a strong, positive, linear relationship.

b Repeat steps 1 and 2 as in a. b i r ≈ −0.7ii The relationship can be described as a

moderate, negative, linear relationship.

c Repeat steps 1 and 2 as in a. c i r ≈ −0.1ii There is no linear relationship.

a b c

1

2

11WORKEDExample

remember1. Pearson’s product–moment correlation coefficient is used to measure the

strength of a linear relationship between two variables.2. The symbol for Pearson’s product–moment correlation coefficient is r.3. The estimate of r can be obtained from the scatterplot.

remember



Pearson’s product–moment correlation coefficient

1 What type of linear relationship does each of the following values of r suggest?

2 For each of the following:i Estimate the value of Pearson’s product–moment correlation coefficient (r), from

the scatterplot.ii Use this to comment on the strength and direction of the relationship between the

two variables.

3A set of data relating the variables x and y is found to have an r value of 0.62. Thescatterplot that could represent the data is:

4A set of data relating the variables x and y is found to have an r value of −0.45. A truestatement about the relationship between x and y is:A There is a strong linear relationship between x and y and when the x-values

increase, the y-values tend to increase also.B There is a moderate linear relationship between x and y and when the x-values

increase, the y-values tend to increase also.C There is a moderate linear relationship between x and y and when the x-values

increase, the y-values tend to decrease.D There is a weak linear relationship between x and y and when the x-values increase,

the y-values tend to increase also.E There is a weak linear relationship between x and y and when the x-values increase,

the y-values tend to decrease.

a 0.21 b 0.65 c −1 d −0.78e 1 f 0.9 g −0.34 h −0.1

A B C D E

2G

WORKEDExample

11

a b c d

e f g h





Calculating r and the coefficient of determination

Pearson’s product–moment correlation coefficientThe formula for calculating Pearson’s correlation coefficient r is as follows:

where n is the number of pairs of data in the setsx is the standard deviation of the x-valuessy is the standard deviation of the y-values

is the mean of the x-values is the mean of the y-values.

The calculation of r by hand using this formula is unnecessary. The calculation ofr is done far more efficiently using a graphics calculator.

There are two important limitations on the use of r. First, since r measures thestrength of a linear relationship, it would be inappropriate to calculate r for data whichare not linear — for example, data which a scatterplot shows to be in a quadratic form.

Second, outliers can bias the value of r. Consequently, if a set of linear data containsan outlier, then r is not a reliable measure of the strength of that linear relationship.

The calculation of r is applicable to sets of bivariate data which are known to be linear in form and which do not have outliers.

With those two provisos, it is good practice to draw a scatterplot for a set of data tocheck for a linear form and an absence of outliers before r is calculated. Having a scat-terplot in front of you is also useful because it enables you to estimate what the valueof r will be — as you did in exercise 2G, and thus you can check that your workings onthe calculator are correct.

r1

n 1–------------

xi x–

sx-------------

yi y–

sy-------------

i 1=

n

∑=

xy

The heights (in centimetres) of 21 football players were recorded against the number of marks they took in a game of football. The data are shown in the table below.

Height (cm)Number of

marks taken Height (cm)Number of

marks taken184 6 182 7194 11 185 5185 3 183 9175 2 191 9186 7 177 3183 5 184 8174 4 178 4200 10 190 10188 9 193 12184 7 204 14188 6

12WORKEDExample



Correlation and causationIn worked example 12 we saw that r = 0.86. While we are entitled to say that there is astrong association between the height of a footballer and the number of marks he takes,we cannot assert that the height of a footballer causes him to take a lot of marks. Beingtall might assist in the taking of marks, but there will be many other factors whichcome into play — for example skill level, accuracy of passes from teammates, abilitiesof the opposing team, and so on.

So, while establishing a high degree of correlation between two variables is veryinteresting and can often flag the need for further, more detailed investigation, it in noway gives us any basis to comment on whether or not one variable causes particularvalues in another variable.

a Construct a scatterplot for the data.b Comment on the correlation between the heights of players and the number of marks

that they take, and estimate the value of r.c Calculate r and use it to comment on the relationship between the heights of players

and the number of marks they take in a game.

THINK WRITE/DISPLAY

a Using a graphics calculator, construct a scatterplot. Refer to worked example 8 in the section on scatterplots for directions on how to use the graphics calculator to draw a scatterplot.

a

b Comment on the correlation between the variables and estimate the value of r.

b The data show what appears to be a linear form of moderate strength.We might expect r ≈ 0.6.

c Because there is a linear form and there are no outliers, the calculation of r is appropriate.Calculate r, using a graphics calculator. The lists are in place from the scatterplot. Firstly press and select DiagnosticOn and press .Press and select CALC and 4:LinReg(ax+b).Press .LinReg(ax+b) appears. Type L1, L2. Press .

c

r = 0.86

The value of r = 0.86 indicates a strong positive linear relationship.

There is a strong positive linear association between the height of a player and the number of marks he takes in a game. That is, the taller the player the more marks we might expect him to take.

1

2nd [CATALOG]ENTER

STAT

ENTER

ENTER

2



The coefficient of determinationThe coefficient of determination is given by r2. Obviously, it is very easy to calculate — we merely square Pearson’s product–moment correlation coefficient (r).

1. The coefficient of determination is useful when we have two variables which have a linear relationship. It tells us the proportion of variation in one variable which can be explained by the variation in the other variable.

2. The coefficient of determination provides a measure of how well the linear rule linking the two variables (x and y) predicts the value of y when we are given the value of x.

A set of data giving the number of police traffic patrols on duty and the number of fatalities for the region was recorded and a correlation coefficient of r = −0.8 was found. Calculate the coefficient of determination and interpret its value.THINK WRITE

Calculate the coefficient of determination by squaring the given value of r.

Coefficient of determination = r2

= (−0.8)2

= 0.64Interpret your result. We can conclude from this that 64% of the

variation in the number of fatalities can be explained by the variation in the number of police traffic patrols on duty. This means that the number of police traffic patrols on duty is a major factor in predicting the number of fatalities.

1

2

13WORKEDExample

remember1. The formula for calculating Pearson’s correlation coefficient r is as follows:

where n is the number of pairs of data in the setsx is the standard deviation of the x valuessy is the standard deviation of the y values

is the mean of the x-values is the mean of the y-values.

2. The calculation of r by hand using this formula is unnecessary. The calculation of r is done far more efficiently using a graphics calculator.

3. The calculation of r is applicable to sets of bivariate data which are known to be linear in form and which do not have outliers.

4. Even if we find that two variables have a very high degree of correlation, for example r = 0.95, we cannot say that the value of one variable is caused by the value of the other variable.

5. The coefficient of determination = r2.6. The coefficient of determination is useful when we have two variables which

have a linear relationship. It tells us the proportion of variation in one variable which can be explained by the variation in the other variable.

r1

n 1–------------

xi x–

sx-------------

yi y–

sy-------------

i 1=

n

∑=

xy

remember



Calculating r and the coefficient of determination

1 The yearly salary ($’000) and the number of votes polled in theBrownlow medal count are given below for 10 leading footballers.

a Construct a scatterplot for the data.b Comment on the correlation of salary and the number of votes and make an

estimate of r.c Calculate r and use it to comment on the relationship between yearly salary and

number of votes.

2 A set of data, obtained from 40 smokers, gives the number of cigarettes smoked per dayand the number of visits per year to the doctor. The Pearson’s correlation coefficient forthese data was found to be 0.87. Calculate the coefficient of determination for the dataand interpret its value.

3 Data giving the annual advertising budgets ($’000) and the yearly profit increases (%)of 8 companies are shown below.

a Construct a scatterplot for these data.b Comment on the correlation of the advertising budget and profit increase and make

an estimate of r.c Calculate r.d Calculate the coefficient of determination.e Write down the proportion of the variation in the yearly profit increase that can be

explained by the variation in the advertising budget.

4 Data showing the number of tourists visiting a small country in a month and thecorresponding average monthly exchange rate for the country’s currency against theAmerican dollar are given below.

Yearly salary($’000)

180 200 160 250 190 210 170 150 140 180

Numberof votes 24 15 33 10 16 23 14 21 31 28

Annual advertising budget ($’000) 11 14 15 17 20 25 25 27

Yearly profit increase (%) 2.2 2.2 3.2 4.6 5.7 6.9 7.9 9.3

Number of tourists (’000) 2 3 4 5 7 8 8 10

Exchange rate 1.2 1.1 0.9 0.9 0.8 0.8 0.7 0.6

2HWORKEDExample

12

EXCEL Spreadsheet

Pearson’sproduct-moment

correlation

GC program

BV stats

WORKEDExample

13



a Construct a scatterplot for the data.b Comment on the correlation between the number of tourists and the exchange rate

and give an estimate of r.c Calculate r.d Calculate the coefficient of determination.e Write down the proportion of the variation in the number of tourists that can be

explained by the exchange rate.

5 Data showing the number of people in 9 households against weekly grocery costs aregiven below.

a Construct a scatterplot for the data.b Comment on the correlation of the number of people in a household and the weekly

grocery costs and give an estimate of r.c Calculate r.d Calculate the coefficient of determination.e Write down the proportion of the variation in the weekly grocery costs that can be

explained by the variation in the number of people in a household.

6 Data showing the number of people on 8 fundraising committees and the annual fundsraised are given below.

a Construct a scatterplot for these data.b Comment on the correlation between the number of people on a committee and the

funds raised and make an estimate of r.c Calculate r.d Calculate the coefficient of determination.e Write down the proportion of the variation in the funds raised that can be explained

by the variation in the number of people on a committee.

The following information applies to questions 7 and 8. A set of data was obtained froma large group of women with children under 5 years of age. They were asked the numberof hours they worked per week and the amount of money they spent on childcare. The resultswere recorded and the value of Pearson’s correlation coefficient was found to be 0.92.

Number of people in household

2 5 6 3 4 5 2 6 3

Weekly grocery costs ($’s)

60 180 210 120 150 160 65 200 90

Number of people on committee

3 6 4 8 5 7 3 6

Annual funds raised ($’s)

4500 8500 6100 12 500 7200 10 000 4700 8800



Which of the following is not true?A The relationship between the number of working hours and the amount of money

spent on child-care is linear.B There is a positive correlation between the number of working hours and the

amount of money spent on child-care.C The correlation between the number of working hours and the amount of money

spent on child-care can be classified as strong.D As the number of working hours increases, the amount spent on child-care increases

as well.E The increase in the number of hours causes the increase in the amount of money

spent on child-care.

8Which of the following is not true?A The coefficient of determination is about 0.85.B The number of working hours is the major factor in predicting the amount of money

spent on child-care.C About 85% of the variation in the number of hours worked can be explained by the

variation in the amount of money spent on child-care.D Apart from number of hours worked, there could be other factors affecting the

amount of money spent on child-care.E About of the variation in the amount of money spent on child-care can be

explained by the variation in the number of hours worked.



1720------



Types of data• Bivariate data are data with two variables.• Numerical data involve quantities which are measurable or countable.• Categorical data are data divided into categories.• In a relationship involving two variables, if the values of one variable depend on the

values of another variable, then the former variable is referred to as the dependent variable and the latter variable is referred to as the independent variable.

• When data are displayed on a graph, the independent variable is placed on the horizontal axis and the dependent variable is placed on the vertical axis.

Back-to-back stem plots• A back-to-back stem plot displays bivariate data involving a numerical variable and

a categorical variable with two categories.• Together with summary statistics, back-to-back stem plots can be used to compare

the two distributions.

Parallel boxplots• To display a relationship between a numerical variable and a categorical variable

with more than two categories, we can use a parallel boxplot.• A parallel boxplot is obtained by constructing individual boxplots for each

distribution, using a common scale.

The two-way frequency table• The two-way frequency table is a tool for examining the relationship between two

categorical variables.• If the total number of scores in each of the two categories is unequal, percentages

should be calculated in order to be able to analyse the table properly.• When the independent variable is placed in the columns of the table, the numbers in

each column should be expressed as a percentage of that column’s total.

The scatterplot• A scatterplot gives a visual display of the relationship between two numerical

variables.• In analysing the scatterplot we look for a pattern in the way the points lie. Certain

patterns tell us that certain relationships exist between the two variables. This is referred to as a correlation. We look at what type of correlation exists and how strong it is.

• When describing the relationship between two variables displayed on a scatterplot, we need to comment on:(a) the direction — whether it is positive or negative(b) the form — whether it is linear or non-linear(c) the strength — whether it is strong, moderate or weak.

summary


C h a p t e r 2 B i v a r i a t e d a t a 97The q-correlation coefficient• The q-correlation coefficient gives us a measure of the strength of the association

between two variables.• To calculate the q-correlation coefficient:

Step 1. Draw a scatterplot of the data.Step 2. Locate the median of the x-values. Draw a vertical line through this median

value.Step 3. Locate the median of the y-values. Draw a horizontal line through this

median value.Step 4. The scatterplot is now divided into 4 sections or quadrants.

(a) Label these sections A, B, C and D.(b) Count the number of points in each section.(c) Do not count points which are on the lines.(d) The number of points in section A is denoted by a, the number of

points in section B is denoted by b, and so on.Step 5. Calculate the q-correlation coefficient, using the formula:

• The sign of the q-value indicates the direction of the relationship; that is, whether there is a negative association or a positive association. The magnitude of q indicates whether the relationship is strong, moderate or weak.

• The q-correlation coefficient gives us an idea of into which quadrants the points fall, but beyond that the points can be in any position in the quadrants. The q-correlation coefficient in that sense is a rather blunt instrument.

Pearson’s product–moment correlation coefficient• Pearson’s product–moment correlation coefficient is used to measure the strength of

a linear relationship between two variables.• The symbol for Pearson’s product–moment correlation coefficient is r.• The calculation of r is applicable to sets of bivariate data which are known to be

linear in form and which don’t have outliers.• The value of r can be estimated from the scatterplot.• The formula for calculating Pearson’s correlation coefficient r is as follows:

where n is the number of pairs of data in the setsx is the standard deviation of the x-valuessy is the standard deviation of the y-values

is the mean of the x-values is the mean of the y-values

• The calculation of r by hand using this formula is unnecessary. The calculation of r is done far more efficiently using a graphics calculator.

• Even if we find that two variables have a very high degree of correlation, for example r = 0.95, we cannot say that the value of one variable is caused by the value of the other variable.

Calculating the coefficient of determination• The coefficient of determination = r2.• The coefficient of determination is useful when we have two variables which have

a linear relationship. It tells us the proportion of variation in one variable which can be explained by the variation in the other variable.

y

x

AB

C D

qa c+( ) b d+( )–a b c d+ + +

----------------------------------------=

r1

n 1–------------

xi x–

sx-------------

yi y–

sy-------------

i 1=

n

∑=

xy



Multiple choice1 An example of a categorical variable is:

A the membership number of a clubB the number of students at each year level of a schoolC the total attendance at Hawthorn football matchesD the breathalyser reading of people in a restaurantE the monthly income for a group of people

2 In a study on the growth of plants, conducted in controlled surroundings, the dependent variable was the height of the plants. The independent variable in the study would be most likely:A the number of people caring for the plantsB the amount of light presentC the number of plants in the studyD whether the plants were deciduous or evergreenE rainfall

3 One of the following pairs of variables could not be displayed on a back-to-back stem plot. It is:A the heights of a group of students and whether or not they like footballB the kilometres travelled in a week and the mode of transport (car or train)C the weights of a group of students and their eye colour (blue or brown)D the annual number of trips to a doctor and whether or not the person is a smokerE the amount spent by each child at the tuckshop and the age of the child

4 A back-to-back stem plot is a useful way of displaying the relationship between:A the number of children attending a day care centre and whether or not the centre has

federal fundingB height and wrist circumferenceC age and weekly incomeD weight and the number of takeaway meals eaten each weekE the age of a car and amount spent each year on servicing it

The information below relates to questions 5 and 6. The salaries of people working at five different advertising companies are shown below on the parallel boxplots.

5 The company with the largest interquartile range is:A Company A B Company B C Company CD Company D E Company E

CHAPTERreview

2A

2A

2B

2B

100 150

Company D

50

Annual salary (× $1000)

Company E

Company C

Company B

Company A

10 20 30 40 60 70 80 90 110 120 130 140

2C


C h a p t e r 2 B i v a r i a t e d a t a 996 The company with the lowest median is:

Questions 7 and 8 relate to the following information. Data showing reactions of junior staff and senior staff to a relocation of offices are given below in a two-way frequency table.

7 From this table, we can conclude that:A 23% of junior staff were for the relocationB 42.6% of junior staff were for the relocationC 31% of junior staff were against the relocationD 62.1% of junior staff were for the relocationE 28.4% of junior staff were against the relocation

8 From this table, we can conclude that:A 14% of senior staff were for the relocationB 37.8% of senior staff were for the relocationC 12.8% of senior staff were for the relocationD 72% of senior staff were against the relocationE 74.5% of senior staff were against the relocation

9 The relationship between the variables x and y is shown on the scatterplot below.That correlation between x and y would be best described as:A a weak positive associationB a weak negative associationC a strong positive associationD a strong negative associationE non-existent

10 An investigation is made into the number of freckles on the back of a hand and the age of the subject. A strong association was found to exist. In this investigation, age is the independent variable and the number of freckles is the dependent variable. You would expect the association to be:

11 The q-correlation coefficient for data shown in the scatterplot above is:

12 A researcher calculates the q-correlation coefficient for the relationship between time (in days) and the growth of the root of a bean plant (measured in millimetres). The value is 0.62. Based on this, the correlation between time and the growth of the roots could be described as:

A Company A B Company B C Company CD Company D E Company E

Attitude Junior staff Senior staff Total

For 23 14 37

Against 31 41 72

Total 54 55 109

A negative B positive C bivariate D weak E categorical

A – B – C D E

A strong and negative B strong and positive C weak and positiveD weak and negative E moderate and positive

2C

2D

2D

2Ey

x

2E

y

x

2F511------ 5

9--- 5

11------ 5

9--- 2

9---

2F



13 A set of data relating the variables x and y is found to have an r value of −0.83. The scatterplot that could represent this data set is:

14 A set of data relating the variables x and y is found to have an r value of 0.65. A true statement about the relationship between x and y is:A There is a strong linear relationship between x and y and when the x-values increase, the

y-values tend to increase also.B There is a moderate linear relationship between x and y and when the x-values increase,

the y-values tend to increase also.C There is a moderate linear relationship between x and y and when the x-values increase,

the y-values tend to decrease.D There is a weak linear relationship between x and y and when the x-values increase, the

y-values tend to increase also.E There is a weak linear relationship between x and y and when the x-values increase, the

y-values tend to decrease.

15 A set of data comparing age with blood pressure is found to have a Pearson’s correlation coefficient of 0.86. The coefficient of determination for this data would be closest to:

16 The coefficient of determination for a set of data relating age and pulse rate is 0.7. This means that:A The correlation coefficient, r, for age against pulse rate is 0.7.B 70% of the variation in pulse rate can be explained by the variation in age.C 30% of the variation in pulse rate can be explained by the variation in age.D 49% of the variation in pulse rate can be explained by the variation in age.E 70% of those in the study had a pulse rate over 0.7.

Short answer1 For each of the following, write down:

i whether each variable in the pair is an example of numerical or categorical dataii which is a dependent and which is an independent variable or whether it is not appropriate

to classify the variables as such.a The number of injuries in a netball season and the age of a netball playerb The suburb and the size of a home mortgagec IQ and weight

A B C

D E

A −0.86 B −0.74 C −0.43 D 0.43 E 0.74

2Gy

x

y

x

y

x

y

x

y

x

2G

2H

2H

2A


C h a p t e r 2 B i v a r i a t e d a t a 1012 The number of hours of counselling received by a group of 9 full-time firefighters and

9 volunteer firefighters after a serious bushfire is given below.

a Construct a back-to-back stem plot to display the data.b Comment on the distributions of the number of hours of counselling of the full-time

firefighters and the volunteers.

3 The IQ of 8 players in 3 different football teams were recorded and are shown below.

Display the data in parallel boxplots.

4 Delegates at the respective Liberal and Labor Party conferences were surveyed on whether or not they believed that uranium mining should continue. Forty-five Liberal delegates were surveyed and 15 were against continuation. Fifty-three Labor delegates were surveyed and 43 were against continuation.a Present data in percentages in a two-way frequency table.b Comment on any difference between the reactions of the Liberal and Labor delegates.

5 a Construct a scatterplot for the data given in the table below.b Use the scatterplot to comment on any relationship which exists between the variables.

6 For the data given in question 5, calculate the q-correlation coefficient and use this to comment on the relationship between the two variables. (Compare your response about the relationship in this question to your response about the relationship in question 5 when you didn’t know the q-value).

7 For the variables shown on the scatterplot at right, give an estimate of the value of r and use it to comment on the nature of the relationship between the two variables.

8 The table below gives data relating the percentage of lectures attended by students in a semester and the corresponding mark for each student in the exam for that subject.

Full-time 2 4 3 5 2 4 6 1 3

Volunteer 8 10 11 11 12 13 13 14 15

Team A 120 105 140 116 98 105 130 102

Team B 110 104 120 109 106 95 102 100

Team C 121 115 145 130 120 114 116 123

Age 15 17 18 16 19 19 17 15 17

Pulse rate 79 74 75 85 82 76 77 72 70

Lectures attended (%) 70 59 85 93 78 85 84 69 70 82

Exam result(%) 80 62 89 98 84 91 83 72 75 85

2B

2C

2D

2E

2F

y

x

2G

2H



a Construct a scatterplot for these data.b Comment on the correlation between the lectures attended and the examination results and

make an estimate of r.c Calculate r.d Calculate the coefficient of determination.e Write down the proportion of the variation in the examination results that can be explained

by the variation in the number of lectures attended.

Analysis1 An investigation into the relationship between age and salary bracket among some employees

of a large computer company is made and the results are shown below.

a State, for each of the variables (age and salary bracket) whether they represent categorical or numerical data.

b State which is the independent variable and which is the dependent variable.c State which of the following you could use to display the data:

i back-to-back stem plotii parallel boxplotiii scatterplotiv two-way frequency table in percentage form

d State which of the following you could calculate in order to find out more about the relationship between age and salary bracket:

i the q-correlation coefficientii r, the Pearson product–moment correlation coefficientiii the coefficient of determination

2 An investigation similar to that in analysis task 1 is undertaken at an accounting firm to explore the relationship between age and salary. The data are shown below.

a State, for each of the variables (age and salary) whether they represent categorical or numerical data.

b Display the data on a scatterplot.c Describe the association between the two variables in terms of direction, form and

strength.d Calculate the value of q.e Explain whether or not it is appropriate to use Pearson’s correlation coefficient to explain

the relationship between age and salary.f Estimate the value of Pearson’s correlation coefficient from the scatterplot.g Calculate the value of this coefficient.h Explain whether or not the salary of the employees is determined by their age.i Calculate the value of the coefficient of determination.j Explain what the coefficient of determination tells us about the relationship between age

and salary at this accounting firm.

Salary bracket ($’000) Age

20–3940–5960–7980–99

100–120

32 21 43 23 22 27 3729 31 37 26 33 3741 29 39 42 47 45 43 3843 48 38 37 49 51 53 5948 37 55 61

Age 20 20 30 35 50 45 35 45 55 55 42 50 25 30 40

Salary (nearest thousand $’s) 20 40 20 30 40 80 40 60 100 70 45 85 30 60 60

testtest

CHAPTERyyourselfourselftestyyourselfourself

2


Documents

Bivariate data - iiNetmembers.ozemail.com.au/~rdunlop/CoplandMain/MathsApps/Modell… · the total attendance at Carlton football matches D position in a queue at the pie stall E