Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
VCEcoverageArea of studyUnits 3 & 4 • Data analysis
In thisIn this chachapterpter2A Types of data2B Back-to-back stem plots2C Parallel boxplots2D The two-way frequency
table2E The scatterplot2F The q-correlation
coefficient2G Pearson’s product–
moment correlation coefficient
2H Calculating r and the coefficient of determination
2
Bivariate data
Ch 02 FM YR 12 Page 57 Friday, November 10, 2000 9:34 AM
58
F u r t h e r M a t h e m a t i c s
Types of data
In this chapter we look at sets of data which contain two variables. We look at ways ofdisplaying the data and of measuring relationships between the two variables.
The methods we employ to do this depend entirely on the type of variables we aredealing with.
Numerical and categorical data
Examples of numerical data are:1. the heights of a group of teenagers2. the marks for a maths test3. the number of universities in a country4. ages5. salaries.
As the name suggests, numerical data involve quantities which are, broadlyspeaking,
measurable
or
countable
.Examples of categorical data are:
1. genders (sexes)2. AFL football teams3. religious denominations4. finishing positions in the Melbourne Cup5. municipalities6. ratings of 1–5 to indicate preferences for 5 different cars7. age groups, for example 0–9, 10–19, 20–29 8. hair colours.
Such categorical data, as the name suggests, have categories like masculine, feminineand neuter for gender, or Catholic, Anglican, Uniting, Baptist, Buddhist and so on forreligious denomination, or 1st, 2nd, 3rd for finishing position in the Melbourne Cup.
Note
: Some numbers may
look
like numerical data, but really be
names
or
titles
(forexample, ratings of 1 to 5 given to different samples of cake — ‘This one’s a 4’; thenumbers on netball players’ uniforms — ‘she’s number 7’). These ‘titles’ are
not count-able
; they place the subject in a category (with a name) so are
categorial
.In this chapter we look at ways of measuring the relationship between:
1. a numerical variable and a categorical variable (for example, weight and nationality)2. two categorical variables (for example, gender and religious denomination)3. two numerical variables (for example, height and weight).
Dependent and independent variables
When a relationship between two sets of variables is being examined, it is useful toknow if one of the variables depends on the other. Often we can make a judgment aboutthis but sometimes we can’t.
Consider the case where a study compared the heights of company employeesagainst their annual salaries. Common sense would suggest that the height of a com-pany employee would not depend on the person’s annual salary nor would the annualsalary of a company employee depend on the person’s height. In this case, it is notappropriate to designate one variable as independent and one as dependent.
In the case where the ages of company employees are compared with their annualsalaries, you might reasonably expect that the annual salary of an employee woulddepend on the person’s age. In this case, the age of the employee is the independentvariable and the salary of the employee is the dependent variable.
It is useful to identify the independent and dependent variables where possible, sinceit is the usual practice when displaying data on a graph to place the independent vari-able on the horizontal axis and the dependent variable on the vertical axis.
Ch 02 FM YR 12 Page 58 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a
59
Types of data
1
Write down whether each of the following represents numerical or categorical data.
a
The heights in centimetres of a group of children
b
The diameters in millimetres of a tray of ball-bearings
c
The numbers of visitors to a display each day
d
The modes of transport that students in Year 11 take to school
e
The 10 most-watched television programs in a week
f
The occupations of a group of 30-year-olds
g
The numbers of subjects offered to VCE students at various schools
2
For each of the following pairs of variables, write down which is independent andwhich is dependent. If it is not possible to identify this, then write ‘not appropriate’.
a
The age of an AFL footballer and his annual salary
b
The weight of a businessman and the number of business lunches he attends each week
c
The growth of a plant and the amount of fertiliser it receives
d
The number of books read in a week and the eye colour of the readers
e
The voting intentions of a woman and her weekly consumption of red meat
f
The number of members in a household and the size of their house
3
An example of a numerical variable is:
4
In a study on mice, the dependent variable was the time (in days) for which the miceremained alive. The independent variable would most likely have been:
A
the weight of the mice
B
the amount of food eaten each day by the mice
C
the daily dosage of an experimental drug given to the mice
D
the number of mice
E
the sex of the mice
h
Life expectancies
i
Species of fish
j
Blood groups
k
Years of birth
l
Countries of birth
m
Tax brackets
A
attitude to 4-yearly elections (for or against)
B
year level of students
C
the total attendance at Carlton football matches
D
position in a queue at the pie stall
E
television channel numbers shown on a dial
remember1. Bivariate data are data with two variables.2. Numerical data involve quantities that are measurable or countable.3. Categorical data, as the name suggests, are data which are divided into categories.4. In a relationship involving two variables, if the values of one variable ‘depend’ on
the values of another variable, then the former variable is referred to as the dependent variable and the latter variable is referred to as the independent variable.
5. It is the usual practice when displaying data on a graph to place the independent variable on the horizontal axis and the dependent variable on the vertical axis.
remember
2A
mmultiple choiceultiple choice
mmultiple choiceultiple choice
Ch 02 FM YR 12 Page 59 Friday, November 10, 2000 9:34 AM
60
F u r t h e r M a t h e m a t i c s
Back-to-back stem plots
In chapter 1, we saw how to construct a stem plot for a set of univariate data. We canalso extend a stem plot so that it displays bivariate data. Specifically, we shall create astem plot that displays the relationship between a numerical variable and a categoricalvariable. We shall limit ourselves in this section to categorical variables with just twocategories, for example sex. The two categories are used to provide two, back-to-backleaves of a stem plot.
A back-to-back stem plot is used to display bivariate data, involving a numerical variable and a categorical variable with 2 categories.
The girls and boys in Grade 4 at Kingston Primary School submitted projects on the Olympic Games. The marks they obtained out of 20 are given below.
Display the data on a back-to-back stem plot.
Girls’ marks 16 17 19 15 12 16 17 19 19 16
Boys’ marks 14 15 16 13 12 13 14 13 15 14
THINK WRITE
Identify the highest and lowest scores in order to decide on the stems.
Highest score = 19Lowest score = 12Use a stem of 1, divide into fifths.
1
1WORKEDExample
Ch 02 FM YR 12 Page 60 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 61
The back-to-back stem plot allows us to make some visual comparisons of the twodistributions. In the above example the centre of the distribution for the girls is higherthan the centre of the distribution for the boys. The spread of each of the distributionsseems to be about the same. For the boys, the marks are grouped around the 12–15marks; for the girls, they are grouped around the 16–19 marks. On the whole, we canconclude that the girls obtained better marks than the boys did.
To get a more precise picture of the centre and spread of each of the distributions wecan use the summary statistics discussed in chapter 1. Specifically, we are interested in:1. the mean and the median (to measure the centre of the distributions), and2. the interquartile range and the standard deviation (to measure the spread of the
distributions).We saw in chapter 1 that the calculation of these summary statistics is very straight-
forward and rapid using a graphics calculator.
THINK WRITE
Create an unordered stem plot first. Put the boys’ scores on the left, and the girls’ scores on the right.
Key: 1|2 = 12Leaf Stem LeafBoys Girls
13 2 3 3 1 2
4 5 4 5 4 1 56 1 6 7 6 7 6
1 9 9 9Now order the stem plot. The scores on the left should increase in value from right to left, while the scores on the right should increase in value from left to right.
Key: 1|2 = 12Leaf Stem LeafBoys Girls
3 3 3 2 1 25 5 4 4 4 1 5
6 1 6 6 6 7 71 9 9 9
2
3
The number of ‘how to vote’ cards handed out by various Australian Labor Party and Liberal party volunteers during the course of a polling day is shown below.
Display the data using a back-to-back stem plot and use this, together with summary statistics, to compare the distributions of the number of cards handed out by the Labor and Liberal volunteers.
Labor 180193
233202
246210
252222
263257
270247
229234
238226
226214
211204
Liberal 204287
215273
226266
253233
263244
272250
285261
245272
267280
275279
2WORKEDExample
Continued over page
Ch 02 FM YR 12 Page 61 Friday, November 10, 2000 9:34 AM
62 F u r t h e r M a t h e m a t i c s
THINK WRITEConstruct the stem plot. Key: 18|0 = 180
Leaf Stem LeafLabor Liberal
0 183 19
4 2 20 44 1 0 21 5
9 6 6 2 22 68 4 3 23 3
7 6 24 4 57 2 25 0 3
3 26 1 3 6 70 27 2 2 3 5 9
28 0 5 7Use a graphics calculator to calculate the summary statistics: the mean, the median, the standard deviation and the interquartile range. Enter each set of data as a separate list. (See chapter 1 on how to use your graphics calculator to calculate these values.)
For the Labor volunteers:Mean = 227.9Median = 227.5Interquartile range = 36Standard deviation = 23.9
For the Liberal volunteers:Mean = 257.5Median = 264.5Interquartile range = 29.5Standard deviation = 23.4
Comment on the relationship. From the stem plot we see that the Labor distribution is symmetric and therefore the mean and the median are very close, whereas the Liberal distribution is negatively skewed.
Since the distribution is skewed, the median is a better indicator of the centre of the distribution than is the mean.
Comparing the medians therefore, we have the median number of cards handed out for Labor at 228 and for Liberal at 265, which is a big difference.
The standard deviations were similar as were the interquartile ranges. There was not a lot of difference in the spread of the data.
In essence, the Liberal party volunteers handed out a lot more ‘how to vote’ cards than the Labor party volunteers did.
1
2
3
remember1. A back-to-back stem plot displays bivariate data involving a numerical variable
and a categorical variable with two categories.2. In the ordered stem plot, the scores on the left side of the stem increase in value
from right to left.3. Together with summary statistics, back-to-back stem plots can be used for
comparing two distributions.
remember
Ch 02 FM YR 12 Page 62 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 63
Back-to-back stem plots
1 The marks (out of 50), obtained for the end-of-term test by the students in German andFrench classes are given below. Display the data on a back-to-back stem plot.
2 The birth masses of 10 boys and 10 girls (in kilograms, to the nearest 100 grams) arerecorded in the table below. Display the data on a back-to-back stem plot.
3 The number of delivery trucks making deliveries to a supermarket each day over a2-week period was recorded for two neighbouring supermarkets —supermarket A andsupermarket B. The data are shown below.
a Display the data on a back-to-back stem plot.b Use the stem plot, together with some summary statistics, to compare the distribu-
tions of the number of trucks delivering to supermarkets A and B.
4 The marks out of 20 for males and females on a science test for a Year-10 class aregiven below.
a Display the data on a back-to-back stem plot.b Use the stem plot, together with some summary statistics, to compare the distribu-
tions of the marks of the males and the females.
5 The end-of-year English marks for 10 students in an English class were compared over2 years. The marks for 1998 and for the same students in 1999 are shown below.
a Display the data on a back-to-back stem plot.b Use the stem plot, together with some summary statistics, to compare the distribu-
tions of the marks obtained by the students in 1998 and 1999.
German 20 38 45 21 30 39 41 22 27 33 30 21 25 32 37 42 26 31 25 37
French 23 25 36 46 44 39 38 24 25 42 38 34 28 31 44 30 35 48 43 34
Boys 3.4 5.0 4.2 3.7 4.9 3.4 3.8 4.8 3.6 4.3
Girls 3.0 2.7 3.7 3.3 4.0 3.1 2.6 3.2 3.6 3.1
A 11 15 20 25 12 16 21 27 16 17 17 22 23 24
B 10 15 20 25 30 35 16 31 32 21 23 26 28 29
Females 12 13 14 14 15 15 16 17
Males 10 12 13 14 14 15 17 19
1998 30 31 35 37 39 41 41 42 43 46
1999 22 26 27 28 30 31 31 33 34 36
2BWORKEDExample
1
WORKEDExample
2
Ch 02 FM YR 12 Page 63 Friday, November 10, 2000 9:34 AM
64 F u r t h e r M a t h e m a t i c s
6 The age and sex of a group of people attending a fitness class are recorded below.
a Display the data on a back-to-back stem plot.b Use the stem plot, together with some summary statistics, to compare the distribu-
tions of the ages of the female to male members of the fitness class.
7 The scores on a board game are recorded for a group of kindergarten children and for agroup of children in a preparatory school.
a Display the data on a back-to-back stem plot.b Use the stem plot, together with some summary statistics, to compare the distributions
of the scores of the kindergarten children compared to the preparatory school children.
8The pair of variables that could be displayed on a back-to-back stem plot is:A the height of student and the number of people in the student’s householdB the time put into completing an assignment and a pass or fail score on the assignmentC the weight of a businessman and his ageD the religion of an adult and the person’s head circumferenceE the income bracket of an employees and the time the employee has worked for the
company
9A back-to-back stem plot is a useful way of displaying the relationship between:A the proximity to markets (km) and the cost of fresh foods on average per kilogramB height and head circumferenceC age and attitude to gambling (for or against)D weight and ageE the money spent during a day of shopping and the number of shops visited on that day
Female 23 24 25 26 27 28 30 31
Male 22 25 30 31 36 37 42 46
Kindergarten 3 13 14 25 28 32 36 41 47 50
Prep. School 5 12 17 25 27 32 35 44 46 52
mmultiple choiceultiple choice
mmultiple choiceultiple choice
Ch 02 FM YR 12 Page 64 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 65
Parallel boxplotsWe saw in the previous section that we could display relationships between a numericalvariable and a categorical variable with just two categories, using a back-to-back stem plot.
When we want to display a relationship between a numerical variable and acategorical variable with more than two categories, a parallel boxplot can be used.
A parallel boxplot is obtained by constructing individual boxplots for eachdistribution, using the common scale.
Construction of individual boxplots was discussed in detail in chapter 1 on univariatedata. In this section we concentrate on comparing distributions represented by anumber of boxplots (that is, on the interpretation of parallel boxplots).
The four Year-7 classes at Western Secondary College complete the same end-of-year maths test. The marks, expressed as percentages for each of the students in the four classes, are given below.
Display the data using a parallel boxplot and use this to describe any similarities or differences in the distributions of the marks between the four classes.
7A 7B 7C 7D 7A 7B 7C 7D
40 60 50 40 69 78 70 69
43 62 51 42 63 82 72 73
45 63 53 43 63 85 73 74
47 64 55 45 68 87 74 75
50 70 57 50 70 89 76 80
52 73 60 53 75 90 80 81
53 74 63 55 80 92 82 82
54 76 65 59 85 95 82 83
57 77 67 60 89 97 85 84
60 77 69 61 90 97 89 90
THINK WRITE/DISPLAYCreate the first boxplot (for class 7A) on a graphics calculator using [STAT PLOT] and appropriate WINDOW settings. Using to show key values, sketch the first boxplot using pen and paper, leaving room for three additional plots.
12nd
TRACE
FM Fig SD 02.01a FM Fig SD 02.01b
3WORKEDExample
Continued over page
Ch 02 FM YR 12 Page 65 Friday, November 10, 2000 9:34 AM
66 F u r t h e r M a t h e m a t i c s
THINK WRITE
Repeat step 1 for the other three classes. All four boxplots share the common scale.
Describe the similarities and differences between the four distributions.
Class 7B had the highest median mark and the range of the distribution was only 37. The lowest mark in 7B was 60.
We notice that the median of 7A’s marks is approximately 60. So, 50% of students in 7A received less than 60. This means that half of 7A had scores that were less than the lowest score in 7B.
The range of marks in 7A was about the same as that of 7D with the highest scores in each about equal, and the lowest scores in each about equal. However, the median mark in 7D was higher than the median mark in 7A so, des-pite a similar range, more students in 7D received a higher mark than in 7A.
While 7D had a top score that was higher than that of 7C, the median score in 7C was higher than that of 7D and the bottom 25% of scores in 7D were less than the lowest score in 7C. In summary, 7B did best, followed by 7C then 7D and finally 7A.
2
30 40 50 60 70 80 90 100
7D
7C
7B
7A
Maths mark (%)
3
remember1. A relationship between a numerical variable and a categorical variable with
more than two categories can be displayed using a parallel boxplot.2. A parallel boxplot is obtained by constructing individual boxplots for each
distribution, using a common scale.
remember
Ch 02 FM YR 12 Page 66 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 67
Parallel boxplots
1 The heights (in cm) of students in 9A, 10A and 11A were recorded andare shown in the table below.
a Construct a parallel boxplot to show the data.b Use the boxplot to compare the distributions of height for the 3 classes.
2 The amounts of money contributed annually to superannuation schemes by people in3 different age groups are shown below.
a Construct a parallel boxplot to show the data.b Use the boxplot to comment on the distributions.
9A 10A 11A 9A 10A 11A 9A 10A 11A
120 140 151 146 153 164 158 168 175
126 143 153 147 156 166 160 170 180
131 146 154 150 162 167 162 173 187
138 147 158 156 164 169 164 175 189
140 149 160 157 165 169 165 176 193
143 151 163 158 167 172 170 180 199
20–29 30–39 40–49 20–29 30–39 40–49
2000 4000 10 000 6500 7000 13 700
3100 5200 11 200 6700 8000 13 900
5000 6000 12 000 7000 9000 14 000
5500 6300 13 300 9200 10 300 14 300
6200 6800 13 500 10 000 12 000 15 000
2CWORKEDExample
3
EXCEL Spreadsheet
Parallelboxplots
GC program
UV stats
Ch 02 FM YR 12 Page 67 Friday, November 10, 2000 9:34 AM
68 F u r t h e r M a t h e m a t i c s
3 The numbers of jars of vitamin A, B, C and multi-vitamins sold per week by a localchemist are shown below.
Construct a parallel boxplot to display the data and use it to compare the distributionsof sales for the 4 types of vitamin.
4The ages of the employees at 5 different companies of the same size are comparedusing the parallel boxplots shown below.
For each of the following, select from:
a Which company has the greatest range of ages?b Which company has the greatest interquartile range of ages?c Which company has the lowest median age?d Which company has the greatest range of ages among their oldest 25% of employees?
Vitamin A 5 6 7 7 8 8 9 11 13 14
Vitamin B 10 10 11 12 14 15 15 15 17 19
Vitamin C 8 8 9 9 9 10 11 12 12 13
Multi-vitamins 12 13 13 15 16 16 17 19 19 20
A company A B company B C company CD company D E company E
mmultiple choiceultiple choice
20 25 30 35 40 45 50 55 60
Company A
Company B
Company C
Company E
Company D
WorkS
HEET 2.1
Ch 02 FM YR 12 Page 68 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 69
The two-way frequency tableWhen we are examining the relationship between two categorical variables, the two-way frequency table is an excellent tool.
Consider the following example.
In the above example we have a very clear breakdown of data. We know how manyfemales preferred Labor, how many females preferred the Liberals, how many malespreferred Labor and how many males preferred the Liberals.
If we wish to compare the number of females who prefer Labor with the number ofmales who prefer Labor, we must be careful. While 12 males preferred Labor comparedto 18 females, there were, of course, fewer males than females being asked. That is,only 23 males were asked for their opinion, compared to 34 females.
To overcome this problem, we can express the figures in the table as percentages.
At a local shopping centre, 34 females, and 23 males were asked which of the two major political parties they preferred. Eighteen females and 12 males preferred Labor. Display these data in a two-way table.
THINK WRITE
Draw a table. Record the respondent’s sex in the columns and party preference in the rows of the table.
(a) We know that 34 female and 23 males were asked. Put this information into the table and fill in the total.(b) We also know that 18 females and 12 males preferred Labor. Put this information in the table and find the total of people who preferred Labor.
Fill in the remaining cells. For example, to find the number of females who preferred the Liberals, subtract the number of females preferring Labor from the total number of females asked:34 − 18 = 16.
1Party preference Female Male Total
Labor
Liberal
Total
2Party preference Female Male Total
Labor 18 12 30
Liberal
Total 34 23 57
3Party preference Female Male Total
Labor 18 12 30
Liberal 16 11 27
Total 34 23 57
4WORKEDExample
Ch 02 FM YR 12 Page 69 Friday, November 10, 2000 9:34 AM
70 F u r t h e r M a t h e m a t i c s
We could have calculated percentages from the table rows, rather than columns. To dothat we would, for example, have divided the number of females who preferred Labor(18) by the total number of people who preferred labor (30) and so on. The table belowshows this:
By doing this we have obtained the percentage of people who were female and pre-ferred Labor (60%), and the percentage of people who were male and preferred Labor(40%), and so on. This highlights facts different from those shown in the previoustable. In other words, different results can be obtained by calculating percentages froma table in different ways.
As a general rule, when the independent variable (in this case the respondent’s sex)is placed in the columns of the table, then the percentages should be calculated incolumns.
Party preference Female Male Total
Labor 60.0 40.0 100
Liberal 59.3 40.7 100
Party preference Female Male Total
Labor 18 12 30
Liberal 16 11 27
Total 34 23 57
THINK WRITE
Draw the table, omitting the ‘total’ column.
Fill in the table by expressing the number in each cell as a percentage of its column’s total. For example, to obtain the percentage of males who prefer Labor, we divide the number of males who prefer Labor by the total number of males and multiply by 100%.
× 100% = 52.5% (correct to 1 decimal place)
1Party preference Female Male
Labor
Liberal
Total
2
1223------
Party preference Female Male
Labor 52.9 52.2
Liberal 47.1 47.8
Total 100.0 100.0
5WORKEDExample
Fifty-seven people in a local shoppingcentre were asked whether they preferredthe Australian Labor Party or the LiberalParty. The results are given at right.Convert the numbers in this table topercentages.
Ch 02 FM YR 12 Page 70 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 71
Sixty-seven primary and 47 secondary school students were asked their attitude to the number of school holidays which should be given. They were asked whether there should be more, fewer or the same number. Five primary students and 2 secondary students wanted fewer holidays, 29 primary and 9 secondary students thought they had enough holidays (that is, they chose the same number) and the rest thought they needed to be given more holidays.
Present these data in percentage form in a two-way frequency table and use it to compare the opinions of the primary and the secondary students.
THINK WRITE
Put the data in a table. First fill in the given information, then find the missing information by subtracting the appropriate numbers from the totals.
Calculate the percentages. Since the independent variable (the level of the student, Primary or Secondary) has been placed in the columns of the table, we calculate the percentages in columns. For example, to obtain the percentage of primary students who wanted fewer holidays, divide the number of such students by the total number of primary students and multiply by 100%. That is, × 100% = 7.5%.Comment on the results. Secondary students were much keener on
having more holidays than were primary students.
1Attitude Primary Secondary Total
Fewer 5 2 7
Same 29 9 38
More 33 36 69
Total 67 47 114
2
567------
Attitude Primary Secondary
Fewer 7.5 4.3
Same 43.3 19.1
More 49.2 76.6
Total 100.0 100.0
3
6WORKEDExample
remember1. The two-way frequency table is an excellent tool for examining the relationship
between two categorical variables.2. If the total number of scores in each of the two categories is unequal,
percentages should be calculated in order to be able to analyse the table properly. When the independent variable is placed in the columns of the table, the percentages should be calculated in columns. That is, the numbers in each column should be expressed as a percentage of that column’s total.
remember
Ch 02 FM YR 12 Page 71 Friday, November 10, 2000 9:34 AM
72 F u r t h e r M a t h e m a t i c s
The two-way frequency table
1 In a survey, 139 women and 102 men were asked whether they approved or disapprovedof a proposed freeway. Thirty-seven women and 79 men approved of the freeway.Display these data in a two-way table (not as percentages).
2 Students at a secondary school were asked whether the length of lessons should be45 minutes or 1 hour. Ninety-three senior students (Years 10–12) were asked and 60preferred 1-hour lessons, whereas of the 86 junior students (Years 7–9), 36 preferred1-hour periods. Display these data in a two-way table (not as percentages).
3 For each of the following two-way frequency tables, complete the entries.
a
b
c
4 Sixty single men and women were asked whether they prefer to live alone, or to shareaccommodation with friends. The results are shown below.
Convert the numbers in this table to percentages.
Attitude Female Male Total
For 25 i 47
Against ii iii iv
Total 51 v 92
Attitude Female Male Total
For i ii 21
Against iii 21 iv
Total v 30 63
Party preference Female Male
Labor i 42%
Liberal 53% ii
Total iii iv
Rent preference Men Women Total
Live alone 12 23 35
Share with friends 9 16 25
Total 21 39 60
2D
EXCEL
Spreadsheet
Two-wayfrequencytable
WORKEDExample
4
WORKEDExample
5
SkillSH
EET 2.1
Ch 02 FM YR 12 Page 72 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 73
The information in the followingtwo-way frequency table relates toquestions 5 and 6. The data show thereactions of administrative staff andtechnical staff to an upgrade of thecomputer systems at a largecorporation.
5From the above table, we can conclude that:A 53% of administrative staff were for the upgradeB 37% of administrative staff were for the upgradeC 37% of administrative staff were against the upgradeD 59% of administrative staff were for the upgradeE 54% of administrative staff were against the upgrade
6From the above table, we can conclude that:A 98% of technical staff were for the upgradeB 65% of technical staff were for the upgradeC 76% of technical staff were for the upgradeD 31% of technical staff were against the upgradeE 14% of technical staff were against the upgrade
7 Delegates at the respective Liberal Party and Australian Labor Party conferences weresurveyed on whether or not they believed that marijuana should be legalised. Sixty-twoLiberal delegates were surveyed and 40 were against legalisation. Seventy-one Labordelegates were surveyed and 43 were against legalisation.
Present the data in percentage form in a two-way frequency table. Comment on anydifferences between the reactions of the Liberal and Labor delegates.
8 Sixty-one union workers were surveyed and asked whether the number of publicholidays should be reduced. Thirty-five supported a reduction. Fifty-nine non-unionworkers were also asked and 31 supported a reduction.
Present the data in percentage form in a two-way frequency table. Comment on anydifference between the reactions of the union and non-union workers.
AttitudeAdministrative
staffTechnical
staff Total
For 53 98 151
Against 37 31 68
Total 90 129 219
mmultiple choiceultiple choice
mmultiple choiceultiple choice
WORKEDExample
6
Ch 02 FM YR 12 Page 73 Friday, November 10, 2000 9:34 AM
74 F u r t h e r M a t h e m a t i c s
The scatterplotWe often want to know if there is some sort of relationship between two numericalvariables. A scatterplot, which gives a visual display of the relationship between twovariables, provides a good starting point.
Consider the data obtained from last year’s 12B class at Northbank Secondary Col-lege. Each student in this class of 29 students was asked to give an estimate of theaverage number of hours of study per week they did during Year 12. They were alsoasked the TER score they obtained.
The figure at right shows the data plotted on a scatterplot.It is reasonable to think that the number of hours of
study put in each week by students would affect theirTER scores and so the number of hours of study perweek is the independent variable and appears on thehorizontal axis. The TER score is the dependent variableand appears on the vertical axis.
There are 29 points on the scatterplot. Each pointrepresents the hours studied and the TER score of onestudent.
In analysing the scatterplot we look for a pattern inthe way the points lie. Certain patterns tell us that cer-tain relationships exist between the two variables. Thisis referred to as correlation. We look at what type ofcorrelation exists and how strong it is.
In the figure above right we see some sort of pattern: the points are spread in a roughcorridor from bottom left to top right. We refer to data following such a direction ashaving a positive relationship. This tells us that as the average number of hours studiedper week increases, the TER score increases.
Averagehours
of study TERscore
Averagehours
of study TERscore
Averagehours
of study TERscore
18 59 14 54 17 59
16 67 17 72 16 76
22 74 14 63 14 59
27 90 19 72 29 89
15 62 20 58 30 93
28 89 10 47 30 96
18 71 28 85 23 82
19 60 25 75 26 35
22 84 18 63 22 78
30 98 19 61
TE
R s
core
40
50
60
70
80
90
100
10 15 20 25 30
Average number of hours of study per week
(26, 35)
Ch 02 FM YR 12 Page 74 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 75The point (26, 35) is an outlier. It stands out because
it is well away from the other points and clearly is notpart of the ‘corridor’ referred to above. This outlier mayhave occurred because a student worked very hard butfound the VCE pretty tough or perhaps the student exag-gerated the number of hours he or she worked in a weekor perhaps there was a recording error. This needs to bechecked.
We could describe the rest of the data as having alinear form as the straight line in the diagram at rightindicates.
When describing the relationship between two vari-ables displayed on a scatterplot, we need to comment on:
(a) the direction — whether it is positive or negative(b) the form — whether it is linear or non-linear(c) the strength — whether it is strong, moderate or weak.Below is a gallery of scatterplots showing the various patterns we look for.
TE
R s
core
40
50
60
70
80
90
100
10 15 20 25 30
Average number of hours of study per week
Weak, positivelinear relationship
Moderate, positivelinear relationship
Strong, positivelinear relationship
Weak, negativelinear relationship
Moderate, negativelinear relationship
Strong, negativelinear relationship
Perfect, negativelinear relationship
No relationship Perfect, positivelinear relationship
Ch 02 FM YR 12 Page 75 Friday, November 10, 2000 9:34 AM
76 F u r t h e r M a t h e m a t i c s
The scatterplot at right shows the number of hours peoplespend at work each week and the number of hours peopleget to spend on recreational activities during the week.
Decide whether or not a relationship exists between thevariables and, if it does, comment on whether it is positiveor negative; weak, moderate or strong; and whether or notit has a linear form.THINK WRITE(a) The points on a scatterplot are spread in a
certain pattern, namely in a rough corridor from the top left to the bottom right corner. This tells us that as the work hours increase, the recreation hours decrease.
(b) The corridor is straight (that is, it would be reasonable to fit a straight line into it).
(c) The points are not too tight and not too dispersed either.
(d) The pattern resembles the central diagram in the gallery of scatterplots shown previously.
There is a moderate, negative linear relation-ship between the two variables.
Hou
rs f
or r
ecre
atio
n
5
10
15
20
25
10 20 30 40 50
Hours worked
60 70
7WORKEDExample
Data giving the average weekly number of hours studied by each student in 12B at Northbank Secondary College and the corresponding height of each student (to the nearest tenth of a metre) are given in the table below.
Construct a scatterplot for the data and use it to comment on the direction, form and strength of any relationship between the number of hours studied and the height of the students.
Averagehours
ofstudy
Height(m)
Averagehours
ofstudy
Height(m)
Averagehours
ofstudy
Height(m)
Averagehours
ofstudy
Height(m)
18 1.5 19 2.0 20 1.9 16 1.6
16 1.9 22 1.9 10 1.9 14 1.9
22 1.7 30 1.6 28 1.5 29 1.7
27 2.0 14 1.5 25 1.7 30 1.8
15 1.9 17 1.7 18 1.8 30 1.5
28 1.8 14 1.8 19 1.8 23 1.5
18 2.1 19 1.7 17 2.1 22 2.1
8WORKEDExample
Ch 02 FM YR 12 Page 76 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 77
Clearly, the number of hours you study for your VCE has no effect on how tall youmight be!
Note that when working with the scatterplot, to change settings at any time use. To identify the coordinates of individual points, use the key with
the arrow keys.
THINK WRITE/DISPLAY
Construct the scatterplot. In this case it is almost impossible to decide which is the independent variable and which is the dependent variable, and therefore on which axis we will place the variables. In such cases, placing either variable on either axis is reasonable.The scatterplot can be constructed using a graphics calculator:(a) Press and CLEAR any functions.(b) Press and select
4:PlotsOff. Press .(c) Press and select 1:Edit. Press .(d) Clear any existing lists and enter the list of
hours of study in L1 and the list of heights in L2.
(e) Press and select 1:Plot 1.(f) Press to turn the plot ON, and select
the first icon which indicates a scatterplot.(g) For Xlist, select L1 and for Ylist select L2 and
select the first symbol in Mark.(h) Press and select 9:ZoomStat.(i) Press to see the scatterplot.Comment on the direction of any relationship. There is no relationship; the points appear
to be randomly placed.Comment on the form of the relationship. There is no form, no linear trend, no
quadratic trend, just a random placement of points.
Comment on the strength of any relationship. Since there is no relationship, strength is not relevant.
1
2
Y=2nd [STAT PLOT]
ENTERSTAT ENTER
2nd [STAT PLOT]ENTER
ZOOMENTER
FM Fig 02.07
Average number of hours studied each week
Hei
ght (
m)
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
10 12 14 16 18 20 22 24 26 28 30
3
4
5
WINDOW TRACE
�
�
Ch 02 FM YR 12 Page 77 Friday, November 10, 2000 9:34 AM
78 F u r t h e r M a t h e m a t i c s
The scatterplot
Have your graphics calculator at hand for the following exercise questions.1 For each of the following pairs of variables, write down whether or not you would
reasonably expect a relationship to exist between the pair and, if so, comment onwhether it would be a positive or negative association.a Time spent in a supermarket and money spentb Income and value of car drivenc Number of children living in a house and time spent cleaning the housed Age and number of hours of competitive sport played per weeke Amount spent on petrol each week and distance travelled by car each weekf Number of hours spent in front of a computer each week and time spent playing the
piano each weekg Amount spent on weekly groceries and time spent gardening each week
remember1. When we are investigating if there is any sort of relationship between two
numerical variables, a scatterplot provides a useful starting point. It gives a visual display of the relationship between two such variables.
2. In analysing the scatterplot we look for a pattern in the way the points lie. Certain patterns tell us that certain relationships exist between the two variables. This is referred to as a correlation. We look at what type of correlation exists and how strong it is.
3. When describing the relationship between two variables displayed on a scatterplot, we need to comment on:(a) the direction — whether it is positive or negative(b) the form — whether it is linear or non-linear(c) the strength — whether it is strong, moderate or weak.
remember
2E
Ch 02 FM YR 12 Page 78 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 792 For each of the scatterplots below, describe whether or not a relationship exists between
the variables and, if it does, comment on whether it is positive or negative, whether it isweak, moderate or strong and whether or not it has a linear form.
3From the scatterplot shown at right, it would be reasonable to observe that:
A as the value of x increases, the value of y increases
B as the value of x increases, the value of y decreases
C as the value of x increases, the value of y remains the same
D as the value of x remains the same, the value of y increases
E there is no relationship between x and y
4 The population of a municipality (to the nearest hundred thousand) together withthe number of primary schools in that particular municipality is given below for11 municipalities.
Construct a scatterplot for the data and use it to comment on the direction, form andstrength of any relationship between the population and the number of primaryschools.
Population(000) 110 130 130 140 150 160 170 170 180 180 190
No. of primary schools 4 4 6 5 6 8 6 7 8 9 8
WORKEDExample
7
FM Fig 02.08a
Age
Hae
mog
lobi
n co
unt
10
1214
20 40 60 80
8
FM Fig 02.08b
Cigarettes smoked
Fitn
ess
leve
l
80100
120
0 10 20
60
Weekly hours of study
Mar
ks a
t sch
ool (
%)
0
20
4060
80
100
4 8 12 16
a b c
Hours spent gardening per week
Wee
kly
expe
nditu
re o
n ga
rden
ing
mag
azin
es (
$)
5
10
1520
25
0 5 10 15
Hours spentcooking per week
Hou
rs s
pent
usi
ng a
co
mpu
ter
per
wee
k
2
4
68
10
12
14
2 4 6 810121416Age
Tim
e un
der
wat
er (
s)
10
20
3040
50
60
70
5 10 15 20 25
d e f
mmultiple choiceultiple choice
y
x
WORKEDExample
8
EXCEL Spreadsheet
Scatterplot
Ch 02 FM YR 12 Page 79 Friday, November 10, 2000 9:34 AM
80 F u r t h e r M a t h e m a t i c s
5 The table below contains data giving the time taken for a paving job and the cost of the job.
Construct a scatterplot for the data. Comment on whether a relationship exists betweenthe time taken and the cost. If there is a relationship, describe it.
6 The table below shows the time of booking (how many days in advance) of the ticketsfor a musical performance and the corresponding row number in A-Reserve.
Construct a scatterplot for the data. Comment on whether a relationship exists betweenthe time of booking and the number of the row and, if there is a relationship, describe it.
Time taken(hours) 5 7 5 8 10 13 15 20 18 25 23
Cost ofjob ($) 1000 1000 1500 1200 2000 2500 2800 3200 2800 4000 3000
Time ofbooking
RowNo.
Time ofbooking
RowNo.
5 15 20 10
6 15 21 8
7 15 22 5
7 14 24 4
8 14 25 3
11 13 28 2
13 13 29 2
14 12 29 1
14 10 30 1
17 11 31 1
Ch 02 FM YR 12 Page 80 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 81
The q-correlation coefficientThe q-correlation coefficient is a measure of the strength of the association betweentwo variables. In the previous section we estimated the strength of association bylooking at a scatterplot and forming a judgment about whether the correlation betweenthe variables was positive or negative and whether the correlation was weak, moderateor strong. The calculation of the q-correlation coefficient aids us considerably inmaking that judgment.
To calculate the q-correlation coefficient:Step 1. Draw a scatterplot of the data.Step 2. Locate the median of the x-values. (If there are n points, the median is located
at the th place.) Draw a vertical line through this median value.
Step 3. Locate the median of the y-values and draw a horizontalline through this median value.
Step 4. The scatterplot is now divided into 4 sections orquadrants (hence the name ‘q’-correlation coefficient).(a) Label these sections A, B, C and D.(b) Count the number of points in each section.(c) Do not count points which are on the lines.(d) The number of points in section A is denoted by a, the number of points in
section B is denoted by b, and so on.Step 5. Calculate the q-correlation coefficient using the formula:
n 1+2
------------y
x
AB
C D
qa c+( ) b d+( )–a b c d+ + +
----------------------------------------=
Calculate the q-correlation coefficient for the data shown in the scatterplot at right.THINK WRITE
(a) Locate the median of the x-values. Note that we are talking here about the x-values of the data observations given. In the scatterplot shown there are 15 points. Each point has an x-value and a y-value. To find the median of the x-values we look for the horizontal middle point; that is, we
look for the = 8th point from
the left (from the right, the point will be the same).
(b) Draw a vertical line through this median value. Note that there are 7 points to the right of this line and 7 to the left.
y
x
1
15 1+2
---------------y
x
Medianof
x-values
9WORKEDExample
Continued over page
Ch 02 FM YR 12 Page 81 Friday, November 10, 2000 9:34 AM
82 F u r t h e r M a t h e m a t i c s
The value of the q-correlation coefficient in the above example indicates a strongcorrelation. The diagram below gives a rough guide to the strength of the correlationsuggested by the value of q.
THINK WRITE
(a) Locate the median of the y-values. This is done in a similar way to finding the median of the x-values except, instead of counting from the left or right, we count from the top or bottom to find the 8th point.
(b) Draw a horizontal line through this median value. Note that there are 7 points above this line and 7 below.
(a) Label the quadrants A, B, C and D.
(b) Count the number of points in each section. Do not count points that are on the lines.
a = 6, b = 0, c = 6, d = 1
Write the formula for calculating the q-coefficient.
Substitute the values of a, b, c and d into the formula and evaluate.
= 0.85 (correct to 2 decimal places)
2
y
x
3 y
x
A
DC
Ba = 6
d = 1c = 6
b = 0
4 qa c+( ) b d+( )–a b c d+ + +
----------------------------------------=
5q
6 6+( ) 0 1+( )–6 0 6 1+ + +
----------------------------------------=
1113------=
Strong positive association
–0.5
–0.25
0
0.25
0.5
0.75
1
–1
–0.75
}}}
}}}}
Moderate positive association
Weak positive association
No association
Weak negative association
Moderate negative association
Strong negative association
Val
ue o
f q
Ch 02 FM YR 12 Page 82 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 83The scatterplots below show three special values of the q-correlation coefficient.
q = q = q =
= 1 = –1 = 0The sign of the q-value indicates the direction of the relationship; that is, whether
there is a negative or positive association.In the cases shown above left and centre, the q-values are at both extremes. That is,
q = 1 and −1 respectively. We would describe the variables as showing a very strongassociation. Having said that, the points are not showing a strong linear form or, forthat matter, any linear form.
The q-correlation coefficient merely gives us an idea of which quadrants contain themost points; but beyond that, the points can be in any position in the quadrants. In thatsense, the q-correlation coefficient is a rather blunt instrument.
y
x
A
DC
B y
x
A
DC
B y
x
A
DC
B
(8 + 8) – (0 + 0)8 + 0 + 8 + 0
-------------------------------------- (0 + 0) – (8 + 8)0 + 8 + 0 + 8
-------------------------------------- (3 + 3) – (3 + 3)3 + 3 + 3 + 3
--------------------------------------
An investigation was made into the relationship between the time spent watching television in the week preceding a Maths test and the mark obtained (out of 20) in that Maths test. The following data were recorded.
Draw a scatterplot and calculate the q-correlation coefficient. Comment on the relationship between the two variables.
Time (h) Mark Time (h) Mark Time (h) Mark4 15 10 8 12 105 16 20 5 5 85 20 5 12 20 8
10 12 15 4 15 1015 8 15 12 20 10
THINK WRITE/DISPLAYDraw a scatterplot. We can use a graphics calculator to draw the scatterplot.(a) On the lists screen (press , select
and 1:Edit), enter the two lists of data into L1 and L2.
(b) Press and select 4:PlotsOff.(c) Press .(d) Press and select 1:Plot1.(e) Select On, and for Type, select the first icon
(scatterplot).(f) For Xlist, type in L1 (use [L1]); for Ylist,
type in L2; for Mark, select the first symbol.(g) Press and select 9:ZoomStat. The display
now shows the scatterplot.
1
STAT EDIT
2nd [STAT PLOT]ENTER2nd [STAT PLOT]
2nd
ZOOM
Time watching TV (hours)
Mat
hs m
ark
4
8
12
20
16
5 10 15 20 25
10WORKEDExample
Continued over page
Ch 02 FM YR 12 Page 83 Friday, November 10, 2000 9:34 AM
84 F u r t h e r M a t h e m a t i c s
THINK WRITE/DISPLAY
We can also use the graphicscalculator to help calculate q.(a) Press and
return to the home screen.(b) Press and
select 4:Vertical.(c) Press .(d) From the MATH menu,
select 4:median(.(e) Type L1 (use 2nd [L1])
at the prompt, then , and the scatterplot appears with the vertical median line drawn.
(f) Similarly, to create the horizontal median line, press and return to the home screen.
(g)Press and select 3:Horizontal.
(h) Press and from the MATH menu, select 4:median(.
(i) Type L2 at the prompt, press and the scatterplot appears with the horizontal median line drawn as well.
Count and record the number of points in each quadrant.
a = 1, b = 5, c = 2, d = 4
Write the formula for calculating the q-correlation coefficient.
Substitute the values of a, b, c and d into the formula and evaluate.
Comment on the relationship. There is moderate, negative association between the hours of television watched and the Maths mark obtained.The negative association means that as the number of hours of television watched prior to the test increased, the marks in the Maths test decreased. The moderate association suggests that it may be worth further investigating the association.
2
2nd [QUIT]
2nd [DRAW]
2nd [LIST]
ENTER
2nd [QUIT]
2nd [DRAW]
2nd [LIST]
ENTER
3
4 qa c+( ) b d+( )–a b c d+ + +
----------------------------------------=
5 q1 2+( ) 5 4+( )–1 5 2 4+ + +
----------------------------------------=
612------–=
0.5–=6
Ch 02 FM YR 12 Page 84 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 85
The q-correlation coefficient
1 Calculate the q-correlation coefficient for each of the sets of data shown on the scatter-plots below.
remember1. The q-correlation coefficient is a measure of the strength of the association
between two variables.2. To calculate the q-correlation coefficient:
Step 1. Draw a scatterplot of the data.Step 2. Locate the median of the x-values and draw a vertical line through this
median value.Step 3. Locate the median of the y-values and draw a horizontal line through
this median value.Step 4. (a) Label the sections thus formed A, B, C and D.
(b) Count the number of points in each section.(c) Do not count points which are on the lines.(d) (The number of points in section A is denoted
by a, and so on.)Step 5. Calculate the q-correlation coefficient using the formula:
3. The sign of the q-value indicates the direction of the relationship (whether there is a negative association or a positive association) while the size of it indicates the strength (whether the relationship is strong, moderate or weak).
4. The q-correlation coefficient gives us an idea of into which quadrants the points fall, but beyond that the points can be in any position in the quadrants. In that sense, the q-correlation coefficient is a rather blunt instrument.
y
x
AB
C D
qa c+( ) b d+( )–a b c d+ + +
----------------------------------------=
remember
2FWORKEDExample
9EXC
EL Spreadsheet
q-correlationy
x
y
x
y
x
a b c
y
x
y
x
y
x
d e f
Ch 02 FM YR 12 Page 85 Friday, November 10, 2000 9:34 AM
86 F u r t h e r M a t h e m a t i c s
2 The data given in the table below show the results of an investigation intothe mass and the height of a certain breed of dog.a Draw a scatterplot and calculate the q-correlation coefficient.b Comment on the relationship between the height and the mass of this
breed of dog.
3 The data in the table below show the number of hours spent by students who arelearning touch-typing and their corresponding speed in words per minute (wpm).a Using a graphics calculator or otherwise, calculate the q-correlation coefficient for
these data.b Comment on the relationship between the number of hours spent on learning and
the speed of typing.
4The q-correlation coefficient for data shown in the scatterplot at right is:
5
A researcher calculates the q-correlation coefficient for the relationship between time(in days) and the diameter (measured in mm) of a crystal that is changing in size. Thevalue is 0.82. Based on this, the correlation between time and the diameter of thecrystal could be described as:A strong and negativeB strong and positiveC weak and positiveD weak and negativeE moderate and positive
Height (cm) 41 40 35 38 43 44 37 39 42 44 31
Mass (kg) 4.5 5 4 3.5 5.5 5 5 4 4 6 3.5
Time(h) 20 33 22 39 40 37 46 44 24 36 50 48 29
Speed (wpm) 34 46 38 53 52 49 60 58 36 42 65 63 40
A B C D E
WORKEDExample
10
mmultiple choiceultiple choice
x
y
111------– 1
9---– 5
11------– 5
9--- 9
9---
mmultiple choiceultiple choice
WorkS
HEET 2.2
Ch 02 FM YR 12 Page 86 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 87
Pearson’s product–moment correlation coefficient
We saw in the previous exercise that the q-correlation coefficient was a rather bluntinstrument for measuring correlation between variables. A more precise tool isPearson’s product–moment correlation coefficient. This coefficient is used to measurethe strength of linear relationships between variables; the q-correlation coefficient, onthe other hand, can be used for both linear and non-linear relationships.
Pearson’s coefficient is therefore more specialised and can give us a much moreprecise picture of the strength of the linear relationship between two variables.
The symbol for Pearson’s product–moment correlation coefficient is r.Below is a gallery of scatterplots with the corresponding value of r for each.
The two extreme values of r (1 and −1) are shown in the first two diagrams respec-tively. It is interesting to compare these two scatterplots with those showing extremevalues (1 and −1) of q.
In the four diagrams above, the scatterplots thatshow matching values of q and r are placed sideby side. We see just how differently the points onthe scatterplots are arranged and note from thisthat the r value gives us a much sharperimpression of the relationship between thevariables. That is, a value of r = 1 means that thereis perfect linear association between the variables,which is not necessarily the case when q = 1!
r = 1 r = –1 r = 0 r = 0.7 r = –0.5
r = –0.9 r = 0.8 r = 0.3 r = –0.2
q = 1 r = 1 q = –1 r = –1
Strong positive linear association
–0.5
–0.25
0
0.25
0.5
0.75
1
–1
–0.75
}}}
}}}}
Moderate positive linear association
Weak positive linear association
No linear association
Weak negative linear association
Moderate negative linear association
Strong negative linear association
Val
ue o
f r
Ch 02 FM YR 12 Page 87 Friday, November 10, 2000 9:34 AM
88 F u r t h e r M a t h e m a t i c s
In describing the strength of the relationship between the variables, the rough guidewe used with the q-correlation coefficient can also be used with Pearson’s coefficient.The difference, of course, is that the value of r gives us a measure of the strength oflinear relationships specifically.
Note that the symbol ≈ means ‘aproximately equal to’. We use it instead of the = signto emphasise that the value (in this case r) is only an estimate.
In completing the worked example above, we notice that estimating the value ofr from a scatterplot is rather like making an informed guess. In the next section ofwork we will see how to obtain the actual value of r.
For each of the following:
i Estimate the value of Pearson’s product–moment correlation coefficient (r) from thescatterplot.
ii Use this to comment on the strength and direction of the relationship between the twovariables.
THINK WRITE
a Compare these scatterplots with those in the gallery of scatterplots shown previously and estimate the value of r.
a i r ≈ 0.9
Comment on the strength and direction of the relationship.
ii The relationship can be described as a strong, positive, linear relationship.
b Repeat steps 1 and 2 as in a. b i r ≈ −0.7ii The relationship can be described as a
moderate, negative, linear relationship.
c Repeat steps 1 and 2 as in a. c i r ≈ −0.1ii There is no linear relationship.
a b c
1
2
11WORKEDExample
remember1. Pearson’s product–moment correlation coefficient is used to measure the
strength of a linear relationship between two variables.2. The symbol for Pearson’s product–moment correlation coefficient is r.3. The estimate of r can be obtained from the scatterplot.
remember
Ch 02 FM YR 12 Page 88 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 89
Pearson’s product–moment correlation coefficient
1 What type of linear relationship does each of the following values of r suggest?
2 For each of the following:i Estimate the value of Pearson’s product–moment correlation coefficient (r), from
the scatterplot.ii Use this to comment on the strength and direction of the relationship between the
two variables.
3A set of data relating the variables x and y is found to have an r value of 0.62. Thescatterplot that could represent the data is:
4A set of data relating the variables x and y is found to have an r value of −0.45. A truestatement about the relationship between x and y is:A There is a strong linear relationship between x and y and when the x-values
increase, the y-values tend to increase also.B There is a moderate linear relationship between x and y and when the x-values
increase, the y-values tend to increase also.C There is a moderate linear relationship between x and y and when the x-values
increase, the y-values tend to decrease.D There is a weak linear relationship between x and y and when the x-values increase,
the y-values tend to increase also.E There is a weak linear relationship between x and y and when the x-values increase,
the y-values tend to decrease.
a 0.21 b 0.65 c −1 d −0.78e 1 f 0.9 g −0.34 h −0.1
A B C D E
2G
WORKEDExample
11
a b c d
e f g h
mmultiple choiceultiple choice
mmultiple choiceultiple choice
Ch 02 FM YR 12 Page 89 Friday, November 10, 2000 9:34 AM
90 F u r t h e r M a t h e m a t i c s
Calculating r and the coefficient of determination
Pearson’s product–moment correlation coefficientThe formula for calculating Pearson’s correlation coefficient r is as follows:
where n is the number of pairs of data in the setsx is the standard deviation of the x-valuessy is the standard deviation of the y-values
is the mean of the x-values is the mean of the y-values.
The calculation of r by hand using this formula is unnecessary. The calculation ofr is done far more efficiently using a graphics calculator.
There are two important limitations on the use of r. First, since r measures thestrength of a linear relationship, it would be inappropriate to calculate r for data whichare not linear — for example, data which a scatterplot shows to be in a quadratic form.
Second, outliers can bias the value of r. Consequently, if a set of linear data containsan outlier, then r is not a reliable measure of the strength of that linear relationship.
The calculation of r is applicable to sets of bivariate data which are known to be linear in form and which do not have outliers.
With those two provisos, it is good practice to draw a scatterplot for a set of data tocheck for a linear form and an absence of outliers before r is calculated. Having a scat-terplot in front of you is also useful because it enables you to estimate what the valueof r will be — as you did in exercise 2G, and thus you can check that your workings onthe calculator are correct.
r1
n 1–------------
xi x–
sx-------------
yi y–
sy-------------
i 1=
n
∑=
xy
The heights (in centimetres) of 21 football players were recorded against the number of marks they took in a game of football. The data are shown in the table below.
Height (cm)Number of
marks taken Height (cm)Number of
marks taken184 6 182 7194 11 185 5185 3 183 9175 2 191 9186 7 177 3183 5 184 8174 4 178 4200 10 190 10188 9 193 12184 7 204 14188 6
12WORKEDExample
Ch 02 FM YR 12 Page 90 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 91
Correlation and causationIn worked example 12 we saw that r = 0.86. While we are entitled to say that there is astrong association between the height of a footballer and the number of marks he takes,we cannot assert that the height of a footballer causes him to take a lot of marks. Beingtall might assist in the taking of marks, but there will be many other factors whichcome into play — for example skill level, accuracy of passes from teammates, abilitiesof the opposing team, and so on.
So, while establishing a high degree of correlation between two variables is veryinteresting and can often flag the need for further, more detailed investigation, it in noway gives us any basis to comment on whether or not one variable causes particularvalues in another variable.
a Construct a scatterplot for the data.b Comment on the correlation between the heights of players and the number of marks
that they take, and estimate the value of r.c Calculate r and use it to comment on the relationship between the heights of players
and the number of marks they take in a game.
THINK WRITE/DISPLAY
a Using a graphics calculator, construct a scatterplot. Refer to worked example 8 in the section on scatterplots for directions on how to use the graphics calculator to draw a scatterplot.
a
b Comment on the correlation between the variables and estimate the value of r.
b The data show what appears to be a linear form of moderate strength.We might expect r ≈ 0.6.
c Because there is a linear form and there are no outliers, the calculation of r is appropriate.Calculate r, using a graphics calculator. The lists are in place from the scatterplot. Firstly press and select DiagnosticOn and press .Press and select CALC and 4:LinReg(ax+b).Press .LinReg(ax+b) appears. Type L1, L2. Press .
c
r = 0.86
The value of r = 0.86 indicates a strong positive linear relationship.
There is a strong positive linear association between the height of a player and the number of marks he takes in a game. That is, the taller the player the more marks we might expect him to take.
1
2nd [CATALOG]ENTER
STAT
ENTER
ENTER
2
Ch 02 FM YR 12 Page 91 Friday, November 10, 2000 9:34 AM
92 F u r t h e r M a t h e m a t i c s
The coefficient of determinationThe coefficient of determination is given by r2. Obviously, it is very easy to calculate — we merely square Pearson’s product–moment correlation coefficient (r).
1. The coefficient of determination is useful when we have two variables which have a linear relationship. It tells us the proportion of variation in one variable which can be explained by the variation in the other variable.
2. The coefficient of determination provides a measure of how well the linear rule linking the two variables (x and y) predicts the value of y when we are given the value of x.
A set of data giving the number of police traffic patrols on duty and the number of fatalities for the region was recorded and a correlation coefficient of r = −0.8 was found. Calculate the coefficient of determination and interpret its value.THINK WRITE
Calculate the coefficient of determination by squaring the given value of r.
Coefficient of determination = r2
= (−0.8)2
= 0.64Interpret your result. We can conclude from this that 64% of the
variation in the number of fatalities can be explained by the variation in the number of police traffic patrols on duty. This means that the number of police traffic patrols on duty is a major factor in predicting the number of fatalities.
1
2
13WORKEDExample
remember1. The formula for calculating Pearson’s correlation coefficient r is as follows:
where n is the number of pairs of data in the setsx is the standard deviation of the x valuessy is the standard deviation of the y values
is the mean of the x-values is the mean of the y-values.
2. The calculation of r by hand using this formula is unnecessary. The calculation of r is done far more efficiently using a graphics calculator.
3. The calculation of r is applicable to sets of bivariate data which are known to be linear in form and which do not have outliers.
4. Even if we find that two variables have a very high degree of correlation, for example r = 0.95, we cannot say that the value of one variable is caused by the value of the other variable.
5. The coefficient of determination = r2.6. The coefficient of determination is useful when we have two variables which
have a linear relationship. It tells us the proportion of variation in one variable which can be explained by the variation in the other variable.
r1
n 1–------------
xi x–
sx-------------
yi y–
sy-------------
i 1=
n
∑=
xy
remember
Ch 02 FM YR 12 Page 92 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 93
Calculating r and the coefficient of determination
1 The yearly salary ($’000) and the number of votes polled in theBrownlow medal count are given below for 10 leading footballers.
a Construct a scatterplot for the data.b Comment on the correlation of salary and the number of votes and make an
estimate of r.c Calculate r and use it to comment on the relationship between yearly salary and
number of votes.
2 A set of data, obtained from 40 smokers, gives the number of cigarettes smoked per dayand the number of visits per year to the doctor. The Pearson’s correlation coefficient forthese data was found to be 0.87. Calculate the coefficient of determination for the dataand interpret its value.
3 Data giving the annual advertising budgets ($’000) and the yearly profit increases (%)of 8 companies are shown below.
a Construct a scatterplot for these data.b Comment on the correlation of the advertising budget and profit increase and make
an estimate of r.c Calculate r.d Calculate the coefficient of determination.e Write down the proportion of the variation in the yearly profit increase that can be
explained by the variation in the advertising budget.
4 Data showing the number of tourists visiting a small country in a month and thecorresponding average monthly exchange rate for the country’s currency against theAmerican dollar are given below.
Yearly salary($’000)
180 200 160 250 190 210 170 150 140 180
Numberof votes 24 15 33 10 16 23 14 21 31 28
Annual advertising budget ($’000) 11 14 15 17 20 25 25 27
Yearly profit increase (%) 2.2 2.2 3.2 4.6 5.7 6.9 7.9 9.3
Number of tourists (’000) 2 3 4 5 7 8 8 10
Exchange rate 1.2 1.1 0.9 0.9 0.8 0.8 0.7 0.6
2HWORKEDExample
12
EXCEL Spreadsheet
Pearson’sproduct-moment
correlation
GC program
BV stats
WORKEDExample
13
Ch 02 FM YR 12 Page 93 Friday, November 10, 2000 9:34 AM
94 F u r t h e r M a t h e m a t i c s
a Construct a scatterplot for the data.b Comment on the correlation between the number of tourists and the exchange rate
and give an estimate of r.c Calculate r.d Calculate the coefficient of determination.e Write down the proportion of the variation in the number of tourists that can be
explained by the exchange rate.
5 Data showing the number of people in 9 households against weekly grocery costs aregiven below.
a Construct a scatterplot for the data.b Comment on the correlation of the number of people in a household and the weekly
grocery costs and give an estimate of r.c Calculate r.d Calculate the coefficient of determination.e Write down the proportion of the variation in the weekly grocery costs that can be
explained by the variation in the number of people in a household.
6 Data showing the number of people on 8 fundraising committees and the annual fundsraised are given below.
a Construct a scatterplot for these data.b Comment on the correlation between the number of people on a committee and the
funds raised and make an estimate of r.c Calculate r.d Calculate the coefficient of determination.e Write down the proportion of the variation in the funds raised that can be explained
by the variation in the number of people on a committee.
The following information applies to questions 7 and 8. A set of data was obtained froma large group of women with children under 5 years of age. They were asked the numberof hours they worked per week and the amount of money they spent on childcare. The resultswere recorded and the value of Pearson’s correlation coefficient was found to be 0.92.
Number of people in household
2 5 6 3 4 5 2 6 3
Weekly grocery costs ($’s)
60 180 210 120 150 160 65 200 90
Number of people on committee
3 6 4 8 5 7 3 6
Annual funds raised ($’s)
4500 8500 6100 12 500 7200 10 000 4700 8800
Ch 02 FM YR 12 Page 94 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 957
Which of the following is not true?A The relationship between the number of working hours and the amount of money
spent on child-care is linear.B There is a positive correlation between the number of working hours and the
amount of money spent on child-care.C The correlation between the number of working hours and the amount of money
spent on child-care can be classified as strong.D As the number of working hours increases, the amount spent on child-care increases
as well.E The increase in the number of hours causes the increase in the amount of money
spent on child-care.
8Which of the following is not true?A The coefficient of determination is about 0.85.B The number of working hours is the major factor in predicting the amount of money
spent on child-care.C About 85% of the variation in the number of hours worked can be explained by the
variation in the amount of money spent on child-care.D Apart from number of hours worked, there could be other factors affecting the
amount of money spent on child-care.E About of the variation in the amount of money spent on child-care can be
explained by the variation in the number of hours worked.
mmultiple choiceultiple choice
mmultiple choiceultiple choice
1720------
Ch 02 FM YR 12 Page 95 Friday, November 10, 2000 9:34 AM
96 F u r t h e r M a t h e m a t i c s
Types of data• Bivariate data are data with two variables.• Numerical data involve quantities which are measurable or countable.• Categorical data are data divided into categories.• In a relationship involving two variables, if the values of one variable depend on the
values of another variable, then the former variable is referred to as the dependent variable and the latter variable is referred to as the independent variable.
• When data are displayed on a graph, the independent variable is placed on the horizontal axis and the dependent variable is placed on the vertical axis.
Back-to-back stem plots• A back-to-back stem plot displays bivariate data involving a numerical variable and
a categorical variable with two categories.• Together with summary statistics, back-to-back stem plots can be used to compare
the two distributions.
Parallel boxplots• To display a relationship between a numerical variable and a categorical variable
with more than two categories, we can use a parallel boxplot.• A parallel boxplot is obtained by constructing individual boxplots for each
distribution, using a common scale.
The two-way frequency table• The two-way frequency table is a tool for examining the relationship between two
categorical variables.• If the total number of scores in each of the two categories is unequal, percentages
should be calculated in order to be able to analyse the table properly.• When the independent variable is placed in the columns of the table, the numbers in
each column should be expressed as a percentage of that column’s total.
The scatterplot• A scatterplot gives a visual display of the relationship between two numerical
variables.• In analysing the scatterplot we look for a pattern in the way the points lie. Certain
patterns tell us that certain relationships exist between the two variables. This is referred to as a correlation. We look at what type of correlation exists and how strong it is.
• When describing the relationship between two variables displayed on a scatterplot, we need to comment on:(a) the direction — whether it is positive or negative(b) the form — whether it is linear or non-linear(c) the strength — whether it is strong, moderate or weak.
summary
Ch 02 FM YR 12 Page 96 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 97The q-correlation coefficient• The q-correlation coefficient gives us a measure of the strength of the association
between two variables.• To calculate the q-correlation coefficient:
Step 1. Draw a scatterplot of the data.Step 2. Locate the median of the x-values. Draw a vertical line through this median
value.Step 3. Locate the median of the y-values. Draw a horizontal line through this
median value.Step 4. The scatterplot is now divided into 4 sections or quadrants.
(a) Label these sections A, B, C and D.(b) Count the number of points in each section.(c) Do not count points which are on the lines.(d) The number of points in section A is denoted by a, the number of
points in section B is denoted by b, and so on.Step 5. Calculate the q-correlation coefficient, using the formula:
• The sign of the q-value indicates the direction of the relationship; that is, whether there is a negative association or a positive association. The magnitude of q indicates whether the relationship is strong, moderate or weak.
• The q-correlation coefficient gives us an idea of into which quadrants the points fall, but beyond that the points can be in any position in the quadrants. The q-correlation coefficient in that sense is a rather blunt instrument.
Pearson’s product–moment correlation coefficient• Pearson’s product–moment correlation coefficient is used to measure the strength of
a linear relationship between two variables.• The symbol for Pearson’s product–moment correlation coefficient is r.• The calculation of r is applicable to sets of bivariate data which are known to be
linear in form and which don’t have outliers.• The value of r can be estimated from the scatterplot.• The formula for calculating Pearson’s correlation coefficient r is as follows:
where n is the number of pairs of data in the setsx is the standard deviation of the x-valuessy is the standard deviation of the y-values
is the mean of the x-values is the mean of the y-values
• The calculation of r by hand using this formula is unnecessary. The calculation of r is done far more efficiently using a graphics calculator.
• Even if we find that two variables have a very high degree of correlation, for example r = 0.95, we cannot say that the value of one variable is caused by the value of the other variable.
Calculating the coefficient of determination• The coefficient of determination = r2.• The coefficient of determination is useful when we have two variables which have
a linear relationship. It tells us the proportion of variation in one variable which can be explained by the variation in the other variable.
y
x
AB
C D
qa c+( ) b d+( )–a b c d+ + +
----------------------------------------=
r1
n 1–------------
xi x–
sx-------------
yi y–
sy-------------
i 1=
n
∑=
xy
Ch 02 FM YR 12 Page 97 Friday, November 10, 2000 9:34 AM
98 F u r t h e r M a t h e m a t i c s
Multiple choice1 An example of a categorical variable is:
A the membership number of a clubB the number of students at each year level of a schoolC the total attendance at Hawthorn football matchesD the breathalyser reading of people in a restaurantE the monthly income for a group of people
2 In a study on the growth of plants, conducted in controlled surroundings, the dependent variable was the height of the plants. The independent variable in the study would be most likely:A the number of people caring for the plantsB the amount of light presentC the number of plants in the studyD whether the plants were deciduous or evergreenE rainfall
3 One of the following pairs of variables could not be displayed on a back-to-back stem plot. It is:A the heights of a group of students and whether or not they like footballB the kilometres travelled in a week and the mode of transport (car or train)C the weights of a group of students and their eye colour (blue or brown)D the annual number of trips to a doctor and whether or not the person is a smokerE the amount spent by each child at the tuckshop and the age of the child
4 A back-to-back stem plot is a useful way of displaying the relationship between:A the number of children attending a day care centre and whether or not the centre has
federal fundingB height and wrist circumferenceC age and weekly incomeD weight and the number of takeaway meals eaten each weekE the age of a car and amount spent each year on servicing it
The information below relates to questions 5 and 6. The salaries of people working at five different advertising companies are shown below on the parallel boxplots.
5 The company with the largest interquartile range is:A Company A B Company B C Company CD Company D E Company E
CHAPTERreview
2A
2A
2B
2B
100 150
Company D
50
Annual salary (× $1000)
Company E
Company C
Company B
Company A
10 20 30 40 60 70 80 90 110 120 130 140
2C
Ch 02 FM YR 12 Page 98 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 996 The company with the lowest median is:
Questions 7 and 8 relate to the following information. Data showing reactions of junior staff and senior staff to a relocation of offices are given below in a two-way frequency table.
7 From this table, we can conclude that:A 23% of junior staff were for the relocationB 42.6% of junior staff were for the relocationC 31% of junior staff were against the relocationD 62.1% of junior staff were for the relocationE 28.4% of junior staff were against the relocation
8 From this table, we can conclude that:A 14% of senior staff were for the relocationB 37.8% of senior staff were for the relocationC 12.8% of senior staff were for the relocationD 72% of senior staff were against the relocationE 74.5% of senior staff were against the relocation
9 The relationship between the variables x and y is shown on the scatterplot below.That correlation between x and y would be best described as:A a weak positive associationB a weak negative associationC a strong positive associationD a strong negative associationE non-existent
10 An investigation is made into the number of freckles on the back of a hand and the age of the subject. A strong association was found to exist. In this investigation, age is the independent variable and the number of freckles is the dependent variable. You would expect the association to be:
11 The q-correlation coefficient for data shown in the scatterplot above is:
12 A researcher calculates the q-correlation coefficient for the relationship between time (in days) and the growth of the root of a bean plant (measured in millimetres). The value is 0.62. Based on this, the correlation between time and the growth of the roots could be described as:
A Company A B Company B C Company CD Company D E Company E
Attitude Junior staff Senior staff Total
For 23 14 37
Against 31 41 72
Total 54 55 109
A negative B positive C bivariate D weak E categorical
A – B – C D E
A strong and negative B strong and positive C weak and positiveD weak and negative E moderate and positive
2C
2D
2D
2Ey
x
2E
y
x
2F511------ 5
9--- 5
11------ 5
9--- 2
9---
2F
Ch 02 FM YR 12 Page 99 Friday, November 10, 2000 9:34 AM
100 F u r t h e r M a t h e m a t i c s
13 A set of data relating the variables x and y is found to have an r value of −0.83. The scatterplot that could represent this data set is:
14 A set of data relating the variables x and y is found to have an r value of 0.65. A true statement about the relationship between x and y is:A There is a strong linear relationship between x and y and when the x-values increase, the
y-values tend to increase also.B There is a moderate linear relationship between x and y and when the x-values increase,
the y-values tend to increase also.C There is a moderate linear relationship between x and y and when the x-values increase,
the y-values tend to decrease.D There is a weak linear relationship between x and y and when the x-values increase, the
y-values tend to increase also.E There is a weak linear relationship between x and y and when the x-values increase, the
y-values tend to decrease.
15 A set of data comparing age with blood pressure is found to have a Pearson’s correlation coefficient of 0.86. The coefficient of determination for this data would be closest to:
16 The coefficient of determination for a set of data relating age and pulse rate is 0.7. This means that:A The correlation coefficient, r, for age against pulse rate is 0.7.B 70% of the variation in pulse rate can be explained by the variation in age.C 30% of the variation in pulse rate can be explained by the variation in age.D 49% of the variation in pulse rate can be explained by the variation in age.E 70% of those in the study had a pulse rate over 0.7.
Short answer1 For each of the following, write down:
i whether each variable in the pair is an example of numerical or categorical dataii which is a dependent and which is an independent variable or whether it is not appropriate
to classify the variables as such.a The number of injuries in a netball season and the age of a netball playerb The suburb and the size of a home mortgagec IQ and weight
A B C
D E
A −0.86 B −0.74 C −0.43 D 0.43 E 0.74
2Gy
x
y
x
y
x
y
x
y
x
2G
2H
2H
2A
Ch 02 FM YR 12 Page 100 Friday, November 10, 2000 9:34 AM
C h a p t e r 2 B i v a r i a t e d a t a 1012 The number of hours of counselling received by a group of 9 full-time firefighters and
9 volunteer firefighters after a serious bushfire is given below.
a Construct a back-to-back stem plot to display the data.b Comment on the distributions of the number of hours of counselling of the full-time
firefighters and the volunteers.
3 The IQ of 8 players in 3 different football teams were recorded and are shown below.
Display the data in parallel boxplots.
4 Delegates at the respective Liberal and Labor Party conferences were surveyed on whether or not they believed that uranium mining should continue. Forty-five Liberal delegates were surveyed and 15 were against continuation. Fifty-three Labor delegates were surveyed and 43 were against continuation.a Present data in percentages in a two-way frequency table.b Comment on any difference between the reactions of the Liberal and Labor delegates.
5 a Construct a scatterplot for the data given in the table below.b Use the scatterplot to comment on any relationship which exists between the variables.
6 For the data given in question 5, calculate the q-correlation coefficient and use this to comment on the relationship between the two variables. (Compare your response about the relationship in this question to your response about the relationship in question 5 when you didn’t know the q-value).
7 For the variables shown on the scatterplot at right, give an estimate of the value of r and use it to comment on the nature of the relationship between the two variables.
8 The table below gives data relating the percentage of lectures attended by students in a semester and the corresponding mark for each student in the exam for that subject.
Full-time 2 4 3 5 2 4 6 1 3
Volunteer 8 10 11 11 12 13 13 14 15
Team A 120 105 140 116 98 105 130 102
Team B 110 104 120 109 106 95 102 100
Team C 121 115 145 130 120 114 116 123
Age 15 17 18 16 19 19 17 15 17
Pulse rate 79 74 75 85 82 76 77 72 70
Lectures attended (%) 70 59 85 93 78 85 84 69 70 82
Exam result(%) 80 62 89 98 84 91 83 72 75 85
2B
2C
2D
2E
2F
y
x
2G
2H
Ch 02 FM YR 12 Page 101 Friday, November 10, 2000 9:34 AM
102 F u r t h e r M a t h e m a t i c s
a Construct a scatterplot for these data.b Comment on the correlation between the lectures attended and the examination results and
make an estimate of r.c Calculate r.d Calculate the coefficient of determination.e Write down the proportion of the variation in the examination results that can be explained
by the variation in the number of lectures attended.
Analysis1 An investigation into the relationship between age and salary bracket among some employees
of a large computer company is made and the results are shown below.
a State, for each of the variables (age and salary bracket) whether they represent categorical or numerical data.
b State which is the independent variable and which is the dependent variable.c State which of the following you could use to display the data:
i back-to-back stem plotii parallel boxplotiii scatterplotiv two-way frequency table in percentage form
d State which of the following you could calculate in order to find out more about the relationship between age and salary bracket:
i the q-correlation coefficientii r, the Pearson product–moment correlation coefficientiii the coefficient of determination
2 An investigation similar to that in analysis task 1 is undertaken at an accounting firm to explore the relationship between age and salary. The data are shown below.
a State, for each of the variables (age and salary) whether they represent categorical or numerical data.
b Display the data on a scatterplot.c Describe the association between the two variables in terms of direction, form and
strength.d Calculate the value of q.e Explain whether or not it is appropriate to use Pearson’s correlation coefficient to explain
the relationship between age and salary.f Estimate the value of Pearson’s correlation coefficient from the scatterplot.g Calculate the value of this coefficient.h Explain whether or not the salary of the employees is determined by their age.i Calculate the value of the coefficient of determination.j Explain what the coefficient of determination tells us about the relationship between age
and salary at this accounting firm.
Salary bracket ($’000) Age
20–3940–5960–7980–99
100–120
32 21 43 23 22 27 3729 31 37 26 33 3741 29 39 42 47 45 43 3843 48 38 37 49 51 53 5948 37 55 61
Age 20 20 30 35 50 45 35 45 55 55 42 50 25 30 40
Salary (nearest thousand $’s) 20 40 20 30 40 80 40 60 100 70 45 85 30 60 60
testtest
CHAPTERyyourselfourselftestyyourselfourself
2
Ch 02 FM YR 12 Page 102 Friday, November 10, 2000 9:34 AM