BEA140Leon Jiang, University of Tasmania1 Module 2 Quantitative Methods Summer Semester 2009 By Leon Jiang University of Tasmania

BEA140 Leon Jiang, University of Tasmania 1

Module 2 Quantitative Methods

Summer Semester 2009By Leon JiangUniversity of Tasmania


Why this unit?

Particularly this module as required to study statistics?


Thinking on a case!

Suppose! You work for R&d of IBM. As you know, IBM

is being competing with a number of very strong rivals.

Can you create a laptop computer for IBM as you like?


Not that simple as ‘you like’! To clearly know what types of

laptop computers that customers like to have should be the only headway and or the direction for a successful R&D department.

But, can we know the number of customers and the types of laptop computers they like to actually buy?


Of course…

The best situation would be we know exactly, in this world, how many customers who are willing to buy, exactly what types of laptop computers, and don’t forget exactly how many of these customers who can financially afford your products.


However, can we know these information? In this case, it is nearly impossible to know!

But, can we estimate these information?


We do can actually know these information to some extent!

This is what we are going to learn in this module!


* What is statistics?

The word 'Statistics' is derived from the Latin word 'Statis' which means a "political state.“

Clearly, statistics is closely linked with the administrative affairs of a state such as facts and figures regarding defense force, population, housing, food, financial resources etc.

What is true about a government is also true about industrial administration units, and even one personal life.


* The meaning of statistics!

* The word statistics, as a singular noun, is used to describe a branch of applied mathematics, whose purpose is to provide methods of dealing with a collection of data and extracting information from them in compact form by tabulating, summarizing and analyzing the numerical data or a set of observations.

* The various methods used are termed as statistical methods and the person using them is known as a statistician. A statistician is concerned with the analysis and interpretation of the data and drawing valid worthwhile conclusions from the same.


* Some important terms~!

Population ( or universe) - the total number of objects (individuals or members) to be considered.

The total number in a population is known as the size of the population---which may be finite or infinite.

The population can refer to things as well as people.


* Examples as “population”! All members of the student union of UTAS.

All students of QM unit in your class.

All the people who drink beer at least 2 times a week over the past one year.

Heights of teaching staff at Utas.

Weights of all the citizens of Hobart above 20 years of age.


* Population - finite and infinite

A population is finite if it contains finite number of individuals. For example, the number of students at your class.

A population is infinite if it contains infinite number of individuals. For example, the pressures at various points in the atmosphere.


* Infinite population~!

Often, statisticians want to know things about population, but they fail to do so almost because in every case such data for every individual of the population are not available.

Suppose in the above case, can we know how many people who are willing to buy and actually afford to buy your products?

Thus whenever we want to study the characteristics of a certain population, it is difficult to study the whole population. it is often expensive and time consuming and many times we lack resources for the study of the whole population. In any science we cannot

study more than a part of population. A part or small section selected from the population is called a sample.


This can be the starting point for the case, to take a sample in order to know something about the population!


* Sample~!

A finite set of objects drawn from the population with an aim is called a sample.


* Even in every day life we make many of our decisions

based on samples taken, though we are not aware of it. We just take a little from a gunny bag of rice, we

judge its quality and then we purchase the whole bag.

If we want to taste milk, we just take a glassful of milk from the can and taste it.

Note that taking a sample is easy in many cases where the population is uniform or homogeneous

When the population is heterogeneous (not uniform), the selection of a sample is not very easy.


* Parameter & statistic*• The word ‘Parameters’ is associated with

the population and it is understood as the measure of the characteristics of the population, such as mean and standard deviation, etc.

• The word 'Statistic' is used for a random sample and it is understood as the measure of the characteristics of the random sample, such as mean and standard deviation etc.


* Different symbols are used to denote parameters and statistics *


* For instance, say:

Aim to know: The average (mean) income of families living in the area “Salamanca" in the year 2007-2008. i.e. $50,000 - The population parameter (in A$).

Work out this way: Draw a random sample of 200 families and compute their average income. i.e. the statistic of sample says $52,000.

Conclusion: The population mean (parameter) is close to sample mean.

X


* The process of doing statistics * - 3 steps ~!

1. Design – gathering data !

2. Description – summarizing, studying features and characteristics of data, providing useful and effective information. (including graphical tools, tables, summary measures.)

3. Inference ( conclusion) – requiring the application of probability concepts.


Importance of statistics

‘hard data’

Scientific evidence

Postgraduate study for masters or PhD.

Necessary for high-quality essays.


* Sources of data *

Primary source – the original collector of the data, i.e. the National Population Census Bureau.

Secondary source – a subsequent user of the data.


* How to collect primary data?

• Survey – interview, questionnaire, etc.- in doing a survey, skills are required for design.

• Observation – observing and recording behaviors.

• Experiment – use of experimental and control groups.

- appropriate design is important.


* Where to find secondary data? *

Those published sources of data, e.g. trade journals, any relevant kinds of media.

Secondary data collection is usually more cost-saving and less time -consuming than primary data collection.


* Survey errors *- Usually four kinds of errors realized beforehand

1. Coverage error→selection bias Not cover all or exclude some –population frames not clear( the

random probability sample selected will provide an estimate of the characteristics of the frame, not the actual population. )

2. Non-response error → bias Those who with no response might have very different views.

3. Sampling error samples are not representative.

4. Measurement error Poor questionnaire or interviewing skills


•Sampling methods *- 2 basic kinds!

1. Probability sampling ( random sampling)-

* Only random sampling is valid for statistical inference.

2. Non-probability sampling –- Two broad types: accidental or purposive.


* Convenience sample *

A convenience sample is a sample where the patients are selected, in part or in whole, at the convenience of the researcher. The researcher makes no attempt, or only a limited attempt, to ensure that this sample is an accurate representation of some larger group or population.

The classic example of a convenience sample is standing at a shopping mall and selecting shoppers as they walk by to fill out a survey.


- More about convenience sample -

In general, the Statistics community frowns on convenience samples. You will often have great difficulty in generalizing the results of a convenience sample to any population that has practical relevance.

Still, convenience samples can provide you with useful information, especially in a pilot study. To interpret the findings from a convenience sample properly, you have to characterize (usually in a qualitative sense) how your sample would differ from an ideal sample that was randomly selected. In particular, pay attention to who might be left out of your convenience sample or who might be underrepresented in your sample.


* Random sample *

In contrast, a random sample is one where the researcher insures (usually through the use of random numbers applied to a list of the entire population) that each member of that population has an equal probability of being selected.

Random samples are an important foundation of Statistics. Almost all of the mathematical theory upon which Statistics are based rely on assumptions which are consistent with a random sample. This theory is inconsistent with data collected from a convenience sample.


* Judgment sample *

- A non-probability sample that is often called a purposive sample because the sample elements are handpicked and because they are expected to serve the research purpose.


* Simple sampling * A sampling procedure that assures that each element in the

population has an equal chance of being selected is referred to as simple random sampling .

Let us assume you had a school ， with a 1000 students, divided equally into boys and girls, and you wanted to select 100 of them for further study. You might put all their names in a drum and then pull 100 names out. Not only does each person have an equal chance of being selected, we can also easily calculate the probability of a given person being chosen, since we know the sample size (n) and the population (N) and it becomes a simple matter of division:

n/N x 100 or 100/1000 x 100 = 10%

This means that every student in the school has a 10% or 1 in 10 chance of being selected using this method.


Systematic Sampling

At first sight this is very different. Suppose that the N units in the population are numbered 1 to N in some order. To select a systematic sample of n units, if then every k-th unit is selected commencing with a randomly chosen number between 1 and k. Hence the selection of the first unit determines the whole sample, e.g., N = 5,000, n = 250 therefore k = 5000/250 = 20. Therefore, select every 20th item commencing with (say) 6.


* Samples from a subdivided population *

* Quota sampling usually refers to the process whereby a researcher attempts to match in a sample the exact makeup of the population with regard to certain demographic characteristics deemed important (such as gender, age, race, income, etc ).

* Quota sampling is non-probability.


* Stratified random sampling*

Stratified sampling is used if sampled area (or volume) is heterogeneous.

The whole population is first into mutually exclusive subgroups or strata and then units are selected randomly from each stratum.


* Cluster sampling *

Cluster sampling is used when "natural" groupings are evident in the population. The total population is divided into groups or clusters.


* Properties of data *

The phenomena or characteristics observed are random variables.

Variables have a range of values and are random, for example: eye color, height, weight, income per month, car accidents per day…


* Two types of variables!

Categorical– featuring in quality of variables.

Numerical – more in quantity of variables.


* Categorical variables!- Yielding categorical responses -

* Nominal scale and ordinal scale *- Nominal scale : variables have no relation to order and only can

be analyzed by their names.- Arithmetic limited to counting.- Example: Degree - law, commerce, economics, arts, science, etc.

- Ordinal scale: variables are also nominal but there is ordering or ranking in them.

- Example as: House number in a street: 121, 122, 123, 124, 125, etc.

- Nominal plus positional measures including median in particular.


* Numerical variables!- Yielding numerical responses -

* Interval scale and ratio scale *- Interval scale: variables themselves have an order and the

difference between values is a meaningful quantity. - Zero value here is arbitrary. - Example: temperature – difference between 4 C and 6 C is the same

as between 6 C and 8 C, but 8 C is not twice as hot as 4 C.- Or, degree of your eyesight; is 1.5 two times 0.75?

- Ratio scale: like interval scale, variables in this scale have a order or ranking, but there is a true zero here.

- Example: 100kg is twice as heavy as 50kg.


* Numerical or quantitative variables can further be subdivided continuous and discrete ones.

Continuous variables - such as time.

Discrete variables – such as family size.


* Example- variable types & scales of measurement

variable example value type/ scale country of birth Australia categorical, norminaljudo belt Blue categorical, ordinal

mortgage $125, 000 (continuous) numerical, ratio

class size 302 (discrete) numerical, ratio


* Two more terms!

Raw data: collected but unsorted.

Array : ordered data, increasing or decreasing.


* Describing and presenting data!

Data can be described and communicated in three main ways:

Tabular (in the form of tables) – frequency tables, contingency tables and super tables, etc.

Graphical- various forms of charts.

Summary (descriptive)- mean, standard deviation, median, etc.


Steam and leaf display

Stem Leaf Frq Cum.

1 .8 1 1

2 .0 .2 .4 .5 .5 .5 .7 .8 .9 .9 10 11

3 .1 .1 .2 .2 .3 .4 .4 .4 .6 .8 .8 .9 .9 13 24

4 .0 .2 .3 .5 .6 .6 6 30

5 .0 .0 .1 .9 4 34

6 .0 .1 .2 .5 .7 .7 6 40

7 .0 .2 .5 .6 .6 5 45

8 .0 .1 .5 .9 4 49

9 .2 1 50

10 .1 1 51

11 0 51

12 .4 1 52

13 .6 1 53

14 0 53

15 0 53

16 0 53

17 .7 1 54

54


Frequency table!

Time Number ofCalls

Class Mark Cum. Freq Cum. %

x i f j x j f j f j /n 0 0 0 0.00% 11 2 11 20.37% 19 4 30 55.56% 10 6 40 74.07% 9 8 49 90.74% 2 10 51 94.44% 1 12 52 96.30% 1 14 53 98.15% 0 16 53 98.15% 1 18 54 100.00% 0 20 54 100.00%


Histogram

Histogram of Call Durations

0

5

10

15

20

25

-1 &U

1

1 &U 3

3 &U 5

5 &U 7

7 &U 9

9 &U 11

11 &U 13

13 &U 15

15 &U 17

17 &U 19

19 &U 21

Duration in Minutes

Number of Calls


Frequency Polygon

Frequency Poly gon of Call Durations

0

5

10

15

20

25

0 2 4 6 8 10 12 14 16 18 20

Duration in Minutes

Num

ber o

f Cal

ls


Ascending Ogive

Ogive of Call Durations

0%

20%

40%

60%

80%

100%

0 2 4 6 8 10 12 14 16 18 20 22

Duration in Minutes

Prop

ortio

n of C

alls


Bar ChartNumber of Calls Handled on an Average Weekday

0100200

300400500

Morning Day Evening Night

Shift

Num

ber o

f Call

s


Pareto Pareto Diagram of Causes of Disatisfaction with Consultants

0%10%20%30%40%50%60%70%80%90%

100%

Ru

de

Po

or

Kn

ow

led

ge

Did

n't

Lis

ten

Po

or

Gra

mm

ar

To

o F

orm

al

To

o

Fa

mili

ar

Oth

ers

Cause

% o

f Re

spo

nse

s


* Summary measures~!

Central tendency: typical or representative value – a measure of location.

Dispersion: indicating the variation or spread in the data.

Shape of the grouped data.


* Presenting data in tables and charts

* Summary Measures


Univariate Data

Single variable


* Learning objectives *

1. Organize numerical data

2. Develop tables and charts for numerical data

3. Develop tables and charts for categorical data

4. Understand the principles of proper graphical presentation


* Two ways to organize numerical data

The ordered array

The stem-and-leaf display


* The ordered array !

An ordered array makes the raw data in rank order from the smallest to the largest.

The feature of ordered array is it makes easier to pick out extremes, typical values, and area where the majority of the values are concentrated.


* The stem-and-leaf display !

This valuable data-organizing tool helps show how the values distribute and cluster in the data set.

The stem-and-leaf display is constructed , apparently from its name, with two parts>:

- the stem - The leaf


* Construct a stem-and-leaf display *

Example: 12, 45, 67, 26, 89, 56, 13, 15, 44, 36, 32, 20, 11, 10


Frequency Cumulative

0 0 01 0 1 2 3 5 5 52 0 6 2 73 2 6 2 94 4 5 2 115 6 1 126 0 127 0 128 9 9 21

21


Stem & Leaf Chart improves information. Useful to indicate range,

concentration and structure of data.


* Tables and charts for numerical data *

1. The frequency distribution

2. The histogram

3. The polygon


* The frequency distribution table * For large data sets, it is not convenient

to analyze those observations by using ordered array or a stem-and-leaf display, instead we can arrange these observations into different groups (class groupings) to provide a more effective presentation.

This arrangement of data in tabular form is called a frequency distribution.


* A frequency distribution table *- this also called “the relative frequency distribution” ---------------------------------------------------- 5-year annualized percentage return number of funds

------------------------------------------------------------------------------------

-10.0<-5.0 1

-5.0 < 0.0 3

0.0<5.0 14

5.0<10.0 58

10.0<15 61

15.0<20.0 17

20.0<25.0 3

25.0<30.0 1

Total 158

---------------------------------------------------------------------------------------


* The procedures of establishing a frequency distribution table *

1. Selecting the number of classes 2. Deciding the class interval( width of interval) 3. Deciding the boundaries of the classes

Then, establishing frequency distribution table.


* Selecting the number of classes Usually , at least 5 classes and at most 15 classes.

This means we can decide the number of classes by ourselves between 5 and 15 classes.

Of course, larger data sets have more classes than smaller ones.


* Deciding class interval? *

Find out the range of the set of data. Where is the range? The largest – the smallest = range range

Width of interval = ------------------------------------------------

number of desired class groupings


* Deciding the boundaries *

Boundaries mean the two ends of this frequency distribution table.

The basic rule for deciding the boundaries is that we must include the entire range of data in and but must avoid overlapping of classes.


* Some highlights here *

1. Of course, you can choose 10 classes or 6, or whatever between 5 to 15.

2. Of course, you can also just use 4 as the width of interval, or even 6.

3. But, remember, the purpose for statistics is to make things simpler and this is why we can subjectively choose 5 or 10 as the width of interval.


* The relative frequency distribution, the percentage distribution, and the cumulative distribution *

-------------------------------------------------------- 5-year annualized number percentage cumulative percentage return of funds of funds percentage (percentage of funds less than lower boundary of class interval)

-------------------------------------------------------------------------------------------- -10.0<-5.0 1 0.6 0.0

-5.0 < 0.0 3 1.9 0.6 0.0<5.0 14 8.9 2.5=0.6+1.9 5.0<10.0 58 36.7 11.4=0.6+1.9+8.9 10.0<15 61 38.6 48.1=0.6+1.9+8.9+36.7 15.0<20.0 17 10.8 86.7=0.6+1.9+8.9+36.7+38.6 20.0<25.0 3 1.9 97.5=0.6+1.9+8.9+36.7 +38.6+10.8 25.0<30.0 1 0.6 99.4= 0.6+1.9+8.9+36.7 +38.6+10.8+1.9 Total 158 100.0 100.0=0.6+1.9+8.9+36.7 +38.6+10.8+1.9+0.6

-------------------------------------------------------------------------------------------- -


* Histogram *

Although tables such as the stem-and-leaf display, ordered array, and the frequency distribution table are effective to describe a large set of data, graphs(pictures) are able to more vividly present the features of it.

A picture is worth 1,000 words!


* What is histogram?

Histogram is used to describe numerical data that have been grouped into frequency, relative frequency, or percentage distributions

. This means, after establishing frequency

distributions, histogram starts its mission.


Histogram

Histogram of Call Durations

0

5

10

15

20

25

-1 &U

1

1 &U 3

3 &U 5

5 &U 7

7 &U 9

9 &U 11

11 &U 13

13 &U 15

15 &U 17

17 &U 19

19 &U 21

Duration in Minutes

Number of Calls


* Frequency polygon *

Connecting all the midpoints of every classes in the frequency distribution!


Frequency Polygon

Frequency Poly gon of Call Durations

0

5

10

15

20

25

0 2 4 6 8 10 12 14 16 18 20

Duration in Minutes

Num

ber o

f Cal

ls


Ascending Ogive – based on cumulative percentage

Ogive of Call Durations

0%

20%

40%

60%

80%

100%

0 2 4 6 8 10 12 14 16 18 20 22

Duration in Minutes

Prop

ortio

n of C

alls


Tables and charts for categorical data

-The summary table-Bar chart-Pareto chart-Pie chart-Run chart


* The summary table *

A summary table is very similar to a frequency distribution table since both of them are basis to build up the other graphs (or pictures).

However, the summary table is for categorical data and the frequency distribution is for numerical data.


* Constructing a summary table *• “Funds” example, there are altogether 259 mutual funds, 158 of

them are growth funds and the other 101 are value funds.

• Previously, we have just sorted out the 158 growth funds. These 158 funds in the group of growth category are numerical.

• Now, we classify all these 259 into 5 groups: risk is very low, low, average, high, and very high.

• These five groups now present us a categorical set of data to analyze.


Now, do it !

------------------------------------------------- fund risk level number of funds percentage

very low 6 2.32 low 76 29.34 average 82 31.66 high 80 30.89 very high 15 5.79------------------------------------------------------------------------ Total 259 100.0


* The bar chart *

Based on the previous summary table, by using Microsoft Excel, we can build up a bar chart.

Bar chart presents the number of different categories of funds’ risk.


Bar Chart – used for categorical data

Number of Calls Handled on an Average Weekday

0100200

300400500

Morning Day Evening Night

Shift

Num

ber o

f Call

s


* The pie chart *

As same, pie chart is also based on the summary table to set up.

Pie chart represents the percentage part of the summary table.


Singles

Married / No kids

Full Nest 1

Full Nest 2

Full Nest 3

Empty Nest


* The properties of a pareto diagram *

1. When having many groupings, we prefer using pareto diagram.

2. Pareto diagram represents the most significant grouping first and then on.

3. For the cumulative percentage polygon, the points are those midpoints of each category.


* The pareto diagram *

Pareto diagram, also based on the summary table, is similar to bar chart.

The differences are :

1. the pareto diagram adopts descending rank order of their frequencies.

2. The pareto also includes cumulative polygon on the same graph.

3. Left side – percentage; right side – cumulative percentage.


Pareto Diagram of Causes of Disatisfaction with Consultants

0%10%20%30%40%50%60%70%80%90%

100%

Rude

Poor

Know

ledge

Didn

't List

en

Poor

Gram

mar

Too

Form

al

Too

Fami

liar

Othe

rs

Cause

% o

f Res

pons

es


History of Chocolate Sales

0

20

40

60

80

100

120

Jan- 95

Apr- 95

Jul- 95

Oct- 95

Jan- 96

Apr- 96

Jul- 96

Oct- 96

Jan- 97

Apr- 97

Jul- 97

Oct- 97

Jan- 98

Apr- 98

Jul- 98

Oct- 98

Jan- 99

Apr- 99

Jul- 99

Month

Sales ($'000)


Run Chart

A good tool for illustrating one or more (numerical) variables over time.

Run Chart can allow identification of trends and periodicity.


* Summary measures *

To describe characteristics of a set of data by “Numbers + words”!


*Measuring from ungrouped (raw) data*

1. Central tendency (location)

2. Variation

3. shape


Central tendency

Most sets of data show a central point, around which group or cluster of data are located.

This central point actually is a typical or representative value for the whole set of data.

Three measures here:

1. The arithmetic mean

2. Median

3. mode


* The arithmetic mean *

• Easy to calculate!

• Caution: arithmetic mean is greatly affected by any extreme value or values.

• Therefore, when reporting an arithmetic mean with extreme values, median and mode should be added with.


Mean for population- “N” is the population size.

NX iX /)(


Mean for sample - “n” is the sample size.

nXX i /)(


The median The median is the value for which 50% of the

observations are smaller and the other 50% are larger.

Caution to even number of array- the median under this circumstance is the average of the two middle values.

n+1 Median = ------------ ordered observations 2


example

1, 2, 3, 4, 5 – odd number of data

1, 2, 3, 4, 5, 6 – even number of data

Median is 3.5.


* Significance of median *

Whenever a set of data includes big extremes, since extremes seriously affect the accuracy of mean, median is adopted.

Median is not affected by any extreme values in a set of data.


The mode

Easy~! No calculation at all~!

Definition: the value in a set of data that appears most frequently!

Caution: different types of data when reporting mode: 1. Data with mode 2. Data with no mode 3. A set of data can be bimodal or multimodal


Midrange

Numerical data only!

Midrange=(Xlargest +Xsmallest)/ 2


* Dispersion or spread *

1. Range

2. Variance

3. Standard deviation


Dispersion, or spread

After we noticing the midpoint value – central tendency, we should pay attention to how and how much a set of data spread from the midpoint value.

Variation amount is used to measure the dispersion(spread) of a set of data.

Often, five measures of variation (measuring dispersion): range, interquartile range, variance, standard deviation and coefficient of variation.


Range

Range = largest value – smallest value


Variance and standard deviation

* Range is a measure of the total spread.

• While, variance and standard deviation consider how the values of the data are distributed.


Variance

• The variance is roughly the average of the squared differences between each of the values in a set of data and the mean.


For population

/N)(X 2Xi

2


For sample

)1/()( 22 nXXs i


* Standard deviation *

Stand deviation is the square root of variance.

Stand deviation means, by rule of thumb, 95% of values are around the mean at two stand deviation values.


This is the end of today’s lecture!

Documents

BEA140Leon Jiang, University of Tasmania1 Module 2 Quantitative Methods Summer Semester 2009 By Leon Jiang University of Tasmania