CHAPTER 2

CHAPTER 2• 2.1 - Basic Definitions and Properties

Population Characteristics = “Parameters” Sample Characteristics = “Statistics” Random Variables (Numerical vs. Categorical)

• 2.2, 2.3 - Exploratory Data Analysis Graphical Displays Descriptive Statistics

• Measures of Center (mode, median, mean) • Measures of Spread (range, variance, standard deviation)

“Classical Scientific Method”• Hypothesis – Define the study population...

What’s the question?• Experiment – Designed to test hypothesis• Observations – Collect sample measurements• Analysis – Do the data formally tend to support or

refute the hypothesis, and with what strength? (Lots of juicy formulas...)

• Conclusion – Reject or retain hypothesis; is the result statistically significant?

• Interpretation – Translate findings in context!Statistics is implemented in each step of the classical

scientific method!2

• Analysis – Do the data formally tend to support or refute the hypothesis, and with what strength? (Lots of juicy formulas...)

To help answer this question, we should first try to obtain an informal “feel” for the sample data we have collected, and see if it suggests anything about the population distribution.

~ Exploratory Data Analysis ~1. Visual Displays (charts, tables, graphs, etc.)

“What do the data look like?”2. “Descriptive Statistics” (measures of center,

measures of spread, proportions, etc.) “How can the data be summarized?”

3

Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.

In published journal articles, the original data are almost never shown, but displayed in tabular form as above. This summary is called “grouped data.”

4 values 8 values 5 values 2 values 1 value

From these values, we can construct a table which consists of the frequencies of each age-interval in the dataset, i.e., a frequency table.

{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}

4

8

2

5

1

Frequency HistogramSuggests

population may be skewed to the right (i.e.,

positively skewed).

Class Interval Frequency

[10, 20) 4

[20, 30) 8

[30, 40) 5

[40, 50) 2

[50, 60) 1

Total n = 20

“Endpoint convention”Here, the left endpoint is included, but not the right.

Note!...Stay away from “10-20,” “20-30,” “30-40,” etc.

4


[10, 20) 4

[20, 30) 8

[30, 40) 5

[40, 50) 2

[50, 60) 1

Total n = 20

Relative Frequency

4/20 = 0.20

8/20 = 0.40

5/20 = 0.25

2/20 = 0.10

1/20 = 0.05

20/20 = 1.00


{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}

Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20.

↓

Relative frequencies are always between 0 and 1,

and sum to 1.

Relative Frequency Histogram

.20

.40

.10

.25

.05

5

0.4

0.3

0.2

0.1

0.0


[10, 20) 4

[20, 30) 8

[30, 40) 5

[40, 50) 2

[50, 60) 1

Total n = 20

Relative Frequency

4/20 = 0.20

8/20 = 0.40

5/20 = 0.25

2/20 = 0.10

1/20 = 0.05

20/20 = 1.00


{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}


↓


and sum to 1.


.20

.40

.10

.25

.05

6

0.4

0.3

0.2

0.1

0.0

“0.20 of the sample is under 20 yrs old”






Cumulative(0.00)0.20

0.60

0.85

0.95

1.00

Example: Exactly what proportion of the sample is under 34 years old?Approximately


[10, 20) 4

[20, 30) 8

[30, 40) 5

[40, 50) 2

[50, 60) 1

Total n = 20

Relative Frequency

4/20 = 0.20

8/20 = 0.40

5/20 = 0.25

2/20 = 0.10

1/20 = 0.05

20/20 = 1.00


{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}


↓


and sum to 1.


.20

.40

.10

.25

.05

7

0.4

0.3

0.2

0.1

0.0


0.60

0.85

0.95

1.00

Cumulative relative frequencies always

increase from 0 to 1.

Solution: [30, 34) contains 4/10 of 0.25 = 0.1, [0, 30) contains 0.6,

sum = 0.7


[10, 20) 4

[20, 30) 8

[30, 40) 5

[40, 50) 2

[50, 60) 1

Total n = 20

Relative Frequency

4/20 = 0.20

8/20 = 0.40

5/20 = 0.25

2/20 = 0.10

1/20 = 0.05

20/20 = 1.00


{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}


↓


and sum to 1.


.20

.40

.10

.25

.05

8

0.4

0.3

0.2

0.1

0.0


0.60

0.85

0.95

1.00

Cumulative relative frequencies always

increase from 0 to 1.

Solution: [30, 34) contains 4/10 of 0.25 = 0.1, [0, 30) contains 0.6,

sum = 0.7Example: Approximately what proportion of the sample is under 34 years old?Exactly

But alas, there is a major problem….


.20

.40

.10

.25

.05

Suppose that, for the purpose of the study, we are not primarily concerned with those 30 or older, and wish to “lump” them into a single class interval.

{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27,

What effect will this have on the histogram?


[10, 20) 4

[20, 30) 8

[30, 40) 5

[40, 50) 2

[50, 60) 1

Total n = 20

Relative Frequency

4/20 = 0.20

8/20 = 0.40

5/20 = 0.25

2/20 = 0.10

1/20 = 0.05

20/20 = 1.00

4 values 8 values

31, 35, 35, 37, 38, 42, 46, 59}

Class Interval

[10, 20)

[20, 30)

[30, 60)

Total

Relative Frequency

4/20 = 0.20

8/20 = 0.40

8/20 = 0.40

20/20 = 1.00

.40

The skew no longer appears. The histogram is distorted because of the

presence of an outlier (59) in the data, creating the need for unequal class widths.

8 values

9

0.4

0.3

0.2

0.1

0.0

OUTLIERS• What are they?Informally, an outlier is a sample data value that is either “much” smaller or larger than the other values.

• How do they arise?o experimental erroro measurement erroro recording erroro not an error; genuine

• What can we do about them?o double-check them if possibleo delete them?o include them… somehowo perform analysis both ways

(A Pain in the Tuches)

10

IDEA: Instead of having height of each class rectangle = relative frequency, make... area of each class rectangle = relative frequency.

Class Interval

Relative Frequency

[10, 20) 0.20

[20, 30) 0.40

[30, 60) 0.40

Total 20/20 = 1.00

The outlier is included, and the overall skewed appearance is restored.

Exercise: What if the outlier was 99 instead of 59?

Density(= height)

0.20/10 = 0.020

0.40/10 = 0.040

0.40/30 = 0.013

height“Density” = relative frequency ×

width /

width = 10

width = 10

width = 30

Density Histogram

0.02

0.04

0.0133…

0.20

0.40

0.40

Total Area = 1!

11

• Analysis – Do the data formally tend to support or refute the hypothesis, and with what strength? (Lots of juicy formulas...)

To help answer this question, we should first try to obtain an informal “feel” for the sample data we have collected, and see if it suggests anything about the population distribution.

~ Exploratory Data Analysis ~1. Visual Displays (charts, tables, graphs, etc.)

“What do the data look like?”2. “Descriptive Statistics” (measures of center,

measures of spread, proportions, etc.) “How can the data be summarized?”

12

CHAPTER 2• 2.1 - Basic Definitions and Properties

Population Characteristics = “Parameters” Sample Characteristics = “Statistics” Random Variables (Numerical vs. Categorical)

• 2.2, 2.3 - Exploratory Data Analysis Graphical Displays Descriptive Statistics

• Measures of Center (mode, median, mean) • Measures of Spread (range, variance, standard deviation)

“Measures of ”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}

• sample mode most frequent value = 80

• sample median “middle” value = (80 + 90) / 2 = 85

• sample mean average value =

14

Data values xi

Frequenciesfi

70 1

80 4

90 2

100 3

Total n = 10

i = 1

i = 2

i = 3

i = 4

(70)(1) + (80)(4) + (90)(2) + (100)(3)

x = xi fi

= 87

(Quartiles are found similarly: Q1 = , Q2 = 85, Q3 = ) 80 100

Center

1/10

n1

• sample mode most frequent value = 80

• sample median “middle” value = (80 + 90) / 2 = 85

• sample mean average value =

“Measures of Center”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}

15

Data values xi

Frequenciesfi

70 1

80 4

90 2

100 3

Total n = 10

(70)(1) + (80)(4) + (90)(2) + (100)(3)1/10 = 87

x = xi fi n1

• sample mean

“Measures of Center”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}

16

Data values xi

Frequenciesfi

70 1

80 4

90 2

100 3

Total n = 10

Relative Frequenciesf (xi ) = fi /n

1/10 = 0.1

4/10 = 0.4

2/10 = 0.2

3/10 = 0.3

10/10 = 1.0

(70)(1) + (80)(4) + (90)(2) + (100)(3)1/10

x = xi f (xi)

“Notation, notation, notation.”

110

410

210

310

(70)(1) + (80)(4) + (90)(2) + (100)(3) =1/10 87

x = xi fi n1

x = 87

• sample mean

17

Data values xi

Frequenciesfi

70 1

80 4

90 2

100 3

Total n = 10

… but how do we measure the “spread” of a set of values?

First attempt:

• sample range = xn – x1 = 100 – 70 = 30. Simple, but…

Spread“Measures of ”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}

x = 87

Ignores all of the data except the extreme points, thus far too sensitive to outliers to be of any practical value.

Example: Company employee salaries, including CEO

Can modify with…

• sample interquartile range (IQR) = Q3 – Q1

= 100 – 80 = 20.

We would still prefer a measure that uses all of the data.

Deviations from meanxi – x

70 – 87 = –17

80 – 87 = –7

90 – 87 = +3

100 – 87 = +13

• sample mean

18

Data values xi

Frequenciesfi

70 1

80 4

90 2

100 3

Total n = 10

… but how do we measure the “spread” of a set of values?

Better attempt: Calculate the average of the “deviations from the mean.”

1/10 [(–17)(1) + (–7)(4) + (3)(2) + (13)(3)] =

0. ????????

This is not a coincidence – the deviations always sum to 0* – so it is not a good measure of variability.

Spread“Measures of ”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}

(xi – x) fi =n1

* Physically, the sample mean is a “balance point” for the data.

x = 87

Deviations from meanxi – x

70 – 87 = –17

80 – 87 = –7

90 – 87 = +3

100 – 87 = +13

• sample mean

19

Data values xi

Frequenciesfi

70 1

80 4

90 2

100 3

Total n = 10

(xi – x) 2 fi

[(–17) 2 (1) + (–7)

2 (4) + (3) 2 (2) + (13)

2 (3)]

Calculate the

“Measures of Spread”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}

s 2 =

• sample variance

• sample standard deviation

s = 2s

11n

1/9 = 112.22

average of the “squared deviations from the mean.”

x = 87

s = 10.59

a modified

Comments is an unbiased estimator of the population mean ,

s 2 is an unbiased estimator of the population variance 2. (Their “expected values” are and 2, respectively.)

Beware of roundoff error!!! There is an alternate, more computationally stable formula for sample variance s 2.

The numerator of s 2 is called a sum of squares (SS); the denominator “n – 1” is the number of degrees of freedom (df) of the n deviations xi – , because they must satisfy a constraint (sum = 0), hence 1 degree of freedom is “lost.”

A natural setting for these formulas and concepts is geometric, specifically, the Pythagorean Theorem: a 2 + b 2 = c 2. See lecture notes appendix…

20

a

cb

x

x

Documents

CHAPTER 2