42
Epidemiology 9509 Descriptive Statistics Epidemiology 9509 Principles of Biostatistics textbook - The Wonders of Biostatistics Chapter 2 Descriptive Statistics John Koval Department of Epidemiology and Biostatistics University of Western Ontario 1

Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Epidemiology 9509Principles of Biostatistics

textbook - The Wonders of BiostatisticsChapter 2

Descriptive Statistics

John Koval

Department of Epidemiology and BiostatisticsUniversity of Western Ontario

1

Page 2: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

What we are learning today

How to describe data

1. numerically

2. pictoriially

2

Page 3: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

statistics for discrete data

{current smoker, former smoker, never smoker}eg C, F, N, C, N

1. frequencytwo C’s, one F, two N’s

2. relative frequencyC 0.40, F 0.20, N 0.40

3. percentageC 40%, F 20%, N 40%

3

Page 4: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

statistics for ordinal data

in addition to frequency

cumulative frequency

{non-smoker, occasional smoker, regular smoker}eg N,O,R,N,R

1. cumulative frequencyonly makes sense for ordinal scaletwo N‘s, three N‘s or O‘s, five N‘s, O‘s or R‘s

2. cumulative relative frequencyonly makes sense for ordinal scale0.40 N‘s , 0.60 N‘s or O‘s, 1.00 N‘s, O‘s or R‘s

3. cumulative percentageonly makes sense for ordinal scale40%, 60%, 100%

4

Page 5: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

statistics for continuous data70, 80, 90, 100, 110

◮ mean

x̄ =∑n

i=1 xin

= 70+80+90+100+1105 = 90

◮ variance

s2 =∑n

i=1(xi−x̄)2

n−1

= 202+102+02+102+202

4

= 400+100+0+100+4004 = 250

◮ standard deviation

s =√s2 =

(250) = 15.8

5

Page 6: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

nonparametric statistics

70, 80, 90, 100, 110

◮ median

1. rank dataput in ascending orderx[1], x[2], ..., x[n]

2. if n odd, usex[ n+1

2 ]

if n even, compute average

(

x[ n2 ]+ x[ n2+1]

)

/2

example: n = 5median is x[3] = 90

6

Page 7: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

nonparametric spread

◮ range 70, 80, 90, 100, 110

1. rank data2. minimum and maximum

x[1], x[n]3. range

(minimum, maximum)(x[1], x[n])

for examplex[1] = 70, x[n] = 110range is (70,110)

7

Page 8: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

quartile

1. first (lower) quartile : Q1

25% less than

2. second quartile - median : Q2

50% less than

3. third (upper) quartile : Q3

75% less than

8

Page 9: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

computing quartiles

1. q1 = 0.25 (first quartile)q2 = 0.50 (median)q3 = 0.75 (third quartile)

2. rank data

3. compute

3.1 if nq integerquartile =

(

x[nq] + x[nq+1]

)

/23.2 if nq not integer

quartile = x[⌊nq⌋+1]

where ⌊m⌋ means greatest integer less than m

9

Page 10: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

example

70, 80, 90, 100, 110

1. q1 = .25, nq1 = 1.25, ⌊nq1⌋ = 1first quartile is x[2] = 80

2. q2 = .50, nq2 = 2.5, ⌊nq2⌋ = 2median is x[3] = 90

3. q3 = .75, nq3 = 3.75, ⌊nq3⌋ = 3third quartile is x[4] = 100

10

Page 11: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Ranges

1. quartile range(Q1,Q3)example(80,100)

2. Interquartile range (IQR)IQR = Q3 − Q1

exampleIQR = 100− 80 = 20

11

Page 12: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

example two65,70, 80, 85, 90, 100, 110, 120

1. mean

x̄ =65 + 70 + 80 + 85 + 90 + 100 + 110 + 120

8= 90

2. variance

s2 = 252+202+102+52+02+102+202+302

7

= 625+400+100+25+0+100+400+9007

= 25507 = 364.2857

3. standard deviation

s =√

(364.2857) = 19.1

12

Page 13: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

nonparametrics65,70, 80, 85, 90, 100, 110, 120

1. first quartile (Q1) q1 = 0.25, so nq = 2

Q1 =(

x[2] + x[3])

/2

= 70+802 = 75

2. median (Q2) q2 = 0.5, so nq = 4

median =(

x[4] + x[5])

/2

= 85+902 = 87.5

3. third quartile (Q3) q3 = 0.75, so nq = 6

Q3 =(

x[6] + x[7])

/2

= 100+1102 = 105

13

Page 14: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

nonparametrics (continued)

IQR = Q3 − Q1

= 105− 75 = 30

14

Page 15: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Empirical Intervals

◮ x̄ ± sd

eg 90.0 ± 19.1 = (70.9, 109.1)68% of data (if data normally distributed)

◮ x̄ ± 2sdeg 90.0 ± 38.2 = (51.8, 128.2)95% of data (if data normally distributed)

◮ x̄ ± 3sdeg 90.0 ± 57.3 = (32.7, 147.3)almost all data (if data normally distributed)

15

Page 16: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Empirical Intervals - continued

1. human data not usually normally distributed

2. confused with distribution of sample mean(to be discussed later)

16

Page 17: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Graphical Summaries

1. histograms (bar charts)

2. stem-and-leaf plots

3. box-and-whisker (box) plots

show distribution of the data graphically

17

Page 18: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

histogram/bar chart

plot:

◮ y-axis is frequency/relative frequency/percentage

◮ x-axis is range of values of data

for discrete dataat each possible valuedraw box whose height is proportionalto frequency/relative frequency/percentage

18

Page 19: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

sample bar chart

2

1

C F N

Frequency

Smoking Status

19

Page 20: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

histogram

continuous datalike bar chartbut bars are contiguous

1. choose number of bins (5-10)log2(n)

2. create boundaries for binsthat DO NOT include data values

3. count number of data points in each bin

4. plot as for bar chart

20

Page 21: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

example two

65,70, 80, 85, 90, 100, 110, 120

1. n=8, use 3 bins

2. range is 65 to 120width of 553 bins of width 2063-83; 83-103; 103-123

3. 3 in first bin3 in second bin2 in last bin

21

Page 22: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

sample histogram

63 103 12383

Blood pressure

1

2

3

Frequency

22

Page 23: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Stem-and-leaf plot

suggested by John Tukeyquick histogramon its side

1. sort (order, rank) data

2. count number of similar firstor first and second digits

65,70, 80, 85, 90, 100, 110, 1206| 57| 08| 0 59| 010| 011| 012| 0

23

Page 24: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

second attempt

67| 5 089| 0 5 0

1011| 0 012| 0

24

Page 25: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Tukey box-and-whisker plot1. centred at median2. lower hinge = Q1 ; upper hinge = Q3

3. rectangle from lower hinge to upper hinge4. IQR is distance between lower and upper hinges,5. lower inner fence = Q1 − 1.5(IQR)6. upper inner fence = Q3 + 1.5(IQR)7. lower outer fence = Q1 − 3.0(IQR)8. upper outer fence = Q3 + 3.0(IQR)9. lower adjacent value is smallest value

greater than the lower inner fence10. upper adjacent value is largest value

lower than the upper inner fence11. whiskers drawn from hinges to the adjacent values12. values between inner and outer fences

possible outliers; marked with an asterisk13. values beyond the outer fences

probable outliers; marked with an zero25

Page 26: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

example two65,70, 80, 85, 90, 100, 110, 120

1. median is 87.5

2. lower hinge is 75; upper hinge is 105

3. IQR is 30

4. lower inner fence is 75 - 1.5(30) = 30

5. upper inner fence is 105 + 1.5(30) = 150

6. lower outer fence is 75 - 3.0(30) = -15

7. upper outer fence is 105 + 3.0(30) = 195

8. lower adjacent value is 65

9. upper adjacent value is 120

10. NO values between inner and outer fencesno possible outliers

11. NO values beyond outer fencesno probable outliers

26

Page 27: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

sample Tukey boxplot

SAS Proc BOXPLOT schematic

uni-dimensional

Blood pressure

60 70 80 90 100 110 120130

27

Page 28: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Modern box-and-whisker plot

1. centred at median;

2. lower hinge = Q1 ; upper hinge = Q3

3. rectangle from lower hinge to upper hinge

4. IQR is distance between lower and upper hinges,

5. whisker drawn from minimum to maximum.

SAS Proc BOXPLOT skeletal

28

Page 29: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

sample modern boxplot

Blood pressure

60 70 80 90 100 110 120130

29

Page 30: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Why graphical display

1. is data distribution symmetrical (Normal)?

2. is data distribution lop-sided?skewed? left or right? negative or positive?

3. are there outliers?(not handled by modern boxplot)

30

Page 31: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Histogram with right skew

63 103 12383

Blood pressure

12

Frequency

143 163 183

4

8

31

Page 32: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Histogram with left skew

63 103 12383

Blood pressure

12

Frequency

143 163 183

4

8

32

Page 33: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Histogram with possible outlier

63 103 12383

Blood pressure

12

Frequency

143 163 183

4

8

33

Page 34: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Using stem-and-leaf plots

65,70, 80, 85, 90, 100, 110, 120, 140 ,1606| 57| 08| 0 59| 010| 011| 012| 013|14| 015|16| 0

34

Page 35: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

second attempt

67| 5 089| 0 5 0

1011| 0 01213| 01415| 01617| 0

35

Page 36: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

boxplot with right skew

Blood pressure

60 70 80 90 100 110 120130

36

Page 37: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

boxplot with left skew

Blood pressure

60 70 80 90 100 110 120130

37

Page 38: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

boxplot with outliers

Blood pressure

60 70 80 90 100 110 120130

38

Page 39: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

skewness

measure of asymmetry (lopsidedness) of data

1. for symmetric data, sample skewness is close to zero

2. for data skewed to the right

skewness has positive value

3. for data skewed to the left

skewness has negative value

39

Page 40: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

calculating sample skewness

1. accurate value

(xi − x̄)3

ns3

2. approximation

3

(

x̄ −Median

s

)

3. second approximation (poorer than preceding)

x̄ −Mode

s

40

Page 41: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Example

For the dataset, {70,80,90,100,110}sample skewness is

(xi − x̄)3

ns3=

(−20)3 + (−10)3 + 03 + 103 + 203

5(250)(15.8)

= 0

41

Page 42: Epidemiology 9509 - Principles of Biostatistics textbook - The …publish.uwo.ca/~jkoval/courses/Epid9509/chapter2/descriptive.pdf · Principles of Biostatistics textbook - The Wonders

Epidemiology 9509 Descriptive Statistics

Example two

For the dataset, {65, 70,80,85 90,100,110,120}sample skewness is

=(−25)3 + (−20)3 + (−10)3 + (−5)3 + 03 + 103 + 203 + 303

8(364.2857)(19.086)

=−15625 − 8000 − 1000 − 125 + 0 + 1000 + 8000 + 27000

55622.8416

=11250

55622.8416= 0.202

or

390 − 87.5

19.086=

3(2.5)

19.086= 0.392

42