52
1 Descriptive statistics A means of organizing, summarizing observations An overview of the general features of a data set For asking questions!!

1 Descriptive statistics A means of organizing, summarizing observations An overview of the general features of a data set For asking questions!!

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

1

Descriptive statistics

A means of organizing, summarizing observations An overview of the general features of a data set For asking questions!!

Page 2: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

2

T1BS

Descriptive Statistics

Data PresentationTypes of numerical dataTablesGraphs

Numerical Summary MeasuresMeasures of central tendencyMeasures of dispersionGrouped data

Page 3: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

3

T1BS

Ex1: sexual dysfunction 1

Page 4: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

4

T1BS

Ex1: sexual dysfunction 2

X1 Y1

Page 5: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

5

T1BS

Ex1: sexual dysfunction 3

X1 Y1

Page 6: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

6

T1BS

Ex1: sexual dysfunction 4

X2 Y1

Page 7: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

7

T1BS

Ex1: sexual dysfunction 5

Y1 Y2

表頭 : 長粗線表頭 : 長細線

組間 : 長細線

組內 : 短細線

表尾 : 長細線

表尾 : 長細線

Page 8: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

8

T1BS

Measurement

All science is measurementHelmholts, German physiologist

Good measurementGreat science breakthrough is always about mea

surement improvementTo do everything well is to measure it well

DefinitionDefinition of variablesDefinition of Denominator and Numerator

Page 9: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

9

T1BS

Types of data measurements

Nominal scale : ( =, ≠ ) Race, gender, etc Also called categorical data, qualitative scale

Ordinal scale : ( =, ≠, >, < ) Your satisfaction score about YMU

Numerical scale : Interval scale (=, ≠, >, <, +, )

Temperature: zero is NOT NOTHING in quantity in measuring

Ratio scale ( =, ≠, >, <, +, , , )Weight: zero is nothing in quantity in measuring

Page 10: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

10

T1BS

Data structure

ID Y1 Y2 X1 X2 … …

1 110 1 1 20

2 140 1 1 25

3 124 2 0 30

4 110 0 0 21

5 100 2 1 19

6 95 0 1 23

7 90 1 0 21

Page 11: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

11

T1BS

Tables

Frequency DistributionCategorical dataDiscrete or continuous data

Data Break down: Distinct, non-overlapping intervalsNumber of Intervals: improve summary or lose information?

A general rule for table presentation: self-explanatoryLabeled clearly: table, columns, measurement specified

Page 12: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

12

T1BS

Example: cholesterol levels Ages 25-34 Ages 55-64

Cholesterol

md/dL Number

of men

Relative

Frequency (%)

Number

of men

Relative

Frequency (%)

80-119 13 1.2 5 0.4

120-159 150 14.1 48 3.9

160-199 442 41.4 265 21.6

200-239 299 28.0 458 37.3

240-279 115 10.8 281 22.9

280-319 34 3.2 128 10.4

320-359 9 0.8 35 2.9

360-399 5 0.5 7 0.6

total 1067 100.0 1227 100.0

Table 2.7Absolute and relative freq. of serum cholesterol for 2294 US males, 1976-1980

Page 13: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

13

T1BS

Examples: cholesterol levels

Table 2.8Relative and cumulative freq. of serum cholesterol for 2294 US males, 1976-1980

Ages 25-34 Ages 55-64

Cholesterol

md/dL

Relative

Frequency

(%)

Cumulative

Relative

Frequency

(%)

Relative

Frequency

(%)

Cumulative

Relative

Frequency

(%)

80-119 1.2 1.2 0.4 0.4

120-159 14.1 15.3 3.9 4.3

160-199 41.4 56.7 21.6 25.9

200-239 28.0 84.7 37.3 63.2

240-279 10.8 95.5 22.9 86.1

280-319 3.2 98.7 10.4 96.6

320-359 0.8 99.5 2.9 99.4

360-399 0.5 100.0 0.6 100.0

total 100.0 100.0

Page 14: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

14

T1BS

Graphs: Bar Charts

For nominal or ordinal data Composition

Horizontal axis: CategoriesVertical bar: freq./relative freq. of each categoryBar Separated: not to imply continuity

Page 15: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

15

T1BS

Ex2: Head-turning asymmetry

Page 16: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

16

T1BS

Types of Numerical Data: Nominal Data: ( 數值大小不重要 )

Sex: (dichotomous, binary data) male=1; female=0

Blood type: O=1; A=2; B=3; AB=4

Ordinal Data: ( 順序重要,數值大小仍不重要 ) Level of severity:

fatal=1; sever=2; moderate=3; minor=4 Oncology Group’s Classification of Pt. Performance status

Status Definition

0 Fully active

1 Restricted in physically strenuous activity

2 Ambulatory and capable of all-self care

3 Capable of only limited self-care

4 Completed disabled

Page 17: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

17

T1BS

台灣 HIV/AIDS 歷年報告個案趨勢圖 , 1984-2009

50% 以上新通報個案為發病者:熱線常務理事喀飛於 2009/11/30 為此現象與 CDC 官員起爭執

台灣地區本國籍感染人類免疫缺乏病毒者趨勢圖1984年至2009年12月(依診斷日分析)

9 15 11 12 28 43 36 91 135136172227277347400478530653

768860

1521

2922

19321746

1648

0 0 1 1 4 9 6 16 23 35 64 98 160136153181183166181234265

584774

1061851930

3381

0

400

800

1200

1600

2000

2400

2800

3200

3600

4000

1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

西元

人數

感染者 發病者

Page 18: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

18

T1BS

舉例: kaposi’s sarcoma in AIDS pts

first 2560 AIDS patients reported to CDC, USA

資料結構: Sarcoma 的有無0: 無 ; 1: 有

分析方法:Excel 的樞紐分析NCSS 的 frequency table: discrete variable

Page 19: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

19

T1BS

Bar chart: Cigarette consumption

Cigarette consumption per person 18 Y/O or older, USA, 1900-1990 Excel: 直條圖 (YEAR 的變項名稱要去掉 ) NCSS: percentile plot (grouping variable: YEAR)

number

0

1000

2000

3000

4000

5000

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990

number

Page 20: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

20

T1BS

Graphs: Histograms

For discrete or continuous data Composition

Horizontal axis: Categories Vertical bar: freq./relative freq. of each interval Interval limit: midpoint

119.5: 80-119, 120-159 mg/dL

Bar Area: frequency representation, NOT bar height Same shape: absolute & relative frequency histogram Unequal interval widths: height must vary along with area

for remaining in proper proportion

Page 21: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

21

T1BS

舉例: Cholesterol levels

分析方法: Excel 折線圖 NCSS: frequency table

Discrete variable: cholesterol, (080-119, 120-159…..)Frequency variable: No_ages_25_34

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

80-119

120-159

160-199

200-239

240-279

280-319

320-359

360-399

Relative freq1

Relative freq2

Page 22: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

22

T1BS

Frequency Polygons

Similar to histogram Placing points at the center of each interval Connecting those points by straight lines

Frequency representation: same as histogram polygon and histogram indistinguishable

Number of observations increase, widths of interval decrease

Relative & Cumulative frequency polygon Percentile for describing the shape of a distribution

Symmetric distribution Skewed distribution

Page 23: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

23

T1BS

One-Way Scatter Plots

For a discrete or continuous data Single horizontal axis

Displaying the relative position of each data point Information and interpretation

No information is lostHard to read if many data points lie close

together

391.8 (Alaska) 1214.9(DC)

One-way scatter plot: Crude death rates for 50 states and DC, USA, 1992

Page 24: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

24

T1BS

Box Plots

A single axisSimilar to one-way scatter plot,

displaying summary of dataPercentiles, and quartiles (25th , 75th percentiles)

The lower and upper sides of the box: 25th & 75th Whisker lines

Adjacent values: the most extreme values set not more than 1.5 times the height of the box beyond either quartile

OutliersData points outside the whisker lines, represented by circles

Page 25: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

25

T1BS

Box Plots, cont.

Crude death rates for 50 states and DC, USA, 1992

200.0

600.0

1000.0

1400.0

Rate

Box Plot

Variables

Am

ount

Alaska (391)

DC (1214)

25th percentile (772)

75th percentile (933)

50th percentile (872)

Lower whisker (772-242=530)

1.5*(933.3-772.0)=242

upper whisker (933+242=1175)

Page 26: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

26

T1BS

Two-Way Scatter Plots For 2 different continuous dataExample: FEV1 and FVC (forced vital capacity)

FV

C (

L)

1 2 3 4 5 6 FEV1 (L)

1

2

3

4

5

Two-way scatter plot: FVC vs FEV1 for 19 asthmatics

Page 27: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

27

T1BS

Line graphs

Line graph: reported rates of malaria by year, USA, 1940-1989

1940 50 60 70 80 90 Year

.01

.1

1

10

100

Rep

ort

ed r

ates

per

100

,000

Relapse of Korean veterans

Returning of Vietnam veterans

Immigration

Page 28: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

28

T1BS

Numerical Summary Measures

Measures of Central TendencyMean

母體 :

樣本 :

Median: 50th percentile of a set of measurement, n=奇數 :

偶數 :

For ordinal and discrete/continuous data

i

iii

xn

x

dxxxfPxxN1

)(1

2/)12

(2

;2

1

thn

thn

thn

Page 29: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

29

T1BS

Measure of central tendency, cont.

Mode

Aunimodal

Bbimodal

CR. skewed

DL. skewed

E2 distributions with Identical means, mediansand modes

Page 30: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

30

T1BS

Measure of DispersionRange: (Max-Min) Interquartile Range

75th – 25th : the middle 50% of the observations

Variance母體 :

樣本 :

Coefficient of Variation: SD/meanWhy use C.V.?

n

xxExE

2

2222)(

)()(

2)1(

2

2

2222

1

)(

11

)(

nii

n

xx

nn

xxS

Page 31: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

31

T1BS

Summarizing the distribution of values Empirical Rule

**Z1-α/2 = 1.96, α=0.05

68.27%

95.45%

2 2

Page 32: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

32

T1BS

Grouped Data: 公式不重要Grouped Mean

公式:

fi: the frequency; mi: midpoint, ith interval

Grouped Variance

公式:

i

ii

f

fmx

1][

)(1

2

2

i

ki

iii

f

fxmS

Page 33: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

33

Probability機率學觀念複習中央極限定理

Page 34: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

34

T1BS

Outlines

Probability: Law of Probability: Independent & Exclusive Bayes’ theorem

Probability Density FunctionNormal density functionCharacteristics of N DistributionStandard normal distribution

Page 35: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

35

T1BS

Outlines (cont.)

Population Parameters & sample StatisticsParameters and Statistics

Sampling methods Sampling distribution Central Limit Theorem

Standard Error & Standard Deviation

Page 36: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

36

T1BS

Probability: key terms

Definition: An estimate of the likelihood of an event’s occurring Expressed as a fraction, a proportion, or a percent

Outcome: (simple event) A single possible result of a random experiment

Event: An event is any collection of outcomes

Sample Space (S): the set of all possible outcomes

Page 37: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

37

T1BS

Law of Probability:

The Law of Large Numbers

1. .

2. .

Law of Probability

1. P(S)=1

2. 0≤P(Ei)≤1

3. ∑P(Ei)=1

4. P(non_A)=P(AC)=1P(A)

nasApn

f),(

n

FrequencyAP

n lim)(

5. For any event A & B P(A&B)=P(A)P(B|A)

P(AorB)=P(A)+P(B)-P(A&B)

P(B)=P(B|A)P(A)+ P(B|AC)P(AC)

Page 38: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

38

T1BS

Independent & Exclusive

If event A and B are independent, thenP(A∩B)=P(A)P(B)The product of the marginal probability will equal

to the joint probability. If A&B are Mutually exclusive event, then

P(A∩B)=0P(A or B)=P(A)+P(B)

Page 39: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

39

T1BS

Bayes’ theorem

Law of total probability

Conditional law of total probability

Bayes’ Theorem

)|()()( jj BAPBPAPA

B1 B2

B3 B4

B5

)(

)(

)|()(

)|()()|(

AP

ABP

BAPBP

BAPBPABP i

ii

iii

)|()|()|( CBAPCBPCAP jj

Page 40: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

40

Probability Density Function& Normal Distribution

Page 41: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

41

T1BS

Probability Density Function

Area under probability distribution For continuous random variables:

a b

f(x)

xallforxf

xdxf

xdxfbxafb

a

,0)(

1)()(

)()()(

Page 42: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

42

T1BS

Normal density function

Point of inflection

),(~ 2Nx

x

Zx

μ

)2

1exp(

2

1)( 2Zxf

σ σ

Page 43: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

43

T1BS

Characteristics of N Distribution

Area under curve=1 Symmetric about the mean mean=median=mode Points of inflection:

μσ

Page 44: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

44

T1BS

),(~ 2Nx

x

Zx

Standard normal distribution Z Transformation and Z value:

(Observed - Expected) in terms of UNITS of SD

Page 45: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

45

T1BS

Population Parameters

& sample Statistics

Page 46: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

46

T1BS

Parameters and Statistics

Parameter symbol

Statistical symbol

Mean μ

Standard Deviation σ SD

Variance σ2 S2

Correlation ρ r

Proportion π p

X

Tab. 4-5, p72

Page 47: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

47

T1BS

Sampling methods

Simple random sampling Stratified random sampling Cluster sampling Systemic sampling Multi-stage sampling Conventional sampling

Page 48: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

48

T1BS

Sampling distribution

Sampling distribution Occurring in REPEATED SAMPLING Distribution of values of over all possible

samples Using sample Statistics to inference population

Parameters

Quiz: If we want to select 6 students from the class

with total 46 students, how many possible samples would we have?

X

Page 49: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

49

T1BS

Central Limit Theorem

The beauty of CLT: Easy to calculate V

The ugliness of CLT: Hard to explain p

Standard Error:

SD of the means

Review

)/,(~ 2 nNx

),(~ 2Nx

For large n,

Xn

Page 50: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

50

T1BS

History of Central Limit Theorem

DeMoivre(1733) 由 Bernoulli 分布提出 CLT 的最初版本

Laplace(1749~1827) 觀察到測量誤差有常態分佈的傾向,將此定理推廣到任意 p 值 CLT 當初叫「誤差頻率法則」 (Law of frequency of error)

Galton(Natural Inheritance, 1889) 如果希臘人知道該法則的話,它或許已被人格化且奉祀為神了。它在最狂亂的困惑

中,全然地謙遜,寧靜地統治一切。暴眾越多,狂亂越大,它的支配越完美。它是非理性的至高無上法則。

The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason.

Liapounoff 1901 提出第一個較完整的 CLT 證明 Lindeberg JW, Levy P 於 1920 年代提出 CLT 的完備證明

Proof for a random sample from an arbitrary distribution 註 : 1905 年愛因斯坦提出狹義相對論

Page 51: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

51

T1BS

Standard Error & SD

SE: The standard error of the mean is the SD of the means in

a sampling distribution. It tells us how much variability can be expected among

means in future samples

SD: The standard deviation is based on measurement of

individuals. It tells us how much variability can be expected among

individuals.

Page 52: 1 Descriptive statistics  A means of organizing, summarizing observations  An overview of the general features of a data set  For asking questions!!

52

T1BS

Home Work

NCSS 軟體操作 : 課本習題 3-2

請做出課本表 3-23(p58) ,加上一個標準差的變項欄位下週課堂上當場隨機點名作 !!檔案名稱: Gebel 研究問題:

Gebel (1997) 研究了 580 個病人的心跳變化,以評估其可能的心血管自主神經系統失常。

深呼吸會使心跳變化增大,但隨年齡增加變化幅度會降低 請解釋中央極限定理

下週課堂上當場隨機點名問 !