55
Fall 2002 Biostat 511 1 Descriptive Statistics and Exploratory Data Analysis - Bivariate Quantitative (continuous) variables 1.Scatterplots (two variables; use color or symbol to add 3 rd variable) 2.Starplots 3.Correlation coefficient Qualitative (categorical) variables 1.Contingency (two-way) tables 2.Joint, marginal, conditional distribution 3.Simpson’s paradox (confounding) 4.Interaction

Descriptive Statistics and Exploratory Data Analysis - Bivariate

Embed Size (px)

DESCRIPTION

Descriptive Statistics and Exploratory Data Analysis - Bivariate. Quantitative (continuous) variables Scatterplots (two variables; use color or symbol to add 3 rd variable) Starplots Correlation coefficient Qualitative (categorical) variables Contingency (two-way) tables - PowerPoint PPT Presentation

Citation preview

Page 1: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 1

Descriptive Statistics and Exploratory Data Analysis -

Bivariate

• Quantitative (continuous) variables1. Scatterplots (two variables; use color or symbol to add 3 rd variable)2. Starplots3. Correlation coefficient

• Qualitative (categorical) variables1. Contingency (two-way) tables2. Joint, marginal, conditional distribution3. Simpson’s paradox (confounding)4. Interaction

Page 2: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 2

Scatterplot

A scatterplot offers a convenient way of visualizing the relationship between pairs of quantitative variables.

Many interesting features can be seen in a scatterplot including the overall pattern (i.e. linear, nonlinear, periodic), strength and direction of the relationship, and outliers (values which are far from the bulk of the data).

Thig

h c

ircu

mfe

rence

(cm

)

Knee circumference (cm)30 35 40 45 50

40

60

80

100

Page 3: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 3

y

x0 2 4 6

.4

.6

.8

1

1.2

Scatterplot showing nonlinear relationship

s1

s30 20 40 60

0

20

40

60

Scatterplot showing daily rainfall amount (mm) at nearby stations in SW Australia. Note outliers (O). Are they data errors … or interesting science?!

O

OO

Page 4: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 4

Month

ly ac

ciden

tal de

aths (

US)

time0 20 40 60 80

6000

8000

10000

12000

Mo

nth

ly a

ccid

en

tal d

ea

ths (

US

)

time0 20 40 60 80

6000

8000

10000

12000

Mo

nth

ly a

ccid

en

tal d

ea

ths

(US

)

time0 20 40 60 80

6000

8000

10000

12000

Presentation matters!

Page 5: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 5

- Important information can be seen in two dimensions that isn’t obvious in one dimension

Fra

ctio

n

No. of eggs10000 12000 14000

0

.05

.1

.15

Fra

ctio

n

weight4 6 8 10

0

.1

.2

.3

No

. o

f e

gg

s

weight4 6 8 10

10000

12000

14000

Page 6: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 6

Use symbols or colors to add a third variable

Page 7: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 7

• Each ray corresponds to a variable

• Rays scaled from smallest to largest value in dataset

Price

Mileage (mpg)

Repair Record 1978

Headroom (in.)

Weight (lbs.)

Turn Circle (ft.)

Displacement (cu. in.)

Gear Ratio

Concord Pacer Century

Electra LeSabre Regal

Riviera Skylark Deville

Star plots are used to display multivariate data

Plots for Multivariate data

Page 8: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 8

How can we summarize the “strength of association” between two variables in a scatterplot?

CorrelationT

hig

h c

ircu

mfe

rence

(cm

)

Knee circumference (cm)30 35 40 45 50

40

60

80

100

Page 9: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 9

When two variables are measured on a scale in which order is meaningful, you can calculate a correlation coefficient that measures the strength of the association between the two variables.

There are two common correlation measures:

1. Pearson Correlation Coefficient: Based on the actual data values. Measure of linear association. Natural when each variable has a normal distribution.

2. Spearman Rank Correlation: Based on ranks of each variable (ranks assigned separately). Useful measure of the monotone association, which may not be linear.

Correlation

Page 10: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 10

The correlation between two variables X and Y is:

Properties:

• No distinction between x and y.

• The correlation is constrained: -1 R +1

• | R | = 1 means “perfect linear relationship”

• The correlation is a scale free measure (correlation doesn’t change if there is a linear change in units).

• Pearson’s correlation only measures strength of linear relationship.

• Pearson’s correlation is sensitive to outliers.

Pearson’s Correlation Coefficient

Y

iN

i X

i

YX

Ni ii

s

YY

s

XX

N

ss

YYXX

NR

1

1

1

1

1

1

Page 11: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 11

y

x-2 -1 0 1 2

-2

-1

0

1

2

y

x-2 -1 0 1 2

-2

-1

0

1

2

y

x-2 -1 0 1 2

0

1

2

3

4

Perfect positive correlation (R = 1)

Perfect negative correlation (R = -1)

Uncorrelated (R = 0) but dependent

Page 12: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 12

Page 13: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 13

Pearson’s Correlation Coefficient

Y

X1-4 -2 0 2

-5

0

5

10Correlation = .8776

Suppose we restrict the range of X …

Y

X1.5 1 1.5 2

2

4

6

Correlation = .5111

• relationship between LSAT and GPA among law school students

• relationship between height and basketball ability among NBA players

Page 14: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 14

Spearman Rank Correlation

• A nonparametric analogue to Pearson’s correlation coefficient is Spearman’s rank correlation coefficient. Use Spearman’s correlation when the assumption of normality of X and Y is not met.

• A measure of monotonic association (not necessarily linear)

• Based on the ranked data

• Rank each sample separately (1 … N)

• Compute Pearson’s correlation on the ranks

• -1 < Rs < 1

Page 15: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 15

Two-way (Contingency) Tables

Now we turn our attention to relationships between pairs of qualitative (categorical, discrete) measures.

Types of Categorical Data:

•Nominal

•Ordinal

Often we wish to assess whether two factors are related. To do so we construct an R x C table that cross-classifies the observations according to the two factors. Such a table is called a two-way or contingency table.

Page 16: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 16

Two-way tables

Example. Education versus willingness to participate in a study of a vaccine to prevent HIV infection if the study was to start tomorrow. Counts, percents and row and column totals are given.

definitelynot

probablynot

Probably definitely Total

< highschool

521.1%

791.6%

3427.0%

2264.6%

699

high school 621.3%

1533.2%

4178.6%

2625.4%

894

somecollege

531.1%

2134.4%

62913.0%

3757.7%

1270

college 541.1%

2314.8%

57111.8%

2445.0%

1100

some postcollege

180.4%

460.9%

1392.9%

741.5%

277

graduate/prof

250.5%

1392.9%

3306.8%

1162.4%

610

Total 264 861 2428 1297 4850

The table displays the joint distribution of education and willingness to participate.

Page 17: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 17

Two-way tables

The marginal distributions of a two-way table are simply the distributions of each measure summed over the other.

E.g. Willingness to participate

Definitelynot

Probablynot

Probably Definitely

264 861 2428 12975.4% 17.8% 50.1% 26.7%

Willing

0

1000

2000

3000 count

Def not Prob not Prob Def

Page 18: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 18

Two-way tables

A conditional distribution is the distribution of one measure conditional on (given the) value of the other measure.

E.g. Willingness to participate among those with a college education.

Definitelynot

Probablynot

Probably Definitely

54 231 571 2444.9% 21.0% 51.9% 22.2%

Page 19: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 19

definitelynot

probablynot

probably definitely Total

< high school 52 79 342 226 699high school 62 153 417 262 894some college 53 213 629 375 1270college 54 231 571 244 1100some postcollege

18 46 139 74 277

graduate/prof

25 139 330 116 610

Total 264 861 2428 1297 4850

Two-way tables

What proportion of individuals …

• will definitely participate?

• have less than college education?

• will probably or definitely participate given less than college education?

• who will probably or definitely participate have have less than college education?

• have a graduate/prof degree and will definitely not participate?

Page 20: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 20

Three-way tables

There are two phenomena that can confuse our interpretation of two-way tables. In each case a third measure is involved.

Simpson’s Paradox - Also known as confounding in the epidemiology literature. MM refer to this as the “lurking variable” problem. Aggregating over a third (lurking) variable results in incorrect interpretation of the association between the two primary variables of interest.

Interaction - Also known as effect modification in the epidemiology literature. The degree of association between the two primary variables depends on a third variable.

Page 21: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 21

Simpson’s Paradox (aka Confounding)

“Condom Use increases the risk of STD”

BUT ...

Explanation: Individuals with more partners are more likely to use condoms. But individuals with more partners are also more likely to get STD.

STD rateYes 55/95 (61%)Condom

Use No 45/105 (43%)

STD rate# Partners < 5

Yes 5/15 (33%)CondomUse No 30/82 (37%)

# Partners > 5Yes 50/80 (62%)Condom

Use No 15/23 (65%)

Page 22: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 22

Interaction (aka Effect Modification)

Impact Speed < 40 mph > 40 mph Driver seat belt

worn not seat belt

worn not dead 2 3 7 18 alive 18 27 13 12

Total 20 30 20 30 Fatality Rate

10% 10% 35% 60%

Seat BeltDriver Worn Not worn

Dead 9 21Alive 31 39

Total 40 60Fatality Rate 22.5% 35%

Page 23: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 23

Summary

• Qualitative (categorical) variables

Contingency table – shows the joint distribution of the two variables, the marginal distributions of each variable and the conditional distribution of one variable for a fixed level of the other variable.

Simpson’s paradox and interactions can occur if a third variable influences the association between the two variables of interest.

• Quantitative (continuous) variables

Scatterplots - display relationship between two quantitative measures. Use colors or symbols to add a third (categorical) dimension.

Starplots - display multivariate data.

Correlation coefficient - summarizes the strength of the linear (Pearson’s) or monotonic (Spearman’s) relationship between two quantitative measures.

Page 24: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 24

Guidelines for Tables and Graphs

• Tables1. Good for showing exact values, small amounts of data2. Guidelines

• Graphs1. Good for showing qualitative trends, large amounts of data2. Guidelines for graphical integrity

Page 25: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 25

Tables and Graphs

• Compact presentation of data

• Visual appeal; readers feel that they are “seeing the data”

• Tables are better for showing exact numerical values, small amounts of data and/or multiple localized comparisons

• Graphs are better for highlighting qualitative aspects of the data and displaying large amounts of data.

Page 26: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 26

Guidelines for Tables (Ehrenberg, 1977)

1.Give marginal averages to provide a visual focus.

2. Order rows/columns by marginal averages or some other measure of size.

3.Put groups to be compared in rows (i.e. scanning down columns for comparisons)

4.Round to 2 effective digits

5.Use layout to facilitate comparisons

6.Give brief verbal summaries to lead reader to patterns and exception.

7. Clearly label rows and columns, give units, source (if appropriate), title.

Page 27: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 27

Unemployment in Great Britain(source: Facts in Focus, CSO, 1974).

Note use of marginal averages and rounding. Table has been reordered so the reader can scan down the column for a time trend.

Unemployed (000’s)Total Male Female

1966 330 260 711968 550 460 891970 580 500 871973 600 500 99Ave. 520 430 86

1966 ‘68 ‘70 ‘73 Total unemployed

(thousands) 330.9 549.4 582.2 597.9

Male 259.6 460.7 495.3 499.4 Female 71.3 88.8 86.9 98.5

Page 28: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 28

Page 29: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 29

Statistical Graphics

“Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers - even a very large set - is to look at pictures of those numbers.”

Edward R. TufteThe Visual Display of Quantitative InformationGraphics Press, 1983

Page 30: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 30

Graphical Integrity

1. The representation of numbers, as physically measured on the surface of the graphic, should be directly proportional to the numerical quantities represented (e.g. purchasing power).

2. Clear, detailed and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graphic itself. Label important events in the data. (e.g. Minard’s graphic)

3. Focus on the data, not the design and maximize the data:ink ratio (counter e.g. USA Today)

4. The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data (e.g. OPEC Oil)

5. Do not quote data out of context (e.g. traffic deaths)

Page 31: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 31

Page 32: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 32

1.444./)44.0.1(125.1/)125.10.7(

change Realchange Perceived

factor Lie

A less distorted view …

Page 33: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 33

Page 34: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 34

Data density - Compare ...

Page 35: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 35

Page 36: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 36

Page 37: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 37

Page 38: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 38

Summary

• Tables

1. Good for showing exact values, small amounts of data

2. Guidelines

• Graphs

1. Good for showing qualitative trends, large amounts of data

2. Guidelines for graphical integrity

Page 39: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 39

Designing Studies

• Design issues1. Types of studies

a. Experimental studies - Control, randomization, replication

b. Observational 2. Controls3. Blinding4. Hawthorne effect5. Longitudinal/cross-sectional6. Dropout

• Population vs Sample1. Bias2. Variability

Page 40: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 40

Experimental Design

“Obtaining valid results from a test program calls for commitment to sound statistical design. In fact, proper experimental design is more important than sophisticated statistical analysis. Results of a well-planned experiment are often evident from simple graphical analyses. However, the world’s best statistical analysis cannot rescue a poorly planned experiment.”

Gerald Hahn, Encyclopedia of Statistical Science, page 359, entry for Design of Experiments

Page 41: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 41

Types of Studies

Most scientific studies can be classified into one of two broad categories:

1) Experimental Studies

The investigator deliberately sets one or more factors to a specific level.

2) Observational Studies

The investigator collects data from an existing situation and does not (intentionally) interfere with the running of the system.

Page 42: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 42

Experimental Studies

• Sources of (major) variability are controlled by the researcher

• Randomization is often used to ensure that uncontrolled factors do not bias results

• The experiment is replicated on many subjects (units) to reduce the effect of chance variation

• Easier to make the case for causation

Examples

• effect of pesticide exposure on hatching of eggs

• comparison of two treatments for preventing perinatal transmission of HIV

Page 43: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 43

Example: control of variability by matching

Hypothesis: Lotions A and B equally effective at softening skin

Page 44: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 44

Design 2: Randomly assign lotion to one hand within each pair. What is the distribution of the sample mean difference in softness, if the true difference is 3?

Fra

ctio

n

matched-40 -20 0 20 40

0

.2

.4

.6

.8

Design 1: Ignore pairing, randomly assign half of the hands to each lotion. What is the distribution of the sample mean difference in softness, if the “true” difference is 3?

Fra

ctio

n

unmatch-40 -20 0 20 40

0

.05

.1

.15

.2

Page 45: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 45

Observational Studies

• Sources of variability (in the outcome) are not controlled by the researcher

• Adjustment for imbalances between groups, if possible, occurs at the analysis phase

• Randomization usually not an option; samples are assumed to be “representative”

• Can identify association, but usually difficult to infer causation

Examples

• natural history of HIV infection

• study of partners of individuals with gonorrhea

• condom use and STD prevention

• association between chess playing and reading skill in elementary school children

Page 46: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 46

Other Study Design Issues

•Selection of controls

•Blinding

•Hawthorne effect

•Longitudinal vs Cross-sectional

•Dropouts

Page 47: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 47

Longitudinal vs Cross-sectional Studies

• Longitudinal studies are more expensive and involve additional analytical complications.

• Longitudinal studies allow one to study changes over time in individuals and populations (similar to idea of pairing or matching)

Page 48: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 48

Rea

ding

Abi

lity

Age Age Age

Hypothetical data on the relationship between reading ability and age.

Page 49: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 49

Populations vs Samples

Population

•set of all “units”

•real or hypothetical

•parameters

Sample

•a subset of “units”

•estimates/statistics

e.g. population - all US households with a TV(~95 million)

sample - Nielsen sample (~5000)

The objective of statistics is to make valid inferences about the population from the

sample.

So far we haven’t thought very hard about where our data come from. However, in almost all cases there is an implicit assumption that the conclusions we draw from our data analysis apply to some larger group than just the individuals we measured.

Page 50: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 50

Population of X’s (true proportion = p)

sam

ple

of si

ze n

sam

ple

of s

ize

n

sam

ple

of s

ize

n

sam

ple

of s

ize

n

sam

ple

of si

ze n

p̂ p̂ p̂ p̂ p̂

Page 51: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 51

Bias

(Sampling) Variability

• Do I expect that, on average, the estimate from my sample will equal the parameter of the population of interest? If so, the estimate is unbiased.

e.g. Ann Landers survey

Pap smear study

• In general, statistical methods do not correct for bias

•If I repeat an experiment (draw a new sample), I don’t expect to get exactly the same results. The sample estimates are variable.

•The aim of experimental design and statistical analysis is to quantify/control effects of variability.

In making such inferences, there are two ways we can go wrong …

Page 52: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 52

Page 53: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 53

Types of samples in medical studies - a hierarchy

1) Probability samples (e.g. simple random sample, stratified samples, multistage samples)

2) Representative samples (no obvious bias, but …)

3) Convienence samples (biases likely …)

4) Anecdotal, Case reports

Page 54: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 54

Problems in Design/Data Collection

Example:

33% reduction in blood pressure after treatment with medication in a sample of 60 hypertensive men.

Problem:

Example:

Daytime telephone interview of voting preferences

Problem:

Example:

Higher proportion of “abnormal” values on tests performed in 1990 than a comparable sample taken in 1980.

Problem:

Page 55: Descriptive Statistics and Exploratory Data Analysis - Bivariate

Fall 2002 Biostat 511 55

Summary

1. Statistics plays a role from study conception to study reporting.

2. Statistics is concerned with making valid inferences about populations from samples that are subject to various sources of variability.

3. Different studies require different statistical approaches. You must understand the study design and sampling procedures before you can hope to interpret the data!!