39
Statistics: Unlocking the Power of Data STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two categorical variables

Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Embed Size (px)

Citation preview

Page 1: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

STAT 250Nathaniel Cannon

Describing Data:Categorical Variables

SECTIONS 2.1• One categorical variable• Two categorical variables

Page 2: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Vaccinations in CaliforniaWhat proportion of children in California are

vaccinated?

California law requires students to provide proof of immunization for school, unless they have an approved exception: Medical Exception Personal belief exception

Let’s look at the data!

Page 3: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Frequency Table

Vaccines up to date

Medical Exception

Personal Belief Exception

Other TOTAL

480014 1009 13229 36391

530643

Data from California department of public health

All kindergartens in California that reported data (required), 2014 – 2015

Do you think schools that reported may differ from schools that didn’t report? Does sampling bias exist?

•A frequency table shows the number of cases that fall in each category:

Minitab: Stat -> Tables -> Tally Individual Variables -> Counts

Page 4: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Bar Chart/Plot/GraphIn a bar chart, the height of the bar is the number of cases falling in each category

Minitab: Graph -> Bar chart

Page 5: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Histogram vs Bar Chart

This is a

a) Histogramb) Bar chartc) Otherd) I have no idea

Page 6: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Histogram vs Bar Chart

This is a

a) Histogramb) Bar chartc) Otherd) I have no idea

Page 7: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Histogram vs Bar ChartA bar chart is for categorical data, and the x-axis has no numeric scale

A histogram is for quantitative data, and the x-axis is numeric

For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed

For a quantitative variable, the number of bars in a histogram is up to you (or your software), and the appearance can differ with different number of bars

Page 8: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Proportion

The proportion in a category is found by

Proportion for a sample: (“p-hat”)

Proportion for a population: p

Page 9: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

ProportionWhat proportion of children in the sample

have their vaccinations up to date?

480014/530643 = 0.9046

A proportion of 0.90 is the same as 90%

Vaccines up to date

Medical Exception

Personal Belief Exception

Other TOTAL

480014 1009 13229 36391

530643

Page 10: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Relative Frequency TableA relative frequency table shows the proportion of cases that fall in each category

All the numbers in a relative frequency table sum to 1

Vaccines up to date

Medical Exception

Personal Belief Exception

Other TOTAL

0.905 0.002 0.025 0.068 1

Minitab: Stat -> Tables -> Tally Individual Variables -> Percents

Page 11: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Pie ChartIn a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category

Minitab: Graph -> Pie Chart

Page 12: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Summary: One Categorical Variable

Summary Statistics Proportion Frequency table Relative frequency table

Visualization Bar chart Pie chart

Page 13: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Two Categorical VariablesLook at the relationship between two categorical variables

1.Relationship status

2.Gender

Page 14: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Two-Way Table

Female Male Total

In a Relationship 32 10 42

It’s Complicated 12 7 19

Single 63 45 108

Total 107 62 169

It doesn’t matter which variable is displayed in the rows and which in the columns

Minitab: Stat -> Tables -> Tally Individual Variables -> Counts

Page 15: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Two-Way Table

What proportion of students in this sample are in a relationship?

a)42/169 25% b)32/107 30%c)10/62 16%d)32/42 76%

Female Male Total

In a Relationship 32 10 42

It’s Complicated 12 7 19

Single 63 45 108

Total 107 62 169

Page 16: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Two-Way Table

What proportion of females in this sample are in a relationship?

a)42/169 25% b)32/107 30%c)10/62 16%d)32/42 76%

Female Male Total

In a Relationship 32 10 42

It’s Complicated 12 7 19

Single 63 45 108

Total 107 62 169

Page 17: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Male and Female Proportions30% of females in the sample say they are in a

relationship

16% of males in the sample say they are in a relationship

Why the difference???

Page 18: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Difference in ProportionsA difference in proportions is a difference in proportions for one categorical variable calculated for different levels of the other categorical variable

Example: proportion of females in a relationship – proportion of males in a relationship

Page 19: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Two-Way Table

What proportion of people in a relationship in this sample are female?

a)42/169 25% b)32/107 30%c)10/62 16%d)32/42 76%

Female Male Total

In a Relationship 32 10 42

It’s Complicated 12 7 19

Single 63 45 108

Total 107 62 169

Page 20: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Two-Way Table

CAUTION: The proportion of females in a relationship is NOT THE SAME AS the proportion of people in a relationship who are female!

30% ≠ 76%!

Page 21: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Side-by-Side Bar Chart

Minitab: Graph -> Bar Chart -> Cluster

The height of each bar is the number of the corresponding cell in the two-way table

Page 22: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Segmented Bar ChartA segmented bar chart is like a side-by-side bar chart, but the bars are stacked instead of side-by-side

Minitab: Graph -> Bar Chart -> Stack

Page 23: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Vitamin D InjectionsMany kidney dialysis patients get vitamin D

injections to correct for a lack of calcium. Two forms of vitamin D injections are used: calcitriol and paricalcitol. The records of 67,000 dialysis patients were examined, and half received one drug; the other half the other drug. After three years, 58.7% of those getting paricalcitol had survived, while only 51.5% of those getting calcitriol had survived.

Construct an approximate two-way table of the data (due to rounding of the percentages we can’t recover the exact counts – round to whole numbers).Source: Teng, M., et. al., “Survival of patients undergoing hemodialysis with paricalcitol

or calcitriol Therapy,” New England Journal of Medicine, July 31, 2003; 349(5): 446-456.

Page 24: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Vitamin D Injections

Page 25: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Getting dataset from tableIf you were to write the data from the two-

way table out as an entire data set, what would it look like?

How many columns would there be? What would they represent?

How many rows would there be? Give an example of one of the rows.

Page 26: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Kidney Stones

R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (1986). "Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy" . Br Med J (Clin Res Ed) 292 (6524): 879–882

Success Failure

Treatment A 273 77

Treatment B 289 61

Which treatment is better at removing kidney stones?

a) Treatment Ab) Treatment B

Page 27: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Kidney Stones

SMALL STONES Success Failure

Treatment A 81 6

Treatment B 234 36

Which treatment is better at removing small kidney stones?

a) Treatment Ab) Treatment B

Page 28: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Kidney Stones

LARGE STONES Success Failure

Treatment A 192 71

Treatment B 55 25

Which treatment is better at removing large kidney stones?

a) Treatment Ab) Treatment B

Page 29: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Kidney Stones

•Treatment A is more effective for all kidney stones, but the data shows Treatment B to be effective overall!

•How is this possible!?!?

Page 30: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Kidney Stones – Simpson’s Paradox

Large Stones Success Failure Success Rate

Treatment A 192 71 73%

Treatment B 55 25 69%

Small Stones Success Failure Success Rate

Treatment A 81 6 93%

Treatment B 234 36 87%

ALL STONES Success Failure Success Rate

Treatment A 273 77 78%

Treatment B 289 61 83%

Page 31: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Kidney Stones

•Treatment A is used more often on large stones, which are harder to treat.

•This is an example of Simpson’s Paradox: an observed relationship between two variables can change (or even reverse!) when a third variable is considered

Page 32: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Kidney Stones

Page 33: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

CombinedTreatment

ATreatment

B

Successful 273 (78%) 289 (83%)

Unsuccessful 77 61

Page 34: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Page 35: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Summary: Two Categorical Variables

Summary Statistics Two-way table Difference in proportions

Visualization Side-by-side bar chart Segmented bar chart

Page 36: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Variable(s) Visualization Summary StatisticsCategorical bar chart,

pie chartfrequency table,

relative frequency table, proportion, odds

Quantitative dotplot, histogram,

boxplot

mean, median, max, min, standard deviation,

range, IQR,five number summary

Categorical vs Categorical

side-by-side bar chart, segmented bar chart

two-way table,difference in

proportions, odds ratio

Quantitative vs Categorical

side-by-side boxplots statistics by group,difference in means

Quantitative vs Quantitative

scatterplot correlation

Page 37: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Descriptive Statistics

Think of a topic or question you would like to use data to help you answer.

What would the cases be?

What would the variables be?

(Limit to one or two variables)

Page 38: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

Descriptive Statistics

How would you visualize and summarize the variable or relationship between variables?

a) bar chart/pie chart, proportions, frequency table/relative frequency table

b) dotplot/histogram/boxplot, mean/median, sd/range/IQR, five number summary

c) side-by-side or segmented bar charts, difference in proportions, two-way table

d) side-by-side boxplot, difference in meanse) scatterplot, correlation

Page 39: Statistics: Unlocking the Power of Data Lock 5 STAT 250 Nathaniel Cannon Describing Data: Categorical Variables SECTIONS 2.1 One categorical variable Two

Statistics: Unlocking the Power of Data Lock5

To DoRead Section 2.1

Do HW 2.1 (due Friday, 2/13)

Study for Exam 1 (Friday, 2/13)