19
Sociology 6Z03 Topic 17: Statistical Inference for Contingency Tables John Fox McMaster University Fall 2016 John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 1 / 38 Outline: Statistical Inference for Contingency Tables Introduction The Chi-Square Test for Two Proportions The Chi-Square Test for r × c Contingency Tables The Chi-Square Goodness-of-Fit Test John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 2 / 38

Sociology 6Z03 McMaster University - McMaster Faculty of

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sociology 6Z03 McMaster University - McMaster Faculty of

Sociology 6Z03Topic 17: Statistical Inference for Contingency Tables

John Fox

McMaster University

Fall 2016

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 1 / 38

Outline: Statistical Inference for Contingency Tables

Introduction

The Chi-Square Test for Two Proportions

The Chi-Square Test for r × c Contingency Tables

The Chi-Square Goodness-of-Fit Test

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 2 / 38

Page 2: Sociology 6Z03 McMaster University - McMaster Faculty of

Introduction

Last lecture’s difference-of-proportions z-test can be thought of as a method for testingthe statistical significance of the relationship between two categorical variables, each withtwo categories — that is, a contingency table in which the explanatory variable and theresponse variable each have two categories.

Here is an example that we looked at much earlier in the course, for white and blackdefendants convicted of murdering white victims (in a study of the application of thedeath penalty in the U.S.):

Death Penalty?Race of Defendant Yes No Total

White 19 132 151Black 11 52 63

Total 30 184 214

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 3 / 38

Introduction

Recall that contingency tables are percentaged within categories of the explanatoryvariable (race of defendant) and across categories of the response (death penalty):

Death Penalty?Race of Defendant Yes No Total Number

White 12.6 87.4 100.0 151Black 17.5 82.5 100.0 63

We want to test the null hypothesis that white and black defendants convicted of killingwhites are equally likely to receive the death penalty

H0: p1 = p2 or H0: p1 − p2 = 0

against the alternative hypothesis that whites are less likely than blacks to receive thedeath penalty

Ha: p1 < p2 or Ha: p1 − p2 < 0

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 4 / 38

Page 3: Sociology 6Z03 McMaster University - McMaster Faculty of

Introduction

In this case, the pooled sample proportion receiving the death penalty is

p̂ =19 + 11

151 + 63=

30

214= 0.140

and the difference-of-proportions z test statistic is

z =p̂1 − p̂2√

p̂(1− p̂)

(1

n1+

1

n2

)

=.126− .175√

.140(1− .140)

(1

151+

1

63

) = −0.94

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 5 / 38

Introduction

Thought Question

The one-sided P-value for z = −0.94 is P = .1736.TRUE or FALSE: Consequently, we do not have strong evidence against the nullhypothesis.

A TRUE.

B FALSE.

C I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 6 / 38

Page 4: Sociology 6Z03 McMaster University - McMaster Faculty of

Introduction

Suppose, however, that we have more than two samples to compare, or that the responsevariable has more than two categories.

In this case, a simple difference-of-proportions test will not suffice

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 7 / 38

The Chi-Square Test for Two ProportionsFinding Expected Counts Under Independence

Here is another way of calculating a difference-of-proportions test that will generalize toany number of categories for the explanatory and response variables.

The first step is to find the expected count in each of the four cells of the table under theassumption that the null hypothesis is true and the row and column variables in the tableare independent (unrelated).

Under this assumption of independence, we would expect the proportion receiving thedeath penalty to be the same for white and black defendants. We estimate this commonquantity to be

p̂ =19 + 11

151 + 63=

30

214= 0.140

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 8 / 38

Page 5: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for Two ProportionsFinding Expected Counts Under Independence

Then we would expect

151× .140 =151× 30

214= 21.2

of the 151 white defendants to receive the death penalty, and the remainder,

151× .860 =151× 184

214= 129.8

not to receive the death penalty.

These expected counts follow from the mean of a binomial random variable.

For example, in 151 independent trials when p = .140, the expected number of “successes”(death sentences) is 151× .140 = 21.2.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 9 / 38

The Chi-Square Test for Two ProportionsFinding Expected Counts Under Independence

The total number of blacks is 63, the proportion overall getting the death penalty is.140 = 30/214, and the proportion not getting the death penalty is .860 = 184/214.

Thought Question

(A) TRUE, (B) FALSE, or (C) I don’t know:The expected counts for blacks are

63× .140 =63× 30

214= 8.8

expected to receive the death penalty, and

63× .860 =63× 184

214= 54.2

expected not to receive the death penalty.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 10 / 38

Page 6: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for Two ProportionsFinding Expected Counts Under Independence

These expected counts are the counts that we would observe on average over manysamples of 151 whites and 63 blacks if the probability of receiving the death penalty werep = .14 for both black and white defendants.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 11 / 38

The Chi-Square Test for Two ProportionsFinding Expected Counts Under Independence: Formula

Here is a simple formula for calculating the expected count in each cell of the table:

expected count =row total × column total

table total

The table total is just the overall sample size, n.

A note on terminology: The expected counts under independence are more accuratelycalled estimated expected counts, because they are based on the sample (not population)marginal distributions for the two variables.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 12 / 38

Page 7: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for Two ProportionsExpected Counts by the Multiplication Rule for Independent Events

We can also derive the formula from the multiplication rule for independent events. Forexample:

The proportion overall receiving the death penalty is

p̂D =30

214= .140

The proportion overall of white defendants is

p̂W =151

214= .706

If the two events are independent, then the joint proportion would be

p̂DW = p̂D p̂W = .140× .706 = .0988

and the expected count under independence of whites receiving the death penalty is

np̂DW = 214× .0988 = 21.2

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 13 / 38

The Chi-Square Test for Two ProportionsChi-Square Test Statistic

The chi-square test statistic compares the expected counts to the observed counts:

X 2 = ∑all cells

(observed count − expected count)2

expected count

The name of the test comes from the Greek letter χ (chi), which looks like the Roman letterX .

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 14 / 38

Page 8: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for Two ProportionsChi-Square Test Statistic: Example

For our example, we have the following computation:

observed expected obs. − exp.(obs. − exp.)2

exp.19 21.2 −2.2 0.23

132 129.8 2.2 0.0411 8.8 2.2 0.5552 54.2 −2.2 0.09

214 214.0√

0.0√

X 2 = 0.91

The P-value for the test is P = .3422; I’ll explain later how the P-value is obtained.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 15 / 38

The Chi-Square Test for Two ProportionsChi-Square Test vs. Difference-of-Proportions Test

The chi-square test for a two-by-two table is really just a more complicated way ofcalculating the difference-of-proportions z-test: In this instance, X 2 = z2.

Note that this is true for the example (within rounding error): −0.942 = 0.88.

The only difference between the two tests is that the z-test can be used for a directionalalternative hypothesis, but the chi-square test is inherently nondirectional.

This is why in this instance the P-value for the chi-square test is twice (within roundingerror) the one-sided P-value for the z-test.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 16 / 38

Page 9: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for r x c Contingency Tables

More generally, a contingency table has r rows and c columns.

Recall, for example, the following contingency table of counts, using data from the U.S.General Social Surveys:

Date of SurveyPremaritial Sex 1970s 1980s 1990s 2000s Total

Not Wrong 2423 3348 3647 4547 13 965Sometimes Wrong 1692 1789 1797 1738 7016Wrong 3207 3017 3035 3114 12 373

Total 7322 8154 8479 9399 33 354

In this case, r = 3 and c = 4.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 17 / 38

The Chi-Square Test for r x c Contingency Tables

To interpret this table, we percentaged it by columns (because the column variable —date — is the explanatory variable):

Date of SurveyPremaritial Sex 1970s 1980s 1990s 2000s

Not Wrong 33.1 41.1 43.0 48.4Sometimes Wrong 23.1 21.9 21.2 18.5Wrong 43.8 37.0 35.8 33.1

Total 100.0 100.0 100.0 100.0Number 7322 8154 8479 9399

We cannot summarize the distribution for each decade with a single percentage becausethe response variable (attitude towards premarital sex) has more than two categories.

A difference-of-proportions z-test would therefore be inappropriate here, even if there wereonly two categories of the explanatory variable (and there are four).

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 18 / 38

Page 10: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for r x c Contingency TablesFinding Expected Counts Under Independence

We can, however, calculate expected counts assuming that date and attitude towardspremarital sex are unrelated, and then use the chi-square test statistic to compare theexpected and observed counts.

The expected count for those who think premartial sex is “not wrong” in the 1970s, forexample, is

expected count =row total × column total

table total

=13, 965× 7322

33, 354= 3065.7

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 19 / 38

The Chi-Square Test for r x c Contingency TablesFinding Expected Counts Under Independence

Thought Question

What is the expected count for “sometimes wrong” in the 1980s?

A 7016× 7322/33, 354.

B 13, 965× 8154/33, 354.

C 7016× 8154/33, 354.

D I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 20 / 38

Page 11: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for r x c Contingency TablesFinding Expected Counts Under Independence

The complete table of expected counts is as follows:

Date of SurveyPremaritial Sex 1970s 1980s 1990s 2000s Total

Not Wrong 3065.7 3414.0 3550.1 3935.3 13 965.1Sometimes Wrong 1540.2 1715.2 1783.5 1977.1 7016.0Wrong 2716.2 3024.8 3145.4 3486.7 12 373.1

Total 7322.1 8154.0 8479.0 9399.1 33 354.2

Note that, within rounding error, the expected counts sum to the observed row and columnmarginals.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 21 / 38

The Chi-Square Test for r x c Contingency TablesChi-Square Test Statistic

The chi-square test statistic is calculated as before:

X 2 = ∑all cells

(observed count − expected count)2

expected count

=(2423− 3065.7)2

3065.7+

(3348− 3414.0)2

3414.0

+ · · ·+ (3114− 3486.7)2

3486.7= 413.3

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 22 / 38

Page 12: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for r x c Contingency TablesThe Family of Chi-Square Distributions

Like the t-distributions, the chi-square distributions are a family of density curves indexedby degrees of freedom.

There is a different chi-square distribution for each degrees of freedom, 1, 2, 3, ... .

Unlike the t and normal distributions, the chi-square distribution is positively skewed.

As degrees of freedom grow, the distribution becomes less skewed.

The average value of a chi-square random variable is equal to the number of degrees offreedom, E (X 2) = df ; the variance is twice the number of degrees of freedom,Var(X 2) = 2× df .

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 23 / 38

The Chi-Square Test for r x c Contingency TablesThe Family of Chi-Square Distributions: Some Examples

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

X2

Den

sity

df = 1

df = 2

df = 5

df = 10

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 24 / 38

Page 13: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for r x c Contingency TablesThe Chi-Square Distribution and the Chi-Square Test Statistic

Chi-square values cannot be smaller than zero, but they can be indefinitely large.

The chi-square test statistic is calculated from squared differences between observed andexpected frequencies, so it can’t be negative.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 25 / 38

title

Thought Question

(A) TRUE, (B) FALSE, or (C) I don’t know:

If the observed and expected frequencies are identical then the value of the chi-squarestatistic is 0.

The larger the difference between observed and expected frequencies (that is, the largerthe departure from independence), the smaller the value of the chi-square test statistic.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 26 / 38

Page 14: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for r x c Contingency TablesThe Chi-Square Distribution and the Chi-Square Test Statistic

The chi-square test is therefore inherently one-tailed: We reject the null hypothesis of norelationship between the two variables in a table when the value of the test statistic issufficiently large.

The alternative hypothesis, however, is nondirectional — a relationship of any sortbetween the two variables.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 27 / 38

The Chi-Square Test for r x c Contingency TablesDegress of Freedom for the Chi-Square Test

The degrees of freedom for a chi-square test of the hypothesis of independence (norelationship) in a two-way contingency table are

df = (r − 1)(c − 1)

In our first example, therefore, where r = 2 and c = 2, df = (2− 1)(2− 1) = 1.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 28 / 38

Page 15: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for r x c Contingency TablesDegress of Freedom for the Chi-Square Test

Thought Question

In the second example, where r = 3 and c = 4, what are the df for the chi-square teststatistic?

A 12.

B 6.

C 4.

D 1.

E I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 29 / 38

The Chi-Square Test for r x c Contingency TablesDegress of Freedom for the Chi-Square Test

The degrees of freedom for the chi-square test of independence follow from the constraintthat the expected frequencies sum to the same row and column marginals as the observedfrequencies.

This means that if we fill in all but the last row and column of the expected counts, theremaining values can be calculated by subtraction.

For the second example, we only have to fill in six expected counts before all are known:

Date of SurveyPremaritial Sex 1970s 1980s 1990s 2000s Total

Not Wrong 3065.7 3414.0 3550.1 ? 13 965Sometimes Wrong 1540.2 1715.2 1783.5 ? 7016Wrong ? ? ? ? 12 373

Total 7322 8154 8479 9399 33 354

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 30 / 38

Page 16: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for r x c Contingency TablesFinding the P-Value for the Chi-Square Test

P-values for chi-square statistics can be obtained from many computer programs or,approximately, from the chi-square table.

The chi-square table (Table E in the text) is set up like the t table, with degrees offreedom (df ) down the side and right-tail probabilities (p) across the top.

Here’s an extract from the table for one degree of freedom:

p

df .25 .20 .15

1 1.32 1.64 2.07

Thus, in our first example, where X 2 = 0.91 with 1 df , P > .25.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 31 / 38

The Chi-Square Test for r x c Contingency TablesFinding the P-Value for the Chi-Square Test

Here’s another extract from the table with six degrees of freedom:

p

df .0025 .001 .0005

6 20.25 22.46 24.10

For our second example, X 2 = 413.3 with 6 df , and so P � .0005.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 32 / 38

Page 17: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for r x c Contingency TablesFinding the P-Value for the Chi-Square Test

Thought Question

TRUE or FALSE: Given this very small P-value, P � .0005, we can reject the nullhypothesis of independence and conclude that attitude towards premarital sexchanged over time.

A TRUE.

B FALSE.

C I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 33 / 38

The Chi-Square Test for r x c Contingency TablesWhen is the Chi-Square Test Appropriate?

The chi-square test is appropriate when we want to test association in a contingency table.

We should have independent simple random samples from the populations defined by thecategories of the explanatory variable, or an SRS from the entire population, classified bythe explanatory and response variables.

For the P-value for the chi-square test to be accurate, the population should be at least10 times larger than the sample.

Chi-square tests are also appropriate when subjects are randomly assigned to experimentaltreatments and when the response variable is categorical.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 34 / 38

Page 18: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Test for r x c Contingency TablesWhen is the Chi-Square Test Appropriate?

Many statisticians also argue that significance tests (including the chi-square test) areappropriate when we have data on a whole population and want to test whether anobserved pattern of association could easily have been the product of chance.

I agree with this argument, by the way.

The chi-square distribution is an approximation to the exact distribution of the X 2 teststatistic.

The approximation is good as long as the expected counts are not too small: No more than20 percent of the expected counts should be smaller than 5, and no expected count shouldbe smaller than 1.This requirement was met for both of our examples

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 35 / 38

The Chi-Square Goodness-of-Fit Test

The chi-square statistic can also be used to test whether the distribution of a categoricalvariable is consistent with some probability distribution.

Suppose that the categorical variable in question has k categories; the null hypothesisgives specific probabilities for all of the categories of the variable:

H0: p1 = p1,0, p2 = p2,0, . . . , pk = pk,0

where ∑ pi ,0 = 1.

Suppose that the observed counts are n1, n2, . . . , nk , and that n = ∑ ni . Then expectedcounts are each npi ,0.

The chi-square goodness-of-fit test statistic is

X 2 = ∑(observed count − expected count)2

expected count

Under the null hypothesis, this statistic has a chi-square distribution with k − 1 degrees offreedom.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 36 / 38

Page 19: Sociology 6Z03 McMaster University - McMaster Faculty of

The Chi-Square Goodness-of-Fit TestExample

A gambler wants to test whether or not a die is fair, and throws the die 100 times,obtaining the following distribution of observed counts:

Number of Dots1 2 3 4 5 6 Total

16 9 24 22 11 18 100

The null and alternative hypotheses are

H0: p1 = p2 = p3 = p4 = p5 = p6 =16 ≈ .1667

Ha: not all pi =16

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 37 / 38

The Chi-Square Goodness-of-Fit TestExample

Here, all 6 expected counts are npi ,0 = 100× 16 = 16.67, and the chi-square test statistic

is

X 2 =(16− 16.67)2

16.67+

(9− 16.67)2

16.67+ · · ·+ (18− 16.67)2

16.67= 10.52 with df = 6− 1 = 5, for which .05 < P < .10

We therefore have only weak evidence against the null hypothesis that the die is fair.

Note: The data were generated by using R to simulate throwing a fair die 100 times; thusH0 is true.

Important Point

Do not confuse the chi-square goodness-of-fit test with the chi-square test of independence ina two-way contingency table.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 38 / 38