Biostatistics course Part 12 Association between two categorical variables Dr. Sc. Nicolas Padilla Raygoza Department of Nursing and Obstetrics Division

Biostatistics coursePart 12

Association between two categorical variables

Dr. Sc. Nicolas Padilla RaygozaDepartment of Nursing and Obstetrics

Division Health Sciences and EngineeringCampus Celaya-Salvatierra

University of Guanajuato, Mexico

Biosketch

Medical Doctor by University Autonomous of Guadalajara. Pediatrician by the Mexican Council of Certification on

Pediatrics. Postgraduate Diploma on Epidemiology, London School of

Hygiene and Tropical Medicine, University of London. Master Sciences with aim in Epidemiology, Atlantic International

University. Doctorate Sciences with aim in Epidemiology, Atlantic

International University. Associated Professor B, Department of Nursing and Obstetrics,

Division of Health Sciences, University of Guanajuato Campus Celaya Salvatierra.

[email protected]

Competencies

The reader will analyze the relationship between two categorical variables with two or more categories.

He (she) will apply the Chi-squared test. He (she) will know the Chi-squared test for

trends and when apply it.

Introduction

In part three, we learned how to tabulate a frequency distribution for a categorical variable. This tab shows how individuals are distributed in each category of a variable.

For example, in a rural community in Celaya, a randomized sample of 200 people were asked about their level of socioeconomic status.

Introduction

The table shows the distribution of individuals in each category Socioeconomic Index Level (SEIL).

SEIL n %

Low 50 25

Regular 110 55

High 40 20

Total 200 100

Introduction

When we examine the relationship between two categorical variables, tabulated one against other.

This is a two way table or cross-tabulation.

SEIL South Center North

Low 33 7 10

Regular 9 81 20

High 2 8 30

Total 44 96 60

Interpretation of a two ways table

There is an association between two categorical variables, if the distribution of a variable varies according to the value of the other.

The question we are interested in is: Is the level of SEIL varies by place of

residence? To answer this question we need to assess a

cross-tabulation


To compare the distributions in the table, we need to consider the percentages. To answer the question of interest, what should we consider the percentages of column or row?

SEIL South %

Center

n %

North

n %

Low 33 75 7 7.3 10 16.7

Regular 9 20.5 81 84.4 20 33.3

High 2 4.5 8 8.3 30 50

Total 44 100.0 96 100.0 60 100.0

Place of residence

Expected frequencies

If the null hypothesis is true, there is no association between SEIL and area of residence, the percentages for each level of SEIL in each area, should be the same as the column of percentages in the total column.

Example of expected frequencies

The percentage of people in low SEIL in the total sample is 50 (25%).

If the null hypothesis is true, we should expect that 25% of people in the place of residence, Center, with low SEIL, are: 25% of 96 = 24

SEIL South

n %

Center

n %

North

n %

Total

n %

Low 33 75 7 7.3 10 16.7 50 25

Regular 9 20.5 81 84.4 20 33.3 110 55

High 2 4.5 8 8.3 30 50 40 20

Total 44 100.0 96 100.0 60 100.0 200 100.0

Place of residence


Example of expected frequencies

If there are no differences in the distribution of SEIL by places of residence, we should expect that the percentage of people with low SEIL is the same in each place of residence.

Note that the expected frequencies do not have to be integers.

Using the totals of columns and rows, we can calculate the expected number in each cell

Chi-squared test

Expected frequencies are those that we should expect if the null hypothesis were true.

To test the null hypothesis, we must compare the expected frequencies with observed frequencies, using the following formula.

(O – E)2

X2=Σ-------------- E

Chi-squared test

From the formula we can see that: If there is a significant difference between the observed

and expected values, X2 will be great If there is a small difference between the observed and

expected values, X2 will be small. If X2 is large, suggesting that data do not support the

null hypothesis because the observed values are not what we expect under the null hypothesis.

If X2 is small, the data suggests that support from the null hypothesis that the observed values are similar to those expected under the null hypothesis.

Chi-squared test

SEIL South

O E

Center

O E

North

O E

Total

n

Low 33 11 7 24 10 15 50

Regular 9 24.2 81 52.8 20 33 110

High 2 8.8 8 19.2 30 12 40

Total 44 96 60 200

Place of residence

Chi-squared test

SEIL Place of residence

Observed Expected O - E (O-E)2 (O-E)2/E

Low South 33 11 22 484 44

Low Center 9 24 - 15 225 9.38

Low North 2 15 - 13 169 11.27

Regular South 7 24.2 -17.2 295.8 12.2

Regular Center 81 52.8 28.2 795.2 15.1

Regular North 8 33 - 25 625 18.9

High South 10 8.8 1.2 1.44 0.2

High Center 20 19.2 0.8 0.64 0.03

High North 30 12 18 324 27

Total 138.1

Chi-squared test in 2 x 2 tables

When both variables are binary, the cross-tabulation table becomes a 2 x 2.

The X2 test was applied in the same way as for a larger table.

Example

There was a study of the bacteriological efficacy of clarithromycin vs penicillin, in acute pharyngotonsillitis in children by Streptococcus Beta Haemolytic Group A.

The results are shown below

Drug Cure Not cure Total

Clarithromycin 91 9 100

Penicillin 82 18 100

Total 173 27 200

Example

To use Chi-squared test, we should point the null hypothesis; in this case, it should be: There are not differences between bacteriological efficacy

between the two treatments, against Streptococcus Beta Hemolytic Group A.

To test the null hypothesis, first we should calculate the expected numbers in each cell from the table.

Drug Cure

O E

Not cure

O E

Total

Clarithromycin 91 86.5 9 13.5 100

Penicillin 82 86.5 18 13.5 100

Total 173 27 200

Example

Drug Effect Observed Expected O - E (O-E)2 (O-E)2/E

Clarithromycin Cure 91 86.5 4.5 20.25 0.234

Clarithromycin Not cure 9 13.5 - 4.5 20.25 1.5

Penicillin Cure 82 86.5 - 4.5 20.25 0.234

Penicillin Not cure 18 13.5 4.5 20.25 1.5

Total 3.47

A quickly formulae for 2 x 2 tables

X2 can be calculate using the observed frequencies in a table and marginal totals.

If we labeled the cells and marginal totals as follow:

Exposure Result

Yes

Result

No

Total

Yes a b a + b

No c d c + d

Total a + c b + d N

X2=(ad – bc)2 x N /(a+b) (c+d) (a+c) (b+d)

Trend test in 2 x c tables

We had use Chi-squared test to evaluate if two categorical variables are associated between them in the population.

When one variable is binary and another is ordered categorical (ordinal), we can be interested in to comprobe if their association follow a trend.


Low

O E

Regular

O E

High

O E

Total

Hypertension 18 38.5 54 54.1 78 57.4 150

Without hypertension

100 79.5 112 111.9 98 118.6 310

Total 118 166 176 460

Hypertension SEIL Observed Expected O - E (O-E)2 (O-E)2/E

Yes Low 18 38.5 -20.5 420.25 10.9

Yes Regular 54 54.1 - 0.1 0.01 0.0002

Yes High 78 57.4 20.6 424.36 7.4

No Low 100 79.5 20.5 420.25 5.3

No Regular 112 111.9 0.1 0.01 0.00009

No High 98 118.6 -20.6 424.36 3.6

Total 27.2

SEIL


To calculate this test, assign a numerical score to each socioeconomic group.

Low Regular High Total

Hypertension 18 54 78 150

Without hypertension

100 112 98 310

Total 118 166 176 460

1 2 3

SEIL

Chi-squared test trends

We conducted a chi-square test for trend, when we assess whether a binary variable, varies linearly through the levels of another variable, to assess whether there is a dose-response effect.

The null hypothesis for this test is that the mean scores in the two groups (the binary variable) are the same.

Thus, the Chi square test becomes a test comparing two means by this is with only one degree of freedom.

Chi-squared test for trends

_ _ (X (Yes) – X (No))2

X2 = ------------------- = S2 (1/n1 + 1/n2)_X (Yes) = mean of score from hypertension group_X (No) = mean of score from non-hypertension groupn1 total of people in hypertension groupn2 total of people in non-hypertension groups= standard deviation for overall scores from both

groups

Validity of Chi-squared tests

Chi square tests that we reviewed are based on the assumption that the test statistic follows approximately the distribution of X2.

This is reasonable for large samples but for the small one should use the following guidelines: For 2 x 2 tables

If the total sample size is> 40, then X2 can be used. If n is between 20 and 40, and the smallest expected value is 5,

X2 can be used. Otherwise, use the exact value of Fisher.

2 x c tables The X2 test is valid if not more than 20% of expected values is

less than 5 and none is less than 1.

Bibliografy

1.- Last JM. A dictionary of epidemiology. New York, 4ª ed. Oxford University Press, 2001:173.

2.- Kirkwood BR. Essentials of medical ststistics. Oxford, Blackwell Science, 1988: 1-4.

3.- Altman DG. Practical statistics for medical research. Boca Ratón, Chapman & Hall/ CRC; 1991: 1-9.

Documents

Biostatistics course Part 12 Association between two categorical variables Dr. Sc. Nicolas Padilla Raygoza Department of Nursing and Obstetrics Division