4. Categorical Data Analysis 2014 - UMass . Categorical Data Analysis 2014.pdf  Categorical Data

  • View
    225

  • Download
    0

Embed Size (px)

Text of 4. Categorical Data Analysis 2014 - UMass . Categorical Data Analysis 2014.pdf  Categorical...

  • PubHlth 640 - Spring 2014 4. Categorical Data Analysis Page 1 of 85

    Unit 4

    Categorical Data Analysis

    Dont ask what it means, but rather how it is used

    - L. Wittgenstein

    Is frequency of exercise associated with better health? Is the proportion of adults who visit their doctor more than once a year, significantly lower among the frequent exercisers than among the non-exercisers? Is alcohol associated with higher risk of lung cancer? Is the apparent association a fluke because we have failed to account for the relationship between drinking and smoking? Is greater exposure to asbestos associated with the development pleural plaques? Is more exposure associated with more pleural plaques? Units 3 (Discrete Distributions) and 4 (Categorical Data Analysis) pertain to questions such as these and are an introduction to the analysis of count data that can be represented in a contingency table (a two-way cross-tabulation of the counts of individuals with each profile of traits; eg non-drinker and lung cancer). Data that are counts are categorical data. A categorical variable is measured on a scale that is nominal (eg religion) or ordinal (eg diagnosis coded as benign, suspicious, or malignant). An example of a two-way cross-tabulation of categorical data is a cross-tabulation of frequency of visits to the doctor (1=less than every five years, 2=annually, and 3=every six months) by diagnosis, coded as above. A categorical data analysis of these data might explore the nature and significance, if any, of the association between the two variables. Thus, there are many uses for categorical data analyses, especially in epidemiology and public health Unit 4 (Categorical Data Analysis) is an introduction to some basic methods for the analysis of categorical data: (1) association in a 2x2 table; (2) variation of a 2x2 table association, depending on the level of another variable; and (3) trend in outcome in a contingency table. Tip - These methods require minimal assumptions for their validity and, in particular, do not assume a regression model. These methods, in contrast to regression approaches, have the added advantage of giving us a much closer look at the data than is generally afforded by regression techniques. Tip always precede a logistic regression analysis with contingency table analyses.

  • PubHlth 640 - Spring 2014 4. Categorical Data Analysis Page 2 of 85

    Table of Contents

    Topics

    1. Learning Objectives

    2. Examples of Categorical Data ...

    3. Hypotheses of Independence, No Association, Homogeneity .

    4. The Chi Square Test of No Association in an RxC Table ....

    5. Rejection of Independence: The Chi Square Residual.

    6. Confidence Interval Estimation of RR and OR ...

    7. Strategies for Controlling Confounding ...

    8. Multiple 2x2 Tables - Stratified Analysis of Rates

    A. Woolf Test of Homogeneity of Odds Ratios.. B. Breslow-Day-Tarone Test of Homogeneity of Odds Ratios C. Mantel Haenszel Test of No Association .

    9. The R x C Table Test for (Linear) Trend .

    10. Factors Associated with Mammography Screening .

    11. The Chi Square Goodness-of-Fit Test .

    3

    4

    9

    10

    16

    20

    24

    26

    31 33 37

    40

    46

    52

    Appendices A. The Chi Square Distribution B. Probability Models for the 2x2 Table C. Concepts of Observed and Expected D. Review: Measures of Association in a 2x2 Table E. Review: Confounding of Rates

    62 66 68 72 78

  • PubHlth 640 - Spring 2014 4. Categorical Data Analysis Page 3 of 85

    Learning Objectives

    When you have finished this unit, you should be able to:

    Perform and interpret the chi square test of association in a single 2x2 table.

    Define and distinguish between exposure-outcome associations that are confounded versus effect modified.

    Perform and interpret an analysis of stratified 2x2 tables, using Mantel-Haenszel methods.

    Perform and interpret the test of trend for RxC tables of counts of ordinal data that are suitable for explorations of dose-response

    Perform and interpret a chi square goodness-of-fit (GOF) test

    Note - Currently, this unit does not discuss matched pairs or matched data.

  • PubHlth 640 - Spring 2014 4. Categorical Data Analysis Page 4 of 85

    2. Examples of Categorical Data

    Source: Fisher LD and Van Belle G. Biostatistics: A Methodology for the Health Sciences New York: John Wiley, 1993, page 235, problem #14. Is there a relationship between coffee consumption and cardiovascular risk? What about the observation that many coffee drinkers are also smokers and smoking is itself a risk factor for heart disease? Suppose we wish to estimate the nature and strength of a coffee-MI relationship independent of the role of smoking. We can do this by looking at coffee-heart disease data separately within groups (strata) of non-smokers, smokers, etc. Consider the following bar graph summaries that compare low coffee drinkers (left bar) with high coffee drinkers (right bar) with respect to proportion suffering a myocardial infarction (MI). The comparison is made for each of several categories of smokers (each row)

    Never Smoked

    Pro

    porti

    on M

    I

    Coffee Consumption0

    .184211 micase

    0=lt 5 cups/day 1=ge 5 cups/day

    Former Smoker

    Prop

    ortio

    n M

    I

    Coffee Consumption0

    .28 micase

    0=lt 5 cups/day 1=ge 5 cups/day

    . Some rows omitted

    45+ cigarettes/day

    Pro

    porti

    on M

    I

    Coffee Consumption0

    .666667 micase

    0=lt 5 cups/day 1=ge 5 cups/day

    Among never smokers, the data suggest a positive coffee-MI relationship. Among former smokers, the coffee-MI association is less strong. Among frequent smokers, there is no longer evidence of a coffee-MI association.

  • PubHlth 640 - Spring 2014 4. Categorical Data Analysis Page 5 of 85

    In Unit 3 (Discrete Distributions) we learned some probability distributions for discrete data: Binomial, Poisson, and Hypergeometric. These probability distributions are often used to model the chances of (likelihood which we abbreviate as L ) obtaining the observations that we have in our data. Here are some examples. Example - Binomial for One Group Count of Events of Success Does minoxidil show promise for the treatment of hair loss? N=13 volunteers

    Administer minoxidil

    Wait 6 months

    Count occurrences of new hair growth. Call this X.

    Suppose we observe X=12. Possible values of X=count of occurrences of new hair growth are 0, 1, 2, , 13. Thus, IF: (1) = probability[new hair growth] for all 13 volunteers, and the

    (2) outcomes for each of the 13 volunteers are independent THEN: X is distributed Binomial (N=13, )

    The likelihood (chances of) L of the outcomes in the one group intervention study design data is

    modeled as a binomial probability:

    ( )13-xxX13

    L (x) = Pr[X=x] = 1 - x

    Example -

    The probability of X=12 events of new hair growth in N=13 trials (study participants) = ( )11213 1-12

  • PubHlth 640 - Spring 2014 4. Categorical Data Analysis Page 6 of 85 Example - The Product of 2 Binomials is used for the 2 Independent Counts in a Cohort Trial In a randomized controlled trial, is minoxidil better than standard care for the treatment of hair loss? Consent of N=30 volunteers

    Randomization

    Standard Care N1 = 17

    Minoxidil N2 = 13

    Administer standard care

    Administer minoxidil

    Wait 6 months

    Wait 6 months

    Count occurrences of new hair growth. Call this X1.

    Count occurrences of new hair growth. Call this X2.

    This design produces a 2x2 table array of count data that is correctly modeled using the product of two binomial distributions. New Growth Not

    Minoxidil

    X2 = 12 N2 = 13

    Standard care

    X1 = 6 N1 = 17

    IF: (1) 1 = probability[new hair growth] on standard care (2) 2 = probability[new hair growth] on minoxidil (3) The outcomes for all 30 trial participants are independent THEN: (1) X1 is distributed Binomial (N1 =17, 1) (2) X2 is distributed Binomial (N2 =13, 2)

    The likelihood (chances of) L of the outcomes in the two group cohort study design data is

    modeled as the product of 2 binomial probabilities:

    ( ) ( )1 21 21 2

    17-x 13-xx xX 1 2 1 1 2 2 1 1 2 2

    1 2

    17 13L (x ,x ) = Pr[X =x and X =x ] = 1 - *