Categorical Data Analysis - PCU Teaching Categorical Response DataCategorical Response Data • A categorical

  • View
    3

  • Download
    0

Embed Size (px)

Text of Categorical Data Analysis - PCU Teaching Categorical Response DataCategorical Response Data •...

  • Categorical Data Analysis

    References : Alan Agresti, Categorical Data Analysis, Wiley Interscience, New Jersey, 2002

    Bhattacharya, G.K., Johnson, R.A., Statistical Concepts and Methods, Wiley,1977

  • OutlineOutline

    Categorical Response Data Distribution of For Categorical Data Distribution of For Categorical Data Pearson’s Test for Goodness of Fit Contingency Tables Test of Homogeneity and Exact TestTest of Homogeneity and Exact Test

  • Categorical Response DataCategorical Response Data

    • A categorical variable has a measurement l f f F scale consisting of a set of categories. For

    instance – political philosophy is often measured

    as: liberal moderate or conservativeas: liberal, moderate or conservative – religious affiliation with the categories:

    Protestant, Catholic, Muslim, Hindus, Budhis, etc

  • Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction • Categorical variables have two primary types of scales.

    – Nominal : variables having categories without natural Nominal : variables having categories without natural ordering. Examples • Mode of transportation to work : automobile, bicycle, bus, walk

    • Favorite type of music: jazz, classical, rock, pop, dangdut, keroncongkeroncong

    – Ordinal : many categorical variables do have ordered categories. Examples • Size of automobile : subcompact, compact, midsize, large • Social class : upper, middle, lower • Political philosophy : liberal, moderate, conservative

  • Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction

    • An interval variable is one that does h l d b have numerical distances between any two values.

    – For examples, blood pressure level, functional life length of TV set length functional life length of TV set, length of prison term and annual income are i t l i bl interval variables.

  • Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction • The way that a variable is measured determines its classificationclassification. • For example, „education“ is only

    nominal when measured as public school or nominal when measured as public school or private school; it is ordinal when measured by highest degree it is ordinal when measured by highest degree attained, using the categories none, higsh school, bachelor‘s, master‘s and doctorate. It is interval when measured by number of years of education, using the integers 0,1,2,...

  • Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction A variable‘s measurement scale determines

    hich statistical meth ds are a r riatewhich statistical methods are appropriate. The measurement hierarchy from high to low:

    Interval – Ordinal – Nominal Methods for ordinal variables cannot be used with nominal variables, since their categories have no meaningful ordering. It is usually best to apply methods appropriate for the actual scale.

  • D t TData Type

    Quantitative (N i l)

    Qualitative (C i l)(Numerical) (Categorical)

    Discrete Continue Discrete

  • Quantitative vs. Qualitative

    Quantitative Data Qualitative Data Variables recorded in The numbers here are just numbers that we use as numbers are called

    labels and their values are arbitrary. They represent categories of the variablesquantitative categories of the variables. We call such variables categorical.

    Examples: ♣Incomes, Heights ♣Weights Ages and Counts

    Examples: ♣ Sex, ♣ Area Code♣Weights, Ages and Counts

    Quantitative variables have

    ♣ Area Code ♣ Production group in a

    certain location.Quantitative variables have measurement units

  • Discrete vs. Continues

    Discrete Data Continues Data The data are integer and

    ll th i f The data usually interval

    l Thusually they are coming from counted process

    scale. They are measurement data

    Examples: ♣ Number of employee

    Examples: ♣ Temperaturey

    ♣ Number of rejected lot ♣ Heights, Weights

  • Discrete DataDiscrete Data

    Nominal Ordinal The rank of the data are not important

    The rank of the data meaningful.

    Examples Production Group

    1 G A

    Examples Frequency of smoking

    1 ft♣ 1 → Group A ♣ 2 → Group B ♣ 3 → Group C

    ♣ 1 → very often ♣ 2 → often ♣ 3 → rare ♣ 4 → never

  • Distributions for Categorical DataDistributions for Categorical Data Binomial Distribution

    Let y1,y2,...,yn denote responses for n independent and y1,y2, ,yn p p identical trials such that

    P(Yi=1) = π and P(Yi=0) = 1- π Id ti l t i l th t th b bilit f “ i Identical trials means that the probability of „success“ , π, is

    the same for each trial. Independent trials means that the {Yi} are independent p i p

    random variables. These are often called as Bernoulli trials. The total number of successes has the binomial The total number of successes, has the binomial

    distribution with index n and parameter π, denoted by bin(n, π)

  • Distributions for Categorical DataDistributions for Categorical Data

    The probability mass function for the possible outcome y for Y is

    ny y n

    yp yny ,...,2,1,0,)1()( =−⎟⎟ ⎠

    ⎞ ⎜⎜ ⎝

    ⎛ = −ππ

    The binomial distribution for Y = ∑i Yi has mean and variance

    y ⎠⎝

    )1()var(and,,)( 2 ππσπμ −==== nYnYE

    There is no guarantee that successive binary observations are independent or identical.

  • Distributions for Categorical DataDistributions for Categorical Data

    Multinomial Distribution

    Some trials have more than two possible outcomes. Suppose that each of n independent, identical trials can have outcome in any of c categories.

    Let ⎨ ⎧

    = categories c ofany in outcome has i trialif1

    y ⎩ ⎨= otherwise0ij

    y

    Then represents a multinomial trial, with ∑j Yij = 1

    ),...,,( 21 iciii yyyy =

  • Distributions for Categorical DataDistributions for Categorical Data

    Let nj = ∑i Yij denote the number of trials having Th ( ) h outcome in category j. The counts (n1,n2,...,nc) have

    the multinomial distribution.

    Let πj = P(Yij = 1) denote the probability of outcome in category j for each trial. The

    lti i l b bilit f ti i multinomial probability mass function is

    cnnnnnnnp πππ!)( 21⎟⎟ ⎞

    ⎜⎜ ⎛

    = c c

    c nnn nnnp πππ ...

    !!...! ),...,,( 21

    21 121 ⎟⎟

    ⎠ ⎜⎜ ⎝

    =−

    )1()var()( jjjjj nnnnE πππ −== )1()var(,)( jjjjj nnnnE πππ ==

  • Distributions for Categorical DataDistributions for Categorical Data

    Poisson Distribution

    Sometimes, count data do not result from a fixed number of trials. There is no upper limit n for y. Since y must be a nonnegative integer, its distribution should place its mass on that range. The i l t h di t ib ti i th P isimplest such distribution is the Poisson.

    The Poisson mass function ,...2,1,0, !

    )( == −

    yeyP yμμ

    The distribution approaches normality as μ increases.

    !yμ== )var()( yyE pp y μ

  • Pearson’s Test for GoFPearson s Test for GoF

    Null Hypothesis : Ho :p1=p10,…,pk =pkoyp o p1 p10, ,pk pko

    The Pearson X2 test statistic :

    ( )k 22

    2

    ( )∑ ∑ =

    − =

    − =

    k

    i cellsi

    ii

    E EO

    np npnX

    1

    2

    0

    2 02 )(

    Distribution : X2 is asymptotically chi-squared with df = k-1

    Reject region : X2 ≥χ2α, where χ2α is the upper α point of the χ2 di t ib ti ith df = k 1distribution with df = k-1

  • Contingency TableContingency Table B1 B2 … Bc Row

    Total P b bili f h j i

    )( jiij BAPp = A1 n11 n12 … n1c n10 A2 n21 n22 … n2C n20 … … … … …

    Probability of the joint occurance of Ai and Bj

    )(0 ii APp = Ar nr1 nr2 … nrc nr0

    Column Total

    n01 n02 … n0c n

    B B B Row

    )(0 iip

    Total probability in the ith row

    B1 B2 … Bc Row Total

    A1 p11 p12 … p1c p10 A

    )( joj BPp = A2 p21 p22 … p2C p20 … … … … …

    Ar pr1 pr2 … prc pr0

    jj

    Total probability in the jth column

    Column Total

    p01 p02 … p0c 1

  • Contingency TableContingency Table The null hypothesis of independence

    H for all cells (i,j)

    ojioij pppH =:0

    Estimation: 2

    0 00

    0 0 ˆˆˆ,ˆ,ˆ

    n

    nn ppp

    n n

    p n

    n p ojiojiij

    oj j

    i i ====

    nn ji 00Expectation: n nn

    pnE jiijij 00ˆ ==

    The test statistic then becomes:

    ( )− 22 ijij En

    which has an approximate χ2 distribution with d f = (r-1)(c-1)

    ( ) ∑=

    cells

    2

    rc all ij

    ijij

    E En

    X

    which has an approximate χ distribution with d.f (r 1)(c 1)

  • Test of HomogeneityTest of Homogeneity The χ2 test of independence is based on the sampling scheme in which a single random sample of size n is scheme in which a single random sample of size n is class