- Home
- Documents
*Categorical Data Analysis - PCU Teaching Categorical Response DataCategorical Response Data â€¢...*

prev

next

out of 27

View

3Download

0

Embed Size (px)

Categorical Data Analysis

References : Alan Agresti, Categorical Data Analysis, Wiley Interscience, New Jersey, 2002

Bhattacharya, G.K., Johnson, R.A., Statistical Concepts and Methods, Wiley,1977

OutlineOutline

Categorical Response Data Distribution of For Categorical Data Distribution of For Categorical Data Pearson’s Test for Goodness of Fit Contingency Tables Test of Homogeneity and Exact TestTest of Homogeneity and Exact Test

Categorical Response DataCategorical Response Data

• A categorical variable has a measurement l f f F scale consisting of a set of categories. For

instance – political philosophy is often measured

as: liberal moderate or conservativeas: liberal, moderate or conservative – religious affiliation with the categories:

Protestant, Catholic, Muslim, Hindus, Budhis, etc

Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction • Categorical variables have two primary types of scales.

– Nominal : variables having categories without natural Nominal : variables having categories without natural ordering. Examples • Mode of transportation to work : automobile, bicycle, bus, walk

• Favorite type of music: jazz, classical, rock, pop, dangdut, keroncongkeroncong

– Ordinal : many categorical variables do have ordered categories. Examples • Size of automobile : subcompact, compact, midsize, large • Social class : upper, middle, lower • Political philosophy : liberal, moderate, conservative

Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction

• An interval variable is one that does h l d b have numerical distances between any two values.

– For examples, blood pressure level, functional life length of TV set length functional life length of TV set, length of prison term and annual income are i t l i bl interval variables.

Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction • The way that a variable is measured determines its classificationclassification. • For example, „education“ is only

nominal when measured as public school or nominal when measured as public school or private school; it is ordinal when measured by highest degree it is ordinal when measured by highest degree attained, using the categories none, higsh school, bachelor‘s, master‘s and doctorate. It is interval when measured by number of years of education, using the integers 0,1,2,...

Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction A variable‘s measurement scale determines

hich statistical meth ds are a r riatewhich statistical methods are appropriate. The measurement hierarchy from high to low:

Interval – Ordinal – Nominal Methods for ordinal variables cannot be used with nominal variables, since their categories have no meaningful ordering. It is usually best to apply methods appropriate for the actual scale.

D t TData Type

Quantitative (N i l)

Qualitative (C i l)(Numerical) (Categorical)

Discrete Continue Discrete

Quantitative vs. Qualitative

Quantitative Data Qualitative Data Variables recorded in The numbers here are just numbers that we use as numbers are called

labels and their values are arbitrary. They represent categories of the variablesquantitative categories of the variables. We call such variables categorical.

Examples: ♣Incomes, Heights ♣Weights Ages and Counts

Examples: ♣ Sex, ♣ Area Code♣Weights, Ages and Counts

Quantitative variables have

♣ Area Code ♣ Production group in a

certain location.Quantitative variables have measurement units

Discrete vs. Continues

Discrete Data Continues Data The data are integer and

ll th i f The data usually interval

l Thusually they are coming from counted process

scale. They are measurement data

Examples: ♣ Number of employee

Examples: ♣ Temperaturey

♣ Number of rejected lot ♣ Heights, Weights

Discrete DataDiscrete Data

Nominal Ordinal The rank of the data are not important

The rank of the data meaningful.

Examples Production Group

1 G A

Examples Frequency of smoking

1 ft♣ 1 → Group A ♣ 2 → Group B ♣ 3 → Group C

♣ 1 → very often ♣ 2 → often ♣ 3 → rare ♣ 4 → never

Distributions for Categorical DataDistributions for Categorical Data Binomial Distribution

Let y1,y2,...,yn denote responses for n independent and y1,y2, ,yn p p identical trials such that

P(Yi=1) = π and P(Yi=0) = 1- π Id ti l t i l th t th b bilit f “ i Identical trials means that the probability of „success“ , π, is

the same for each trial. Independent trials means that the {Yi} are independent p i p

random variables. These are often called as Bernoulli trials. The total number of successes has the binomial The total number of successes, has the binomial

distribution with index n and parameter π, denoted by bin(n, π)

Distributions for Categorical DataDistributions for Categorical Data

The probability mass function for the possible outcome y for Y is

ny y n

yp yny ,...,2,1,0,)1()( =−⎟⎟ ⎠

⎞ ⎜⎜ ⎝

⎛ = −ππ

The binomial distribution for Y = ∑i Yi has mean and variance

y ⎠⎝

)1()var(and,,)( 2 ππσπμ −==== nYnYE

There is no guarantee that successive binary observations are independent or identical.

Distributions for Categorical DataDistributions for Categorical Data

Multinomial Distribution

Some trials have more than two possible outcomes. Suppose that each of n independent, identical trials can have outcome in any of c categories.

Let ⎨ ⎧

= categories c ofany in outcome has i trialif1

y ⎩ ⎨= otherwise0ij

y

Then represents a multinomial trial, with ∑j Yij = 1

),...,,( 21 iciii yyyy =

Distributions for Categorical DataDistributions for Categorical Data

Let nj = ∑i Yij denote the number of trials having Th ( ) h outcome in category j. The counts (n1,n2,...,nc) have

the multinomial distribution.

Let πj = P(Yij = 1) denote the probability of outcome in category j for each trial. The

lti i l b bilit f ti i multinomial probability mass function is

cnnnnnnnp πππ!)( 21⎟⎟ ⎞

⎜⎜ ⎛

= c c

c nnn nnnp πππ ...

!!...! ),...,,( 21

21 121 ⎟⎟

⎠ ⎜⎜ ⎝

=−

)1()var()( jjjjj nnnnE πππ −== )1()var(,)( jjjjj nnnnE πππ ==

Distributions for Categorical DataDistributions for Categorical Data

Poisson Distribution

Sometimes, count data do not result from a fixed number of trials. There is no upper limit n for y. Since y must be a nonnegative integer, its distribution should place its mass on that range. The i l t h di t ib ti i th P isimplest such distribution is the Poisson.

The Poisson mass function ,...2,1,0, !

)( == −

yeyP yμμ

The distribution approaches normality as μ increases.

!yμ== )var()( yyE pp y μ

Pearson’s Test for GoFPearson s Test for GoF

Null Hypothesis : Ho :p1=p10,…,pk =pkoyp o p1 p10, ,pk pko

The Pearson X2 test statistic :

( )k 22

2

( )∑ ∑ =

− =

− =

k

i cellsi

ii

E EO

np npnX

1

2

0

2 02 )(

Distribution : X2 is asymptotically chi-squared with df = k-1

Reject region : X2 ≥χ2α, where χ2α is the upper α point of the χ2 di t ib ti ith df = k 1distribution with df = k-1

Contingency TableContingency Table B1 B2 … Bc Row

Total P b bili f h j i

)( jiij BAPp = A1 n11 n12 … n1c n10 A2 n21 n22 … n2C n20 … … … … …

Probability of the joint occurance of Ai and Bj

)(0 ii APp = Ar nr1 nr2 … nrc nr0

Column Total

n01 n02 … n0c n

B B B Row

)(0 iip

Total probability in the ith row

B1 B2 … Bc Row Total

A1 p11 p12 … p1c p10 A

)( joj BPp = A2 p21 p22 … p2C p20 … … … … …

Ar pr1 pr2 … prc pr0

jj

Total probability in the jth column

Column Total

p01 p02 … p0c 1

Contingency TableContingency Table The null hypothesis of independence

H for all cells (i,j)

ojioij pppH =:0

Estimation: 2

0 00

0 0 ˆˆˆ,ˆ,ˆ

n

nn ppp

n n

p n

n p ojiojiij

oj j

i i ====

nn ji 00Expectation: n nn

pnE jiijij 00ˆ ==

The test statistic then becomes:

( )− 22 ijij En

which has an approximate χ2 distribution with d f = (r-1)(c-1)

( ) ∑=

cells

2

rc all ij

ijij

E En

X

which has an approximate χ distribution with d.f (r 1)(c 1)

Test of HomogeneityTest of Homogeneity The χ2 test of independence is based on the sampling scheme in which a single random sample of size n is scheme in which a single random sample of size n is class