View
3
Download
0
Embed Size (px)
Categorical Data Analysis
References : Alan Agresti, Categorical Data Analysis, Wiley Interscience, New Jersey, 2002
Bhattacharya, G.K., Johnson, R.A., Statistical Concepts and Methods, Wiley,1977
OutlineOutline
Categorical Response Data Distribution of For Categorical Data Distribution of For Categorical Data Pearson’s Test for Goodness of Fit Contingency Tables Test of Homogeneity and Exact TestTest of Homogeneity and Exact Test
Categorical Response DataCategorical Response Data
• A categorical variable has a measurement l f f F scale consisting of a set of categories. For
instance – political philosophy is often measured
as: liberal moderate or conservativeas: liberal, moderate or conservative – religious affiliation with the categories:
Protestant, Catholic, Muslim, Hindus, Budhis, etc
Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction • Categorical variables have two primary types of scales.
– Nominal : variables having categories without natural Nominal : variables having categories without natural ordering. Examples • Mode of transportation to work : automobile, bicycle, bus, walk
• Favorite type of music: jazz, classical, rock, pop, dangdut, keroncongkeroncong
– Ordinal : many categorical variables do have ordered categories. Examples • Size of automobile : subcompact, compact, midsize, large • Social class : upper, middle, lower • Political philosophy : liberal, moderate, conservative
Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction
• An interval variable is one that does h l d b have numerical distances between any two values.
– For examples, blood pressure level, functional life length of TV set length functional life length of TV set, length of prison term and annual income are i t l i bl interval variables.
Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction • The way that a variable is measured determines its classificationclassification. • For example, „education“ is only
nominal when measured as public school or nominal when measured as public school or private school; it is ordinal when measured by highest degree it is ordinal when measured by highest degree attained, using the categories none, higsh school, bachelor‘s, master‘s and doctorate. It is interval when measured by number of years of education, using the integers 0,1,2,...
Nominal – Ordinal Scale DistinctionNominal Ordinal Scale Distinction A variable‘s measurement scale determines
hich statistical meth ds are a r riatewhich statistical methods are appropriate. The measurement hierarchy from high to low:
Interval – Ordinal – Nominal Methods for ordinal variables cannot be used with nominal variables, since their categories have no meaningful ordering. It is usually best to apply methods appropriate for the actual scale.
D t TData Type
Quantitative (N i l)
Qualitative (C i l)(Numerical) (Categorical)
Discrete Continue Discrete
Quantitative vs. Qualitative
Quantitative Data Qualitative Data Variables recorded in The numbers here are just numbers that we use as numbers are called
labels and their values are arbitrary. They represent categories of the variablesquantitative categories of the variables. We call such variables categorical.
Examples: ♣Incomes, Heights ♣Weights Ages and Counts
Examples: ♣ Sex, ♣ Area Code♣Weights, Ages and Counts
Quantitative variables have
♣ Area Code ♣ Production group in a
certain location.Quantitative variables have measurement units
Discrete vs. Continues
Discrete Data Continues Data The data are integer and
ll th i f The data usually interval
l Thusually they are coming from counted process
scale. They are measurement data
Examples: ♣ Number of employee
Examples: ♣ Temperaturey
♣ Number of rejected lot ♣ Heights, Weights
Discrete DataDiscrete Data
Nominal Ordinal The rank of the data are not important
The rank of the data meaningful.
Examples Production Group
1 G A
Examples Frequency of smoking
1 ft♣ 1 → Group A ♣ 2 → Group B ♣ 3 → Group C
♣ 1 → very often ♣ 2 → often ♣ 3 → rare ♣ 4 → never
Distributions for Categorical DataDistributions for Categorical Data Binomial Distribution
Let y1,y2,...,yn denote responses for n independent and y1,y2, ,yn p p identical trials such that
P(Yi=1) = π and P(Yi=0) = 1- π Id ti l t i l th t th b bilit f “ i Identical trials means that the probability of „success“ , π, is
the same for each trial. Independent trials means that the {Yi} are independent p i p
random variables. These are often called as Bernoulli trials. The total number of successes has the binomial The total number of successes, has the binomial
distribution with index n and parameter π, denoted by bin(n, π)
Distributions for Categorical DataDistributions for Categorical Data
The probability mass function for the possible outcome y for Y is
ny y n
yp yny ,...,2,1,0,)1()( =−⎟⎟ ⎠
⎞ ⎜⎜ ⎝
⎛ = −ππ
The binomial distribution for Y = ∑i Yi has mean and variance
y ⎠⎝
)1()var(and,,)( 2 ππσπμ −==== nYnYE
There is no guarantee that successive binary observations are independent or identical.
Distributions for Categorical DataDistributions for Categorical Data
Multinomial Distribution
Some trials have more than two possible outcomes. Suppose that each of n independent, identical trials can have outcome in any of c categories.
Let ⎨ ⎧
= categories c ofany in outcome has i trialif1
y ⎩ ⎨= otherwise0ij
y
Then represents a multinomial trial, with ∑j Yij = 1
),...,,( 21 iciii yyyy =
Distributions for Categorical DataDistributions for Categorical Data
Let nj = ∑i Yij denote the number of trials having Th ( ) h outcome in category j. The counts (n1,n2,...,nc) have
the multinomial distribution.
Let πj = P(Yij = 1) denote the probability of outcome in category j for each trial. The
lti i l b bilit f ti i multinomial probability mass function is
cnnnnnnnp πππ!)( 21⎟⎟ ⎞
⎜⎜ ⎛
= c c
c nnn nnnp πππ ...
!!...! ),...,,( 21
21 121 ⎟⎟
⎠ ⎜⎜ ⎝
=−
)1()var()( jjjjj nnnnE πππ −== )1()var(,)( jjjjj nnnnE πππ ==
Distributions for Categorical DataDistributions for Categorical Data
Poisson Distribution
Sometimes, count data do not result from a fixed number of trials. There is no upper limit n for y. Since y must be a nonnegative integer, its distribution should place its mass on that range. The i l t h di t ib ti i th P isimplest such distribution is the Poisson.
The Poisson mass function ,...2,1,0, !
)( == −
yeyP yμμ
The distribution approaches normality as μ increases.
!yμ== )var()( yyE pp y μ
Pearson’s Test for GoFPearson s Test for GoF
Null Hypothesis : Ho :p1=p10,…,pk =pkoyp o p1 p10, ,pk pko
The Pearson X2 test statistic :
( )k 22
2
( )∑ ∑ =
− =
− =
k
i cellsi
ii
E EO
np npnX
1
2
0
2 02 )(
Distribution : X2 is asymptotically chi-squared with df = k-1
Reject region : X2 ≥χ2α, where χ2α is the upper α point of the χ2 di t ib ti ith df = k 1distribution with df = k-1
Contingency TableContingency Table B1 B2 … Bc Row
Total P b bili f h j i
)( jiij BAPp = A1 n11 n12 … n1c n10 A2 n21 n22 … n2C n20 … … … … …
Probability of the joint occurance of Ai and Bj
)(0 ii APp = Ar nr1 nr2 … nrc nr0
Column Total
n01 n02 … n0c n
B B B Row
)(0 iip
Total probability in the ith row
B1 B2 … Bc Row Total
A1 p11 p12 … p1c p10 A
)( joj BPp = A2 p21 p22 … p2C p20 … … … … …
Ar pr1 pr2 … prc pr0
jj
Total probability in the jth column
Column Total
p01 p02 … p0c 1
Contingency TableContingency Table The null hypothesis of independence
H for all cells (i,j)
ojioij pppH =:0
Estimation: 2
0 00
0 0 ˆˆˆ,ˆ,ˆ
n
nn ppp
n n
p n
n p ojiojiij
oj j
i i ====
nn ji 00Expectation: n nn
pnE jiijij 00ˆ ==
The test statistic then becomes:
( )− 22 ijij En
which has an approximate χ2 distribution with d f = (r-1)(c-1)
( ) ∑=
cells
2
rc all ij
ijij
E En
X
which has an approximate χ distribution with d.f (r 1)(c 1)
Test of HomogeneityTest of Homogeneity The χ2 test of independence is based on the sampling scheme in which a single random sample of size n is scheme in which a single random sample of size n is class