Upload
others
View
48
Download
1
Embed Size (px)
Citation preview
Introduction
STAC51: Categorical data Analysis
Mahinda Samarakoon
January 21, 2016
Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 21
Introduction
Table of contents
1 Introduction
Mahinda Samarakoon STAC51: Categorical data Analysis 2 / 21
Introduction
Basic Concepts
Categorical data analysis is concerned with the statistical methodsfor analysis of categorical response (dependent) variables.Explanatory variables may be categorical or continuous or both.For example the explanatory variables can be income, education,gender, race etc.There are two types of categorical variables:
Mahinda Samarakoon STAC51: Categorical data Analysis 3 / 21
Introduction
Types of variables
Nominal - unordered categories
Major: Mathematics, Statistics ot Computer ScienceFavorite music: rock, classical, jazz, country, folk, popCriminal offense convictions: murder, robbery, assault
Ordinal - ordered categories, but the exact distances betweencategories are unknown..Examples
Patient condition: excellent, good, fair, poorGovernment spending: too high, about right, too lowHighest attained education level: HS, BS, MS, PhD
Mahinda Samarakoon STAC51: Categorical data Analysis 4 / 21
Introduction
Types of variables
Binary valentines
A binary variable is a special case of a categorical variable,taking only two values (categories) such as success and failureot true or false.
For binary variables nominal-ordinal distinction is notimportant.
Mahinda Samarakoon STAC51: Categorical data Analysis 5 / 21
Introduction
Types of variables
Interval variables
An interval variables is one that does have meaningfuldistances between any two values.
Examples: Annual income, height, weight, systolic bloodpressure level.
Mahinda Samarakoon STAC51: Categorical data Analysis 6 / 21
Introduction
Probability Distributions for Categorical Data
In categorical data analysis, the binomial distribution (and itsmultinomial distribution generalization) plays the role that theNormal distribution does for continuous response.
Recall that for a Bin(n, π) random variable Y
P(Y = y) = pY (y) =(ny
)πy (1− π)n−y for y = 1, . . . , n and
zero otherwise.
E (Y ) = nπ
Var(Y ) = nπ(1− π)
If X1, . . . ,Xn are i.i.d. Bernoulli random variables, i.e.P(X1 = 1) = π and P(X1 = 0) = 1− π, thenY = X1 + · · ·+ Xn ∼ Bin(n, π). In other words Y is thenumber of successes (i.e. 1’s) in n independent Bernoullitrials.
Mahinda Samarakoon STAC51: Categorical data Analysis 7 / 21
Introduction
Binomial Distribution
Example According to published statistics, 8% of people ages14-24 are school dropouts, i.e. persons who are not in regularschool and who have not completed the 12th grade or any higherdegree degree. Suppose you pick five people at random from thisage group, what is the probability that exactly two of then will beschool dropouts?Solution: Let Y denote the number of school dropouts in thissample of 5 people, then Y ∼ Bin(n = 5, π = 0.08). The questionwants P(Y = 2) pause and using the formula
P(Y = 2) =
(5
2
)(0.08)2(1− 0.08)5−2
= 10× (0.08)× (0.92)3
= 0.049836032.
Mahinda Samarakoon STAC51: Categorical data Analysis 8 / 21
Introduction
Multinomial Distribution
In some trials more than two outcomes are possible. Suppose nindependent trails can have outcome in any of c categories. Letyij = 1 if the i th outcome results in category j and zero otherwise.
Let nj =n∑
i=1yij , then (n1, n2, . . . , nc) is an observed value (vector)
from a multinomial distribution. The probability mass function ofthe multinomial distribution is given by:
p(n1, n2, . . . , nc) =
(n!
n1!n2! . . . nc !
)πn11 π
n22 . . . πncc . (1)
where πj is the probability of an outcome in category j (for anytrial).
Mahinda Samarakoon STAC51: Categorical data Analysis 9 / 21
Introduction
Multinomial Distribution: Example
Suppose we have a bowl with 10 marbles - 2 red marbles, 3 greenmarbles, and 5 blue marbles. We randomly select 4 marbles fromthe bowl, with replacement. What is the probability of selecting 2green marbles and 2 blue marbles?Solution: Let Y1,Y1 and , Y3 denote the numbers of red, greenand blue marbles respectively. Then (Y1,Y1,Y3) has a multinomialdistribution with n = 4, π1 = 0.2, π2 = 0.3 and π3 = 0.5 andP(Y1 = 0,Y2 = 2,Y2 = 2) =
(4!
0!2!2!
)0.20 × 0.32 × 0.52 =
6× 0.0225 = 0.135.R commands
> dmultinom(x = c(0, 2, 2), size = 4, prob = c(0.2, 0.3, 0.5))
[1] 0.135
>
Mahinda Samarakoon STAC51: Categorical data Analysis 10 / 21
Introduction
Multinomial Distribution
Some properties of the Multinomial DistributionIf Y1,Y2, . . . ,Yc−1 have a multinomial (n, π1, π2, . . . , πc), then
Yi ∼ Bin(n, πi )
µi = E (Yj) = nπj
Var(Yj) = nπj(1− πj)Cov(Yj ,Yk) = E ((Yj − µj)(Yk − µk)) = −nπjπk .
Mahinda Samarakoon STAC51: Categorical data Analysis 11 / 21
Introduction
Poisson Distribution
Sometimes, count data do not result from a fixed number of trials.For example, the number of accidents during a particular period ina particular city. This type of random variables often have aPoisson distribution. The probability mass function of the Poissondistribution is given by
p(y) =e−µµy
y !, y = 0, 1, . . . (2)
The parameter of the distribution µ represents the mean of thedistribution. That is, if Y ∼ Po(µ), then E (Y ) = µ. It can also beshown that Var(Y ) = µ.
Mahinda Samarakoon STAC51: Categorical data Analysis 12 / 21
Introduction
Poisson Distribution: Example
Births in a hospital occur randomly at an average rate of 1.8 birthsper hour. It is reasonable to assume that distribution of the thenumber of births in a in any particular hour to be Poisson withmean 1.8.What is the probability of observing 4 births in a given hour at thehospital?Solution: Let Y be the number of births in this interval. ThenY ∼ Po(1.8) and so P(Y = 4) = e−1.81.84
4! = 0.0723.
Mahinda Samarakoon STAC51: Categorical data Analysis 13 / 21
Introduction
Poisson Approximation to the Binomial distribution
If n is large (n ≥ 100) and π is small (usually π ≤ 0.01) (andnπ ≤ 20), then we can use Poisson(µ = nπ) to approximate thebinomial probabilities.
Mahinda Samarakoon STAC51: Categorical data Analysis 14 / 21
Introduction
Poisson Approximation to the Binomial distribution:Example
Suppose that 1 in 5000 light bulbs are defective. Let Y denote thenumber of defective bulbs in a batch of 10000 bulbs.What is the chance that at most three bulbs will be defective?Solution: Y ∼ Bin(n = 10000, p = 1/5000 = 0.0002).P(Y ≤ 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3)=(10000
0
)0.00020(1− 0.0002)10000−0 +
(100001
)0.00021(1−
0.0002)10000−1 +(10000
2
)0.00022(1− 0.0002)10000−2 +(10000
3
)0.00023(1− 0.0002)10000−3 =?
Or we can use the Poisson approximation.Y
approx∼ Po(µ = nπ = 10000× 0.0002) = 2.P(Y ≤ 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3) ≈e−2 20
0! + e−2 21
1! + e−2 22
2! + e−2 23
3! = 0.8571230094
Mahinda Samarakoon STAC51: Categorical data Analysis 15 / 21
Introduction
Poisson Approximation to the Binomial distribution:Example
Here are the R commands calculating P(Y ≤ 3) using the twodistributions:
> pbinom(3, 10000, 0.0002)
[1] 0.8571415
> ppois(3, 2)
[1] 0.8571235
Mahinda Samarakoon STAC51: Categorical data Analysis 16 / 21
Introduction
The Chi-squared Distribution Another distribution that we oftencome across in categorical data analysis is the chi-squareddistribution. Definition Let Z1,Z2, . . . ,Zν be ν iid randomvariables each having a N(0, 1) distribution., then the distributionof the random variable Y = Z 2
1 + Z 22 + · · ·+ Z 2
ν is called achi-squared distribution with degreed of freedom ν.
Mahinda Samarakoon STAC51: Categorical data Analysis 17 / 21
Introduction
Some properties of the Chi-squared distribution
1 If Z ∼ N(0, 1), then E (Z 2) = Var(Z ) + (E (Z ))2 = 1 + 02 = 1
2 If X ∼ N(µ, σ2), then, it can be shown that for any integerp ≥ 0,
E (X − µ)2p =(2p)!
p!2pσ2p
andE (X − µ)2p+1 = 0.
3 Var(Z 2) = E (Z 4)− (E (Z 2)) = (2×2)!2!22
× 12×2 − 11 = 2
4 If Y = Z 21 + Z 2
2 + · · ·+ Z 2ν , where Z1,Z2, . . . ,Zν are iid
N(0, 1), then EY = EZ 21 + EZ 2
2 + · · ·+ EZ 2ν = ν
5 Y = Z 21 + Z 2
2 + · · ·+ Z 2ν , where Z1,Z2, . . . ,Zν are iid N(0, 1),
then Var(Y ) = Var(Z 21 ) + Var(Z 2
2 ) + · · ·+ Var(Z 2ν ) = 2ν
6 If Y1 ∼ χ2ν1 and Y2 ∼ χ2
ν2 and if Y1 and Y2 are independent ,then Y1 + Y2 ∼ χ2
ν1+ν2
Mahinda Samarakoon STAC51: Categorical data Analysis 18 / 21
Introduction
Inference for ProportionsLet Y be the number of successes (i.e. 1’s) in n independentBernoulli trials with success probability π. The probability of asuccess π is usually an unknown parameter and we estimate it bythe sample proportion of successes:
π̂ =Y
n. (3)
Mahinda Samarakoon STAC51: Categorical data Analysis 19 / 21
Introduction
Some properties of π̂
1 π̂ is an unbiased estimator of π (i.e. E (π̂) = π).
2 Var(π̂) = π(1−π)n
3 π̂Pr→ π by WLLN
4 π̂approx∼ N(π, π(1−π)n ) for large n, by CLT
Mahinda Samarakoon STAC51: Categorical data Analysis 20 / 21
Introduction
Definition (Likelihood function)The likelihood function is the probability of the observed data,expressed as a function of the parameter value.
Definition (Maximum Likelihood Estimate)The maximum likelihood estimate (MLE) is the parametervalue at which the likelihood function takes its maximum.
Mahinda Samarakoon STAC51: Categorical data Analysis 21 / 21