21
Introduction STAC51: Categorical data Analysis Mahinda Samarakoon January 21, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 21

STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

  • Upload
    others

  • View
    48

  • Download
    1

Embed Size (px)

Citation preview

Page 1: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

STAC51: Categorical data Analysis

Mahinda Samarakoon

January 21, 2016

Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 21

Page 2: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Table of contents

1 Introduction

Mahinda Samarakoon STAC51: Categorical data Analysis 2 / 21

Page 3: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Basic Concepts

Categorical data analysis is concerned with the statistical methodsfor analysis of categorical response (dependent) variables.Explanatory variables may be categorical or continuous or both.For example the explanatory variables can be income, education,gender, race etc.There are two types of categorical variables:

Mahinda Samarakoon STAC51: Categorical data Analysis 3 / 21

Page 4: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Types of variables

Nominal - unordered categories

Major: Mathematics, Statistics ot Computer ScienceFavorite music: rock, classical, jazz, country, folk, popCriminal offense convictions: murder, robbery, assault

Ordinal - ordered categories, but the exact distances betweencategories are unknown..Examples

Patient condition: excellent, good, fair, poorGovernment spending: too high, about right, too lowHighest attained education level: HS, BS, MS, PhD

Mahinda Samarakoon STAC51: Categorical data Analysis 4 / 21

Page 5: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Types of variables

Binary valentines

A binary variable is a special case of a categorical variable,taking only two values (categories) such as success and failureot true or false.

For binary variables nominal-ordinal distinction is notimportant.

Mahinda Samarakoon STAC51: Categorical data Analysis 5 / 21

Page 6: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Types of variables

Interval variables

An interval variables is one that does have meaningfuldistances between any two values.

Examples: Annual income, height, weight, systolic bloodpressure level.

Mahinda Samarakoon STAC51: Categorical data Analysis 6 / 21

Page 7: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Probability Distributions for Categorical Data

In categorical data analysis, the binomial distribution (and itsmultinomial distribution generalization) plays the role that theNormal distribution does for continuous response.

Recall that for a Bin(n, π) random variable Y

P(Y = y) = pY (y) =(ny

)πy (1− π)n−y for y = 1, . . . , n and

zero otherwise.

E (Y ) = nπ

Var(Y ) = nπ(1− π)

If X1, . . . ,Xn are i.i.d. Bernoulli random variables, i.e.P(X1 = 1) = π and P(X1 = 0) = 1− π, thenY = X1 + · · ·+ Xn ∼ Bin(n, π). In other words Y is thenumber of successes (i.e. 1’s) in n independent Bernoullitrials.

Mahinda Samarakoon STAC51: Categorical data Analysis 7 / 21

Page 8: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Binomial Distribution

Example According to published statistics, 8% of people ages14-24 are school dropouts, i.e. persons who are not in regularschool and who have not completed the 12th grade or any higherdegree degree. Suppose you pick five people at random from thisage group, what is the probability that exactly two of then will beschool dropouts?Solution: Let Y denote the number of school dropouts in thissample of 5 people, then Y ∼ Bin(n = 5, π = 0.08). The questionwants P(Y = 2) pause and using the formula

P(Y = 2) =

(5

2

)(0.08)2(1− 0.08)5−2

= 10× (0.08)× (0.92)3

= 0.049836032.

Mahinda Samarakoon STAC51: Categorical data Analysis 8 / 21

Page 9: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Multinomial Distribution

In some trials more than two outcomes are possible. Suppose nindependent trails can have outcome in any of c categories. Letyij = 1 if the i th outcome results in category j and zero otherwise.

Let nj =n∑

i=1yij , then (n1, n2, . . . , nc) is an observed value (vector)

from a multinomial distribution. The probability mass function ofthe multinomial distribution is given by:

p(n1, n2, . . . , nc) =

(n!

n1!n2! . . . nc !

)πn11 π

n22 . . . πncc . (1)

where πj is the probability of an outcome in category j (for anytrial).

Mahinda Samarakoon STAC51: Categorical data Analysis 9 / 21

Page 10: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Multinomial Distribution: Example

Suppose we have a bowl with 10 marbles - 2 red marbles, 3 greenmarbles, and 5 blue marbles. We randomly select 4 marbles fromthe bowl, with replacement. What is the probability of selecting 2green marbles and 2 blue marbles?Solution: Let Y1,Y1 and , Y3 denote the numbers of red, greenand blue marbles respectively. Then (Y1,Y1,Y3) has a multinomialdistribution with n = 4, π1 = 0.2, π2 = 0.3 and π3 = 0.5 andP(Y1 = 0,Y2 = 2,Y2 = 2) =

(4!

0!2!2!

)0.20 × 0.32 × 0.52 =

6× 0.0225 = 0.135.R commands

> dmultinom(x = c(0, 2, 2), size = 4, prob = c(0.2, 0.3, 0.5))

[1] 0.135

>

Mahinda Samarakoon STAC51: Categorical data Analysis 10 / 21

Page 11: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Multinomial Distribution

Some properties of the Multinomial DistributionIf Y1,Y2, . . . ,Yc−1 have a multinomial (n, π1, π2, . . . , πc), then

Yi ∼ Bin(n, πi )

µi = E (Yj) = nπj

Var(Yj) = nπj(1− πj)Cov(Yj ,Yk) = E ((Yj − µj)(Yk − µk)) = −nπjπk .

Mahinda Samarakoon STAC51: Categorical data Analysis 11 / 21

Page 12: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Poisson Distribution

Sometimes, count data do not result from a fixed number of trials.For example, the number of accidents during a particular period ina particular city. This type of random variables often have aPoisson distribution. The probability mass function of the Poissondistribution is given by

p(y) =e−µµy

y !, y = 0, 1, . . . (2)

The parameter of the distribution µ represents the mean of thedistribution. That is, if Y ∼ Po(µ), then E (Y ) = µ. It can also beshown that Var(Y ) = µ.

Mahinda Samarakoon STAC51: Categorical data Analysis 12 / 21

Page 13: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Poisson Distribution: Example

Births in a hospital occur randomly at an average rate of 1.8 birthsper hour. It is reasonable to assume that distribution of the thenumber of births in a in any particular hour to be Poisson withmean 1.8.What is the probability of observing 4 births in a given hour at thehospital?Solution: Let Y be the number of births in this interval. ThenY ∼ Po(1.8) and so P(Y = 4) = e−1.81.84

4! = 0.0723.

Mahinda Samarakoon STAC51: Categorical data Analysis 13 / 21

Page 14: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Poisson Approximation to the Binomial distribution

If n is large (n ≥ 100) and π is small (usually π ≤ 0.01) (andnπ ≤ 20), then we can use Poisson(µ = nπ) to approximate thebinomial probabilities.

Mahinda Samarakoon STAC51: Categorical data Analysis 14 / 21

Page 15: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Poisson Approximation to the Binomial distribution:Example

Suppose that 1 in 5000 light bulbs are defective. Let Y denote thenumber of defective bulbs in a batch of 10000 bulbs.What is the chance that at most three bulbs will be defective?Solution: Y ∼ Bin(n = 10000, p = 1/5000 = 0.0002).P(Y ≤ 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3)=(10000

0

)0.00020(1− 0.0002)10000−0 +

(100001

)0.00021(1−

0.0002)10000−1 +(10000

2

)0.00022(1− 0.0002)10000−2 +(10000

3

)0.00023(1− 0.0002)10000−3 =?

Or we can use the Poisson approximation.Y

approx∼ Po(µ = nπ = 10000× 0.0002) = 2.P(Y ≤ 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3) ≈e−2 20

0! + e−2 21

1! + e−2 22

2! + e−2 23

3! = 0.8571230094

Mahinda Samarakoon STAC51: Categorical data Analysis 15 / 21

Page 16: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Poisson Approximation to the Binomial distribution:Example

Here are the R commands calculating P(Y ≤ 3) using the twodistributions:

> pbinom(3, 10000, 0.0002)

[1] 0.8571415

> ppois(3, 2)

[1] 0.8571235

Mahinda Samarakoon STAC51: Categorical data Analysis 16 / 21

Page 17: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

The Chi-squared Distribution Another distribution that we oftencome across in categorical data analysis is the chi-squareddistribution. Definition Let Z1,Z2, . . . ,Zν be ν iid randomvariables each having a N(0, 1) distribution., then the distributionof the random variable Y = Z 2

1 + Z 22 + · · ·+ Z 2

ν is called achi-squared distribution with degreed of freedom ν.

Mahinda Samarakoon STAC51: Categorical data Analysis 17 / 21

Page 18: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Some properties of the Chi-squared distribution

1 If Z ∼ N(0, 1), then E (Z 2) = Var(Z ) + (E (Z ))2 = 1 + 02 = 1

2 If X ∼ N(µ, σ2), then, it can be shown that for any integerp ≥ 0,

E (X − µ)2p =(2p)!

p!2pσ2p

andE (X − µ)2p+1 = 0.

3 Var(Z 2) = E (Z 4)− (E (Z 2)) = (2×2)!2!22

× 12×2 − 11 = 2

4 If Y = Z 21 + Z 2

2 + · · ·+ Z 2ν , where Z1,Z2, . . . ,Zν are iid

N(0, 1), then EY = EZ 21 + EZ 2

2 + · · ·+ EZ 2ν = ν

5 Y = Z 21 + Z 2

2 + · · ·+ Z 2ν , where Z1,Z2, . . . ,Zν are iid N(0, 1),

then Var(Y ) = Var(Z 21 ) + Var(Z 2

2 ) + · · ·+ Var(Z 2ν ) = 2ν

6 If Y1 ∼ χ2ν1 and Y2 ∼ χ2

ν2 and if Y1 and Y2 are independent ,then Y1 + Y2 ∼ χ2

ν1+ν2

Mahinda Samarakoon STAC51: Categorical data Analysis 18 / 21

Page 19: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Inference for ProportionsLet Y be the number of successes (i.e. 1’s) in n independentBernoulli trials with success probability π. The probability of asuccess π is usually an unknown parameter and we estimate it bythe sample proportion of successes:

π̂ =Y

n. (3)

Mahinda Samarakoon STAC51: Categorical data Analysis 19 / 21

Page 20: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Some properties of π̂

1 π̂ is an unbiased estimator of π (i.e. E (π̂) = π).

2 Var(π̂) = π(1−π)n

3 π̂Pr→ π by WLLN

4 π̂approx∼ N(π, π(1−π)n ) for large n, by CLT

Mahinda Samarakoon STAC51: Categorical data Analysis 20 / 21

Page 21: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:

Introduction

Definition (Likelihood function)The likelihood function is the probability of the observed data,expressed as a function of the parameter value.

Definition (Maximum Likelihood Estimate)The maximum likelihood estimate (MLE) is the parametervalue at which the likelihood function takes its maximum.

Mahinda Samarakoon STAC51: Categorical data Analysis 21 / 21