Models for discrete data - kpfu.ru · Models for discrete data 1/20. Number game C | simple arithmetical concept: "prime number"or "a number between 1 and 10" D= x 1;:::;x N { set

Models for discrete data

1 / 20

Number game

C — simple arithmetical concept: ”prime number”or ”a numberbetween 1 and 10”

D = x1, ..., xN

– set drawn from C Problem: x belongs to C?

D = {16, 8, 2, 64}

2 / 20

Number game

C — simple arithmetical concept: ”prime number”or ”a numberbetween 1 and 10”

D = x1, ..., xN

– set drawn from C Problem: x belongs to C?

D = {16, 8, 2, 64}

2 / 20

Likelihood

htwo = ”powers of two”

heven = ”even numbers”

p(D|h) =[

1

size(h)

]N=

[1

|h|

]NSize principle (Occam’s razor).

p(D|htwo) ? p(D|heven)

3 / 20

Posterior

p(h|D) =p(h)p(D|h)∑h′∈H p(D,h

′)=

p(h)I(D ∈ h)/|h|N∑h′∈H p(h

′)I(D ∈ h′)/|h′|N

p(h|D)→ δhMAP (h)

hMAP = argmaxh

p(D|h)p(h)

MAP estimate converges towards the maximum likelihood estimateor MLE:

hmle = argmaxh

p(D|h) = argmaxh

log p(D|h)

4 / 20

Posterior predictive distribution

Bayes model averaging

p(x ∈ C|D) =∑h

p(y = 1|x, h)p(h|D)

Plug-in approximation

p(x ∈ C|D) =∑h

p(x|h)δh(h) = p(x|h)

A more complex prior

p(h) = π0prules(h) + (1− π0)pinterval(h)

5 / 20

The beta-binomial model

Xi ∼ Ber(θ)

p(D|θ) = θN1(1− θ)N0

Sufficient statistics

N1 =

N∑i=1

I(xi = 1);N0 =

N∑i=1

I(xi = 0)

Count of the number of heads:

Bin(k|n, θ) =(n

k

)θk(1− θ)n−k

6 / 20

Prior

If prior (conjugate):

p(θ) ∝ θγ1(1− θ)γ2

then posterior:

p(θ) ∝ p(D|θ)p(θ) = θN1(1−θ)N0θγ1(1−θ)γ2 = θN1+γ1(1−θ)N0+γ2

7 / 20

Posterior

p(θ|D) ∝ Bin(N1|θ,N0 +N1)Beta(θ|a, b)Beta(θ|N1 + a,N0 + b)

For two datasets Da, Db

p(θ|Da, Db) ∝ Bin(N1|θ,N1+N0)Beta(θ|a, b) ∝ Beta(θ|N1+a,N0+b)

In sequential mode

p(θ|Da, Db) ∝ p(Db|θ)p(θ|Da) ∝ Beta(θ|Na1+N

b1+a,N

a0+N

b0+b)

8 / 20

Posterior mean and mode

θMAP =a+N1 − 1

a+ b+N − 2

θMLE =N1

N

Posterior mean:

θ =a+N1

a+ b+N

If α0 = a+ b and prior mean m1 = a/α0

E[θ|D] =α0m1 +N1

N + α0=

α0

N + α0m1+

N

N + α0

N1

N= λm1+(1−λ)θMLE

9 / 20

Posterior variance

var[θ|D] =(a+N1)(b+N0)

(a+N1 + b+N0)2(a+N1 + b+N0 + 1)

if N � a, b and θ – MLE

var[θ|D] ≈ N1N0

NNN=θ(1− θ)N

10 / 20

Posterior predictive distribution

p(x = 1|D) =

∫ 1

0p(x = 1|θ)p(θ|D)dθ =

=

∫ 1

0θBeta(θ|a, b)dθ = E[θ|D] =

a

a+ b

11 / 20

Predicting the outcome of multiple future trials

p(x|D,M) =

∫ 1

0Bin(x|θ,M)Beta(θ|a, b)dθ =

=

(M,x 1

B(a,b)

∫ 10 θ

x(1− θ)M−xθa−1(1− θ)b−1dθ

)Beta-binomial distribution

Bb(x|a, b,M) =

(M

x

)B(x+ a,M − x+ b)

B(a, b)

E[x] =Ma

a+ b, var[x] =

Mab

(a+ b)2(a+ b+M)

a+ b+ 1

12 / 20

The Dirichlet-multinomial model

D = x1, . . . , xN , xi ∈ 1, . . . ,K

Likelihood:

p(D|θ) =K∏k=1

θNkk

Prior:

Dir(θ|α) = 1

B(α)

K∏k=1

θαk−1k I(x ∈ SK)

13 / 20

Posterior

p(θ|D) ∝ p(D|θ)p(θ)

∝K∏k=1

θNkk θ

αk−1

k =

K∏k=1

θαk+Nk−1k

= Dir(θ|α1 +N1, . . . , αK +NK)

MAP estimate:

θk =Nk + αk − 1

N + α0 −KMLE estimate:

θk =Nk

N

14 / 20

Posterior predictive

p(X = j|D) =

∫p(X = j|θ)p(θ|D)dθ

=

∫p(X = j|θj) [p(θ−j , θj |D)dθ−j ] dθj

=

∫θjp(θj |D)dθj = E[θj |D] =

αj +Nj∑k(αk +Nk)

=αj +Nj

α0 +N

θ−j – all components except θj .

15 / 20

Example. Language modeling

Mary had a little lamb, little lamb, little lamb,Mary had a little lamb, its fleece as white as snow

mary lamb little big fleece white black snow rain unk1 2 3 4 5 6 7 8 9 10

1 10 3 2 3 2 3 21 10 3 2 10 5 10 6 8

p(X = j|D) = E[θj |D] =αj +Nj∑j′ αj′ +Nj′

=1 +Nj

10 + 17

for α = 1

p(X = j|D) = (3/27, 5/27, 5/27, 1/27, 2/27, 2/27, 1/27, 2/27, 1/27, 5/27)

16 / 20

Naive Bayes classifiers

x ∈ 1, ...,KD

p(x|y = c, θ) =

D∏j=1

p(xj |y = c, θjc)

Real-valued features:

p(x|y = c, θ) =

D∏j=1

N(xj |µjc, σ2jc)

Binary features, xj ∈ 0, 1

p(x|y = c, θ) =

D∏j=1

Ber(xj |µjc)

Categorical features, xj ∈ 1, . . . ,K

p(x|y = c, θ) =

D∏j=1

Cat(xj |µjc)17 / 20

MLE for NBC

p(xi, yi|θ) = p(yi|π)∏j

p(xij |θj) =∏c

πI(yi=c)c

∏j

∏c

p(xij |θjc)I(yi = c)

logp(D|θ) =C∑c=1

Nclogπc +

D∑j=1

C∑c=1

∑i:yi=c

logp(xij |θjc)

πc =Nc

N

if xj |y = c ∼ Ber(θjc)

θjc =Njc

Nc

18 / 20

Bayesian naive Bayes

p(θ) = p(π)

D∏j=1

C∏c=1

p(θjc)

π ∼ Dir(α), θjc ∼ Beta(β0, beta1)

p(θ|D) = p(π|D)

D∏j=1

C∏c=1

p(θjc|D)

p(π|D) = Dir(N1 + α1..., NC + αC)

p(θjc|D) = Beta((Nc −Njc) + β0, Njc + beta1)

19 / 20

Classifying documents using bag of words

20 / 20

Documents

Models for discrete data - kpfu.ru · Models for discrete data 1/20. Number game C | simple arithmetical concept: "prime number"or "a number between 1 and 10" D= x 1;:::;x N { set