Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Models for discrete data
1 / 20
Number game
C — simple arithmetical concept: ”prime number”or ”a numberbetween 1 and 10”
D = x1, ..., xN
– set drawn from C Problem: x belongs to C?
D = {16, 8, 2, 64}
2 / 20
Number game
C — simple arithmetical concept: ”prime number”or ”a numberbetween 1 and 10”
D = x1, ..., xN
– set drawn from C Problem: x belongs to C?
D = {16, 8, 2, 64}
2 / 20
Likelihood
htwo = ”powers of two”
heven = ”even numbers”
p(D|h) =[
1
size(h)
]N=
[1
|h|
]NSize principle (Occam’s razor).
p(D|htwo) ? p(D|heven)
3 / 20
Posterior
p(h|D) =p(h)p(D|h)∑h′∈H p(D,h
′)=
p(h)I(D ∈ h)/|h|N∑h′∈H p(h
′)I(D ∈ h′)/|h′|N
p(h|D)→ δhMAP (h)
hMAP = argmaxh
p(D|h)p(h)
MAP estimate converges towards the maximum likelihood estimateor MLE:
hmle = argmaxh
p(D|h) = argmaxh
log p(D|h)
4 / 20
Posterior predictive distribution
Bayes model averaging
p(x ∈ C|D) =∑h
p(y = 1|x, h)p(h|D)
Plug-in approximation
p(x ∈ C|D) =∑h
p(x|h)δh(h) = p(x|h)
A more complex prior
p(h) = π0prules(h) + (1− π0)pinterval(h)
5 / 20
The beta-binomial model
Xi ∼ Ber(θ)
p(D|θ) = θN1(1− θ)N0
Sufficient statistics
N1 =
N∑i=1
I(xi = 1);N0 =
N∑i=1
I(xi = 0)
Count of the number of heads:
Bin(k|n, θ) =(n
k
)θk(1− θ)n−k
6 / 20
Prior
If prior (conjugate):
p(θ) ∝ θγ1(1− θ)γ2
then posterior:
p(θ) ∝ p(D|θ)p(θ) = θN1(1−θ)N0θγ1(1−θ)γ2 = θN1+γ1(1−θ)N0+γ2
7 / 20
Posterior
p(θ|D) ∝ Bin(N1|θ,N0 +N1)Beta(θ|a, b)Beta(θ|N1 + a,N0 + b)
For two datasets Da, Db
p(θ|Da, Db) ∝ Bin(N1|θ,N1+N0)Beta(θ|a, b) ∝ Beta(θ|N1+a,N0+b)
In sequential mode
p(θ|Da, Db) ∝ p(Db|θ)p(θ|Da) ∝ Beta(θ|Na1+N
b1+a,N
a0+N
b0+b)
8 / 20
Posterior mean and mode
θMAP =a+N1 − 1
a+ b+N − 2
θMLE =N1
N
Posterior mean:
θ =a+N1
a+ b+N
If α0 = a+ b and prior mean m1 = a/α0
E[θ|D] =α0m1 +N1
N + α0=
α0
N + α0m1+
N
N + α0
N1
N= λm1+(1−λ)θMLE
9 / 20
Posterior variance
var[θ|D] =(a+N1)(b+N0)
(a+N1 + b+N0)2(a+N1 + b+N0 + 1)
if N � a, b and θ – MLE
var[θ|D] ≈ N1N0
NNN=θ(1− θ)N
10 / 20
Posterior predictive distribution
p(x = 1|D) =
∫ 1
0p(x = 1|θ)p(θ|D)dθ =
=
∫ 1
0θBeta(θ|a, b)dθ = E[θ|D] =
a
a+ b
11 / 20
Predicting the outcome of multiple future trials
p(x|D,M) =
∫ 1
0Bin(x|θ,M)Beta(θ|a, b)dθ =
=
(M,x 1
B(a,b)
∫ 10 θ
x(1− θ)M−xθa−1(1− θ)b−1dθ
)Beta-binomial distribution
Bb(x|a, b,M) =
(M
x
)B(x+ a,M − x+ b)
B(a, b)
E[x] =Ma
a+ b, var[x] =
Mab
(a+ b)2(a+ b+M)
a+ b+ 1
12 / 20
The Dirichlet-multinomial model
D = x1, . . . , xN , xi ∈ 1, . . . ,K
Likelihood:
p(D|θ) =K∏k=1
θNkk
Prior:
Dir(θ|α) = 1
B(α)
K∏k=1
θαk−1k I(x ∈ SK)
13 / 20
Posterior
p(θ|D) ∝ p(D|θ)p(θ)
∝K∏k=1
θNkk θ
αk−1
k =
K∏k=1
θαk+Nk−1k
= Dir(θ|α1 +N1, . . . , αK +NK)
MAP estimate:
θk =Nk + αk − 1
N + α0 −KMLE estimate:
θk =Nk
N
14 / 20
Posterior predictive
p(X = j|D) =
∫p(X = j|θ)p(θ|D)dθ
=
∫p(X = j|θj) [p(θ−j , θj |D)dθ−j ] dθj
=
∫θjp(θj |D)dθj = E[θj |D] =
αj +Nj∑k(αk +Nk)
=αj +Nj
α0 +N
θ−j – all components except θj .
15 / 20
Example. Language modeling
Mary had a little lamb, little lamb, little lamb,Mary had a little lamb, its fleece as white as snow
mary lamb little big fleece white black snow rain unk1 2 3 4 5 6 7 8 9 10
1 10 3 2 3 2 3 21 10 3 2 10 5 10 6 8
p(X = j|D) = E[θj |D] =αj +Nj∑j′ αj′ +Nj′
=1 +Nj
10 + 17
for α = 1
p(X = j|D) = (3/27, 5/27, 5/27, 1/27, 2/27, 2/27, 1/27, 2/27, 1/27, 5/27)
16 / 20
Naive Bayes classifiers
x ∈ 1, ...,KD
p(x|y = c, θ) =
D∏j=1
p(xj |y = c, θjc)
Real-valued features:
p(x|y = c, θ) =
D∏j=1
N(xj |µjc, σ2jc)
Binary features, xj ∈ 0, 1
p(x|y = c, θ) =
D∏j=1
Ber(xj |µjc)
Categorical features, xj ∈ 1, . . . ,K
p(x|y = c, θ) =
D∏j=1
Cat(xj |µjc)17 / 20
MLE for NBC
p(xi, yi|θ) = p(yi|π)∏j
p(xij |θj) =∏c
πI(yi=c)c
∏j
∏c
p(xij |θjc)I(yi = c)
logp(D|θ) =C∑c=1
Nclogπc +
D∑j=1
C∑c=1
∑i:yi=c
logp(xij |θjc)
πc =Nc
N
if xj |y = c ∼ Ber(θjc)
θjc =Njc
Nc
18 / 20
Bayesian naive Bayes
p(θ) = p(π)
D∏j=1
C∏c=1
p(θjc)
π ∼ Dir(α), θjc ∼ Beta(β0, beta1)
p(θ|D) = p(π|D)
D∏j=1
C∏c=1
p(θjc|D)
p(π|D) = Dir(N1 + α1..., NC + αC)
p(θjc|D) = Beta((Nc −Njc) + β0, Njc + beta1)
19 / 20
Classifying documents using bag of words
20 / 20