Upload
zukun
View
231
Download
0
Embed Size (px)
Citation preview
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 1/98
Applied Bayesian
Nonparametrics1. Models & Inference
Tutorial at CVPR 2012
Erik SudderthBrown University
Additional detail & citations in background chapter:
E. B. Sudderth, Graphical Models for Visual Object
Recognition and Tracking, PhD Thesis, MIT, 2006.
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 2/98
Applied
Bayesian
Nonparametric
StatisticsLearning probabilistic models of visual data.
Clustering & features, space & time, mostly unsupervised.
Complex data motivates models of unbounded complexity.We often need to learn the structure of the model itself.
Not no parameters! Models with infinitely many parameters.
Distributions on uncertain functions, distributions,!
Focus on those models which are most useful in practice.To understand those models, we’ll start with theory !
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 3/98
Applied BNP: Part I•! Review of parametric Bayesian models!! Finite mixture models
!! Beta and Dirichlet distributions
•! Canonical Bayesian nonparametric (BNP) model families
!! Dirichlet & Pitman-Yor processes for infinite clustering
!! Beta processes for infinite feature induction•! Key representations for BNP learning
!! Infinite-dimensional stochastic processes
!! Stick-breaking constructions
!! Partitions and Chinese restaurant processes
!! Infinite limits of finite, parametric Bayesian models•! Learning and inference algorithms
!! Representation and truncation of infinite models
!! MCMC methods and Gibbs samplers
!! Variational methods and mean field
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 4/98
Coffee Break
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 5/98
Applied BNP: Part II
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 6/98
Bayes Rule (Bayes Theorem)
unknown parameters (many possible models)
observed data available for learning
prior distribution (domain knowledge)
likelihood function (measurement model)
posterior distribution (learned information)
θD
p(θ)
p(D | θ) p(θ | D)
p(θ, D) = p(θ) p(D | θ) = p(D) p(θ | D)
∝ p(D | θ) p(θ)
p(θ | D) = p(θ, D) p(D)
= p(D | θ) p(θ)θ∈Θ
p(D | θ) p(θ)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 7/98
Gaussian Mixture Models
•! Observed feature vectors:
•! Hidden cluster labels:
•! Hidden mixture means:
•! Hidden mixture covariances:
•! Hidden mixture probabilities:
xi ∈ R
d
, i = 1, 2, . . . , N
µk ∈ Rd, k = 1, 2, . . . , K
zi ∈ {1, 2, . . . , K }, i = 1, 2, . . . , N
Σk ∈ Rd×d, k = 1, 2, . . . , K
πk,
K
k=1
πk = 1
•! Gaussian mixture marginal likelihood:
p(xi | π, µ,Σ) =K
zi=1
πzi N (xi | µzi ,Σzi)
p(xi | zi,π, µ,Σ) = N (xi | µzi ,Σzi)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 8/98
Gaussian Mixture ModelsMixture of 3 Gaussian
Distributions in 2D
Contour Plot of Joint Density,
Marginalizing Cluster Assignments
0 0.2 0.4 0.6 0.8 1
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
p(xi | π, µ,Σ) =K
zi=1
πzi N (xi | µzi ,Σzi)
p(xi | zi,π, µ,Σ) = N (xi | µzi ,Σzi)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 9/98
Gaussian Mixture Models
Surface Plot of Joint Density,
Marginalizing Cluster Assignments
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 10/98
Clustering with Gaussian Mixtures
C. Bishop, Pattern Recognition & Machine Learning
(a)
0 0.5 1
0
0.5
1(b)
0 0.5 1
0
0.5
1
Complete Data Labeled
by True Cluster Assignments
Incomplete Data:
Points to be Clustered
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 11/98
Inference Given Cluster Parameters
(b)
0 0.5 1
0
0.5
1
Posterior Probabilities of
Assignment to Each Cluster
Incomplete Data:
Points to be Clustered
(c)
0 0.5 1
0
0.5
1
rik = p(zi = k | xi,π, θ) =πk p(xi | θk)
K
=1 π p(xi | θ)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 12/98
Learning Binary Probabilities
Bernoulli Distribution: Single toss of a (possibly biased) coin
0 ≤ θ ≤ 1
X i ∼ Ber(θ), i = 1, . . . , N
•! Suppose we observe N samples from a Bernoulli
distribution with unknown mean:
•! What is the maximum likelihood parameter estimate?
p(x1, . . . , xN | θ) = θN 1(1− θ)N 0
Ber(x | θ) = θI(x=1)(1− θ)I(x=0)
θ̂ = arg maxθ
log p(x | θ) =N 1
N
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 13/98
Beta Distributions
Probability density function: x ∈ [0, 1]Γ(k) = (k − 1)!
Γ(x + 1) = xΓ(x)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 14/98
Beta Distributions
E[x] = a
a + bV[x] = ab
(a + b)2(a + b + 1)
Mode[x] = arg maxx∈[0,1]
Beta(x | a, b) =a− 1
(a−
1) + (b−
1)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 15/98
Bayesian Learning of Probabilities
Bernoulli Likelihood: Single toss of a (possibly biased) coin
0 ≤ θ ≤ 1Ber(x | θ) = θI(x=1)(1− θ)I(x=0)
p(x1, . . . , xN | θ) = θN 1(1− θ)N 0
Beta Prior Distribution:
p(θ) = Beta(θ | a, b) ∝ θa−1(1− θ)b−1
p(θ | x) ∝ θN 1+a−1(1− θ)N 0+b−1
∝ Beta(θ | N 1 + a,N 0 + b)
Posterior Distribution:
•! This is a conjugate prior, because posterior is in same family •! Estimate by posterior mode (MAP) or mean (preferred)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 16/98
Sequence of Beta Posteriors
0 0.2 0.4 0.6 0.8 1
0
1
2
3
4
5
6
7
8
9
10
truth
n=5
n=50
n=100
Murphy
2012
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 17/98
Multinomial Simplex
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 18/98
Learning Categorical Probabilities
Categorical Distribution: Single roll of a (possibly biased) die
Cat(x | θ) =K
k=1
θxk
k X = {0, 1}K ,
K
k=1
xk = 1
p(x1, . . . , xN | θ) =
K
k=1θN k
k
•! If we have N k observations of outcome k in N trials:
•! The maximum likelihood parameter estimates are then:
•! Will this produce sensible predictions when K is large?
For nonparametric models we let K approach infinity!
ˆ = arg maxθ
log p(x | θ) ˆk =
N k
N
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 19/98
Dirichlet Distributions
0
0.5
1
0
0.5
1
0
5
10
15
20
25
α=10.00
p
0
0.5
0
0.5
1
0
5
10
15
α=0.10
p
Moments:
Marginal Distributions:
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 20/98
Dirichlet Probability Densities
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 21/98
Dirichlet Samples
Dir θ | 0.1, 0.1, 0.1, 0.1, 0.1 Dir(θ | 1.0, 1.0, 1.0, 1.0, 1.0)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 22/98
Bayesian Learning of Probabilities
Dirichlet Prior Distribution:
Posterior Distribution:
•! This is a conjugate prior, because posterior is in same family
Categorical Distribution: Single roll of a (possibly biased) die
Cat(x | θ) =K
k=1
θxk
k X = {0, 1}K ,
K
k=1
xk = 1
p(x1, . . . , xN | θ) =
K
k=1 θ
N k
k
p(θ) = Dir(θ | α) ∝K
k=1
θαk−1
k
p(θ | x) ∝K
k=1
θN k+αk−1
k∝ Dir(θ | N 1 + α1, . . . , N K + αK)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 23/98
Directed Graphical Models
Chain rule implies that any joint distribution equals:
Directed graphical model implies a restricted factorization:
a(t)→ parents with edges pointing to node t
nodes → random variables
Valid for any directed acyclic graph (DAG):
equivalent to dropping conditional dependencies in standard chain rule
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 24/98
Plates: Learning with Priors
•! Boxes, or plates, indicate replication of variables•! Variables which are observed, or fixed, are often shaded
•! Prior distributions may themselves have hyperparameters λ
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 25/98
Gaussian Mixture Models
0 0.2 0.4 0.6 0.8 1
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
p(xi | π, µ,Σ) =K
zi=1
πzi N (xi | µzi ,Σzi)
p(xi | zi,π, µ,Σ) = N (xi | µzi ,Σzi)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 26/98
Finite Bayesian Mixture Models
(a)
•!Cluster frequencies: Symmetric Dirichlet
0
0.5
1
0
0.5
1
0
5
10
15
•! Cluster shapes: Any valid prior on chosen
family (e.g., Gaussian mean & covariance)
•! Data: Assign each data item to a cluster,
and sample from that cluster’s likelihood
zi ∼ Cat(π)
xi
∼ F
(θzi)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 27/98
Generative Gaussian Mixture Samples
•! Gaussian with known variance: Gaussian prior on mean
•! Gaussian with unknown mean & variance: normal inverse-Wishart
Learning is simplest with conjugate priors on cluster shapes:
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 28/98
Mixtures as Discrete Measures
Θ
Θ
Θ
X
Toy visualization: 1D Gaussian
mixture with unknown cluster
means and fixed variance
G(θ) =K
k=1
πkδθk(θ)
•! Define mixture via a discrete probability
measure on cluster parameters:
θkatom, point mass, Dirac delta
•! Generate data via repeated draws G:
θ̄i = θzi
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 29/98
Mixtures Induce Partitions
(a)
zi ∼ Cat(π)
•! If our goal is clustering, the output grouping
is defined by assignment indicator variables:
•! The number of ways of assigning N data
points to K mixture components isK
N
K ≥ N •! If this is much larger than thenumber of ways of partitioning that data:
33
= 27N=3: 5 partitions versus
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 30/98
Mixtures Induce Partitions
zi ∼ Cat(π)
•! If our goal is clustering, the output grouping
is defined by assignment indicator variables:
•! The number of ways of assigning N data
points to K mixture components isK
N
K ≥ N •! If this is much larger than thenumber of ways of partitioning that data:
N=5: 52 partitions versus55
= 3125
Courtesy Wikipedia
For any clustering, there is aunique partition, but many ways to
label that partition’s blocks.
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 31/98
Dirichlet Process MixturesThe Dirichlet Process (DP)
A distribution on countably infinite
discrete probability measures.
Sampling yields aPolya urn.
Infinite Mixture Models As an infinite limit of finite
mixtures with Dirichlet weight priors
Stick-Breaking An explicit construction
for the weights in
DP realizations
Chinese RestaurantProcess (CRP)The distribution on
partitions induced by a DP prior
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 32/98
Chinese RestaurantProcess (CRP)The distribution on
partitions induced by a DP prior
Dirichlet Process Mixtures
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 33/98
Nonparametric Clustering
•!Large Support: All partitions of the data, from one giantcluster to N singletons, have positive probability under prior
•! Exchangeable: Partition probabilities are invariant to
permutations of the data
•! Desirable: Good asymptotics, computational tractability,
flexibility and ease of generalization!
Clusters, arbitrarily ordered
O b s e r v a t i o
n s , a r b i t r a r i l y o r d e r e
d Ghahramani,
BNP 2009
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 34/98
Chinese Restaurant Process (CRP)•! Visualize clustering as a sequential process of customers
sitting at tables in an (infinitely large) restaurant:
customers observed data to be clustered
tables distinct blocks of partition, or clusters
•!The first customer sits at a table. Subsequent customersrandomly select a table according to:
number of tables occupied by the first N customersnumber of customers seated at table k
k̄ a new, previously unoccupied table
K N k
α positive concentration parameter
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 35/98
Chinese Restaurant Process (CRP)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 36/98
CRPs & Exchangeable Partitions
•! The probability of a seating arrangement of N customersis independent of the order they enter the restaurant:
p(z1, . . . , zN | α) = Γ(α)Γ(N + α)
αK
K
k=1
Γ(N k)
1
1 + α
·
1
2 + α
· · ·
1
N − 1 + α
=Γ(α)
Γ(N + α)normalization
constants
α
1 · 2 · · · (N k − 1) = (N k − 1)! = Γ(N k)
first customer to
sit at each table
other customers
joining each table
•! The CRP is thus a prior on infinitely exchangeable partitions
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 37/98
De Finetti’s Theorem•!
Finitely exchangeable random variables satisfy:
•! A sequence is infinitely exchangeable if every finitesubsequence is exchangeable
•! Exchangeable variables need not be independent, but
always have a representation with conditional independencies:
An explicit construction is useful in hierarchical modeling !
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 38/98
De Finetti’s Directed Graph
What distribution underlies the infinitely exchangeable CRP?
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 39/98
Chinese RestaurantProcess (CRP)The distribution on
partitions induced by a DP prior
Dirichlet Process MixturesThe Dirichlet Process (DP)
A distribution on countably infinite
discrete probability measures.
Sampling yields aPolya urn.
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 40/98
Dirichlet Processes
•! Given a base measure (distribution) H & concentration parameter
•! Then for any finite partition
the distribution of the measure of those cells is Dirichlet:
Ferguson 1973
Base Measure
α > 0
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 41/98
Dirichlet Processes
•! Marginalization properties of finite Dirichlet distributions satisfy
Kolmogorov’s extension theorem for stochastic processes:
Ferguson 1973
Base Measure
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 42/98
DP Posteriors and Conjugacy
•! Does the posterior distribution of G have a tractable form?•! For any partition, the posterior mean given N observations is
G ∼ DP(α,H ) θ̄i ∼ G, i = 1, . . . , N
•! In fact, the posterior distribution is another Dirichlet process,with mean that depends on the data’s empirical distribution:
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 43/98
DPs and Polya Urns
!! Consider an urn containing ! pounds of very tiny,
colored sand (the space of possible colors is ")
!! Take out one grain of sand, record its color as
!! Put that grain back, add 1 extra pound of that color
!! Repeat this process!
θ̄1
•! Can we simulate observations without constructing G?•! Yes, by a variation on the classical balls in urns analogy:
G ∼ DP(α,H ) θ̄i ∼ G, i = 1, . . . , N
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 44/98
Chinese RestaurantProcess (CRP)The distribution on
partitions induced by a DP prior
Dirichlet Process MixturesThe Dirichlet Process (DP)
A distribution on countably infinitediscrete probability measures.
Sampling yields aPolya urn.
Stick-Breaking An explicit construction
for the weights in
DP realizations
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 45/98
A Stick-Breaking Construction
0 1
concentrationparameter
•! Dirichlet process realizations are discrete with probability one:
G ∼ DP(α,H ) G(θ) =∞
k=1
πkδθk(θ)
Sethuraman 1994
•! Cluster shape parameters drawn from base measure: θk ∼ H
•! Cluster weights drawn from a stick-breaking process:
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 46/98
Dirichlet Stick-Breaking
E[βk] =
1
1 + αβk∼
Beta(1,
α)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 47/98
DPs and Stick Breaking
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 48/98
Chinese RestaurantProcess (CRP)The distribution on
partitions induced by a DP prior
Dirichlet Process MixturesThe Dirichlet Process (DP)
A distribution on countably infinitediscrete probability measures.
Sampling yields aPolya urn.
Stick-Breaking An explicit construction
for the weights in
DP realizations
Infinite Mixture Models As an infinite limit of finite
mixtures with Dirichlet weight priors
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 49/98
DP Mixture Models
G ∼DP(
α, H )
θi ∼ G
xi ∼ F (θ̄i)
G(θ) =
∞
k=1
πkδθk(θ)
zi ∼ Cat(π)
xi ∼ F (θzi)
θk ∼ H
(λ
)π ∼ Stick(α)
•! Stick-breaking: Explicit size-biased sampling of weights•! Chinese restaurant process: Indicator sequence
•!Polya urn: Corresponding parameter sequence
z1, z2, z3, . . .
θ̄1,θ̄2,
θ̄3, . . .
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 50/98
Samples from DP Mixture Priors
N=50
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 51/98
Samples from DP Mixture Priors
N=200
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 52/98
Samples from DP Mixture Priors
N=1000
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 53/98
Finite versus DP Mixtures
π ∼ Stick(α)
zi ∼ Cat(π)
xi ∼ F
(θz
i)
Finite Mixture DP Mixture
GK(θ) =
K
k=1
πkδθk(θ)
THEOREM: For any measureable function f, as K →∞
G ∼ DP(α, H )
k ∼ H
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 54/98
Finite versus CRP Partitions
π ∼ Stick(α)
zi ∼ Cat(π)
Finite Mixture DP Mixture
Chinese Restaurant Process:
p(z1, . . . , zN | α) = Γ(α)Γ(N + α)
α
K
K+
K+k=1
N k−1j=1
j +
α
K
p(z1, . . . , zN | α) =
Γ(α)
Γ(N + α)αK+
K+
k=1
(N k − 1)!
+ number of blocks in cluster
Finite Dirichlet:
•! Probability of Dirichlet indicators approach zero as•! Probability of Dirichlet partition approaches CRP as
K →∞
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 55/98
Dirichlet Process MixturesThe Dirichlet Process (DP)
A distribution on countably infinitediscrete probability measures.
Sampling yields aPolya urn.
Infinite Mixture Models As an infinite limit of finite
mixtures with Dirichlet weight priors
Stick-Breaking An explicit construction
for the weights in
DP realizations
Chinese RestaurantProcess (CRP)The distribution on
partitions induced by a DP prior
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 56/98
Pitman-Yor Process Mixtures
Infinite Mixture ModelsBut not an infinite limit of finite
mixtures with symmetric weight priors
Stick-Breaking An explicit construction
for the weights in
PY realizations
Chinese RestaurantProcess (CRP)The distribution on
partitions induced by a PY prior
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 57/98
Pitman-Yor Processes
The Pitman-Yor process defines a distribution oninfinite discrete measures, or partitions
Dirichlet process:
0 1
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 58/98
Dirichlet Stick-Breaking
All stick indices
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 59/98
Pitman-Yor Stick-Breaking
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 60/98
Chinese Restaurant Process (CRP)customers observed data to be clustered
tables distinct blocks of partition, or clusters
•! Partitions sampled from the PY process can be generatedvia a generalized CRP, which remains exchangeable
•! The first customer sits at a table. Subsequent customers
randomly select a table according to:
number of tables occupied by the first N customersnumber of customers seated at table k
k̄ a new, previously unoccupied table
K
N k
discount & concentration parameters 0 ≤ a < 1, b > −a
p(zN +1 = z | z1, . . . , zN ) =1
b + N
Kk=1
(N k − a)δ(z, k) + (b + Ka)δ(z, k̄)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 61/98
Human Image Segmentations
Labels for more than 29,000 segments in 2,688 images of natural scenes
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 62/98
Statistics of Human Segments
How many objectsare in this image?
ManySmall
Objects
SomeLarge
Objects
Object sizes followa power law
Labels for more than 29,000 segments in 2,688 images of natural scenes
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 63/98
Statistics of Semantic Labels
sky
trees
person
lichen
rainbow
wheelbarrow
waterfall
Labels for more than 29,000 segments in 2,688 images of natural scenes
How frequent aretext region labels?
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 64/98
Why Pitman-Yor?
Jim Pitman
Marc Yor
Generalizing the Dirichlet Process
!! Distribution on partitions leads to ageneralized Chinese restaurant process
! Special cases of interest in probability:
Markov chains, Brownian motion,!
Power Law Distributions
DP PY
Number of uniqueclusters in N
observations
HeapsLaw:
Size of sorted cluster weight k
Goldwater, Griffiths, & Johnson, 2005
Teh, 2006
Natural Language
Statistics
Zipfs Law:
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 65/98
An Aside: Toy Dataset Bias
! !"# $ $"# % %"# & &"# ' '"# #!
!"#
$
$"#
%
%"#
&
&"#
'
'"#
#()*+,-.-/01+234+
! !"# $ $"# % %"# & &"# ' '"# #!
!"#
$
$"#
%
%"#
&
&"#
'
'"#
#0)(35(67500+,-&-/01+234+
! !"# $ $"# % %"# & &"# ' '"# #!
!"#
$
$"#
%
%"#
&
&"#
'
'"#
#8914/0916+,-%-/01+234+
! !"# $ $"# % %"# & &"# ' '"# #!
!"#
$
$"#
%
%"#
&
&"#
'
'"#
#+:1);;03+,-'-/01+234+
! !"# $ $"# % %"# & &"# ' '"# #!
!"#
$
$"#
%
%"#
&
&"#
'
'"#
#2<9/)4/03+,-%-/01+234+
! !"# $ $"# % %"# & &"# ' '"# #!
!"#
$
$"#
%
%"#
&
&"#
'
'"#
#2=433/)4/03+ >9)(36,-%-/01+234+
- - - - - -
Ng, Jordan, & Weiss, NIPS 2001
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 66/98
Pitman-Yor Process Mixtures
Infinite Mixture ModelsBut not an infinite limit of finite
mixtures with symmetric weight priors
Stick-Breaking An explicit construction
for the weights in
PY realizations
Chinese RestaurantProcess (CRP)The distribution on
partitions induced by a PY prior
Dirichlet processes and finite Dirichlet
distributions do not produceheavy-tailed, power law distributions
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 67/98
Latent Feature Models
•! Latent feature modeling : Each group of observations isassociated with a subset of the possible latent features
•! Factorial power: There are 2K combinations of K features, while
accurate mixture modeling may require many more clusters
•! Question: What is the analog of the DP for feature modeling?
Binary matrix
indicating feature
presence/absence
Depending on application, features
can be associated with any
parameter value of interest
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 68/98
Nonparametric Binary FeaturesThe Beta Process (BP)
A Levy process whose realizationsare countably infinite collections of
atoms, with mass between
0 and 1.
Infinite Feature Models As an infinite limit of a finite
beta-Bernoulli binary feature model
Stick-Breaking An explicit construction
for the feature frequencies
in BP realizations
Indian Buffet
Process (IBP)The distribution on
sparse binary matricesinduced by a BP
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 69/98
Poisson Distribution for Counts
X =
{0
,1
,2
,3
, . . .
}Poi(x | θ) = e
−θθx
x!θ > 0
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 70/98
Indian Buffet Process (IBP)•! Visualize feature assignment as a sequential process of
customers sampling dishes from an (infinitely long) buffet:
customers observed data to be modeled
dishes binary features to be selected
•! The first customer chooses dishes,•! Subsequent customer i randomly samples each previously
tasted dish k with probability
•! That customer also samples new dishes
number of previous customers to sample dish k
Poisson(α)
f ik ∼ Ber
mk
i
Poisson(α/i)
mk
α > 0
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 71/98
Binary Feature Realizations
•! IBP is exchangeable, up to a permutation of the order withwhich dishes are listed in the binary feature matrix
•! Clustering models like the DP have one “feature” per customer
•! The number of features sampled at least once is
Ghahramani,BNP 2009
O(α logN )
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 72/98
Finite Beta-Bernoulli Features
πk ∼ Beta
α
K , 1
zik ∼ Ber(πk)
•! The expected number of active features in N customers is
N α
(1 + α/K )
→ N α
•! The marginal probability of the realized binary matrix equals
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 73/98
Beta-Bernoulli and the IBP•! We can show that the limit of the finite beta-Bernoulli model,
and the IBP, produce the same distribution onleft-ordered-form equivalence classes of binary matrices:
•! Poisson distribution in IBP arises from the law of rare events:•! Flip K coins with probability of coming up heads
•! As the distribution of the number of total heads
approaches Poisson(α)
α/K K →∞
N t i Bi F t
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 74/98
Nonparametric Binary FeaturesThe Beta Process (BP)
A Levy process whose realizationsare countably infinite collections of
atoms, with mass between
0 and 1.
Infinite Feature Models As an infinite limit of a finite
beta-Bernoulli binary feature model
Stick-Breaking An explicit construction
for the feature frequencies
in BP realizations
Indian Buffet
Process (IBP)The distribution on
sparse binary matricesinduced by a BP
Extensions: Additional control over feature sharing, power laws!
N t i L i
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 75/98
Nonparametric Learning
Finite Bayesian ModelsSet finite model order to be larger
than expected number of
clusters or features.
Stick-BreakingTruncate stick-breaking
to produce provably
accurate approximation.
CRP & IBPTractably learn via finite
summaries of true,
infinite model.
Infinite Stochastic ProcessesConceptually useful, but usually
impractical or impossible for
learning algorithms.
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 76/98
Markov Chain Monte Carlo
•! At each time point, state is a configuration of all the
variables in the model: parameters, hidden variables, etc. •! We design the transition distribution so that
the chain is irreducible and ergodic , with a unique
stationary distribution
z(0)
z(1)
z(2)
z(t+1)
∼ q(z | z(t))
z(t)
q(z | z(t))
p∗(z)
p
∗
(z) = Z q(z | z
) p∗
(z
) dz
•! For learning, the target equilibrium distribution is usually theposterior distribution given data x :
•! Popular recipes: Metropolis-Hastings and Gibbs samplers
p∗(z) = p(z | x
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 77/98
Gibbs Sampler for a 2D Gaussian
C. Bishop, Pattern Recognition & Machine Learning
z1
z2
L
l
z(t
)i∼ p(zi | z(
t−
1)\i ) i = i(t)
z(t)j = z
(t−1)j j = i(t)
Under mild conditions,
converges assuming all
variables are resampled
infinitely often (order can be
fixed or random)
General Gibbs Sampler
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 78/98
Finite Mixture Gibbs Sampler
Most basic approach: Sample z, #, $%
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 79/98
Standard Finite Mixture Sampler
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 80/98
Standard Sampler: 2 Iterations
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 81/98
Standard Sampler: 10 Iterations
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 82/98
Standard Sampler: 50 Iterations
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 83/98
Collapsed Finite Bayesian Mixture
•! Conjugate priors allow analytic integration of some parameters•! Resulting sampler operates on reduced space of cluster
assignments (implicitly considers all possible cluster shapes)
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 84/98
Collapsed Finite Mixture Sampler
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 85/98
Standard versus Collapsed Samplers
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 86/98
DP Mixture Models
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 87/98
Collapsed DP Mixture Sampler
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 88/98
Collapsed DP Sampler: 2 Iterations
S S 10
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 89/98
Standard Sampler: 10 Iterations
St d d S l 50 It ti
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 90/98
Standard Sampler: 50 Iterations
DP Fi it Mi t S l
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 91/98
DP versus Finite Mixture Samplers
DP P t i N b f Cl t
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 92/98
DP Posterior Number of Clusters
B i O kh ’ R
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 93/98
Bayesian Ockham’s Razor
William of Ockham
Plurality must never be posited without necessity.
Dm
θ
data
model
parameters
Even with uniform p(m), marginal likelihood provides a model selection bias
E l I thi i f i ?
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 94/98
Example: Is this coin fair?M 0 : Tosses are from a fair coin:
M 1: Tosses are from a coin of unknown bias:
θ = 1/2
θ ∼ Unif(0, 1)
Marginal Likelihoods
Number of heads in N=5 tosses
M 1
M 0
•! ML: Always prefer M 1•! Bayes: Unbalanced counts
are much more likely with abiased coin, so favor M 1
•! Bayes: Balanced counts
only happen with somebiased coins, so favor M 0
Variational Approximations
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 95/98
Variational Approximations
(Multiply by one)
(Jensen’sinequality)
•! Minimizing KL divergence maximizes a likelihood bound•! Variational EM algorithms, which maximize for q(x) within
some tractable family, retain BNP model selection behavior
M Fi ld f DP Mi t
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 96/98
Mean Field for DP Mixtures
•! Truncate stick-breaking at some upper bound K on the truenumber of occupied clusters:
•! Priors encourage assigning data to fewer than K clusters
k = 1, . . . , K −
1
πK = 1−
K−1
k=1
πk
Blei & Jordan2006
MCMC & Variational Learning
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 97/98
MCMC & Variational Learning
Finite Bayesian ModelsSet finite model order to be larger
than expected number of
clusters or features.
Stick-BreakingTruncate stick-breaking
to produce provably
accurate approximation.
CRP & IBPTractably learn via finite
summaries of true,
infinite model.
Infinite Stochastic ProcessesConceptually useful, but usually
impractical or impossible for
learning algorithms.
A li d BNP P t II
7/31/2019 BNP1models Cvpr2012 Applied Bayesian Nonparametrics
http://slidepdf.com/reader/full/bnp1models-cvpr2012-applied-bayesian-nonparametrics 98/98
Applied BNP: Part II