Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
On Bayesian Computation
Michael I. Jordan
with Elaine Angelino, Maxim Rabinovich, Martin Wainwrightand Yun Yang
Previous Work: Information Constraints on Inference
I Minimize the minimax risk under constraintsI privacy constraintI communication constraintI memory constraint
I Yields tradeoffs linking statistical quantities (amount of data,complexity of model, dimension of parameter space, etc) to“externalities” (privacy level, bandwidth, storage, etc)
Ongoing Work: Computational Constraints on Inference
I Tradeoffs via convex relaxations
I Tradeoffs via concurrency control
I Bounds via optimization oracles
I Bounds via communication complexity
I Tradeoffs via subsampling
I All of this work has been frequentist; how about Bayesianinference?
Today’s talk: Bayesian Inference and Computation
I Integration rather than optimization
I Part I: Mixing times for MCMC
I Part II: Distributed MCMC
Part I: Mixing Times for MCMC
MCMC and Sparse Regression
I Statistical problem: predict/explain a response variable Ybased on a very large number of features X1, . . . ,Xd .
I Formally:
Y = X1 β1 + X2 β2 + · · ·+ Xd βd + noise, or
Yn×1 = Xn×d βd×1 + noise, (matrix form)
number of features d � sample size n.
I Sparsity condition: number of non-zero βj ’s � n.
I Goal: variable selection—identify influential features
I Can this be done efficiently (in a computational sense)?
Statistical Methodology
I Frequentist approach: penalized optimization
Objective function f (β)
=1
2n‖Y − Xβ‖22︸ ︷︷ ︸
Goodness of fit
+ Pλ(β)︸ ︷︷ ︸Penalty term
I Bayesian approach: Monte Carlo sampling
Posterior probability P(β |Y )
∝ P(Y |β)︸ ︷︷ ︸Likelihood
× P(β)︸ ︷︷ ︸Prior probability
Computation and Bayesian inference
Posterior probability P(β |Y )
= C (Y )︸ ︷︷ ︸Normalizing
constant
× P(Y |β)︸ ︷︷ ︸Likelihood
× P(β)︸ ︷︷ ︸Prior probability
I The most widely used tool for fitting Bayesian models is thesampling technique based on Markov chain Monte Carlo(MCMC)
I The theoretical analysis of the computational efficiency ofMCMC algorithms lags that of optimization-based methods
Computational Complexity
I Central object of interest: the mixing time tmix of the Markovchain
I Bound tmix as a function of problem parameters (n, d)
I rapidly mixing—tmix grows at most polynomially
I slowly mixing—tmix grows exponentially
I It has long been believed by many statisticians that MCMCfor Bayesian variable selection is necessarily slowly mixing;efforts have been underway to find lower bounds but suchbounds have not yet been found
Model
Mγ : Linear model: Y = Xγβγ + w , w ∼ N (0, φ−1In)
Precision prior: π(φ) ∝ 1
φ
Regression prior:(βγ | γ
)∼ N (0, g φ−1(XT
γ Xγ)−1)
Sparsity prior: π(γ) ∝(1
p
)κ|γ|I[|γ| ≤ s0].
Assumption A (Conditions on β∗)
The true regression vector has components β∗ = (β∗S , β∗Sc ) that
satisfy the bounds
Full β∗ condition:∥∥ 1√
nXβ∗
∥∥22≤ g σ20
log p
n
Off-support Sc condition:∥∥ 1√
nXScβ∗Sc
∥∥22≤ Lσ20
log p
n,
for some L ≥ 0.
Assumption B (Conditions on the design matrix)
The design matrix has been normalized so that ‖Xj‖22 = n for allj = 1, . . . , p; moreover, letting Z ∼ N(0, In), there exist constantsν > 0 and L <∞ such that Lν ≥ 4 and
Lower restricted eigenvalue (RE(s)):
min|γ|≤s
λmin
(1
nXTγ Xγ
)≥ ν,
Sparse projection condition (SI(s)):
EZ
[max|γ|≤s
maxk∈[p]\γ
1√n
∣∣〈(I − Φγ
)Xk , Z 〉
∣∣] ≤ 1
2
√Lν log p,
where Φγ denotes projection onto the span of {Xj , j ∈ γ}.
This is a mild assumption, needed in the information-theoreticanalysis of variable selection.
Assumption C (Choices of prior hyperparameters)
The noise hyperparameter g and sparsity penalty hyperparameterκ > 0 are chosen such that
g � p2α for some α > 0, and
κ+ α ≥ C1(L + L) + 2 for some universal constant C1 > 0.
Assumption D (Sparsity control)
For a constant C0 > 8, one of the two following conditions holds:
Version D(s∗): We set s0 : = p in the sparsity prior and the truesparsity s∗ is bounded as
max{1, s∗} ≤ 1
8C0K
{ n
log p− 16Lσ20
}for some constant K ≥ 4 +α+ cL, where c is a universal constant.
Version D(s0): The sparsity parameter s0 satisfies the sandwichrelation
max{
1,(2ν−2 ω(X ) + 1
)s∗}≤ s0 ≤
1
8C0 K
{ n
log p− 16Lσ20
},
where ω(X ) : = maxγ|||(XT
γ Xγ)−1XTγ Xγ∗\γ |||2op.
Our Results
Theorem (Posterior concentration)
Given Assumption A, Assumption B with s = Ks∗, Assumption C,and Assumption D(s∗), if Cβ satisfies
C 2β ≥ c0 ν
−2 (L + L + α + κ)σ20log p
n,
we have πn(γ∗ | Y ) ≥ 1− c1 p−1 with high probability.
Compare to `1-based approaches,which require an irrepresentablecondition:
maxk∈[d ],
S: |S |=s∗
‖XTk XS(XT
S XS)−1‖1 < 1
Our Results
Theorem (Rapid mixing)
Suppose that Assumption A, Assumption B with s = s0,Assumption C, and Assumption D(s0) all hold. Then there areuniversal constants c1, c2 such that, for any ε ∈ (0, 1), the ε-mixingtime of the Metropolis-Hastings chain is upper bounded as
τε ≤ c1 ps20
(c2α (n + s0) log p + log(1/ε) + 2
)with probability at least 1− c3p
−c4 .
High Level Proof Idea
I A Metropolis-Hasting random walk on the d-dim hypercube{0, 1}d
I Canonical path ensemble argument: for any model γ 6= γ∗,find a path from γ to γ∗, along which acceptance ratios arehigh, where γ∗ is the true model
Part II: Distributed MCMC
Traditional MCMC
I Serial, iterative algorithm for generating samples
I Slow for two reasons:
(1) Large number of iterations required to converge
(2) Each iteration depends on the entire dataset
I Most research on MCMC has targeted (1)
I Recent threads of work target (2)
Serial MCMC
Data
Single core
Samples
Data-parallel MCMC
Data
Parallel cores
“Samples”
Aggregate samples from across partitions — but how?
Aggregate
Data
Parallel cores
“Samples”
Factorization motivates a data-parallel approach
π(θ | x)︸ ︷︷ ︸posterior
∝ π(θ)︸︷︷︸prior
π(x | θ)︸ ︷︷ ︸likelihood
=J∏
j=1
π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior
Factorization motivates a data-parallel approach
π(θ | x)︸ ︷︷ ︸posterior
∝ π(θ)︸︷︷︸prior
π(x | θ)︸ ︷︷ ︸likelihood
=J∏
j=1
π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior
I Partition the data as x(1), . . . , x(J) across J cores
I The jth core samples from a distribution proportional to thejth sub-posterior (a ‘piece’ of the full posterior)
I Aggregate the sub-posterior samples to form approximate fullposterior samples
Aggregation strategies for sub-posterior samples
π(θ | x)︸ ︷︷ ︸posterior
∝ π(θ)︸︷︷︸prior
π(x | θ)︸ ︷︷ ︸likelihood
=J∏
j=1
π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior
Sub-posterior density estimation (Neiswanger et al, UAI 2014)
Weierstrass samplers (Wang & Dunson, 2013)
Weighted averaging of sub-posterior samples
I Consensus Monte Carlo (Scott et al, Bayes 250, 2013)
I Variational Consensus Monte Carlo (Rabinovich et al, NIPS 2015)
Aggregate ‘horizontally’ across partions
Aggregate
Data
Parallel cores
“Samples”
Naıve aggregation = Average
Aggregate
= + x 0.5 x 0.5
( , )
= + x x
Aggregate( , )
Less naıve aggregation = Weighted average
= + x 0.58 x 0.42
Aggregate( , )
= + x x
Aggregate( , )
Consensus Monte Carlo (Scott et al, 2013)
= + x x
Aggregate( , ) = + x x
Aggregate( , )
I Weights are inverse covariance matrices
I Motivated by Gaussian assumptions
I Designed at Google for the MapReduce framework
Variational Consensus Monte Carlo
Goal: Choose the aggregation function to best approximate thetarget distribution
Method: Convex optimization via variational Bayes
Variational Consensus Monte Carlo
Goal: Choose the aggregation function to best approximate thetarget distribution
Method: Convex optimization via variational Bayes
F = aggregation function
qF = approximate distribution
L (F )︸ ︷︷ ︸objective
= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood
+ H [qF ]︸ ︷︷ ︸entropy
Variational Consensus Monte Carlo
Goal: Choose the aggregation function to best approximate thetarget distribution
Method: Convex optimization via variational Bayes
F = aggregation function
qF = approximate distribution
L (F )︸ ︷︷ ︸objective
= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood
+ H [qF ]︸ ︷︷ ︸relaxed entropy
Variational Consensus Monte Carlo
Goal: Choose the aggregation function to best approximate thetarget distribution
Method: Convex optimization via variational Bayes
F = aggregation function
qF = approximate distribution
L (F )︸ ︷︷ ︸objective
= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood
+ H [qF ]︸ ︷︷ ︸relaxed entropy
No mean field assumption
Variational Consensus Monte Carlo
= + x x
Aggregate( , ) = + x x
Aggregate( , )
I Optimize over weight matrices
I Restrict to valid solutions when parameter vectors constrained
Variational Consensus Monte Carlo
Theorem (Entropy relaxation)
Under mild structural assumptions, we can choose
H [qF ] = c0 +1
K
K∑k=1
hk (F ) ,
with each hk a concave function of F such that
H [qF ] ≥ H [qF ] .
We therefore haveL (F ) ≥ L (F ) .
Variational Consensus Monte Carlo
Theorem (Concavity of the variational Bayes objective)
Under mild structural assumptions, the relaxed variational Bayesobjective
L (F ) = EqF [log π (X, θ)] + H [qF ]
is concave in F .
Empirical evaluation
I Compare three aggregation strategies:
I Uniform average
I Gaussian-motivated weighted average (CMC)
I Optimized weighted average (VCMC)
I For each algorithm A, report approximation error of someexpectation Eπ[f ], relative to serial MCMC
εA (f ) =|EA [f ]− EMCMC [f ]||EMCMC [f ]|
I Preliminary speedup results
Example 1: High-dimensional Bayesian probit regression
#data = 100, 000, d = 300
First moment estimation error, relative to serial MCMC
(Error truncated at 2.0)
Example 2: High-dimensional covariance estimation
Normal-inverse Wishart model
#data = 100, 000, #dim = 100 =⇒ 5, 050 parameters
(L) First moment estimation error (R) Eigenvalue estimation error
Example 3: Mixture of 8, 8-dim Gaussians
Error relative to serial MCMC, for cluster comembershipprobabilities of pairs of test data points
VCMC reduces CMC error at the cost of speedup (∼2x)
VCMC speedup is approximately linear
CMC
VCMC
Discussion
Contributions
I Convex optimization framework for Consensus Monte Carlo
I Structured aggregation accounting for constrained parameters
I Entropy relaxation
I Empirical evaluation