Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

On Bayesian Computation

Michael I. Jordan

with Elaine Angelino, Maxim Rabinovich, Martin Wainwrightand Yun Yang

Previous Work: Information Constraints on Inference

I Minimize the minimax risk under constraintsI privacy constraintI communication constraintI memory constraint

I Yields tradeoffs linking statistical quantities (amount of data,complexity of model, dimension of parameter space, etc) to“externalities” (privacy level, bandwidth, storage, etc)

Ongoing Work: Computational Constraints on Inference

I Tradeoffs via convex relaxations

I Tradeoffs via concurrency control

I Bounds via optimization oracles

I Bounds via communication complexity

I Tradeoffs via subsampling

I All of this work has been frequentist; how about Bayesianinference?

Today’s talk: Bayesian Inference and Computation

I Integration rather than optimization

I Part I: Mixing times for MCMC

I Part II: Distributed MCMC

Part I: Mixing Times for MCMC

MCMC and Sparse Regression

I Statistical problem: predict/explain a response variable Ybased on a very large number of features X1, . . . ,Xd .

I Formally:

Y = X1 β1 + X2 β2 + · · ·+ Xd βd + noise, or

Yn×1 = Xn×d βd×1 + noise, (matrix form)

number of features d � sample size n.

I Sparsity condition: number of non-zero βj ’s � n.

I Goal: variable selection—identify influential features

I Can this be done efficiently (in a computational sense)?

Statistical Methodology

I Frequentist approach: penalized optimization

Objective function f (β)

=1

2n‖Y − Xβ‖22︸︷︷︸

Goodness of fit

+ Pλ(β)︸︷︷︸Penalty term

I Bayesian approach: Monte Carlo sampling

Posterior probability P(β |Y )

∝ P(Y |β)︸︷︷︸Likelihood

× P(β)︸︷︷︸Prior probability

Computation and Bayesian inference

Posterior probability P(β |Y )

= C (Y )︸︷︷︸Normalizing

constant

× P(Y |β)︸︷︷︸Likelihood

× P(β)︸︷︷︸Prior probability

I The most widely used tool for fitting Bayesian models is thesampling technique based on Markov chain Monte Carlo(MCMC)

I The theoretical analysis of the computational efficiency ofMCMC algorithms lags that of optimization-based methods

Computational Complexity

I Central object of interest: the mixing time tmix of the Markovchain

I Bound tmix as a function of problem parameters (n, d)

I rapidly mixing—tmix grows at most polynomially

I slowly mixing—tmix grows exponentially

I It has long been believed by many statisticians that MCMCfor Bayesian variable selection is necessarily slowly mixing;efforts have been underway to find lower bounds but suchbounds have not yet been found

Model

Mγ : Linear model: Y = Xγβγ + w , w ∼ N (0, φ−1In)

Precision prior: π(φ) ∝ 1

φ

Regression prior:(βγ | γ

)∼ N (0, g φ−1(XT

γ Xγ)−1)

Sparsity prior: π(γ) ∝(1

p

)κ|γ|I[|γ| ≤ s0].

Assumption A (Conditions on β∗)

The true regression vector has components β∗ = (β∗S , β∗Sc ) that

satisfy the bounds

Full β∗ condition:∥∥ 1√

nXβ∗

∥∥22≤ g σ20

log p

n

Off-support Sc condition:∥∥ 1√

nXScβ∗Sc

∥∥22≤ Lσ20

log p

n,

for some L ≥ 0.

Assumption B (Conditions on the design matrix)

The design matrix has been normalized so that ‖Xj‖22 = n for allj = 1, . . . , p; moreover, letting Z ∼ N(0, In), there exist constantsν > 0 and L <∞ such that Lν ≥ 4 and

Lower restricted eigenvalue (RE(s)):

min|γ|≤s

λmin

(1

nXTγ Xγ

)≥ ν,

Sparse projection condition (SI(s)):

EZ

[max|γ|≤s

maxk∈[p]\γ

1√n

∣∣〈(I − Φγ

)Xk , Z 〉

∣∣] ≤ 1

2

√Lν log p,

where Φγ denotes projection onto the span of {Xj , j ∈ γ}.

This is a mild assumption, needed in the information-theoreticanalysis of variable selection.

Assumption C (Choices of prior hyperparameters)

The noise hyperparameter g and sparsity penalty hyperparameterκ > 0 are chosen such that

g � p2α for some α > 0, and

κ+ α ≥ C1(L + L) + 2 for some universal constant C1 > 0.

Assumption D (Sparsity control)

For a constant C0 > 8, one of the two following conditions holds:

Version D(s∗): We set s0 : = p in the sparsity prior and the truesparsity s∗ is bounded as

max{1, s∗} ≤ 1

8C0K

{ n

log p− 16Lσ20

}for some constant K ≥ 4 +α+ cL, where c is a universal constant.

Version D(s0): The sparsity parameter s0 satisfies the sandwichrelation

max{

1,(2ν−2 ω(X ) + 1

)s∗}≤ s0 ≤

1

8C0 K

{ n

log p− 16Lσ20

},

where ω(X ) : = maxγ|||(XT

γ Xγ)−1XTγ Xγ∗\γ |||2op.

Our Results

Theorem (Posterior concentration)

Given Assumption A, Assumption B with s = Ks∗, Assumption C,and Assumption D(s∗), if Cβ satisfies

C 2β ≥ c0 ν

−2 (L + L + α + κ)σ20log p

n,

we have πn(γ∗ | Y ) ≥ 1− c1 p−1 with high probability.

Compare to `1-based approaches,which require an irrepresentablecondition:

maxk∈[d ],

S: |S |=s∗

‖XTk XS(XT

S XS)−1‖1 < 1

Our Results

Theorem (Rapid mixing)

Suppose that Assumption A, Assumption B with s = s0,Assumption C, and Assumption D(s0) all hold. Then there areuniversal constants c1, c2 such that, for any ε ∈ (0, 1), the ε-mixingtime of the Metropolis-Hastings chain is upper bounded as

τε ≤ c1 ps20

(c2α (n + s0) log p + log(1/ε) + 2

)with probability at least 1− c3p

−c4 .

High Level Proof Idea

I A Metropolis-Hasting random walk on the d-dim hypercube{0, 1}d

I Canonical path ensemble argument: for any model γ 6= γ∗,find a path from γ to γ∗, along which acceptance ratios arehigh, where γ∗ is the true model

Part II: Distributed MCMC

Traditional MCMC

I Serial, iterative algorithm for generating samples

I Slow for two reasons:

(1) Large number of iterations required to converge

(2) Each iteration depends on the entire dataset

I Most research on MCMC has targeted (1)

I Recent threads of work target (2)

Serial MCMC

Data

Single core

Samples

Data-parallel MCMC

Data

Parallel cores

“Samples”

Aggregate samples from across partitions — but how?

Aggregate

Data

Parallel cores

“Samples”

Factorization motivates a data-parallel approach

π(θ | x)︸︷︷︸posterior

∝ π(θ)︸︷︷︸prior

π(x | θ)︸︷︷︸likelihood

=J∏

j=1

π(θ)1/Jπ(x(j) | θ)︸︷︷︸sub-posterior

Factorization motivates a data-parallel approach




=J∏

j=1


I Partition the data as x(1), . . . , x(J) across J cores

I The jth core samples from a distribution proportional to thejth sub-posterior (a ‘piece’ of the full posterior)

I Aggregate the sub-posterior samples to form approximate fullposterior samples

Aggregation strategies for sub-posterior samples




=J∏

j=1


Sub-posterior density estimation (Neiswanger et al, UAI 2014)

Weierstrass samplers (Wang & Dunson, 2013)

Weighted averaging of sub-posterior samples

I Consensus Monte Carlo (Scott et al, Bayes 250, 2013)

I Variational Consensus Monte Carlo (Rabinovich et al, NIPS 2015)

Aggregate ‘horizontally’ across partions

Aggregate

Data

Parallel cores

“Samples”

Naıve aggregation = Average

Aggregate

= + x 0.5 x 0.5

( , )

= + x x

Aggregate( , )

Less naıve aggregation = Weighted average

= + x 0.58 x 0.42

Aggregate( , )

= + x x

Aggregate( , )

Consensus Monte Carlo (Scott et al, 2013)

= + x x

Aggregate( , ) = + x x

Aggregate( , )

I Weights are inverse covariance matrices

I Motivated by Gaussian assumptions

I Designed at Google for the MapReduce framework

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate thetarget distribution

Method: Convex optimization via variational Bayes




F = aggregation function

qF = approximate distribution

L (F )︸︷︷︸objective

= EqF [log π (X, θ)]︸︷︷︸likelihood

+ H [qF ]︸︷︷︸entropy








+ H [qF ]︸︷︷︸relaxed entropy








+ H [qF ]︸︷︷︸relaxed entropy

No mean field assumption


= + x x

Aggregate( , ) = + x x

Aggregate( , )

I Optimize over weight matrices

I Restrict to valid solutions when parameter vectors constrained


Theorem (Entropy relaxation)

Under mild structural assumptions, we can choose

H [qF ] = c0 +1

K

K∑k=1

hk (F ) ,

with each hk a concave function of F such that

H [qF ] ≥ H [qF ] .

We therefore haveL (F ) ≥ L (F ) .


Theorem (Concavity of the variational Bayes objective)

Under mild structural assumptions, the relaxed variational Bayesobjective

L (F ) = EqF [log π (X, θ)] + H [qF ]

is concave in F .

Empirical evaluation

I Compare three aggregation strategies:

I Uniform average

I Gaussian-motivated weighted average (CMC)

I Optimized weighted average (VCMC)

I For each algorithm A, report approximation error of someexpectation Eπ[f ], relative to serial MCMC

εA (f ) =|EA [f ]− EMCMC [f ]||EMCMC [f ]|

I Preliminary speedup results

Example 1: High-dimensional Bayesian probit regression

#data = 100, 000, d = 300

First moment estimation error, relative to serial MCMC

(Error truncated at 2.0)

Example 2: High-dimensional covariance estimation

Normal-inverse Wishart model

#data = 100, 000, #dim = 100 =⇒ 5, 050 parameters

(L) First moment estimation error (R) Eigenvalue estimation error

Example 3: Mixture of 8, 8-dim Gaussians

Error relative to serial MCMC, for cluster comembershipprobabilities of pairs of test data points

VCMC reduces CMC error at the cost of speedup (∼2x)

VCMC speedup is approximately linear

CMC

VCMC

Discussion

Contributions

I Convex optimization framework for Consensus Monte Carlo

I Structured aggregation accounting for constrained parameters

I Entropy relaxation

I Empirical evaluation

Documents

Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information