Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Understanding and Accelerating Particle-BasedVariational Inference

Chang Liu†, Jingwei Zhuo†, Pengyu Cheng‡, Ruiyi Zhang‡,Jun Zhu†§, Lawrence Carin‡§

†: Department of Computer Science and Technology, Tsinghua University‡: Department of Electrical and Computer Engineering, Duke University

§: Corresponding authors

[email protected]

ICML 2019

C. Liu et al. Understanding and Accelerating ParVIs 1 / 31

1 Introduction

2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods

3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing

4 Accelerated First-Order Methods on P2(X )

5 Bandwidth Selection via the Heat Equation

6 Experiments


Introduction

1 Introduction





6 Experiments


Introduction

Introduction

Particle-based Variational Inference Methods (ParVIs):

Represent the variational distribution q by particles; update theparticles to minimize KLp(q).

More flexible than classical VIs; more particle-efficient than MCMCs.

What is known:

Stein Variational Gradient Descent (SVGD) [13] simulates thegradient flow (steepest descending curves) of KLp on PH(X ) [12].

The Blob and w-SGLD methods [5] simulate the gradient flow of KLpon the Wasserstein space P2(X ).

What remains unknown:

Do ParVIs make assumptions when simulating the gradient flow?Does one assume stronger than another?

Is it possible to accelerate the gradient flow?

Is there a principle for selecting the bandwidth parameter?


Introduction

Introduction




What is known:








Introduction

Introduction




What is known:








Introduction

Contributions

Findings:

SVGD approximates the gradient flow on P2(X ).

ParVIs approximate the P2(X ) gradient flow by a compulsorysmoothing treatment.

Various ParVIs either smooth the density or smooth functions, andthey are equivalent.

Methods:

Two novel ParVIs.

An acceleration framework for general ParVIs.

A principled bandwidth selection method for the smoothing kernel.


Preliminaries

1 Introduction





6 Experiments


Preliminaries

Basic Concepts

Spaces:

Metric space: a set M with a distance function d :M×M→ R.Riemannian manifold:A topological space M that locally behaves like an Euclidean space(manifold), and there is an inner product 〈·, ·〉TxM in each of itstangent space TxM (Riemannian).

Tangent space is the structure of manifolds.Riemannian manifolds are metric spaces:

d(x, y) := inf(γt)t:γ0=x,γ1=y

√∫ 1

0

〈γt, γt〉TγtM dt.


Preliminaries

Basic Concepts

Gradient flow {(xt)t} of a function f : steepest descending curves.

On metric spaces: various defs ([1], Def. 11.1.1; [18], Def. 23.7), e.g., theMimimizing Movement Scheme (MMS) ([1], Def. 2.0.6):

xt+ε = argminx∈M

f(x) +1

2εd2(x, xt).

On Riemannian manifolds: xt = − grad f(xt), where:

〈grad f(x), v〉TxM = v[f ] :=∑i

vi∂if, ∀v ∈ TxM,

which is equivalent to:grad f(x) = max · argmax

v∈TxM,‖v‖TxM=1

d

dtf(xt).

It coincides with MMS on Riemannian manifolds.


Preliminaries The Wasserstein Space P2(X )

The Wasserstein Space P2(X )

P2(X ) :={q: distribution on X

∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.

Consider Euclidean support space X = RD afterwards.

P2(X ) as a metric space ([18], Def 6.4):

dW (q, p) :=(

infπ∈Π(q,p)

Eπ(x,y)[d(x, y)2])1/2

,

where

Π(q, p) :=

{π: distribution on X × X

∣∣∣∣ ∫Xπ(x, y) dy = q(x),∫

Xπ(x, y) dx = p(y)

}.



The Wasserstein Space P2(X )P2(X ) :=

{q: distribution on X

∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.

P2 as a Riemannian manifold [17, 18, 1] (X = RD):

Tangent vector ∂tqt on P2(X ) ⇐⇒ Vector field vt on X .{x(i)}Ni=1 ∼ qt =⇒ {x(i) + εvt(x

(i))}Ni=1 ∼ (id +εvt)#qt = qt+ε + o(ε).([1], Prop 8.1.8)





∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.

P2 as a Riemannian manifold [17, 18, 1] (X = RD):

Tangent space: TqP2 := {∇ϕ | ϕ ∈ C∞c }L2q ,

L2q := {u : RD → RD |

∫X ‖u(x)‖22 dq <∞}, ϕ : RD → R.

([18], Thm 13.8; [1], Thm 8.3.1, Def 8.4.1, Prop 8.4.5)

Riemannian metric: 〈v, u〉TqP2:=∫X v(x) · u(x) q(x) dx.

(consistent with the Wasserstein distance dW [3])





∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.

Gradient flow on P2(X ) for KLp(q) := Eq[log(q/p)]:

P2(X ) as a Riemannian manifold:

vGF := − grad KLp(q) = −∇( δδq

KLp(q))

= ∇ log p−∇ log q.

([18], Thm 23.18; [1], Example 11.1.2)

Minimizing Movement Scheme (MMS) ([1], Def. 2.0.6):

qt+ε = argminq∈P2(X )

KLp(q) +1

2εd2W (q, qt).

They coincide under the Riemannian structure.([18], Prop. 23.1, Rem. 23.4; [1], Thm. 11.1.6; [8], Lem. 2.7)

Remark 1The Langevin dynamics dx = ∇ log p(x) dt+

√2 dBt(x) (Bt is the

Brownian motion) is also the gradient flow of KLp on P2(X ) [9].


Preliminaries Particle-Based Variational Inference Methods

Particle-Based Variational Inference Methods (ParVIs)

Stein Variational Gradient Descent (SVGD) [13]:

vSVGD(·) := max · argmaxv∈HD,‖v‖HD=1

− d

dεKLp

((id +εv)#q

)∣∣∣ε=0

= Eq(x)[K(x, ·)∇ log p(x) +∇xK(x, ·)],where H is the reproducing kernel Hilbert space (RKHS) of kernel K.

vSVGD is the gradient flow of KLp on a kernel-related distributionmanifold PH [12].

Blob (w-SGLD-B) [5]:

vBlob := −∇( δδq

Eq[log(q/p)])

= ∇ log p−∇ log q −∇((q/q) ∗K

),

q := q ∗K.


Preliminaries Particle-Based Variational Inference Methods

Particle-Based Variational Inference Methods (ParVIs)

Particle Optimization (PO) [4]: using MMS; estimate dW by solvingthe dual optimal transport problem.

x(i)k = x

(i)k−1 + ε(vSVGD(x

(i)k−1) +N (0, σ2I)) + µ(x

(i)k−1 − x

(i)k−2).

w-SGLD [5]: using MMS; estimate dW by solving the primal problem.Similar update rule.


ParVIs as Approximations to the P2(X ) Gradient Flow

1 Introduction





6 Experiments


ParVIs as Approximations to the P2(X ) Gradient Flow SVGD Approximates P2(X ) Gradient Flow

SVGD Approximates P2(X ) Gradient Flow

Reformulate vGF as:

vGF = max · argmaxv∈L2

q,‖v‖L2q=1

⟨vGF, v

⟩L2q. (1)

We find:

Theorem 2 (vSVGD approximates vGF)

vSVGD = max · argmaxv∈HD,‖v‖HD=1

⟨vGF, v

⟩L2q.

HD is a subspace of L2q , so vSVGD is the projection of vGF on HD.

The PH(X )-gradient-flow interpretation of SVGD: PH(X ) is not avery nice manifold.

All ParVIs approximate the P2(X ) gradient flow.


ParVIs as Approximations to the P2(X ) Gradient Flow ParVIs Approximate P2(X ) Gradient Flow by Smoothing

ParVIs Approximate P2(X ) Gradient Flow by Smoothing

Smoothing Functions

SVGD restricts the optimization domain L2q to HD.

Theorem 3 (HD smooths L2q)

For X = RD, a Gaussian kernel K on X and an absolutely continuous q,the vector-valued RKHS HD of K is isometrically isomorphic to the

closure G := {φ ∗K : φ ∈ C∞c }L2q .

C∞cL2q = L2q ([11], Thm. 2.11) =⇒ G is roughly the kernel-smoothed L2q .

PO solves the dual problem by restricting the optimization domain ofLipschitz functions to quadratic functions.




Smoothing the Density

Blob partially smooths the density.

vGF = −∇( δδq

Eq[log(q/p)])

=⇒ vBlob = −∇( δδq

Eq[log(q/p)]).

w-SGLD adds an entropy regularizer in the primal objective function.

d2W ({x(i)}Ni=1, {y(j)}Nj=1) ≈ minπij

∑i,j

πijd2ij + λ

∑i,j

πij log πij ,

s.t.∑i

πij = 1/N,∑j

πij = 1/N.




Equivalence:Smoothing-function objective = Eq[L(v)], L : L2q → L2

q linear.

=⇒ Eq[L(v)] = Eq∗K [L(v)] = Eq[L(v) ∗K] = Eq[L(v ∗K)].

Necessity: grad KLp(q) undefined at q = q := 1N

∑Ni=1 δx(i) .

Theorem 4 (Necessity of smoothing for SVGD)

For q = q and v ∈ L2p, problem (1):

maxv∈L2

p,‖v‖L2p=1

⟨vGF, v

⟩L2q

,

has no optimal solution. In fact the supremum of the objective is infinite,indicating that a maximizing sequence of v tends to be ill-posed.

ParVIs rely on the smoothing assumption! No free lunch!


ParVIs as Approximations to the P2(X ) Gradient Flow New ParVIs with Smoothing

New ParVIs with Smoothing

Gradient Flow with Smoothed Density (GFSD):Fully smooth the density:

vGFSD := ∇ log p−∇ log q.

Gradient Flow with Smoothed test Functions (GFSF):

vGF = ∇ log p−∇ log q

=⇒ vGF = ∇ log p+ argminu∈L2

maxφ∈C∞c ,‖φ‖L2q=1

(Eq[φ · u−∇ · φ]

)2.

Smooth φ: take φ from HD:

vGFSF := ∇ log p+ argminu∈L2

maxφ∈HD,‖φ‖HD=1

(Eq[φ · u−∇ · φ]

)2.

Solution: vGFSF = g + K ′K−1. (Note vSVGD = vGFSFK.)g:,i = ∇x(i) log p(x(i)), Kij = K(x(i), x(j)), K′

:,i =∑

j ∇x(j)K(x(j), x(i)).


Accelerated First-Order Methods on P2(X )

1 Introduction





6 Experiments



Nesterov’s Acceleration Methods on Riemannian Manifolds

rk ∈ P2(X ): auxiliary variable. vk := − grad KL(rk).Riemannian Accelerated Gradient (RAG) [14] (with simplification):{

qk = Exprk−1(εvk−1),

rk = Expqk

[−Γqkrk−1

(k−1k Exp−1

rk−1(qk−1)− k+α−2

k εvk−1

)].

Riemannian Nesterov’s method (RNes) [20] (with simplification):{qk = Exprk−1

(εvk−1),

rk = Expqk{c1 Exp−1

qk

[Exprk−1

((1−c2) Exp−1

rk−1(qk−1)+c2 Exp−1

rk−1(qk)

)]}.

Required:

Exponential map Expq : TqP2(X )→ P2(X ) and its inverse.Parallel transport Γrq : TqP2(X )→ TrP2(X ).



Leveraging the Riemannian Structure of P2(X )

Exponential map ([18], Coro. 7.22; [1], Prop. 8.4.6; [8], Prop. 2.1):Expq(v) = (id +v)#q, i.e., {x(i)}i ∼ q ⇒ {x(i)+v(x(i))}i ∼ Expq(v).Inverse exponential map: require the optimal transport map.

Sinkhorn methods [6, 19] appear costly and unstable.Make approximations when {x(i)}i and {y(i)}i are pairwise close:d(x(i), y(i))� min

{minj 6=i d(x(i), x(j)),minj 6=i d(y(i), y(j))

}.

Proposition 5 (Inverse exponential map)

For pairwise close samples {x(i)}i of q and {y(i)}i of r, we have(Exp−1

q (r))(x(i)) ≈ y(i) − x(i).

Parallel transportHard to implement analytical results [15, 16].Use Schild’s ladder method [7, 10] for approximation.

Proposition 6 (Parallel transport)

For pairwise close samples {x(i)}i of q and {y(i)}i of r, we have(Γrq(v)

)(y(i)) ≈ v(x(i)), ∀v ∈ TqP2.



Acceleration Framework for ParVIs

Algorithm 1 The acceleration framework with Wasserstein Accelerated Gra-dient (WAG) and Wasserstein Nesterov’s method (WNes)

1: WAG: select acceleration factor α > 3;WNes: select or calculate c1, c2 ∈ R+;

2: Initialize {x(i)0 }Ni=1 distinctly; let y(i)0 = x

(i)0 ;

3: for k = 1, 2, · · · , kmax, do4: for i = 1, · · · , N , do

5: Find v(y(i)k−1) by SVGD/Blob/GFSD/GFSF;

6: x(i)k = y

(i)k−1 + εv(y

(i)k−1);

7: y(i)k = x

(i)k +

{WAG: k−1

k (y(i)k−1 − x

(i)k−1) + k+α−2

k εv(y(i)k−1);

WNes: c1(c2 − 1)(x(i)k − x

(i)k−1);

8: end for9: end for

10: Return {x(i)kmax}Ni=1.


Bandwidth Selection via the Heat Equation

1 Introduction





6 Experiments




Note

Under the dynamics dx = −∇ log qt(x) dt, qt evolves following the heatequation (HE): ∂tqt(x) = ∆qt(x).

Smoothing the density: qt(x) ≈ q(x) = q(x; {x(i)}Ni=1). Then for qt+ε(x),

Due to HE, qt+ε(x) ≈ q(x) + ε∆q(x).

Due to the effect of the dynamics, updated particles{x(i)−ε∇ log q(x(i))}Ni=1 approximate qt+ε, soqt+ε(x) ≈ q(x; {x(i)−ε∇ log q(x(i))}Ni=1).

Objective:∑

k

(q(x(k)) + ε∆q(x(k))− q(x(k); {x(i)−ε∇ log q(x(i))}Ni=1)

)2.

Take ε→ 0, make the objective dimensionless (h/x2 is dimensionless):

1hD+2

∑k

[∆q(x(k); {x(i)}i)+

∑j∇x(j) q(x(k); {x(i)}i)·∇log q(x(j); {x(i)}i)

]2.

Also applicable to smoothing functions.


Experiments

1 Introduction





6 Experiments


Experiments

Toy Experiments

Median:

HE:

SVGD Blob GFSD GFSF

Figure: Comparison of HE (bottom row) with the median method (top row) forbandwidth selection.


Experiments

Bayesian Logistic Regression (BLR)

0 2500 5000 7500iteration

0.70

0.72

0.74

0.76

accu

racy

SVGD-WGDSVGD-POSVGD-WAGSVGD-WNes

0 2500 5000 7500iteration

0.70

0.72

0.74

0.76

accu

racy

Blob-WGDBlob-POBlob-WAGBlob-WNes

0 2500 5000 7500iteration

0.70

0.72

0.74

0.76

accu

racy

GFSD-WGDGFSD-POGFSD-WAGGFSD-WNes

0 2500 5000 7500iteration

0.70

0.72

0.74

0.76

accu

racy

GFSF-WGDGFSF-POGFSF-WAGGFSF-WNes

Figure: Acceleration effect of WAG and WNes on BLR on the Covertype dataset,measured by prediction accuracy on test dataset. Each curve is averaged over 10runs.


Experiments

Bayesian Neural Networks (BNNs)

Table: Results on BNN on the Kin8nm dataset (one of the UCI datasets [2]).Results are averaged over 20 runs.

MethodAvg. Test RMSE (×10−2)

SVGD Blob GFSD GFSF

WGD 8.4±0.2 8.2±0.2 8.0±0.3 8.3±0.2PO 7.8±0.2 8.1±0.2 8.1±0.2 8.0±0.2

WAG 7.0±0.2 7.0±0.2 7.1±0.1 7.0±0.1WNes 6.9±0.1 7.0±0.2 6.9±0.1 6.8±0.1

MethodAvg. Test LL

SVGD Blob GFSD GFSF

WGD 1.042±0.016 1.079±0.021 1.087±0.029 1.044±0.016PO 1.114±0.022 1.070±0.020 1.067±0.017 1.073±0.016

WAG 1.167±0.015 1.169±0.015 1.167±0.017 1.190±0.014WNes 1.171±0.014 1.168±0.014 1.173±0.016 1.193±0.014


Experiments

Latent Dirichlet Allocation (LDA)

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

SVGD-WGDSVGD-POSVGD-WAGSVGD-WNes

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

Blob-WGDBlob-POBlob-WAGBlob-WNes

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

GFSD-WGDGFSD-POGFSD-WAGGFSD-WNes

200 400iteration

1020

1040

1060

1080

1100ho

ldou

t per

plex

ityGFSF-WGDGFSF-POGFSF-WAGGFSF-WNes

Figure: Acceleration effect of WAG and WNes onLDA. Inference results are measured by the hold-outperplexity. Curves are averaged over 10 runs.

100 200 300 400iteration

1025

1050

1075

1100

1125

1150

hold

out p

erpl

exity

SGNHT-seqSGNHT-paraSVGD-WGDSVGD-WNes

Figure: Comparison ofSVGD and SGNHT onLDA, as representativesof ParVIs and MCMCs.Average over 10 runs.


Thank you!


References

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savare.

Gradient flows: in metric spaces and in the space of probability measures.

Springer Science & Business Media, 2008.

Arthur Asuncion and David Newman.

Uci machine learning repository, 2007.

Jean-David Benamou and Yann Brenier.

A computational fluid mechanics solution to the monge-kantorovich mass transferproblem.

Numerische Mathematik, 84(3):375–393, 2000.

Changyou Chen and Ruiyi Zhang.

Particle optimization in stochastic gradient mcmc.

arXiv preprint arXiv:1711.10927, 2017.

Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen.

A unified particle-optimization framework for scalable bayesian sampling.


Marco Cuturi.

Sinkhorn distances: Lightspeed computation of optimal transport.

In Advances in neural information processing systems, pages 2292–2300, 2013.C. Liu et al. Understanding and Accelerating ParVIs 31 / 31

References

J Ehlers, F Pirani, and A Schild.

The geometry of free fall and light propagation, in the book “generalrelativity”(papers in honour of jl synge), 63–84, 1972.

Matthias Erbar et al.

The heat equation on manifolds as a gradient flow in the wasserstein space.

In Annales de l’Institut Henri Poincare, Probabilites et Statistiques, volume 46,pages 1–23. Institut Henri Poincare, 2010.

Richard Jordan, David Kinderlehrer, and Felix Otto.

The variational formulation of the fokker–planck equation.

SIAM journal on mathematical analysis, 29(1):1–17, 1998.

Arkady Kheyfets, Warner A Miller, and Gregory A Newton.

Schild’s ladder parallel transport procedure for an arbitrary connection.

International Journal of Theoretical Physics, 39(12):2891–2898, 2000.

Ondrej Kovacik and Jirı Rakosnık.

On spaces lp(x) and wk, p(x).

Czechoslovak Mathematical Journal, 41(4):592–618, 1991.

Qiang Liu.C. Liu et al. Understanding and Accelerating ParVIs 31 / 31

References

Stein variational gradient descent as gradient flow.

In Advances in neural information processing systems, pages 3118–3126, 2017.

Qiang Liu and Dilin Wang.

Stein variational gradient descent: A general purpose bayesian inference algorithm.

In Advances In Neural Information Processing Systems, pages 2378–2386, 2016.

Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao.

Accelerated first-order methods for geodesically convex optimization on riemannianmanifolds.

In Advances in Neural Information Processing Systems, pages 4875–4884, 2017.

John Lott.

Some geometric calculations on wasserstein space.

Communications in Mathematical Physics, 277(2):423–437, 2008.

John Lott.

An intrinsic parallel transport in wasserstein space.

Proceedings of the American Mathematical Society, 145(12):5329–5340, 2017.

Felix Otto.

The geometry of dissipative evolution equations: the porous medium equation.

2001.C. Liu et al. Understanding and Accelerating ParVIs 31 / 31

References

Cedric Villani.

Optimal transport: old and new, volume 338.

Springer Science & Business Media, 2008.

Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha.

A fast proximal point method for computing wasserstein distance.


Hongyi Zhang and Suvrit Sra.

An estimate sequence for geodesically convex optimization.

In Conference On Learning Theory, pages 1703–1723, 2018.


Documents

Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update