38
Understanding and Accelerating Particle-Based Variational Inference Chang Liu , Jingwei Zhuo , Pengyu Cheng , Ruiyi Zhang , Jun Zhu †§ , Lawrence Carin ‡§ : Department of Computer Science and Technology, Tsinghua University : Department of Electrical and Computer Engineering, Duke University §: Corresponding authors [email protected] ICML 2019 C. Liu et al. Understanding and Accelerating ParVIs 1 / 31

Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Understanding and Accelerating Particle-BasedVariational Inference

Chang Liu†, Jingwei Zhuo†, Pengyu Cheng‡, Ruiyi Zhang‡,Jun Zhu†§, Lawrence Carin‡§

†: Department of Computer Science and Technology, Tsinghua University‡: Department of Electrical and Computer Engineering, Duke University

§: Corresponding authors

[email protected]

ICML 2019

C. Liu et al. Understanding and Accelerating ParVIs 1 / 31

Page 2: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

1 Introduction

2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods

3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing

4 Accelerated First-Order Methods on P2(X )

5 Bandwidth Selection via the Heat Equation

6 Experiments

C. Liu et al. Understanding and Accelerating ParVIs 2 / 31

Page 3: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Introduction

1 Introduction

2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods

3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing

4 Accelerated First-Order Methods on P2(X )

5 Bandwidth Selection via the Heat Equation

6 Experiments

C. Liu et al. Understanding and Accelerating ParVIs 3 / 31

Page 4: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Introduction

Introduction

Particle-based Variational Inference Methods (ParVIs):

Represent the variational distribution q by particles; update theparticles to minimize KLp(q).

More flexible than classical VIs; more particle-efficient than MCMCs.

What is known:

Stein Variational Gradient Descent (SVGD) [13] simulates thegradient flow (steepest descending curves) of KLp on PH(X ) [12].

The Blob and w-SGLD methods [5] simulate the gradient flow of KLpon the Wasserstein space P2(X ).

What remains unknown:

Do ParVIs make assumptions when simulating the gradient flow?Does one assume stronger than another?

Is it possible to accelerate the gradient flow?

Is there a principle for selecting the bandwidth parameter?

C. Liu et al. Understanding and Accelerating ParVIs 4 / 31

Page 5: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Introduction

Introduction

Particle-based Variational Inference Methods (ParVIs):

Represent the variational distribution q by particles; update theparticles to minimize KLp(q).

More flexible than classical VIs; more particle-efficient than MCMCs.

What is known:

Stein Variational Gradient Descent (SVGD) [13] simulates thegradient flow (steepest descending curves) of KLp on PH(X ) [12].

The Blob and w-SGLD methods [5] simulate the gradient flow of KLpon the Wasserstein space P2(X ).

What remains unknown:

Do ParVIs make assumptions when simulating the gradient flow?Does one assume stronger than another?

Is it possible to accelerate the gradient flow?

Is there a principle for selecting the bandwidth parameter?

C. Liu et al. Understanding and Accelerating ParVIs 4 / 31

Page 6: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Introduction

Introduction

Particle-based Variational Inference Methods (ParVIs):

Represent the variational distribution q by particles; update theparticles to minimize KLp(q).

More flexible than classical VIs; more particle-efficient than MCMCs.

What is known:

Stein Variational Gradient Descent (SVGD) [13] simulates thegradient flow (steepest descending curves) of KLp on PH(X ) [12].

The Blob and w-SGLD methods [5] simulate the gradient flow of KLpon the Wasserstein space P2(X ).

What remains unknown:

Do ParVIs make assumptions when simulating the gradient flow?Does one assume stronger than another?

Is it possible to accelerate the gradient flow?

Is there a principle for selecting the bandwidth parameter?

C. Liu et al. Understanding and Accelerating ParVIs 4 / 31

Page 7: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Introduction

Contributions

Findings:

SVGD approximates the gradient flow on P2(X ).

ParVIs approximate the P2(X ) gradient flow by a compulsorysmoothing treatment.

Various ParVIs either smooth the density or smooth functions, andthey are equivalent.

Methods:

Two novel ParVIs.

An acceleration framework for general ParVIs.

A principled bandwidth selection method for the smoothing kernel.

C. Liu et al. Understanding and Accelerating ParVIs 5 / 31

Page 8: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Preliminaries

1 Introduction

2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods

3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing

4 Accelerated First-Order Methods on P2(X )

5 Bandwidth Selection via the Heat Equation

6 Experiments

C. Liu et al. Understanding and Accelerating ParVIs 6 / 31

Page 9: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Preliminaries

Basic Concepts

Spaces:

Metric space: a set M with a distance function d :M×M→ R.Riemannian manifold:A topological space M that locally behaves like an Euclidean space(manifold), and there is an inner product 〈·, ·〉TxM in each of itstangent space TxM (Riemannian).

Tangent space is the structure of manifolds.Riemannian manifolds are metric spaces:

d(x, y) := inf(γt)t:γ0=x,γ1=y

√∫ 1

0

〈γt, γt〉TγtM dt.

C. Liu et al. Understanding and Accelerating ParVIs 7 / 31

Page 10: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Preliminaries

Basic Concepts

Gradient flow {(xt)t} of a function f : steepest descending curves.

On metric spaces: various defs ([1], Def. 11.1.1; [18], Def. 23.7), e.g., theMimimizing Movement Scheme (MMS) ([1], Def. 2.0.6):

xt+ε = argminx∈M

f(x) +1

2εd2(x, xt).

On Riemannian manifolds: xt = − grad f(xt), where:

〈grad f(x), v〉TxM = v[f ] :=∑i

vi∂if, ∀v ∈ TxM,

which is equivalent to:grad f(x) = max · argmax

v∈TxM,‖v‖TxM=1

d

dtf(xt).

It coincides with MMS on Riemannian manifolds.

C. Liu et al. Understanding and Accelerating ParVIs 8 / 31

Page 11: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Preliminaries The Wasserstein Space P2(X )

The Wasserstein Space P2(X )

P2(X ) :={q: distribution on X

∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.

Consider Euclidean support space X = RD afterwards.

P2(X ) as a metric space ([18], Def 6.4):

dW (q, p) :=(

infπ∈Π(q,p)

Eπ(x,y)[d(x, y)2])1/2

,

where

Π(q, p) :=

{π: distribution on X × X

∣∣∣∣ ∫Xπ(x, y) dy = q(x),∫

Xπ(x, y) dx = p(y)

}.

C. Liu et al. Understanding and Accelerating ParVIs 9 / 31

Page 12: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Preliminaries The Wasserstein Space P2(X )

The Wasserstein Space P2(X )P2(X ) :=

{q: distribution on X

∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.

P2 as a Riemannian manifold [17, 18, 1] (X = RD):

Tangent vector ∂tqt on P2(X ) ⇐⇒ Vector field vt on X .{x(i)}Ni=1 ∼ qt =⇒ {x(i) + εvt(x

(i))}Ni=1 ∼ (id +εvt)#qt = qt+ε + o(ε).([1], Prop 8.1.8)

C. Liu et al. Understanding and Accelerating ParVIs 10 / 31

Page 13: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Preliminaries The Wasserstein Space P2(X )

The Wasserstein Space P2(X )P2(X ) :=

{q: distribution on X

∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.

P2 as a Riemannian manifold [17, 18, 1] (X = RD):

Tangent space: TqP2 := {∇ϕ | ϕ ∈ C∞c }L2q ,

L2q := {u : RD → RD |

∫X ‖u(x)‖22 dq <∞}, ϕ : RD → R.

([18], Thm 13.8; [1], Thm 8.3.1, Def 8.4.1, Prop 8.4.5)

Riemannian metric: 〈v, u〉TqP2:=∫X v(x) · u(x) q(x) dx.

(consistent with the Wasserstein distance dW [3])

C. Liu et al. Understanding and Accelerating ParVIs 10 / 31

Page 14: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Preliminaries The Wasserstein Space P2(X )

The Wasserstein Space P2(X )P2(X ) :=

{q: distribution on X

∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.

Gradient flow on P2(X ) for KLp(q) := Eq[log(q/p)]:

P2(X ) as a Riemannian manifold:

vGF := − grad KLp(q) = −∇( δδq

KLp(q))

= ∇ log p−∇ log q.

([18], Thm 23.18; [1], Example 11.1.2)

Minimizing Movement Scheme (MMS) ([1], Def. 2.0.6):

qt+ε = argminq∈P2(X )

KLp(q) +1

2εd2W (q, qt).

They coincide under the Riemannian structure.([18], Prop. 23.1, Rem. 23.4; [1], Thm. 11.1.6; [8], Lem. 2.7)

Remark 1The Langevin dynamics dx = ∇ log p(x) dt+

√2 dBt(x) (Bt is the

Brownian motion) is also the gradient flow of KLp on P2(X ) [9].

C. Liu et al. Understanding and Accelerating ParVIs 11 / 31

Page 15: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Preliminaries Particle-Based Variational Inference Methods

Particle-Based Variational Inference Methods (ParVIs)

Stein Variational Gradient Descent (SVGD) [13]:

vSVGD(·) := max · argmaxv∈HD,‖v‖HD=1

− d

dεKLp

((id +εv)#q

)∣∣∣ε=0

= Eq(x)[K(x, ·)∇ log p(x) +∇xK(x, ·)],where H is the reproducing kernel Hilbert space (RKHS) of kernel K.

vSVGD is the gradient flow of KLp on a kernel-related distributionmanifold PH [12].

Blob (w-SGLD-B) [5]:

vBlob := −∇( δδq

Eq[log(q/p)])

= ∇ log p−∇ log q −∇((q/q) ∗K

),

q := q ∗K.

C. Liu et al. Understanding and Accelerating ParVIs 12 / 31

Page 16: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Preliminaries Particle-Based Variational Inference Methods

Particle-Based Variational Inference Methods (ParVIs)

Particle Optimization (PO) [4]: using MMS; estimate dW by solvingthe dual optimal transport problem.

x(i)k = x

(i)k−1 + ε(vSVGD(x

(i)k−1) +N (0, σ2I)) + µ(x

(i)k−1 − x

(i)k−2).

w-SGLD [5]: using MMS; estimate dW by solving the primal problem.Similar update rule.

C. Liu et al. Understanding and Accelerating ParVIs 13 / 31

Page 17: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

ParVIs as Approximations to the P2(X ) Gradient Flow

1 Introduction

2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods

3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing

4 Accelerated First-Order Methods on P2(X )

5 Bandwidth Selection via the Heat Equation

6 Experiments

C. Liu et al. Understanding and Accelerating ParVIs 14 / 31

Page 18: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

ParVIs as Approximations to the P2(X ) Gradient Flow SVGD Approximates P2(X ) Gradient Flow

SVGD Approximates P2(X ) Gradient Flow

Reformulate vGF as:

vGF = max · argmaxv∈L2

q,‖v‖L2q=1

⟨vGF, v

⟩L2q. (1)

We find:

Theorem 2 (vSVGD approximates vGF)

vSVGD = max · argmaxv∈HD,‖v‖HD=1

⟨vGF, v

⟩L2q.

HD is a subspace of L2q , so vSVGD is the projection of vGF on HD.

The PH(X )-gradient-flow interpretation of SVGD: PH(X ) is not avery nice manifold.

All ParVIs approximate the P2(X ) gradient flow.

C. Liu et al. Understanding and Accelerating ParVIs 15 / 31

Page 19: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

ParVIs as Approximations to the P2(X ) Gradient Flow ParVIs Approximate P2(X ) Gradient Flow by Smoothing

ParVIs Approximate P2(X ) Gradient Flow by Smoothing

Smoothing Functions

SVGD restricts the optimization domain L2q to HD.

Theorem 3 (HD smooths L2q)

For X = RD, a Gaussian kernel K on X and an absolutely continuous q,the vector-valued RKHS HD of K is isometrically isomorphic to the

closure G := {φ ∗K : φ ∈ C∞c }L2q .

C∞cL2q = L2q ([11], Thm. 2.11) =⇒ G is roughly the kernel-smoothed L2q .

PO solves the dual problem by restricting the optimization domain ofLipschitz functions to quadratic functions.

C. Liu et al. Understanding and Accelerating ParVIs 16 / 31

Page 20: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

ParVIs as Approximations to the P2(X ) Gradient Flow ParVIs Approximate P2(X ) Gradient Flow by Smoothing

ParVIs Approximate P2(X ) Gradient Flow by Smoothing

Smoothing the Density

Blob partially smooths the density.

vGF = −∇( δδq

Eq[log(q/p)])

=⇒ vBlob = −∇( δδq

Eq[log(q/p)]).

w-SGLD adds an entropy regularizer in the primal objective function.

d2W ({x(i)}Ni=1, {y(j)}Nj=1) ≈ minπij

∑i,j

πijd2ij + λ

∑i,j

πij log πij ,

s.t.∑i

πij = 1/N,∑j

πij = 1/N.

C. Liu et al. Understanding and Accelerating ParVIs 17 / 31

Page 21: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

ParVIs as Approximations to the P2(X ) Gradient Flow ParVIs Approximate P2(X ) Gradient Flow by Smoothing

ParVIs Approximate P2(X ) Gradient Flow by Smoothing

Equivalence:Smoothing-function objective = Eq[L(v)], L : L2q → L2

q linear.

=⇒ Eq[L(v)] = Eq∗K [L(v)] = Eq[L(v) ∗K] = Eq[L(v ∗K)].

Necessity: grad KLp(q) undefined at q = q := 1N

∑Ni=1 δx(i) .

Theorem 4 (Necessity of smoothing for SVGD)

For q = q and v ∈ L2p, problem (1):

maxv∈L2

p,‖v‖L2p=1

⟨vGF, v

⟩L2q

,

has no optimal solution. In fact the supremum of the objective is infinite,indicating that a maximizing sequence of v tends to be ill-posed.

ParVIs rely on the smoothing assumption! No free lunch!

C. Liu et al. Understanding and Accelerating ParVIs 18 / 31

Page 22: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

ParVIs as Approximations to the P2(X ) Gradient Flow New ParVIs with Smoothing

New ParVIs with Smoothing

Gradient Flow with Smoothed Density (GFSD):Fully smooth the density:

vGFSD := ∇ log p−∇ log q.

Gradient Flow with Smoothed test Functions (GFSF):

vGF = ∇ log p−∇ log q

=⇒ vGF = ∇ log p+ argminu∈L2

maxφ∈C∞c ,‖φ‖L2q=1

(Eq[φ · u−∇ · φ]

)2.

Smooth φ: take φ from HD:

vGFSF := ∇ log p+ argminu∈L2

maxφ∈HD,‖φ‖HD=1

(Eq[φ · u−∇ · φ]

)2.

Solution: vGFSF = g + K ′K−1. (Note vSVGD = vGFSFK.)g:,i = ∇x(i) log p(x(i)), Kij = K(x(i), x(j)), K′

:,i =∑

j ∇x(j)K(x(j), x(i)).

C. Liu et al. Understanding and Accelerating ParVIs 19 / 31

Page 23: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Accelerated First-Order Methods on P2(X )

1 Introduction

2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods

3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing

4 Accelerated First-Order Methods on P2(X )

5 Bandwidth Selection via the Heat Equation

6 Experiments

C. Liu et al. Understanding and Accelerating ParVIs 20 / 31

Page 24: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Accelerated First-Order Methods on P2(X )

Nesterov’s Acceleration Methods on Riemannian Manifolds

rk ∈ P2(X ): auxiliary variable. vk := − grad KL(rk).Riemannian Accelerated Gradient (RAG) [14] (with simplification):{

qk = Exprk−1(εvk−1),

rk = Expqk

[−Γqkrk−1

(k−1k Exp−1

rk−1(qk−1)− k+α−2

k εvk−1

)].

Riemannian Nesterov’s method (RNes) [20] (with simplification):{qk = Exprk−1

(εvk−1),

rk = Expqk{c1 Exp−1

qk

[Exprk−1

((1−c2) Exp−1

rk−1(qk−1)+c2 Exp−1

rk−1(qk)

)]}.

Required:

Exponential map Expq : TqP2(X )→ P2(X ) and its inverse.Parallel transport Γrq : TqP2(X )→ TrP2(X ).

C. Liu et al. Understanding and Accelerating ParVIs 21 / 31

Page 25: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Accelerated First-Order Methods on P2(X )

Leveraging the Riemannian Structure of P2(X )

Exponential map ([18], Coro. 7.22; [1], Prop. 8.4.6; [8], Prop. 2.1):Expq(v) = (id +v)#q, i.e., {x(i)}i ∼ q ⇒ {x(i)+v(x(i))}i ∼ Expq(v).Inverse exponential map: require the optimal transport map.

Sinkhorn methods [6, 19] appear costly and unstable.Make approximations when {x(i)}i and {y(i)}i are pairwise close:d(x(i), y(i))� min

{minj 6=i d(x(i), x(j)),minj 6=i d(y(i), y(j))

}.

Proposition 5 (Inverse exponential map)

For pairwise close samples {x(i)}i of q and {y(i)}i of r, we have(Exp−1

q (r))(x(i)) ≈ y(i) − x(i).

Parallel transportHard to implement analytical results [15, 16].Use Schild’s ladder method [7, 10] for approximation.

Proposition 6 (Parallel transport)

For pairwise close samples {x(i)}i of q and {y(i)}i of r, we have(Γrq(v)

)(y(i)) ≈ v(x(i)), ∀v ∈ TqP2.

C. Liu et al. Understanding and Accelerating ParVIs 22 / 31

Page 26: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Accelerated First-Order Methods on P2(X )

Acceleration Framework for ParVIs

Algorithm 1 The acceleration framework with Wasserstein Accelerated Gra-dient (WAG) and Wasserstein Nesterov’s method (WNes)

1: WAG: select acceleration factor α > 3;WNes: select or calculate c1, c2 ∈ R+;

2: Initialize {x(i)0 }Ni=1 distinctly; let y(i)0 = x

(i)0 ;

3: for k = 1, 2, · · · , kmax, do4: for i = 1, · · · , N , do

5: Find v(y(i)k−1) by SVGD/Blob/GFSD/GFSF;

6: x(i)k = y

(i)k−1 + εv(y

(i)k−1);

7: y(i)k = x

(i)k +

{WAG: k−1

k (y(i)k−1 − x

(i)k−1) + k+α−2

k εv(y(i)k−1);

WNes: c1(c2 − 1)(x(i)k − x

(i)k−1);

8: end for9: end for

10: Return {x(i)kmax}Ni=1.

C. Liu et al. Understanding and Accelerating ParVIs 23 / 31

Page 27: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Bandwidth Selection via the Heat Equation

1 Introduction

2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods

3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing

4 Accelerated First-Order Methods on P2(X )

5 Bandwidth Selection via the Heat Equation

6 Experiments

C. Liu et al. Understanding and Accelerating ParVIs 24 / 31

Page 28: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Bandwidth Selection via the Heat Equation

Bandwidth Selection via the Heat Equation

Note

Under the dynamics dx = −∇ log qt(x) dt, qt evolves following the heatequation (HE): ∂tqt(x) = ∆qt(x).

Smoothing the density: qt(x) ≈ q(x) = q(x; {x(i)}Ni=1). Then for qt+ε(x),

Due to HE, qt+ε(x) ≈ q(x) + ε∆q(x).

Due to the effect of the dynamics, updated particles{x(i)−ε∇ log q(x(i))}Ni=1 approximate qt+ε, soqt+ε(x) ≈ q(x; {x(i)−ε∇ log q(x(i))}Ni=1).

Objective:∑

k

(q(x(k)) + ε∆q(x(k))− q(x(k); {x(i)−ε∇ log q(x(i))}Ni=1)

)2.

Take ε→ 0, make the objective dimensionless (h/x2 is dimensionless):

1hD+2

∑k

[∆q(x(k); {x(i)}i)+

∑j∇x(j) q(x(k); {x(i)}i)·∇log q(x(j); {x(i)}i)

]2.

Also applicable to smoothing functions.

C. Liu et al. Understanding and Accelerating ParVIs 25 / 31

Page 29: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Experiments

1 Introduction

2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods

3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing

4 Accelerated First-Order Methods on P2(X )

5 Bandwidth Selection via the Heat Equation

6 Experiments

C. Liu et al. Understanding and Accelerating ParVIs 26 / 31

Page 30: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Experiments

Toy Experiments

Median:

HE:

SVGD Blob GFSD GFSF

Figure: Comparison of HE (bottom row) with the median method (top row) forbandwidth selection.

C. Liu et al. Understanding and Accelerating ParVIs 27 / 31

Page 31: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Experiments

Bayesian Logistic Regression (BLR)

0 2500 5000 7500iteration

0.70

0.72

0.74

0.76

accu

racy

SVGD-WGDSVGD-POSVGD-WAGSVGD-WNes

0 2500 5000 7500iteration

0.70

0.72

0.74

0.76

accu

racy

Blob-WGDBlob-POBlob-WAGBlob-WNes

0 2500 5000 7500iteration

0.70

0.72

0.74

0.76

accu

racy

GFSD-WGDGFSD-POGFSD-WAGGFSD-WNes

0 2500 5000 7500iteration

0.70

0.72

0.74

0.76

accu

racy

GFSF-WGDGFSF-POGFSF-WAGGFSF-WNes

Figure: Acceleration effect of WAG and WNes on BLR on the Covertype dataset,measured by prediction accuracy on test dataset. Each curve is averaged over 10runs.

C. Liu et al. Understanding and Accelerating ParVIs 28 / 31

Page 32: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Experiments

Bayesian Neural Networks (BNNs)

Table: Results on BNN on the Kin8nm dataset (one of the UCI datasets [2]).Results are averaged over 20 runs.

MethodAvg. Test RMSE (×10−2)

SVGD Blob GFSD GFSF

WGD 8.4±0.2 8.2±0.2 8.0±0.3 8.3±0.2PO 7.8±0.2 8.1±0.2 8.1±0.2 8.0±0.2

WAG 7.0±0.2 7.0±0.2 7.1±0.1 7.0±0.1WNes 6.9±0.1 7.0±0.2 6.9±0.1 6.8±0.1

MethodAvg. Test LL

SVGD Blob GFSD GFSF

WGD 1.042±0.016 1.079±0.021 1.087±0.029 1.044±0.016PO 1.114±0.022 1.070±0.020 1.067±0.017 1.073±0.016

WAG 1.167±0.015 1.169±0.015 1.167±0.017 1.190±0.014WNes 1.171±0.014 1.168±0.014 1.173±0.016 1.193±0.014

C. Liu et al. Understanding and Accelerating ParVIs 29 / 31

Page 33: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Experiments

Latent Dirichlet Allocation (LDA)

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

SVGD-WGDSVGD-POSVGD-WAGSVGD-WNes

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

Blob-WGDBlob-POBlob-WAGBlob-WNes

200 400iteration

1020

1040

1060

1080

1100

hold

out p

erpl

exity

GFSD-WGDGFSD-POGFSD-WAGGFSD-WNes

200 400iteration

1020

1040

1060

1080

1100ho

ldou

t per

plex

ityGFSF-WGDGFSF-POGFSF-WAGGFSF-WNes

Figure: Acceleration effect of WAG and WNes onLDA. Inference results are measured by the hold-outperplexity. Curves are averaged over 10 runs.

100 200 300 400iteration

1025

1050

1075

1100

1125

1150

hold

out p

erpl

exity

SGNHT-seqSGNHT-paraSVGD-WGDSVGD-WNes

Figure: Comparison ofSVGD and SGNHT onLDA, as representativesof ParVIs and MCMCs.Average over 10 runs.

C. Liu et al. Understanding and Accelerating ParVIs 30 / 31

Page 34: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

Thank you!

C. Liu et al. Understanding and Accelerating ParVIs 31 / 31

Page 35: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

References

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savare.

Gradient flows: in metric spaces and in the space of probability measures.

Springer Science & Business Media, 2008.

Arthur Asuncion and David Newman.

Uci machine learning repository, 2007.

Jean-David Benamou and Yann Brenier.

A computational fluid mechanics solution to the monge-kantorovich mass transferproblem.

Numerische Mathematik, 84(3):375–393, 2000.

Changyou Chen and Ruiyi Zhang.

Particle optimization in stochastic gradient mcmc.

arXiv preprint arXiv:1711.10927, 2017.

Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen.

A unified particle-optimization framework for scalable bayesian sampling.

arXiv preprint arXiv:1805.11659, 2018.

Marco Cuturi.

Sinkhorn distances: Lightspeed computation of optimal transport.

In Advances in neural information processing systems, pages 2292–2300, 2013.C. Liu et al. Understanding and Accelerating ParVIs 31 / 31

Page 36: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

References

J Ehlers, F Pirani, and A Schild.

The geometry of free fall and light propagation, in the book “generalrelativity”(papers in honour of jl synge), 63–84, 1972.

Matthias Erbar et al.

The heat equation on manifolds as a gradient flow in the wasserstein space.

In Annales de l’Institut Henri Poincare, Probabilites et Statistiques, volume 46,pages 1–23. Institut Henri Poincare, 2010.

Richard Jordan, David Kinderlehrer, and Felix Otto.

The variational formulation of the fokker–planck equation.

SIAM journal on mathematical analysis, 29(1):1–17, 1998.

Arkady Kheyfets, Warner A Miller, and Gregory A Newton.

Schild’s ladder parallel transport procedure for an arbitrary connection.

International Journal of Theoretical Physics, 39(12):2891–2898, 2000.

Ondrej Kovacik and Jirı Rakosnık.

On spaces lp(x) and wk, p(x).

Czechoslovak Mathematical Journal, 41(4):592–618, 1991.

Qiang Liu.C. Liu et al. Understanding and Accelerating ParVIs 31 / 31

Page 37: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

References

Stein variational gradient descent as gradient flow.

In Advances in neural information processing systems, pages 3118–3126, 2017.

Qiang Liu and Dilin Wang.

Stein variational gradient descent: A general purpose bayesian inference algorithm.

In Advances In Neural Information Processing Systems, pages 2378–2386, 2016.

Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao.

Accelerated first-order methods for geodesically convex optimization on riemannianmanifolds.

In Advances in Neural Information Processing Systems, pages 4875–4884, 2017.

John Lott.

Some geometric calculations on wasserstein space.

Communications in Mathematical Physics, 277(2):423–437, 2008.

John Lott.

An intrinsic parallel transport in wasserstein space.

Proceedings of the American Mathematical Society, 145(12):5329–5340, 2017.

Felix Otto.

The geometry of dissipative evolution equations: the porous medium equation.

2001.C. Liu et al. Understanding and Accelerating ParVIs 31 / 31

Page 38: Understanding and Accelerating Particle-Based Variational ... · Particle-based Variational Inference Methods (ParVIs): Represent the variational distribution qby particles; update

References

Cedric Villani.

Optimal transport: old and new, volume 338.

Springer Science & Business Media, 2008.

Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha.

A fast proximal point method for computing wasserstein distance.

arXiv preprint arXiv:1802.04307, 2018.

Hongyi Zhang and Suvrit Sra.

An estimate sequence for geodesically convex optimization.

In Conference On Learning Theory, pages 1703–1723, 2018.

C. Liu et al. Understanding and Accelerating ParVIs 31 / 31