Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Understanding and Accelerating Particle-BasedVariational Inference
Chang Liu†, Jingwei Zhuo†, Pengyu Cheng‡, Ruiyi Zhang‡,Jun Zhu†§, Lawrence Carin‡§
†: Department of Computer Science and Technology, Tsinghua University‡: Department of Electrical and Computer Engineering, Duke University
§: Corresponding authors
ICML 2019
C. Liu et al. Understanding and Accelerating ParVIs 1 / 31
1 Introduction
2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods
3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing
4 Accelerated First-Order Methods on P2(X )
5 Bandwidth Selection via the Heat Equation
6 Experiments
C. Liu et al. Understanding and Accelerating ParVIs 2 / 31
Introduction
1 Introduction
2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods
3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing
4 Accelerated First-Order Methods on P2(X )
5 Bandwidth Selection via the Heat Equation
6 Experiments
C. Liu et al. Understanding and Accelerating ParVIs 3 / 31
Introduction
Introduction
Particle-based Variational Inference Methods (ParVIs):
Represent the variational distribution q by particles; update theparticles to minimize KLp(q).
More flexible than classical VIs; more particle-efficient than MCMCs.
What is known:
Stein Variational Gradient Descent (SVGD) [13] simulates thegradient flow (steepest descending curves) of KLp on PH(X ) [12].
The Blob and w-SGLD methods [5] simulate the gradient flow of KLpon the Wasserstein space P2(X ).
What remains unknown:
Do ParVIs make assumptions when simulating the gradient flow?Does one assume stronger than another?
Is it possible to accelerate the gradient flow?
Is there a principle for selecting the bandwidth parameter?
C. Liu et al. Understanding and Accelerating ParVIs 4 / 31
Introduction
Introduction
Particle-based Variational Inference Methods (ParVIs):
Represent the variational distribution q by particles; update theparticles to minimize KLp(q).
More flexible than classical VIs; more particle-efficient than MCMCs.
What is known:
Stein Variational Gradient Descent (SVGD) [13] simulates thegradient flow (steepest descending curves) of KLp on PH(X ) [12].
The Blob and w-SGLD methods [5] simulate the gradient flow of KLpon the Wasserstein space P2(X ).
What remains unknown:
Do ParVIs make assumptions when simulating the gradient flow?Does one assume stronger than another?
Is it possible to accelerate the gradient flow?
Is there a principle for selecting the bandwidth parameter?
C. Liu et al. Understanding and Accelerating ParVIs 4 / 31
Introduction
Introduction
Particle-based Variational Inference Methods (ParVIs):
Represent the variational distribution q by particles; update theparticles to minimize KLp(q).
More flexible than classical VIs; more particle-efficient than MCMCs.
What is known:
Stein Variational Gradient Descent (SVGD) [13] simulates thegradient flow (steepest descending curves) of KLp on PH(X ) [12].
The Blob and w-SGLD methods [5] simulate the gradient flow of KLpon the Wasserstein space P2(X ).
What remains unknown:
Do ParVIs make assumptions when simulating the gradient flow?Does one assume stronger than another?
Is it possible to accelerate the gradient flow?
Is there a principle for selecting the bandwidth parameter?
C. Liu et al. Understanding and Accelerating ParVIs 4 / 31
Introduction
Contributions
Findings:
SVGD approximates the gradient flow on P2(X ).
ParVIs approximate the P2(X ) gradient flow by a compulsorysmoothing treatment.
Various ParVIs either smooth the density or smooth functions, andthey are equivalent.
Methods:
Two novel ParVIs.
An acceleration framework for general ParVIs.
A principled bandwidth selection method for the smoothing kernel.
C. Liu et al. Understanding and Accelerating ParVIs 5 / 31
Preliminaries
1 Introduction
2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods
3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing
4 Accelerated First-Order Methods on P2(X )
5 Bandwidth Selection via the Heat Equation
6 Experiments
C. Liu et al. Understanding and Accelerating ParVIs 6 / 31
Preliminaries
Basic Concepts
Spaces:
Metric space: a set M with a distance function d :M×M→ R.Riemannian manifold:A topological space M that locally behaves like an Euclidean space(manifold), and there is an inner product 〈·, ·〉TxM in each of itstangent space TxM (Riemannian).
Tangent space is the structure of manifolds.Riemannian manifolds are metric spaces:
d(x, y) := inf(γt)t:γ0=x,γ1=y
√∫ 1
0
〈γt, γt〉TγtM dt.
C. Liu et al. Understanding and Accelerating ParVIs 7 / 31
Preliminaries
Basic Concepts
Gradient flow {(xt)t} of a function f : steepest descending curves.
On metric spaces: various defs ([1], Def. 11.1.1; [18], Def. 23.7), e.g., theMimimizing Movement Scheme (MMS) ([1], Def. 2.0.6):
xt+ε = argminx∈M
f(x) +1
2εd2(x, xt).
On Riemannian manifolds: xt = − grad f(xt), where:
〈grad f(x), v〉TxM = v[f ] :=∑i
vi∂if, ∀v ∈ TxM,
which is equivalent to:grad f(x) = max · argmax
v∈TxM,‖v‖TxM=1
d
dtf(xt).
It coincides with MMS on Riemannian manifolds.
C. Liu et al. Understanding and Accelerating ParVIs 8 / 31
Preliminaries The Wasserstein Space P2(X )
The Wasserstein Space P2(X )
P2(X ) :={q: distribution on X
∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.
Consider Euclidean support space X = RD afterwards.
P2(X ) as a metric space ([18], Def 6.4):
dW (q, p) :=(
infπ∈Π(q,p)
Eπ(x,y)[d(x, y)2])1/2
,
where
Π(q, p) :=
{π: distribution on X × X
∣∣∣∣ ∫Xπ(x, y) dy = q(x),∫
Xπ(x, y) dx = p(y)
}.
C. Liu et al. Understanding and Accelerating ParVIs 9 / 31
Preliminaries The Wasserstein Space P2(X )
The Wasserstein Space P2(X )P2(X ) :=
{q: distribution on X
∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.
P2 as a Riemannian manifold [17, 18, 1] (X = RD):
Tangent vector ∂tqt on P2(X ) ⇐⇒ Vector field vt on X .{x(i)}Ni=1 ∼ qt =⇒ {x(i) + εvt(x
(i))}Ni=1 ∼ (id +εvt)#qt = qt+ε + o(ε).([1], Prop 8.1.8)
C. Liu et al. Understanding and Accelerating ParVIs 10 / 31
Preliminaries The Wasserstein Space P2(X )
The Wasserstein Space P2(X )P2(X ) :=
{q: distribution on X
∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.
P2 as a Riemannian manifold [17, 18, 1] (X = RD):
Tangent space: TqP2 := {∇ϕ | ϕ ∈ C∞c }L2q ,
L2q := {u : RD → RD |
∫X ‖u(x)‖22 dq <∞}, ϕ : RD → R.
([18], Thm 13.8; [1], Thm 8.3.1, Def 8.4.1, Prop 8.4.5)
Riemannian metric: 〈v, u〉TqP2:=∫X v(x) · u(x) q(x) dx.
(consistent with the Wasserstein distance dW [3])
C. Liu et al. Understanding and Accelerating ParVIs 10 / 31
Preliminaries The Wasserstein Space P2(X )
The Wasserstein Space P2(X )P2(X ) :=
{q: distribution on X
∣∣ ∃x0 ∈ X s.t. Eq[d(x0, x)2] < +∞}.
Gradient flow on P2(X ) for KLp(q) := Eq[log(q/p)]:
P2(X ) as a Riemannian manifold:
vGF := − grad KLp(q) = −∇( δδq
KLp(q))
= ∇ log p−∇ log q.
([18], Thm 23.18; [1], Example 11.1.2)
Minimizing Movement Scheme (MMS) ([1], Def. 2.0.6):
qt+ε = argminq∈P2(X )
KLp(q) +1
2εd2W (q, qt).
They coincide under the Riemannian structure.([18], Prop. 23.1, Rem. 23.4; [1], Thm. 11.1.6; [8], Lem. 2.7)
Remark 1The Langevin dynamics dx = ∇ log p(x) dt+
√2 dBt(x) (Bt is the
Brownian motion) is also the gradient flow of KLp on P2(X ) [9].
C. Liu et al. Understanding and Accelerating ParVIs 11 / 31
Preliminaries Particle-Based Variational Inference Methods
Particle-Based Variational Inference Methods (ParVIs)
Stein Variational Gradient Descent (SVGD) [13]:
vSVGD(·) := max · argmaxv∈HD,‖v‖HD=1
− d
dεKLp
((id +εv)#q
)∣∣∣ε=0
= Eq(x)[K(x, ·)∇ log p(x) +∇xK(x, ·)],where H is the reproducing kernel Hilbert space (RKHS) of kernel K.
vSVGD is the gradient flow of KLp on a kernel-related distributionmanifold PH [12].
Blob (w-SGLD-B) [5]:
vBlob := −∇( δδq
Eq[log(q/p)])
= ∇ log p−∇ log q −∇((q/q) ∗K
),
q := q ∗K.
C. Liu et al. Understanding and Accelerating ParVIs 12 / 31
Preliminaries Particle-Based Variational Inference Methods
Particle-Based Variational Inference Methods (ParVIs)
Particle Optimization (PO) [4]: using MMS; estimate dW by solvingthe dual optimal transport problem.
x(i)k = x
(i)k−1 + ε(vSVGD(x
(i)k−1) +N (0, σ2I)) + µ(x
(i)k−1 − x
(i)k−2).
w-SGLD [5]: using MMS; estimate dW by solving the primal problem.Similar update rule.
C. Liu et al. Understanding and Accelerating ParVIs 13 / 31
ParVIs as Approximations to the P2(X ) Gradient Flow
1 Introduction
2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods
3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing
4 Accelerated First-Order Methods on P2(X )
5 Bandwidth Selection via the Heat Equation
6 Experiments
C. Liu et al. Understanding and Accelerating ParVIs 14 / 31
ParVIs as Approximations to the P2(X ) Gradient Flow SVGD Approximates P2(X ) Gradient Flow
SVGD Approximates P2(X ) Gradient Flow
Reformulate vGF as:
vGF = max · argmaxv∈L2
q,‖v‖L2q=1
⟨vGF, v
⟩L2q. (1)
We find:
Theorem 2 (vSVGD approximates vGF)
vSVGD = max · argmaxv∈HD,‖v‖HD=1
⟨vGF, v
⟩L2q.
HD is a subspace of L2q , so vSVGD is the projection of vGF on HD.
The PH(X )-gradient-flow interpretation of SVGD: PH(X ) is not avery nice manifold.
All ParVIs approximate the P2(X ) gradient flow.
C. Liu et al. Understanding and Accelerating ParVIs 15 / 31
ParVIs as Approximations to the P2(X ) Gradient Flow ParVIs Approximate P2(X ) Gradient Flow by Smoothing
ParVIs Approximate P2(X ) Gradient Flow by Smoothing
Smoothing Functions
SVGD restricts the optimization domain L2q to HD.
Theorem 3 (HD smooths L2q)
For X = RD, a Gaussian kernel K on X and an absolutely continuous q,the vector-valued RKHS HD of K is isometrically isomorphic to the
closure G := {φ ∗K : φ ∈ C∞c }L2q .
C∞cL2q = L2q ([11], Thm. 2.11) =⇒ G is roughly the kernel-smoothed L2q .
PO solves the dual problem by restricting the optimization domain ofLipschitz functions to quadratic functions.
C. Liu et al. Understanding and Accelerating ParVIs 16 / 31
ParVIs as Approximations to the P2(X ) Gradient Flow ParVIs Approximate P2(X ) Gradient Flow by Smoothing
ParVIs Approximate P2(X ) Gradient Flow by Smoothing
Smoothing the Density
Blob partially smooths the density.
vGF = −∇( δδq
Eq[log(q/p)])
=⇒ vBlob = −∇( δδq
Eq[log(q/p)]).
w-SGLD adds an entropy regularizer in the primal objective function.
d2W ({x(i)}Ni=1, {y(j)}Nj=1) ≈ minπij
∑i,j
πijd2ij + λ
∑i,j
πij log πij ,
s.t.∑i
πij = 1/N,∑j
πij = 1/N.
C. Liu et al. Understanding and Accelerating ParVIs 17 / 31
ParVIs as Approximations to the P2(X ) Gradient Flow ParVIs Approximate P2(X ) Gradient Flow by Smoothing
ParVIs Approximate P2(X ) Gradient Flow by Smoothing
Equivalence:Smoothing-function objective = Eq[L(v)], L : L2q → L2
q linear.
=⇒ Eq[L(v)] = Eq∗K [L(v)] = Eq[L(v) ∗K] = Eq[L(v ∗K)].
Necessity: grad KLp(q) undefined at q = q := 1N
∑Ni=1 δx(i) .
Theorem 4 (Necessity of smoothing for SVGD)
For q = q and v ∈ L2p, problem (1):
maxv∈L2
p,‖v‖L2p=1
⟨vGF, v
⟩L2q
,
has no optimal solution. In fact the supremum of the objective is infinite,indicating that a maximizing sequence of v tends to be ill-posed.
ParVIs rely on the smoothing assumption! No free lunch!
C. Liu et al. Understanding and Accelerating ParVIs 18 / 31
ParVIs as Approximations to the P2(X ) Gradient Flow New ParVIs with Smoothing
New ParVIs with Smoothing
Gradient Flow with Smoothed Density (GFSD):Fully smooth the density:
vGFSD := ∇ log p−∇ log q.
Gradient Flow with Smoothed test Functions (GFSF):
vGF = ∇ log p−∇ log q
=⇒ vGF = ∇ log p+ argminu∈L2
maxφ∈C∞c ,‖φ‖L2q=1
(Eq[φ · u−∇ · φ]
)2.
Smooth φ: take φ from HD:
vGFSF := ∇ log p+ argminu∈L2
maxφ∈HD,‖φ‖HD=1
(Eq[φ · u−∇ · φ]
)2.
Solution: vGFSF = g + K ′K−1. (Note vSVGD = vGFSFK.)g:,i = ∇x(i) log p(x(i)), Kij = K(x(i), x(j)), K′
:,i =∑
j ∇x(j)K(x(j), x(i)).
C. Liu et al. Understanding and Accelerating ParVIs 19 / 31
Accelerated First-Order Methods on P2(X )
1 Introduction
2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods
3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing
4 Accelerated First-Order Methods on P2(X )
5 Bandwidth Selection via the Heat Equation
6 Experiments
C. Liu et al. Understanding and Accelerating ParVIs 20 / 31
Accelerated First-Order Methods on P2(X )
Nesterov’s Acceleration Methods on Riemannian Manifolds
rk ∈ P2(X ): auxiliary variable. vk := − grad KL(rk).Riemannian Accelerated Gradient (RAG) [14] (with simplification):{
qk = Exprk−1(εvk−1),
rk = Expqk
[−Γqkrk−1
(k−1k Exp−1
rk−1(qk−1)− k+α−2
k εvk−1
)].
Riemannian Nesterov’s method (RNes) [20] (with simplification):{qk = Exprk−1
(εvk−1),
rk = Expqk{c1 Exp−1
qk
[Exprk−1
((1−c2) Exp−1
rk−1(qk−1)+c2 Exp−1
rk−1(qk)
)]}.
Required:
Exponential map Expq : TqP2(X )→ P2(X ) and its inverse.Parallel transport Γrq : TqP2(X )→ TrP2(X ).
C. Liu et al. Understanding and Accelerating ParVIs 21 / 31
Accelerated First-Order Methods on P2(X )
Leveraging the Riemannian Structure of P2(X )
Exponential map ([18], Coro. 7.22; [1], Prop. 8.4.6; [8], Prop. 2.1):Expq(v) = (id +v)#q, i.e., {x(i)}i ∼ q ⇒ {x(i)+v(x(i))}i ∼ Expq(v).Inverse exponential map: require the optimal transport map.
Sinkhorn methods [6, 19] appear costly and unstable.Make approximations when {x(i)}i and {y(i)}i are pairwise close:d(x(i), y(i))� min
{minj 6=i d(x(i), x(j)),minj 6=i d(y(i), y(j))
}.
Proposition 5 (Inverse exponential map)
For pairwise close samples {x(i)}i of q and {y(i)}i of r, we have(Exp−1
q (r))(x(i)) ≈ y(i) − x(i).
Parallel transportHard to implement analytical results [15, 16].Use Schild’s ladder method [7, 10] for approximation.
Proposition 6 (Parallel transport)
For pairwise close samples {x(i)}i of q and {y(i)}i of r, we have(Γrq(v)
)(y(i)) ≈ v(x(i)), ∀v ∈ TqP2.
C. Liu et al. Understanding and Accelerating ParVIs 22 / 31
Accelerated First-Order Methods on P2(X )
Acceleration Framework for ParVIs
Algorithm 1 The acceleration framework with Wasserstein Accelerated Gra-dient (WAG) and Wasserstein Nesterov’s method (WNes)
1: WAG: select acceleration factor α > 3;WNes: select or calculate c1, c2 ∈ R+;
2: Initialize {x(i)0 }Ni=1 distinctly; let y(i)0 = x
(i)0 ;
3: for k = 1, 2, · · · , kmax, do4: for i = 1, · · · , N , do
5: Find v(y(i)k−1) by SVGD/Blob/GFSD/GFSF;
6: x(i)k = y
(i)k−1 + εv(y
(i)k−1);
7: y(i)k = x
(i)k +
{WAG: k−1
k (y(i)k−1 − x
(i)k−1) + k+α−2
k εv(y(i)k−1);
WNes: c1(c2 − 1)(x(i)k − x
(i)k−1);
8: end for9: end for
10: Return {x(i)kmax}Ni=1.
C. Liu et al. Understanding and Accelerating ParVIs 23 / 31
Bandwidth Selection via the Heat Equation
1 Introduction
2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods
3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing
4 Accelerated First-Order Methods on P2(X )
5 Bandwidth Selection via the Heat Equation
6 Experiments
C. Liu et al. Understanding and Accelerating ParVIs 24 / 31
Bandwidth Selection via the Heat Equation
Bandwidth Selection via the Heat Equation
Note
Under the dynamics dx = −∇ log qt(x) dt, qt evolves following the heatequation (HE): ∂tqt(x) = ∆qt(x).
Smoothing the density: qt(x) ≈ q(x) = q(x; {x(i)}Ni=1). Then for qt+ε(x),
Due to HE, qt+ε(x) ≈ q(x) + ε∆q(x).
Due to the effect of the dynamics, updated particles{x(i)−ε∇ log q(x(i))}Ni=1 approximate qt+ε, soqt+ε(x) ≈ q(x; {x(i)−ε∇ log q(x(i))}Ni=1).
Objective:∑
k
(q(x(k)) + ε∆q(x(k))− q(x(k); {x(i)−ε∇ log q(x(i))}Ni=1)
)2.
Take ε→ 0, make the objective dimensionless (h/x2 is dimensionless):
1hD+2
∑k
[∆q(x(k); {x(i)}i)+
∑j∇x(j) q(x(k); {x(i)}i)·∇log q(x(j); {x(i)}i)
]2.
Also applicable to smoothing functions.
C. Liu et al. Understanding and Accelerating ParVIs 25 / 31
Experiments
1 Introduction
2 PreliminariesThe Wasserstein Space P2(X )Particle-Based Variational Inference Methods
3 ParVIs as Approximations to the P2(X ) Gradient FlowSVGD Approximates P2(X ) Gradient FlowParVIs Approximate P2(X ) Gradient Flow by SmoothingNew ParVIs with Smoothing
4 Accelerated First-Order Methods on P2(X )
5 Bandwidth Selection via the Heat Equation
6 Experiments
C. Liu et al. Understanding and Accelerating ParVIs 26 / 31
Experiments
Toy Experiments
Median:
HE:
SVGD Blob GFSD GFSF
Figure: Comparison of HE (bottom row) with the median method (top row) forbandwidth selection.
C. Liu et al. Understanding and Accelerating ParVIs 27 / 31
Experiments
Bayesian Logistic Regression (BLR)
0 2500 5000 7500iteration
0.70
0.72
0.74
0.76
accu
racy
SVGD-WGDSVGD-POSVGD-WAGSVGD-WNes
0 2500 5000 7500iteration
0.70
0.72
0.74
0.76
accu
racy
Blob-WGDBlob-POBlob-WAGBlob-WNes
0 2500 5000 7500iteration
0.70
0.72
0.74
0.76
accu
racy
GFSD-WGDGFSD-POGFSD-WAGGFSD-WNes
0 2500 5000 7500iteration
0.70
0.72
0.74
0.76
accu
racy
GFSF-WGDGFSF-POGFSF-WAGGFSF-WNes
Figure: Acceleration effect of WAG and WNes on BLR on the Covertype dataset,measured by prediction accuracy on test dataset. Each curve is averaged over 10runs.
C. Liu et al. Understanding and Accelerating ParVIs 28 / 31
Experiments
Bayesian Neural Networks (BNNs)
Table: Results on BNN on the Kin8nm dataset (one of the UCI datasets [2]).Results are averaged over 20 runs.
MethodAvg. Test RMSE (×10−2)
SVGD Blob GFSD GFSF
WGD 8.4±0.2 8.2±0.2 8.0±0.3 8.3±0.2PO 7.8±0.2 8.1±0.2 8.1±0.2 8.0±0.2
WAG 7.0±0.2 7.0±0.2 7.1±0.1 7.0±0.1WNes 6.9±0.1 7.0±0.2 6.9±0.1 6.8±0.1
MethodAvg. Test LL
SVGD Blob GFSD GFSF
WGD 1.042±0.016 1.079±0.021 1.087±0.029 1.044±0.016PO 1.114±0.022 1.070±0.020 1.067±0.017 1.073±0.016
WAG 1.167±0.015 1.169±0.015 1.167±0.017 1.190±0.014WNes 1.171±0.014 1.168±0.014 1.173±0.016 1.193±0.014
C. Liu et al. Understanding and Accelerating ParVIs 29 / 31
Experiments
Latent Dirichlet Allocation (LDA)
200 400iteration
1020
1040
1060
1080
1100
hold
out p
erpl
exity
SVGD-WGDSVGD-POSVGD-WAGSVGD-WNes
200 400iteration
1020
1040
1060
1080
1100
hold
out p
erpl
exity
Blob-WGDBlob-POBlob-WAGBlob-WNes
200 400iteration
1020
1040
1060
1080
1100
hold
out p
erpl
exity
GFSD-WGDGFSD-POGFSD-WAGGFSD-WNes
200 400iteration
1020
1040
1060
1080
1100ho
ldou
t per
plex
ityGFSF-WGDGFSF-POGFSF-WAGGFSF-WNes
Figure: Acceleration effect of WAG and WNes onLDA. Inference results are measured by the hold-outperplexity. Curves are averaged over 10 runs.
100 200 300 400iteration
1025
1050
1075
1100
1125
1150
hold
out p
erpl
exity
SGNHT-seqSGNHT-paraSVGD-WGDSVGD-WNes
Figure: Comparison ofSVGD and SGNHT onLDA, as representativesof ParVIs and MCMCs.Average over 10 runs.
C. Liu et al. Understanding and Accelerating ParVIs 30 / 31
Thank you!
C. Liu et al. Understanding and Accelerating ParVIs 31 / 31
References
Luigi Ambrosio, Nicola Gigli, and Giuseppe Savare.
Gradient flows: in metric spaces and in the space of probability measures.
Springer Science & Business Media, 2008.
Arthur Asuncion and David Newman.
Uci machine learning repository, 2007.
Jean-David Benamou and Yann Brenier.
A computational fluid mechanics solution to the monge-kantorovich mass transferproblem.
Numerische Mathematik, 84(3):375–393, 2000.
Changyou Chen and Ruiyi Zhang.
Particle optimization in stochastic gradient mcmc.
arXiv preprint arXiv:1711.10927, 2017.
Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen.
A unified particle-optimization framework for scalable bayesian sampling.
arXiv preprint arXiv:1805.11659, 2018.
Marco Cuturi.
Sinkhorn distances: Lightspeed computation of optimal transport.
In Advances in neural information processing systems, pages 2292–2300, 2013.C. Liu et al. Understanding and Accelerating ParVIs 31 / 31
References
J Ehlers, F Pirani, and A Schild.
The geometry of free fall and light propagation, in the book “generalrelativity”(papers in honour of jl synge), 63–84, 1972.
Matthias Erbar et al.
The heat equation on manifolds as a gradient flow in the wasserstein space.
In Annales de l’Institut Henri Poincare, Probabilites et Statistiques, volume 46,pages 1–23. Institut Henri Poincare, 2010.
Richard Jordan, David Kinderlehrer, and Felix Otto.
The variational formulation of the fokker–planck equation.
SIAM journal on mathematical analysis, 29(1):1–17, 1998.
Arkady Kheyfets, Warner A Miller, and Gregory A Newton.
Schild’s ladder parallel transport procedure for an arbitrary connection.
International Journal of Theoretical Physics, 39(12):2891–2898, 2000.
Ondrej Kovacik and Jirı Rakosnık.
On spaces lp(x) and wk, p(x).
Czechoslovak Mathematical Journal, 41(4):592–618, 1991.
Qiang Liu.C. Liu et al. Understanding and Accelerating ParVIs 31 / 31
References
Stein variational gradient descent as gradient flow.
In Advances in neural information processing systems, pages 3118–3126, 2017.
Qiang Liu and Dilin Wang.
Stein variational gradient descent: A general purpose bayesian inference algorithm.
In Advances In Neural Information Processing Systems, pages 2378–2386, 2016.
Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao.
Accelerated first-order methods for geodesically convex optimization on riemannianmanifolds.
In Advances in Neural Information Processing Systems, pages 4875–4884, 2017.
John Lott.
Some geometric calculations on wasserstein space.
Communications in Mathematical Physics, 277(2):423–437, 2008.
John Lott.
An intrinsic parallel transport in wasserstein space.
Proceedings of the American Mathematical Society, 145(12):5329–5340, 2017.
Felix Otto.
The geometry of dissipative evolution equations: the porous medium equation.
2001.C. Liu et al. Understanding and Accelerating ParVIs 31 / 31
References
Cedric Villani.
Optimal transport: old and new, volume 338.
Springer Science & Business Media, 2008.
Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha.
A fast proximal point method for computing wasserstein distance.
arXiv preprint arXiv:1802.04307, 2018.
Hongyi Zhang and Suvrit Sra.
An estimate sequence for geodesically convex optimization.
In Conference On Learning Theory, pages 1703–1723, 2018.
C. Liu et al. Understanding and Accelerating ParVIs 31 / 31