59
Optimal Transport http://bicmr.pku.edu.cn/~wenzw/bigdata2019.html Acknowledgement: this slides is based on Prof. Gabriel Peyré’s lecture notes 1/59

Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

Optimal Transport

http://bicmr.pku.edu.cn/~wenzw/bigdata2019.html

Acknowledgement: this slides is based on Prof. Gabriel Peyré’s lecture notes

1/59

Page 2: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

2/59

Applications: comparing measures

������������������

! images, vision, graphics and machine learning, . . .

• Optimal transport

Optimal transport meanL2mean

Page 3: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

3/59

Applications: toward high-dimensional OT

��������������������������

����� ����������� ������� ������� ���� ������ �������

Page 4: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

4/59

Outline

1 Problem Formulation

2 Applications

3 Entropic Regularization

4 Sinkhorn’s Algorithm

5 Sinkhorn-Newton method

6 Shielding Neighborhood Method

Page 5: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

5/59

Kantorovitch’s Formulation

Discrete Optimal TransportInput two discrete probability measures

α =

m∑i=1

aiδxi , β =

n∑j=1

bjδyj . (1)

X = {xi}i, Y = {xj}j: are given points clouds, xi, yi are vectors.ai, bj : positive weights,

∑mi=1 ai =

∑nj=1 bj = 1.

Cij: costs, Cij = c(xi, yj) ≥ 0.

Couplings

U(a, b)def= {Π ∈ Rm×n

+ ; Π1n = a,Π>1m = b} (2)

is called the set of couplings with respect to α and β.

Page 6: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

6/59

Kantorovitch’s Formulation

Discrete Optimal TransportIn the Optimal transport, we want to compute the following quantity[Kantorovich 1942]

Optimal transport distance

LC(a, b)def= min

∑i,j

Ci,jΠi,j; Π ∈ U(a, b)

. (3)

Page 7: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

7/59

Push Forward

Radon measures (α, β) on (X ,Y).Transfer of measure by T : X → Y: push forward.The measure T#α on Y is defined by

T#α(Y) = α(T−1(Y)), for all measurable Y ∈ Y. (4)

Equivalently, ∫Y

g(y)dT#α(y)def=

∫X

g(T(x))dα(x). (5)

Discrete measures: T#α =∑

i αiδT(xi)

Smooth densities: dα = ρ(x)dx, dβ = ξ(x)dx.

T#α = β ⇐⇒ ρ(T(x))|det(∂T(x))| = ξ(x). (6)

Page 8: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

8/59

Monge problem

Monge problem seeks for a map that associates to each point xi

a single point yj, and which must push the mass of α toward themass of β, namely:

∀j, bj =∑

i:T(xi)=yj

ai

Discrete case:

minT

∑i

c(xi,T(xi)), s.t. T#α = β

Arbitrary measures:

minT

∫X

c(x,T(x))dα(x), s.t. T#α = β

Page 9: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

9/59

Couplings between General Measures

Projectors:PX : (x, y) ∈ X × Y → x ∈ X ,PY : (x, y) ∈ X × Y → y ∈ Y.

(7)

Couplings between General Measures

U(α, β)def= {π ∈M+(X × Y); PX#π = α,PY#π = β}. (8)

is called the set of couplings with respect to α and β.

Page 10: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

10/59

Cases of Couplings

�������������������������

Discrete

Continuous

Semi-discrete

βββ

↵↵ ↵

βββ↵ ↵ ↵

Page 11: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

11/59

More Examples

���������������������

β

β

⇡β

β

β

β

β

⇡β

Page 12: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

12/59

Kantorovitch Problem for General Measures

Optimal transport distance between General Measures

L(α, β, c)def= min

π∈U(α,β)

∫X×Y

c(x, y)dπ(x, y). (9)

Probability interpretation:

min(X,Y){E(X,Y)(c(X,Y)),X ∼ α,Y ∼ β}. (10)

Page 13: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

13/59

Wasserstein Distance

Metric Space X = Y.Distance d(x, y) (nonegative, symmetric, identity, triangle inequality).Cost c(x, y) = d(x, y)p, p ≥ 1.

Wasserstein Distance

Wp(α, β)def= L(α, β, dp)1/p. (11)

TheoremWp is a distance, and

Wp(αn, α)→ 0 ⇐⇒ αnweak→ α. (12)

Example

Wp(δx, δy) = d(x, y). (13)

Page 14: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

14/59

Dual form

Dual problem (discrete case)

maxf∈Rm,g∈Rn

f>a + g>b,

subject to fi + gj ≤ Cij, ∀(i, j)(14)

Relation between any primal and dual solutions:

Pij > 0⇒ fi + gj = Cij.

Page 15: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

15/59

Wasserstein barycenter

Define C def= MXY , where (MXY)ij = d(xi, yi)

p. The Wassersteindistance as

L(a, b,C)def= min

∑i,j

Ci,jΠi,j; Π ∈ U(a, b)

. (15)

Given a set of point clouds and their corresponding probabilityvector {(Y(i), b(i))}, i = 1, . . . ,N.Find a support X = {xi} with a probability vector a such that(X, a) is the optimal solution of the following problem

minX,a

1N

N∑k=1

L(a, b(k),MXYk)

Page 16: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

16/59

Outline

1 Problem Formulation

2 Applications

3 Entropic Regularization

4 Sinkhorn’s Algorithm

5 Sinkhorn-Newton method

6 Shielding Neighborhood Method

Page 17: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

17/59

Applications: Grayscale image equalization

����������������������������

t = 0 t = 0.25 t = 0.5 t = .75 t = 1

Page 18: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

18/59

Applications: image color adaptation

Example: https://github.com/rflamary/POT/blob/master/notebooks/plot_otda_color_images.ipynb

Given color image stored in the RGB format: I1, I2# Converts an image to matrix (one pixel per line)X1 = im2mat(I1), X2 = im2mat(I2)# Take samplesXs = X1[idx1, :], Xt = X2[idx2, :]# Scatter plot of colorspl.scatter(Xs[:, 0], Xs[:, 2], c=Xs)# Sinkhorn Transportot_sinkhorn = ot.da.SinkhornTransport(reg_e=1e-1)ot_sinkhorn.fit(Xs=Xs, Xt=Xt)# prediction between imagestransp_Xs_sinkhorn = ot_sinkhorn.transform(Xs=X1)transp_Xt_sinkhorn = ot_sinkhorn.inverse_transform(Xt=X2)

Page 19: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

19/59

Applications: image color adaptation

Page 20: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

20/59

Applications: image color palette equalization

Optimal

transport

��������������������������������

Page 21: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

21/59

Applications: shape interpolation

�������������������

Page 22: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

22/59

Applications: MRI Data Processing

�������������������������������������

L2 barycenter

W2

2barycenter

Ground cost c = dM : geodesic on cortical surface M .

Page 23: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

23/59

Applications

�������������������������

�������������������������������������������

���������������������������

Page 24: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

24/59

Applications

��������������������������������������

input

output

input

output

µ

Shape registration:loss regularity

min' di↵eo

D('(µ), ⌫) +R(')

σ

(µ− ⌫) ? kσ

Hilbertian loss (MMD/RKHS):

D(µ, ⌫) = ||kσ ? (µ− ⌫)||2L2

it 0

it 250

! Sinkhorn’s iterates “propagate” a small bandwidth kernel.

! Automatic di↵erentation: game changer for advanced loss and models.

! Do not use OT for registration . . . but as a loss.

Joint work with J. Feydy, B. Charier, F-X. Vialard.

Sinkhorn divergence:

D(µ, ⌫) = W"(µ, ⌫)

Page 25: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

25/59

Applications: word mover’s distance

normalized bag-of-words (nBOW), word travel cost (word2vecdistance), document distance Tijc(i, j), transportation problem

�������������������������

����������� dist(D1, D2) = W2(µ,⌫)

µ

������������

Page 26: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

26/59

Applications: word mover’s distance

minΠ≥0

∑ij

Πijcij

s.t.n∑

j=1

Πij = di

n∑i=1

Πij = d′j

xi: word2vec embeddingcij = ‖xi − xj‖2

if word i appears wi times in the document, we denote di = wi∑wj

Page 27: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

27/59

Applications: topic models

Top-left topic: competitions. Top-right: time. Bottom-left: socceractions. Bottom-right: drugs.

����������

������������

Page 28: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

28/59

Applications

���������������������������������������

!""""""""""""""#""""""""""""""$Source

!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""#""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""$Targets

0 0

MDS in 3-D

Use T to define registration between:Colors

distributionShapeShapeShape

Geodesic distances GW distances MDS

Vizualization

Shapes(Xs)s

1

5

10

15

20

25

30

35

40

45

MDS in 2-D

Page 29: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

29/59

Outline

1 Problem Formulation

2 Applications

3 Entropic Regularization

4 Sinkhorn’s Algorithm

5 Sinkhorn-Newton method

6 Shielding Neighborhood Method

Page 30: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

30/59

Discrete OT Review

Given an integer n > 1, we write Σn for the discrete probability simplex

Σndef=

{a ∈ R+

n ;

n∑i=1

ai = 1.

}(16)

Given a ∈ Σm, b ∈ Σn, the Optimal Transport problem is to compute

L(a, b,C)def= min{

∑i,j

Ci,jPi,j; s.t. P ∈ U(a, b)}. (17)

Where U(a, b) is the set of couplings between a and b.

Page 31: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

31/59

Entropy

The discrete entropy of a positive matrix P (∑

ij Pij = 1) is defined as

H(P)def= −

∑i,j

Pi,j(log(Pi,j)− 1). (18)

For a positive vector u ∈ Σn, the entropy is defined analogously:

H(u)def= −

∑i

ui(log(ui)− 1). (19)

For two positive vector u, v ∈ Σn, the Kullback-Leibler divergence (or,KL divergence) is defined to be

KL(u‖v) = −n∑

i=1

ui log(vi

ui). (20)

The KL divergence is always non-negative: KL(u‖v) ≥ 0 (Jensen’sinequality: E[f (g(X))] ≥ f (E[g(X)])).

Page 32: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

32/59

Entropic regularization

Given a ∈ Σm, b ∈ Σn and cost matrix C ∈ Rm×n+ . The entropic

regularization of the transportation problem reads

Lε(a, b,C) = minP∈U(a,b)

〈P,C〉 − εH(P). (21)

The case ε = 0 corresponds to the classic (linear) optimaltransport problem.For ε > 0, problem (21) has an ε-strongly convex objective andtherefore admits a unique optimal solution P?ε.

This is not (necessarily) true for ε = 0. But we have the followingproposition.

Page 33: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

33/59

Entropic regularization

PropositionWhen ε→ 0, the unique solution Pε of (21) converges to the optimalsolution with maximal entropy within the set of all optimal solutions ofthe unregularized transportation problem, namely,

Pεε→0→ argmaxP{H(P); P ∈ U(a, b), 〈P,C〉 = L0(a, b,C)} (22)

The above proposition motivates us to solve the problems in (21)sequentially and then take ε→ 0.

Page 34: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

34/59

Entropic regularization

ProofWe consider a sequence (ε`)` such that ε` → 0 and ε` > 0. Wedenote P` = P?ε` . Since U(a, b) is bounded, we can extract asequence (that we do not relabel for the sake of simplicity) such thatP` → P?. Since U(a, b) is closed, P? ∈ U(a, b). We consider any Psuch that 〈C,P〉 = L0(a, b,C). By optimality of P and P` for theirrespective optimization problems (for ε = 0 and ε = ε`), one has

0 ≤ 〈C,P`〉 − 〈C,P〉 ≤ ε`(H(P`)− H(P)). (23)

Since H is continuous, taking the limit `→ +∞ in this expressionshows that 〈C,P?〉 = 〈C,P〉. Furthermore, dividing by ε` and takingthe limit shows that H(P) 6 H(P?). Now the result follows from thestrictly convexity of −H.

Page 35: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

35/59

Entropic regularization

By the concavity of entropy, for α > 0, we introduce the convex set

Uα(a, b)def= {P ∈ U(a, b)|KL(P‖ab>) ≤ α}= {P ∈ U(a, b)|H(P) ≥ H(a) + H(b)− 1− α}.

(24)

Definition: Sinkhorn Distance

dC,α(a, b)def= min

P∈Uα(a,b)〈C,P〉. (25)

PropositionFor α ≥ 0, dC,α(a, b) is symmetric and satisfies all triangle inequalities.Moreover, 1a6=bdC,α(a, b) satisfies all three distance axioms.

Page 36: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

36/59

Entropic regularization

PropositionFor α large enough, the Sinkhorn distance dC,α is the transportdistance dC.

Proof.Note that for any P ∈ U(a, b), we have

H(P) ≥ 12

(H(a) + H(b)), (26)

so for α ≥ 12(H(a) + H(b))− 1, we have

Uα(a, b) = U(a, b).

Page 37: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

37/59

Outline

1 Problem Formulation

2 Applications

3 Entropic Regularization

4 Sinkhorn’s Algorithm

5 Sinkhorn-Newton method

6 Shielding Neighborhood Method

Page 38: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

38/59

Sinkhorn’s algorithm

For solving (21), consider its Lagrangian dual function

LεC(P, α, β) = 〈C,P〉 − εH(P) + α>(P1n − a) + β>(P>1m − b). (27)

Now let ∂LεC/∂pij = 0, i.e.,

pij = e−cij+αi+βj

ε , (28)

so we can write

Pε = diag(e−αε )e−

Cε diag(e−

βε ). (29)

Note thatPε1n = a, P>ε 1m = b, (30)

we can then use Sinkhorn’s algorithm to find Pε!

Page 39: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

39/59

Sinkhorn’s algorithm

Let u = e−αε , v = e−

βε and K = e−C/ε. We again state the KKT system

of (21):Pε = diag(u)Kdiag(v),

a = diag(u)Kv,

b = diag(v)K>u.

(31)

Then the Sinkhorn’s algorithm amounts to alternating updates in theform of

u(k+1) = diag(Kv(k))−1a,

v(k+1) = diag(K>u(k+1))−1b.(32)

Page 40: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

40/59

Sinkhorn’s algorithm

Sinkhorn’s algorithm

1. Compute K = e−Cε .

2. Compute K = diag(a−1)K.3. Initial scale factor u ∈ Rm.4. Iteratively update u:

u = 1./(K(b./(K>u))),

until reaches certain stopping criterion.5. Compute

v = b./(K>u),

and eventuallyPε = diag(u)Kdiag(v).

Page 41: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

41/59

Outline

1 Problem Formulation

2 Applications

3 Entropic Regularization

4 Sinkhorn’s Algorithm

5 Sinkhorn-Newton method

6 Shielding Neighborhood Method

Page 42: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

42/59

Sinkhorn-Newton method

The dual problem of (21) is

minα,β

〈a, α〉+ 〈b, β〉+ ε〈e−αε ,Ke−

βε 〉,

s.t. diag(e−αε )Ke−

βε = a,

diag(e−βε )K>e−

αε = b.

(33)

with α, β being the dual variables.[1] proposes using Newton method to solve this system.

Page 43: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

43/59

Sinkhorn-Newton method

Let

F(α, β) =

(diag(e−

αε )Ke−

βε − a

diag(e−βε )K>e−

αε − b

). (34)

We want to find α, β such that F(α, β) = 0 so that

Pε = diag(e−αε )e−

Cε diag(e−

βε ). (35)

The Newton iteration is given by(α(k+1)

β(k+1)

)=

(α(k)

β(k)

)− J−1

F (α(k), β(k))F(α(k), β(k)), (36)

where

JF =1ε

(diag(P1n) P

P> diag(P>1m)

). (37)

Page 44: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

44/59

Sinkhorn-Newton method: Convergence

PropositionFor α ∈ Rm and β ∈ Rn, the Jacobian matrix JF(α, β) is symmetricpositive semidefinite, and its kernel is given by

ker(JF(α, β)) = span{(

1m

−1n

)}. (38)

ProofJF is clearly symmetric. For arbitrary γ ∈ Rm and φ ∈ Rn, one has

(γ> φ>

)JF

(γφ

)=

∑ij

Pij(γi + φj)2 ≥ 0,

which holds with equality if and only if γi + φj = 0 for all i, j, leading usto (38).

Page 45: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

45/59

Sinkhorn-Newton method: Convergence

LemmaLet F : D→ Rn be a continuously differentiable mapping with D ⊂ Rn openand convex. Suppose that F(x) is invertible for each x ∈ D. Assume that thefollowing affine covariant Lipschitz condition holds

‖F′(x)−1(F′(y)− F′(x))(y− x)‖ ≤ ω‖y− x‖2 (39)

for x, y ∈ D. Let F(x) = 0 have a solution x∗. For the initial guess x(0) assumethat B(x∗, ‖x(0) − x∗‖) ⊂ D and that

ω‖x(0) − x∗‖ < 2.

Then the ordinary Newton iterates remain in the open ball B(x∗, ‖x(0) − x∗‖)and converge to x∗ at an estimated quadratic rate

‖x(k+1) − x∗‖ ≤ ω

2‖x(k) − x∗‖2. (40)

Moreover, the solution x∗ is unique in the open ball B(x∗, 2/ω).

Page 46: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

46/59

Sinkhorn-Newton method: Convergence

ProofDenote e(k) = x(k) − x∗. Let us prove the lemma by induction:

‖e(k+1)‖ = ‖x(k) − (F′(x(k)))−1(F(x(k) − F(x∗))− x∗‖= ‖e(k) − (F′(x(k)))−1(F(x(k) − F(x∗))‖= ‖(F′(x(k)))−1((F(x∗)− F(x(k))) + F′(x(k))e(k))‖

= ‖(F′(x(k)))−1∫ −1

s=0(F′(x(k) + se(k))− F′(x(k)))e(k) ds‖

≤ ω‖∫ −1

s=0s ds‖e(k)‖2 =

ω

2‖e(k)‖2 < ‖e(k)‖.

(41)

Alsoω‖e(k+1)‖ ≤ ω‖e(k)‖ < 2. (42)

For the uniqueness part, let x(0) = x∗∗ 6= x∗ be a different solution,thenx(1) = x∗∗, then consider (40) when k = 0.

Page 47: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

47/59

Sinkhorn-Newton method: Convergence

Proposition

For any k ∈ N with P(k)ε,ij > 0, the affine covariante Lipschitz condition

holds in the `∞-norm for

ω ≤ (e1ε − 1)

(1 + 2e

max{‖P(k)ε 1n‖∞, ‖(P(k)

ε )>1m‖∞}minij P(k)

ε,ij

)(43)

when ‖y− x‖∞ ≤ 1.

The proof for this proposition is tedious and therefore we refer theinterested readers to the paper [1].

Page 48: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

48/59

Relationship with Sinkhorn’s algorithm

Let u = e−αε , v = e−

βε and K = e−C/ε. We again state the KKT system

of (21):Pε = diag(u)Kdiag(v),

a = diag(u)Kv,

b = diag(v)K>u.

(44)

Then the Sinkhorn’s algorithm amounts to alternating updates in theform of

u(k+1) = diag(Kv(k))−1a,

v(k+1) = diag(K>u(k+1))−1b.(45)

Page 49: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

49/59

Relationship with Sinkhorn’s algorithm

Define

G(u, v) =

(diag(u)Kv− a

diag(v)K>u− b

). (46)

Process analogously to the Sinkhorn-Newton method we justdiscussed, note that

JG(u, v) =

(diag(Kv) diag(u)K

diag(v)K> diag(K>u)

). (47)

If we neglect the off-diagonal blocks above, i.e.,

JG(u, v) =

(diag(Kv) 0

0 diag(K>u)

), (48)

and perform the Newton iteration(u(k+1)

v(k+1)

)=

(u(k)

v(k)

)− J−1

G (u(k), v(k))G(u(k), v(k)), (49)

Page 50: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

50/59

Relationship with Sinkhorn’s algorithm

We getu(k+1) = diag(Kv(k))−1a,

v(k+1) = diag(K>u(k))−1b.(50)

So the Sinkhorn’s algorithm simply approximates one Newton step byneglecting the off-diagonal blocks and replacing u(k) by u(k+1).

Page 51: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

51/59

Outline

1 Problem Formulation

2 Applications

3 Entropic Regularization

4 Sinkhorn’s Algorithm

5 Sinkhorn-Newton method

6 Shielding Neighborhood Method

Page 52: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

52/59

Shielding Neighborhood Method

Main features:Proposed by Bernhard Schmitzer [2] in 2016Exploits a hierarchical multiscale schemeSolves a sequence of sparse subproblems instead of one largedense problemAny existing discrete solver can be used as internal black-boxSignificant improvements both in runtime and memoryrequirements have been observed

Page 53: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

53/59

Shielding Neighborhood Method

DefinitionFor some N ⊂ X × Y, we call N a neighborhood.

We may consider the problem restricted to N now and then in thismethod. It is important that N is small (sparse) when compared toX × Y.

Definition(Short-Cut) For a neighborhood N ⊂ X × Y and a coupling π with sptπ ⊂ N , let ((x2, y2), . . . , (xn−1, yn−1)) be an ordered tuple of pairs inspt π . We say ((x2, y2), . . . , (xn−1, yn−1)) is a short-cut for(x1, yn) ∈ X × Y if (xi, yi+1) ∈ N for i = 1, . . . , n− 1 and

c(x1, yn) ≥ c(x1, y2) +

n−1∑i=2

(c(xi, yi+1)− c(xi, yi)). (51)

Page 54: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

54/59

Shielding Neighborhood Method

Find a clever way to choose a small neighborhood N such that thereis a short-cut for every (x, y) 6∈ N .

Definition(Shielding Condition) Let x ∈ X , y ∈ Y and (xs, ys) ∈ X × Y. We say(xs, ys) shields x from y when

c(x, y) + c(xs, ys) > c(x, ys) + c(xs, y).

Definition(Shielding Neighborhood) For a given coupling π, we say that aneighborhood N ⊂ X × Y , N ⊃ spt π is a shielding neighborhood iffor every pair (x, y) 6∈ N , there exists some (xs, ys) ∈ spt π with(xs, ys) ∈ N such that (xs, ys) shields x from y.

Page 55: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

55/59

Multiscale Scheme

Multiscale Scheme(Hierarchical Partition and Multiscale Measure Approximation) For adiscrete set X a hierarchical partition is an ordered tuple (X0, . . . ,XK)of partitions of X where X0 = {{x} : x ∈ X} is the trivial partition of Xinto singletons and each subsequent level is generated by mergingcells from the previous level.

For simplicity, we assume that the coarsest level is the trivial partitioninto one set: XK = {X}. We call K > 0 the depth of X.

This implies a directed tree graph with vertex set ∪Kk′=0Xk′ , and for

k ∈ {1, . . . ,K}, we say x′ ∈ Xj, j < k, is a descendant of x ∈ Xk whenx′ ⊂ x. We call x′ a child of x for j = k − 1, and a leaf for j = 0.

Page 56: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

56/59

Main Algorithm

Assume that we already have two subalgorithms:solveLocal: 2X×Y → Π(µ, ν) such that for N ∈ 2X×Y thecoupling solveLocal(N ) is locally primal optimal w.r.t. N .When N is sparse, any DOT solver can quickly provide ananswer.shield: Π(µ, ν)→ 2X×Y such that for π ∈ Π(µ, ν) theneighborhood shield(π) is shielding for π .

Page 57: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

57/59

Main Algorithm

Algorithm 1: Main algorithm

1 k← K.2 π ← solveDense(k) ; // this can be done immediately3 while k > 0 do4 k← k − 1.5 N1 ← {}.6 for (x, y) ∈ spt π do7 N1 ← N1∪(children(x) × children(y)).

8 π ← solveSparse(k,N1) ; // solveSparse ispresented below

9 return π.

Page 58: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

58/59

solveSparse

solveSparse uses current scale level k and feasible neighborhoodN1 initialized from the coarser level to compute the global optimizer πand corresponding neighborhood N at current hierarchical level.

Algorithm 2: solveSparse

1 Input: current scale level k, initial feasible neighborhood N1.2 i← 1.3 while i = 1 or C(πi) 6= C(πi−1) do4 πi+1 ← solveLocal(Ni) ; // Use any DOT solver5 Ni+1 ← shield(πi+1, k); // shield is discussed

later6 i← i + 1

7 return πi ; // returning Ni is not necessary

Page 59: Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

59/59

References

Gabriel Peyre and Marco Cuturi, Computational OptimalTransport, ArXiv:1803.00567, 2018.https://github.com/rflamary/POT

C. BRAUER, C. CLASON, D. LORENZ, AND B. WIRTH, Asinkhorn-newton method for entropic optimal transport, arXivpreprint arXiv:1710.06635, (2017).

B. SCHMITZER, A sparse multiscale algorithm for dense optimaltransport, Journal of Mathematical Imaging and Vision, 56(2016), pp. 238–259.