Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2

Optimal Transport

http://bicmr.pku.edu.cn/~wenzw/bigdata2019.html

Acknowledgement: this slides is based on Prof. Gabriel Peyré’s lecture notes

1/59

http://bicmr.pku.edu.cn/~wenzw/bigdata2019.html

2/59

Applications: comparing measures

��

! images, vision, graphics and machine learning, . . .

• Optimal transport

Optimal transport meanL2mean

3/59

Applications: toward high-dimensional OT

��

��

4/59

Outline

1 Problem Formulation

2 Applications

3 Entropic Regularization

4 Sinkhorn’s Algorithm

5 Sinkhorn-Newton method

6 Shielding Neighborhood Method

5/59

Kantorovitch’s Formulation

Discrete Optimal TransportInput two discrete probability measures

α =

m∑i=1

aiδxi , β =

n∑j=1

bjδyj . (1)

X = {xi}i, Y = {xj}j: are given points clouds, xi, yi are vectors.ai, bj : positive weights,

∑mi=1 ai =

∑nj=1 bj = 1.

Cij: costs, Cij = c(xi, yj) ≥ 0.

Couplings

U(a, b)def= {Π ∈ Rm×n

+ ; Π1n = a,Π>1m = b} (2)

is called the set of couplings with respect to α and β.

6/59

Kantorovitch’s Formulation

Discrete Optimal TransportIn the Optimal transport, we want to compute the following quantity[Kantorovich 1942]

Optimal transport distance

LC(a, b)def= min

∑i,j

Ci,jΠi,j; Π ∈ U(a, b)

. (3)

7/59

Push Forward

Radon measures (α, β) on (X ,Y).Transfer of measure by T : X → Y: push forward.The measure T#α on Y is defined by

T#α(Y) = α(T−1(Y)), for all measurable Y ∈ Y. (4)

Equivalently, ∫Y

g(y)dT#α(y)def=

∫X

g(T(x))dα(x). (5)

Discrete measures: T#α =∑

i αiδT(xi)

Smooth densities: dα = ρ(x)dx, dβ = ξ(x)dx.

T#α = β ⇐⇒ ρ(T(x))|det(∂T(x))| = ξ(x). (6)

8/59

Monge problem

Monge problem seeks for a map that associates to each point xi

a single point yj, and which must push the mass of α toward themass of β, namely:

∀j, bj =∑

i:T(xi)=yj

ai

Discrete case:

minT

∑i

c(xi,T(xi)), s.t. T#α = β

Arbitrary measures:

minT

∫X

c(x,T(x))dα(x), s.t. T#α = β

9/59

Couplings between General Measures

Projectors:PX : (x, y) ∈ X × Y → x ∈ X ,PY : (x, y) ∈ X × Y → y ∈ Y.

(7)

Couplings between General Measures

U(α, β)def= {π ∈M+(X × Y); PX#π = α,PY#π = β}. (8)

is called the set of couplings with respect to α and β.

10/59

Cases of Couplings

��

⇡

Discrete

⇡

Continuous

⇡

Semi-discrete

βββ

↵↵ ↵

βββ↵ ↵ ↵

11/59

More Examples

��

β

↵

β

↵

⇡β

↵

β

↵

⇡

β

↵

β

↵

⇡

↵

β

↵

⇡β

12/59

Kantorovitch Problem for General Measures

Optimal transport distance between General Measures

L(α, β, c)def= min

π∈U(α,β)

∫X×Y

c(x, y)dπ(x, y). (9)

Probability interpretation:

min(X,Y){E(X,Y)(c(X,Y)),X ∼ α,Y ∼ β}. (10)

13/59

Wasserstein Distance

Metric Space X = Y.Distance d(x, y) (nonegative, symmetric, identity, triangle inequality).Cost c(x, y) = d(x, y)p, p ≥ 1.

Wasserstein Distance

Wp(α, β)def= L(α, β, dp)1/p. (11)

TheoremWp is a distance, and

Wp(αn, α)→ 0 ⇐⇒ αnweak→ α. (12)

Example

Wp(δx, δy) = d(x, y). (13)

14/59

Dual form

Dual problem (discrete case)

maxf∈Rm,g∈Rn

f>a + g>b,

subject to fi + gj ≤ Cij, ∀(i, j)(14)

Relation between any primal and dual solutions:

Pij > 0⇒ fi + gj = Cij.

15/59

Wasserstein barycenter

Define C def= MXY , where (MXY)ij = d(xi, yi)

p. The Wassersteindistance as

L(a, b,C)def= min

∑i,j

Ci,jΠi,j; Π ∈ U(a, b)

. (15)

Given a set of point clouds and their corresponding probabilityvector {(Y(i), b(i))}, i = 1, . . . ,N.Find a support X = {xi} with a probability vector a such that(X, a) is the optimal solution of the following problem

minX,a

1N

N∑k=1

L(a, b(k),MXYk)

16/59

Outline


2 Applications





17/59

Applications: Grayscale image equalization

��

t = 0 t = 0.25 t = 0.5 t = .75 t = 1

18/59

Applications: image color adaptation

Example: https://github.com/rflamary/POT/blob/master/notebooks/plot_otda_color_images.ipynb

Given color image stored in the RGB format: I1, I2# Converts an image to matrix (one pixel per line)X1 = im2mat(I1), X2 = im2mat(I2)# Take samplesXs = X1[idx1, :], Xt = X2[idx2, :]# Scatter plot of colorspl.scatter(Xs[:, 0], Xs[:, 2], c=Xs)# Sinkhorn Transportot_sinkhorn = ot.da.SinkhornTransport(reg_e=1e-1)ot_sinkhorn.fit(Xs=Xs, Xt=Xt)# prediction between imagestransp_Xs_sinkhorn = ot_sinkhorn.transform(Xs=X1)transp_Xt_sinkhorn = ot_sinkhorn.inverse_transform(Xt=X2)

https://github.com/rflamary/POT/blob/master/notebooks/plot_otda_color_images.ipynb

https://github.com/rflamary/POT/blob/master/notebooks/plot_otda_color_images.ipynb

19/59

Applications: image color adaptation

20/59

Applications: image color palette equalization

Optimal

transport

��

21/59

Applications: shape interpolation

��

22/59

Applications: MRI Data Processing

��

L2 barycenter

W2

2barycenter

Ground cost c = dM : geodesic on cortical surface M .

23/59

Applications

��

��

��

24/59

Applications

��

input

output

input

output

µ

⌫

Shape registration:loss regularity

min' di↵eo

D('(µ), ⌫) +R(')

σ

(µ− ⌫) ? kσ

Hilbertian loss (MMD/RKHS):

D(µ, ⌫) = ||kσ ? (µ− ⌫)||2L2

it 0

it 250

! Sinkhorn’s iterates “propagate” a small bandwidth kernel.

! Automatic di↵erentation: game changer for advanced loss and models.

! Do not use OT for registration . . . but as a loss.

Joint work with J. Feydy, B. Charier, F-X. Vialard.

Sinkhorn divergence:

D(µ, ⌫) = W"(µ, ⌫)

25/59

Applications: word mover’s distance

normalized bag-of-words (nBOW), word travel cost (word2vecdistance), document distance Tijc(i, j), transportation problem

��

�� dist(D1, D2) = W2(µ,⌫)

µ

⌫

��

26/59

Applications: word mover’s distance

minΠ≥0

∑ij

Πijcij

s.t.n∑

j=1

Πij = di

n∑i=1

Πij = d′j

xi: word2vec embeddingcij = ‖xi − xj‖2

if word i appears wi times in the document, we denote di = wi∑wj

27/59

Applications: topic models

Top-left topic: competitions. Top-right: time. Bottom-left: socceractions. Bottom-right: drugs.

��

��

28/59

Applications

��

!""""""""""""""#""""""""""""""$Source

!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""#""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""$Targets

0 0

MDS in 3-D

Use T to define registration between:Colors

distributionShapeShapeShape

Geodesic distances GW distances MDS

Vizualization

Shapes(Xs)s

1

5

10

15

20

25

30

35

40

45

MDS in 2-D

29/59

Outline


2 Applications





30/59

Discrete OT Review

Given an integer n > 1, we write Σn for the discrete probability simplex

Σndef=

{a ∈ R+

n ;

n∑i=1

ai = 1.

}(16)

Given a ∈ Σm, b ∈ Σn, the Optimal Transport problem is to compute

L(a, b,C)def= min{

∑i,j

Ci,jPi,j; s.t. P ∈ U(a, b)}. (17)

Where U(a, b) is the set of couplings between a and b.

31/59

Entropy

The discrete entropy of a positive matrix P (∑

ij Pij = 1) is defined as

H(P)def= −

∑i,j

Pi,j(log(Pi,j)− 1). (18)

For a positive vector u ∈ Σn, the entropy is defined analogously:

H(u)def= −

∑i

ui(log(ui)− 1). (19)

For two positive vector u, v ∈ Σn, the Kullback-Leibler divergence (or,KL divergence) is defined to be

KL(u‖v) = −n∑

i=1

ui log(vi

ui). (20)

The KL divergence is always non-negative: KL(u‖v) ≥ 0 (Jensen’sinequality: E[f (g(X))] ≥ f (E[g(X)])).

32/59

Entropic regularization

Given a ∈ Σm, b ∈ Σn and cost matrix C ∈ Rm×n+ . The entropic

regularization of the transportation problem reads

Lε(a, b,C) = minP∈U(a,b)

〈P,C〉 − εH(P). (21)

The case ε = 0 corresponds to the classic (linear) optimaltransport problem.For ε > 0, problem (21) has an ε-strongly convex objective andtherefore admits a unique optimal solution P?ε.

This is not (necessarily) true for ε = 0. But we have the followingproposition.

33/59


PropositionWhen ε→ 0, the unique solution Pε of (21) converges to the optimalsolution with maximal entropy within the set of all optimal solutions ofthe unregularized transportation problem, namely,

Pεε→0→ argmaxP{H(P); P ∈ U(a, b), 〈P,C〉 = L0(a, b,C)} (22)

The above proposition motivates us to solve the problems in (21)sequentially and then take ε→ 0.

34/59


ProofWe consider a sequence (ε`)` such that ε` → 0 and ε` > 0. Wedenote P` = P?ε` . Since U(a, b) is bounded, we can extract asequence (that we do not relabel for the sake of simplicity) such thatP` → P?. Since U(a, b) is closed, P? ∈ U(a, b). We consider any Psuch that 〈C,P〉 = L0(a, b,C). By optimality of P and P` for theirrespective optimization problems (for ε = 0 and ε = ε`), one has

0 ≤ 〈C,P`〉 − 〈C,P〉 ≤ ε`(H(P`)− H(P)). (23)

Since H is continuous, taking the limit `→ +∞ in this expressionshows that 〈C,P?〉 = 〈C,P〉. Furthermore, dividing by ε` and takingthe limit shows that H(P) 6 H(P?). Now the result follows from thestrictly convexity of −H.

35/59


By the concavity of entropy, for α > 0, we introduce the convex set

Uα(a, b)def= {P ∈ U(a, b)|KL(P‖ab>) ≤ α}= {P ∈ U(a, b)|H(P) ≥ H(a) + H(b)− 1− α}.

(24)

Definition: Sinkhorn Distance

dC,α(a, b)def= min

P∈Uα(a,b)〈C,P〉. (25)

PropositionFor α ≥ 0, dC,α(a, b) is symmetric and satisfies all triangle inequalities.Moreover, 1a6=bdC,α(a, b) satisfies all three distance axioms.

36/59


PropositionFor α large enough, the Sinkhorn distance dC,α is the transportdistance dC.

Proof.Note that for any P ∈ U(a, b), we have

H(P) ≥ 12

(H(a) + H(b)), (26)

so for α ≥ 12(H(a) + H(b))− 1, we have

Uα(a, b) = U(a, b).

37/59

Outline


2 Applications





38/59

Sinkhorn’s algorithm

For solving (21), consider its Lagrangian dual function

LεC(P, α, β) = 〈C,P〉 − εH(P) + α>(P1n − a) + β>(P>1m − b). (27)

Now let ∂LεC/∂pij = 0, i.e.,

pij = e−cij+αi+βj

ε , (28)

so we can write

Pε = diag(e−αε )e−

Cε diag(e−

βε ). (29)

Note thatPε1n = a, P>ε 1m = b, (30)

we can then use Sinkhorn’s algorithm to find Pε!

39/59


Let u = e−αε , v = e−

βε and K = e−C/ε. We again state the KKT system

of (21):Pε = diag(u)Kdiag(v),

a = diag(u)Kv,

b = diag(v)K>u.

(31)

Then the Sinkhorn’s algorithm amounts to alternating updates in theform of

u(k+1) = diag(Kv(k))−1a,

v(k+1) = diag(K>u(k+1))−1b.(32)

40/59



1. Compute K = e−Cε .

2. Compute K = diag(a−1)K.3. Initial scale factor u ∈ Rm.4. Iteratively update u:

u = 1./(K(b./(K>u))),

until reaches certain stopping criterion.5. Compute

v = b./(K>u),

and eventuallyPε = diag(u)Kdiag(v).

41/59

Outline


2 Applications





42/59

Sinkhorn-Newton method

The dual problem of (21) is

minα,β

〈a, α〉+ 〈b, β〉+ ε〈e−αε ,Ke−

βε 〉,

s.t. diag(e−αε )Ke−

βε = a,

diag(e−βε )K>e−

αε = b.

(33)

with α, β being the dual variables.[1] proposes using Newton method to solve this system.

43/59

Sinkhorn-Newton method

Let

F(α, β) =

(diag(e−

αε )Ke−

βε − a

diag(e−βε )K>e−

αε − b

). (34)

We want to find α, β such that F(α, β) = 0 so that

Pε = diag(e−αε )e−

Cε diag(e−

βε ). (35)

The Newton iteration is given by(α(k+1)

β(k+1)

)=

(α(k)

β(k)

)− J−1

F (α(k), β(k))F(α(k), β(k)), (36)

where

JF =1ε

(diag(P1n) P

P> diag(P>1m)

). (37)

44/59

Sinkhorn-Newton method: Convergence

PropositionFor α ∈ Rm and β ∈ Rn, the Jacobian matrix JF(α, β) is symmetricpositive semidefinite, and its kernel is given by

ker(JF(α, β)) = span{(

1m

−1n

)}. (38)

ProofJF is clearly symmetric. For arbitrary γ ∈ Rm and φ ∈ Rn, one has

(γ> φ>

)JF

(γφ

)=

1ε

∑ij

Pij(γi + φj)2 ≥ 0,

which holds with equality if and only if γi + φj = 0 for all i, j, leading usto (38).

45/59


LemmaLet F : D→ Rn be a continuously differentiable mapping with D ⊂ Rn openand convex. Suppose that F(x) is invertible for each x ∈ D. Assume that thefollowing affine covariant Lipschitz condition holds

‖F′(x)−1(F′(y)− F′(x))(y− x)‖ ≤ ω‖y− x‖2 (39)

for x, y ∈ D. Let F(x) = 0 have a solution x∗. For the initial guess x(0) assumethat B(x∗, ‖x(0) − x∗‖) ⊂ D and that

ω‖x(0) − x∗‖ < 2.

Then the ordinary Newton iterates remain in the open ball B(x∗, ‖x(0) − x∗‖)and converge to x∗ at an estimated quadratic rate

‖x(k+1) − x∗‖ ≤ ω

2‖x(k) − x∗‖2. (40)

Moreover, the solution x∗ is unique in the open ball B(x∗, 2/ω).

46/59


ProofDenote e(k) = x(k) − x∗. Let us prove the lemma by induction:

‖e(k+1)‖ = ‖x(k) − (F′(x(k)))−1(F(x(k) − F(x∗))− x∗‖= ‖e(k) − (F′(x(k)))−1(F(x(k) − F(x∗))‖= ‖(F′(x(k)))−1((F(x∗)− F(x(k))) + F′(x(k))e(k))‖

= ‖(F′(x(k)))−1∫ −1

s=0(F′(x(k) + se(k))− F′(x(k)))e(k) ds‖

≤ ω‖∫ −1

s=0s ds‖e(k)‖2 =

ω

2‖e(k)‖2 < ‖e(k)‖.

(41)

Alsoω‖e(k+1)‖ ≤ ω‖e(k)‖ < 2. (42)

For the uniqueness part, let x(0) = x∗∗ 6= x∗ be a different solution,thenx(1) = x∗∗, then consider (40) when k = 0.

47/59


Proposition

For any k ∈ N with P(k)ε,ij > 0, the affine covariante Lipschitz condition

holds in the `∞-norm for

ω ≤ (e1ε − 1)

(1 + 2e

1ε

max{‖P(k)ε 1n‖∞, ‖(P(k)

ε )>1m‖∞}minij P(k)

ε,ij

)(43)

when ‖y− x‖∞ ≤ 1.

The proof for this proposition is tedious and therefore we refer theinterested readers to the paper [1].

48/59

Relationship with Sinkhorn’s algorithm

Let u = e−αε , v = e−

βε and K = e−C/ε. We again state the KKT system

of (21):Pε = diag(u)Kdiag(v),

a = diag(u)Kv,

b = diag(v)K>u.

(44)

Then the Sinkhorn’s algorithm amounts to alternating updates in theform of

u(k+1) = diag(Kv(k))−1a,

v(k+1) = diag(K>u(k+1))−1b.(45)

49/59


Define

G(u, v) =

(diag(u)Kv− a

diag(v)K>u− b

). (46)

Process analogously to the Sinkhorn-Newton method we justdiscussed, note that

JG(u, v) =

(diag(Kv) diag(u)K

diag(v)K> diag(K>u)

). (47)

If we neglect the off-diagonal blocks above, i.e.,

JG(u, v) =

(diag(Kv) 0

0 diag(K>u)

), (48)

and perform the Newton iteration(u(k+1)

v(k+1)

)=

(u(k)

v(k)

)− J−1

G (u(k), v(k))G(u(k), v(k)), (49)

50/59


We getu(k+1) = diag(Kv(k))−1a,

v(k+1) = diag(K>u(k))−1b.(50)

So the Sinkhorn’s algorithm simply approximates one Newton step byneglecting the off-diagonal blocks and replacing u(k) by u(k+1).

51/59

Outline


2 Applications





52/59

Shielding Neighborhood Method

Main features:Proposed by Bernhard Schmitzer [2] in 2016Exploits a hierarchical multiscale schemeSolves a sequence of sparse subproblems instead of one largedense problemAny existing discrete solver can be used as internal black-boxSignificant improvements both in runtime and memoryrequirements have been observed

53/59


DefinitionFor some N ⊂ X × Y, we call N a neighborhood.

We may consider the problem restricted to N now and then in thismethod. It is important that N is small (sparse) when compared toX × Y.

Definition(Short-Cut) For a neighborhood N ⊂ X × Y and a coupling π with sptπ ⊂ N , let ((x2, y2), . . . , (xn−1, yn−1)) be an ordered tuple of pairs inspt π . We say ((x2, y2), . . . , (xn−1, yn−1)) is a short-cut for(x1, yn) ∈ X × Y if (xi, yi+1) ∈ N for i = 1, . . . , n− 1 and

c(x1, yn) ≥ c(x1, y2) +

n−1∑i=2

(c(xi, yi+1)− c(xi, yi)). (51)

54/59


Find a clever way to choose a small neighborhood N such that thereis a short-cut for every (x, y) 6∈ N .

Definition(Shielding Condition) Let x ∈ X , y ∈ Y and (xs, ys) ∈ X × Y. We say(xs, ys) shields x from y when

c(x, y) + c(xs, ys) > c(x, ys) + c(xs, y).

Definition(Shielding Neighborhood) For a given coupling π, we say that aneighborhood N ⊂ X × Y , N ⊃ spt π is a shielding neighborhood iffor every pair (x, y) 6∈ N , there exists some (xs, ys) ∈ spt π with(xs, ys) ∈ N such that (xs, ys) shields x from y.

55/59

Multiscale Scheme

Multiscale Scheme(Hierarchical Partition and Multiscale Measure Approximation) For adiscrete set X a hierarchical partition is an ordered tuple (X0, . . . ,XK)of partitions of X where X0 = {{x} : x ∈ X} is the trivial partition of Xinto singletons and each subsequent level is generated by mergingcells from the previous level.

For simplicity, we assume that the coarsest level is the trivial partitioninto one set: XK = {X}. We call K > 0 the depth of X.

This implies a directed tree graph with vertex set ∪Kk′=0Xk′ , and for

k ∈ {1, . . . ,K}, we say x′ ∈ Xj, j < k, is a descendant of x ∈ Xk whenx′ ⊂ x. We call x′ a child of x for j = k − 1, and a leaf for j = 0.

56/59

Main Algorithm

Assume that we already have two subalgorithms:solveLocal: 2X×Y → Π(µ, ν) such that for N ∈ 2X×Y thecoupling solveLocal(N ) is locally primal optimal w.r.t. N .When N is sparse, any DOT solver can quickly provide ananswer.shield: Π(µ, ν)→ 2X×Y such that for π ∈ Π(µ, ν) theneighborhood shield(π) is shielding for π .

57/59

Main Algorithm

Algorithm 1: Main algorithm

1 k← K.2 π ← solveDense(k) ; // this can be done immediately3 while k > 0 do4 k← k − 1.5 N1 ← {}.6 for (x, y) ∈ spt π do7 N1 ← N1∪(children(x) × children(y)).

8 π ← solveSparse(k,N1) ; // solveSparse ispresented below

9 return π.

58/59

solveSparse

solveSparse uses current scale level k and feasible neighborhoodN1 initialized from the coarser level to compute the global optimizer πand corresponding neighborhood N at current hierarchical level.

Algorithm 2: solveSparse

1 Input: current scale level k, initial feasible neighborhood N1.2 i← 1.3 while i = 1 or C(πi) 6= C(πi−1) do4 πi+1 ← solveLocal(Ni) ; // Use any DOT solver5 Ni+1 ← shield(πi+1, k); // shield is discussed

later6 i← i + 1

7 return πi ; // returning Ni is not necessary

59/59

References

Gabriel Peyre and Marco Cuturi, Computational OptimalTransport, ArXiv:1803.00567, 2018.https://github.com/rflamary/POT

C. BRAUER, C. CLASON, D. LORENZ, AND B. WIRTH, Asinkhorn-newton method for entropic optimal transport, arXivpreprint arXiv:1710.06635, (2017).

B. SCHMITZER, A sparse multiscale algorithm for dense optimaltransport, Journal of Mathematical Imaging and Vision, 56(2016), pp. 238–259.

https://github.com/rflamary/POT

Documents

Optimal Transport - PKUbicmr.pku.edu.cn/~wenzw/bigdata/lect-ot.pdf · 2/59 Applications: comparing measures ! images, vision, graphics andmachinelearning, ... Optimal transport L2