Gromov-Wasserstein Learning for Graph Matching and Node ... · Grover & Leskovec,2016) are designed for a single graph, and applying such methods separately to multiple graphs doesn’t

Gromov-Wasserstein Learning for Graph Matching and Node Embedding

Hongteng Xu 1 2 Dixin Luo 2 Hongyuan Zha 3 Lawrence Carin 2

AbstractA novel Gromov-Wasserstein learning frameworkis proposed to jointly match (align) graphs andlearn embedding vectors for the associated graphnodes. Using Gromov-Wasserstein discrepancy,we measure the dissimilarity between two graphsand find their correspondence, according to thelearned optimal transport. The node embeddingsassociated with the two graphs are learned underthe guidance of the optimal transport, the distanceof which not only reflects the topological structureof each graph but also yields the correspondenceacross the graphs. These two learning steps aremutually-beneficial, and are unified here by mini-mizing the Gromov-Wasserstein discrepancy withstructural regularizers. This framework leads toan optimization problem that is solved by a proxi-mal point method. We apply the proposed methodto matching problems in real-world networks, anddemonstrate its superior performance comparedto alternative approaches.

1. IntroductionReal-world entities and their interactions are often repre-sented as graphs. Given two or more graphs created in differ-ent domains, graph matching aims to find a correspondenceacross different graphs. This task is important for many ap-plications, e.g., matching the protein networks from differ-ent species (Sharan & Ideker, 2006; Singh et al., 2008), link-ing accounts in different social networks (Zhang & Philip,2015), and feature matching in computer vision (Cordellaet al., 2004). However, because it is NP-hard, graph match-ing is challenging and often solved heuristically. Furthercomplicating matters, the observed graphs may be noisy(e.g., containing unreliable edges), which leads to unsatisfy-ing matching results using traditional methods.

1Infinia ML, Inc., Durham, NC, USA 2Department of ECE,Duke University, Durham, NC, USA 3College of Computing, Geor-gia Institute of Technology, Atlanta, GA, USA. Correspondenceto: Hongteng Xu <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

AB

CD

E

1 2 3 4 5 6 7 8

AB

C DE

1

2 3

4 5

7

6

8

Embedding space of nodes

Relational matching between graphs Cost = |d(A, D) - d(1, 2)|

Optimal transport

Update embeddings based on optimaltransport and graph topology

Update optimaltransport based on embeddings and graph topology

Absolute matching between graphsCost = K(A, 1)

Figure 1. An illustration of the proposed method.

A problem related to graph matching is the learning of nodeembeddings, which aims to learn a latent vector for eachgraph node; the collection of embeddings approximates thetopology of the graph, with similar/related nodes nearby inembedding space. Learning suitable node embeddings isbeneficial for graph matching, as one may seek to align twoor more graphs according to the metric structure associatedwith their node embeddings. Although graph matching andnode embedding are highly related tasks, in practice theyare often treated and solved independently. Existing nodeembedding methods (Perozzi et al., 2014; Tang et al., 2015;Grover & Leskovec, 2016) are designed for a single graph,and applying such methods separately to multiple graphsdoesn’t share information across the graphs, and, hence, isless helpful for graph matching. Most graph matching meth-ods rely purely on topological information (i.e., adjacencymatrices of graphs) and ignore the potential functionality ofnode embeddings (Kuchaiev et al., 2010; Neyshabur et al.,2013; Nassar et al., 2018). Although some methods con-sider first deriving embeddings for each graph and thenlearning a transformation between the embeddings, theirresults are often unsatisfying because their embeddings arepredefined and the transformations are limited to orthogonalprojections (Grave et al., 2018) or rigid/non-rigid deforma-tions (Myronenko & Song, 2010).

This paper considers the joint goal of graph matching andlearning node embeddings, seeking to achieve improve-ments in both tasks. As illustrated in Figure 1, to achievethis goal we propose a novel Gromov-Wasserstein learningframework. The dissimilarity between two graphs is mea-sured by the Gromov-Wasserstein discrepancy (GW discrep-ancy) (Peyre et al., 2016), which compares the distance ma-

arX

iv:1

901.

0600

3v2

[cs

.LG

] 7

May

201

9


trices of different graphs in a relational manner, and learnsan optimal transport between the nodes of different graphs.The learned optimal transport indicates the correspondencebetween the graphs. The embeddings of the nodes fromdifferent graphs are learned jointly: the distance betweenthe embeddings within the same graph should approach thedistance matrix derived from data, and the distance betweenthe embeddings across different graphs should reflect thecorrespondence indicated by the learned optimal transport.As a result, the objectives of graph matching and node em-bedding are unified as minimizing the Gromov-Wassersteindiscrepancy between two graphs, with structural regulariz-ers. This framework leads to an optimization problem that issolved via an iterative process. In each iteration, the embed-dings are used to estimate distance matrices when learningthe optimal transport, and the learned optimal transport reg-ularizes the learning of embeddings in the next iteration.

There are two important benefits to tackling graph matchingand node embedding jointly. First, the observed graphs oftencontain spurious edges or miss some useful edges, leadingto noisy adjacency matrices and unreliable graph matchingresults. Treating the distance between learned node em-beddings as complementary information of observed edges,we can approximate the topology of graph more robustly,and accordingly, match noisy graphs. Second, our methodregularizes the GW discrepancy and learns embeddings ofdifferent graphs on the same manifold, instead of learningan explicit transformation between the embeddings withpredefined constraints. Therefore, the proposed method ismore flexible and has lower risk of model misspecification(i.e., imposing incorrect constraints on the transformation);the distance between the embeddings of different graphscan be calculated directly without any additional transforma-tion. We test our method on real-world matching problemsand analyze its performance in depth. Experiments showthat our method obtains encouraging matching results, withcomparisons made to alternative approaches.

2. Gromov-Wasserstein Learning FrameworkAssume we have two sets of entities (nodes), denoted assource set Vs and target set Vt. Without loss of gener-ality, we assume that |Vs| ≤ |Vt|. For each set, we ob-serve a set of interactions between its entities, i.e., Ek ={(vi, vj , wij)|vi, vj ∈ Vk}, where k = s or t, and wijcounts the appearances of the interaction (vi, vj). Ac-cordingly, the data of these entities can be representedas two graphs, denoted as G(Vs, Es) and G(Vt, Et), andwe focus on the following two tasks: i) Find a corre-spondence between the graphs. ii) Obtain node embed-dings of the two graphs, i.e., Xs = [xsi ] ∈ RD×|Vs| andXt = [xti] ∈ RD×|Vt|. These two tasks are unified in aframework based on Gromov-Wasserstein discrepancy.

2.1. Gromov-Wasserstein discrepancy

Gromov-Wasserstein discrepancy was proposed in (Peyreet al., 2016), which is a natural extension of Gromov-Wasserstein distance (Memoli, 2011). Specifically, the defi-nition of Gromov-Wasserstein distance is as follows:Definition 2.1. Let (X, dX , µX) and (Y, dY , µY ) be twometric measure spaces, where (X, dX) is a compact met-ric space and µX is a Borel probability measure on X(with (Y, dY , µY ) defined in the same way). The Gromov-Wasserstein distance dGW (µX , µY ) is defined as

infπ∈Π(µX ,µY )

∫∫X×Y,X×Y

L(x, y, x′, y′) dπ(x, y) dπ(x′, y′),

where L(x, y, x′, y′) = |dX(x, x′) − dY (y, y′)| is the lossfunction and Π(µX , µY ) is the set of all probability mea-sures on X × Y with µX and µY as marginals.

This defines an optimal transport-like distance (Villani,2008) for metric spaces: it calculates distances betweenpairs of samples within each domain and measures howthese distances compare to those in the other domain. It doesnot require one to directly compare the samples across dif-ferent spaces and the target spaces can have different dimen-sions. When dX and dY are replaced with dissimilarity mea-surements rather than strict distance metrics, and the lossfunction L is defined more flexibly, e.g., mean-square-error(MSE) or KL-divergence, we relax the Gromov-Wassersteindistance to the proposed Gromov-Wasserstein discrepancy.These relaxations make the proposed Gromov-Wassersteinlearning framework suitable for a wide range of machinelearning tasks, including graph matching.

In graph matching, a metric-measure space correspondsto the pair (C,µ) ∈ R|V|×|V| × Σ|V| of a graphG(V, E), where C = [cij ] ∈ R|V|×|V| represents a dis-tance/dissimilarity matrix derived according to the interac-tion set E , i.e., each cij is a function of wij . The empiricaldistribution of nodes, i.e., µ = [µi] ∈ Σ|V|, is calculatedbased on the normalized degree of graph. It reflects theprobability of each node appearing in observed interactions.Given two graphs G(Vs, Es) and G(Vt, Et), the Gromov-Wasserstein discrepancy between (Cs,µs) and (Ct,µt) isdefined as

dGW (µs,µt)

:= minT∈Π(µs,µt)

∑i,j,i′,j′

L(csij , cti′j′)Tii′Tjj′

= minT∈Π(µs,µt)〈L(Cs,Ct,T ),T 〉.

(1)

Here, Π(µs, µt) = {T ∈ R|Vs|×|Vt| | T1|Vt| =

µs, T>1|Vs| = µt}. L(·, ·) is an element-wise loss func-

tion, with typical choices the square loss L(a, b) = (a− b)2

and the KL-divergence L(a, b) = a log ab − a + b. Ac-

cordingly, L(Cs,Ct,T ) = [Ljj′ ] ∈ R|Vs|×|Vt| and each


Ljj′ =∑i,i′ L(csij , c

ti′j′)Tii′ , and 〈·, ·〉 represents the inner

product of matrices; T is the optimal transport betweenthe nodes of two graphs, and its element Tij represents theprobability that vi ∈ Vs matches vj ∈ Vt. By choosingthe largest Tij for each i, we find the correspondence thatminimizes the GW discrepancy between the two graphs.

However, such a graph matching strategy raises several is-sues. First, for each graph, its observed interaction set canbe noisy, which leads to an unreliable distance matrix. Mini-mizing the GW discrepancy based on such distance matriceshas a negative influence on matching results. Second, theGromov-Wasserstein discrepancy compares different graphsrelationally based on their edges (i.e., the distance betweena pair of nodes within each graph), while most existinggraph matching methods consider the information of nodesand edges jointly (Neyshabur et al., 2013; Vijayan et al.,2015; Sun et al., 2015). Therefore, to make a successfulgraph matching method, we further consider the learning ofnode embeddings and derive the proposed framework.

2.2. Proposed model

We propose to not only learn the optimal transport indicatingthe correspondence between graphs but also simultaneouslylearn the node embeddings for each graph, which leads toa regularized Gromov-Wasserstein discrepancy. The corre-sponding optimization problem is

minXs,Xt

minT∈Π(µs,µt)

〈L(Cs(Xs),Ct(Xt),T ), T 〉︸︷︷︸Gromov-Wasserstein discrepancy

+ α〈K(Xs,Xt), T 〉︸︷︷︸Wasserstein discrepancy

+βR(Xs,Xt)︸︷︷︸prior information

.(2)

The first term in (2) corresponds to the GW discrepancydefined in (1), which measures the relational dissimilaritybetween the two graphs. The difference here is that theproposed distance matrices consider both the informationof observed data and that of embeddings:

Ck(Xk) = (1− α)Ck + αK(Xk,Xk), for k = s, t. (3)

HereK(Xk,Xk) = [κ(xki ,xkj )] ∈ R|Vk×|Vk| is a distance

matrix, with element κ(xki ,xkj ) that is a function measur-

ing the distance between the node embeddings within thesame graph; α ∈ [0, 1] is a hyperparameter controlling thecontribution of embedding-based distance to Ck(Xk).

The second term in (2) represents the Wasserstein discrep-ancy between the nodes of the two graphs. Similar to the firstterm, the distance matrix is also derived based on the nodeembeddings, i.e.,K(Xs,Xt) = [κ(xsi ,x

tj)] ∈ R|Vs|×|Vt|,

and its contribution is controlled by the same hyperparame-ter α. This term measures the absolute dissimilarity betweenthe two graphs, which connects the target optimal transportwith node embeddings. By adding this term, the optimal

transport minimizes both the Gromov-Wasserstein discrep-ancy based directly on observed data and the Wassersteindiscrepancy based on the embeddings (which are indirectlyalso a function of the data). Furthermore, the embeddingsof different graphs can be learned jointly under the guid-ance of the optimal transport — the distance between theembeddings of different graphs should be consistent withthe relationship indicated by the optimal transport.

Because the target optimal transport is often sparse, purelyconsidering its guidance leads to overfitting or trivial solu-tions when learning embeddings. To mitigate this problem,the third term in (2) represents a regularization of the em-beddings, based on the prior information provided by Csand Ct. We require the embedding-based distance matricesto be close to the observed ones, and R(Xs,Xt) is∑

k=s,tL(K(Xk,Xk),Ck) + L(K(Xs,Xt),Cst)︸︷︷︸

optional

,(4)

where the definition of loss function L(·, ·) is the sameas that used in (1). Note that if we observe partialcorrespondences between different graphs, i.e., Est ={(vi, vj , wij)|vi ∈ Vs, vj ∈ Vt}, we can calculate a dis-tance matrix for the nodes of different graphs, denoted asCst ∈ R|Vs|×|Vt|, and require the distance between the em-beddings to match with Cst, as shown in the optional termof (4). This term is available only when Est is given.

The proposed method unifies (optimal transport-based)graph matching and node embedding in the same framework,and makes them beneficial to each other. For the originalGW discrepancy term, introducing the embedding-based dis-tance matrices can suppress the noise in the data-driven dis-tance matrices, improving robustness. Additionally, basedon node embeddings, we can calculate the Wasserstein dis-crepancy between graphs, which further regularizes the tar-get optimal transport directly. When learning node em-beddings, the Wasserstein discrepancy term works as theregularizer of node embeddings — the values of the learnedoptimal transport indicate which pairs of nodes should beclose to each other.

3. Learning Algorithm3.1. Learning optimal transport

Although (2) is a complicated nonconvex optimization prob-lem, we can solve it effectively by alternatively learningthe optimal transport and the embeddings. In particular, theproposed method applies nested iterative optimization. Inthe m-th outer iteration, given current embeddings X(m)

s

andX(m)t , we solve the following sub-problem:

minT∈Π(µs,µt)〈L(Cs(X(m)s ),Ct(X

(m)t ),T ), T 〉

+ α〈K(X(m)s ,X

(m)t ), T 〉.

(5)


This sub-problem is still nonconvex because of the quadraticterm L(Cs(X

(m)s ),Ct(X

(m)t ),T ), T 〉. We solve it itera-

tively with the help of a proximal point method. Inspired bythe method in (Xie et al., 2018), in the n-th inner iterationwe update the target optimal transport via

minT∈Π(µs,µt)〈L(Cs(X(m)s ),Ct(X

(m)t ),T ),T 〉

+ α〈K(X(m)s ,X

(m)t ), T 〉+ γKL(T ‖T (n)).

(6)

Here, a proximal term based on Kullback-Leibler (KL) di-vergence, KL(T ‖T (n)) =

∑ij Tij log

Tij

T(n)ij

− Tij + T(n)ij ,

is added as a regularizer.

We use projected gradient descent to solve (6), in whichboth the gradient and the projection are based on the KLmetric. When the learning rate is set as 1

γ , the projected gra-dient descent is equivalent to solving the following optimaltransport problem with an entropy regularizer (Benamouet al., 2015; Peyre et al., 2016):

minT∈Π(µs,µt)〈C(m,n) − γ logT (n),T 〉+ γH(T ), (7)

where C(m,n) = L(Cs,Ct,T(n)) + αK(X

(m)s ,X

(m)t ) +

γ, and H(T ) =∑i,j Tij log Tij . This problem can be

solved via the Sinkhorn-Knopp algorithm (Sinkhorn &Knopp, 1967; Cuturi, 2013) with linear convergence.

In summary, we decompose (5) into a series of updatingsteps. Each updating step (6) can be solved via projectedgradient descent, which is a solution to a regularized optimaltransport problem (7). Essentially, the proposed method canbe viewed as a special case of successive upper-bound min-imization (SUM) (Razaviyayn et al., 2013), whose globalconvergence is guaranteed:

Proposition 3.1. Every limit point generated by our proxi-mal point method, i.e., limn→∞ T

(n), is a stationary pointof the problem (5).

Note that besides our proximal point method, anothermethod for solving (5) involves replacing the KL-divergenceKL(T ‖T (n)) in (6) with an entropy regularizer H(T )and minimizing an entropic GW discrepancy via iterativeSinkhorn projection (Peyre et al., 2016). However, its perfor-mance (i.e., its convergence and numerical stability) is moresensitive to the choice of the hyperparameter γ. The detailsof our proximal point method, the proof of Proposition 3.1,and its comparison with the Sinkhorn method (Peyre et al.,2016) are shown in the Supplementary Material.

Parameter α controls the influence of node embeddings onthe GW discrepancy and the Wasserstein discrepancy. Whentraining the proposed model from scratch, the embeddingsXs andXt are initialized randomly and thus are unreliablein the beginning. Therefore, we initialize α with a smallvalue and increase it with respect to the number of outer

Algorithm 1 Gromov-Wasserstein Learning (GWL)1: Input: {Cs,Ct}, {µs,µt}, β, γ, the dimensionD, the

number of outer/inner iterations {M,N}.2: Output: Xs,Xt and T .3: InitializeX(0)

s ,X(0)t randomly, T (0) = µsµ

>t .

4: For m = 0 : M − 15: Set αm = m

M .6: For n = 0 : N − 17: Update optimal transport T (m+1) via solving (6).8: ObtainX(m+1)

s ,X(m+1)t via solving (8).

9: Xs = X(M)s ,Xt = X

(M)t and T = T (M).

10: \\ Graph matching:11: Initialize correspondence set P = ∅12: For vi ∈ Vs13: j = arg maxj Tij . P = P ∪ {(vi ∈ Vs, vj ∈ Vt)}.

iterations. We apply a simple linear strategy to adjust α:with the maximum number of outer iterations set as M , inthe m-th iteration, we set αm = m

M .

3.2. Updating embeddings

Given the optimal transport, T (m), we update the embed-dings by solving the following optimization problem:

minXs,Xtαm〈K(Xs,Xt), T(m)〉+ βR(Xs,Xt). (8)

This problem can be solved effectively by (stochastic) gra-dient descent. In summary, the proposed learning algorithmis shown in Algorithm 1.

3.3. Implementation details and analysis

Distance matrix The distance matrix plays an importantrole in our Gromov-Wasserstein learning framework. Fora graph, the data-driven distance matrix should reflect itsstructure. Based on the fact that the counts of interactions inmany real-world graphs is characterized by Zipf’s law (Pow-ers, 1998), we treat the counts as the weights of edges anddefine the element of the data-driven distance matrix as

ckij =

{1

wij+1 , (vi, vj) ∈ Ek,1, (vi, vj) /∈ Ek,

for k = s, t. (9)

This definition assigns a short distance to pairs of nodeswith many interactions. Additionally, we hope that theembedding-based distance matrix can fit the data-driven dis-tance matrix easily. In the following experiments, we testtwo kinds of embedding-based distance: 1) Cosine-baseddistance: κ(xi,xj) = 1 − exp(−σ(1 − x>

i xj

‖xi‖2‖xj‖2 )). 2)Radial basis function (RBF)-based distance: κ(xi,xj) =

1− exp(−‖xi−xj‖22σ2 ). When applying the cosine-based dis-

tance, we choose σ = 10 such that the maximum κ(xi,xj)


approaches to 1. When applying the RBF-based distance,we choose σ = D. The following experiments show thatthese two distances work well in various matching tasks.

Complexity and Scalability When learning optimal trans-port, one of the most time-consuming steps is computing theloss matrix L(Cs,Ct,T ), which involves a tensor-matrixmultiplication. Fortunately, as shown in (Peyre et al., 2016)when the loss function L(a, b) can be written as L(a, b) =f1(a) + f2(b) − h1(a)h2(b) for functions (f1, f2, h1, h2),which is satisfied by our MSE/KL loss, the loss matrixcan be calculated as L(Cs,Ct,T ) = f1(Cs)µs1

>|Vt| +

1|Vs|µ>t f2(Ct)

> − h1(Cs)Th2(Ct)>. Because T tends

to be sparse quickly during the learning process, the compu-tational complexity ofL(a, b) = f1(a)+f2(b)−h1(a)h2(b)isO(V 3), where V = max{|Vs|, |Vt|}. For D-dimensionalnode embeddings, the complexity of the embedding-baseddistance matrixK(Xs,Xt) is O(V 2D). Additionally, wecan apply the inexact proximal point method (Xie et al.,2018; Chen et al., 2018a), running one-step Sinkhorn-Knoppprojection in each inner iteration. Therefore, the complexityof learning optimal transport is O(V 2D + NV 3). Whenlearning node embeddings, we can apply stochastic gradientdescent to solve (8). In our experiments, we select the sizeof the node batch as B � V and the objective function of(8) converges quickly after a few epochs. Therefore, thecomputational complexity of the embedding-based distancesub-matrix is just O(B2D), which may be ignored com-pared to that of learning optimal transport. In summary, theoverall complexity of our method is O(M(V 2D +NV 3)),and both the learning of optimal transport and that of nodeembeddings can be done in parallel on GPUs.

Note that the proposed method has lower complexity thanmany existing graph matching methods. For example, theGRAAL and its variants (Malod-Dognin & Przulj, 2015)have O(V 5) complexity, which is much slower than theproposed method. Additionally, the complexity of ourmethod is independent of the number of edges (denotedas E = max{|Es|, |Et|}). Compared to other well-knownalternatives, e.g., NETAL (Neyshabur et al., 2013) withO(V log V + E2 + EV log V ), our method has at leastcomparable complexity for dense graphs (E � V ).

4. Related WorkGromov-Wasserstein learning Gromov-Wasserstein dis-crepancy extends optimal transport (Villani, 2008) to thecase when the target domains are not registered well. It canalso be viewed as a relaxation of Gromov-Hausdorff dis-tance (Memoli, 2008; Bronstein et al., 2010) when pairwisedistance between entities is defined. The GW discrepancyis suitable for solving matching problems like shape andobject matching (Memoli, 2009; 2011). Besides graphicsand computer vision, recently its potential for other applica-

tions has been investigated, e.g., matching vocabulary setsbetween different languages (Alvarez-Melis & Jaakkola,2018) and matching weighted directed networks (Chowd-hury & Memoli, 2018). The work in (Peyre et al., 2016)considers the Gromov-Wasserstein barycenter and proposesa fast Sinkhorn projection-based algorithm to compute GWdiscrepancy (Cuturi, 2013). Similar to our method, the workin (Vayer et al., 2018) proposes a fused Gromov-Wassersteindistance, combining GW discrepancy with Wasserstein dis-crepancy. However, it does not consider the learning ofembeddings and requires the distance between the entitiesin different domains to be known, which is inapplicable tomatching problems. In (Bunne et al., 2018), an adversariallearning method is proposed to learn a pair of generativemodels for incomparable spaces, which uses GW discrep-ancy as the objective function. This method imposes anorthogonal assumption on the transformation between thesample and its embedding; it is designed for fuzzy matchingbetween distributions, rather than the graph matching taskthat requires point-to-point correspondence.

Graph matching Graph matching has been studied ex-tensively, with a wide range of applications. Focusingon protein-protein interaction (PPI) networks, many meth-ods have been proposed, including methods based on localneighborhood information like GRAAL (Kuchaiev et al.,2010), and its variants MI-GRAAL (Kuchaiev & Przulj,2011) and L-GRAAL (Malod-Dognin & Przulj, 2015);as well as methods based on global structural informa-tion, like IsoRank (Singh et al., 2008), MAGNA++ (Vi-jayan et al., 2015), NETAL (Neyshabur et al., 2013),HubAlign (Hashemifar & Xu, 2014) and WAVE (Sun et al.,2015). Among these methods, MAGNA++ and WAVEconsider both edge and node information. Besides bioin-formatics, network alignment techniques are also appliedto computer vision (Jun et al., 2017; Yu et al., 2018), doc-ument analysis (Bayati et al., 2009) and social networkanalysis (Zhang & Philip, 2015). For small graphs, e.g., thegraph of feature points in computer vision, graph match-ing is often solved as a quadratic assignment problem (Yanet al., 2015). For large graphs, e.g., social networks andPPI networks, existing methods either depend on a heuristicsearching strategy or leverage domain knowledge for spe-cific cases. None of these methods consider graph matchingand node embedding jointly from the viewpoint of Gromov-Wasserstein discrepancy.

Node embedding Node embedding techniques have beenwidely used to represent and analyze graph/network struc-tures. The representative methods include LINE (Tanget al., 2015), Deepwalk (Perozzi et al., 2014), andnode2vec (Grover & Leskovec, 2016). Most of these em-bedding methods first generate sequential observations ofnodes through a random-walk procedure, and then learn theembeddings by maximizing the coherency between each


observation and its context (Mikolov et al., 2013). Thedistance between the learned embeddings can reflect thetopological structure of the graph. More recently, many newembedding methods have been proposed, e.g., the anony-mous walk embedding in (Ivanov & Burnaev, 2018) and themixed membership word embedding (Foulds, 2018), whichhelp to improve the representations of complicated graphsand their nodes. However, none of these methods considerjointly learning embeddings for multiple graphs.

5. ExperimentsWe apply the Gromov-Wasserstein learning (GWL) methodto both synthetic and real-world matching tasks, and com-pare it with state-of-the-art methods. In our experiments,we set hyperparameters as follows: the number of outeriterations is M = 30, the number of inner iteration isN = 200, γ = 0.01 and L(·, ·) is the MSE loss. We tried βsin {0, 1, 10, 100, 1000} and the β in [1, 100] achieves stableperformance. Therefore, we empirically set β = 10. Whensolving (8), we use Adam (Kingma & Ba, 2014) with learn-ing rate 0.001 and set the number of epochs to 5, and thesize of batches as 100. The proposed method based on co-sine and RBF distances are denoted GWL-C and GWL-R,respectively. Additionally, to highlight the benefit from jointgraph matching and node-embedding learning, we considera baseline that purely minimizes GW discrepancy based ondata-driven distance matrices (denoted as GWD). The codeis available on https://github.com/HongtengXu/gwl.

5.1. Synthetic data

We verify the feasibility of our GWL method by first con-sidering two kinds of synthetic datasets. The graphs in thefirst dataset imitate K-NN graphs with certain randomness,which is common in practical data science. The graphsin the second dataset yield to the Barabasi-Albert (BA)model (Barabasi et al., 2016), which matches the statisticsof real-world networks well. For the K-NN graph dataset,we simulate the source graph G(Vs, Es) as follows: for eachvi ∈ Vs, we select K ∼ Poisson(0.1 × |Vs|) nodes ran-domly from Vs \ vi, denoted as {vki }Kk=1. For each selectededge (vi, v

ki ), there are w ∼ Poisson(10) interactions be-

tween these two nodes. Accordingly, Es is the union of allsimulated {(vi, vki , w)}i,k. The target graph G(Vt, Et) isconstructed by first adding q% noisy nodes to the sourcegraph, i.e., |Vt| = (1 + q%)|Vs|, and then generating q%noisy edges between the nodes in Vt via the simulationmethod mentioned above, i.e., |Et| = (1 + q%)|Es|. Simi-larly, for the BA graph dataset, we first simulate the sourcegraph with |Vs| nodes, and then simulate the target graphvia adding q%|Vs| more nodes and corresponding edges.

We set |Vs| ∈ {50, 100} and q ∈ {0, 10, 20, 30, 40, 50}.For each configuration, we simulate the source graph and

GWD GWL-C (OT) GWL-C (Embedding) GW Discrepancy

0.0 0.1 0.2 0.3 0.4 0.5The percentage of noisy nodes and edges

40

50

60

70

80

90

100

Node

Cor

rect

ness

(%)

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

GW d

iscre

panc

y

(a) K-NN: |Vs| = 50


30

40

50

60

70

80

90

Node

Cor

rect

ness

(%)

0.0000

0.0002

0.0004

0.0006

0.0008

0.0010

0.0012

GW d

iscre

panc

y

(b) BA: |Vs| = 50

Figure 2. The performance of our method on synthetic data.

the target one in 100 trials. For each trial, we apply ourmethod (and its baseline GWD) to match the graphs andcalculate node correctness as our measurement: Given thelearned correspondence setP and the ground truth set of cor-respondences Preal, we calculate percent node correctnessas NC = |P∩Preal|

|P| × 100%. To analyze the rationality ofthe learned node embeddings, we construct P in two ways:for each vsi ∈ Vs, we find its matched node vtj ∈ Vt via (i)arg maxj Tij (as shown in line 13 of Algorithm 1) or (ii)arg minj κ(xsi ,x

tj). Additionally, the corresponding GW

discrepancy is calculated as well. Assuming that the resultsin different trials are Gaussian distributed, we calculate the95% confidence interval for each measurement.

Figure 2 visualizes the performance of our GWL-C methodand its baseline GWD when |Vs| = 50. More results are inthe Supplementary Material. When the target graph is iden-tical to the source one (i.e., q = 0), the proposed Gromov-Wasserstein learning framework can achieve almost 100%node correctness, and the GW discrepancy approaches zero.With the increase of q, the noise in the target graph becomesserious, and the GW discrepancy increases accordingly. Itmeans that the GW discrepancy reflects the dissimilarity be-tween the graphs indeed. Although the GWD is comparableto our GWL-C in the case with low noise level, it becomesmuch worse when q > 20. This phenomenon supports ourclaim that learning node embeddings can improve the ro-bustness of graph matching. Moreover, we find that the nodecorrectness based on the optimal transport (blue curves) andthat based on the embeddings (orange curves) are almostthe same. This demonstrates that the embeddings of differ-ent graphs are on the same manifold, and their distancesindicate the correspondences between graphs.

5.2. MC3: Matching communication networks

MC3 is a dataset used in the Mini-Challenge 3 of VASTChallenge 2018, which records the communication behavioramong a company’s employees on different networks.1 Thecommunications are categorized into two types: phone calls

1http://vacommunity.org/VAST+Challenge+2018+MC3

https://github.com/HongtengXu/gwl

http://vacommunity.org/VAST+Challenge+2018+MC3


and emails between employees. According to the types ofthe communications, we obtain two networks, denoted asCallNet and EmailNet. Because an employee has two in-dependent accounts in these two networks, we aim to linkthe accounts belonging to the same employee. We test ourmethod on a subset of the MC3 dataset, which contains622 employees and their communications through phonecalls and emails. In this subset, for each selected employeethere is at least one employee in a network (either CallNetor EmailNet) having over 10 times communications withhim/her, which ensures that each node has at least one reli-able edge. Additionally, for each network, we can controlthe density of its edge by thresholding the count of interac-tions. When we only keep the edges corresponding to thecommunications happening more than 8 times, we obtaintwo sparse graphs: the CallNet contains 1, 228 edges andthe EmailNet contains 1, 235 edges. When we keep all thecommunications and the corresponding edges, we obtaintwo dense graphs, the CallNet contains 141, 846 edges andthe EmailNet contains 115, 782 edges. Generally, experi-ence indicates that matching dense graphs is much moredifficult than matching sparse ones.

We compare our methods (GWL-R and GWL-C) with well-known graph matching methods: the graduated assign-ment algorithm (GAA) (Gold & Rangarajan, 1996), thelow-rank spectral alignment (LRSA) (Nassar et al., 2018),TAME (Mohammadi et al., 2017), GRAAL2, MI-GRAAL3,MAGNA++4, HugAlign and NETAL.5 These alternativesachieve the state-of-the-art performance on matching large-scale graphs, e.g., protein networks. Table 1 lists the match-ing results obtained by the different methods.6 For the alter-native methods, their best results in 10 trials are listed. Wecan find that their performance on sparse and dense graphs isinconsistent. For example, GRAAL works almost as well asour GWL-R and GWL-C for sparse graphs, but its matchingresult becomes much worse for dense graphs. For the base-line GWD, it is inferior to most graph-matching methods onnode correctness, because it purely minimizes the GW dis-crepancy based on the information of pairwise interactions(i.e., edges). Additionally, GWD merely relies on data-driven distance matrices, which is sensitive to the noise inthe graphs. However, when we take node embeddings (withdimension D = 100) into account, the proposed GWL-Rand GWL-C outperform GWD and other considered ap-proaches consistently, on both sparse and dense graphs.

To demonstrate the convergence and the stability of ourmethod, we run GWD, GWL-R and GWL-C in 10 trials

2http://www0.cs.ucl.ac.uk/staff/natasa/GRAAL.3http://www0.cs.ucl.ac.uk/staff/natasa/MI-GRAAL.4https://www3.nd.edu/∼cone/MAGNA++.5http://ttic.uchicago.edu/∼hashemifar.6For GWD, GWL-R and GWL-C, here we show the node

correctness calculated based on the learned optimal transport.

OT-based NC (%)Embedding-based NC (%)GW discrepancy

The number of outer iterations

(a) Stability and convergence (b) Learned embeddings

Figure 3. Visualization of typical experimental results.

Table 1. Communication network matching results.

Method Call→Email (Sparse) Call→Email (Dense)Node Correctness (%) Node Correctness (%)

GAA 34.22 0.53LRSA 38.20 2.93TAME 37.39 2.67

GRAAL 39.67 0.48MI-GRAAL 35.53 0.64MAGNA++ 7.88 0.09HugAlign 36.21 3.86NETAL 36.87 1.77GWD 23.16±0.46 1.77±0.22

GWL-R 39.64±0.57 3.80±0.23GWL-C 40.45±0.53 4.23±0.27

with different initialization. For each method, its node cor-rectness is calculated based on optimal transport and theembedding-based distance matrix. The 95%-confidence in-terval of the node correctness is estimated as well, as shownin Table 1. We find that the proposed method has good sta-bility and outperforms other methods with high confidence.Figure 3(a) visualizes the GW discrepancy and the nodecorrectness with respect to the number of outer iterations;the 95%-confidence intervals are shown as well. In Fig-ure 3(a), we find that the GW discrepancy decreases andthe two kinds of node correctness increase accordingly andbecome consistent with the increase of iterations, whichmeans that the embeddings we learn and their distancesindeed reflect the correspondence between the two graphs.Figure 3(b) visualizes the learned embeddings with the helpof t-SNE (Maaten & Hinton, 2008). We find that the learnednode embeddings of different graphs are on the same mani-fold and the overlapped embeddings indicate matched pairs.

5.3. MIMIC-III: Procedure recommendation

Besides typical graph matching, our method has poten-tial for other applications, like recommendation systems.Such systems recommend items to users according to thedistance/similarity between their embeddings. Traditionalmethods (Rendle et al., 2009; Chen et al., 2018b) learn theembeddings of users and items purely based on their inter-actions. Recent work (Monti et al., 2017; Ying et al., 2018)shows that considering the user network and/or item net-

http://www0.cs.ucl.ac.uk/staff/natasa/GRAAL

http://www0.cs.ucl.ac.uk/staff/natasa/MI-GRAAL

https://www3.nd.edu/~cone/MAGNA++

http://ttic.uchicago.edu/~hashemifar


work is beneficial to improve recommendation results. Sucha strategy is also applicable to our Gromov-Wassersteinlearning framework: given the network of users, the net-work of items, and the observed interactions between them(i.e., partial correspondences between the graphs), we learnthe embeddings of users and items and the optimal transportbetween them via minimizing the GW discrepancy betweenthe networks. Because the learned embeddings are on thesame manifold, we can calculate the distance between auser and an item directly via the cosine-based distance orthe RBF-based distance. Accordingly, we recommend eachuser with the items with shortest distances. For our method,the only difference between the recommendation task andprevious graph matching task is that we observed some in-teractions, i.e., the wij between source node vi ∈ Vs andtarget node vj ∈ Vt. In such a situation, we take the optionalregularizer in (4) into account. Based on observed wij’s, theelements of the Cst in (4) are calculated via (9).

We test the feasibility of our method on the MIMIC-IIIdataset (Johnson et al., 2016), which contains patient ad-missions in a hospital. Each admission is represented as asequence of ICD (International Classification of Diseases)codes of the diseases and the procedures. The diseases(procedures) appearing in the same admission construct theinteractions of the disease (procedure) graph. We aim to rec-ommend suitable procedures for patients, according to theirdisease characteristics. To achieve this, we learn the embed-dings of the ICD codes for the diseases and the procedureswith the help of various methods, and measure the distancebetween the embeddings. We compare the proposed GWLmethod with the following baselines: i) treating the admis-sion sequences as sentences and learning the embeddingsof ICD codes via traditional word embedding methods likeWord2Vec (Mikolov et al., 2013) and GloVe (Penningtonet al., 2014); ii) the distilled Wasserstein learning (DWL)method in (Xu et al., 2018), which trains the embeddingsfrom scratch or fine-tunes Word2Vec’s embeddings basedon a Wasserstein topic model; and iii) the GWD methodthat minimizes the GW discrepancy purely based on thedata-driven distance matrices, and then learns the embed-dings regularized by the learned optimal transport. TheGWD method is equivalent to applying our GWL methodand setting the number of outer iterations M = 1. For theGWD method, we also consider the cosine- and RBF-baseddistances when learning embeddings, denoted as GWD-Cand GWD-R, respectively.

For fairness of comparison, we use a subset of the MIMIC-III dataset provided by (Xu et al., 2018), which contains11, 086 patient admissions, corresponding to 56 diseasesand 25 procedures. For all the methods, we use 50% ofthe admissions for training, 25% for validation, and the re-maining 25% for testing. In the testing phase, for the i-thadmission, i = 1, ..., I , we may recommend a list of proce-

Table 2. Top-N procedure recommendation results.

Method Top-1 (%) Top-5 (%)P R F1 P R F1

Word2Vec 39.95 13.27 18.25 28.89 46.98 32.59GloVe 32.66 13.01 17.22 27.93 44.79 31.47

DWL (Scratch) 37.89 12.42 17.16 27.39 43.81 30.81DWL (Finetune) 40.00 13.76 18.71 30.59 48.56 34.28

GWD-R 46.29 17.01 22.32 31.82 43.81 33.77GWD-C 43.16 15.79 20.77 31.42 42.99 33.25GWL-R 46.20 16.93 22.22 32.03 44.75 34.18GWL-C 47.46 17.25 22.71 32.09 45.64 34.31

dures with length L, denoted as Ei, based on its diseasesand evaluate recommendation results based on the groundtruth list of procedures, denoted as Ti. Given {Ei, Ti}, wecalculate the top-L precision, recall and F1-score as fol-lows: P =

∑Ii=1 Pi =

∑Ii=1

|Ei∩Ti||Ei| , R =

∑Ii=1Ri =∑I

i=1|Ei∩Ti||Ti| , F1 =

∑Ii=1

2PiRi

Pi+Ri. Table 2 shows the re-

sults of various methods with L = 1 and 5. We find thatour GWL method outperforms the alternatives, especiallyon the top-1 measurements.

We analyze the learned optimal transport between diseasesand procedures from a clinical viewpoint. In particular, wenormalize the transport matrix, ensuring its maximum valueis 1. For each disease, we find the corresponding proceduresi) with the maximum optimal transports and ii) with Tij >0.15. We asked two clinical researchers to check the pairswe find; they confirmed that for over 77.42% of the pairs,either the procedures are clearly related to the treatments ofthe diseases, or the procedures clearly lead to the diseases asside effects or complications (other relationships may be lessclear, but are implied by the data), e.g., “(dV3001) Singleliveborn, born in hospital, delivered by cesarean section↔(p640) Circumcision”. The learned optimal transport, andall pairs of ICD codes and their evaluation results are shownin the Supplementary Material.

6. Conclusions and Future WorkWe have proposed a Gromov-Wasserstein learning methodto unify graph matching and the learning of node embed-dings into a single framework. We show that such jointlearning is beneficial to each of the objectives, obtaining su-perior performance in various matching tasks. In the future,we plan to extend our method to multi-graph matching (Yanet al., 2016), which may be related to Gromov-Wassersteinbarycenter (Peyre et al., 2016) and its learning method. Ad-ditionally, to improve the scalability of our method, we willexplore new Gromov-Wasserstein learning algorithms.

Acknowledgments This research was supported in part byDARPA, DOE, NIH, ONR and NSF. We thank Dr. MatthewEngelhard and Rachel Draelos for evaluating our results.We also thank Wenlin Wang for helpful discussions.


ReferencesAltschuler, J., Weed, J., and Rigollet, P. Near-linear

time approximation algorithms for optimal transport viaSinkhorn iteration. arXiv preprint arXiv:1705.09634,2017.

Alvarez-Melis, D. and Jaakkola, T. Gromov-Wassersteinalignment of word embedding spaces. In EMNLP, pp.1881–1890, 2018.

Barabasi, A.-L. et al. Network science. Cambridge univer-sity press, 2016.

Bayati, M., Gerritsen, M., Gleich, D. F., Saberi, A., andWang, Y. Algorithms for large, sparse network alignmentproblems. In ICDM, pp. 705–710, 2009.

Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., andPeyre, G. Iterative Bregman projections for regularizedtransportation problems. SIAM Journal on Scientific Com-puting, 37(2):A1111–A1138, 2015.

Bronstein, A. M., Bronstein, M. M., Kimmel, R., Mah-moudi, M., and Sapiro, G. A Gromov-Hausdorff frame-work with diffusion geometry for topologically-robustnon-rigid shape matching. International Journal of Com-puter Vision, 89(2-3):266–286, 2010.

Bunne, C., Alvarez-Melis, D., Krause, A., and Jegelka, S.Learning generative models across incomparable spaces.NeurIPS Workshop on Relational Representation Learn-ing, 2018.

Chen, L., Dai, S., Tao, C., Zhang, H., Gan, Z., Shen, D.,Zhang, Y., Wang, G., Zhang, R., and Carin, L. Adversar-ial text generation via feature-mover’s distance. In NIPS,pp. 4671–4682, 2018a.

Chen, X., Xu, H., Zhang, Y., Tang, J., Cao, Y., Qin, Z., andZha, H. Sequential recommendation with user memorynetworks. In WSDM, pp. 108–116, 2018b.

Chowdhury, S. and Memoli, F. The Gromov-Wassersteindistance between networks and stable network invariants.arXiv preprint arXiv:1808.04337, 2018.

Cordella, L. P., Foggia, P., Sansone, C., and Vento, M. A(sub) graph isomorphism algorithm for matching largegraphs. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 26(10):1367–1372, 2004.

Cuturi, M. Sinkhorn distances: Lightspeed computation ofoptimal transport. In NIPS, pp. 2292–2300, 2013.

Foulds, J. Mixed membership word embeddings for compu-tational social science. In AISTATS, pp. 86–95, 2018.

Gold, S. and Rangarajan, A. A graduated assignment algo-rithm for graph matching. IEEE Transactions on PatternAnalysis and Machine Intelligence, 18(4):377–388, 1996.

Grave, E., Joulin, A., and Berthet, Q. Unsupervised align-ment of embeddings with Wasserstein Procrustes. arXivpreprint arXiv:1805.11222, 2018.

Grover, A. and Leskovec, J. node2vec: Scalable featurelearning for networks. In KDD, pp. 855–864, 2016.

Hashemifar, S. and Xu, J. Hubalign: An accurate andefficient method for global alignment of protein–proteininteraction networks. Bioinformatics, 30(17):i438–i444,2014.

Ivanov, S. and Burnaev, E. Anonymous walk embeddings.In ICML, 2018.

Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng,M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A.,and Mark, R. G. MIMIC-III, a freely accessible criticalcare database. Scientific data, 3:160035, 2016.

Jun, S.-H., Wong, S. W., Zidek, J., and Bouchard-Cote, A.Sequential graph matching with sequential monte carlo.In AISTATS, pp. 1075–1084, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Kuchaiev, O. and Przulj, N. Integrative network alignmentreveals large regions of global network similarity in yeastand human. Bioinformatics, 27(10):1390–1396, 2011.

Kuchaiev, O., Milenkovic, T., Memisevic, V., Hayes, W.,and Przulj, N. Topological network alignment uncoversbiological function and phylogeny. Journal of the RoyalSociety Interface, pp. rsif20100063, 2010.

Maaten, L. v. d. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.

Malod-Dognin, N. and Przulj, N. L-GRAAL: Lagrangiangraphlet-based network aligner. Bioinformatics, 31(13):2182–2189, 2015.

Memoli, F. Gromov-Hausdorff distances in Euclideanspaces. In CVPR Workshops, pp. 1–8, 2008.

Memoli, F. Spectral Gromov-Wasserstein distances forshape matching. In ICCV Workshops, pp. 256–263, 2009.

Memoli, F. Gromov-Wasserstein distances and the metricapproach to object matching. Foundations of computa-tional mathematics, 11(4):417–487, 2011.


Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficientestimation of word representations in vector space. arXivpreprint arXiv:1301.3781, 2013.

Mohammadi, S., Gleich, D. F., Kolda, T. G., and Grama, A.Triangular alignment TAME: A tensor-based approachfor higher-order network alignment. IEEE/ACM Trans-actions on Computational Biology and Bioinformatics(TCBB), 14(6):1446–1458, 2017.

Monti, F., Bronstein, M., and Bresson, X. Geometric matrixcompletion with recurrent multi-graph neural networks.In NIPS, pp. 3697–3707, 2017.

Myronenko, A. and Song, X. Point set registration: Coher-ent point drift. IEEE Transactions on Pattern Analysisand Machine Intelligence, 32(12):2262–2275, 2010.

Nassar, H., Veldt, N., Mohammadi, S., Grama, A., andGleich, D. F. Low rank spectral network alignment. InWWW, pp. 619–628, 2018.

Neyshabur, B., Khadem, A., Hashemifar, S., and Arab, S. S.NETAL: A new graph-based method for global alignmentof protein–protein interaction networks. Bioinformatics,29(13):1654–1662, 2013.

Pennington, J., Socher, R., and Manning, C. Glove: Globalvectors for word representation. In EMNLP, pp. 1532–1543, 2014.

Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Onlinelearning of social representations. In KDD, pp. 701–710,2014.

Peyre, G., Cuturi, M., and Solomon, J. Gromov-Wassersteinaveraging of kernel and distance matrices. In ICML, pp.2664–2672, 2016.

Powers, D. M. Applications and explanations of zipf’s law.In Proceedings of the joint conferences on new methods inlanguage processing and computational natural languagelearning, pp. 151–160, 1998.

Razaviyayn, M., Hong, M., and Luo, Z.-Q. A unified conver-gence analysis of block successive minimization methodsfor nonsmooth optimization. SIAM Journal on Optimiza-tion, 23(2):1126–1153, 2013.

Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. BPR: Bayesian personalized ranking fromimplicit feedback. In UAI, pp. 452–461, 2009.

Sharan, R. and Ideker, T. Modeling cellular machinerythrough biological network comparison. Nature biotech-nology, 24(4):427, 2006.

Singh, R., Xu, J., and Berger, B. Global alignment ofmultiple protein interaction networks with applicationto functional orthology detection. Proceedings of theNational Academy of Sciences, 2008.

Sinkhorn, R. and Knopp, P. Concerning nonnegative ma-trices and doubly stochastic matrices. Pacific Journal ofMathematics, 21(2):343–348, 1967.

Sun, Y., Crawford, J., Tang, J., and Milenkovic, T. Simulta-neous optimization of both node and edge conservation innetwork alignment via WAVE. In International Workshopon Algorithms in Bioinformatics, pp. 16–39, 2015.

Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei,Q. Line: Large-scale information network embedding. InWWW, pp. 1067–1077, 2015.

Vayer, T., Chapel, L., Flamary, R., Tavenard, R., and Courty,N. Fused Gromov-Wasserstein distance for structuredobjects: theoretical foundations and mathematical proper-ties. arXiv preprint arXiv:1811.02834, 2018.

Vijayan, V., Saraph, V., and Milenkovic, T. MAGNA++:Maximizing accuracy in global network alignment viaboth node and edge conservation. Bioinformatics, 31(14):2409–2411, 2015.

Villani, C. Optimal transport: Old and new, volume 338.Springer Science & Business Media, 2008.

Xie, Y., Wang, X., Wang, R., and Zha, H. A fast proximalpoint method for Wasserstein distance. arXiv preprintarXiv:1802.04307, 2018.

Xu, H., Wang, W., Liu, W., and Carin, L. Distilled Wasser-stein learning for word embedding and topic modeling.In NIPS, pp. 1723–1732, 2018.

Yan, J., Xu, H., Zha, H., Yang, X., Liu, H., and Chu, S.A matrix decomposition perspective to multiple graphmatching. In ICCV, pp. 199–207, 2015.

Yan, J., Cho, M., Zha, H., Yang, X., and Chu, S. M. Multi-graph matching via affinity optimization with graduatedconsistency regularization. IEEE transactions on pat-tern analysis and machine intelligence, 38(6):1228–1242,2016.

Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,W. L., and Leskovec, J. Graph convolutional neuralnetworks for web-scale recommender systems. arXivpreprint arXiv:1806.01973, 2018.

Yu, T., Yan, J., Wang, Y., Liu, W., et al. Generalizing graphmatching beyond quadratic assignment model. In NIPS,pp. 861–871, 2018.

Zhang, J. and Philip, S. Y. Multiple anonymized socialnetworks alignment. In ICDM, pp. 599–608, 2015.


7. Supplementary Material7.1. The scheme of proposed proximal point method

In the m-th ouer iteration, we learn the optimal transportiteratively. Particularly, in the n-th inner iteration, we up-date the target optimal transport via solving (6) based onthe Sinkhorn-Knopp algorithm (Sinkhorn & Knopp, 1967;Cuturi, 2013). Algorithm 2 gives the details of our proximalpoint method in the m-th outer iteration, where diag(·) con-verts a vector to a diagonal matrix, and � and ·· representelement-wise multiplication and division, respectively.

7.2. The convergence of each updating step

The proposed proximal point method decomposes a noncon-vex optimization problem into a series of convex updatingsteps. Each updating step corresponds to the solution toa regularized optimal transport problem, which is solvedvia J Sinkhorn projections. The work in (Altschuler et al.,2017) proves that solving the regularized optimal transportbased on Sinkhorn projections is with linear convergence.The work in (Xie et al., 2018) further proves that the linearconvergence holds even just applying one-step Sinkhornprojection in each updating step (i.e., J = 1). Therefore,the updating steps of the proposed method are with linearconvergence.

7.3. Global convergence: The proof of Proposition 3.1

Proposition 3.1 Every limit point generated by our proximalpoint method, i.e., limn→∞ T

(n), is a stationary point ofthe problem (5).

Proof. When learning the target optimal transport, the orig-inal optimization problem (5) is with a nonconvex and dif-ferentiable objective function

f(T ) =L(Cs(X(m)s ),Ct(X

(m)t ),T ),T 〉

+ α〈K(X(m)s ,X

(m)t ),T 〉,

(10)

and a closed convex set T = Π(µs,µt) as the constraintof T . As a special case of successive upper-bound mini-mization (SUM), our proximal point method solves (5) viaoptimizing a sequence of approximate objective functions:starting from a feasible point T (0), the algorithm generatesa sequence {T (n)} according to the update rule:

T (n+1) = arg minT∈T u(T ,T (n)), (11)

where

u(T ,T (n)) = f(T ) + γKL(T ‖T (n)). (12)

is an approximation of f(T ) at the n-th iteration, and T (n)

is the point generated in the previous iteration.

Obviously, we have

Algorithm 2 Proximal Point Method for GW Discrepancy1: Input: {Cs,Ct}, {µs,µt}, current embeddings{X(m)

s ,X(m)t }, γ, the number of inner iterations N .

2: Output: T (m+1).3: Initialize T (0) = µsµ

>t and a = µs.

4: for n = 0 : N − 1 do5: Calculate the C(m,n) in (7).6: SetG = exp(−C

(m,n)

γ )� T (n).7: \\ Sinkhorn-Knopp algorithm:8: for j = 1 : J do9: b = µY

G>a10: a = µX

Gb11: end for12: T (n+1) = diag(a)Gdiag(b).13: end for14: T (m+1) = T (N).

C1: u(T ,T ′) is continuous in (T ,T ′).

Additionally, because KL(T ‖T ′) =∑ij Tij log

Tij

T ′ij−Tij+

T ′ij ≥ 0 and the equality holds only when T ′ = T , we have

C2: u(T ,T ) = f(T ) ∀ T ∈ T .

C3: u(T ,T ′) ≥ f(T ) ∀ T ,T ′ ∈ T .

According to the Proposition 1 in (Razaviyayn et al., 2013),when the conditions C2 and C3 are satisfied, for the differen-tiable function f(T ) and its global upper bound u(T ,T ′),we have

C4: u′(T ,T ′;D)|T=T ′ = f ′(T ′;D) ∀D with T ′+D ∈T , where

f ′(T ;D) := limδ→0

inff(T + δD)− f(T )

δ

is the directional derivative of f(T ) along the directionD, and u′(T ,T ′; d) is the directional derivative onlywith respect to T .

According to the Theorem 1 in (Razaviyayn et al., 2013),when the approximate objective function u(·,T (n)) in eachiteration satisfies C1-C4, every limit point generated by theproposed method, limn→∞ T

(n), is a stationary point of theoriginal problem (5).

7.4. Connections and comparisons with existingmethod

Note that when replacing the KL-divergence KL(T ‖T (n))in (6) with an entropy regularizer H(T ), we derive an en-tropic GW discrepancy, which can also be solved by the


0 250 500 750 1000 1250 1500 1750 2000The number of inner iteration N

0.020

0.025

0.030

0.035

0.040

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(a) J = 1, γ = 1e− 3


0.020

0.025

0.030

0.035

0.040

0.045

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(b) J = 1, γ = 1e− 2


0.02

0.03

0.04

0.05

0.06

0.07

0.08

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(c) J = 1, γ = 1e− 1


0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(d) J = 1, γ = 1


0.0200

0.0225

0.0250

0.0275

0.0300

0.0325

0.0350

0.0375

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(e) J = 10, γ = 1e− 3


0.0225

0.0250

0.0275

0.0300

0.0325

0.0350

0.0375

0.0400

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(f) J = 10, γ = 1e− 2


0.02

0.03

0.04

0.05

0.06

0.07

0.08

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(g) J = 10, γ = 1e− 1


0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(h) J = 10, γ = 1


0.020

0.022

0.024

0.026

0.028

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(i) J = 100, γ = 1e− 3


0.0225

0.0250

0.0275

0.0300

0.0325

0.0350

0.0375

0.0400

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(j) J = 100, γ = 1e− 2


0.02

0.03

0.04

0.05

0.06

0.07

0.08

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(k) J = 100, γ = 1e− 1


0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Grom

ov-W

asse

rste

in d

iscre

panc

y proximalsinkhorn

(l) J = 100, γ = 1

Figure 4. Comparisons for our method (blue curves) and that in (Peyre et al., 2016) (orange curves).

Sinkhorn-Knopp algorithm. Accordingly, the G in Algo-rithm 2 (line 6) is replaced with exp(−C

(m,n)

γ ). In sucha situation, the proposed algorithm becomes the Sinkhornprojection method in (Peyre et al., 2016).

For both these two methods, the number of Sinkhorn it-erations J and the weight of (proximal or entropical) reg-ularizer γ are two significant hyperparameters. Figure 4shows the empirical convergence of these two methods us-ing different hyperparameters with respect to the number ofinner iterations (N in Algorithm 2). We can find that for theSinkhorn method it can obtain smaller GW discrepancy thanour proximal point method when the weight of regularizerγ is very small (1e − 3) and J = 1. However, in such asituation both of these two methods suffer from a high riskof numerical instability. When enlarging γ, the stabilityof our method is improved obviously, and it is still able toobtain small GW discrepancy with a good convergence rate.

The Sinkhorn method, on the contrary, converges slowlywhen γ > 1e−2. In other words, our method is more robustto the change of γ, and we can choose γ in a wide rangeand achieve a trade-off between convergence and stabilityeasily. Additionally, although the increase of J helps toimprove the stability of our method slightly, i.e., suppress-ing the numerical fluctuations after the GW discrepancyconverges, such an improvement is so obvious as the coston the computational complexity. Therefore, in practice weset J = 1.

7.5. Runtime

We implement our method based on PyTorch. For the syn-thetic graphs with 100 nodes, it takes about 15 seconds on aCPU. For the MC3 graphs (622 nodes), it takes about 8 min-utes on the CPU. It is even faster than some baselines basedon C++ (Vijayan et al., 2015; Kuchaiev & Przulj, 2011).


GWD GWL-C (OT) GWL-C (Embedding) GW Discrepancy


40

50

60

70

80

90

100

Node

Cor

rect

ness

(%)

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

GW d

iscre

panc

y

(a) K-NN: |Vs| = 50


70

75

80

85

90

95

100

Node

Cor

rect

ness

(%)

0.00050

0.00075

0.00100

0.00125

0.00150

0.00175

0.00200

GW d

iscre

panc

y

(b) K-NN: |Vs| = 100


30

40

50

60

70

80

90

Node

Cor

rect

ness

(%)

0.0000

0.0002

0.0004

0.0006

0.0008

0.0010

0.0012

GW d

iscre

panc

y

(c) BA: |Vs| = 50


20

30

40

50

60

70

80

90

Node

Cor

rect

ness

(%)

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0.0007

0.0008

GW d

iscre

panc

y

(d) BA: |Vs| = 100

Figure 5. The performance of our method on synthetic data.

7.6. More details of experiments

For synthetic data, Figure 5 shows all experimental resultson node correctness. We can find that the proposed methodconsistently better than the baseline method.

For the MIMIC-III data, the enlarged optimal transport be-tween diseases and procedures learned by our method isshown in Figure 6. The transport matrix is normalized,whose maximum value is 1. For each disease, the proce-dures with the maximum transport value or the correspond-ing Tij ≥ 0.15 are listed in Table 3. Additionally, we asktwo clinical researchers to evaluate these pairs — for eachpair, each research independently checks whether the pro-cedure is potentially related to the disease. The columns of“CR1” and “CR2” in Table 3 give the evaluation results. Foreach pair, the “X” means that the procedure is potentiallyrelated to the treatments of the disease, while the “X” meansthat the procedure can lead to the disease as side effect orcomplication. We can find that 1) the evaluation results fromdifferent clinical researchers are with high consistency; 2)over 73.6% of the pairs are reasonable: they correspond toeither “diseases and their treatments” or “procedures andtheir complications”. These phenomena demonstrate thatthe learned optimal transport is clinically-meaningful tosome extent, which reflects some relationships between dis-eases and procedures. Table 4 lists the ICD codes of diseasesand procedures and their detailed descriptions.


Table 3. Matched disease–procedure pairs based on optimal transportTij Disease↔ Procedure CR1 CR21.00 d4019: Unspecified essential hypertension↔ p9604: Insertion of endotracheal tube0.22 d4019: Unspecified essential hypertension↔ p966: Enteral infusion of concentrated nutritional substances0.17 d4280: Congestive heart failure, unspecified↔ p966: Enteral infusion of concentrated nutritional substances X X0.64 d4280: Congestive heart failure, unspecified↔ p9671: Continuous invasive mechanical ventilation for less than 96 consecutive hours X X0.36 d42731: Atrial fibrillation↔ p3961: Extracorporeal circulation auxiliary to open heart surgery X XX0.18 d42731: Atrial fibrillation↔ p8856: Coronary arteriography using two catheters X XX0.16 d42731: Atrial fibrillation↔ p8872: Diagnostic ultrasound of heart X X0.34 d41401: Coronary atherosclerosis of native coronary artery↔ p3961: Extracorporeal circulation auxiliary to open heart surgery X X0.29 d41401: Coronary atherosclerosis of native coronary artery↔ p8856: Coronary arteriography using two catheters X X0.42 d5849: Acute kidney failure, unspecified↔ p9672: Continuous invasive mechanical ventilation for 96 consecutive hours or more X0.44 d25000: Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled↔ p3615: Single internal mammary-coronary artery bypass Xa

0.45 d2724 Other and unspecified hyperlipidemia↔ p8853 Angiocardiography of left heart structures X0.20 d51881: Acute respiratory failure↔ p3893: Venous catheterization, not elsewhere classified X X0.22 d51881: Acute respiratory failure↔ p9904: Transfusion of packed cells X XX0.29 d5990: Urinary tract infection, site not specified↔ p3893: Venous catheterization, not elsewhere classified X X0.22 d53081: Esophageal reflux↔ p9390: Non-invasive mechanical ventilation X0.23 d2720: Pure hypercholesterolemia↔ p3891: Arterial catheterization X0.48 dV053: Need for prophylactic vaccination and inoculation against viral hepatitis↔ p9955: Prophylactic administration of vaccine against other diseases X X0.53 dV290: Observation for suspected infectious condition↔ p9955: Prophylactic administration of vaccine against other diseases X0.30 d2859: Anemia, unspecified↔ p9915: Parenteral infusion of concentrated nutritional substances Xb

0.25 d2449: Unspecified acquired hypothyroidism↔ p9915: Parenteral infusion of concentrated nutritional substances0.24 d486: Pneumonia, organism unspecified↔ p9671: Continuous invasive mechanical ventilation for less than 96 consecutive hours XX XX0.18 d2851: Acute posthemorrhagic anemia↔ p9904: Transfusion of packed cells X X0.18 d2762: Acidosis↔ p966: Enteral infusion of concentrated nutritional substances XX0.28 d496: Chronic airway obstruction, not elsewhere classified↔ p3722: Left heart cardiac catheterization0.16 d99592: Severe sepsis↔ p3893: Venous catheterization, not elsewhere classified X X0.26 d0389: Unspecified septicemia↔ p966: Enteral infusion of concentrated nutritional substances X0.26 d5070: Pneumonitis due to inhalation of food or vomitus↔ p3893: Venous catheterization, not elsewhere classified X X0.33 dV3000: Single liveborn, born in hospital, delivered without mention of cesarean section↔ p331: Incision of lung0.17 d5859: Chronic kidney disease, unspecified↔ p9904: Transfusion of packed cells X X0.13 d311: Depressive disorder - not elsewhere classified↔ p4513: Other endoscopy of small intestine0.14 d40390: Hypertensive chronic kidney disease↔ p3324: Closed biopsy of bronchus0.11 d3051: Tobacco use disorder↔ p5491: Percutaneous abdominal drainage0.16 d412: Old myocardial infarction↔ p8853: Angiocardiography of left heart structures X X0.18 d2875: Thrombocytopenia, unspecified↔ p3893: Venous catheterization, not elsewhere classified X X0.10 dV4581: Aortocoronary bypass status↔ p3891: Arterial catheterization X0.25 d41071: Subendocardial infarction, initial episode of care↔ p3723: Combined right and left heart cardiac catheterization XX XX0.09 d2761: Hyposmolality and/or hyponatremia↔ p966: Enteral infusion of concentrated nutritional substances XX0.21 d4240: Mitral valve disorders↔ p9904: Transfusion of packed cells X0.31 dV3001: Single liveborn, born in hospital, delivered by cesarean section→ p640: Circumcision X X0.08 d5119: Unspecified pleural effusion↔ p9604: Insertion of endotracheal tube X0.07 dV4582: Percutaneous transluminal coronary angioplasty status↔ p9907: Transfusion of other serum X0.23 d40391: Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage V or end stage renal disease↔ p3893: Venous catheterization, not elsewhere classified X X0.17 d78552: Septic shock↔ p9904: Transfusion of packed cells X0.05 d4241: Aortic valve disorders↔ p8872: Diagnostic ultrasound of heart X X0.06 dV5867: Long-term (current) use of insulin↔ p3995: Hemodialysis0.07 d42789: Other specified cardiac dysrhythmias↔ p9604: Insertion of endotracheal tube0.05 d32723: Obstructive sleep apnea (adult)(pediatric)↔ p9390: Non-invasive mechanical ventilation X X0.17 d9971: Cardiac complications, not elsewhere classified↔ p8856: Coronary arteriography using two catheters XX XX0.07 d5845: Acute kidney failure with lesion of tubular necrosis↔ p9904: Transfusion of packed cells X XX0.05 d2760: Hyperosmolality and/or hypernatremia↔ p966: Enteral infusion of concentrated nutritional substances XX0.27 d7742: Neonatal jaundice associated with preterm delivery↔ p9983: Other phototherapy X X0.12 d49390: Asthma - unspecified type - unspecified↔ p5491: Percutaneous abdominal drainage0.10 d2767: Hyperpotassemia↔ p3893: Venous catheterization - not elsewhere classified X X0.09 d5180: Pulmonary collapse↔ p3893: Venous catheterization - not elsewhere classified X X0.08 d4168: Other chronic pulmonary heart diseases↔ p9907: Transfusion of other serum X0.13 d45829: Other iatrogenic hypotension↔ p9907: Transfusion of other serum X0.14 d2749: Gout - unspecified↔ p3995: Hemodialysis X0.10 d4589: Hypotension - unspecified↔ p966: Enteral infusion of concentrated nutritional substances0.25 dV502: Routine or ritual circumcision↔ p9955: Prophylactic administration of vaccine against other diseases Xc

a The relationship here is that usually people with diabetes also have heart disease, and heart disease can require a coronary artery bypassb The relationship here is that if someone has a chronic disease they can develop anemia of chronic disease and they may also be requiring parenteral nutrition for some specific conditionc This procedure is not inherently related to the disease, but they do appear together frequently in the same medical record because they both happen to newborn babies.X The procedure is related to the treatment of the disease.X The procedure can lead to the disease as side effect or complication.


Figure 6. The optimal transport from diseases to procedures (enlarged version).


Table 4. The map between ICD codes and diseases/proceduresICD code Disease/Procedured4019 Unspecified essential hypertensiond41401 Coronary atherosclerosis of native coronary arteryd4241 Aortic valve disordersdV4582 Percutaneous transluminal coronary angioplasty statusd2724 Other and unspecified hyperlipidemiad486 Pneumonia, organism unspecifiedd99592 Severe sepsisd51881 Acute respiratory failured5990 Urinary tract infection, site not specifiedd5849 Acute kidney failure, unspecifiedd78552 Septic shockd25000 Diabetes mellitus without mention of complication, type II or unspecified typed2449 Unspecified acquired hypothyroidismd41071 Subendocardial infarction, initial episode of cared4280 Congestive heart failure, unspecifiedd4168 Other chronic pulmonary heart diseasesd412 Pneumococcus infection in conditions classified elsewhere and of unspecified sited2761 Hyposmolality and/or hyponatremiad2720 Pure hypercholesterolemiad2762 Acidosisd389 Unspecified septicemiad4589 Hypotension, unspecifiedd42731 Atrial fibrillationd2859 Anemia, unspecifiedd311 Cutaneous diseases due to other mycobacteriadV3001 Single liveborn, born in hospital, delivered by cesarean sectiondV053 Need for prophylactic vaccination and inoculation against viral hepatitisd4240 Mitral valve disordersdV3000 Single liveborn, born in hospital, delivered without mention of cesarean sectiond7742 Neonatal jaundice associated with preterm deliveryd42789 Other specified cardiac dysrhythmiasd5070 Pneumonitis due to inhalation of food or vomitusdV502 Routine or ritual circumcisiond2760 Hyperosmolality and/or hypernatremiadV1582 Personal history of tobacco used40390 Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage I through stage IV, or unspecifieddV4581 Aortocoronary bypass statusdV290 Observation for suspected infectious conditiond5845 Acute kidney failure with lesion of tubular necrosisd2875 Thrombocytopenia, unspecifiedd2767 Hyperpotassemiad32723 Obstructive sleep apnea (adult)(pediatric)dV5861 Long-term (current) use of anticoagulantsd2851 Acute posthemorrhagic anemiad53081 Esophageal refluxd496 Chronic airway obstruction, not elsewhere classifiedd40391 Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage V or end stage renal diseased9971 Gross hematuriad5119 Unspecified pleural effusiond2749 Gout, unspecifiedd5859 Chronic kidney disease, unspecifiedd49390 Asthma, unspecified type, unspecifiedd45829 Other iatrogenic hypotensiond3051 Tobacco use disorderdV5867 Long-term (current) use of insulind5180 Pulmonary collapsep9604 Insertion of endotracheal tubep9671 Continuous invasive mechanical ventilation for less than 96 consecutive hoursp3615 Single internal mammary-coronary artery bypassp3961 Extracorporeal circulation auxiliary to open heart surgeryp8872 Diagnostic ultrasound of heartp9904 Transfusion of packed cellsp9907 Transfusion of other serump9672 Continuous invasive mechanical ventilation for 96 consecutive hours or morep331 Spinal tapp3893 Venous catheterization, not elsewhere classifiedp966 Enteral infusion of concentrated nutritional substancesp3995 Hemodialysisp9915 Parenteral infusion of concentrated nutritional substancesp8856 Coronary arteriography using two cathetersp9955 Prophylactic administration of vaccine against other diseasesp3891 Arterial catheterizationp9390 Non-invasive mechanical ventilationp9983 Other phototherapyp640 Circumcisionp3722 Left heart cardiac catheterizationp8853 Angiocardiography of left heart structuresp3723 Combined right and left heart cardiac catheterizationp5491 Percutaneous abdominal drainagep3324 Closed (endoscopic) biopsy of bronchusp4513 Other endoscopy of small intestine

Documents

Gromov-Wasserstein Learning for Graph Matching and Node ... · Grover & Leskovec,2016) are designed for a single graph, and applying such methods separately to multiple graphs doesn’t