Massive MapReduce Matrix Computations & Multicore Graph Algorithms

Massive MapReduce Matrix Computations & Multicore Graph Algorithms DAVID F. GLEICH COMPUTER SCIENCE PURDUE UNIVERSITY

1 David Gleich · Purdue

It’s a pleasure … Intel Intern 2005 in Application Research Lab in Santa Clara Resulting in one of my favorite papers!

!!

“imvol3” — 2007/7/25 — 21:25 — page 257 — #1 !!

!!

!!

Internet Mathematics Vol. 3, No. 3: 257-294

Approximating PersonalizedPageRank with Minimal Useof Web Graph DataDavid Gleich and Marzia Polito

Abstract. In this paper, we consider the problem of calculating fast and accurate ap-proximations to the personalized PageRank score of a webpage. We focus on techniquesto improve speed by limiting the amount of web graph data we need to access.

Our algorithms provide both the approximation to the personalized PageRank scoreas well as guidance in using only the necessary information—and therefore sensiblyreduce not only the computational cost of the algorithm but also the memory andmemory bandwidth requirements. We report experiments with these algorithms onweb graphs of up to 118 million pages and prove a theoretical approximation boundfor all. Finally, we propose a local, personalized web-search system for a future clientsystem using our algorithms.

1. Introduction and Motivation

To have web search results that are personalized, we claim that there is no needto access data from the whole web. In fact, it is likely that the majority of thewebpages are totally unrelated to the interests of any one user.

In the original PageRank paper [Brin and Page 98], Brin and Page proposed apersonalized version of the algorithm for the goal of user-specific page ranking.While the PageRank algorithm models a random surfer that teleports everywherein the web graph, the random surfer in the personalized PageRank Markov chainonly teleports to a few pages of personal interest. As a consequence, the person-alization vector is usually sparse, and the value of a personalized score will benegligible or zero on most of the web.

© A K Peters, Ltd.1542-7951/06 $0.50 per page 257

Could you run your own search engine and crawl the web to compute

your own PageRank vector if you are highly concerned with privacy?

Yes! Theory, Experiments, Implementation!


Yangyang Hou " Purdue, CS Paul G. Constantine "Austin Benson "Joe Nichols" Stanford University James Demmel " UC Berkeley Joe Ruthruff "Jeremy Templeton" Sandia CA

Massive MapReduce Matrix Computations

Funded by Sandia National Labs CSAR project.

A1

A4

A2

A3

A4


By 2013(?) all Fortune 500 companies will have a data computer


Data computers I’ve worked with …

5

Nebula Cluster @ !

Sandia CA!2TB/core storage, "64 nodes, 256 cores, "GB ethernet Cost $150k These systems are good for working with

enormous matrix data!

Student Cluster @ !

Stanford!3TB/core storage, "11 nodes, 44 cores, "GB ethernet Cost $30k

Magellan Cluster @ !

NERSC!128GB/core storage, "80 nodes, 640 cores, "infiniband

David Gleich · Purdue

How do you program them?


MapReduce and"Hadoop overview


MapReduce in a picture

8

Like an MPI all-to-all

In parallel In parallel David Gleich · Purdue

Computing a histogram "A simple MapReduce example

9

Input!!Key ImageId Value Pixels

Map(ImageId, Pixels) for each pixel emit� Key = (r,g,b)� Value = 1

Reduce(Color, Values) emit� Key = Color Value = sum(Values)

Output!!Key Color Value " # of pixels

5 15 10 9 3 17 5 10

1 1

1

1

Map Reduce 1 1

1

1

1 1

1

1

1

1 1 1

shuffle


Why a limited computational model? Data scalability, fault tolerance.

1 MM R

RMMM

Maps Reduce

Shuffle

2

3

4

5

1 2 M M

3 4 M M

5 M

The idea !Bring the computations to the data MR can schedule map functions without moving data.

After waiting in the queue for a month and "after 24 hours of finding eigenvalues, "one node randomly hiccups.

The last page of a 136-page error dump.

10


11

A

From tinyimages"collection

Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000)

regression and general linear models"with many samples

block iterative methods panel factorizations

simulation data analysis !

big-data SVD/PCA!

Used in


Scientific simulations as "Tall-and-Skinny matrices

12

Input "Parameters

Time history"of simulation

s f"~100GB

f(s) =

2

66666666666664

q(x1, t1, s)...

q(xn

, t1, s)q(x1, t2, s)

...q(x

n

, t2, s)...

q(xn

, t

k

, s)

3

77777777777775

The s

imul

ation

as a

vect

or

The s

imul

ation

as a

mat

rix

A

spac

e

time A database of simulations

s1 -> f1 s2 -> f2

sk -> fk

A

spac

e-by

-tim

e

parameters

The

data

base

is a

ver

y"ta

ll-an

d-sk

inny

mat

rix


A Large Scale Example

Nonlinear heat transfer model 80k nodes, 300 time-steps 104 basis runs SVD of 24m x 104 data matrix 500x reduction in wall clock time (100x including the SVD)

Model reduction

13

Constantine & Gleich, ICASSP 2012


PCA of 80,000,000"images

14/2

2

A

80,000,000 images

1000 pixels

X

MapReduce Post Processing

Zero"mean"rows

TSQ

R

R SVD

  V

First 16 columns

of V as images

Top 100 singular values

(principal �components)

David Gleich · Purdue Constantine & Gleich, MapReduce 2010.

All these applications need is Tall-and-Skinny QR

15


Quick review of QR

Current MapReduce algs use the normal equations which can limit numerical accuracy

16

QR Factorization

David Gleich (Sandia)

Using QR for regression

  is given by the solution of  

QR is block normalization“normalize” a vector usually generalizes to computing   in the QR

A Q

Let   , real

 

  is   orthogonal (   )

  is   upper triangular.

0

R

=

4/22MapReduce 2011

QR Factorization

David Gleich (Sandia)

Using QR for regression

  is given by the solution of  

QR is block normalization“normalize” a vector usually generalizes to computing   in the QR

A Q

Let   , real

 

  is   orthogonal (   )

  is   upper triangular.

0

R

=

4/22MapReduce 2011

A = QR AT A Cholesky��! RT R Q = AR�1


There are good MPI implementations. Why MapReduce?

17


Full TSQR code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper

def compress(self):R = numpy.linalg.qr(

numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])

def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()

def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, row

def mapper(self,key,value):self.collect(key,value)

def reducer(self,key,values):for value in values: self.mapper(key,value)

if __name__=='__main__':mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)

David Gleich (Sandia) 13/22MapReduce 2011 18


Tall-and-skinny matrix storage in MapReduce A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array "for a row Each submatrix Ai is an "the input to a map task.

A1

A4

A2

A3

A4

19


Numerical stability was a problem for prior approaches

20

Condition number 1020 105

norm

( Q

T Q –

I )

AR-1

AR-1 + "

iterative refinement Direct TSQR Benson, Gleich, "Demmel, Submitted

Prior work

Constantine & Gleich, MapReduce 2010

Benson, Gleich, Demmel, Submitted

Previous methods couldn’t ensure that the matrix Q was orthogonal


A1

A2

A3

A1

A2qr

Q2 R2

A3qr

Q3 R3

A4qr Q4A4

R4

emit

A5

A6

A7

A5

A6qr

Q6 R6

A7qr

Q7 R7

A8qr Q8A8

R8

emit

Mapper 1Serial TSQR

R4

R8

Mapper 2Serial TSQR

R4

R8

qr Q emitRReducer 1Serial TSQR

AlgorithmData Rows of a matrix

Map QR factorization of rowsReduce QR factorization of rows

Communication avoiding QR (Demmel et al. 2008) "on MapReduce (Constantine and Gleich, 2010)

21


“Manual reduce” can make it faster by adding a second iteration. Computes only R and not Q Can get Q via Q = AR-1 with another MR iteration. Use the standard Householder method?

Taking care of business by keeping track of Q

22

A1

A4

Q1 R1

Mapper 1

A2 Q2 R2

A3 Q3 R3

A4 Q4

Q1

Q2

Q3

Q4

R1

R2

R3

R4

R4 Q o

utpu

t

R ou

tput

Q11

Q21

Q31

Q41

R Task 2

Q11

Q21

Q31

Q41

Q1

Q2

Q3

Q4

Mapper 3

1. Output local Q and R in separate files

2. Collect R on one node, compute Qs for each piece

3. Distribute the pieces of Q*1 and form the true Q


The price is right! Based on performance model and tests

23

seco

nds

2500

500

DirectTSQR is faster than refinement for few columns

… and not any slower for many columns.

Experiment on NERSC Magellan computer, 80 nodes, 640 processors, 80TB disk

800M-by-10 7.5B-by-4 150M-by-100 500M-by-50


Ongoing work

Make AR-1 stable with targeted quad-precision arithmetic to get a numerically orthogonal Q"

Performance model says it’s feasible! How to handle more than ~ 10,000 columns? "

Some randomized methods? Do we need quad-precision for big-data?"Standard error analysis n𝜀 to compute sum."

I’ve seen this with PageRank computations!

24


Assefaw Gehraimbem "Arif Khan"Alex Pothen"Ryan Rossi" Purdue, CS Mahantesh Halappanavar" PNNL Chen Greif"David Kurokawa" Univ. British Columbia Mohsen Bayati"Amin Saberi"Ying Wang (now Google)" Stanford

Multicore Graph "Algorithms

Funded by DOE CSCAPES Institute grant (DE-FC02-08ER25864), NSF CAREER grant 1149756-CCF, and the Center for Adaptive Super Computing Software Multithreaded Architectures (CASS-MT) at PNNL.

CPU

Mem

CPU

Mem

CPU

Mem

25


Network alignment"What is the best way of matching "graph A to B?

v

r

t

s

w

u

A B

26


Network alignment"

review articles

MAY 2012 | VOL. 55 | NO. 5 | COMMUNICATIONS OF THE ACM 91

subgraph under some mapping of the proteins between the two species) or inexact, allowing unmatched nodes on either subnetwork. This problem was first studied by Kelley et al.17 in the context of local network alignment; its later development accompanied the growth in the number of mapped organ isms.5,7,9,33 The third problem that has been considered is global net-work alignment (Figure 1c), where one wishes to align whole networks, one against the other.4,34 In its simplest form, the problem calls for identifying a 1-1 mapping between the proteins of two species so as to optimize some conservation criterion, such as the number of conserved interactions be-tween the two networks.

All these problems are NP-hard as they generalize graph and subgraph isomorphism problems. However, heuristic, parameterized, and ILP ap-proaches for solving them have worked remarkably well in practice. Here, we review these approaches and demon-strate their good performance in prac-tice both in terms of solution quality and running time.

Heuristic ApproachesAs in other applied fields, many prob-lems in network biology are amenable to heuristic approaches that perform well in practice. Here, we highlight two such methods: a local search heuristic for local network alignment and an eigenvector-based heuristic for global network alignment.

NetworkBLAST32 is an algorithm for local network alignment that aims to identify significant subnetwork matches across two or more networks. It searches for conserved paths and conserved dense clusters of interac-tions; we focus on the latter in our de-scription. To facilitate the detection of conserved subnetworks, Network-BLAST first forms a network alignment graph,17,23 in which nodes correspond to pairs of sequence-similar proteins, one from each species, and edges cor-respond to conserved interactions (see Figure 2). The definition of the latter is flexible and allows, for instance, a di-rect interaction between the proteins of one species versus an indirect interac-tion (via a common network neighbor) in the other species. Any subnetwork of the alignment graph naturally corre-

Figure 2. The NetworkBLAST local network alignment algorithm. Given two input networks, a network alignment graph is constructed. Nodes in this graph correspond to pairs of sequence-similar proteins, one from each species, and edges correspond to conserved interactions. A search algorithm identifies highly similar subnetworks that follow a prespecified interaction pattern. Adapted from Sharan and Ideker.30

Figure 3. Performance comparison of computational approaches.

(a) An evaluation of the quality of NetworkBLAST’s output clusters. NetworkBLAST was applied to a yeast network from Yu et al.39 For every protein that served as a seed for an output cluster, the weight of this cluster was compared to the optimal weight of a cluster containing this protein, as computed using an ILP approach. The plot shows the % of protein seeds (y-axis) as a function of the deviation of the resulting clusters from the optimal attainable weight (x-axis).

(b) A comparison of the running times of the dynamic programming (DP) and ILP approaches employed by Torque.7 The % of protein complexes (queries, y-axis) that were completed in a given time (x-axis) is plotted for the two algorithms. The shift to the left of the ILP curve (red) compared with that of the dynamic programming curve (blue) indicates the ILP formulation tends to be faster than the dynamic programming implementation.

(a)

(b)

From Sharan and Ideker, Modeling cellular machinery through biological network comparison. Nat. Biotechnol. 24, 4 (Apr. 2006), 427–433. 27


Network alignment"What is the best way of matching "graph A to B using only edges in L?

v

r

t

s

w

uwtu

A L B

28


Network alignment"Matching? 1-1 relationship"Best? highest weight and overlap

v

r

t

s

w

uwtu

Overlap

A L B

29


Our contributions A new belief propagation method (Bayati et al. 2009, 2013)"Outperformed state-of-the-art PageRank and optimization-based heuristic methods High performance C++ implementations (Khan et al. 2012)"40 times faster (C++ ~ 3, complexity ~ 2, threading ~ 8)"5 million edge alignments ~ 10 sec" www.cs.purdue.edu/~dgleich/codes/netalignmc

30


31


Each iteration involves Matrix-vector-ish computations with a sparse matrix, e.g. sparse matrix vector products in a semi-ring, dot-products, axpy, etc. Bipartite max-weight matching using a different weight vector at each iteration "No “convergence” "100-1000 iterations

Let x[i] be the score for each pair-wise match in L

for i=1 to ...

update x[i] to y[i]

compute a max-weight match with y

update y[i] to x[i] (using match in MR)

32


The methods Each iteration involves! Matrix-vector-ish computations with a sparse matrix, e.g. sparse matrix vector products in a semi-ring, dot-products, axpy, etc. Bipartite max-weight matching using a different weight vector at each iteration

Belief Propagation!!!!!!!!

Listing 2. A belief-propagation message passing procedure for networkalignment. See the text for a description of othermax and round heuristic.

1 y

(0)= 0, z(0) = 0,d(0)

= 0,S(k)= 0

2 for k = 1 to niter

3 F = bound0,� [�S+ S

(k)T] Step 1: compute F

4 d = ↵w + Fe Step 2: compute d5 y

(k)= d� othermaxcol(z

(k�1)) Step 3: othermax

6 z

(k)= d� othermaxrow(y

(k�1))

7 S

(k)= diag(y

(k)+ z

(k) � d)S� F Step 4: update S8 (y

(k), z(k),S(k)) �k

(y

(k), z(k),S(k))+

9 (1� �k)(y

(k�1), z(k�1),S(k�1)) Step 5: damping

10 round heuristic (y(k)) Step 6: matching11 round heuristic (z(k))12 end13 return y

(k) or z(k) with the largest objective value

interpretation, the weight vectors are usually called messagesas they communicate the “beliefs” of each “agent.” In thisparticular problem, the neighborhood of an agent representsall of the other edges in graph L incident on the same vertexin graph A (1st vector), all edges in L incident on the samevertex in graph B (2nd vector), or the edges in L that arepart of an overlap. The message vectors do not generallyconverge, and thus, the iteration is artificially damped toenforce convergence. We only describe one type of damping.See [13] for other variations.

After each update to the messages, we round the messagesto a matching using a bipartite maximum weight matchingprocedure, and then evaluate the objective function.

We present a pseudo-code for the method in Figure 2. Thiscode uses the mildly curious function othermaxrow. Supposethat g is a weight vector on the edges of a bipartite graph L.This means we can index g with the edges of L such thatgi,i0 is the weight on the edge (i, i0) 2 EL. The othermaxrowfunction then computes a new weight for each edge in L:

[othermaxrow(g)]i,i0 = bound

0,1[ max

(i,k0)2EL,k0 6=i0gi,k0

].

This function computes something rather simple. Given a row,replace all non-zeros in that row with the maximum value forthe row; except, for the element that is the maximum value,replace it with the second largest value. The othermaxcolfunction works on columns instead of rows.

C. Stopping Criteria

Both algorithms generate a sequence of heuristic weightvectors whose solution quality varies continually. There is nomonotonicity in the solution quality, which can vary greatlybetween iterations. Thus, no simple stopping criteria is pos-sible. Due to the shrinking step length in Klau’s method andthe artificial damping in BP, there is also no point in runningfor more than 500-1000 iterations with reasonable choices ofthese two parameters.

D. ComplexityThe complexity of each iteration of Klau’s method and

the BP method is O(nnz(S) + |EL| + matching), whereO(matching) is the complexity of the bipartite matching instep 5. Let N = (|VA| + |VB |). Currently, the best knownalgorithm for computing an optimal edge-weighted match-ing has complexity O(|EL|N + N2

logN) [20]. Practicalimplementations have complexity O(|EL|N logN) [21]. Thehalf-approximate matching discussed below has complexityO(|EL|). Thus, when we replace the exact matching step withapproximate matching for our experiments, the complexity ofeach iteration will be O(nnz(S) + |EL|).

IV. PARALLEL NETWORK ALIGNMENT IMPLEMENTATIONS

We now consider a shared-memory multi-core implementa-tion of these procedure with OpenMP. All required memory ispre-allocated before the first iteration and there are no dynamicmemory allocations. We avoid computing intermediate resultswhenever possible, see the online codes for details.

A. Matrix computations in both methodsAll matrices are sparse and are stored as compressed

sparse row arrays. All non-zero patterns and structures remainfixed throughout iterations. We found using simple OpenMP“parallel for” loops faster than using a matrix library suchas Intel’s Math Kernel Library. This result is due to thesimplicity of the matrix computations. For instance, becauseS and U are structurally symmetric with the same structure,the transposes have the same row pointer and the columnindex arrays. But the value array is permuted. So we computethe permutation and whenever we need to transpose one ofthese matrices, we just permute the values array according tothe permutation. Since these matrices do not change structureduring the algorithm, we can compute the permutation once.Sometimes – such as line 5 of Klau’s method or line 3 of BP– we simply use the permutation array to pull elements fromappropriate memory locations without any intermediate write.

The matrix S can be highly imbalanced (some rows areempty and others have many non-zeros) and so we foundthat using a dynamic schedule in OpenMP’s “parallel for”construction yielded better performance than a static schedule.After some experimentation, we found that a chunk-size of1000 seemed to produce the best performance for theseoperations. Indeed, we found this observation to be the casefor all operations involving the matrix S. Synchronization onlyoccurs at the end of each “parallel for” loop.

B. Specifics about Klau’s methodIn the first step of the iteration, we need to solve a bipartite

matching problem for each row of the matrix S with weightsthat change based on U

(k). We compute �2S+U

(k)�U

(k)T

using the permutation trick, and then we parallelize the op-eration over rows. Each of these matching problems is smallbecause there are only a few non-zeros in each row of S andso we do not consider using the parallel approximation here.We precompute the maximum memory required for p threads

Step 6: matching

33


The NEW methods Each iteration involves! Matrix-vector-ish computations with a sparse matrix, e.g. sparse matrix vector products in a semi-ring, dot-products, axpy, etc. Approximate bipartite max-weight matching is used here instead!

Belief Propagation!!!!!!!!

Listing 2. A belief-propagation message passing procedure for networkalignment. See the text for a description of othermax and round heuristic.

1 y

(0)= 0, z(0) = 0,d(0)

= 0,S(k)= 0

2 for k = 1 to niter

3 F = bound0,� [�S+ S

(k)T] Step 1: compute F

4 d = ↵w + Fe Step 2: compute d5 y

(k)= d� othermaxcol(z

(k�1)) Step 3: othermax

6 z

(k)= d� othermaxrow(y

(k�1))

7 S

(k)= diag(y

(k)+ z

(k) � d)S� F Step 4: update S8 (y

(k), z(k),S(k)) �k

(y

(k), z(k),S(k))+

9 (1� �k)(y

(k�1), z(k�1),S(k�1)) Step 5: damping

10 round heuristic (y(k)) Step 6: matching11 round heuristic (z(k))12 end13 return y

(k) or z(k) with the largest objective value

interpretation, the weight vectors are usually called messagesas they communicate the “beliefs” of each “agent.” In thisparticular problem, the neighborhood of an agent representsall of the other edges in graph L incident on the same vertexin graph A (1st vector), all edges in L incident on the samevertex in graph B (2nd vector), or the edges in L that arepart of an overlap. The message vectors do not generallyconverge, and thus, the iteration is artificially damped toenforce convergence. We only describe one type of damping.See [13] for other variations.

After each update to the messages, we round the messagesto a matching using a bipartite maximum weight matchingprocedure, and then evaluate the objective function.

We present a pseudo-code for the method in Figure 2. Thiscode uses the mildly curious function othermaxrow. Supposethat g is a weight vector on the edges of a bipartite graph L.This means we can index g with the edges of L such thatgi,i0 is the weight on the edge (i, i0) 2 EL. The othermaxrowfunction then computes a new weight for each edge in L:

[othermaxrow(g)]i,i0 = bound

0,1[ max

(i,k0)2EL,k0 6=i0gi,k0

].

This function computes something rather simple. Given a row,replace all non-zeros in that row with the maximum value forthe row; except, for the element that is the maximum value,replace it with the second largest value. The othermaxcolfunction works on columns instead of rows.

C. Stopping Criteria

Both algorithms generate a sequence of heuristic weightvectors whose solution quality varies continually. There is nomonotonicity in the solution quality, which can vary greatlybetween iterations. Thus, no simple stopping criteria is pos-sible. Due to the shrinking step length in Klau’s method andthe artificial damping in BP, there is also no point in runningfor more than 500-1000 iterations with reasonable choices ofthese two parameters.

D. ComplexityThe complexity of each iteration of Klau’s method and

the BP method is O(nnz(S) + |EL| + matching), whereO(matching) is the complexity of the bipartite matching instep 5. Let N = (|VA| + |VB |). Currently, the best knownalgorithm for computing an optimal edge-weighted match-ing has complexity O(|EL|N + N2

logN) [20]. Practicalimplementations have complexity O(|EL|N logN) [21]. Thehalf-approximate matching discussed below has complexityO(|EL|). Thus, when we replace the exact matching step withapproximate matching for our experiments, the complexity ofeach iteration will be O(nnz(S) + |EL|).

IV. PARALLEL NETWORK ALIGNMENT IMPLEMENTATIONS

We now consider a shared-memory multi-core implementa-tion of these procedure with OpenMP. All required memory ispre-allocated before the first iteration and there are no dynamicmemory allocations. We avoid computing intermediate resultswhenever possible, see the online codes for details.

A. Matrix computations in both methodsAll matrices are sparse and are stored as compressed

sparse row arrays. All non-zero patterns and structures remainfixed throughout iterations. We found using simple OpenMP“parallel for” loops faster than using a matrix library suchas Intel’s Math Kernel Library. This result is due to thesimplicity of the matrix computations. For instance, becauseS and U are structurally symmetric with the same structure,the transposes have the same row pointer and the columnindex arrays. But the value array is permuted. So we computethe permutation and whenever we need to transpose one ofthese matrices, we just permute the values array according tothe permutation. Since these matrices do not change structureduring the algorithm, we can compute the permutation once.Sometimes – such as line 5 of Klau’s method or line 3 of BP– we simply use the permutation array to pull elements fromappropriate memory locations without any intermediate write.

The matrix S can be highly imbalanced (some rows areempty and others have many non-zeros) and so we foundthat using a dynamic schedule in OpenMP’s “parallel for”construction yielded better performance than a static schedule.After some experimentation, we found that a chunk-size of1000 seemed to produce the best performance for theseoperations. Indeed, we found this observation to be the casefor all operations involving the matrix S. Synchronization onlyoccurs at the end of each “parallel for” loop.

B. Specifics about Klau’s methodIn the first step of the iteration, we need to solve a bipartite

matching problem for each row of the matrix S with weightsthat change based on U

(k). We compute �2S+U

(k)�U

(k)T

using the permutation trick, and then we parallelize the op-eration over rows. Each of these matching problems is smallbecause there are only a few non-zeros in each row of S andso we do not consider using the parallel approximation here.We precompute the maximum memory required for p threads

Step 6"approx matching

34

Parallel


Approximation doesn’t hurt the belief propagation algorithm

problem from Klau [7] (homo-musm). The first is an alignmentbetween protein interactions in a fly (D. melanogaster) andyeast (S. cerevisiae). The second is an alignment betweenhumans (H. sapiens) and mice (M. musculus). We utilizethese problems solely for the instances of a network alignmentproblem and do not focus on the biological insights suggested.The graph L and associated weights are from the originalpapers.

C. Ontology alignmentWe consider two problems in ontology alignment from [13].

The first is an alignment between the Library of Congresssubject headings and Wikipedia categories (lcsh-wiki). Whileboth ontologies have a core hierarchical tree, they also havemany cross edges for other types of relationships. Thus wecan think of them as general graphs. The second problem is analignment between the Library of Congress subject headingsand its counterpart in the French National Library: Rameau.In both cases, the edges and weights in L are computed via atext-matching of the subject heading strings (and via translatedstrings in the case of Rameau). These problems are larger thanthe bioinformatics ones.

VII. NETWORK ALIGNMENT WITH APPROXIMATEMATCHING

In this section we address the question: how does the be-havior of Klau’s method and the BP method change when wesubstitute the approximate matching procedure from Section Vfor the bipartite matching step in each algorithm? Note thatwe always use exact matching in the first step of Klau’smethod (Step 1: row match) because the problems in eachrow tend to be small and we parallelize over rows. Note alsothat the bipartite matching is much more integral to Klau’smethod than the BP procedure. For the BP procedure, weonly solve a bipartite matching problem to evaluate the qualityof an iterate, whereas in Klau’s method, the results of thematching determine the update to the Lagrange multipliers U.Put another way, the set of iterates from the BP method isindependent of the choice of matching algorithm. At the endof the iteration, each of the methods returns the best heuristicit computed, and we perform one final step of exact maximumweight matching to convert this into the returned matching.

We begin by evaluating the solution quality on syntheticpower-law problems. We use ↵ = 1,� = 2 and 1000 iterationsof each method. We evaluate each solution by comparisonwith the identity alignment. Note that the identity alignment– which assigns each vertex in graph A to its mirror imagein graph B based on the original graph G – may not be theoptimal alignment because the perturbations to the graph couldintroduce a better solution. This seems to occur because wecompute objective values larger than the identity alignment for¯d > 10. We also study how many of the correct matches eachmethod generates with respect to the identity matching. Theresults are shown in Figure 2. In the top plot, we show thefraction of the objective from the identity matching achieved(y-axis) as the expected degree ¯d of random edges in L

0 5 10 15 200

0.2

0.4

0.6

0.8

1

rou

nd

ed

ob

ject

ive

va

lue

s

expected degree of noise in L (p ! n)

MR-upperMRApproxMRBPApproxBP

0 5 10 15 200

0.2

0.4

0.6

0.8

1

rou

nd

ed

ob

ject

ive

va

lue

s


0 5 10 15 200

0.2

0.4

0.6

0.8

1

fra

ctio

n c

orr

ect


MRApproxMRBPApproxBP

0 5 10 15 200

0.2

0.4

0.6

0.8

1

fra

ctio

n c

orr

ect


Fig. 2. Alignment with a power-law graph shows the large effect thatapproximate rounding can have on solutions from Klau’s method (MR). Withthat method, using exact rounding will yield the identity matching for allproblems (bottom figure), whereas using the approximation results in over a50% error rate. The results from the BP method with and without approximatematching are indistinguishable. Small differences were randomly added toshow both lines in the figure.

varies from 2 to 20 (x-axis). In the bottom plot, we showthe fraction of correct matches (y-axis), again as the expecteddegree ¯d varies. Problems with more random edges are morechallenging. The figures demonstrate that Klau’s method issensitive to using an approximate matching routine, whereasthe BP method with exact and approximate matching arenearly indistinguishable.

We also evaluate the how the matching weight (wTx,

plotted on the x-axis) and overlap (xTSx/2, plotted on the

y-axis) change for a bioinformatics problem (dmela-scere)and an ontology problem (lcsh-wiki) in the upper and lowerplots in Figure 3. Again, the BP results with and withoutapproximate matching are virtually indistinguishable. Klau’smethod, however, produces results that are considerably worse.

Randomly perturb one power-law graph to get A, B Generate L by the true-match + random edges

BP and ApproxBP are indistinguishable

The amount of random-ness in L in average expected degree

Frac

tion

of c

orre

ct m

atch

35


A local dominating edge method for bipartite matching

i

r

t

s

j

uwtu

A L B

The method guarantees •  ½ approximation •  maximal matching based on work by Preis (1999), Manne and Bisseling (2008), and Halappanavar et al (2012)

A locally dominating edge is an edge heavier than all neighboring edges. For bipartite Work on smaller side only

36



i

r

t

s

j

uwtu

A L B

Queue all vertices

Until queue is empty!In Parallel over vertices"

Match to heavy edge and if there’s a conflict, check the winner, and find an alternative for the loser

Add endpoint of non-dominating edges to the queue

37




i

r

t

s

j

uwtu

A L B

Customized first iteration (with all vertices) Use OpenMP locks to update choices Use sync_and_fetch_add for queue updates.

38



Remaining multi-threading procedures are straightforward Standard OpenMP for matrix-computations" use schedule=dynamic to handle skew We can batch the matching procedures in the BP method for additional parallelism for i=1 to ...

update x[i] to y[i] save y[i] in a buffer when the buffer is full compute max-weight match for all in buffer and save the best

39


Performance evaluation (2x4)-10 core Intel E7-8870, 2.4 GHz (80-cores) 16 GB memory/proc (128 GB) Scaling study 1.  Thread binding "

scattered vs. compact 2.  Memory binding "

interleaved vs. bind

40

CPU

Mem

CPU

Mem

CPU

Mem

CPU

Mem

CPU

Mem

CPU

Mem

CPU

Mem

CPU

Mem


Scaling

41

0 20 40 60 800

5

10

15

20

25

Threads

Spee

dup

scatter and interleave

0 20 40 60 800

5

10

15

20

25

Threads

Spee

dup

BP with no batching lcsh-rameau, 400 iterations

1450 seconds for 1-thread

115 seconds for 40-thread


Ongoing work

Better memory handling! "

numactl, affinity insufficient for full scaling Better models!"

These get to be much bigger computations. Distributed memory."

Trying to get an MPI version, looking into GraphLab

42


UTRC Seminar David Gleich, Purdue 43/4

0

PageRank details

1

2

3

4

5

6

!

2664

1/6 1/2 0 0 0 01/6 0 0 1/3 0 01/6 1/2 0 1/3 0 01/6 0 1/2 0 0 01/6 0 1/2 1/3 0 11/6 0 0 0 1 0

3775

| {z }P

P�j�0eTP=eT

“jump” ! v = [ 1n ... 1

n ]T ��0

eTv=1

Markov chainî�P+ (1� �)veT

óx = x

unique x ) �j � 0, eTx = 1.

Linear system (�� P)x = (1� �)vIgnored dangling nodes patched back to v

algorithms laterDavid F. Gleich (Sandia) PageRank intro Purdue 6 / 36

PageRank by Google

1

2

3

4

5

6

The Model1. follow edges uniformly with

probability �, and2. randomly jump with probability

1� �, we’ll assume everywhere isequally likely

The places we find thesurfer most often are im-portant pages.

David F. Gleich (Sandia) PageRank intro Purdue 5 / 36

PageRank was created by Google to rank web-pages

Other uses for PageRankWhat else people use PageRank to do

GeneRank

10 20 30 40 50 60 70

NM_003748NM_003862Contig32125_RCU82987AB037863NM_020974Contig55377_RCNM_003882NM_000849Contig48328_RCContig46223_RCNM_006117NM_003239NM_018401AF257175AF201951NM_001282Contig63102_RCNM_000286Contig34634_RCNM_000320AB033007AL355708NM_000017NM_006763AF148505Contig57595NM_001280AJ224741U45975Contig49670_RCContig753_RCContig25055_RCContig53646_RCContig42421_RCContig51749_RCAL137514NM_004911NM_000224NM_013262Contig41887_RCNM_004163AB020689NM_015416Contig43747_RCNM_012429AB033043AL133619NM_016569NM_004480NM_004798Contig37063_RCNM_000507AB037745Contig50802_RCNM_001007Contig53742_RCNM_018104Contig51963Contig53268_RCNM_012261NM_020244Contig55813_RCContig27312_RCContig44064_RCNM_002570NM_002900AL050090NM_015417Contig47405_RCNM_016337Contig55829_RCContig37598Contig45347_RCNM_020675NM_003234AL080110AL137295Contig17359_RCNM_013296NM_019013AF052159Contig55313_RCNM_002358NM_004358Contig50106_RCNM_005342NM_014754U58033Contig64688NM_001827Contig3902_RCContig41413_RCNM_015434NM_014078NM_018120NM_001124L27560Contig45816_RCAL050021NM_006115NM_001333NM_005496Contig51519_RCContig1778_RCNM_014363NM_001905NM_018454NM_002811NM_004603AB032973NM_006096D25328Contig46802_RCX94232NM_018004Contig8581_RCContig55188_RCContig50410Contig53226_RCNM_012214NM_006201NM_006372Contig13480_RCAL137502Contig40128_RCNM_003676NM_013437Contig2504_RCAL133603NM_012177R70506_RCNM_003662NM_018136NM_000158NM_018410Contig21812_RCNM_004052Contig4595Contig60864_RCNM_003878U96131NM_005563NM_018455Contig44799_RCNM_003258NM_004456NM_003158NM_014750Contig25343_RCNM_005196Contig57864_RCNM_014109NM_002808Contig58368_RCContig46653_RCNM_004504M21551NM_014875NM_001168NM_003376NM_018098AF161553NM_020166NM_017779NM_018265AF155117NM_004701NM_006281Contig44289_RCNM_004336Contig33814_RCNM_003600NM_006265NM_000291NM_000096NM_001673NM_001216NM_014968NM_018354NM_007036NM_004702Contig2399_RCNM_001809Contig20217_RCNM_003981NM_007203NM_006681AF055033NM_014889NM_020386NM_000599Contig56457_RCNM_005915Contig24252_RCContig55725_RCNM_002916NM_014321NM_006931AL080079Contig51464_RCNM_000788NM_016448X05610NM_014791Contig40831_RCAK000745NM_015984NM_016577Contig32185_RCAF052162AF073519NM_003607NM_006101NM_003875Contig25991Contig35251_RCNM_004994NM_000436NM_002073NM_002019NM_000127NM_020188AL137718Contig28552_RCContig38288_RCAA555029_RCNM_016359Contig46218_RCContig63649_RCAL080059

Use (�� GD�1)x =w tofind “nearby” importantgenes.

ProteinRankObjectRankEventRankIsoRankClustering

(graph partitioning)

Sports rankingFood websCentralityTeaching

Note Conjectured new papers: TweetRank (Done, WSDM 2010), WaveRank,BeachRank, PaperRank, UniversityRank, LabRank. I think the last one involves arandom scientist!

Morrison et al. GeneRank, 2005.David F. Gleich (Sandia) PageRank intro Purdue 7 / 36

UTRC Seminar David Gleich, Purdue 44/4

0

Which sensitivity?

(�� P)x = (1� �)v

Sensitivity to the links : examined and understood

Sensitivity to the jump : examined, understood, and useful

Sensitivity to � : less well understood

David F. Gleich (Sandia) Sensitivity Purdue 10 / 36

Multicore PageRank

… similar story … Serialized preprocessing Parallelize the linear algebra via an "asynchronous Gauss-Seidel iterative method ~10x scaling on same (80-core) machine "(1M nodes, 15M edges, synthetic)

David Gleich · Purdue 45

Questions? Papers on my webpage

www.cs.purdue.edu/homes/dgleich

Codes github.com/arbenson/mrtsqr

www.cs.purdue.edu/homes/dgleich/codes/netalignmc github.com/dgleich/prpack

David Gleich · Purdue 46

Technology

Massive MapReduce Matrix Computations & Multicore Graph Algorithms