Graph Partitioning using Bayesian Inference on GPUctcyang/pub/gtc-slides2018.pdf · Overview 1 Introduction 2 Stochastic Block Model 3 Bayesian inference for graph partitioning 4

Graph Partitioning using Bayesian Inference on GPU

Carl Yang, Steven Dalton, Maxim Naumov, Michael Garland,Aydın Buluc, John D. Owens

UC Davis, NVIDIA intern

[email protected]

March 26, 2018

Carl Yang, Steven Dalton, Maxim Naumov, Michael Garland,Aydın Buluc, John D. Owens (NVIDIA)Final Presentation March 26, 2018 1 / 63

Overview

1 Introduction

2 Stochastic Block Model

3 Bayesian inference for graph partitioning

4 Parallelization strategy

5 Experiments


Problem: How can we break this graph up into smallerpieces so we can understand it?


Problem definition

Problem 1

Can MCMC be sped up by using a GPU?

Problem 2

How is convergence affected?

Problem 3

Is this a scalable solution?


Problem definition

Problem 1


Problem 2


Problem 3

Is MCMC a scalable solution to the graph clustering problem?


Problem definition

Problem 1


Problem 2


Problem 3

Is this a scalable solution?


Related work

Minimum-cut method

Hierarchical clustering

Girvan–Newman algorithm

Modularity maximization

Clique-based methods


Generative models

Idea

Before thinking of how to partition, we should come up with a model thatgenerates what we are looking for.

Want:

The parameters should describe block structure in a graph.

The parameter values are unknown, but can be inferred from the dataand the current state in a principled, statistical way.


Stochastic Block Model (SBM)

Holland, Laskey, and Leinhardt. ”Stochastic blockmodels: First steps.”Social networks 5.2 (1983)

Parameters: ηi → probability a node belongs to block i

Mrs → probability an edge exists between block r and block s

Rules for placing N nodes in B blocks:

1 Sample bi ∼ Cat(η) to obtain each node’s colour.

2 Sample eij ∼ Poisson(M) to determine which two blocks r and s theedge connects.

3 Sample i ∼ Uniform(nr ) and j ∼ Uniform(ns) to get two nodes inblocks r and s respectively for edge eij .




















Formulate clustering as exact recovery problem

1 Given G and b(t), find M(t).

2 Given G and M(t), find arg maxb P(b|G ,M). This becomes b(t+1).


Exact recovery problem




Exact recovery problem




Bayesian inference

We want to find partition b that maximizes:

P(b|G ,M) =P(G |b,M)P(b,M)

P(G )

Taking negative logs of both sides, we want to minimize Σ:

Σ = − logP(G |b,M)− logP(b,M) + logP(G )

S is the amount of information required to describe the graph when themodel is known.L is the amount of information required to describe the model.


Bayesian inference

We want to find partition b that maximizes:

P(b|G ,M) =P(G |b,M)P(b,M)

P(G )

Taking negative logs of both sides, we want to minimize Σ:

Σ = − logP(G |b,M)︸︷︷︸S

− logP(b,M)︸︷︷︸L

+ logP(G )︸︷︷︸constant

S is the amount of information required to describe the graph when themodel is known.L is the amount of information required to describe the model.


Computing terms

S can be found by counting the number of configurations of the graph.The fewer configurations, the better our model fits the graph:

S = log( 1

Ω

)= log

( ∏rs Mrs !∏

r k+r !∏

r k−r !

)−1

L can be found by counting:

L = log

((B

N

))+ logN!−

∑r

log nr !︸︷︷︸b term

+ log

((B2

E

))︸︷︷︸

M term

Design decision: Ignore L for now in prototype, but leave room for it to beadded in the future.


Computing terms


S = log( 1

Ω

)= log

( ∏rs Mrs !∏

r k+r !∏

r k−r !

)−1


L = log

((B

N

))+ logN!−

∑r


+ log

((B2

E

))︸︷︷︸

M term

Design decision: Ignore L for now in prototype, but leave room for it to beadded in the future.


Intuition


Intuition


Combinatorial optimization problem

So we want to partition b s.t. Σ is minimized.

However for a graph of B blocks and N nodes, there are BN manypossible partitions b we would need to compute that quantity for.

We need an efficient way to traverse large state space.


MCMC sampling

1 Propose move.

2 Calculate move acceptance probability.

3 Commit move.

Upside: Stationary distribution will converge to probability distribution weare trying to find.


Merge phase

Merge phase:


Merge phase


Nodal (MCMC) phase


Nodal (MCMC) phase


MCMC sampling applied to solve graph partitioning

Merge phase1 Propose move2 Calculate change in objective function3 Get block move that improves objective function the most4 Commit move5 Goto 1) until nblocksinitial

r blocks left

MCMC phase1 Propose move2 Calculate change in objective function3 Calculate move acceptance probability4 Commit move5 Goto 1) until MCMC chain has converged

Do Merge phase, MCMC phase, Merge phase, MCMC phase, etc.until target cluster count has been reached.




r blocks left






r blocks left




1. Propose move

Counter-based RNG allows O(1) skip-ahead for each thread.

This allows independent random numbers to be generated within adevice function.


2. Calculate objective function

Problem: How do we compute the objective function as if we have alreadymade the move, but without actually changing our graph?

Key insight: Merge move and node move can be both expressed as thesimultaneous element-wise addition of rows and columns of a matrix.



Problem: How do we compute the objective function as if we have alreadymade the move, but without actually changing our graph?

Key insight: Merge move and node move can be both expressed as thesimultaneous element-wise addition of rows and columns of a matrix.


We have a graph


How to express in matrix notation node 1 being movedfrom blue to yellow?


Elementwise move node 1’s out-edge contribution fromblue to yellow


Elementwise move node 1’s out-edge contribution fromblue to yellow


Elementwise move node 1’s in-edge contribution from blueto yellow


Elementwise move node 1’s in-edge contribution from blueto yellow


Move complete



For sparse matrices, elementwise addition is equivalent to doing a setunion.

Warp-wide sorting network allows us to do set unions using registermemory.


3. Commit move

Triple matrix product used to update model between Merge and MCMCphases.

Hypothesis 1: Committing merge moves in parallel does not affectconvergence rate.

Hypothesis 2: Committing MCMC moves in parallel does not affectconvergence rate.


Parallelization summary

Reference impl. Our contributionCPU Seq CPU Par GPU Seq GPU Par

Propose move par par par parMerge Calculate obj par par par par

Commit move seq seq seq par

Propose move seq par par parMCMC Calculate obj seq par par par

Commit move seq seq par par


Experimental Setup

Hardware:

CPU: Intel Core i7-5820K CPU @ 3.30GHz, 32GB RAM

GPU: Titan Xp, 12GB RAM

Datasets:

Nodes 50 100 1K 5K 20K 50K 500K

Edges 319 6K 20K 102K 409K 1M 10M


Experimental Setup

Hardware:

CPU: Intel Core i7-5820K CPU @ 3.30GHz, 32GB RAM

GPU: Titan Xp, 12GB RAM

Datasets:

Synthetic datasets with ground truth partitions for each node.

Nodes 50 100 1K 5K 20K 50K 500K

Edges 319 6K 20K 102K 409K 1M 10M


Speedup comparison

0

2

4

6

8

10

12

14

16

18

50 100 1000 5000 20000 50000 500000

Speedup

NumberofNodes

CPUSeq CPUPar

GPUSeq GPUPar

Figure: Speedup comparison across four implementations.Carl Yang, Steven Dalton, Maxim Naumov, Michael Garland,Aydın Buluc, John D. Owens (NVIDIA)Final Presentation March 26, 2018 45 / 63

Runtime breakdown

0

0.2

0.4

0.6

0.8

1

1.2

CPUSeq

CPUPar

GPUSeq

GPUPar

CPUSeq

CPUPar

GPUSeq

GPUPar

CPUSeq

CPUPar

GPUSeq

GPUPar

CPUSeq

CPUPar

GPUSeq

GPUPar

CPUSeq

CPUPar

GPUSeq

GPUPar

CPUSeq

CPUPar

GPUSeq

GPUPar

50 100 1000 5000 20000 50000

Build Merge MCMC

Figure: Runtime breakdown between four implementations.


Rate of convergence

-600000

-500000

-400000

-300000

-200000

-100000

0

100000

200000

0 500000 1000000 1500000 2000000

GPU CPUSeq CPUPar

Figure: Change in objective function plotted against number of moves.


Rate of convergence (in runtime)

-600000

-500000

-400000

-300000

-200000

-100000

0

100000

200000

0 50 100 150 200 250

GPU CPUSeq CPUPar

Figure: Change in objective function plotted against runtime in seconds.


Raw runtime numbers and accuracy

CPU Seq CPU Par GPU Seq GPU ParNodes Time (s) Acc (%) Time (s) Acc (%) Time (s) Acc (%) Time (s) Acc (%)50 0.519 100 0.519 100 0.0876 100 0.0603 100100 0.802 100 0.531 82 0.2249 100 0.1779 1001000 5.193 81.41 0.939 100 3.153 100 1.5649 1005000 16.443 90 2.255 81.7 27.093 92.943 3.113 87.620000 118.201 94.6 29.97 93.93 51.519 96.5 7.671 88.550000 272.249 89.8 97.68 87.15 2902.4 97.6 23.707 89.2


Takeaways

It is surprisingly easy to make MCMC converge.

However, it’s a different story to make MCMC scalable.


Future work

Use specialized triple matrix product kernel to take advantage ofknowledge about matrix structure.

Use load-balancing methods such as TWC to handle unbalanced data.

Try newer Bayesian inference methods such as minibatch MCMC andADVA (auto differentiation variational inference) that claim to scalebetter with data size than standard MCMC.

Add multi-GPU support.


Questions?




Given N nodes in B blocks:

State: bi → block node i belongs to

Parameters: ηi → probability a node belongs in block i

λrs → probability an edge exists between block r and block s

1 Sample each node i.i.d. over ηi to obtain each node’s colour.

2 Sample each edge i.i.d. over Poi(λrs) to obtain blocks r and s theyconnect. For each edge, sample one node in block r with probability1nr

and one node in block s with probability 1ns

to determine whichtwo nodes the edge connects.




Given N nodes in B blocks:

State: bi → block node i belongs to

Parameters: ηi → probability a node belongs in block i

λrs → probability an edge exists between block r and block s

The probability of generating a graph G and partition b given parametersη, λ assuming a Bernoulli edge distribution is:

P(G |b,M) =∏i

ηbi∏i<j

λAij

bibj(1− λbibj )

1−Aij


Variant of SBM we will use

Non-parametric: use Bayesian formulation instead of maximumlikelihood.

This solves the over-fitting problem.

Degree-corrected: add additional parameters ki for every node irepresenting its propensity for high degree

This accounts for the power law degree distribution that manyreal-world graphs exhibit.


Expression

Taking negative logs of both sides:

− logP(b|G ,M) = − logP(G |b,M)︸︷︷︸S

− logP(b,M)︸︷︷︸L

+ logP(G )︸︷︷︸constant


Sequential MCMC for graph partitioning

Input: b: N × 1 current block assignment vector, M: B × B interblockedge count matrix, A: N × N adjacency matrix

1: procedure MCMCSequential(b,M,A)2: for node i do3: Propose random move for i : block r → s4: Acceptance probability:

5: paccept = min[exp(−β∆S)ps→r

pr→s, 1]

6: Perform move by updating b,M



Generative models

Idea

Before thinking of how to partition, we should come up with a model ofwhat we are looking for.

The parameters should describe block structure.

The parameter values are unknown, but can be inferred from the dataand the current state in a principled, statistical way.


Generative models: Sketch of algorithm

Given data G and an initial guess of partition b(0), we can compute M(1)

and b(1):

1 Compute model parameters M(1) using G and b(1).

2 Make better guess for partition b(1) using Bayesian inference:

arg maxb

P(b|G ,M) = arg maxb

P(G |b,M)P(b,M)

P(G )


Computing terms


S =1

Ω

=( ∏

rs Mrs !∏r k

+r !∏

r k−r !

)−1


L = log

((B

N

))+ logN!−

∑r


+ log

((B2

E

))︸︷︷︸

M term


Variable-at-a-time Metropolis-Hastings

Algorithm 1 Sequential MCMC.

Input: b0: N × 1 state vector initialized randomlyOutput: bT : N × 1 vector equal to stationary distribution1: for iteration t = 1, 2, ... do2: for node i = 1, 2, ...,N do

3: Propose: b(cand)i ∼ q(bti |bt−1)

4: Acceptance probability:

α = min (q(bt−1

i |bcandi )π(bcandi )

q(bcandi |bt−1i )π(bt−1

i ), 1)

5: u ∼ Uniform(0, 1)6: if u < α then7: Accept proposal: bti ← bcandi

8: else9: Reject proposal: bti ← bt−1

i


Where SBM fits into machine learning

Hidden Markov Model

Latent Variable Model

Variational auto-encoders


Documents

Graph Partitioning using Bayesian Inference on GPUctcyang/pub/gtc-slides2018.pdf · Overview 1 Introduction 2 Stochastic Block Model 3 Bayesian inference for graph partitioning 4