47
Fast Convolutional Neural Networks for Graph-Structured Data Xavier Bresson !"#$%& (&%))*+ , Swiss Federal Institute of Technology (EPFL) Joint work with Michaël De"errard (EPFL) and Pierre Vandergheynst (EPFL) Organizers: Xue-Cheng Tai, Egil Bae, Marius Lysaker, Alexander Vasiliev Aug 29 th 2016 Nanyang Technological University (NTU) from Dec. 2016

Talk Norway Aug2016

Embed Size (px)

Citation preview

Page 1: Talk Norway Aug2016

Fast Convolutional Neural Networks !for Graph-Structured Data!

Xavier Bresson!

!"#$%&'(&%))*+' ,'

Swiss Federal Institute of Technology (EPFL) !

Joint work with Michaël De"errard (EPFL) !and Pierre Vandergheynst (EPFL) !

!

Organizers: Xue-Cheng Tai, Egil Bae, !Marius Lysaker, Alexander Vasiliev !

!

Aug 29th 2016!

Nanyang Technological University (NTU) from Dec. 2016 !

Page 2: Talk Norway Aug2016

Data Science and Artificial Neural Networks !•! Data Science is an interdisciplinary field that aims at transforming raw data

into meaningful knowledge to provide smart decisions for real-world problems:!

!"#$%&'(&%))*+' -'

•! In the news:!

Page 3: Talk Norway Aug2016

A Brief History of Artificial Neural Networks !

!"#$%&'(&%))*+' .'

1958'Perceptron!Rosenblatt!

1982'Backprop !

Werbos!

1959'

Visual primary cortex!Hubel-Wiesel!

1987'

Neocognitron!Fukushima!

SVM/Kernel techniques!Vapnik!

1995'

1997'1998'

1999'1999 2006' 2012' 2014' 2015'Data scientist!1st Job in US!

Facebook Center!OpenAI Center!

Deep Learning!Breakthough !

Hinton!

Auto-encoder!LeCun, Hinton, Bengio!

First NVIDIA !GPU!

First Amazon!Cloud Center!

2012

RNN!Schmidhuber!

1997

CNN!LeCun!

1997 19981997

1999

First NVIDIA

AI “Winter” [1960’s-2012]!AI Hope! AI Resurgence!

Hardware!GPU speed doubles/ year!

Kernel techniques!Handcrafted Features!Graphical models!

4th industrial revolution?!Digital Intelligence !

Revolution!or new AI bubble?!

Big Data!Volume doubles/1.5 year!

Google AI ! TensorFlow!Facebook AI! Torch!

1962'1962

Birth of!Data Science!Split from Statistics!Tukey!

PerceptronRosenblatt

1962

First !NIPS!

1989'

First !KDD!19891989

First KDD

2010'

Kaggle!Platform!

Page 4: Talk Norway Aug2016

What happened in 2012?!

!"#$%&'(&%))*+' /'

•! ImageNet [Fei Fei-et.al.’09]: International Image Classification Challenge !1,000 object classes and 1,431,167 images!

Error decreased by 2!!

•! Before 2012: Handcrafted filters + SVM classification!!

!!!!After 2012: Learned filters !

with Neural Networks!

SIFT !Lowe, 1999!

(most cited paper in CV)'Histogram of Gradients (HoG)!

Dalal & Triggs, 2005!

Handcrafted filters + SVM classificationHandcrafted filters + SVM classificationHandcrafted filters + SVM classificationHandcrafted filters + SVM classification

[L. Fei-Fei-et.al.’09]'[Russakovsky-et.al.’14]'

[Fei-Fei Li-et.al.’15]'

[T.Q.Hoan’13]'

[Karpathy-Johnson’16]'

Page 5: Talk Norway Aug2016

Outline!Ø  Data Science and ANN!

Ø  Convolutional Neural Networks !- Why CNNs are good? !- Local Stationarity and Multiscale Hierarchical Features!

!

Ø  Convolutional Neural Networks on Graphs !- CNNs Only Process Euclidean-Structured Data!- CNNs for Graph-Structured Data !- Graph Spectral Theory for Convolution on Graphs!- Balanced Cut Model for Graph Coarsening!- Fast Graph Pooling!

!

Ø  Numerical Experiments!!

Ø  Conclusion!

XavierBresson 5

Page 6: Talk Norway Aug2016

Outline!! Data Science

!! Convolutional Neural Networks !- Why CNNs are good? !- Local Stationarity and Multiscale Hierarchical Features!

!

! Convolutional Neural Networks on Graphs - CNNs Only Process Euclidean-Structured Data- CNNs for Graph-Structured Data- Graph Spectral Theory for Convolution on Graphs- Balanced Cut Model for Graph Coarsening- Fast Graph Pooling

! Numerical Experiments

! Conclusion

!"#$%&'(&%))*+' 1'

Page 7: Talk Norway Aug2016

Convolutional Neural Networks !•! CNNs [LeCun-et.al.’98] are very successful for Computer Vision tasks:!

!! Image/object recognition [Krizhevsky-Sutskever-Hinton’12] !

!! Image captioning [Karpathy-FeiFei’15]!

!! Image inpainting [Pathak-Efros-etal’16]!

!! Etc!

!"#$%&'(&%))*+' 2'

•! CNNs are used by several (big) IT companies:!!! Facebook (Torch software)!!! Google (TensorFlow software, Google Brain, Deepmind)!!! Microsoft !!! Tesla (AI Open)!!! Amazon (DSSTNE software)!!! Apple!!! IBM!

Amazon (DSSTNE software)

Page 8: Talk Norway Aug2016

Why CNNs are Good?!

•  CNNs are extremely efficient at extracting meaningful statistical patterns in large-scale and high-dimensional datasets.!

XavierBresson 8

•  Key idea: Learn local stationary structures and compose them to form multiscale hierarchical patterns.!

•  Why? It is open (math) question to prove the efficiency of CNNs.!

Note: Despite the lack of theory, the entire ML and CV communities have shifted to deep learning techniques! Ex: NIPS’16: 2326 submissions, 328 DL (14%), Convex Optimization 90 (3.8%). !

Page 9: Talk Norway Aug2016

Local Stationarity!•! Assumption: Data are locally stationary " similar local patches are shared

across the data domain:!

!"#$%&'(&%))*+' 4'

•! How to extract local stationary patterns? !Convolutional filters (filters with !compact support)!

x ⇤ F3

x ⇤ F2

x ⇤ F1"#$%#&'(#$)&!*+&,-./!

F1

F2

F3

x

Page 10: Talk Norway Aug2016

Multiscale Hierarchical Features!•! Assumption: Local stationary patterns can be composed to form more abstract

complex patterns:!

!"#$%&'(&%))*+' ,5'

•! How to extract multiscale hierarchical patterns? !Downsampling of data domain (s.a. image grid) with Pooling (s.a. max, average).!

•! Other advantage: Reduce computational complexity while increasing #filters.!

2x2 max!pooling'

2x2 max !pooling'

Layer 1' Layer 2' Layer 3' Layer 4'Layer 1 Layer 2 Layer 3 Layer 4 Deep/hierarchical !Features (simple to abstract)'

…'

[Karpathy-!Johnson’16]'

Page 11: Talk Norway Aug2016

Classification Function!

•  Classifier: After extracting multiscale locally stationary features, use them to design a classification function with the training labels.!

XavierBresson 11

•  How to design a (linear) classifier? !Fully connected neural networks.!

OutputsignalClasslabels

Class1

Class2

ClassK

Features

xclass = Wxfeat

Page 12: Talk Norway Aug2016

Full Architecture of CNNs!

!"#$%&'(&%))*+' ,-'

7-8'!)9(%)(#$!:!

;.+<!<#=$/)>1&+$2!

:!?##&+$2!!

:!?##&+$2!

"#$%#&'(#$)&!*+&,-./!

@'

A$1',!/+2$)&! "#$%#&'(#$)&!&)B-./!C-D,.)9,!&#9)&!/,)(#$).B!E-),'.-/!)$<!9#>1#/-!,F->!!

%+)!<#=$/)>1&+$2!)$<!1##&+$2G!

*'&&B!9#$$-9,-<!&)B-./!C"&)//+H9)(#$!E'$9(#$G!

0',1',!/+2$)&!"&)//!&)3-&/!

y 2 Rnc

F1

F2

F3x ⇤ F3

x ⇤ F2

x ⇤ F1

x

x

l=0 = x x

l=1x

l=1

conv

x

l=0

conv

x

l

IJ2J!+>)2-!

Page 13: Talk Norway Aug2016

CNNs Only Process Euclidean-Structured Data!

•! CNNs are designed for Data lying on Euclidean spaces:!(1) Convolution on Euclidean grids (FFT)!(2) Downsampling on Euclidean grids!(3) Pooling on Euclidean grids!

Everything mathematically well defined and computationally fast!!

!"#$%&'(&%))*+' ,.'

•! But not all data lie on Euclidean grids!!

Images (2D, 3D) !videos (2+1D)'

Sound (1D)'

Page 14: Talk Norway Aug2016

Non-Euclidean Data!

!"#$%&'(&%))*+' ,/'

•! Examples of irregular/graph-structured data: !(1) Social networks (Facebook, Twitter)!(2) Biological networks (genes, brain connectivity)!(3) Communication networks (wireless, tra"c)!

•! Main challenges: !(1) How to define convolution, downsampling and pooling on graphs?!(2) And how to make them numerically fast?!

•! Current solution: Map graph-structured data to regular/Euclidean grids with e.g. kernel methods and apply standard CNNs. !Limitation: Handcrafting the mapping is (to my opinion) against CNN principle! !

;.)1FK$-,=#.LM!!/,.'9,'.-<!<),)!!

N#9+)&!$-,=#.L/! O.)+$!/,.'9,'.-!

P'Q-&-9#>>'$+9)(#$!

$-,=#.L/!

Page 15: Talk Norway Aug2016

Outline!! Data Science and ANN

! Convolutional Neural Networks - Why CNNs are good?- Local Stationarity and Multiscale Hierarchical FeaturesStationarity and Multiscale Hierarchical FeaturesStationarity

!

!! Convolutional Neural Networks on Graphs !- CNNs Only Process Euclidean-Structured Data- CNNs for Graph-Structured Data !- Graph Spectral Theory for Convolution on Graphs!- Balanced Cut Model for Graph Coarsening!- Fast Graph Pooling!

!

! Numerical Experiments

! Conclusion

!"#$%&'(&%))*+' ,0'

Page 16: Talk Norway Aug2016

CNNs for Graph-Structured Data!

•  Our contribution: Generalizing CNNs to any graph-structured data with same computational complexity as standard CNNs!!

XavierBresson 16

•  What tools for this generalization?!(1) Graph spectral theory for convolution on graphs,!(2) Balanced cut model for graph coarsening,!(3) Graph pooling with binary tree structure of coarsened graphs.!

Page 17: Talk Norway Aug2016

Related Works!

!"#$%&'(&%))*+' ,2'

•! Categories of graph CNNs: !(1) Spatial approach!(2) Spectral (Fourier) approach !

•! Spatial approach: !!! Local reception fields [Coates-Ng’11, Gregor-LeCun’10]:! Find compact groups of similar features, but no defined convolution.!!! Locally Connected Networks [Bruna-Zaremba-Szlam-LeCun’13]:! Exploit multiresolution structure of graphs, but no defined convolution.!!! ShapeNet [Bronstein-et.al.’15’16]:! Generalization of CNNs to 3D-meshes. Convolution well-defined in these ! smooth low-dimensional non-Euclidean spaces. Handle multiple graphs. ! Obtained state-of-the-art results for 3D shape recognition.!

•! Spectral approach: !!! Deep Spectral Networks [Hena#-Bruna-LeCun’15]:! Related to this work. We will compare later.!

Page 18: Talk Norway Aug2016

Outline!! Data Science and ANN

! Convolutional Neural Networks - Why CNNs are good?- Local Stationarity and Multiscale Hierarchical FeaturesStationarity and Multiscale Hierarchical FeaturesStationarity

!

!! Convolutional Neural Networks on Graphs !- CNNs Only Process Euclidean-Structured Data- CNNs for Graph-Structured Data- Graph Spectral Theory for Convolution on Graphs!- Balanced Cut Model for Graph Coarsening- Fast Graph Pooling

!

! Numerical Experiments

! Conclusion

!"#$%&'(&%))*+' ,3'

Page 19: Talk Norway Aug2016

Convolution on Graphs 1/3!•! Graphs: G=(V,E,W), with V set of vertices, E set of edges, !

W similarity matrix, and |V|=n.!

!"#$%&'(&%))*+' ,4'

i j

i 2 V j 2 V

Wij = 0.9

eij 2 E

ii

2 V

jjjjj

j

W

G

•! Graph Laplacian (core operator to spectral graph theory [1]): !2nd order derivative operator on graphs!

⇢L = D �W normalized

L = In �D�1/2WD�1/2unnormalized

[1] Chung, 1997'

Page 20: Talk Norway Aug2016

F�1G f = Uf = UUT f = f,

Convolution on Graphs 2/3!•  Fourier transform on graphs [2]: L symmetric positive semidefinite matrix !

⇒ It has a set of orthonormal eigenvectors {ul} known as graph Fourier modes, associated to nonnegative eigenvalues {λl} known as the graph frequencies.!

XavierBresson 20

FGf = f = UT f 2 Rn,

The Graph Fourier Transform of f 2 Rnis

which value at frequency λl is:

f(�l) = fl := hf, uli =n�1X

i=0

f(i)ul(i)

The inverse GFT is defined as:

which value at vertex i is:

(Uf)(i) =n�1X

l=0

flul(i).

[2] Hammond, Vandergheynst, Gribonval, 2011

Page 21: Talk Norway Aug2016

Convolution on Graphs 3/3!•  Convolution on graphs (in the Fourier domain) [2]:!

XavierBresson 21

f ⇤G g = F�1G

�FGf � FGg

�2 Rn,

which value at vertex i is:

(f ⇤G g)(i) =n�1X

l=0

flglul(i)

It is also convenient to see that:

f ⇤G g = g(L)f,

as

f ⇤G g = U�(UT f)� (UT g)

�= U

2

64g(�0)

. . .g(�n�1)

3

75UT f

= Ug(⇤)UT f = g(L)f[2] Hammond, Vandergheynst, Gribonval, 2011

Page 22: Talk Norway Aug2016

Translation on Graphs 1/2!

•  Translation on graphs: A signal f defined on the graph can be translated to any vertex i using the graph convolution operator [3]:!

XavierBresson 22

Tif := f ⇤G �i,

where Tif is the graph translation operator with vertex shift i. !Function f, translated at vertex i, has the following value at vertex j:

This formula is the graph counterpart of the continuous translation operator:

(Ts

f)(x) = f(x� s) = (f ⇤ �s

)(x) =

Z

Rf(⇠)e�2⇡i⇠s

e

2⇡i⇠xd⇠,

where

ˆf(⇠) = hf, e2⇡i⇠xi, and e2⇡i⇠x are the eigenfunctions of the continuum

Laplace-Beltrami operator �, i.e. the continuum version of the graph Fourier

modes ul

.

(Tif)(j) = f(j �G i) = (f ⇤G �i)(j) =n�1X

l=0

flul(i)ul(j),

[3] Shuman, Ricaud, Vandergheynst, 2016

Page 23: Talk Norway Aug2016

Translation on Graphs 2/2!

!"#$%&'(&%))*+' -.'

•! Note: Translation on graphs are easier to carry out with the spectral approach, than directly in the spatial/graph domain.!

(a) Tsf (b) Ts0f (c) Ts00f

(f) Ti00f(e) Ti0f(d) Tif

Figure 1: Translated signals in the continuous R2domain (a-c), and in the graph

domain (d-f). The component of the translated signal at the center vertex is

highlighted in green.

[Shuman-Ricaud-! Vandergheynst’16]'

Page 24: Talk Norway Aug2016

g(�l) = pK(�l) :=

KX

k=0

ak�kl , (1)

where pK is a Kthorder polynomial function of the Laplacian eigenvalues �l.

This class of kernels defines spatially localized filters as proved below:

Theorem 1 Laplacian-based polynomial kernels (1) are K-localized in the sense

that

(TipK)(j) = 0 if dG(i, j) > K, (2)

where dG(i, j) is the discrete geodesic distance on graphs, that is the shortest

path between vertex i and vertex j.

Localized Filters on Graphs 1/3!

•! Laplacian polynomial kernels [2]: We consider a family of spectral kernels defined as:!

!"#$%&'(&%))*+' -/'

•! Localized convolutional kernels: As standard CNNs, we must !define localized filters on graphs.!

[2] Hammond, Vandergheynst, Gribonval, 2011'

Page 25: Talk Norway Aug2016

Localized Filters on Graphs 2/3!

!"#$%&'(&%))*+' -0'

Corollary 1. Consider the function �ij such that

�ij =�TipK

�(j) =

�pK ⇤G �i

�(j) =

�pK(L)�i

�(j) =

� KX

k=0

akLk�i

�(j).

Then �ij = 0 if dG(i, j) > K.

Vertex iVertex

Spatial profile of polynomial filter given by �ij

BKi = Support of

polynomial filter at vertex i

Figure 2. Illustration of localized filters on graphs. Laplacian-based polynomial

kernels are exactly localized in a K-ball BKi centered at vertex i.

Page 26: Talk Norway Aug2016

Vertex iVertex

BKi = Support of

polynomial filter at vertex i

Localized Filters on Graphs 3/3!

!"#$%&'(&%))*+' -1'

Corollary 2. Localized filters on graphs are defined according to the principle:

Frequency smoothness , Spatial graph localization

This is obviously the Heisenberg’s uncertainty principle extended to the graphsetting. Recent papers have studied the uncertainty principle on graphs.

Page 27: Talk Norway Aug2016

Chebyshev polynomials: Let Tk(x) the Chebyshev polynomial of order k gen-

erated by the fundamental recurrence property Tk(x) = 2xTk�1(x) � Tk�2(x)

with T0 = 1 and T1 = x. The Chebyshev basis {T0, T1, ..., TK} forms an orthog-

onal basis in [�1, 1].

Fast Chebyshev Polynomial Kernels 1/2!•! Graph filtering: Let y be a signal x filtered by a Laplacian-based!

polynomial kernel:!

!"#$%&'(&%))*+' -2'

•! C!

y = x ⇤G pK = pK(L)x =KX

k=0

akLkx

The monomial basis {1, x, x2, x

3, ..., x

K} provides localized spatial filters, but

does not form an orthogonal basis (e.g. h1, xi =

R 10 1xdx =

x

2

2

��10=

12 ), which

limits its ability to learn good spectral filters.

Figure 3. First six Chebyshev polynomials.

x ⇤ F3

x ⇤ F2

x ⇤ F1"#$%#&'(#$)&!*+&,-./!

F1

F2

F3

x

Page 28: Talk Norway Aug2016

Fast Chebyshev Polynomial Kernels 2/2!•  Graph filtering with Chebyshev [2]: The filtered signal y is defined with the

Chebyshev polynomials is:!

XavierBresson 28

•  F!

y = x ⇤G qK =KX

k=0

✓kTk(L)x,

qK(�) =KX

k=0

✓kTk(�).

with the Chebyshev spectral kernel:

Fast filtering: Let denote Xk := Tk(L)x and rewrite y =

PKk=0 ✓kXk. Then

all {Xk} are generated with the recurrence equation Xk = 2LXk�1 � Xk�2.

As L is sparse, all matrix multiplications are done between a sparse matrix

and a vector. The computational complexity is O(EK), and reduces to linear

complexity O(n) for k-NN graphs.

•  GPU parallel implementation: Linear algebra operations can be done in parallel, allowing a fast GPU implementation of Chebyshev filtering. !

[2] Hammond, Vandergheynst, Gribonval, 2011

Page 29: Talk Norway Aug2016

Outline!! Data Science and ANN

! Convolutional Neural Networks - Why CNNs are good?- Local Stationarity and Multiscale Hierarchical FeaturesStationarity and Multiscale Hierarchical FeaturesStationarity

!

!! Convolutional Neural Networks on Graphs !- CNNs Only Process Euclidean-Structured Data- CNNs for Graph-Structured Data- Graph Spectral Theory for Convolution on Graphs- Balanced Cut Model for Graph Coarsening!- Fast Graph Pooling

!

! Numerical Experiments

! Conclusion

!"#$%&'(&%))*+' -4'

Page 30: Talk Norway Aug2016

Graph Coarsening!

!"#$%&'(&%))*+' .5'

•! Graph coarsening: As standard CNNs, we must define a grid coarsening process for graphs. It will be essential for pooling similar features together.!

•! Graph partitioning: Graph coarsening is equivalent to graph clustering, which is a NP-hard combinatorial problem.!

Gl=0 = G Gl=1 Gl=2

Graph coarsening/

clustering

Graph coarsening/

clustering

Figure 4: Illustration of graph coarsening.'

Page 31: Talk Norway Aug2016

Graph Partitioning!

!"#$%&'(&%))*+' .,'

•! Balanced Cuts [4]: Two powerful measures of graph clustering are the Normalized Cut and Normalized Association defined as:!

where Cut(A,B) :=

Pi2A,j2B Wij , Assoc(A) :=

Pi2A,i2B Wij ,

Vol(A) :=

Pi2A,j2B di, and di :=

Pj2V Wij is the degree of vertex i.

Figure 5: Equivalence between NCut and NAssoc partitioning.'

min

C1,...,CK

KX

k=1

Cut(Ck, Cck)

Vol(Ck)

Equivalence by

complementarity

,

Normalized Cut:

Partitioning by min edge cuts.

Ck Cck

max

C1,...,CK

KX

k=1

Assoc(Ck)

Vol(Ck)

Normalized Association:

Partitioning by max vertex matching.

Ck

[4] Shi, Malik, 2000'

Page 32: Talk Norway Aug2016

Graclus Graph Partitioning!

!"#$%&'(&%))*+' .-'

•! Graclus [5]: It is a greedy (fast) technique that computes clusters that locally maximize the Normalized Association.!

Matched vertices {�i, �j} are

merged into a super-vertex

at the next coarsening level.

@' @'

Gl�1

6'

6'6'

Gl Gl+1 Gl+2

(P1) Vertex matching:

n

i, j = argmax

j

W lii + 2W l

ij +W ljj

dli + dlj

o

(P2): Gl+1=

⇢W l+1

ij = Cut(Cli , C

lj)

W l+1ii = Assoc(Cl

i)

Graph coarsening/

clustering

6'

6'

Matched vertices {�i, �j} are

merged into a super-vertex

at the next coarsening level.

@ @@

Gl�1

66

Gl Gll+1+1 Gl+2

(P1) Vertex matching:

nn

i, j = argmax

j= argmax

j= argmax

W liiWiiW + 2W l

ijWijW +W ljjWjjW

dli + dlj

o

(P2): Gl+1=

⇢W l+1

ijWijW = Cut(CliCiC ,Cl

j, Cj, C )

W lijlij+1

iiWiiW = Assoc(CliCiC )

Graph coarsening/Graph coarsening/

clusteringclustering

6

Matched vertices { }

6

66

6

6

66

Partition energy at level l :

X

matched{i,j}

W lii + 2W l

ij +W ljj

dli + dlj=

KlX

k=1

Assoc(Clk)

Vol(Clk)

,

where Clk is a super-vertex computed by (P1), i.e. Cl

k := matched{i, j}.It is also a cluster with at most 2

k+1original vertices.

i

j

Figure 6: Graph coarsening with Graclus. Graclus proceeds by two successive steps: (P1) Vertex matching, and (P2) Graph coarsening. These two steps provide a local solution to the Normalized Association clustering problem at each coarsening level l.'

[5] Dhillon, Guan, Kulis, 2007'

Page 33: Talk Norway Aug2016

Outline!! Data Science and ANN

! Convolutional Neural Networks - Why CNNs are good?- Local Stationarity and Multiscale Hierarchical FeaturesStationarity and Multiscale Hierarchical FeaturesStationarity

!

!! Convolutional Neural Networks on Graphs !- CNNs Only Process Euclidean-Structured Data- CNNs for Graph-Structured Data- Graph Spectral Theory for Convolution on Graphs- Balanced Cut Model for Graph Coarsening- Fast Graph Pooling!

!

! Numerical Experiments

! Conclusion

!"#$%&'(&%))*+' ..'

Page 34: Talk Norway Aug2016

Fast Graph Pooling 1/2!

•  Unstructured pooling is inefficient: The graph and its coarsened versions indexed by Graclus require to store a table with all matched vertices. !⇒ Memory consuming and not (easily) parallelizable. !

XavierBresson 34

•  Graph pooling: As standard CNNs, we must define a pooling process such as max pooling or average pooling. This operation will be done many times during the optimization task. !

•  Structured pooling is as efficient as a 1D grid pooling: Start from the coarsest level, then propagate the ordering to the next finer level such that node k has nodes 2k and 2k+1 as children ⇒ binary tree arrangement of the nodes such that adjacent nodes are hierarchically merged at the next coarser level. !

Page 35: Talk Norway Aug2016

Fast Graph Pooling 2/2!

!"#$%&'(&%))*+' .0'

Figure 7: Fast graph pooling using graph coarsening structure. The binary tree arrangement of vertices allows a very e"cient pooling on graphs, as fast as a regular 1D Euclidean grid pooling.'

01

2

3

4 5

6

7

Gl=0 = G

Gl=1

0, 1

2, 3 4, 5

6, 7

0, 12, 3

4, 56, 7Gl=2

0

1 2

3

01

Graph coarsening:

0

1

2

3

45

6

7Gl=0 = G

Gl=1

Gl=2

0

12

3

01

Graph coarsening:

0, 2

1, 4

5, 7

3, 6

0, 21, 4

5, 73, 6

0 1 2 3 4 5 6 7

0 12 3

0 1

Graph pooling:

Unstructured arrangement of vertices

0 1 2 3 4 5 6 7

0 1 2 3

0 1

Graph pooling:

Binary tree arrangement of vertices

Reindexing w.r.t.

coarsening structure

0

1

2

3

445

6

77 5

333

Gl=0 = G

Gl=1

Gl=2

00

12

33

001

Graph coarsening:

0, 22

11, 4444

55, 7

333333, 66

0, 21, 4

55, 773, 6

0 1 2 3 4 5 6 7

00 111222 33

00 11

Graph pooling:

Unstructured arrangement of vertices

01

22

3

4 5

6

7

4

3 777

Gl=0 = G

Gl=1

0, 11

22, 3333 44, 5

66, 7

44 54

0, 12, 3

444, 566, 7Gl=2

00

11 22

33

0011

Graph coarsening:

0 1 2 3 4 5 6 7

00 11 22 33

00 11

Graph pooling:

Binary tree arrangement of vertices

Matched vertices

Page 36: Talk Norway Aug2016

Full Architecture of Graph CNNs!

!"#$%&'(&%))*+' .1'

x

l=5 2 Rn5F5

7-8'!)9(%)(#$!:!

;.)1F!9#)./-$+$2!*)9,#.!51!

?.-M9#>1',-<!:!

?##&+$2!C;?R/G!

✓l=6 2 Rn5nc

:!

;.)1F!IDS!/#9+)&T!3+#&#2+9)&T!!

,-&-9#>>'$+9)(#$!2.)1F/!

"#$%#&'(#$)&!*+&,-./!0C6G!1).)>-,-./!

0CIJ6G!#1-.)(#$/!C;?R/G!

@'

A$1',!/+2$)&!#$!2.)1F/!

;.)1F!9#$%#&'(#$)&!&)B-./!C-D,.)9,!>'&(/9)&-!&#9)&!/,)(#$).B!E-),'.-/!#$!2.)1F/G!

*'&&B!9#$$-9,-<!&)B-./!

0',1',!/+2$)&!"&)//!&)3-&/!

x 2 Rn

x

l=0 2 Rnl=0

=

✓l=1 2 RK1F1

x

l=1 2 Rn1F1

y 2 Rnc

✓l=5 2 RK5F1...F5n1 = n0/2p1

g1✓K1

g3✓K1

G = Gl=0

Gl=2

Gl=1

x

l=0g 2 Rn0F1

g2✓K1

Figure 8. Architecture of the proposed CNNs on graphs.

Notation: l is the coarsening level, x

lare the down sampled signals at layer

l, G

lis the coarser graph, g✓Kl are the spectral filters at layer l, x

lg are the

filtered signals, pl is the coarsening exponent, nc is the number of classes, y is

the output signal, and ✓

lis the number of parameters to learn at l.

Page 37: Talk Norway Aug2016

Optimization!•! Backpropagation [6] = Chain rule applied to the neurons at each layer.!

!"#$%&'(&%))*+' .2'

Gradient descents:

(✓

n+1ij

= ✓

n

ij

� ⌧.

@E

@✓ij

x

n+1i

= x

n

i

� ⌧.

@E

@xi

Loss function: E = �X

s2S

ls log ys

Local gradients:

@E

@✓ij=

@E

@yj

@yj

@✓ij=

X

s2S

hX0,s, ..., XK�1,s

iT@E

@yj,s

@E

@xi=

@E

@yj

@yj

@xi=

FoutX

j=1

g✓ij

(L)

@E

@yj[6] Werbos 1982 and Rumelhart, Hinton, Williams, 1985'

…' …'xi

yj✓ij

@E

@yj

@E

@xi

@yj@✓ij

E

yjyjy

yj =FinX

i=1

g✓ij (L)xi

yj

xi

✓✓✓ijijijij

Local !computations!

i

Local Accumulative!computations!

yjyjy

AccumulativeBackpropagation '

Page 38: Talk Norway Aug2016

Outline!! Data Science and ANN

! Convolutional Neural Networks - Why CNNs are good?- Local Stationarity and Multiscale Hierarchical FeaturesStationarity and Multiscale Hierarchical FeaturesStationarity

! Convolutional Neural Networks on Graphs - CNNs Only Process Euclidean-Structured Data- CNNs for Graph-Structured Data- Graph Spectral Theory for Convolution on Graphs- Balanced Cut Model for Graph Coarsening- Fast Graph Pooling

!

!! Numerical Experiments!!

! Conclusion

!"#$%&'(&%))*+' .3'

Page 39: Talk Norway Aug2016

Numerical Experiments!

•  Platforms and hardware: All experiments are carried out with:!(1) TensorFlow (Google AI software) [7]!(2) GPU NVIDIA K40c (CUDA)!

XavierBresson 39

•  Types of experiments:!(1) Euclidean CNNs!(2) Non-Euclidean CNNs!

[7] Abadi-et.al. 2015

•  Code will be released soon!!

Page 40: Talk Norway Aug2016

Revisiting Euclidean CNNs 1/2!•! Sanity check: MNIST is the most popular dataset in deep learning [8].!

It is a dataset of 70,000 images represented on a 2D grid of size 28x28 (dim data = 282 = 784) of handwritten digits, from 0 to 9.!

!"#$%&'(&%))*+' /5'

•! Graph: A k-NN graph (k=8) of the Euclidean grid: !

Wij

= e�kxi�xjk22/�

i

j

ii

jj

kxi � xjk2

jj

[8] LeCun, Bottou, Bengio, 1998'

Page 41: Talk Norway Aug2016

Revisiting Euclidean CNNs 2/2!

XavierBresson 41

•  Results: Classification rates!

•  CPU vs. GPU: A GPU implementation of graph CNNs is 8x faster than a CPU implementation, same order as standard CNNs. !Graph CNN only uses matrix multiplications which are efficiently implemented in CUDA BLAS.!

Architecture Time Speedup

CNNs with CPU 210ms -

CNNs with GPU NVIDIA K40c 31ms 6.77x

graph CNNs with CPU 200ms -

graph CNNs with GPU NVIDIA K40c 25ms 8.00x

Algorithm Accuracy

Linear SVM 91.76

Softmax 92.36

CNNs [LeNet5] 99.33

graph CNNs: CN32-P4-CN64-P4-FC512-softmax 99.18

Page 42: Talk Norway Aug2016

Non-Euclidean CNNs 1/2!

•! Text categorization with 20NEWS: It is a benchmark dataset introduced at CMU [9]. It has 20,000 text documents (dim data = 33,000, #words in dictionary) across 20 topics.!

!"#$%&'(&%))*+' /-'

Table 1. 20 Topics of 20NEWS!

Instance of document in topic:!Auto!

Instance of document in topic:!Medicine!

[9] Leng, 1995'

Page 43: Talk Norway Aug2016

Non-Euclidean CNNs 1/2!

XavierBresson 43

⇒ Recognition rate depends on graph quality.!

•  Results: Classification rates!

Algorithm Word2vec features

Linear SVM 65.90

Multinomial NB 68.51

Softmax 66.28

FC + softmax + dropout 64.64

FC + FC + softmax + dropout 65.76

graph CNNs: CN32-softmax 68.26

•  Influence of graph quality:!

Architecture G1= learned G2= normalized G3= pre-trained G4= ANN G5= random

graph [word2vec] word count graph graph graph graph

CN32-softmax 68.26 67.50 66.98 67.86 67.75

Page 44: Talk Norway Aug2016

Comparison Our Model vs [Hena#-Bruna-LeCun’15]!

•! Advantages over (inspiring) pioneer work [HBL15]:!

(1) Computational complexity O(n) vs. O(n2):!!

!!!

(2) no EVD O(n3) vs. [HBL15]!!

(3) Accuracy:!!!!!

(4) Learn faster:!!

!"#$%&'(&%))*+' //'

y = x ⇤G hK =

ˆhK(L)x = U(

ˆhK(⇤)(UTx)),

where

ˆhK are spectral filters based on Kth-order splines, ⇤, U are the eigenvalue,

eigenvector matrices of the graph Laplacian L, and UT , U are full O(n2) matrices.

Architecture AccuracyCN10 [HBL15] 97.26CN10 [Ours] 97.48

CN32-P4-CN64-P4-FC512 [HBL15] 97.75CN32-P4-CN64-P4-FC512 [Ours] 99.18

Page 45: Talk Norway Aug2016

Outline!! Data Science and ANN

! Convolutional Neural Networks - Why CNNs are good?- Local Stationarity and Multiscale Hierarchical FeaturesStationarity and Multiscale Hierarchical FeaturesStationarity

! Convolutional Neural Networks on Graphs - CNNs Only Process Euclidean-Structured Data- CNNs for Graph-Structured Data- Graph Spectral Theory for Convolution on Graphs- Balanced Cut Model for Graph Coarsening- Fast Graph Pooling

! Numerical Experiments!

!! Conclusion!

!"#$%&'(&%))*+' /0'

Page 46: Talk Norway Aug2016

Conclusion!•! Proposed contributions:!

(1) Generalization of CNNs to non-Euclidean domains/graph-structured data!(2) Localized filters on graphs!(3) Same learning complexity as CNNs while being universal to any graph!(4) GPU implementation !

!"#$%&'(&%))*+' /1'

•! Future applications:!(i) Social networks (Facebook, Twitter)!(ii) Biological networks (genes, brain connectivity)!(iii) Communication networks (Internet, wireless, tra"c)!

N#9+)&!$-,=#.L/! O.)+$!/,.'9,'.-! Q-&-9#>>'$+9)(#$!$-,=#.L/!

•! Paper (accepted at NIPS’16): M. De#errard, X. Bresson, P. Vandergheynst, “Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering”, arXiv:1606.09375, 2016!

Page 47: Talk Norway Aug2016

Thankyou.

XavierBresson 47