Graph mining 2: Statistical approaches for graph mining

Graph mining 2Statistical approaches for graph mining

Nathalie Villa-Vialaneix

nathalie.villa@toulouse.inra.frhttp://www.nathalievilla.org

Advanced mathematics for network analysisLuchon, May 3rd 2016

Nathalie Villa-Vialaneix | Graph mining 2 1/48

Talk map...

Who am I? Statistician working in biostatistics at INRA ToulouseMy research interests are: data mining, network inference andmining, machine learning

Purpose of this talk: presenting a few statistical tools for graphmining (graph structure, important vertices) and clustering

Background

Unlike said so, G:

I undirected and connected graph;

I with vertices V = {x1, ..., xn};I with set of edges E;

I eventually with (positive and symmetric) weights on edges, wij

(st wii = 0, no self loop)I adjacency matrix A = (wij)i,j=1,...,n

Background

Unlike said so, G:

Background

Unlike said so, G:

Examples are made with...the toy example “Les Misérables” (co-appearance network inHugo’s novel)

MyrielNapoleon

MlleBaptistineMmeMagloire

CountessDeLoGeborand

ChamptercierCravatte

OldMan

Labarre

Valjean

Marguerite

MmeDeRIsabeau

Gervais

Tholomyes

ListolierFameuil

BlachevilleFavourite

Dahlia

Zephine

Fantine

MmeThenardier

Thenardier

Cosette

Javert

Fauchelevent

Bamatabois

Perpetue

Simplice

Scaufflaire

Woman1

JudgeChampmathieu

BrevetChenildieu

Cochepaille

Pontmercy

Boulatruelle

Eponine

Anzelma

Woman2

MotherInnocent

Gribier

Jondrette

MmeBurgon

Gavroche

Gillenormand

Magnon

MlleGillenormand

MmePontmercy

MlleVaubois

LtGillenormand

Marius

BaronessT

MabeufEnjolras

Combeferre

Prouvaire

FeuillyCourfeyrac

BahorelBossuet

Grantaire

MotherPlutarch

GueulemerBabet

Claquesous

Montparnasse

Toussaint

Child1Child2

Brujon

MmeHucheloup

software and especially the R package igraph

the full script and the dataset is available on my website at:http://www.nathalievilla.org/teaching/toconet.html

Basic description of the graph

lesmis

## IGRAPH U--- 77 254 --## + attr: layout (g/n), id (v/n), label (v/c), value (e/n)## + edges:## [1] 1-- 2 1-- 3 1-- 4 3-- 4 1-- 5 1-- 6 1-- 7 1-- 8 1-- 9 1--10## [11] 11--12 4--12 3--12 1--12 12--13 12--14 12--15 12--16 17--18 17--19## [21] 18--19 17--20 18--20 19--20 17--21 18--21 19--21 20--21 17--22 18--22## [31] 19--22 20--22 21--22 17--23 18--23 19--23 20--23 21--23 22--23 17--24## [41] 18--24 19--24 20--24 21--24 22--24 23--24 13--24 12--24 24--25 12--25## [51] 25--26 24--26 12--26 25--27 12--27 17--27 26--27 12--28 24--28 26--28## [61] 25--28 27--28 12--29 28--29 24--30 28--30 12--30 24--31 31--32 12--32## [71] 24--32 28--32 12--33 12--34 28--34 12--35 30--35 12--36 35--36 30--36## + ... omitted several edges

U--- means: Undirected, not Named (no name attribute for thevertices), not Weighted (no weight attribute for the edges) and notBipartite

System information

## R version 3.2.5 (2016-04-14)## Platform: x86_64-pc-linux-gnu (64-bit)## Running under: Ubuntu 14.04.4 LTS#### locale:## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C## [9] LC_ADDRESS=C LC_TELEPHONE=C## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C#### attached base packages:## [1] stats graphics grDevices utils datasets methods base#### other attached packages:## [1] igraph_1.0.1 knitr_1.12.3#### loaded via a namespace (and not attached):## [1] magrittr_1.5 formatR_1.3 tools_3.2.5 stringi_1.0-1## [5] highr_0.5.1 stringr_1.0.0 evaluate_0.8.3

Outline

Numerical characteristics

ClusteringModularity optimizationSpectral clusteringModel based clustering

Sketch of this section

Issue at stake:

I a graph is given

I numerical characteristics describing the graph, the nodes, area standard approach to describe it

I how to know that the observed value are unexpectedaccording to a so-called “null model”?

Issue at stake:

I a graph is given

Issue at stake:

I a graph is given

Standard (global) characteristics

I density: |E |n(n−1)/2 graph.density

I number of triangles: triangles (see also motifs)I transitivity: number of triangles divided by the number of

triplets with at least two edges transitivityI diameter: length of the longest shortest paths between two

nodes diameterI radius: minimal length, over all vertices in the graph, of the

longest shortest path linking this vertex to another vertexradius

I girth: length of the shortest circle in the graph girthI cohesion: minimum number of vertices to remove to

disconnect the graph

Standard (global) characteristics for “Les misérables”graph.density(lesmis); triangles(lesmis); length(triangles(lesmis))/3

## [1] 0.08680793## + 1401/77 vertices:## [1] 12 1 3 12 1 4 12 3 4 12 24 32 12 24 13 12 24 25 12 24 30 12 25## [24] 71 12 25 70 12 25 69 12 25 27 12 26 24 12 26 25 12 26 27 12 26 72 12## [47] 26 71 12 26 70 12 26 69 12 27 73 12 27 52 12 27 50 12 27 44 12 28 73## [70] 12 28 24 12 28 25 12 28 26 12 28 27 12 28 29 12 28 30 12 28 32 12 28## [93] 34 12 28 44 12 28 72 12 28 59 12 28 69 12 28 70 12 28 71 12 29 45 12## [116] 30 39 12 30 38 12 30 37 12 30 35 12 30 36 12 35 39 12 35 38 12 35 36## [139] 12 35 37 12 36 39 12 36 38 12 36 37 12 37 39 12 37 38 12 38 39 12 49## [162] 26 12 49 28 12 49 56 12 49 59 12 49 65 12 49 69 12 49 70 12 49 72 12## [185] 50 52 12 56 26 12 56 27 12 56 65 12 56 50 12 56 52 12 56 59 12 59 71## [208] 12 59 65 12 69 72 12 69 71 12 69 70 12 70 72 12 70 71 12 71 72 49 26## + ... omitted several vertices## [1] 467

transitivity(lesmis); diameter(lesmis); radius(lesmis); girth(lesmis)

## [1] 0.4989316## [1] 5## [1] 3## $girth## [1] 3#### $circle## + 3/77 vertices:## [1] 3 1 4

Comparison with random graphs...

Erdos-Renyi model with the same number of nodes and the samenumber of edges than the original graph (uniform probability toobserve an edge between two given nodes)

Method: compare the observed values with those of a largenumber of randomly generated random graphs (with no loop, onlyconnected graphs are kept)sample_gnm(vcount(lesmis), ecount(lesmis))

Results of the comparison with random graphs...For B = 500 graphs (only connected graphs are kept), we have:## density triangles transitivity diameter## Min. :0.08681 Min. :31.00 Min. :0.05834 Min. :4.000## 1st Qu.:0.08681 1st Qu.:43.00 1st Qu.:0.07907 1st Qu.:4.000## Median :0.08681 Median :47.00 Median :0.08701 Median :5.000## Mean :0.08681 Mean :47.55 Mean :0.08660 Mean :4.627## 3rd Qu.:0.08681 3rd Qu.:52.00 3rd Qu.:0.09415 3rd Qu.:5.000## Max. :0.08681 Max. :67.00 Max. :0.11793 Max. :6.000## radius girth cohesion## Min. :3.000 Min. :3 Min. :1.000## 1st Qu.:3.000 1st Qu.:3 1st Qu.:1.000## Median :3.000 Median :3 Median :2.000## Mean :3.004 Mean :3 Mean :1.599## 3rd Qu.:3.000 3rd Qu.:3 3rd Qu.:2.000## Max. :4.000 Max. :3 Max. :3.000

compared to: 0.0868079, 467, 0.4989316, 5, 3, 3, 1⇒ all values are standard except for:I the number of triangles and the transitivity which are larger:

local connectivity is strongest than expected in Erdos-Renyirandom graphs

I the cohesion which is in the lowest values of what is expectedin Erdos-Renyi random graphs: this again indicates astrongest local connectivity

Standard (local) characteristics

... for the vertex xi :I degree:

∣∣∣{xj : (xi , xj) ∈ E, j , i}∣∣∣ degree (or strength for the

weighted version,∑

j,i wij)I betweenness (or centrality): number of shortest paths

between any pair of vertices in the graph which pass throughxi betweenness

I eccentricity: maximal length of all the shortest paths goingfrom xi to any other vertex in the graph eccentricity

I closeness (or closeness centrality): 1∑j,i d(xi ,xj)

in which d(xi , xj)

is the length of the shortest path between xi and xj closeness

...and their distributions among all vertices.

Standard (local) characteristics for “Les misérables”

summary(degree(lesmis))

## Min. 1st Qu. Median Mean 3rd Qu. Max.## 1.000 2.000 6.000 6.597 10.000 36.000

summary(betweenness(lesmis))

summary(eccentricity(lesmis))

summary(closeness(lesmis))

Method: compare the observed values (average betweenness anddegree) with those of a large number of randomly generatedrandom graphs (with no loop, only connected graphs are kept)sample_gnm(vcount(lesmis), ecount(lesmis))

Results of the comparison with random graphs...

For B = 500 graphs (only connected graphs are kept), we have:

## degree betweenness eccentricity closeness## Min. :6.597 Min. :54.64 Min. :3.597 Min. :0.005249## 1st Qu.:6.597 1st Qu.:55.93 1st Qu.:3.779 1st Qu.:0.005322## Median :6.597 Median :56.32 Median :3.857 Median :0.005340## Mean :6.597 Mean :56.36 Mean :3.863 Mean :0.005340## 3rd Qu.:6.597 3rd Qu.:56.71 3rd Qu.:3.909 3rd Qu.:0.005361## Max. :6.597 Max. :58.79 Max. :4.688 Max. :0.005430

compared to: 6.597, 62.364, 4.13, 0.00512

⇒ the observed average betweenness is higher and the observedaverage closeness is smaller for all the randomly generatedgraphs: this seems to indicate that, in average, shortest paths inthe graphs are longer than expected for graphs with uniformdistribution of the edges.

Degree distribution for “Les misérables”+

++++ +

0 1 2 3

log(k)

Estimation of power law fit (left: α = 1.49) withfit_power_law(degree(lesmis) + 1, implementation ="R.mle")

Scale free model with a parameter for the power law identical tothe one previously estimated and the same number of nodes.Barabási and Albert model is used with a number of edges addedat each step which is chosen so that the final number of edgesresembles that of the original graph (3 edges, which gives 225edges in the final graph, compared to 254)

P(degree = k) = k−α

Method: compare the observed values with those of a largenumber of randomly generated random graphssample_pa(vcount(lesmis), m = 3, power = ..., directed =

FALSE)

Scale free model with a parameter for the power law identical tothe one previously estimated and the same number of nodes.Barabási and Albert model is used with a number of edges addedat each step which is chosen so that the final number of edgesresembles that of the original graph (3 edges, which gives 225edges in the final graph, compared to 254)

P(degree = k) = k−α

Method: compare the observed values with those of a largenumber of randomly generated random graphssample_pa(vcount(lesmis), m = 3, power = ..., directed =

FALSE)

Results of the comparison with random graphs...For B = 500 graphs, we have:

## density triangles transitivity diameter## Min. :0.0769 Min. : 72 Min. :0.1075 Min. :3.000## 1st Qu.:0.0769 1st Qu.:102 1st Qu.:0.1250 1st Qu.:4.000## Median :0.0769 Median :112 Median :0.1307 Median :4.000## Mean :0.0769 Mean :113 Mean :0.1303 Mean :3.988## 3rd Qu.:0.0769 3rd Qu.:124 3rd Qu.:0.1359 3rd Qu.:4.000## Max. :0.0769 Max. :153 Max. :0.1530 Max. :5.000## radius girth cohesion degree betweenness## Min. :2.000 Min. :3 Min. :3 Min. :5.844 Min. :41.86## 1st Qu.:2.000 1st Qu.:3 1st Qu.:3 1st Qu.:5.844 1st Qu.:47.88## Median :2.000 Median :3 Median :3 Median :5.844 Median :49.55## Mean :2.314 Mean :3 Mean :3 Mean :5.844 Mean :49.35## 3rd Qu.:3.000 3rd Qu.:3 3rd Qu.:3 3rd Qu.:5.844 3rd Qu.:50.97## Max. :3.000 Max. :3 Max. :3 Max. :5.844 Max. :55.73## eccentricity closeness## Min. :2.935 Min. :0.005407## 1st Qu.:3.130 1st Qu.:0.005695## Median :3.221 Median :0.005788## Mean :3.234 Mean :0.005805## 3rd Qu.:3.325 3rd Qu.:0.005901## Max. :3.662 Max. :0.006334

compared to: 0.087, 467, 0.499, 5, 3, 3, 1, 6.597, 62.364, 4.13, 0.00512

⇒ the number of triangles, the transitivity, the radius, the average degree, the

average betweenness and the eccentricity are larger than in power law graphs

with power 1.495, whereas the cohesion and the closeness are smaller.

Limits of the previous approaches

Until now, we have compared the real graph to graphs randomlygenerated according to a given random model but:

I this approach only gives information about globalcharacteristics of the observed graph;

I none of the distributions of the current characteristics ispreserved during the process, especially not the degreedistribution which is central for controlling local/globalconnectivity, counts of specific patterns...

A null model closer to the real graph...

Sketch of statistical tests on graphs

1. sample at random within the set of graphs with the samedegree distribution than the observed graph (B times)

2. compute a numerical statistics for each of these randomlygenerated graphs

3. comparing the observed value of the statistics and itsdistribution over the random graphs, a p-value can be derived(for B large enough)

Two main approaches to sample at random with fixed degrees:I configuration model [Bender and Canfield, 1978]

I permutation approach [Rao et al., 1996, Roberts Jr., 2000]

A null model closer to the real graph...

Sketch of statistical tests on graphs

1. sample at random within the set of graphs with the samedegree distribution than the observed graph (B times)

2. compute a numerical statistics for each of these randomlygenerated graphs

3. comparing the observed value of the statistics and itsdistribution over the random graphs, a p-value can be derived(for B large enough)

Two main approaches to sample at random with fixed degrees:I configuration model [Bender and Canfield, 1978]

I permutation approach [Rao et al., 1996, Roberts Jr., 2000]

Sampling at random within the set of graphs with a givendegree distribution

Aim:I all graphs can exhaustively be sampledI all graphs have the same probability to be sampled

⇒ MCMC approach

Method:1: Start from the observed graph G2: for t = 1→ T do3: Select uniformly at random two edges e1 = (x1

i , x1j ) and e2 = (x2

i , x2j ) ∈ E

4: E′ ← E \ {e1, e2} ∪ {e1s , e

2s } with e1

s = (x1i , x

2j ) and e2

s = (x2i , x

5: if G′ = (V ,E′) is simple and connected then6: G ← G′

7: end if8: end for9: return G

Sampling at random within the set of graphs with a givendegree distribution

Aim:I all graphs can exhaustively be sampledI all graphs have the same probability to be sampled

⇒ MCMC approach

Method:1: Start from the observed graph G2: for t = 1→ T do3: Select uniformly at random two edges e1 = (x1

i , x1j ) and e2 = (x2

i , x2j ) ∈ E

4: E′ ← E \ {e1, e2} ∪ {e1s , e

2s } with e1

s = (x1i , x

2j ) and e2

s = (x2i , x

5: if G′ = (V ,E′) is simple and connected then6: G ← G′

7: end if8: end for9: return G

In practice...This method is used in [Milo et al., 2004] with T = 100. It can beperformed using rewire(lesmis, keeping_degseq(n = 100))

Number of triangles

200 300 400

transitivity

0.25 0.35 0.45

In practice... for the vertex characteristicsFind a(n empirical) p-value for all vertices which indicates if itsbetweenness is higher or lower than expected with respect to itsdegree: ratio of random graphs for which the observedbetweenness is higher (resp. lower) than 95% of thebetweennesses for the corresponding vertex in random graphs.

Myriel

Valjean

ListolierFameuilBlachevilleFavourite

Dahlia

Zephine

Fantine

JudgeChampmathieu

BrevetChenildieu

Cochepaille

LtGillenormand

Marius

Combeferre

Prouvaire

FeuillyCourfeyracBahorelJoly

Grantaire

GueulemerBabetClaquesous

MontparnasseBrujon

MmeHucheloup

Graph mining 2: Statistical approaches for graph mining

Science

Large Graph Mining

Graph Mining and Graph Kernels - Homepage | ETH Zürich · 2014-10-29 · Graph Mining and Graph Kernels Karsten Borgwardt and Xifeng Yan | Biological Network Analysis: Graph Mining|

Graph and Web Mining - Motivation, Applications and ... · Outline Basic concepts of Data Mining and Association rules Apriori algorithm Sequence mining Motivation for Graph Mining

Graph Mining Approach for Large-Scale Data Analysis Junichiro Mori, Associate Professor Location Hongo Research Area Large-scale Graph Mining Graph Mining Approach for Large-Scale

PEGASUS: A Peta-Scale Graph Mining System ...people.seas.harvard.edu/~babis/pegasusICDM09.pdf · source Peta Graph Mining library which performs typical graph mining tasks such as

06. graph mining

Graph Mining, Social Network Analysis, and Multirelational ... · Graph Mining, Social Network9 Analysis, and Multirelational Data Mining We have studied frequent-itemset mining in

Localized methods in graph mining

Graph Theory Approaches to Protein Interaction Data Analysisnatasha/GT_PPI.pdf · Graph Theory Approaches to Protein Interaction Data Analysis ... on graph theory [164]. A graph is

5.5 graph mining

Data Mining-Graph Mining

Graph Mining: Laws, Generators, and Algorithmscse835/Papers/Graph Mining Laws... · 2007-06-06 · Graphs Mining: Laws, Generators, and Algorithms 3 2. GRAPH PATTERNS What are the

2010 Managing & Mining Graph Data.pdf

DMTM Lecture 18 Graph mining

11 Graph Pattern Mining

MANAGING AND MINING GRAPH DATA

Centrality and Graph Mining

Graph Mining: Social network analysis and Information ... · Graph Mining: Social network analysis and Information Diffusion Graph Mining course Winter Semester 2016 Davide Mottin,

Graph Mining - Social Network - Multi-relation Mining

Graph Mining and Workflow Mining