Link Prediction and Topological Feature Importance in ... · ers results regarding the model performance on each dataset, and the analysis of feature importance. We conclude in Section

Link Prediction and Topological Feature Importance in SocialNetworks

Stephan A. Curiskis Thomas R. Osborn Paul J. Kennedy

Centre for Quantum Computation and Intelligent SystemsFaculty of Engineering and Information Technology

University of Technology, Sydney,15 Broadway, Ultimo, NSW 2007,

Email: [email protected]

Abstract

The problem of link prediction describes how to ac-count for the development of connection structure ina graph. There are many applications of link predic-tion, such as predicting missing links and future linksin online social networks. Much of the literature hasfocused on limited characteristics of the graph topol-ogy or on node attributes, rather than a broad rangeof measures. There is a rich spectrum of topologi-cal features associated with a graph, such as neigh-bourhood similarity scores, node centrality measures,community structure and path-based distance mea-sures. In this paper we formulate a supervised learn-ing approach to link prediction using a feature setof graph measures chosen to capture a wide rangeof topological structure. This approach has the ad-vantage that it can be applied to any graph wherethe connection structure is known. Random forestlearning models are used for their high accuracy andmeasures of feature importance. The feature impor-tance scores reveal the strength of contribution of thetopological predictors for link prediction in a varietyof synthetically generated network datasets, as wellas three real world citation networks. We investigateboth undirected and directed cases. Our results showthat this approach can deliver very high model pre-cision and recall performance in certain graphs, andgood performance generally. Our models also con-sistently outperform a simpler comparison model wedeveloped to resemble earlier work. In addition, ouranalysis of variable importance for each dataset re-veals meaningful information regarding deep networkproperties.

Keywords: social networks, link prediction, super-vised learning, centrality, community structure,graph topology

1 Introduction

Link prediction describes how the likelihood for a linkexisting between two nodes in a complex network canbe estimated. Many approaches have been proposed,but most require non-topological, node specific infor-mation to achieve high accuracy. A key goal of linkprediction is the development of an accurate model

Copyright c©2015, Australian Computer Society, Inc. Thispaper appeared at the Thirteenth Australasian Data MiningConference, Sydney, Australia. Conferences in Research andPractice in Information Technology, Vol. 168. Md Zahidul Is-lam, Ling Chen, Kok-Leong Ong, Yanchang Zhao, Richi Nayak,Paul Kennedy, Ed. Reproduction for academic, not-for profitpurposes permitted provided this text is included.

that can be applied universally to any social networkdataset (Shibata et al. 2012).

There are many applications of link prediction.For instance, these methods can be applied for linkrecommendation to users in online social networks.Another application identified for link predictionmodels is the evaluation of evolving social networkmodels (Lu & Zhou 2011). Link prediction algorithmsestimate the influence of a set of features on the like-lihood of link formation. A wide range of topologicalfeatures can provide information regarding emergentproperties of a social network through their predic-tive importance. This information may provide anempirical basis for the derivation of the rule set of anevolving social network mode. The research questionof using a link prediction model to discover informa-tion about the deep structure of a social network isstill open in the literature.

In this paper, we present a novel link predictionframework that can be applied universally to any net-work where the topology is known. We define a setof topologically derived features which capture a widerange of network properties, and apply a random for-est supervised learning model. Our method is testedagainst a variety of synthetic networks and real worlddatasets, for both the undirected and directed cases.We also evaluate whether feature importance scoringcan provide information about global and emergentproperties of a network.

2 Related Work

There are three common frameworks to link predic-tion: the similarity based approach, methods basedon maximum likelihood estimation, and probabilisticmodelling approaches.

The simplest framework for link prediction is thesimilarity-based approach. In this method, each pairof nodes i and j is assigned a score sij , defined asthe similarity between i and j. All unobserved edgesare then ranked according to their scores, with higherranks having a greater link likelihood. Much of theearly literature on link prediction focussed on the useof singular similarity features, or small sets of fea-tures. For example, Adamic & Adar (2003) developeda similarity measure based on common neighboursto predict connections amongst web pages. Liben-Nowell & Kleinberg (2007) experimented with a widerrange of similarity based measures, but still used eachin isolation to rank node pairs with the highest scores.Many similarity indices have been proposed, see Lu &Zhou (2011) for more details. As these methods arerelatively simple, each similarity index only considersa limited amount of information regarding the graphtopology. As such, the accuracy of these indices isgenerally quite low.

Proceedings of the 13-th Australasian Data Mining Conference (AusDM 2015), Sydney, Australia

39

A more recent framework is to use algorithmsbased on maximum likelihood estimation. This ap-proach assumes some topological form of the network,e.g. an exponential random graph (Zaccarin & Riv-ellini 2010). An algorithm then estimates the modelparameters against the dataset. Another method wasproposed by Clauset et al. (2008) and starts by infer-ring the hierarchical organisation of a network. Thehierarchical structure is assumed to extend acrossthe network, and is applied to predict missing links.A modelling approach known as a stochastic blockmodel has also been developed. These models parti-tion nodes into groups, which strongly influence linkprobabilty. However, this method is known to ignoreheterogeneity in node degree. Zhang et al. (2014)recently extended the stochastic block model by cor-recting for variable node degree, with improved re-sults. Heterogeneity across other network propertiesmay still be unaccounted for in this approach, limit-ing its performance in real world datasets. For thesemethods to work well in practice, the network struc-ture should be first understood and matched to theclosest topological form before a model is be built.

The third common framework uses probabilisticmodelling. This methodology aims to abstract theunderlying link structure of the network throughtraining a probabilistic model, commonly a super-vised learning model. For instance, Backstrom &Leskovec (2011) developed a supervised random walkapproach. This method combines node attributes as asupervised learning problem to guide a random walkto nodes which are more likely to be connected. Thereare many types of probabilistic link prediction mod-els which can be applied. Wang et al. (2011) tested arange of supervised learning models on a social net-work derived from mobile phone data. They foundthat a decision-tree model performed most effectivelywhen trained on a mixture of both network and dataspecific node attribute measures.

Recent publications in this area have focussed onwider sets of features used in a supervised learningframework to predict link formation. More diversetopological features can capture different types ofcomplex structure. Shibata et al. (2012) applied asupport vector machine learning model to a varietyof features on citation networks, including similarityscores, some centrality measures, community classifi-cation and node attributes. This approach achievedhigh performance, and the model weights for eachfeature were provided as importance measures. Thenon-topological node attribute features strongly in-fluenced most of their models, and it was found thatdifferent models were required for different citationnetworks. Bliss et al. (2014) analysed link predictionon a large Twitter social network using a wide varietyof topological and node attribute similarity features.An evolutionary algorithm was used to estimate thecoefficients for a linear combination of features, withgood results. Both of these publications, however,utilised non-topological attributes. This limits thenetwork datasets to which their application can beapplied to, and are difficult to compare across othernetworks. There is a great deal of interest in this area,so for a detailed review see Lu & Zhou (2011) and,more recently, Wang et al. (2014).

The method we present in this paper is motivatedby the above publications using the supervised learn-ing framework. However, we only consider topologi-cal features to ensure that our approach can be ap-plied universally to any network where the topologyis known. We contribute to the existing literature byextending the range of topological features used. Wealso show that a model with a wider range of topolog-

ical features consistently outperforms a reduced fea-ture set model. A random forest model is used as itcan deliver high accuracy and analysis of feature im-portance. The results outlined in this work provideinformation regarding emergent properties of synthet-ically generated networks, as well as three real worldcitation networks.

The rest of this paper is organised as follows. Sec-tion 3 outlines the research methods. Section 4 deliv-ers results regarding the model performance on eachdataset, and the analysis of feature importance. Weconclude in Section 5, and outline the future direc-tions for this work.

3 Research Methods

In this section, we outline our supervised learningframework for link prediction using a purely topo-logical feature set. We proceed by describing thedatasets used and the data preparation and modellingapproach. The features are then defined, followed bythe model evaluation methodology.

We define a graph G as an ordered pair G = (V,E)comprising a set of nodes V and edges E. Thegraph is endowed with nodes vi and edges eij , wherei, j = 1, . . . , n. Edges are symmetric for undirectednetworks, i.e. eij = eji, but eij 6= eji for directednetworks. For directed networks, we will refer to the“to” node as y, and the “from” node as x.

3.1 Data Description

Our model is applied over three types of syntheticallygenerated networks: an Erdos-Renyi random graph, asmall-world network, and a scale-free network. Whileidealised and simplistic, these three graphs can pro-vide topology often observed in real world networks(Newman 2003). These networks are generated usingthe igraph package with R statistical software (Csardi& Nepusz 2006). For each model type, we define avariable m which varies the number of edges in thenetwork. The parameter m roughly gives the num-ber of edges as m × V , and we take m ∈ {1, 2, 3}.Higher values of parameter m tend to give unrealisticproperties, such as much higher graph densities thanthe real world networks. The model performance alsodoes not vary substantially with m > 3.

• Scale-free networks are generated by the prefer-ential attachment mechanism, where new nodesare connected preferentially to existing nodeswith higher degrees. We generate scale-free net-works SFm with 2,000 nodes. Parameter m isdefined as the number of edges added per node.

• Small-world networks start with a regular lattice,then proceed to rewire edges randomly acrossthe network. We generate small-world networksSWm with starting lattice dimension equal to 1,nodes equal to 2,000 and rewiring probability of0.05. Parameter m is defined as the lattice con-nection neighbourhood distance.

• Erdos-Renyi random graphs start with a fixednumber of nodes, and edges are created randomlywith uniform probability. We generate Erdos-Renyi random graphs ERm with 2,000 nodes.The uniform connection probability is defined asm

1,000 .

We have chosen three real world network data setsto apply the link prediction model to: Cora, Cite-seer and WebKB (Sen et al. 2008). These data sets

CRPIT Volume 168 - Data Mining and Analytics 2015

40

have been chosen as the full connection structure isprovided, and they have been used in recent studies(De et al. 2013). All three data sets represent citationnetworks, are directed, and all contain the full connec-tion structure and node attributes. Table 1 outlinesthe key properties of each graph. Di gives the diam-eter of the network, APL is the average path length,Cls gives the clustering coefficient of the network, andDns describes the graph density.

Table 1: Key properties of graph datasets

Graph V E Di APL Cls DnsSynthetic

SF1 2,000 1,999 17 7.8 0 0.001SF2 2,000 3,997 7 3.8 0.006 0.002SF3 2,000 5,994 6 3 0.01 0.003SW1 2,000 2,000 132 51.4 0 0.001SW2 2,000 4,000 20 10.1 0.362 0.002SW3 2,000 6,000 13 7 0.436 0.003ER1 2,000 1,913 33 11 0.001 0.001ER2 2,000 3,917 12 5.7 0.002 0.002ER3 2,000 5,951 8 4.5 0.003 0.003

Real WorldWebKB 878 1,388 8 3.1 0.036 0.004Cora 2,709 5,278 19 6.3 0.093 0.001

Citeseer 3,328 4,552 28 9.3 0.13 0.001

In addition to the key graph measures, the net-works are plotted with a Fruchterman-Reingold lay-out, which gives a visual indication of their struc-ture. Figures 1(a) and 1(b) show the plots of the firsttwo scale-free networks SF1 and SF2, respectively.SF3 has been excluded as the structure becomes dif-ficult to discern visually. The branch-like structureis clearly visible in SF1. Similarly, Figures 1(c) and1(d) depict the structure of the small-world networksSW1 and SW2, and Figures 1(e) and 1(f) plot ER1and ER2. It is noted that these latter graphs arenot fully connected, however the largest componentmakes up the majority of the network in both cases.

Figures 1(g), 1(h) and 1(i) show the structure ofthe real world datasets. Anecdotally, we can see somecommon patterns amongst the three real world graphsand our generated graphs. For example, the WebKBgraph shown in Figure 1(g) appears to have a similarbranching structure to the scale-free network in Fig-ure 1(a). In terms of graph measures, WebKB pos-sesses similar properties to SF2 and SW2, althoughthe scale-free network has a lower clustering coeffi-cient, and the small-world network has a smaller di-ameter.

While the link prediction method outlined in thispaper can be applied to both directed and undirectednetworks, we note that there may be differences inperformance in each case. We therefore consider bothdirected and undirected interpretations of the abovenetworks in our application.

3.2 Data Preparation and Learning Method

Link prediction is known to be a very unbalanced clas-sification problem (Wang et al. 2014). There are usu-ally a large number of node pairs to predict over, anda small number of actual links. We address these is-sues through implementing a sampling process. Wealso adopt a random forest learning model. This mod-elling approach has been chosen because it can be ef-fectively trained to distinguish between unbalancedclasses (Breiman 2001). It is proven to be particu-larly robust to data outliers and also very accurate.Lastly, it can provide measures for the importance of

Table 2: Link prediction features

Feature Category1. Jaccard coefficient Similarity2. Adamic-Adar index3. Dice similarity4. Degree Centrality5. Betweenness6. Closeness7. Eigenvector8. PageRank9. In same community Community10. Community density11. Community clustering12. Cross-community edge weight13. Node pair geodesic Distance14. Community pair geodesic

each feature on the likelihood of link formation, whichwe evaluate for each data set.

In all data sets, we have prepared the data formodelling in the following way:

• We create the feature set for all node pairs in G,as defined in Section 3.3

• The data is split into 70%/30% training/testingsets using random sampling.

• In the training data set, we select all the link ob-servations (True class), and sample the remain-ing set of node pairs (False class) such that thereis a 1:100 ratio between the True and False linkclasses. Through experimentation, we found thatoversampling the False cases in the training dataproduces a more accurate model. 1:100 providesa reasonably representative sample of False cases,while still providing a proportion of True caseshigh enough for the random forest to model ac-curately.

• The random forest model is trained using 50trees, since we found through experimentationthat the model accuracy did not improve withadditional trees. We also set the number of fea-tures included in each tree to the floor of thesquare root of the total number of features.

3.3 Model Features

We present four categories of measures in this paper:similarity scores, node centrality, community struc-ture and distance measures. While the real worlddatasets have available certain node attributes, ouraim is to use only topological features in our model.This allows for our method to be applied in a standardway to a wide variety of networks in future. We alsoconsider directed cases of each measure. Table 2 out-lines the features used by their category. We presentthe directed version of the features as the generalisa-tion to undirected graphs is trivial.

We refer to node pairs x, y ∈ G, where x 6= y asxy, and classify x as the from node and y as the tonode.

3.3.1 Similarity Measures

Similarity measures have been applied extensively tolink prediction. A simple approach to link predictionis to rank all pairs of nodes by their score accordingto a specific similarity index, and take those nodepairs with the highest score to be the most likelyconnected pairs (Lu & Zhou 2011). However, these


41

Figure 1: Plots of all network datasets used, in the undirected interpretation. Node size is scaled to represent the degree of thenode. (a) represents the scale-free network SF1, and SF2 is shown in (b). The branch-like structure is clearly visible in (a), whereeach node is only connected to one existing node. (c) and (d) depict the plots of the small world networks, SW1 and SW2. (e)and (f) show the plots of the Erdos-Renyi networks with m = 1 and m = 2 respectively. It is clear that there is a large connectedcomponent in ER1, with a number of smaller components around the edges. With m = 2 in (f), most of the network is connected.(g), (h) and (i) show plots of the real world networks WebKB, Cora and Citeseer, respectively. It is evident that WebKB consistsof four connected components, each with a highly connected hub node, indicating scale-free structure. Cora and Citeseer, shownin (h) and (i), both show a large connected component with a number of smaller components, similar to (e).

measures can also be used as features in the su-pervised learning approach to link prediction. Weconsider the following three similarity measures touse as features in the supervised learning problem.

(1) Jaccard similarity coefficient (SimJaccxy ):

The Jaccard similarity coefficient of two ver-tices is the number of common neighbours dividedby the union of the neighbours of both vertices. Fora node x, let Γ(x) denote the set of neighbours of x.For directed networks, Γ(x) defines the set of nodes

with a link from node x. The Jaccard similaritycoefficient is then defined as

SimJaccxy =

|Γ(x) ∩ Γ(y)||Γ(x) ∪ Γ(y)|

.

(2) Adamic-Adar index (SimAAxy ) (Adamic &

Adar 2003):

This index extends the simple counting of com-mon neighbours to include a term which gives less


42

connected neighbours higher weight. Letting kout,zbe the out-degree of node z, the Adamic-Adar Indexis defined as

SimAAxy =

∑z∈Γ(x)∪Γ(y)

1

log kout,z.

(3) Dice similarity coefficient (SimDicexy ):

This similarity coefficient, also known as theSørensen index, is similar to the Jaccard index inform, but has been applied to ecological communitiesand is known to be robust to outliers (Sørensen1948). The Dice similarity coefficient is given by

SimDicexy = 2

|Γ(x) ∩ Γ(y)|kout,x + kout,y

.

3.3.2 Node Centrality Measures

Centrality measures have been used for a long timewithin the field of social network analysis to describethe relative influence of a node over the network. In-deed, scale-free networks arose from the insight thatindividuals may connect preferentially to more highlyconnected individuals, i.e. those with high degree cen-trality (Barabasi & Albert 1999). Shibata et al.(2007) showed that in citation networks, the between-ness centrality measure was positively correlated be-tween pairs of nodes where a connection existed, andproved to be a significant predictor for future connec-tions. Further to these results, we expect that in so-cial networks individuals may connect preferentiallyto other individuals according to a variety of central-ity measures. We therefore include a broad range offeatures derived from centrality measures into our su-pervised learning model and evaluate their predictivepower and importance.

The question of how the centrality measuresfor each node pair are used as features requiresconsideration. A number of approaches have beenapplied previously. For example, Shibata et al. (2007)derive a feature based on the difference between thein-degrees of the two nodes, CnPA2

xy = kin,x − kin,y.This measure will capture differences in the in-degree between the nodes, but may not distinguishbetween high and low in-degree nodes. Anotherfeature derivation commonly used for preferentialattachment is defined as the product of each node’sdegree, i.e. CnPA1

xy = kx × ky (Lu & Zhou 2011,Barabasi & Albert 1999). This method gives moreweight to either node having high degree, howeverthe formula does not distinguish between the to andfrom nodes. Additional consideration of this measureis therefore required for directed networks. However,this approach is more common in the literature(Wang et al. 2014) so we adopt the same convention.To accomodate the directed network case we alsoinclude as a separate feature the centrality measurefor the to node, y. The undirected centrality productfeatures are labelled as CnMeasure

x×y , and the additional

directed centrality features are labelled CnMeasurey .

The list of centrality features are outlined as follows.

(4) Degree centrality (CnDgrex×y , CnDgrey ):

The degree centrality is defined simply as thenumber of connections of each node in an undirectednetwork, or the in-degree in a directed network. We

construct a feature for the product of the in-degreesof both nodes, and another feature for the in-degreeof the to node:

CnDgrex×y = kout,x × kin,y,CnDgrey = kin,y

We expect that a higher degree product indicateshigh connection likelihood in the undirected case. Inthe directed case, we also expect that from nodesx with a higher out-degree and to nodes y with ahigher in-degree are more likely to be connected.The in-degree of the to node y as a separate featurefor directed networks.

(5) Betweenness centrality (CnBtwnx×y , CnBtwny ):

Previous studies have revealed that betweennesscentrality can be a useful predictor of links incitation networks (Shibata et al. 2012). This measurerepresents the extent to which a node lies on theshortest paths (geodesics) between other nodes,which can be a useful indicator of influence on net-work flow. Nodes with high betweenness centralitiestend to bridge otherwise unconnected subsets of anetwork. Formally, for nodes i, s and t, let

nist =

{1 if i lies on the geodesic path from s and t

0 otherwise.

The betweenness centrality of a node i is then definedas:

CnBtwni =∑st

nistgst

,

where gst gives the total number of geodesic pathsfrom s to t. Our features are then defined as:

CnBtwnx×y = CnBtwnx × CnBtwny ,

with CnBtwny included in the directed case.

(6) Closeness centrality (CnClsex×y , CnClsey ):

Closeness centrality measures the inverse of themean distance from a node to all other nodes (New-man 2010). Letting dij be the length of the geodesicpath from node i to j, closeness centrality is definedas:

CnClsei =n∑j dij

.

We therefore construct our features as

CnClsex×y = CnClsex × CnClsey

=n∑i dxi

× n∑i dyi

,

with CnClsey included as a separate feature in thedirected case.

(7) Eigenvector centrality (CnEgnvx×y , CnEgnvy ):

Eigenvector centrality extends from degree cen-trality with the acknowledgement that not allneighbours are equal. This measure awards eachnode with a score proportional to the sum of thescores of its neighbours. Derivations of eigenvectorcentrality have been applied effectively to link pre-diction (Symeonidis et al. 2013), so we expect that it


43

will be a useful feature. The eigenvector centralitymeasure for node i, vi, is defined as:

vi = κ−11

∑j

Aijvj ,

where κ1 is the leading eigenvalue of A, Aij is the ijth

element of A, and vj is the eigenvector centrality ofnode j. We construct our feature as:

CnEgnvx×y = CnEgnvx × CnEgnvy

=

(κ−1

1

∑i

Axivi

)×

(κ−1

1

∑i

Ayivi

).

As before, we also use CnEgnvy as a feature in thedirected case.

(8) PageRank (CnPgRkx×y , CnPgRky ):

PageRank was originally designed as a measureof web page importance (Page & Brin 1998). It issimilar to the eigenvector centrality in form, howeverthe neighbour centrality score for each neighbouri is divided by that node’s out-degree, kouti . Thispenalises the influence of neighbours with very highout-degree. We construct our PageRank features asfollows:

CnPgRkx×y = CnPgRkx × CnPgRky

=

(α∑i

Axivikouti

+ β

)(α∑i

Ayivikouti

+ β

),

where α and β are constants. We also include CnPgRky

as before in the directed case.

3.3.3 Community Measures

Community detection describes the problem of parti-tioning a graph into densely connected subsets, com-monly referred to as communities. A graph’s commu-nity structure has been shown to be predictive of linkformation, as nodes within the same community aremore likely to be connected. For example, Shibataet al. (2012) include a feature for whether two nodesare in the same community in their supervised learn-ing model for link prediction, with good performance.

The problem of community detection has receiveda great deal of interest, see Fortunato (2010) fora comprehensive review. However, many of thecommunity detection methods have been developedfor undirected graphs, and some have a high com-putational cost. We utilise the infomap communitydetection algorithm as it has computational timeO(V (V × E)) and can handle directed graphs(Rosvall et al. 2009). Once the graph is partitionedinto communities, we create the community graphGC by aggregating the vertices of G to their com-munity partitions. Each node in GC is therefore acommunity partition of G, and we use the symbolsµ and ν to refer to the from and to nodes in GC ,respectively. We also let µ, ν = 1, . . . , k, where k isthe total number of community partitions in G. Weassign edge weights according to the number of linksbetween each community pair in G. The communityfeatures are defined as follows:

(9) In same community (CmCommxy ):

We expect that a pair of nodes xy in the samecommunity should have a higher likelihood of beingconnected given that the communities are partitionedaccording to connection density. This measure isdefined for node pairs simply as:

CmCommxy =

{1 if x is in the same community as y

0 otherwise.

Including a feature for two links being in the samecommunity should effectively reduce the link like-lihood space dramatically. This will yield moreaccurate link prediction in networks where thecommunity clustering is strong.

(10) Community density (CmDensxy ):

The density of a graph is simply the number ofedges divided by the number of possible edges. Weapply this measure to each community µ, representedas induced subgraphs of G, µ ⊂ G. Letting E(G)define the set of edges in G, we define |E(µ)| as thenumber of edges xy, with x, y ∈ µ,G. The density ofµ is then defined as

Densµ =|E(µ)|∑x,y∈µ,G 1

.

We then construct our feature vector for each nodepair in G as

CmDensxy =

{Densµ if x and y share community µ

0 otherwise.

We expect that two nodes in the same communitywith higher density will have a greater likelihood ofbeing connected than those in a different communitywith lower density, or in separate communities.

(11) Community clustering coefficient (CmClstxy ):

The clustering coefficient, also known as transi-tivity, for community subgraph µ is defined as

Clstµ =(number of closed paths of length two)

(number of paths of length two).

Similarly to the community density, we expect thatnodes in the same community with higher clusteringare more likely to be connected. The feature vectoris constructed as

CmClstxy =

{Clstµ if x and y share community µ

0 otherwise.

(12) Cross-community edge weight (CmEwgtxy ):

The community graph GC provides a more coarsenetwork from which we can take attributes to usefor link prediction of node pairs in G. One straight-forward measure is the edge weight between eachcommunity pair. We construct a feature vector forthe node pairs in G based on the edge weights be-tween their respective communities, where the nodesare not in the same community. Let AC denote theadjacency matrix for GC , and let µ, ν ∈ GC denotethe community classifications for nodes x, y ∈ Grespectively. The cross-community edge weight isthen defined as:

CmEwgtxy =

{∑xy∈GACµν if µ 6= ν

0 otherwise.


44

3.3.4 Distance Measures

Distance measures indicate the number of linksbetween a pair of nodes. Our expectation is thatnodes which are closer together are more likely tobe connected. For the purposes of link prediction,we only consider distance greater than one for eachnode pair. This ensures any direct connections arenot counted, as links in the graph have distance ofone. We construct distance measures for both theindividual node pairs, and the distance between theircommunities in GC .

(13) Node pair geodesic (DNodexy ):

For a pair of nodes x, y ∈ G, let dxy,2 be theshortest path (geodesic) from x to y of length greaterthan or equal to 2. Our feature vector is then simplygiven by

DNodexy = dxy,2

(14) Community pair geodesic (DCommxy ):

For a pair of nodes x, y ∈ G lying in communi-ties µ, ν ∈ GC respectively, our feature vector foreach node pair is defined as the distance from µ to ν:

DCommxy = dµν,2

An issue that arises with the distance features, par-ticularly on small networks, is that there may not bea path with length greater than one for node pairswhich are connected. If the network is not fully con-nected, then there will also be node pairs that do nothave a geodesic. These issues effectively introducemissing values into the observations. To account forthis, we simply replace any missing distance valueswith the mean across the dataset which allows forthese observations to be included in the model. Moresophisticated approachs may be developed to accountfor missing distance values, but we leave this to futurework.

3.4 Model Evaluation

Given the unbalanced nature of the link predictionproblem, measuring the performance of a model re-quires consideration. A common approach is to usethe model precision, recall, and the associated F1measure (Wang et al. 2014). With our model trainedon the training data set, we apply the following eval-uation measures on the testing data set only. Weabbreviate true positives to TP, false positives to FP,and false negatives to FN.

Precision =TP

TP + FP,

Recall =TP

TP + FN,

F1 =TP

TP + FP + FN.

We have chosen precision, recall and F1 specifi-cally because they are useful in unbalanced problemssince they ignore true negatives. There are likely tobe a very large number of records classified as truenegatives due to the high number of node pairs that

Table 3: Link prediction features for comparison model

Feature Category1. Jaccard coefficient Similarity2. Adamic-Adar index3. Dice similarity4. Degree Centrality5. Betweenness6. In same community Community7. Community density

Table 4: Model performance on all undirected networkdatasets

Network Precision Recall F1 AUCSynthetic

SF1 1 1 1 1SF2 0.552 0.729 0.628 0.9932SF3 0.529 0.527 0.528 0.9729SW1 1 0.998 0.999 1.000SW2 0.548 0.851 0.666 0.9932SW3 0.56 0.829 0.669 0.9858ER1 0.82 0.964 0.886 0.9999ER2 0.622 0.476 0.54 0.951ER3 0.463 0.267 0.339 0.8206

Real WorldWebKB 0.709 0.791 0.748 0.9867Cora 0.402 0.727 0.518 0.983

Citeseer 0.406 0.876 0.555 0.9969

do not have links, which distort the performance re-sults. In addition to these evaluation measures, wealso provide the precision and recall charts.

In this paper, we expect that a wider range oftopological features can produce a more accurate linkprediction model, as well as revealing diverse informa-tion about deeper graph properties. To demonstratethe performance improvement, we create a compari-son model based on the same topological feature setused by Shibata et al. (2012), with a random forestmodel as the learning algorithm rather than the sup-port vector machine approach. The list of featuresfor the comparison models is given in Table 3. Itis noted that the approach by Shibata et al. (2012)includes non-topological features, so we are not mak-ing a direct comparison between the two different ap-proaches. We also modify the comparison model fea-tures to be applicable to undirected networks. Thedirected and undirected cases are treated in the sameway as outlined in Section 3.3. We compare the per-formance of models trained with the full feature setto the comparison set by producing precision recallcharts for both.

The last aspect of model evaluation we consider isto determine the relative importance of the topolog-ical features. We provide an analysis of the randomforest mean decrease accuracy importance measure,and discuss the results with respect to model accu-racy.

4 Results

4.1 Model Performance: Undirected Graphs

Table 4 outlines the model performance on all datasets, interpreted as undirected networks. The mod-els trained on the synthetic networks with parame-ter m = 1 give very high performance. The modeltrained on the scale-free network with m = 1 actuallyclassifies every node pair correctly. However, as weincrease the value of parameter m, the model perfor-


45

Figure 2: Plot of the precision and recall charts for the undirected scale-free networks (a), small-world networks (b), Erdos-Renyirandom graphs (c), and the real world networks (d). Plots (e) to (h) depict the performance of the comparison models trained onthe same network as opposite with a smaller range of topological features.

mance decreases in all cases. The scale-free networkmodels lose precision with higher values of m, andrecall drops substantially as well. The small-worldnetwork maintains good recall and F1, although pre-cision does drop with m > 1. The Erdos-Renyi ran-dom graph loses precision as m is increased, but therecall drops much more rapidly to 0.267 with m = 3.

Figure 2(a-c) depicts the precision and recallcurves for the undirected synthetic networks, whererecall is plotted on the x-axis and precision on they-axis. It is clear from Figure 2(a) that the modelperforms well for the scale-free graphs with m = 1, 2and 3, given that these curves remain in the top tri-

angle of the plot, and their concavity is down. Thesmall-world networks in Figure 2(b) retain good per-formance as well. It is interesting that the model onthe network with m = 3 actually seems to outper-form that with m = 2. We can see the former modelachieves a slightly higher precision of 0.56, comparedto 0.548. The Erdos-Renyi random graphs in Figure2(c) show good performance for the case with m = 1,but a large difference to the case with m = 2, andagain with m = 3. All models, except ER3, deliverAUC higher than 0.97. These results indicate thatour learning model can accurately describe the net-work generating processes of these three undirected


46

synthetic networks, however this accuracy diminishesas nodes are added to the network with more edges.

On the real world datasets the model gives con-sistently high recall and AUC values. This indicatesthat a high proportion of the actual links are beingclassed correctly. Precision gives how many of thepredicted links are actually links, and our model per-forms well on the WebKB dataset with precision of0.709. However, the precision values for Cora andCiteseer are much lower at around 0.4. Figure 2(d)gives the precision against recall curves for the realworld datasets. We can see that our model performsthe best on WebKB, followed closely by Citeseer. Themodel classification gives a low precision score to themodel trained on Citeseer, however the recall is muchhigher than Cora. The precision recall curve for Cite-seer indicates that the model could deliver higher pre-cision with a small drop in recall.

Table 5: Model performance on all directed network datasets

Network Precision Recall F1 AUCSynthetic

SF1 0.201 0.926 0.33 0.9994SF2 0.237 0.522 0.326 0.9443SF3 0.206 0.194 0.2 0.7293ER1 0.29 0.45 0.353 0.9855ER2 0.345 0.154 0.213 0.9401ER3 0.35 0.075 0.123 0.8391

Real WorldWebKB 0.487 0.796 0.604 0.9988Cora 0.297 0.798 0.433 0.9988

Citeseer 0.237 0.955 0.379 0.9997

Figure 2(e) to 2(h) show the precision and recallcurves for the comparison model with a reduced topo-logical feature set, trained on the same graphs as thefull model. The key observation from these plots isthat these models fail to achieve high performance inboth recall and precision. Many of these curves arealso not smooth or monotonic. This is likely due tothe feature sampling of the random forest over the re-duced feature set; the curve may change sharply whenan important feature is excluded or included. Wetherefore conclude that by including a wider range oftopological features in our supervised learning model,we achieve much higher performance than a smallerset of features.

4.2 Model Performance: Directed Graphs

When we apply the same methodology to the directedversions of these networks, including the centralitymeasures for the to nodes y, the results differ sub-stantially. Table 5 gives the model performance onthe directed interpretation of the networks. igraphdoes not currently support a directed version of thesmall-world network, so we have removed it from thisanalysis. It is clear that the approach is not as ac-curate on directed networks as their undirected inter-pretation. Precision is down significantly for all net-works, however recall remains high for the real worldnetwork models and the synthetic graphs with m = 1.This result suggests that the model can still identifythe node pairs that have links, but cannot accuratelydistinguish the direction of the link. We will explorethis more through an analysis of the feature impor-tance on each network.

Figure 3(a-c) show the precision and recall chartsfor the directed networks. We can clearly see thatprecision is not as high in the undirected case for

the real world graphs, although recall is slightly im-proved. However, it is clear that the synthetic net-works have not performed as well. To compare, Fig-ure 3(d-f) show the precision and recall charts for thecomparison model with the reduced feature set. Itis clearly evident that the full model outperforms thesimpler model. However, it is noted that the compari-son model seems to perform slightly better on the realworld networks than the synthetic. This may be dueto a more balanced feature importance. We explorefeature importance in Section 4.4.

4.3 Feature Importance: Undirected Graphs

As mentioned earlier, one of the advantages of usinga random forest model is that features which are oflesser use in prediction tend to be effectively down-weighted. One measure of the importance of the fea-tures is the mean decrease accuracy, which is a scaledaverage of the prediction accuracy of each feature. Iteffectively measures the decrease in model accuracywhen values of each feature are randomly permuted(Breiman 2001). Figure 4 shows a matrix of the meandecrease accuracy for each feature and graph, for boththe undirected and directed cases. To save space, wehave only shown the synthetic networks with m = 2as these networks are the most similar to the realworld graphs.

The undirected models and graphs are given inFigure 4(a). In the model for Citeseer, communitybased features and distance measures hold the high-est importance. The most important feature in thisdataset is the community density, followed by thesame community flag and the node geodesic. The im-portance of this feature set may be due to the largenumber of unconnected components in the Citeseergraph; nodes in these minor components are likely tobe attributed to the same community. The model hastherefore limited the possible connection space dra-matically by assigning a higher importance to com-munity based features. The analysis is very similar forthe Cora dataset as well, however the node geodesicis flagged as the most important feature. In bothgraphs, the centrality measures are also important.

Of all the networks analysed, the model trainedon WebKB delivered the best performance. In thisdataset, we see a broader distribution of predictiveimportance across the feature classes. All the fea-tures have a mean decrease accuracy in the same or-der of magnitude. Community based features havethe strongest importance, however centrality mea-sures are very close in magnitude, as are the similar-ity features. These results are not surprising, giventhe existence of four key connected components, eachwith a strongly connected hub node.

The Erdos-Renyi random graph model produced alow score for the recall measure, and we see centralitymeasures strongly influencing the predictions. De-gree centrality was the most important, followed byeigenvector and closeness centralities. It seems thatthe model has trained closely to the node influencefeatures, given the lack of strong neighbourhood andcommunity structure. This may also explain the lowmodel performance, since node influence is not likelyto show huge variation in a random graph.

The feature importance in the small-world net-work is assigned primarily to the community and nodegeodesic measures. Again, this makes sense given thatlocal connections are far more common in this net-work type. Centrality measures are marked with lowimportance, given that highly influential nodes arerarer.


47

Figure 3: Plot of the precision and recall charts for the directed scale-free networks (a), Erdos-Renyi random graphs (b), andthe real world networks (c). Plots (d) to (f) depict the performance of the comparison models trained on the same network asopposite with a smaller range of topological features.

In the scale-free network centrality measures areimportant as we expect given the preferential attach-ment growth mechanism. Closeness, PageRank anddegree centralities all score highly. However, the mostimportant feature is actually the node geodesic. Com-munity features are also flagged with high impor-tance. These results suggest that community struc-ture and local, densely connected sets of nodes haveself-organised in this artificial network.

4.4 Feature Importance: Directed Graphs

For the directed case, Figure 4(b) depicts the featureimportance per network. As discussed in Section 4.2,the drop in precision described in Section 3.1 maybe due to the model assigning more importance tomeasures that don’t necessarily discriminate betweenthe two nodes. In other words, they will identify thenode pairs most likely to be connected, but will notaccurately determine the link direction.

For the Cora and Citeseer directed networks inFigure 4(b), the model has scored the communitybased measures highly, particularly the communitydensity feature. This implies that many of the pre-dicted links will likely be in high density communities.

However, this measure does not carry any informa-tion regarding the direction of the link within a com-munity. These models all deliver high recall, whichshows that the community based features are predic-tive of a link existing between two nodes in eitherdirection. The only other highly important featureis the to node’s in-degree. It seems that with onlyone important centrality measure carrying link direc-tion, these models deliver low precision, but manageto correctly classify a majority of the links.

Similarly to the undirected case, the WebKBmodel achieved the best performance. Again, thereis a distribution of predictive importance across thefeature categories. Community structure and nodecentrality measures are assigned high scores, for boththe centrality products and the to node centralities.The higher precision for this model relative to themodels for Cora and Citeseer can be explained by thefact that most of the centrality features are impor-tant. However, the node neighbourhood features arenot important in the directed case, which likely ex-plains the drop in precision relative to the undirectedmodel.

The synthetic networks, on the other hand, showstrong influence of node centrality product features.


48

Figure 4: Plot of the feature importance for both the undirected and directed models. Feature importance is determined bythe Random Forest mean decrease accuracy measure. Higher mean decrease accuracy is given darker colour.

Betweenness is highly important for the ER2 model,and both degree features in the SF2 model. Rela-tive to the undirected feature importance for thesemodels, similarity scores receive very low importance.Community based features are also less important inthe directed version. The lower precision and recallfor these models is likely due to the bias towards cen-trality features.

The results for the directed case clearly suggestthat more work is required on the feature set. Par-ticularly, more features are required that carry infor-mation regarding the direction of the link. As wellas this, additional variations of measures based onin-links and out-links should be considered.

5 Conclusion

We have described a new approach to link predictionusing a broad range of graph topological features anda random forest learning model. This approach hasthe distinct advantage of being applicable to any net-work where the connection structure is known. It alsocan discover global or emergent properties of the net-work through analysis of feature importance.

Our method was tested on three types of synthet-ically generated datasets, as well as three real worldcitation networks. In the undirected case, the modelperforms very well in terms of precision, recall, F1 andAUC. It was shown that the model can achieve per-

fect classification of classes on a scale-free networkwith one edge added per node. However, our ap-proach clearly performs more accurately on syntheticnetworks with less complex structure, i.e. where thenumber of edges added per new node is small. Wealso found that the model in the directed case deliverslower precision as it does not accurately distinguishthe direction of the link. Modifying the feature setin the directed case to account for the link directionmore strongly may address this issue. To demonstratethat including a broader range of topological featurescan give higher performance, we compared our ap-proach to a model with fewer features. It was shownthat a larger set of features consistently gives higherperformance.

It was found that the importance of the input fea-tures vary significantly with the different networks.The model performs best when the features are moreevenly important across the feature categories, as inthe WebKB network. Generally, the model tends toperform very well when strong neighbourhood andcommunity structure is present, in addition to highcentrality importance. Finally, the analysis of fea-ture importance provides a method for the discoveryof complex, emergent properties within a social net-work. This was demonstrated through the varyingimportance of neighbourhood and community-basedfeatures in many of the datasets considered.


49

5.1 Further Work

In future, we aim to extend our approach and ap-ply it to more network datasets, particularly largesocial networks with complex structure. To reducethe computational cost associated with several of thefeatures used, we will consider reducing the numberof features used to those with the highest predictiveimportance, while retaining as much accuracy as pos-sible. Alternative supervised learning methods willbe considered as well. The directed version of ourapproach also needs further development so that themodel can more accurately distinguish the link direc-tion.

References

Adamic, L. A. & Adar, E. (2003), ‘Friends and neigh-bors on the web’, Social Networks 25, 211–230.

Backstrom, L. & Leskovec, J. (2011), Supervised ran-dom walks: predicting and recommending linksin social networks, in ‘WSDM Conference’, HongKong, pp. 635–644.

Barabasi, A. & Albert, R. (1999), ‘Emergence of scal-ing in random networks’, Science 286(5439), 509–512.

Bliss, C., Frank, M., Danforth, C. & Dodds, P. (2014),‘An evolutionary algorithm approach to link predic-tion in dynamic social networks’, Journal of Com-putational Science 5, 750–764.

Breiman, L. (2001), ‘Random forests’, MachineLearning 45(1), 5–32.

Clauset, A., Moore, C. & Newman, M. E. J. (2008),‘Hierarchical structure and the prediction of miss-ing links in networks’, Nature 453, 98–101.

Csardi, G. & Nepusz, T. (2006), ‘The igraph softwarepackage for complex network research’, InterJour-nal Complex Systems, 1695.URL: http://igraph.org

De, A., Ganguly, N. & Chakrabarti, S. (2013), Dis-criminative link prediction using local links, nodefeatures and community structure, in ‘2013 IEEE13th International Conference on Data Mining’,pp. 1009–1014.

Fortunato, S. (2010), ‘Community detection ingraphs’, Physics Reports 486, 75–174.

Liben-Nowell, D. & Kleinberg, J. (2007), ‘The link-prediction problem for social networks.’, Journal ofthe American Society for Information Science andTechnology 58(7), 1019–1031.

Lu, L. & Zhou, T. (2011), ‘Link prediction in complexnetworks: a survey’, Physica A 390, 1150–1170.

Newman, M. E. J. (2003), ‘The structure and functionof complex networks’, SIAM Review 45(2), 167–256.

Newman, M. E. J. (2010), Networks: an introduction,Oxford University Press.

Page, L. & Brin, S. (1998), ‘The anatomy of a large-scale hypertextual web search engine’, Proceedingsof the Seventh International World Wide Web Con-ference 30(1–7), 107–117.

Rosvall, M., Axelsson, D. & Bergstrom, C. T. (2009),‘The map equation’, The European Physical Jour-nal Special Topics 178(1), 13–23.

Sen, P., Namata, G. M., Bilgic, M., Getoor, L.,Gallagher, B. & Eliassi-Rad, T. (2008), ‘Collec-tive classification in network data’, AI Magazine29(3), 93–106.

Shibata, N., Kajikawa, Y. & Matsushima, K. (2007),‘Topological analysis of citation networks to dis-cover the future core articles’, Journal of the Amer-ican Society for Information Science and Technol-ogy 56(6), 872–882.

Shibata, N., Kajikawa, Y. & Sakata, I. (2012),‘Link prediction in citation networks’, Journal ofthe American Society for Information Science andTechnology 63(1), 78–85.

Sørensen, T. (1948), ‘A method of establishing groupsof equal amplitude in plant sociology based on simi-larity of species content and its application to anal-yses of the vegetation on danish commons’, Biolo-giske Skrifter 5(4).

Symeonidis, P., Iakovidou, N., Mantas, N. &Manolopoulos, Y. (2013), ‘From biological to so-cial networks: Link prediction based on multi-wayspectral clustering’, Data and Knowledge Engineer-ing 87, 226–242.

Wang, D., Pedreschi, D., Song, C., Giannotti, F.& Barabasi, A.-L. (2011), Human mobility, socialties, and link prediction, in ‘Proceedings of the17th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining’, KDD ’11,ACM, New York, NY, USA, pp. 1100–1108.

Wang, P., Xu, B., Wu, Y. & Zhou, X. (2014), ‘Linkprediction in social networks: the state-of-the-art’,Science China 58(1–38).

Zaccarin, S. & Rivellini, G. (2010), Modelling net-work data: An introduction to exponential ran-dom graph models, in F. Palumbo, C. N. Lauro& M. J. Greenacre, eds, ‘Data Analysis and Clas-sification’, Studies in Classification, Data Analysis,and Knowledge Organization, Springer Berlin Hei-delberg, pp. 297–305.

Zhang, X., Wang, X., Zhao, C., Yi, D. & Xie,Z. (2014), ‘Degree-corrected stochastic block mod-els and reliability in social networks’, Physica A393, 553–559.


50

Documents

Link Prediction and Topological Feature Importance in ... · ers results regarding the model performance on each dataset, and the analysis of feature importance. We conclude in Section