Leonid E. Zhukov - Leonid Zhukov · Evaluation metrics Precision and Recall, F-measure Precision =...

Preview:

Citation preview

Link prediction

Leonid E. Zhukov

School of Data Analysis and Artificial IntelligenceDepartment of Computer Science

National Research University Higher School of Economics

Structural Analysis and Visualization of Networks

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 1 / 23

Lecture outline

1 Link prediction problem

2 Proximity measures

3 Prediction by supervised learning

4 Other methods

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 2 / 23

Link prediction

Link prediction. A network is changing over time. Given a snapshotof a network at time t, predict edges added in the interval (t, t ′)

Link completion (missing links identification). Given a network, inferlinks that are consistent with the structure, but missing (findunobserved edges)

Link reliability. Estimate the reliability of given links in the graph.

Predictions: link existence, link weight, link type

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 3 / 23

Link prediction

Graph G(V,E)

Number of ”missing edges”: |V |(|V | − 1)/2− |E |In sparse graphs |E | � |V |2, Prob. of correct random guess O( 1

|V |2 )

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 4 / 23

Scoring algorithm

Link prediction by proximity scoring

1 For each pair of nodes compute proximity (similarity) score c(v1, v2)

2 Sort all pairs by the decreasing score

3 Select top n pairs (or above some threshold) as new links

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 5 / 23

Scoring functions

Local neighborhood of vi and vj

Number of common neighbors:

|N (vi ) ∩N (vj)|

Jaccard’s coefficient:|N (vi ) ∩N (vj)||N (vi ) ∪N (vj)|

Adamic/Adar: ∑v∈N (vi )∩N (vj )

1

log |N (v)|

Liben-Nowell and Kleinberg, 2003

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 6 / 23

Scoring functions

Paths and ensembles of paths between vi and vj

Shortest path:−min

s{pathsij > 0}

Katz score:

∞∑l=1

βl |paths(l)(vi , vj)| =∞∑l=1

(βA)lij = (I − βA)−1 − I

Personalized (rooted) PageRank:

PR = α(D−1A)TPR + (1− α)|

Liben-Nowell and Kleinberg, 2003

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 7 / 23

Scoring functions

Expected number of random walk steps:– hitting time: −Hij

– commute time −(Hij + Hji )– normalized hitting/commute time (Hijπj + Hjiπi )

SimRank:

SimRank(vi , vj) =C

|N (vi )| · |N (vj)|∑

m∈N (vi )

∑n∈N (vj )

SimRank(m, n)

Liben-Nowell and Kleinberg, 2003

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 8 / 23

Vertex feature aggregations

Preferential attachment:

ki · kj = |N (vi )| · |N (vj)|

orki + kj = |N (vi )|+ |N (vj)|

Clustering coefficient:CC (vi ) · CC (vj)

orCC (vi ) + CC (vj)

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 9 / 23

Low-rank approximations

Low-rank approximation (truncated SVD)

A ≈∑k

UkSkVTk

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 10 / 23

Link prediction

Liben-Nowell and Kleinberg, 2007

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 11 / 23

Link prediction

Liben-Nowell and Kleinberg, 2007

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 12 / 23

Binary classification

Challenging classification problem:

Computational cost of evaluating of very large number of possibleedges (quadratic in number of nodes)

Highly imbalanced class distribution: number of positive examples(existing edges) grows linearly and negative quadratically withnumber on nodes

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 13 / 23

Prediction difficulty

Actual and possible collaborations between DBLP authors

Extreme class imbalance

from Rattigan and Jensen, 2005

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 14 / 23

Link prediction with supervised learning

Supervised learning:1 Features generation2 Model training3 Testing (model application)

Features:

Topological proximity features

Aggregated features

Content based node proximity features

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 15 / 23

Features

Discriminative abilities of features

from Rattigan and Jensen, 2005

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 16 / 23

Simple evaluation

Simple ”hold out set” evaluation

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 17 / 23

Training and testing

Evaluation for evolving networks

image from Y. Yang et.al, 2014

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 18 / 23

Evaluation metrics

Precision and Recall, F-measure

Precision =TP

TP + FP, Recall =

TP

TP + FN

F =2 · Precision · RecallPrecision + Recall

True positive rate (TPR), False positive rate (FPR), ROC curve, AUC

TPR =TP

TP + FN, FPR =

FP

FP + TN

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 19 / 23

Performance of classification algorithms

BIOBASE database (research publications)

DBLP dataset (research publications)

from M. Al Hasan, 2010

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 20 / 23

ROC curves

from Lichtenwalter, 2010

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 21 / 23

Probabilistic models

Local model, Markov random fields [Wang, 2007]

Hierarchical probabilistic model [Clauset, 2008]

Probabilistic relations models:– Bayesian networks [Getoor, 2002]– relational Markov networks [Tasker, 2003, 2007]

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 22 / 23

References

D. Liben-Nowell and J. Kleinberg. The link prediction problem forsocial networks. Journal of the American Society for InformationScience and Technology, 58(7):1019?1031, 2007

R. Lichtenwalter, J.Lussier, and N. Chawla. New perspectives andmethods in link prediction. KDD 10: Proceedings of the 16th ACMSIGKDD, 2010

M. Al Hasan, V. Chaoji, S. Salem, M. Zaki, Link prediction usingsupervised learning. Proceedings of SDM workshop on link analysis,2006

M. Rattigan, D. Jensen. The case for anomalous link discovery. ACMSIGKDD Explorations Newsletter. v 7, n 2, pp 41-47, 2005

M. Al. Hasan, M. Zaki. A survey of link prediction in social networks.In Social Networks Data Analytics, Eds C. Aggarwal, 2011.

Leonid E. Zhukov (HSE) Lecture 16 21.05.2016 23 / 23

Recommended