33
Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Embed Size (px)

Citation preview

Page 1: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Link Prediction

Topics in Data MiningFall 2015

Bruno Ribeiro

Page 2: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

2

Deterministic Heuristics Methods

Matrix / Probabilistic Methods

Supervised Learning Approaches

Overview

Page 3: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

3

Input: Snapshot of social network at time t

Ouput: Predict edges that will be added to network at time t’ > t

Goal

Page 4: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

4

Product Recommendation

2

2

j5

i

Example of link prediction application

Page 5: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

5

Twitter’s Who to Follow

Another example of link prediction application

Page 6: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

6

A few more examples:

Fraud Detection: (Beutel, Akoglu, Faloutsos, KDD’15)◦ Focuses on discovering surprising link patterns

Anomaly detection: (Rattigan et al, 2005) ◦ Focuses on finding unlikely existing links

Other Applications

Page 7: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

7

Deterministic Heuristics

Page 8: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

8

Every entity is assigned a score

Score(v) = how many friends person already has

A Very Naïve Approach

Page 9: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

9

Assign connection weight score(u, v) to non-existing edge (u, v)

Rank edges and recommend ones with highest score

Can be viewed as computing a measure of proximity or “similarity” between nodes u and v.

General Approach

Page 10: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

10

Negated length of shortest path between u and v. All nodes that share one neighbor will be linked

Problem: Network diameter often very small

Shortest Path

Page 11: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

11

Common neighbors uses as score the number of common neighbors between vertices u and v.

Score can be high when vertices have too many neighbors

Common Neighbors

Page 12: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

12

Fixes common neighbors “too many neighbors” problem by dividing the intersection by the union

Between 0 and 1

Jaccard Similarity

Page 13: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

13

This score gives more weight to neighbors that that are not shared with many others.

Adamic / Adar

5 neighbors

50neighbors

Too activeLess evidence of missing link between Alice and BobLess activeMore evidence of missing link between Bob and Carol

Alice

Bob

Carol

Page 14: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

14

The probability of co-authorship of u and v is proportional to the product of the number of collaborators of u and v

Preferential Attachment

Page 15: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

15

Exponentially weighted sum of number of paths of length l

For small α << 1 : predictions ≈ common neighbors

Katz (1953)

Page 16: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

16

where, Hu,v is the random walk hitting time between u and v

Hitting Time

Page 17: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

17

Stationary probability walker is at v under the following random walk: ◦ With probability α, jump back to u ◦ With probability 1 – α, go to random neighbor of

current node

Personalized PageRank

Page 18: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

18

N(u) and N(v) are number of in-degrees of nodes u and v

Only directed graphs

score(u,v) ∊ [0,1]. If u=v then, s(u,v)=1

SimRank

Page 19: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

19

Latent Space Models

Page 20: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

20

Represent the adjacency matrix A with a lower rank matrix Ak.

If Ak(u,v) has large value for a missing A(u,v)=0, then recommend link (u,v)

Low Rank Reconstruction

A ≈

Vn⨉k

Uk⨉

n

= Ak

Page 21: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

21

(Non-negative matrix factorization)

Product Recommendations

2

2

j5

Users have a propensity rate to buy certaincategory of product (W)

In a category some products have a propensity rate to be bought (H)

i

Page 22: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Example Euclidean Generative Model

22

The problem of link prediction is to infer distances in the latent space, i.e., find the nearest neighbor in latent space not currently linkedRaftery, A. E., Handcock, M. S., & Hoff, P. D. (2002). Latent

space approaches to social network analysis. J. Amer. Stat. Assoc., 15, 460.

Unit plane

• Vertices uniformly distributed in latent unit

hyperplane

• Vertices close in latent space more likely to be

connected

• Probability of edge = f (distance on latent space)

Page 23: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

23

Directed Graphical Models

Bayesian networks and Prob. Relational Models (Getoor et al., 2001, Neville & Jensen 2003)

Captures the dependence of link existence on attributes of entities

Constrains probabilistic dependency to directed acyclic graph

Undirected Graphical Models Markov Networks

Bayesian Networks

Page 24: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

24

Example of Today’s Approaches

Credit: N. Barbieri, F. Bonchi, G. Manco KDD’14

Whom To Follow and Why

Page 25: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

25

A Relational Markov Network (RMN) specifies cliques and the potentials between attributes of related entities at the template level

Single model provides distribution for entire graph

Train model to maximize the probability of observed edge and labels

Use trained model to predict edges and unknown attributes

Relational Markov Networks

Page 26: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Latent Space Rational

26

Generative model

Link Prediction Heuristic

Node u

Most likely neighbor of node v ?

Node v

Observed edges

Model parameters (latent space)

θ

Page 27: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

27

Relationship between Deterministic & Probabilistic methods?

Yes = (Sarkar, Chakrabarti, Moore, COLT’10)

Page 28: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Empirically

RandomShortest Path

Common Neighbors

Adamic/Adar Ensemble of short paths

Lin

k pre

dic

tion

acc

ura

cy*

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

How do we justify these observations?

Especially if the graph is sparse

Credit: Sarkar, Chakrabarti, Moore

Page 29: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Raftery’s Model (cont)

29

1

½

Higher probability of linking

Two sources of randomness• Point positions: uniform in D dimensional space

• Linkage probability: logistic with parameters α, r

• α, r and D are known

radius r

α determines the steepness

Credit: Sarkar, Chakrabarti, Moore

Page 30: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Raftery’s Model: Probability of New Edge

Logistic function integrated over volume determined by dij

i j

Credit: Sarkar, Chakrabarti, Moore

Page 31: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Connection to Common Neighbors

31

ik

jr r

j

No. commonneigh.

Net size Decreases with N

k ji

Empirical Bernstein bounds (Maurer & Pontil, 2009)

Distinct radius per node gives Adamic/Adar

Page 32: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

32

Create binary classifier that predicts whether an edge exists between two nodes

Supervised Learning Approaches

Page 33: Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

33

Types of Features

Features

Label Features

KeywordsFrequency of

an item

Topological Features

Shortest DistanceDegree

CentralityPageRank

index

SVM Decision Trees Deep Belief Network (DBN) K-Nearest Neighbors (KNN) Naive Bayes Radial Basis Function (RBF) Logistic Regression

Link Prediction Heuristic

Classifier