56
1 Local Approximation Local Approximation of PageRank and of PageRank and Reverse PageRank Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

1

Local Approximation Local Approximation of PageRank and of PageRank and Reverse PageRankReverse PageRank

Li-Tal MashiachAdvisor: Dr. Ziv Bar-Yossef

13/03/08

Page 2: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

2

OverviewOverview

Review of PageRankLocal PageRank approximationAlgorithmLower boundsPageRank vs. Reverse PageRankApplications of Reverse PageRank

Page 3: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

3

PageRankPageRank

Most search engines analyze the hyperlink structure to order search results

PageRankPageRank • Important measure of ranking for all major

search engines

Page 4: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

4

Review of PageRankReview of PageRank

1 ( )( )

( )q p

PR qPR p

n OutDeg q

Sum of thein-neighbors’ ranks

Rank divided among all out-neighbors

Base rank

Damping factor

Page 5: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

5

PageRank as a Random WalkPageRank as a Random Walk

A random surferrandom surfer is visiting the web:• With probability , selects a random out-link• With probability jumps to a random web

page

1

Page 6: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

6

Global PageRank ComputationGlobal PageRank ComputationRun power method

• Initialize:

• Repeat until convergence:

Challenges:• Holding the whole web graph• Multiplying a matrix by a vector

1k kx x M

0 1 1,...,xn n

1, if u v is an edge

out deg( )( , )

1, otherwise

n uM u v

n

Page 7: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

7

Local PR ApproximationLocal PR Approximation

Global PR calculates PR to all pagesSometime we are interested in the PR of a

small number of pages• Person interested in the PR of his homepage• Online business is interested in the PR of his

own website and his competitors’ websiteDo we need to calculate the PR of the

whole graph for that?

Page 8: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

8

Given: local accesslocal access to a directed graph G and target node

Output: PR(u)

local access:

Cost: Number of queries to the link server

Problem StatementProblem Statement [Chen, Gan, Suel, 2004]

u G

v G ( )neighbors vLink Server

Page 9: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

9

OverviewOverview

Review of PageRankLocal PageRank approximationAlgorithmLower boundsPageRank vs. Reverse PageRankApplications of Reverse PageRank

Page 10: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

10

Another Characterization of Another Characterization of PRPR[Jeh, Widom, 2003][Jeh, Widom, 2003]

infinftt(v,u)(v,u) – the fraction of the PR score of vv that flows to uu on paths of length tt

1

( , ) 0

1inf ( , )

outdeg( )t

t

tp paths v u i i

v uu

u

v

1 ( )( )

( )q p

PR qPR p

n OutDeg q

u1

u2

Page 11: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

11

Another Characterization of Another Characterization of PRPR[Jeh, Widom, 2003][Jeh, Widom, 2003]

PRPRrr(u) (u) – PR score that flows into u from nodes at distance at most r from u

Theorem:Theorem:

0

1( ) Inf ( , )

rt

r tt v G

PR u v un

( ) lim ( )rr

PR u PR u

u

v

u1

u2

Page 12: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

12

Local PR Brute Force AlgorithmLocal PR Brute Force Algorithm[Chen, Gan, Suel, 2004]

Goal: calculate PRr(u) for a sufficiently large r

Algorithm:Crawl backwardsbackwards the sub-graph of radius rr around uu

For each node vv at layer tt calculate the infinftt(v,u)(v,u)

Sum up the weighted influence values

1:

1inf ( , ) inf ( , )

deg( )t tw v w

v u w uout v

u

v

w2w1

0

1( ) Inf ( , )

rt

r tt v G

PR u v un

Page 13: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

13

Local PR Brute Force AlgorithmLocal PR Brute Force Algorithm

0

1( ) 0.0107PR u

n

1 0

1 1 1( ) ( ) 0.0175

2 4PR u PR u

n

2 22 1

1 1 1 1( ) ( ) 1 0.0214

2 2 4PR u PR u

n

3 33 2

1 1 1 1 1 1 1 1( ) ( ) 1 1 0.0236

2 2 2 4 2 4 3PR u PR u

n

u

0.85

14n

Page 14: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

14

Optimization by PruningOptimization by PruningHeuristic to improve the cost

Prune all nodes whose influence is below some threshold

Was shown empirically to be sometimes better [Chen, Gan, Suel, 2004]

u

Page 15: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

15

Analysis of the AlgorithmAnalysis of the Algorithm

This algorithm requires at most queries • rr – number of iterations until the PR random

walk almost converges• dd – maximum in-degree of the graph

In case of slow PR convergence or high in-degree, the algorithm is not feasible

rd

Page 16: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

16

Limitations of the AlgorithmLimitations of the Algorithm

In the web graph there are a lot of web pages with high in-degree

Conclusion: The algorithm is frequently unsuitable for the web graph

Is this a limitation of this specific algorithm only?

Page 17: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

17

Lower BoundsLower Bounds

Local PR approx. is hard for graphs with:• High in-degree nodesHigh in-degree nodes• Slow convergence of the PR random walkSlow convergence of the PR random walk

High in-degreeHigh in-degree Slow PageRank Slow PageRank convergenceconvergence

Randomized alg.Randomized alg.

Deterministic Deterministic alg.alg. n

1 log

2( )n

n d n

log iterationsn

n d n

Page 18: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

18

ProofProof

By reduction from the OR problem

Input:

Output: 1 2 3 ... mx x x x

queries are needed even for randomized algorithms m

x1 x2 x3 xm 0,1m

Page 19: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

19

The ReductionThe Reduction1 0 1X=

Gx=

u

….… … …

m

AA - Alg. that calculates local PRBB - Alg. that computes the OR

function

Page 20: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

20

Claim 2:Claim 2: When ,

The ReductionThe Reduction1 0 1X=

Gx=

u

….… … …

m k n

Claim 1:Claim 1: Let |x| be the number of 1’s in x. Then,

21( ) (1 | |)PR u m k x

n

0

11p m

n

2

1

11p m k

n

m

1 0p c p

Page 21: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

21

Proof Cont.Proof Cont.

Given an input x, BB simulates AA on Gx, uIf PRx(u) ≥ p1 => OR=1If PRx(u) ≤ p0 => OR=0

It means that the maximum number of queries AA uses ≥ m n

Page 22: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

22

ConclusionConclusion

Local PageRank approximation is frequently infeasible on the web graph

Page 23: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

23

PageRank vs. Reverse PageRankPageRank vs. Reverse PageRank

The local approximation algorithm should perform better on the Reverse Web GraphReverse Web Graph

Web Graph

Fast convergence

bounded in-degree

Reverse Web Graph

Page 24: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

24

Experimental SetupExperimental Setup

280,000 page crawl of the www.stanford.edu domain

22,000 page crawl of the www.cnn.com site

Page 25: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

25

Convergence RateConvergence Rate

Page 26: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

26

Crawl Growth RateCrawl Growth RateIn-deg: 38,606Out-deg: 255

Page 27: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

27

Performance of the Algorithm Performance of the Algorithm

Page 28: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

28

Applications of Reverse Applications of Reverse PageRankPageRank TrustRank Influencers in social networksInfluencers in social networks Hub web pagesHub web pages Measuring semantic relatednessMeasuring semantic relatedness Finding crawl seedsFinding crawl seeds

Local RPR Local RPR appapp..

Novel Novel appapp..

Page 29: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

29

Influencers in Influencers in Social NetworksSocial Networks

Goal: Market a new product to be adopted by a large fraction of a social network

Method: • Initially target a few influential membersinfluential members• Trigger a word of mouth process• Results in a large number of users

How should we choose these seed members?

Page 30: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

30

Why RPR?Why RPR?[Java et al. 2006][Java et al. 2006]

Nodes with high RPR • Have short paths to many other nodes in the

network• Frequently the only gateways to these nodes

Page 31: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

31

Influencers in Social NetworksInfluencers in Social Networks

Page 32: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

32

Influencers in Social NetworksInfluencers in Social Networks

www.Livejournal.com, 3.5 million nodes

1-level BFS crawl 4-level BFS crawl

Page 33: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

33

Hub Web PagesHub Web Pages

Goal: Find good starting points for search• Difficult to formulate queries• Broad search tasks• Need to understand the surrounding context

Method: Find pages with short paths to many relevant pages

Page 34: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

34

Why RPR?Why RPR?[Fogaras, 2003][Fogaras, 2003]

High RPR pages tend to have short paths to many authorities

Page 35: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

35

Hub Web PagesHub Web PagesMeta-search engine over Yahoo! search

Fraction of hubs in the top 20 results for the queries:1. “computer scientists” 2. “global warming”

3. “folk dancing” 4. “queen Elizabeth”

Page 36: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

36

Measuring Semantic Measuring Semantic RelatednessRelatedness

Goal: Find the relatedness between two concepts

• For Natural language processing applications

Method: Use a taxonomy like the ODP or Wikipedia

Page 37: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

37

Why RPR?Why RPR?

bb is a strong sub-concept of aa in a taxonomy if

• there are many short paths from aa to bb

RPR-RPR- measure of b as sub-concept of a •

RPR Similarity-RPR Similarity- two concepts will be similar in case they have significant overlap between their RPR vectors

• similarity between the vectors RPRa and RPRb

aRPR b

Page 38: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

38

Measuring Semantic Measuring Semantic RelatednessRelatedness www.dmoz.org taxonomy WordSimilarity-353

Relatedness to “Einstein” Relatedness to “Computer”Human

judgementRPR RPR

similarityPath-based

Software (8.5) Software Software Laboratory

Keyboard (7.58) Internet Internet Keyboard

Internet (7.62) Keyboard Keyboard News

Laboratory (6.78)

News Laboratory Software

News (4.47) Laboratory News Internet

0.60.6 0.60.6 -0.4-0.4

RPR RPR similarity Path-based

Physics Prize Newton Isaac Agriculture

Physics Physics Prize Internet

Newton Isaac Physics Nuclear

Nuclear Nuclear Physics Prize

Agriculture World War II Pizza

United States Agriculture United States

World War II Helicopter Physics

Internet Ronald Reagan Newton Isaac

Helicopter Italy Italy

Ronald Reagan Internet World War II

Italy United States Ronald Reagan

Pizza Sodoku Helicopter

Sodoku Pizza Sodoku

Agriculture

Internet

Physics Prize Newton Isaac

0.60.6 0.60.6 -0.4-0.4

Page 39: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

39

Finding Crawl SeedsFinding Crawl Seeds

Goal: Discover quickly new content on the web while incurring as little overheadoverhead as possible

• Overhead: old pages / new pages Method: Find good seeds

Page 40: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

40

Why RPR?Why RPR?

A page pp has high RPR if • Many pages are reachable from p by short

paths• These pages are not reachable from many

other pages

Known page

Unknown page

u

v

Page 41: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

41

Finding Crawl SeedsFinding Crawl SeedsWebBase project, two crawls of ~1,000,000

pages, one week apart4-level BFS crawl

Fraction of new pages discovered

Overhead

Page 42: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

42

SummarySummaryTwo graph propertiesTwo graph properties make local

PageRank approximation hardThe Web GraphWeb Graph is not suitable for local PR approximationThe Reverse Web graphReverse Web graph is suitable for local PR approximationRPRRPR finds nodes that

• have short paths to many other nodes• frequently the only gateways to these nodes

ApplicationsApplications of RPR

Page 43: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

43

Thanks!Thanks!

Page 44: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

44

AppendixAppendix

Page 45: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

45

Proof – High in-degree Proof – High in-degree Deterministic algorithmsDeterministic algorithms

By reduction from the majority-by-a-margin problem

Input:

Output: the majority

At least queries are needed 1 2 m

x1 x2 x3 xm 0,1m

1 1at least 0's or at least 1's

2 2m m

Page 46: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

46

The ReductionThe Reduction1 0 1X=

Gx=

u

W1

V1 V2 V3

m

AA - Alg. that calculates local PRBB - Alg. that computes majority-by-

a-margin

W2 Wm

Page 47: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

47

The ReductionThe Reduction1 0 1X=

Gx=

u

W1

V1 V2 V3

m

W2 Wm

Claim 2:Claim 2: When ,( 1) / 2m n

Claim 1:Claim 1: Let |x| be the number of 1’s in x. Then,

21( ) (1 | |)PR u m x

n

20

1 11

2p m m

n

2

1

1 11

2p m k

n

1 0p c p

Page 48: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

48

Proof Cont.Proof Cont.

Given an input x, BB simulates AA on Gx, uIf PRx(u) ≥ p1 => The majority bit of x is 1If PRx(u) ≤ p0 => The majority bit of x is 0

It means that the maximum number of queries AA uses ≥ (1 2 )m n

Page 49: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

49

Proof – Slow PR Conversion Proof – Slow PR Conversion Randomized algorithmsRandomized algorithms

By reduction from the OR problem

Input:

Output: 1 2 3 ... mx x x x

queries are needed even for randomized algorithms m

x1 x2 x3 xm 0,1m

Page 50: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

50

The ReductionThe Reduction1 0 0X=

Gx=

m

AA - Alg. that calculates local PRBB - Alg. that computes the OR

functionu

……

T

S

1

S

m

Page 51: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

51

Claim 2:Claim 2: When ,

The ReductionThe Reduction

1 log

2m n

Claim 1:Claim 1: Let |x| be the number of 1’s in x. Then,

log 10

1(2 ) 1

(2 1)mp

n

1 0p c p

1 0 0X=

Gx=

m

u

……

T

S

1

S

m

log 1 log log 11

1(2 ) 1 ((2 ) 1)

(2 1)m m kp

n

Page 52: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

52

Proof Cont.Proof Cont.

Given an input x, BB simulates AA on Gx, uIf PRx(u) ≥ p1 => OR=1If PRx(u) ≤ p0 => OR=0

It means that the maximum number of queries AA uses ≥

1 log

2( )m n

Page 53: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

53

Proof – Slow PR Convergence Proof – Slow PR Convergence Deterministic algorithmsDeterministic algorithms

By reduction from the majority-by-a-margin problem

Input:

Output: the majority

At least queries are needed 1 2 m

x1 x2 x3 xm 0,1m

1 1at least 0's or at least 1's

2 2m m

Page 54: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

54

The ReductionThe Reduction1 0 1X=

Gx=

m

AA - Alg. that calculates local PR

BB - Alg. that computes majority-by-a-margin

u

……

……

……

……

w1 w2 w3 w4 wm-1 wm

Page 55: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

55

The ReductionThe Reduction

Claim 2:Claim 2: When ,/ 2m n1 0p c p

1 0 1X=

m

u

……

……

……

……

w1 w2 w3 w4 wm-1 wm

Claim 1:Claim 1: Let |x| be the number of 1’s in x. Then,

log 1log 1

0

1 (2 ) 1 1( ( ) )

2 1 2

nnp m

n

log 1log 1

1

1 (2 ) 1 1( ( ) )

2 1 2

nnp m

n

log 1log 11 (2 ) 1

( ) ( | | )2 1

nn

xPR u xn

Page 56: Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

56

Proof Cont.Proof Cont.

Given an input x, BB simulates AA on Gx, uIf PRx(u) ≥ p1 => The majority bit of x is 1If PRx(u) ≤ p0 => The majority bit of x is 0

It means that the maximum number of queries AA uses ≥ (1 2 )m n