View
221
Download
0
Tags:
Embed Size (px)
Citation preview
1
Local Approximation Local Approximation of PageRank and of PageRank and Reverse PageRankReverse PageRank
Li-Tal MashiachAdvisor: Dr. Ziv Bar-Yossef
13/03/08
2
OverviewOverview
Review of PageRankLocal PageRank approximationAlgorithmLower boundsPageRank vs. Reverse PageRankApplications of Reverse PageRank
3
PageRankPageRank
Most search engines analyze the hyperlink structure to order search results
PageRankPageRank • Important measure of ranking for all major
search engines
4
Review of PageRankReview of PageRank
1 ( )( )
( )q p
PR qPR p
n OutDeg q
Sum of thein-neighbors’ ranks
Rank divided among all out-neighbors
Base rank
Damping factor
5
PageRank as a Random WalkPageRank as a Random Walk
A random surferrandom surfer is visiting the web:• With probability , selects a random out-link• With probability jumps to a random web
page
1
6
Global PageRank ComputationGlobal PageRank ComputationRun power method
• Initialize:
• Repeat until convergence:
Challenges:• Holding the whole web graph• Multiplying a matrix by a vector
1k kx x M
0 1 1,...,xn n
1, if u v is an edge
out deg( )( , )
1, otherwise
n uM u v
n
7
Local PR ApproximationLocal PR Approximation
Global PR calculates PR to all pagesSometime we are interested in the PR of a
small number of pages• Person interested in the PR of his homepage• Online business is interested in the PR of his
own website and his competitors’ websiteDo we need to calculate the PR of the
whole graph for that?
8
Given: local accesslocal access to a directed graph G and target node
Output: PR(u)
local access:
Cost: Number of queries to the link server
Problem StatementProblem Statement [Chen, Gan, Suel, 2004]
u G
v G ( )neighbors vLink Server
9
OverviewOverview
Review of PageRankLocal PageRank approximationAlgorithmLower boundsPageRank vs. Reverse PageRankApplications of Reverse PageRank
10
Another Characterization of Another Characterization of PRPR[Jeh, Widom, 2003][Jeh, Widom, 2003]
infinftt(v,u)(v,u) – the fraction of the PR score of vv that flows to uu on paths of length tt
1
( , ) 0
1inf ( , )
outdeg( )t
t
tp paths v u i i
v uu
u
v
1 ( )( )
( )q p
PR qPR p
n OutDeg q
u1
u2
11
Another Characterization of Another Characterization of PRPR[Jeh, Widom, 2003][Jeh, Widom, 2003]
PRPRrr(u) (u) – PR score that flows into u from nodes at distance at most r from u
Theorem:Theorem:
0
1( ) Inf ( , )
rt
r tt v G
PR u v un
( ) lim ( )rr
PR u PR u
u
v
u1
u2
12
Local PR Brute Force AlgorithmLocal PR Brute Force Algorithm[Chen, Gan, Suel, 2004]
Goal: calculate PRr(u) for a sufficiently large r
Algorithm:Crawl backwardsbackwards the sub-graph of radius rr around uu
For each node vv at layer tt calculate the infinftt(v,u)(v,u)
Sum up the weighted influence values
1:
1inf ( , ) inf ( , )
deg( )t tw v w
v u w uout v
u
v
w2w1
0
1( ) Inf ( , )
rt
r tt v G
PR u v un
13
Local PR Brute Force AlgorithmLocal PR Brute Force Algorithm
0
1( ) 0.0107PR u
n
1 0
1 1 1( ) ( ) 0.0175
2 4PR u PR u
n
2 22 1
1 1 1 1( ) ( ) 1 0.0214
2 2 4PR u PR u
n
3 33 2
1 1 1 1 1 1 1 1( ) ( ) 1 1 0.0236
2 2 2 4 2 4 3PR u PR u
n
u
0.85
14n
14
Optimization by PruningOptimization by PruningHeuristic to improve the cost
Prune all nodes whose influence is below some threshold
Was shown empirically to be sometimes better [Chen, Gan, Suel, 2004]
u
15
Analysis of the AlgorithmAnalysis of the Algorithm
This algorithm requires at most queries • rr – number of iterations until the PR random
walk almost converges• dd – maximum in-degree of the graph
In case of slow PR convergence or high in-degree, the algorithm is not feasible
rd
16
Limitations of the AlgorithmLimitations of the Algorithm
In the web graph there are a lot of web pages with high in-degree
Conclusion: The algorithm is frequently unsuitable for the web graph
Is this a limitation of this specific algorithm only?
17
Lower BoundsLower Bounds
Local PR approx. is hard for graphs with:• High in-degree nodesHigh in-degree nodes• Slow convergence of the PR random walkSlow convergence of the PR random walk
High in-degreeHigh in-degree Slow PageRank Slow PageRank convergenceconvergence
Randomized alg.Randomized alg.
Deterministic Deterministic alg.alg. n
1 log
2( )n
n d n
log iterationsn
n d n
18
ProofProof
By reduction from the OR problem
Input:
Output: 1 2 3 ... mx x x x
queries are needed even for randomized algorithms m
x1 x2 x3 xm 0,1m
19
The ReductionThe Reduction1 0 1X=
Gx=
u
….… … …
m
AA - Alg. that calculates local PRBB - Alg. that computes the OR
function
20
Claim 2:Claim 2: When ,
The ReductionThe Reduction1 0 1X=
Gx=
u
….… … …
m k n
Claim 1:Claim 1: Let |x| be the number of 1’s in x. Then,
21( ) (1 | |)PR u m k x
n
0
11p m
n
2
1
11p m k
n
m
1 0p c p
21
Proof Cont.Proof Cont.
Given an input x, BB simulates AA on Gx, uIf PRx(u) ≥ p1 => OR=1If PRx(u) ≤ p0 => OR=0
It means that the maximum number of queries AA uses ≥ m n
22
ConclusionConclusion
Local PageRank approximation is frequently infeasible on the web graph
23
PageRank vs. Reverse PageRankPageRank vs. Reverse PageRank
The local approximation algorithm should perform better on the Reverse Web GraphReverse Web Graph
Web Graph
Fast convergence
bounded in-degree
Reverse Web Graph
24
Experimental SetupExperimental Setup
280,000 page crawl of the www.stanford.edu domain
22,000 page crawl of the www.cnn.com site
25
Convergence RateConvergence Rate
26
Crawl Growth RateCrawl Growth RateIn-deg: 38,606Out-deg: 255
27
Performance of the Algorithm Performance of the Algorithm
28
Applications of Reverse Applications of Reverse PageRankPageRank TrustRank Influencers in social networksInfluencers in social networks Hub web pagesHub web pages Measuring semantic relatednessMeasuring semantic relatedness Finding crawl seedsFinding crawl seeds
Local RPR Local RPR appapp..
Novel Novel appapp..
29
Influencers in Influencers in Social NetworksSocial Networks
Goal: Market a new product to be adopted by a large fraction of a social network
Method: • Initially target a few influential membersinfluential members• Trigger a word of mouth process• Results in a large number of users
How should we choose these seed members?
30
Why RPR?Why RPR?[Java et al. 2006][Java et al. 2006]
Nodes with high RPR • Have short paths to many other nodes in the
network• Frequently the only gateways to these nodes
31
Influencers in Social NetworksInfluencers in Social Networks
32
Influencers in Social NetworksInfluencers in Social Networks
www.Livejournal.com, 3.5 million nodes
1-level BFS crawl 4-level BFS crawl
33
Hub Web PagesHub Web Pages
Goal: Find good starting points for search• Difficult to formulate queries• Broad search tasks• Need to understand the surrounding context
Method: Find pages with short paths to many relevant pages
34
Why RPR?Why RPR?[Fogaras, 2003][Fogaras, 2003]
High RPR pages tend to have short paths to many authorities
35
Hub Web PagesHub Web PagesMeta-search engine over Yahoo! search
Fraction of hubs in the top 20 results for the queries:1. “computer scientists” 2. “global warming”
3. “folk dancing” 4. “queen Elizabeth”
36
Measuring Semantic Measuring Semantic RelatednessRelatedness
Goal: Find the relatedness between two concepts
• For Natural language processing applications
Method: Use a taxonomy like the ODP or Wikipedia
37
Why RPR?Why RPR?
bb is a strong sub-concept of aa in a taxonomy if
• there are many short paths from aa to bb
RPR-RPR- measure of b as sub-concept of a •
RPR Similarity-RPR Similarity- two concepts will be similar in case they have significant overlap between their RPR vectors
• similarity between the vectors RPRa and RPRb
aRPR b
38
Measuring Semantic Measuring Semantic RelatednessRelatedness www.dmoz.org taxonomy WordSimilarity-353
Relatedness to “Einstein” Relatedness to “Computer”Human
judgementRPR RPR
similarityPath-based
Software (8.5) Software Software Laboratory
Keyboard (7.58) Internet Internet Keyboard
Internet (7.62) Keyboard Keyboard News
Laboratory (6.78)
News Laboratory Software
News (4.47) Laboratory News Internet
0.60.6 0.60.6 -0.4-0.4
RPR RPR similarity Path-based
Physics Prize Newton Isaac Agriculture
Physics Physics Prize Internet
Newton Isaac Physics Nuclear
Nuclear Nuclear Physics Prize
Agriculture World War II Pizza
United States Agriculture United States
World War II Helicopter Physics
Internet Ronald Reagan Newton Isaac
Helicopter Italy Italy
Ronald Reagan Internet World War II
Italy United States Ronald Reagan
Pizza Sodoku Helicopter
Sodoku Pizza Sodoku
Agriculture
Internet
Physics Prize Newton Isaac
0.60.6 0.60.6 -0.4-0.4
39
Finding Crawl SeedsFinding Crawl Seeds
Goal: Discover quickly new content on the web while incurring as little overheadoverhead as possible
• Overhead: old pages / new pages Method: Find good seeds
40
Why RPR?Why RPR?
A page pp has high RPR if • Many pages are reachable from p by short
paths• These pages are not reachable from many
other pages
Known page
Unknown page
u
v
41
Finding Crawl SeedsFinding Crawl SeedsWebBase project, two crawls of ~1,000,000
pages, one week apart4-level BFS crawl
Fraction of new pages discovered
Overhead
42
SummarySummaryTwo graph propertiesTwo graph properties make local
PageRank approximation hardThe Web GraphWeb Graph is not suitable for local PR approximationThe Reverse Web graphReverse Web graph is suitable for local PR approximationRPRRPR finds nodes that
• have short paths to many other nodes• frequently the only gateways to these nodes
ApplicationsApplications of RPR
43
Thanks!Thanks!
44
AppendixAppendix
45
Proof – High in-degree Proof – High in-degree Deterministic algorithmsDeterministic algorithms
By reduction from the majority-by-a-margin problem
Input:
Output: the majority
At least queries are needed 1 2 m
x1 x2 x3 xm 0,1m
1 1at least 0's or at least 1's
2 2m m
46
The ReductionThe Reduction1 0 1X=
Gx=
u
W1
V1 V2 V3
m
AA - Alg. that calculates local PRBB - Alg. that computes majority-by-
a-margin
W2 Wm
47
The ReductionThe Reduction1 0 1X=
Gx=
u
W1
V1 V2 V3
m
W2 Wm
Claim 2:Claim 2: When ,( 1) / 2m n
Claim 1:Claim 1: Let |x| be the number of 1’s in x. Then,
21( ) (1 | |)PR u m x
n
20
1 11
2p m m
n
2
1
1 11
2p m k
n
1 0p c p
48
Proof Cont.Proof Cont.
Given an input x, BB simulates AA on Gx, uIf PRx(u) ≥ p1 => The majority bit of x is 1If PRx(u) ≤ p0 => The majority bit of x is 0
It means that the maximum number of queries AA uses ≥ (1 2 )m n
49
Proof – Slow PR Conversion Proof – Slow PR Conversion Randomized algorithmsRandomized algorithms
By reduction from the OR problem
Input:
Output: 1 2 3 ... mx x x x
queries are needed even for randomized algorithms m
x1 x2 x3 xm 0,1m
50
The ReductionThe Reduction1 0 0X=
Gx=
m
AA - Alg. that calculates local PRBB - Alg. that computes the OR
functionu
……
T
S
1
S
m
51
Claim 2:Claim 2: When ,
The ReductionThe Reduction
1 log
2m n
Claim 1:Claim 1: Let |x| be the number of 1’s in x. Then,
log 10
1(2 ) 1
(2 1)mp
n
1 0p c p
1 0 0X=
Gx=
m
u
……
T
S
1
S
m
log 1 log log 11
1(2 ) 1 ((2 ) 1)
(2 1)m m kp
n
52
Proof Cont.Proof Cont.
Given an input x, BB simulates AA on Gx, uIf PRx(u) ≥ p1 => OR=1If PRx(u) ≤ p0 => OR=0
It means that the maximum number of queries AA uses ≥
1 log
2( )m n
53
Proof – Slow PR Convergence Proof – Slow PR Convergence Deterministic algorithmsDeterministic algorithms
By reduction from the majority-by-a-margin problem
Input:
Output: the majority
At least queries are needed 1 2 m
x1 x2 x3 xm 0,1m
1 1at least 0's or at least 1's
2 2m m
54
The ReductionThe Reduction1 0 1X=
Gx=
m
AA - Alg. that calculates local PR
BB - Alg. that computes majority-by-a-margin
u
……
……
……
……
w1 w2 w3 w4 wm-1 wm
55
The ReductionThe Reduction
Claim 2:Claim 2: When ,/ 2m n1 0p c p
1 0 1X=
m
u
……
……
……
……
w1 w2 w3 w4 wm-1 wm
Claim 1:Claim 1: Let |x| be the number of 1’s in x. Then,
log 1log 1
0
1 (2 ) 1 1( ( ) )
2 1 2
nnp m
n
log 1log 1
1
1 (2 ) 1 1( ( ) )
2 1 2
nnp m
n
log 1log 11 (2 ) 1
( ) ( | | )2 1
nn
xPR u xn
56
Proof Cont.Proof Cont.
Given an input x, BB simulates AA on Gx, uIf PRx(u) ≥ p1 => The majority bit of x is 1If PRx(u) ≤ p0 => The majority bit of x is 0
It means that the maximum number of queries AA uses ≥ (1 2 )m n