Upload
dwain-owens
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Predictive RankingPredictive Ranking-H-Handling missing data on the webandling missing data on the web
Haixuan YangHaixuan YangGroup Meeting Group Meeting November 04, 2004November 04, 2004
2
OutlineOutline
IntroductionIntroduction Related WorkRelated Work Predictive Ranking ModelPredictive Ranking Model Block Predictive Ranking ModelBlock Predictive Ranking Model Experiment SetupExperiment Setup
3
IntroductionIntroduction
PageRank (1998)PageRank (1998)– It uses the link information to rank web page;It uses the link information to rank web page;– The importance of a page depends on the number of The importance of a page depends on the number of
pages that point to it;pages that point to it;– The importance of a page also depends on the The importance of a page also depends on the
importance of pages that point to it.importance of pages that point to it.– If X is the rank vector,If X is the rank vector,
ProblemsProblems– ManipulationManipulation– The “richer-get-richer” phenomenonThe “richer-get-richer” phenomenon– Computation EfficiencyComputation Efficiency– Dangling nodes problemDangling nodes problem
xAfex T ])1[(
4
IntroductionIntroduction
Nodes that either have no out-link or for which no out-link Nodes that either have no out-link or for which no out-link is known are called dangling nodes.is known are called dangling nodes.
Dangling nodes problemDangling nodes problem– It is hard to sample the entire web.It is hard to sample the entire web.
• Page et al (1998) reported that they have 51 million URL not Page et al (1998) reported that they have 51 million URL not downloaded yet when they have 24 million pages downloaded.downloaded yet when they have 24 million pages downloaded.
• Handschuh et al (2003) estimated that dynamic pages are 100 Handschuh et al (2003) estimated that dynamic pages are 100 times more than static pages.times more than static pages.
• Eiron et al (2004) reported that the number of uncrawled pages Eiron et al (2004) reported that the number of uncrawled pages (3.75 billion) still far exceeds the number of crawled pages (1.1 (3.75 billion) still far exceeds the number of crawled pages (1.1 billion).billion).
– Including dangling nodes in the overall ranking may not only Including dangling nodes in the overall ranking may not only change the rank value of non-dangling nodes but also change change the rank value of non-dangling nodes but also change the order of them.the order of them.
5
An exampleAn example
If we ignore the dangling node 3, then the ranks for nodes 1 and 2 are .)5.0,5.0(),( 21 xx
If we consider the dangling node 3, then the ranks are by the revised pagerank algorithm (Kamvar 2003).
)1714.0,1714.0,6571.0(),,( 321 xxx
6
IntroductionIntroduction
Classes of Dangling nodesClasses of Dangling nodes– Nodes that are found but not visited at current time are Nodes that are found but not visited at current time are
called dangling nodes of class 1.called dangling nodes of class 1.– Nodes that have been tried but not visited successfully Nodes that have been tried but not visited successfully
are called dangling nodes of class 2.are called dangling nodes of class 2.– Nodes, which have been visited successfully but from Nodes, which have been visited successfully but from
which no outlink is found, are called dangling nodes of which no outlink is found, are called dangling nodes of class 3.class 3.
Handle different kind of dangling nodes in Handle different kind of dangling nodes in different way. Our work focuses on dangling different way. Our work focuses on dangling nodes of class 1, which cause missing nodes of class 1, which cause missing information. information.
7
1 2 3 4
5 6
7
Illustration of dangling nodesIllustration of dangling nodes
At time 1
visited node:1.
Dangling nodes of class 1: 2, 4, 5,7.
At time 2,
1 2 4
5
2 3
6
Dangling nodes of class 3 : 7
Visited nodes : 1,7,2;
Dangling nodes of class 1: 3,4,5,6
77
Known information at time 2: red links
Missing information at time 2:
White links
8
Related workRelated work
Page (1998): Simply removing them. After doing Page (1998): Simply removing them. After doing so, they can be added back in. The details are so, they can be added back in. The details are missing. missing.
Amati (2003): Handle dangling nodes robustly Amati (2003): Handle dangling nodes robustly based on a modified graph. based on a modified graph.
Kamvar (2003): Add uniform random jump from Kamvar (2003): Add uniform random jump from dangling nodes to all nodes.dangling nodes to all nodes.
Eiron (2004): Speed up the model in Kamvar Eiron (2004): Speed up the model in Kamvar (2003), but sacrificing accuracy. Furthermore, (2003), but sacrificing accuracy. Furthermore, suggest algorithm that penalize the nodes that suggest algorithm that penalize the nodes that link to dangling nodes of class 2.link to dangling nodes of class 2.
9
Related work - Amati (2003)Related work - Amati (2003)
10
Related work - Kamvar (2003)Related work - Kamvar (2003)
11
Related work - Eiron (2004)Related work - Eiron (2004)
12
Predictive Ranking ModelPredictive Ranking Model
For dangling nodes of class 3, we use the same For dangling nodes of class 3, we use the same technique as Kamvar (2003).technique as Kamvar (2003).
For dangling nodes of class 2, we ignore them at For dangling nodes of class 2, we ignore them at current model although it is possible to combine current model although it is possible to combine the push-back algorithm (Eiron 2004) with our the push-back algorithm (Eiron 2004) with our model. (Penalizing nodes is a subjective matter.) model. (Penalizing nodes is a subjective matter.)
For dangling nodes of class 1, we try to predict For dangling nodes of class 1, we try to predict the missing information about the link structrue.the missing information about the link structrue.
13
Predictive Ranking ModelPredictive Ranking Model
Suppose that all the nodes V can be partitioned into three Suppose that all the nodes V can be partitioned into three subsets: . subsets: . – denotes the set of all nodes that have been crawled denotes the set of all nodes that have been crawled
successfully and have at least one out-link; successfully and have at least one out-link;
– denotes the set of all dangling nodes of class 3; denotes the set of all dangling nodes of class 3;
– denotes the set of all dangling nodes of class 1; denotes the set of all dangling nodes of class 1; For each node in V, the real in-degree of v is not known. For each node in V, the real in-degree of v is not known. For each node v in , the real out-degree of v is known.For each node v in , the real out-degree of v is known. For each node v in , the real out-degree of v is known to For each node v in , the real out-degree of v is known to
be zero. be zero. For each node v in , the real out-degree of v is unknown.For each node v in , the real out-degree of v is unknown.
21 ,, DDC
1D
2D
C
C
1D
2D
14
Predictive Ranking ModelPredictive Ranking Model
We predict the real in-degree of v by the number We predict the real in-degree of v by the number of found links from C to v.of found links from C to v.– Assumption: the number of found links from C to v is Assumption: the number of found links from C to v is
proportional to the real number of links from V to v.proportional to the real number of links from V to v.• For example, For example,
if C and have 100 nodes, if C and have 100 nodes,
V has 1000 nodes, V has 1000 nodes,
and if the number of links from C to v is 5,and if the number of links from C to v is 5,
then we estimate that the number of links from V to v is 50. then we estimate that the number of links from V to v is 50.
The difference between these two numbers is The difference between these two numbers is distributed uniformly to the nodes in .distributed uniformly to the nodes in .
1D
2D
15
Predictive Ranking ModelPredictive Ranking Model
NQD
MPCA Models the missing
information from unvisited nodes to nodes in V.
)(
1
1
22
1
11
1
)(
)()(00
00)(
)()(0
000)(
)()(
mmnn
nn
mmn
vfdvd
mmn
vfdvdmmn
vfdvd
N
M
1
Model the known link information as Page (1998): from C to V.
Model the user’s behavior as Kamvar (2003) when facing dangling nodes of class 3.
16
Predictive Ranking ModelPredictive Ranking Model
nnnnn ffff
ffff
ffff
f
f
f
2222
1111
2
1
111
Model users’ behavior (called as “teleportation”) as Page (1998) and Kamvar (2003) when the users get bored in following actual links and they may
jump to some nodes randomly.
xAfex T ])1[( f Te
85.0
Tnxxxx ),,,( 21 is the rank vector.
17
Block Predictive Ranking ModelBlock Predictive Ranking Model
Predict the in-degree of v more accurately.Predict the in-degree of v more accurately. Divide all nodes into p blocks (v[1], v[2], …, v[p]) Divide all nodes into p blocks (v[1], v[2], …, v[p])
according to their top level domains (for example, according to their top level domains (for example, edu), or domains (for example, stanford.edu), or edu), or domains (for example, stanford.edu), or the countries (for example, cn).the countries (for example, cn).
Assumption: the number of found links from C[i] Assumption: the number of found links from C[i] (C[i] is the meet of C and V[i]) to v is proportional (C[i] is the meet of C and V[i]) to v is proportional to the real number of links from V[i] to v. to the real number of links from V[i] to v. Consequently, the matrix A is changed. Consequently, the matrix A is changed.
Other parts are same as the Predictive Ranking Other parts are same as the Predictive Ranking Model. Model.
18
Block Predictive Ranking ModelBlock Predictive Ranking Model
p
p
NNNQD
MMMPCA
21
21
Models the missing information from unvisited nodes in 1st block to nodes in V.
19
Experiment SetupExperiment Setup
Get two datasets (by Patrick Lau): one is within the domain Get two datasets (by Patrick Lau): one is within the domain cuhk.edu.hk, the other is outside this domain. In the first cuhk.edu.hk, the other is outside this domain. In the first dataset, we snapshot 11 matrices during the process of dataset, we snapshot 11 matrices during the process of crawling; in the second dataset, we snapshot 9 matrices.crawling; in the second dataset, we snapshot 9 matrices.
Apply both Predictive Ranking Model and the revised Apply both Predictive Ranking Model and the revised RageRank Model (RageRank Model (Kamvar 2003) to these snapshots.Kamvar 2003) to these snapshots.
Compare the results of both models at time t with the future Compare the results of both models at time t with the future results of both models. results of both models. – The future results rank more nodes than the current results. The future results rank more nodes than the current results.
So it is difficult to make a direct comparison.So it is difficult to make a direct comparison.
20
Illustration for comparisonIllustration for comparison
future result
current result
Cut
Normalize
Comparison by 1-norm
21
Within domain cuhk.edu.hkWithin domain cuhk.edu.hk
Data Description
Time t 1 2 3 4 5 6 7 8 9 10 11
Vnum[t] 7712 78662 109383 160019 252522 301707 373579 411724 444974 471684 502610
Tnum[t] 18542 120970 157196 234701 355720 404728 476961 515534 549162 576139 607170
Time t 1 2 3 4 5 6 7 8 9 10 11
Predictive Ranking 5 3 2 2 2 2 2 2 2 2 2
Page Ranking 12 4 3 2 2 2 2 2 2 2 2
Number of iterations
22
Within domain cuhk.edu.hkWithin domain cuhk.edu.hk
Comparison Based on future PageRank result at time 11.
23
Within domain cuhk.edu.hkWithin domain cuhk.edu.hk
Comparison Based on future PreRank result at time 11
24
Outside cuhk.edu.hkOutside cuhk.edu.hk
Data Description
Time t 1 2 3 4 5 6 7 8 9
Vnum[t] 4611 6785 10310 16690 20318 23453 25417 28847 39824
Tnum[t] 87930 121961 164701 227682 290731 322774 362440 413053 882254
Time t 1 2 3 4 5 6 7 8 9
Predictive Ranking 2 2 2 1 1 1 1 1 1
Page Ranking 3 3 3 2 2 2 2 2 1
Number of iterations
25
Outside cuhk.edu.hkOutside cuhk.edu.hk
Comparison Based on future PageRank result at time 9
26
Outside cuhk.edu.hkOutside cuhk.edu.hk
Comparison Based on future PreRank result at time 9
27
Q & AQ & A