View
223
Download
0
Tags:
Embed Size (px)
Citation preview
1
Hyperlink Analysis
A Survey(In Progress)
2
Overview of This Talk
Introduction to Hyperlink Analysis
Classification of Hyperlink Analysis
Two sub-topics: Measures and Metrics Interesting Web Structures
3
Definition of Hyperlink Analysis
Hyperlink Analysis can be defined as an
area of Web Information Retrieval using
the hyperlink structure of the Web.
4
Motivation
Hyperlinks serve two main purposes. Pure Navigation. Point to pages with authority* on the
same topic of the page containing the link.
This can be used to retrieve useful information from the web.
* - a set of ideas or statements supporting a topic
5
What Information Can Be Retrieved ?
Quality of Web Page.- The authority of a page on a topic.
- Ranking of web Pages. Interesting Web Structures.
- Graph patterns like Co-citation, Social choice, Complete bipartite graphs etc.
Web Page Classification.- Classifying web pages according to various
topics.
6
What Information Can Be Retrieved? (Cont…)
Which pages to crawl.- Deciding which web pages to add to the
collection of web pages. Finding Related Pages.
- Given one relevant page, find all related pages.
Detection of duplicated pages.- Detection of neared-mirror sites to
eliminate duplication.
7
Classification of Hyperlink Analysis Research
Hyperlink Analysis
Measures and Metrics
Interesting Web Structures
Web Page Classification
Web Search
(Still needs to be refined. Suggestions Welcome)
8
Measures/metrics
Standards for measuring properties of a page or a web structure.
Quality of a page. Distance between pages. Web Page Reputation.
9
PageRank Citation Ranking[1]
Aim Ranking Metric for Hypertext
Documents
Approach Page has a high rank if the sum of the
ranks of its backlinks is high
10
Authoritative Sources in Hyperlink Environment[3]
Aim Determining relative “authority” of pages
Approach Good authority page is one pointed to by many good
hubs Good hub page is one that points to many good
authorities
Results Efficient when query topic is sufficiently “broad”
Benefits Locating dense bipartite communities
11
Does “Authority” Mean Quality ?[4]
Aim. Are any metrics we compute for Web documents good
predictors of document quality ?
Approach. Do experts agree in their quality judgments? Are different link-based metrics different?
o Indegree, PageRank and Authority. Can we predict human quality judgments ?Compute correlations between each pair of metrics and
also compare it with expert judgment.
12
Does “Authority” Mean Quality ?[4]
Results. Experts agree on the nature of a quality within
a topic. No significant difference between link based
metrics. In-degree performed as well as PR and
Authority.
13
Web Page Reputations [5]
Aim. Input: URL, Output: Ranked set of topics for
which the page has a reputation.
Approach.A page an acquire a high reputation on a topic
because the page is pointed to by many pages on that topic, or because the page is pointed to by some high reputation pages on that topic.
A page is deemed authority on the topic if it is pointed to by good hubs on the topic, and a good hub is one that points to good authorities.
14
One-level Influence Propagation
Reputation of the page p on a topic is the probability that the random surfer looking for topic t will visit page p
At each step: with probability d>0 jump to a random page, or with probability (1-d) follow a random link from the current
page
Gpq
n
t
n
qOuttqRdN
dtpR),(
1
)(),()1(),(
Gpq
n
qOuttqRd
),(
1
)(),()1(
if term t appears in page p
otherwise
15
Two Level Influence Propagation
with probability d>0 jump to random page that contains term t with probability (1-d) follow random link forward/backward from
the current page, alternating directions
Authority Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a forward visit to the page p
Hub Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a backward visit to the page p
16
Two Level Influence Propagation
pq
n
t
n
qOuttqHdN
dtpA )(),()1(2),(
1
pq
n
qOuttqHd )(),()1(
1
if term t appears in page p
otherwise
qp
n
t
n
qIntqAdN
dtpH )(),()1(2),(
1
if term t appears in page p
qp
n
qIntqAd )(),()1(
1otherwise
A(p,t) = probability of a forward visit to page p when searching for term t = Authority rank of page p on term t
H(p,t) = probability of a backward visit to page p when searching for term t = Hub rank of page p on term t
17
Factors Affecting Page Reputation
How well a topic is represented. How well pages on a topic are
connected.
18
Link Analysis and Stability[6]
Aim. When to expect stable rankings under small
perturbations to hyperlink patterns.
Approach. Eigengap directly affects the stability of
eigenvectors in HITS algorithm. Coupled Markov Chain Theory(?).
So long as perturbed web pages did not have high overall PageRank scores, then the perturbed PageRank Scores will not be far from the original.
Result. HITS – Unstable; PageRank – Stable.
19
Stable Algorithms [7]Aim
Stable Link Analysis Methods
Approach Randomized HITS
Merging Hubs and Authorities notion with “reset” mechanism from PageRank
Subspace HITSCombining multiple eigenvectors from HITS to yield
aggregate authority scores – Subspace HITS
Results Both approaches more stable than HITS, latter a little
worse than PageRank
20
Average Clicks [8]Aim.
A new definition of distance between two pages.
Approach. Based on probability to click a link through random
surfing.
Benefit. A good justification of practical search for fetching
neighboring pages.
Result. Distance by average clicks seems to fit well intuitively.
21
Interesting Web Structure
Analyzing interesting graph patterns or Web Structures.
Helpful in identification of ‘Web Communities.’
22
Interesting Web Structures [11]
Endorsement Mutual Reinforcement
Co-Citation Social Choice
Transitive Endorsement
23
Interesting Web Structures [11]
Directed Complete Bipartite graph NK-clan with N=2, K=10
NK- Clan is a set of K-nodes in which there is a path length N or less(ignoring edge directions) between every pair of nodes
24
Interesting Web Structures [11]
In - TreeOut- Tree
25
Interesting Web Structures
Web Communities
26
Friends and Neighbors [9]
Aim. Techniques to mine information in order
to predict relationship between individuals.
Approach. Similarity measured by analyzing text,
in-links, out-links and mailing list.
Result. In-links were ‘good’ predictors.
27
References [1] S. Brin and L. Page(1998) The PageRank
Citation Ranking: Bringing Order to the Web. In Technical Report available at http://www-db.stanford.edu/~backrub/pageranksub.ps, January 1998.
[2] T. Haveliwala,(1999) Efficient Computation of PageRank In Technical Report , Stanford University,CA
[3] J.M. Klienberg (1998), Authoritative Sources in Hyperlinked Environment
28
References [4] B. Amento1, L. Terveen, and Will Hill(2000) ,
Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents (ACM 2000)
[5] D. Rafiei, A.O. Mendelzon (2000), What is this Page Known for? Computing Web Page Reputations ,Proceedings of Ninth International WWW Conference
29
References(contd…) [6] A. Y. Ng, A. X. Zheng, and M. I.
Jordan(2001),Link Analysis, Eigenvectors and Stability, IJCAI-01.
[7] A. Y. Ng, A. X. Zheng, and M. I. Jordan(2001), Stable algorithms for link analysis. Proc. 24th International Conference on Research and Development in Information Retrieval (SIGIR), 2001.
[8] Y. Matsuo, Y.Ohsawa and M. Ishizuka(2001), Average-clicks: A new measure of distance on the WWW, WI-2001, 2001.
30
References(contd…) [9] L. A. Adamic and E. Adar(2000), Friends and
Neighbors on the Web,Xerox Palo Alto Research Center Palo Alto, CA 94304.
[10] A. Borodin, G.O. Roberts, J.S. Rosenthal, P. Tsaparas (2000), Finding Authorities and Hubs From Link Structures on the World Wide Web,WWW10 Proceedings.
31
References (contd…) [11] Kemal Efe, Vijay Raghavan, C. Henry Chu, Adrienne L.
Broadwater, Levent Bolelli, Seyda Ertekin (2000), The Shape of the Web and Its Implications for Searching the Web , International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet- Proceedings at http://www.ssgrr.it/en/ssgrr2000/proceedings.htm, Rome. Italy, Jul.-Aug. 2000
[12] Monika Henzinger, Link Analysis in Web Information Retrieval, ICDE Bulletin Sept 2000, Vol 23. No.3
32
PageRank Approach
PageRank of a page p.
• d is the damping factor (or probability that a page is chosen uniformly at random from all pages ).
• n is the number of nodes in Graph G.• outdegree(q) is the number of edges leaving a page q.
Back.
Gpq
qreeoutqPRdn
dpPR),(
)(deg)()1()(
33
HITS Approach
Let z denote the vector(1,1,1,1,….1).Initially set x z ; y z,For i = 1,2,3….Apply the I Operation.Apply the O operation.Normalize x and y.The sequence of (x, y) pairs produced converges to a limit (x*, y*).
Return (x*, y* ) as the authority and hub weights.Back.
EPpqqqp yx
),(:
:
EPpqqqp xy
),(:
:
34
Friends and Neighbors
Predicting Friendship
sshareditem shareditemfrequency
BAsimilarity)](log[
1),(
Items that are unique to few users are weighted more than commonly occurring items 2 people mention item, Weight = 1/log(2) = 1.4 5 people mention item, Weight = 1/log(5) = 0.62
Back