Upload
martina-millicent-hodges
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
COMP5331 6
HITS Algorithm
Authority
v v
Hub
Each page has two weights 1. Authority weight a(v)2. Hub weight h(v)
COMP5331 7
HITS Algorithm Each vertex has two weights
Authority weight Hub weight
Authority Weight
v v
Hub Weight
a(v) = uv h(u) h(v) = vu a(u)
A good authority has many edges from good hubs
A good hub has many outgoing edges to good authorities
COMP5331 8
HITS Algorithm
HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step
COMP5331 9
Step 1 – Sampling Step Given a user query with several terms
Collect a set of pages that are very relevant – called the base set
How to find base set? We retrieve all webpages that contain the query
terms. The set of webpages is called the root set. Next, find the link pages, which are either pages
with a hyperlink to some page in the root set orsome page in the root set has hyperlink to these pages
All pages found form the base set.
COMP5331 10
HITS Algorithm
HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step
COMP5331 11
Step 2 – Iteration Step
Goal: to find the base pages that are good hubs and good authorities
COMP5331 12
Step 2 – Iteration Step
N
A
M
N: NetscapeMS: MicrosoftA: Amazon.com
h(N) = a(N) + a(MS) + a(A)
h(MS) = a(A)
h(A) = a(N) + a(MS)
=
011
100
111
h(N)
h(MS)
h(A)
a(N)
a(MS)
a(A)
Adjacency matrix M
=
011
100
111N
MS
A
N MS A
aMh
h(N)
h(MS)
h(A)
h
a(N)
a(MS)
a(A)
a
COMP5331 13
Step 2 – Iteration Step
N
A
M
N: NetscapeMS: MicrosoftA: Amazon.com
a(N) = h(N) + h(A)
a(MS) = h(N) + h(A)
a(A) = h(N) + h(MS)
=
011
101
101
a(N)
a(MS)
a(A)
h(N)
h(MS)
h(A)
Adjacency matrix M
=
011
100
111N
MS
A
N MS A
h(N)
h(MS)
h(A)
h
a(N)
a(MS)
a(A)
a
hMa T
COMP5331 15
Step 2 – Iteration Step
N
A
M
=
011
100
111N
MS
A
N MS A
M =
011
101
101N
MS
A
N MS A
MT
=
202
011
213N
MS
A
N MS A
MMT =
211
122
122N
MS
A
N MS A
MTM
COMP5331 16
Step 2 – Iteration Step
hMMh T
=
202
011
213N
MS
A
N MS A
MMT
Iteration No. 1 2 3 4 5 6 7
Hub (non-normalized)
111
NMSA
Iteration No. 1 2 3 4 5 6 7
Hub (normalized)
NMSA
111
624
1.50.51
725
7.0711.9295.143
1.50.4291.071
1.50.4091.091
7.0911.9095.182
7.0961.9045.192
1.50.4041.096
1.50.4021.098
7.0981.9025.195
1.50.4021.098
The sum of all elements in the vector = 3
NMSA
Hub
=
1.50.4021.098
COMP5331 17
Step 2 – Iteration Step
Iteration No. 1 2 3 4 5 6 7
Authority (non-normalized)
111
NMSA
Iteration No. 1 2 3 4 5 6 7
Authority (normalized)
NMSA
111
554
1.0711.0710.857
5.1435.1433.857
5.1825.1823.818
1.0911.0910.818
1.0961.0960.808
5.1925.1923.808
5.1955.1953.805
1.0981.0980.805
1.0981.0980.804
5.1965.1963.804
1.0981.0980.804
The sum of all elements in the vector = 3
aMMa T
=
211
122
122N
MS
A
N MS A
MTM
NMSA
Hub
=
1.50.4021.098
NMSA
Authority
=
1.0981.0980.804
COMP5331 18
How to Rank
Many ways Rank in descending order of hub only Rank in descending order of authority
only Rank in descending order of the value
computed from both hub and authority (e.g., the sum of the hub value and the authority value)
NMSA
Hub
=
1.50.4021.098
NMSA
Authority
=
1.0981.0980.804
COMP5331 20
PageRank Algorithm (Google)
Disadvantage of HITS: Since there are two concepts, namely
hubs and authorities, we do not know which concept is more important for ranking.
Advantage of PageRank: PageRank involves only one concept
for ranking
COMP5331 21
PageRank Algorithm (Google)
PageRank Algorithm makes use of Stochastic approach to rank the pages
Link Structure of the Web
150 million web pages 1.7 billion links
Backlinks and Forward links:A and B are C’s backlinksC is A and B’s forward link
Intuitively, a webpage is important if it has a lot of backlinks.
What if a webpage has only one link off www.yahoo.com?
A Simple Version of PageRank
u: a web page Bu: the set of u’s backlinks Nv: the number of forward links of page v c: the normalization factor to make ||R||
L1 = 1 (||R||L1= |R1 + … + Rn|)
A Problem with Simplified PageRank A loop:
During each iteration, the loop accumulates rank but never distributes rank to other pages!
Random Walks in Graphs The Random Surfer Model
The simplified model: the standing probability distribution of a random walk on the graph of the web. simply keeps clicking successive links at random
The Modified Model The modified model: the “random surfer”
simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E
Modified Version of PageRank
E(u): a distribution of ranks of web pages that “users” jump to when they “gets bored” after successive links at random.
Dangling Links Links that point to any page with no
outgoing links Most are pages that have not been
downloaded yet Affect the model since it is not clear
where their weight should be distributed Do not affect the ranking of any other
page directly Can be simply removed before pagerank
calculation and added back afterwards
PageRank Implementation Convert each URL into a unique integer and
store each hyperlink in a database using the integer IDs to identify pages
Sort the link structure by ID
Remove all the dangling links from the database
Make an initial assignment of ranks and start iteration Choosing a good initial assignment can speed up the pagerank
Adding the dangling links back.