35
1 COMP4332 Web Data Thanks for Raymond Wong’s slides

1 COMP4332 Web Data Thanks for Raymond Wong’s slides

Embed Size (px)

Citation preview

1

COMP4332

Web Data

Thanks for Raymond Wong’s slides

2

Web Databases

Raymond Wong

COMP5331 3

How to rank the webpages?

4

Ranking Methods

HITS Algorithm PageRank Algorithm

COMP5331 5

HITS Algorithm

HITS is a ranking algorithm which ranks “hubs” and “authorities”.

COMP5331 6

HITS Algorithm

Authority

v v

Hub

Each page has two weights 1. Authority weight a(v)2. Hub weight h(v)

COMP5331 7

HITS Algorithm Each vertex has two weights

Authority weight Hub weight

Authority Weight

v v

Hub Weight

a(v) = uv h(u) h(v) = vu a(u)

A good authority has many edges from good hubs

A good hub has many outgoing edges to good authorities

COMP5331 8

HITS Algorithm

HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step

COMP5331 9

Step 1 – Sampling Step Given a user query with several terms

Collect a set of pages that are very relevant – called the base set

How to find base set? We retrieve all webpages that contain the query

terms. The set of webpages is called the root set. Next, find the link pages, which are either pages

with a hyperlink to some page in the root set orsome page in the root set has hyperlink to these pages

All pages found form the base set.

COMP5331 10

HITS Algorithm

HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step

COMP5331 11

Step 2 – Iteration Step

Goal: to find the base pages that are good hubs and good authorities

COMP5331 12

Step 2 – Iteration Step

N

A

M

N: NetscapeMS: MicrosoftA: Amazon.com

h(N) = a(N) + a(MS) + a(A)

h(MS) = a(A)

h(A) = a(N) + a(MS)

=

011

100

111

h(N)

h(MS)

h(A)

a(N)

a(MS)

a(A)

Adjacency matrix M

=

011

100

111N

MS

A

N MS A

aMh

h(N)

h(MS)

h(A)

h

a(N)

a(MS)

a(A)

a

COMP5331 13

Step 2 – Iteration Step

N

A

M

N: NetscapeMS: MicrosoftA: Amazon.com

a(N) = h(N) + h(A)

a(MS) = h(N) + h(A)

a(A) = h(N) + h(MS)

=

011

101

101

a(N)

a(MS)

a(A)

h(N)

h(MS)

h(A)

Adjacency matrix M

=

011

100

111N

MS

A

N MS A

h(N)

h(MS)

h(A)

h

a(N)

a(MS)

a(A)

a

hMa T

COMP5331 14

Step 2 – Iteration Step

aMh

hMa T

hMMh T

aMMa T

We have

We derive

COMP5331 15

Step 2 – Iteration Step

N

A

M

=

011

100

111N

MS

A

N MS A

M =

011

101

101N

MS

A

N MS A

MT

=

202

011

213N

MS

A

N MS A

MMT =

211

122

122N

MS

A

N MS A

MTM

COMP5331 16

Step 2 – Iteration Step

hMMh T

=

202

011

213N

MS

A

N MS A

MMT

Iteration No. 1 2 3 4 5 6 7

Hub (non-normalized)

111

NMSA

Iteration No. 1 2 3 4 5 6 7

Hub (normalized)

NMSA

111

624

1.50.51

725

7.0711.9295.143

1.50.4291.071

1.50.4091.091

7.0911.9095.182

7.0961.9045.192

1.50.4041.096

1.50.4021.098

7.0981.9025.195

1.50.4021.098

The sum of all elements in the vector = 3

NMSA

Hub

=

1.50.4021.098

COMP5331 17

Step 2 – Iteration Step

Iteration No. 1 2 3 4 5 6 7

Authority (non-normalized)

111

NMSA

Iteration No. 1 2 3 4 5 6 7

Authority (normalized)

NMSA

111

554

1.0711.0710.857

5.1435.1433.857

5.1825.1823.818

1.0911.0910.818

1.0961.0960.808

5.1925.1923.808

5.1955.1953.805

1.0981.0980.805

1.0981.0980.804

5.1965.1963.804

1.0981.0980.804

The sum of all elements in the vector = 3

aMMa T

=

211

122

122N

MS

A

N MS A

MTM

NMSA

Hub

=

1.50.4021.098

NMSA

Authority

=

1.0981.0980.804

COMP5331 18

How to Rank

Many ways Rank in descending order of hub only Rank in descending order of authority

only Rank in descending order of the value

computed from both hub and authority (e.g., the sum of the hub value and the authority value)

NMSA

Hub

=

1.50.4021.098

NMSA

Authority

=

1.0981.0980.804

COMP5331 19

Ranking Methods

HITS Algorithm PageRank Algorithm

COMP5331 20

PageRank Algorithm (Google)

Disadvantage of HITS: Since there are two concepts, namely

hubs and authorities, we do not know which concept is more important for ranking.

Advantage of PageRank: PageRank involves only one concept

for ranking

COMP5331 21

PageRank Algorithm (Google)

PageRank Algorithm makes use of Stochastic approach to rank the pages

Link Structure of the Web

150 million web pages 1.7 billion links

Backlinks and Forward links:A and B are C’s backlinksC is A and B’s forward link

Intuitively, a webpage is important if it has a lot of backlinks.

What if a webpage has only one link off www.yahoo.com?

A Simple Version of PageRank

u: a web page Bu: the set of u’s backlinks Nv: the number of forward links of page v c: the normalization factor to make ||R||

L1 = 1 (||R||L1= |R1 + … + Rn|)

An example of Simplified PageRank

PageRank Calculation: first iteration

An example of Simplified PageRank

PageRank Calculation: second iteration

An example of Simplified PageRank

Convergence after some iterations

A Problem with Simplified PageRank A loop:

During each iteration, the loop accumulates rank but never distributes rank to other pages!

An example of the Problem

An example of the Problem

An example of the Problem

Random Walks in Graphs The Random Surfer Model

The simplified model: the standing probability distribution of a random walk on the graph of the web. simply keeps clicking successive links at random

The Modified Model The modified model: the “random surfer”

simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E

Modified Version of PageRank

E(u): a distribution of ranks of web pages that “users” jump to when they “gets bored” after successive links at random.

An example of Modified PageRank

33

Dangling Links Links that point to any page with no

outgoing links Most are pages that have not been

downloaded yet Affect the model since it is not clear

where their weight should be distributed Do not affect the ranking of any other

page directly Can be simply removed before pagerank

calculation and added back afterwards

PageRank Implementation Convert each URL into a unique integer and

store each hyperlink in a database using the integer IDs to identify pages

Sort the link structure by ID

Remove all the dangling links from the database

Make an initial assignment of ranks and start iteration Choosing a good initial assignment can speed up the pagerank

Adding the dangling links back.