31
Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems May 16, 2022 Some content based on slides by Marti Hearst, Ray Larson

Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

Embed Size (px)

Citation preview

Page 1: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

Web Searching & Ranking

Zachary G. IvesUniversity of Pennsylvania

CIS 455/555 – Internet and Web Systems

April 20, 2023Some content based on slides by Marti Hearst, Ray Larson

Page 2: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

Recall Where We Left Off

We were discussing information retrieval ranking models

The Boolean model captures some intuitions of what we want – AND, OR

But it’s too restrictive, and has no real ranking between returned answers

2

Page 3: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

3

Sim(q,dj) = cos()= [vec(dj) vec(q)] / |

dj| * |q|= [ wij * wiq] / |dj| * |q|

Since wij > 0 and wiq > 0, 0 ≤ sim(q,dj) ≤ 1

A document is retrieved even if it matches the query terms only partially

i

j

dj

q

Vector Model

Page 4: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

4

Weights in the Vector Model

Sim(q,dj) = [ wij * wiq] / |dj| * |q|

How do we compute the weights wij and wiq? A good weight must take into account two

effects: quantification of intra-document contents

(similarity) tf factor, the term frequency within a document

quantification of inter-documents separation (dissimilarity) idf factor, the inverse document frequency

wij = tf(i,j) * idf(i)

Page 5: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

5

TF and IDF Factors

Let:N be the total number of docs in the collectionni be the number of docs which contain ki

freq(i,j) raw frequency of ki within dj

A normalized tf factor is given byf(i,j) = freq(i,j) / max(freq(l,j))

where the maximum is computed over all terms which occur within the document dj

The idf factor is computed asidf(i) = log (N / ni)

the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information

associated with the term ki

Page 6: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

6

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q dj d1 1 0 1 4 d2 1 0 0 1 d3 0 1 1 5 d4 1 0 0 1 d5 1 1 1 6 d6 1 1 0 3 d7 0 1 0 2

q 1 2 3

Vector ModelExample 1I

Page 7: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

7

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q dj d1 2 0 1 5 d2 1 0 0 1 d3 0 1 3 11 d4 2 0 0 2 d5 1 2 4 17 d6 1 2 0 5 d7 0 5 0 10

q 1 2 3

Vector ModelExample III

Page 8: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

8

Vector Model, Summarized

The best term-weighting schemes tf-idf weights:wij = f(i,j) * log(N/ni)

For the query term weights, a suggestion iswiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *

log(N / ni)

This model is very good in practice: tf-idf works well with general collections Simple and fast to compute Vector model is usually as good as the known

ranking alternatives

Page 9: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

9

Advantages: term-weighting improves quality of the answer

set partial matching allows retrieval of docs that

approximate the query conditions cosine ranking formula sorts documents

according to degree of similarity to the query

Disadvantages: assumes independence of index terms; not

clear if this is a good or bad assumption

Pros & Cons of Vector Model

Page 10: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

10

Comparison of Classic Models

Boolean model does not provide for partial matches and is considered to be the weakest classic model

Experiments indicate that the vector model outperforms the third alternative, the probabilistic model, in general

Generally we use a variation of the vector model in most text search systems

Page 11: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

11

Switching Our Sights to the Web

Information retrieval is more heterogeneous in nature: No editor to control quality Deliberately misleading information (“web spam”) Great variety in types of information

Phone books, catalogs, technical reports, news, slide shows, …

Many languages; partial duplication; jargon Diverse user goals

Very short queries ~2.35 words on average (Aug 2000; Google results)

And much larger scale!

Page 12: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

12

Handling Short Queries &Mixed-Quality Information

Human processing Web directories: Yahoo, Open Directory, … Human-created answers: about.com, Search Wikia (Still not clear that automated question-answering

works) Capitalism: “paid placement”

Advertisers pay to be associated with certain keywords Clicks / page popularity: pages visited most often Link analysis: use link structure to determine

credibility

… combination of all?

Page 13: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

13

Link Analysis for Starting Points:HITS (Kleinberg), PageRank (Google)

Assumptions: Credible sources will mostly point to credible

sources Names of hyperlinks suggest meaning Ranking is a function of the query terms and of the hyperlink

structure

An example of why this makes sense: The official Olympics site will be linked to by most

high-quality sites about sports, Olympics, etc. A spammer who adds “Olympics” to his/her web site

probably won’t have many links to it Caveat: “Search engine optimization”

Page 14: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

14

Google’s PageRank (Brin/Page 98)

Mine structure of web graph independently of the query!

Each web page is a node, each hyperlink is a directed edge

Assumes a random walk (surf) through the web: Start at a random page

At each step, the surfer proceeds to a randomly chosen web page with probability d

to a randomly chosen successor of the current page with probability 1- d

The PageRank of a page p is the fraction of steps the surfer spends at p in the limit

Page 15: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

15

Link Counts Aren’t Everything…

“A-Team” page

Hollywood“Series to

Recycle” page

YahooDirectory

WikipediaMr. T’spage

TeamSports

CheesyTV

Showspage

Page 16: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

16

PageRank

jBj j

i xN

xi

1

Rank of page jRank of page i

Every pagej that links to i

Number oflinks out

from page j

Importance of page i is governed by pages linking to it

Page 17: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

17

Computing PageRank (Simple version)

)()1( 1 kj

Bj j

ki x

Nx

i

nxi

1)0( Initialize so total rank sums to 1.0

Iterate untilconvergence

Page 18: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

18

Computing PageRank (Step 0)

0.33

0.33

0.33

Initialize so total rank sums to 1.0 n

xi

1)0(

Page 19: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

19

Computing PageRank (Step 1)

0.17

0.33

0.33

)()1( 1 kj

Bj j

ki x

Nx

i

0.17

Propagate weightsacross out-edges

Page 20: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

20

Computing PageRank (Step 2)

0.17

0.50

0.33

Compute weightsbased on in-edges

)0()1( 1j

Bj ji x

Nx

i

Page 21: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

21

Computing PageRank (Convergence)

0.2

0.40

0.4

)()1( 1 kj

Bj j

ki x

Nx

i

Page 22: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

22

Naïve PageRank Algorithm Restated

Let N(p) = number outgoing links from page p B(p) = number of back-links to page p

Each page b distributes its importance to all of the pages it points to (so we scale by N(b))

Page p’s importance is increased by the importance of its back set

)()(

1)( bPageRank

bNpPageRank

iBb

Page 23: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

23

In Linear Algebra Terms

Create an m x m matrix M to capture links: M(i, j) = 1 / nj if page i is pointed to by page j

and page j has nj outgoing links

Initialize all PageRanks to 1, multiply by M repeatedly until all values converge:

(Computes principal eigenvector via power iteration)

)(

...

)(

)(

)'(

...

)'(

)'(

2

1

2

1

mm pPageRank

pPageRank

pPageRank

M

pPageRank

pPageRank

pPageRank

Page 24: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

24

A Brief Example

Google

Amazon Yahoo

0 0.5

0.5

0 0 0.5

1 0.5

0

g'

y’

a’

g

y

a

= *

Total rank sums to number of pages

g

y

a

1

1

1

=

1

0.5

1.5

,

1

0.75

1.25

,

1

0.01

1.99

, …

Running for multiple iterations:

Page 25: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

25

Oops #1 – PageRank Sinks: Dead Ends

Google

Amazon Yahoo

0 0 0.5

0.5

0 0.5

0.5

0 0

g'

y’

a’

g

y

a

= *

g

y

a

1

1

1

=

0.5

1

0.5

,

0.25

0.5

0.25

,

0

0

0, …

Running for multiple iterations:

Page 26: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

26

Oops #2 – Hogging all the PageRank

Google

Amazon Yahoo

0 0 0.5

0.5

1 0.5

0.5

0 0

g'

y’

a’

g

y

a

= *

g

y

a

1

1

1

=

0.5

2

0.5

,

0.25

2.5

0.25

,

0

3

0, …

Running for multiple iterations:

Page 27: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

27

Improved PageRank

Remove out-degree 0 nodes (or consider them to refer back to referrer)

Add decay factor to deal with sinks PageRank(p) = d b B(p) (PageRank(b) / N(b)) + (1

– d)

Intuition in the idea of the “random surfer”: Surfer occasionally stops following link sequence and

jumps to new random page, with probability 1 - d

Page 28: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

28

Stopping the Hog

0 0 0.5

0.5

1 0.5

0.5

0 0

g'

y’

a’

g

y

a

= 0.8 *

g

y

a=

0.35

2.30

0.35

,

0.2

0.2

0.2

+

Running for multiple iterations:

… though does this seem right?

Google

Amazon Yahoo

0.6

1.8

0.6

0.44

2.12

0.44

0.38

2.25

0.38

, ,

Page 29: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

29

Summary of Link Analysis

Use back-links as a means of adjusting the “worthiness” or “importance” of a page

Use iterative process over matrix/vector values to reach a convergence point

PageRank is query-independent and considered relatively stable But vulnerable to SEO

Page 30: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

Can We Go Beyond?

PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page

A more general notion: label propagation Take a set of start nodes each with a different label Estimate, for every node, the distribution of arrivals

from each label In essence, captures the relatedness or influence of

nodes Used in YouTube video matching, schema matching, …

30

Page 31: Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

31

Overall Ranking Strategies inWeb Search Engines

Everybody has their own “secret sauce” that uses: Vector model (TF/IDF) Proximity of terms Where terms appear (title vs. body vs. link) Link analysis Info from directories Page popularity gorank.com “search engine optimization site” compares

these factors

Some alternative approaches: Some new engines (Vivisimo, Teoma, Clusty) try to do

clustering A few engines (Dogpile, Mamma.com) try to do meta-search