Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides

Web Searching & Ranking

Zachary G. IvesUniversity of Pennsylvania

CIS 455/555 – Internet and Web Systems

April 20, 2023Some content based on slides by Marti Hearst, Ray Larson

Recall Where We Left Off

We were discussing information retrieval ranking models

The Boolean model captures some intuitions of what we want – AND, OR

But it’s too restrictive, and has no real ranking between returned answers

2

3

Sim(q,dj) = cos()= [vec(dj) vec(q)] / |

dj| * |q|= [ wij * wiq] / |dj| * |q|

Since wij > 0 and wiq > 0, 0 ≤ sim(q,dj) ≤ 1

A document is retrieved even if it matches the query terms only partially

i

j

dj

q

Vector Model

4

Weights in the Vector Model

Sim(q,dj) = [ wij * wiq] / |dj| * |q|

How do we compute the weights wij and wiq? A good weight must take into account two

effects: quantification of intra-document contents

(similarity) tf factor, the term frequency within a document

quantification of inter-documents separation (dissimilarity) idf factor, the inverse document frequency

wij = tf(i,j) * idf(i)

5

TF and IDF Factors

Let:N be the total number of docs in the collectionni be the number of docs which contain ki

freq(i,j) raw frequency of ki within dj

A normalized tf factor is given byf(i,j) = freq(i,j) / max(freq(l,j))

where the maximum is computed over all terms which occur within the document dj

The idf factor is computed asidf(i) = log (N / ni)

the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information

associated with the term ki

6

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q dj d1 1 0 1 4 d2 1 0 0 1 d3 0 1 1 5 d4 1 0 0 1 d5 1 1 1 6 d6 1 1 0 3 d7 0 1 0 2

q 1 2 3

Vector ModelExample 1I

7

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q dj d1 2 0 1 5 d2 1 0 0 1 d3 0 1 3 11 d4 2 0 0 2 d5 1 2 4 17 d6 1 2 0 5 d7 0 5 0 10

q 1 2 3

Vector ModelExample III

8

Vector Model, Summarized

The best term-weighting schemes tf-idf weights:wij = f(i,j) * log(N/ni)

For the query term weights, a suggestion iswiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *

log(N / ni)

This model is very good in practice: tf-idf works well with general collections Simple and fast to compute Vector model is usually as good as the known

ranking alternatives

9

Advantages: term-weighting improves quality of the answer

set partial matching allows retrieval of docs that

approximate the query conditions cosine ranking formula sorts documents

according to degree of similarity to the query

Disadvantages: assumes independence of index terms; not

clear if this is a good or bad assumption

Pros & Cons of Vector Model

10

Comparison of Classic Models

Boolean model does not provide for partial matches and is considered to be the weakest classic model

Experiments indicate that the vector model outperforms the third alternative, the probabilistic model, in general

Generally we use a variation of the vector model in most text search systems

11

Switching Our Sights to the Web

Information retrieval is more heterogeneous in nature: No editor to control quality Deliberately misleading information (“web spam”) Great variety in types of information

Phone books, catalogs, technical reports, news, slide shows, …

Many languages; partial duplication; jargon Diverse user goals

Very short queries ~2.35 words on average (Aug 2000; Google results)

And much larger scale!

12

Handling Short Queries &Mixed-Quality Information

Human processing Web directories: Yahoo, Open Directory, … Human-created answers: about.com, Search Wikia (Still not clear that automated question-answering

works) Capitalism: “paid placement”

Advertisers pay to be associated with certain keywords Clicks / page popularity: pages visited most often Link analysis: use link structure to determine

credibility

… combination of all?

13

Link Analysis for Starting Points:HITS (Kleinberg), PageRank (Google)

Assumptions: Credible sources will mostly point to credible

sources Names of hyperlinks suggest meaning Ranking is a function of the query terms and of the hyperlink

structure

An example of why this makes sense: The official Olympics site will be linked to by most

high-quality sites about sports, Olympics, etc. A spammer who adds “Olympics” to his/her web site

probably won’t have many links to it Caveat: “Search engine optimization”

14

Google’s PageRank (Brin/Page 98)

Mine structure of web graph independently of the query!

Each web page is a node, each hyperlink is a directed edge

Assumes a random walk (surf) through the web: Start at a random page

At each step, the surfer proceeds to a randomly chosen web page with probability d

to a randomly chosen successor of the current page with probability 1- d

The PageRank of a page p is the fraction of steps the surfer spends at p in the limit

15

Link Counts Aren’t Everything…

“A-Team” page

Hollywood“Series to

Recycle” page

YahooDirectory

WikipediaMr. T’spage

TeamSports

CheesyTV

Showspage

16

PageRank

jBj j

i xN

xi

1

Rank of page jRank of page i

Every pagej that links to i

Number oflinks out

from page j

Importance of page i is governed by pages linking to it

17

Computing PageRank (Simple version)

)()1( 1 kj

Bj j

ki x

Nx

i

nxi

1)0( Initialize so total rank sums to 1.0

Iterate untilconvergence

18

Computing PageRank (Step 0)

0.33

0.33

0.33

Initialize so total rank sums to 1.0 n

xi

1)0(

19


0.17

0.33

0.33

)()1( 1 kj

Bj j

ki x

Nx

i

0.17

Propagate weightsacross out-edges

20


0.17

0.50

0.33

Compute weightsbased on in-edges

)0()1( 1j

Bj ji x

Nx

i

21

Computing PageRank (Convergence)

0.2

0.40

0.4

)()1( 1 kj

Bj j

ki x

Nx

i

22

Naïve PageRank Algorithm Restated

Let N(p) = number outgoing links from page p B(p) = number of back-links to page p

Each page b distributes its importance to all of the pages it points to (so we scale by N(b))

Page p’s importance is increased by the importance of its back set

)()(

1)( bPageRank

bNpPageRank

iBb

23

In Linear Algebra Terms

Create an m x m matrix M to capture links: M(i, j) = 1 / nj if page i is pointed to by page j

and page j has nj outgoing links

Initialize all PageRanks to 1, multiply by M repeatedly until all values converge:

(Computes principal eigenvector via power iteration)

)(

...

)(

)(

)'(

...

)'(

)'(

2

1

2

1

mm pPageRank

pPageRank

pPageRank

M

pPageRank

pPageRank

pPageRank

24

A Brief Example

Google

Amazon Yahoo

0 0.5

0.5

0 0 0.5

1 0.5

0

g'

y’

a’

g

y

a

= *

Total rank sums to number of pages

g

y

a

1

1

1

=

1

0.5

1.5

,

1

0.75

1.25

,

1

0.01

1.99

, …

Running for multiple iterations:

25

Oops #1 – PageRank Sinks: Dead Ends

Google

Amazon Yahoo

0 0 0.5

0.5

0 0.5

0.5

0 0

g'

y’

a’

g

y

a

= *

g

y

a

1

1

1

=

0.5

1

0.5

,

0.25

0.5

0.25

,

0

0

0, …


26

Oops #2 – Hogging all the PageRank

Google

Amazon Yahoo

0 0 0.5

0.5

1 0.5

0.5

0 0

g'

y’

a’

g

y

a

= *

g

y

a

1

1

1

=

0.5

2

0.5

,

0.25

2.5

0.25

,

0

3

0, …


27

Improved PageRank

Remove out-degree 0 nodes (or consider them to refer back to referrer)

Add decay factor to deal with sinks PageRank(p) = d b B(p) (PageRank(b) / N(b)) + (1

– d)

Intuition in the idea of the “random surfer”: Surfer occasionally stops following link sequence and

jumps to new random page, with probability 1 - d

28

Stopping the Hog

0 0 0.5

0.5

1 0.5

0.5

0 0

g'

y’

a’

g

y

a

= 0.8 *

g

y

a=

0.35

2.30

0.35

,

0.2

0.2

0.2

+


… though does this seem right?

Google

Amazon Yahoo

0.6

1.8

0.6

0.44

2.12

0.44

0.38

2.25

0.38

, ,

29

Summary of Link Analysis

Use back-links as a means of adjusting the “worthiness” or “importance” of a page

Use iterative process over matrix/vector values to reach a convergence point

PageRank is query-independent and considered relatively stable But vulnerable to SEO

Can We Go Beyond?

PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page

A more general notion: label propagation Take a set of start nodes each with a different label Estimate, for every node, the distribution of arrivals

from each label In essence, captures the relatedness or influence of

nodes Used in YouTube video matching, schema matching, …

30

31

Overall Ranking Strategies inWeb Search Engines

Everybody has their own “secret sauce” that uses: Vector model (TF/IDF) Proximity of terms Where terms appear (title vs. body vs. link) Link analysis Info from directories Page popularity gorank.com “search engine optimization site” compares

these factors

Some alternative approaches: Some new engines (Vivisimo, Teoma, Clusty) try to do

clustering A few engines (Dogpile, Mamma.com) try to do meta-search

Documents

Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides