26
Relating Web Characteristics Ricardo Baeza-Yates Carlos Castillo Universidad de Chile

Relating Web Characteristics with Link-Based Ranking

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Ricardo Baeza-Yates

Carlos CastilloUniversidad de Chile

Page 2: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Agenda

• Introduction

• Link-based ranking

• Web structure

• Web characteristics

• Web usage

• Web dynamics

• Conclusions

Page 3: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Introduction: Sample

• Web sample: .CL domain on year 2000• 670,000 pages in 7,500 domains• 15kb average page size• Collection from the TodoCL web search

engine

Page 4: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Introduction: Emphasis

• Broder et al.: Graph Structure on the Web (2000)– Page-based structure based on strongly

connected components

– The Web graph is not a random graph

– Process: cut & paste model

• Our is mostly a site-based analysis– Trying to make Web structure meaningful

Page 5: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Introduction: The Empire

Page 6: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Introduction: One Map

Page 7: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Link ranking: Pagerank

∑=

−+=k

i

irPagerankqN

qpPagerank

1

)()1()(

Pages that pointto page p

Probability of a random jump over number of pages

Currently used byGoogleBrin & Page, 1998

Page 8: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Link ranking: Hubs & Authorities

• HITS algorithm (Kleinberg, 1998)

• A good authority is a page pointed by good hubs, so we assume that it has good content

• A good hub is a page that points to good authorities, so we assume it is a good set of links

• Linear system calculated by numerical iteration

Page 9: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Link ranking: Distribution

9% with relevanthub score 2-3% with relevant

authority score

<2% with relevant Pagerank

Page 10: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Link ranking: Correlation

Hub score,authority scoreand Pagerankdo not seem

to be correlated

Page 11: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Link ranking: Sites

• Which measure to use for sites ?

• Average score– But good sites can have lots of bad pages

• Maximum score– But one good page cannot be all that is

needed to be a good site

• Sum of the scores of all pages– Natural for Pagerank

Page 12: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Link ranking: Sites Graph

90% relevant site-Pagerank

It’s harder to have a good hub than a good authority (site)

Page 13: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Structure: Basis

• The Web graph has structure:

INOUT

MAIN

ISLANDS

Page 14: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Structure: Basis (cont.)

• The MAIN component has structure:

INOUTMAIN NORM

MAIN IN

MAIN MAIN MAIN OUT

Page 15: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Structure: Sketch

Page 16: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Structure: Degree

Page 17: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Structure: Sizes

Page 18: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Structure: Preferences

Page 19: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Structure: Preferences

OUT

MAINMAIN

MAINMAIN

OUTMAINOUT

Real ODP TodoCL

Page 20: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Structure: Various

Page 21: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Structure: Link Scores

Page 22: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Dynamics: Ages

• The kernel of the Web comes from the past

Page 23: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Dynamics: By Component

Page 24: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Dynamics: Pagerank

Pagerank is biased against newer pages

Page 25: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Web Dynamics: Hubs & Authorities

Age (months)

Aut

horit

y S

core

Hub

Sco

re

Page 26: Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Conclusions

• Pagerank/HITS do not seem to be correlated– And Pagerank is biased to older pages

• Site ranking can help to make good human-selected directories

• Finding good pages is not so simple

• Characterizing Web structure gives valuable insight– Web Graph Mining is just starting