Nurturing content-based collaborative communities on the Web Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering

Nurturing content-based collaborative communities

on the Web

Soumen Chakrabarti

Center for Intelligent Internet ResearchComputer Science and Engineering

Indian Institute of Technology Bombay

www.cse.iitb.ernet.in/~soumenwww.cse.iitb.ernet.in/~cfiir

http://www.cse.iitb.ernet.in/~soumen

http://www.cse.iitb.ernet.in/~soumen

http://www.cse.iitb.ernet.in/~cfiir

http://www.cse.iitb.ernet.in/~cfiir

EMNLP/VLC 2000

Generic search engines Struggle to cover the expanding Web

• 35% coverage in 1997 (Bharat and Broder)• 18% in 1999 (Lawrence and Lee Giles)• Google rebounds to 50% in 2000• Moore’s law vs. Web population• Search quality, index freshness

Cannot afford advanced processing• Alta Vista serves >40 million queries / day• Cannot even afford to seek on disk (8ms)• Limits intelligence of search engines

EMNLP/VLC 2000

Scale vs. quality

Scale

Qua

lity

Keyword-basedsearch engines

Link-assistedranking

HotBot, Alta Vista

Google, Clever

Resourcediscovery

Focusedcrawling

Topic distillation

Lexical networks, parsing, semantic indexing

EMNLP/VLC 2000

The case for vertical portals

“Portals and search pages are changing rapidly, in part because their biggest strength — massive size and reach — can also be a drawback. The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical — and useful — than trying to cover the entire universe.”

(San Jose Mercury News, 1/1999)

EMNLP/VLC 2000

Scaling through specialization The Web shows content-based locality

• Link-based clusters correlated with content• Content-based communities emerge in a

spontaneous, decentralized fashion

Can learn and exploit locality patterns• Analyze page visits and bookmarks• Automatically construct a “focused portal”

with resources that– Have high relevance and quality– Are up-to-date and collectively comprehensive

EMNLP/VLC 2000

Roadmap Hyperlink mining: a short history Resource discovery

• Content-based locality in hypertext• Taxonomy models, topic distillation• Strategies for focused crawling

Data capture and mining architecture• The Memex collaboration system

– Collaborative construction of vertical portals

• Link metadata management architecture– Surfing backwards on the Web

EMNLP/VLC 2000

Historical background First generation Web search engines

• Delete ‘stopwords’ from queries• Can only do syntactic matching

Users stopped asking good questions!• TREC queries: tens to hundreds of words• Alta Vista: at most 2–3 words

Crisis of abundance• Relevance ranking for very short queries• Quality complements relevance — that’s

where hand-made topic directories shine

EMNLP/VLC 2000

Hyperlink Induced Topic Search

Expanded graph

Response

KeywordSearchengine

Query

a = Ehh = ETa‘Hubs’ and‘authorities’

h

a

h

h

ha

a

a

EMNLP/VLC 2000

PageRank and Google Prestige of a page is

proportional to sum of prestige of citing pages

Standard bibliometric measure of influence

Simulate a random walk on the Web to precompute prestige of all pages

Sort keyword-matched responses by decreasing prestige

p3

p4

p1

p2

p4 p1 + p2 + p3

I.e., p = Ep

Follow randomoutlink from page

EMNLP/VLC 2000

Observations HITS

• Uses text initially to select Web subgraph• Expands subgraph by radius 1 … magic!• h and a scores independent of content• Iterations required at query time

Google/PageRank• Precomputed query-independent prestige• No iterations needed at query time, faster• Keyword query selects subgraph to rank• No notion of hub or bipartite reinforcement

EMNLP/VLC 2000

Limitations Artificial decoupling of text and links Connectivity-based topic drift (HITS)

• “movie awards” “movies”• Expanders at www.web-popularity.com

Feature diffusion (Google)• “more evil than evil” www.microsoft.com • New threat of anchor-text spamming

Decoupled ranking (Google)• “harvard mother” Bill Gates’s bio page!

http://www.web-popularity.com/

http://www.microsoft.com/

EMNLP/VLC 2000

Genealogy

Bibliometry

Google HITS

Clever@IBM

Exploiting anchor text

Topic distillation@Compaq

Outlier elimination

Relaxationlabeling

Text classificatio

n Hypertextclassification

Learningtopic paths

Focusedcrawling

Crawlingcontext graphs

EMNLP/VLC 2000

Reducing topic drift: anchor text Page modeled as sequence

of tokens and outlinks “Radius of influence”

around each token Query term matching token

increases link weight Favors hubs and authorities

near relevant pages Better answers than HITS Ad-hoc “spreading

activation”, but no formal model as yet

Query term

EMNLP/VLC 2000

Expanded graph

Reducing topic drift: Outlier detection Search response is

usually ‘purer’ than radius=1 expansion

Compute document term vectors

Compute centroid of response vectors

Eliminate far-away expanded vectors

Results improve Why stop at

radius=1?

Keywordsearch

response

Vector-spacedocumentmodel

Centroid

×

Cut-offradius

EMNLP/VLC 2000

Resource discovery Given

• Yahoo-like topic tree with example URLs• A selection of good topics to explore

Examples, not queries, define topics• Need 2-way decision, not ad-hoc cut-off

Goal• Start from the good / relevant examples• Crawl to collect additional relevant URLs• Fetch as few irrelevant URLs as possible

EMNLP/VLC 2000

A model for relevance

All

Bus&Econ Recreation

Companies Cycling

Bike Shops

Mt.Biking

Clubs

Arts

... ...

Path class

Good classes Subsumed classes

)good(

)|Pr()good is Pr()(c

dcddR

Blocked class

EMNLP/VLC 2000

Pr(c|d) from Pr(c|d) using Bayes rule Decide topic; topic c is picked with prior

probability (c); c(c) = 1 Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1 Fix document length n(d) and toss coin Naïve yet effective; can use other algos Given c, probability of document is

dt

tdntctdn

dncd ),(),(

)},({

)(]|Pr[

EMNLP/VLC 2000

Enhanced models for hypertext c=class, d=text,

N=neighbors Text-only model: Pr(d|c) Using neighbors’ text to

judge my topic:Pr(d, d(N) | c)

Better recursive model:Pr(d, c(N) | c)

Relaxation labeling over Markov random fields

Or, EM formulation

?

EMNLP/VLC 2000

Hyperlink modeling boosts accuracy 9600 patents from 12

classes marked by USPTO

Patents have text and prior art links

Expand test patent to include neighborhood

‘Forget’ and re-estimate fraction of neighbors’ classes

(Even better for Yahoo)

0

5

10

15

20

25

30

35

40

0 50 100

%Neighborhood known

%E

rror

Text Link Text+Link

EMNLP/VLC 2000

Resource discovery: basic approach Topic taxonomy with

examples and ‘good’ topics specified

Crawler coupled to hypertext classifier

Crawl frontier expanded in relevance order

Neighbors of good hubs expanded with high priority

??

Radius-1 rule Radius-2 rule

ExampleURLs

EMNLP/VLC 2000

Focused crawler block diagram

TaxonomyDatabase

TaxonomyEditor

ExampleBrowser

CrawlDatabase

HypertextClassifier(Learn)

TopicModels

HypertextClassifier(Apply)

Scheduler

Workers

TopicDistiller

Feedback

EMNLP/VLC 2000

Focused crawling evaluation Harvest rate

• What fraction of crawled pages are relevant

Robustness across seed sets• Perform separate crawls with random

disjoint samples• Measure overlap in URLs, server IP

addresses, and best-rated resources

Evidence of non-trivial work• Path length to the best resources

EMNLP/VLC 2000

Harvest rate

Harvest Rate (Cycling, Unfocused)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000 10000

#URLs fetched

Ave

rag

e R

ele

van

ce

Avg over 100

Unfocused

Harvest Rate (Cycling, Soft Focus)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2000 4000 6000

#URLs fetched

Ave

rag

e R

ele

van

ce

Avg over 100

Avg over 1000

Focused

EMNLP/VLC 2000

Crawl robustness

Crawl Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1000 2000 3000

#URLs crawled

UR

L O

verl

ap

Crawl Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000

#URLs crawled

Se

rve

r o

verl

ap

Overlap1

Overlap2

URL Overlap Server OverlapCrawl 1 Crawl 2

EMNLP/VLC 2000

Robustness of resource quality Sample disjoint sets

of starting URL’s Two separate crawls Run HITS/Clever Find best authorities Order by rank Find overlap in the

top-rated resources

Resource Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25#Top resources

Se

rve

r O

verl

ap

Overlap1

Overlap2

EMNLP/VLC 2000

Distance to best resources

Resource Distance (Mutual Funds)

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Min. distance from crawl seed (#links)

#S

erv

ers

in t

op

10

0

Resource Distance (Cycling)

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12

Min. distance from crawl seed (#links)

#S

erv

ers

in t

op

10

0

Cycling: cooperative Mutual funds: competitive

EMNLP/VLC 2000

A top hub on‘airlines’ afterhalf an hourof focusedcrawling

EMNLP/VLC 2000

A top hub on‘bicycling’ afterone hour offocused crawling

EMNLP/VLC 2000

Learning context graphs Topics form connected cliques

• “heart disease” ‘swimming’, ‘hiking’• ‘cycling’ “first-aid”!

Radius-1 rule can be myopic• Trapped within boundaries of related topics

From short pre-crawled paths• Can learn frequent chains of related topics• Use this knowledge to circumvent local

“topic traps”

EMNLP/VLC 2000

Context improves focused crawling

EMNLP/VLC 2000

Roadmap Hyperlink mining: a short history Resource discovery

• Content-based locality in hypertext• Taxonomy models, topic distillation• Strategies for focused crawling

Data capture and mining architecture• The Memex collaboration system

– Collaborative construction of vertical portals

• Link metadata management architecture– Surfing backwards on the Web

EMNLP/VLC 2000

Memex project goals Infrastructure to support spontaneous

formation of topic-based communities Mining algorithms for personal and

community level topic management and collaborative resource discovery

Extensible API for plugging in additional hypertext analysis tools

EMNLP/VLC 2000

Memex project status Java applet client

• Netscape 4.5+ (Javascript) available• IE4+ (ActiveX) planned

Server code for Unix and Windows• Servlets + IBM Universal Database• Berkeley DB lightweight storage manager• Simple-to-install RPMs for Linux planned

About a dozen alpha testers First beta available 12/2000

EMNLP/VLC 2000

Creating personal topic spaces

Valuable user input and feedback on topics and associated examples

File manager-like interface

Privacychoice

‘?’ indicatesautomatic

placement byMemex classifier

User cuts andpastes to corrector reinforce theMemex classifier

EMNLP/VLC 2000

Replaying topic-based contexts

“Where was I when last surfing around /Software/Programming?”

Choice oftopic context

Replay of recentbrowsing context

restricted tochosen topic

Active browser monitoringand dynamic layout of new/incremental context graph

Better mobility than one-dimensional history provided

by popular browsers

EMNLP/VLC 2000

Synthesis of a community taxonomy Users classify URLs into folders How to synthesize personal folders into

common taxonomy? Combine multiple similarity hints

Entertainment

Studios

Broadcasting

Media kpfa.org

bbc.co.uk

kron.com

channel4.com

kcbs.com

foxmovies.com

lucasfilms.com

miramax.com

Share document

Share folder

Share termsThemes

‘Radio’

‘Television’

‘Movies’

EMNLP/VLC 2000

Setting up the focused crawler

TaxonomyEditor

CurrentExamples

SuggestedAdditionalExamples

Drag

EMNLP/VLC 2000

Monitoring harvest rate

Time

Rel

evan

ce/H

arve

st r

ate One URL

MovingAverage

EMNLP/VLC 2000

Overview of the Memex system

Browser

Memex server

Client JARVisit

Runningclient applet

Download

Attach

Even

t-h

an

dle

r se

rvle

ts

Search

Folder

Context

Archive

Memex client-serverprotocol and workloadsharing negotiations

Relationalmetadata

Textindex

Min

ing

dem

on

s

Topicmodels

Taxonomy synthesis

Resource discovery

Recommendation

Classification

Clustering

EMNLP/VLC 2000

Surfing backwards using contexts Space-bounded referrer log HTTP extension to query backlink data

S1

Chttp://S1/P1

http://S2/P2

S2

GET /P2 HTTP/1.0Referer: http://S1/P1

BacklinkDatabaseC’

Who pointsto S2/P2?

Local or on Memex server

EMNLP/VLC 2000

Surfing backwards 1

EMNLP/VLC 2000

Surfing backwards 2

EMNLP/VLC 2000

Surfing backwards 3

EMNLP/VLC 2000

Surfing backwards 4

EMNLP/VLC 2000

User study and analysis (1999) Significant improvement in finding

comprehensive resource lists• Six broad information needs, 25 volunteers• Find good resources within limited time• Backlinks faked using search engines• Blind-reviewed by three other volunteers

(2000) Average path length of undirected Web graph is much smaller compared to directed Web graph

(2000) Better focused crawls using backlinks Proposal to W3C

EMNLP/VLC 2000

Backlinks improve focused crawling Follow

forward HREF as before

Also expand backlinks using ‘link:’ queries

Classify pages as before

Sometimes distracts in unrewarding work…

…but pays off in

the end

EMNLP/VLC 2000

Surfing backwards: summary “Life must be lived forwards, but it can

only be understood backwards”—Soren

Kierkegaard Hubs are everywhere!

• To find them, look backwards

Bidirectional surfing is a valuable means to seed focused resource discovery• Even if one has to depend on search

engines initially for link:… queries

EMNLP/VLC 2000

Conclusion Architecture for topic-specific web

resource discovery Driven by examples collected from

surfing and bookmarking activity Reduced dependence on large crawlers Modest desktop hardware adequate Variable radius goal-directed crawling High harvest rate High quality resources found far from

keyword query response nodes

Documents

Nurturing content-based collaborative communities on the Web Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering