View
215
Download
0
Embed Size (px)
Citation preview
Nurturing content-based collaborative communities
on the Web
Soumen Chakrabarti
Center for Intelligent Internet ResearchComputer Science and Engineering
Indian Institute of Technology Bombay
www.cse.iitb.ernet.in/~soumenwww.cse.iitb.ernet.in/~cfiir
EMNLP/VLC 2000
Generic search engines Struggle to cover the expanding Web
• 35% coverage in 1997 (Bharat and Broder)• 18% in 1999 (Lawrence and Lee Giles)• Google rebounds to 50% in 2000• Moore’s law vs. Web population• Search quality, index freshness
Cannot afford advanced processing• Alta Vista serves >40 million queries / day• Cannot even afford to seek on disk (8ms)• Limits intelligence of search engines
EMNLP/VLC 2000
Scale vs. quality
Scale
Qua
lity
Keyword-basedsearch engines
Link-assistedranking
HotBot, Alta Vista
Google, Clever
Resourcediscovery
Focusedcrawling
Topic distillation
Lexical networks, parsing, semantic indexing
EMNLP/VLC 2000
The case for vertical portals
“Portals and search pages are changing rapidly, in part because their biggest strength — massive size and reach — can also be a drawback. The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical — and useful — than trying to cover the entire universe.”
(San Jose Mercury News, 1/1999)
EMNLP/VLC 2000
Scaling through specialization The Web shows content-based locality
• Link-based clusters correlated with content• Content-based communities emerge in a
spontaneous, decentralized fashion
Can learn and exploit locality patterns• Analyze page visits and bookmarks• Automatically construct a “focused portal”
with resources that– Have high relevance and quality– Are up-to-date and collectively comprehensive
EMNLP/VLC 2000
Roadmap Hyperlink mining: a short history Resource discovery
• Content-based locality in hypertext• Taxonomy models, topic distillation• Strategies for focused crawling
Data capture and mining architecture• The Memex collaboration system
– Collaborative construction of vertical portals
• Link metadata management architecture– Surfing backwards on the Web
EMNLP/VLC 2000
Historical background First generation Web search engines
• Delete ‘stopwords’ from queries• Can only do syntactic matching
Users stopped asking good questions!• TREC queries: tens to hundreds of words• Alta Vista: at most 2–3 words
Crisis of abundance• Relevance ranking for very short queries• Quality complements relevance — that’s
where hand-made topic directories shine
EMNLP/VLC 2000
Hyperlink Induced Topic Search
Expanded graph
Response
KeywordSearchengine
Query
a = Ehh = ETa‘Hubs’ and‘authorities’
h
a
h
h
ha
a
a
EMNLP/VLC 2000
PageRank and Google Prestige of a page is
proportional to sum of prestige of citing pages
Standard bibliometric measure of influence
Simulate a random walk on the Web to precompute prestige of all pages
Sort keyword-matched responses by decreasing prestige
p3
p4
p1
p2
p4 p1 + p2 + p3
I.e., p = Ep
Follow randomoutlink from page
EMNLP/VLC 2000
Observations HITS
• Uses text initially to select Web subgraph• Expands subgraph by radius 1 … magic!• h and a scores independent of content• Iterations required at query time
Google/PageRank• Precomputed query-independent prestige• No iterations needed at query time, faster• Keyword query selects subgraph to rank• No notion of hub or bipartite reinforcement
EMNLP/VLC 2000
Limitations Artificial decoupling of text and links Connectivity-based topic drift (HITS)
• “movie awards” “movies”• Expanders at www.web-popularity.com
Feature diffusion (Google)• “more evil than evil” www.microsoft.com • New threat of anchor-text spamming
Decoupled ranking (Google)• “harvard mother” Bill Gates’s bio page!
EMNLP/VLC 2000
Genealogy
Bibliometry
Google HITS
Clever@IBM
Exploiting anchor text
Topic distillation@Compaq
Outlier elimination
Relaxationlabeling
Text classificatio
n Hypertextclassification
Learningtopic paths
Focusedcrawling
Crawlingcontext graphs
EMNLP/VLC 2000
Reducing topic drift: anchor text Page modeled as sequence
of tokens and outlinks “Radius of influence”
around each token Query term matching token
increases link weight Favors hubs and authorities
near relevant pages Better answers than HITS Ad-hoc “spreading
activation”, but no formal model as yet
Query term
EMNLP/VLC 2000
Expanded graph
Reducing topic drift: Outlier detection Search response is
usually ‘purer’ than radius=1 expansion
Compute document term vectors
Compute centroid of response vectors
Eliminate far-away expanded vectors
Results improve Why stop at
radius=1?
Keywordsearch
response
Vector-spacedocumentmodel
Centroid
×
Cut-offradius
EMNLP/VLC 2000
Resource discovery Given
• Yahoo-like topic tree with example URLs• A selection of good topics to explore
Examples, not queries, define topics• Need 2-way decision, not ad-hoc cut-off
Goal• Start from the good / relevant examples• Crawl to collect additional relevant URLs• Fetch as few irrelevant URLs as possible
EMNLP/VLC 2000
A model for relevance
All
Bus&Econ Recreation
Companies Cycling
Bike Shops
Mt.Biking
Clubs
Arts
... ...
Path class
Good classes Subsumed classes
)good(
)|Pr()good is Pr()(c
dcddR
Blocked class
EMNLP/VLC 2000
Pr(c|d) from Pr(c|d) using Bayes rule Decide topic; topic c is picked with prior
probability (c); c(c) = 1 Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1 Fix document length n(d) and toss coin Naïve yet effective; can use other algos Given c, probability of document is
dt
tdntctdn
dncd ),(),(
)},({
)(]|Pr[
EMNLP/VLC 2000
Enhanced models for hypertext c=class, d=text,
N=neighbors Text-only model: Pr(d|c) Using neighbors’ text to
judge my topic:Pr(d, d(N) | c)
Better recursive model:Pr(d, c(N) | c)
Relaxation labeling over Markov random fields
Or, EM formulation
?
EMNLP/VLC 2000
Hyperlink modeling boosts accuracy 9600 patents from 12
classes marked by USPTO
Patents have text and prior art links
Expand test patent to include neighborhood
‘Forget’ and re-estimate fraction of neighbors’ classes
(Even better for Yahoo)
0
5
10
15
20
25
30
35
40
0 50 100
%Neighborhood known
%E
rror
Text Link Text+Link
EMNLP/VLC 2000
Resource discovery: basic approach Topic taxonomy with
examples and ‘good’ topics specified
Crawler coupled to hypertext classifier
Crawl frontier expanded in relevance order
Neighbors of good hubs expanded with high priority
??
Radius-1 rule Radius-2 rule
ExampleURLs
EMNLP/VLC 2000
Focused crawler block diagram
TaxonomyDatabase
TaxonomyEditor
ExampleBrowser
CrawlDatabase
HypertextClassifier(Learn)
TopicModels
HypertextClassifier(Apply)
Scheduler
Workers
TopicDistiller
Feedback
EMNLP/VLC 2000
Focused crawling evaluation Harvest rate
• What fraction of crawled pages are relevant
Robustness across seed sets• Perform separate crawls with random
disjoint samples• Measure overlap in URLs, server IP
addresses, and best-rated resources
Evidence of non-trivial work• Path length to the best resources
EMNLP/VLC 2000
Harvest rate
Harvest Rate (Cycling, Unfocused)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5000 10000
#URLs fetched
Ave
rag
e R
ele
van
ce
Avg over 100
Unfocused
Harvest Rate (Cycling, Soft Focus)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2000 4000 6000
#URLs fetched
Ave
rag
e R
ele
van
ce
Avg over 100
Avg over 1000
Focused
EMNLP/VLC 2000
Crawl robustness
Crawl Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1000 2000 3000
#URLs crawled
UR
L O
verl
ap
Crawl Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000
#URLs crawled
Se
rve
r o
verl
ap
Overlap1
Overlap2
URL Overlap Server OverlapCrawl 1 Crawl 2
EMNLP/VLC 2000
Robustness of resource quality Sample disjoint sets
of starting URL’s Two separate crawls Run HITS/Clever Find best authorities Order by rank Find overlap in the
top-rated resources
Resource Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25#Top resources
Se
rve
r O
verl
ap
Overlap1
Overlap2
EMNLP/VLC 2000
Distance to best resources
Resource Distance (Mutual Funds)
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Min. distance from crawl seed (#links)
#S
erv
ers
in t
op
10
0
Resource Distance (Cycling)
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12
Min. distance from crawl seed (#links)
#S
erv
ers
in t
op
10
0
Cycling: cooperative Mutual funds: competitive
EMNLP/VLC 2000
Learning context graphs Topics form connected cliques
• “heart disease” ‘swimming’, ‘hiking’• ‘cycling’ “first-aid”!
Radius-1 rule can be myopic• Trapped within boundaries of related topics
From short pre-crawled paths• Can learn frequent chains of related topics• Use this knowledge to circumvent local
“topic traps”
EMNLP/VLC 2000
Roadmap Hyperlink mining: a short history Resource discovery
• Content-based locality in hypertext• Taxonomy models, topic distillation• Strategies for focused crawling
Data capture and mining architecture• The Memex collaboration system
– Collaborative construction of vertical portals
• Link metadata management architecture– Surfing backwards on the Web
EMNLP/VLC 2000
Memex project goals Infrastructure to support spontaneous
formation of topic-based communities Mining algorithms for personal and
community level topic management and collaborative resource discovery
Extensible API for plugging in additional hypertext analysis tools
EMNLP/VLC 2000
Memex project status Java applet client
• Netscape 4.5+ (Javascript) available• IE4+ (ActiveX) planned
Server code for Unix and Windows• Servlets + IBM Universal Database• Berkeley DB lightweight storage manager• Simple-to-install RPMs for Linux planned
About a dozen alpha testers First beta available 12/2000
EMNLP/VLC 2000
Creating personal topic spaces
Valuable user input and feedback on topics and associated examples
File manager-like interface
Privacychoice
‘?’ indicatesautomatic
placement byMemex classifier
User cuts andpastes to corrector reinforce theMemex classifier
EMNLP/VLC 2000
Replaying topic-based contexts
“Where was I when last surfing around /Software/Programming?”
Choice oftopic context
Replay of recentbrowsing context
restricted tochosen topic
Active browser monitoringand dynamic layout of new/incremental context graph
Better mobility than one-dimensional history provided
by popular browsers
EMNLP/VLC 2000
Synthesis of a community taxonomy Users classify URLs into folders How to synthesize personal folders into
common taxonomy? Combine multiple similarity hints
Entertainment
Studios
Broadcasting
Media kpfa.org
bbc.co.uk
kron.com
channel4.com
kcbs.com
foxmovies.com
lucasfilms.com
miramax.com
Share document
Share folder
Share termsThemes
‘Radio’
‘Television’
‘Movies’
EMNLP/VLC 2000
Setting up the focused crawler
TaxonomyEditor
CurrentExamples
SuggestedAdditionalExamples
Drag
EMNLP/VLC 2000
Overview of the Memex system
Browser
Memex server
Client JARVisit
Runningclient applet
Download
Attach
Even
t-h
an
dle
r se
rvle
ts
Search
Folder
Context
Archive
Memex client-serverprotocol and workloadsharing negotiations
Relationalmetadata
Textindex
Min
ing
dem
on
s
Topicmodels
Taxonomy synthesis
Resource discovery
Recommendation
Classification
Clustering
EMNLP/VLC 2000
Surfing backwards using contexts Space-bounded referrer log HTTP extension to query backlink data
S1
Chttp://S1/P1
http://S2/P2
S2
GET /P2 HTTP/1.0Referer: http://S1/P1
BacklinkDatabaseC’
Who pointsto S2/P2?
Local or on Memex server
EMNLP/VLC 2000
User study and analysis (1999) Significant improvement in finding
comprehensive resource lists• Six broad information needs, 25 volunteers• Find good resources within limited time• Backlinks faked using search engines• Blind-reviewed by three other volunteers
(2000) Average path length of undirected Web graph is much smaller compared to directed Web graph
(2000) Better focused crawls using backlinks Proposal to W3C
EMNLP/VLC 2000
Backlinks improve focused crawling Follow
forward HREF as before
Also expand backlinks using ‘link:’ queries
Classify pages as before
Sometimes distracts in unrewarding work…
…but pays off in
the end
EMNLP/VLC 2000
Surfing backwards: summary “Life must be lived forwards, but it can
only be understood backwards”—Soren
Kierkegaard Hubs are everywhere!
• To find them, look backwards
Bidirectional surfing is a valuable means to seed focused resource discovery• Even if one has to depend on search
engines initially for link:… queries
EMNLP/VLC 2000
Conclusion Architecture for topic-specific web
resource discovery Driven by examples collected from
surfing and bookmarking activity Reduced dependence on large crawlers Modest desktop hardware adequate Variable radius goal-directed crawling High harvest rate High quality resources found far from
keyword query response nodes