WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin,...

WebBase:Building a Web Warehouse

Hector Garcia-MolinaStanford University

Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala,Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley

The Web

• A universal information resource– Model weak, strong agreement

• How to exploit it?

WebBase

WEB PAGE

WebBase Goals

• Manage very large collections of Web pages– Today: 1500GB HTML, 200 M pages

• Enable large-scale Web-related research• Locally provide a significant portion of the Web• Efficient wide-area Web data distribution

WebBase Architecture

WebBase Remote Users

• Berkeley• Columbia• U. Washington• Harvey Mudd• Università degli

Studi di Milano• U. of Arizona

• California Digital Library

• Cornell• U. of Houston• Learning Lab

Lower Saxony (L3S)• France Telecom• U. Texas

Outline

• Technical Challenges• WebBase Use• The Future

Challenges

• Scalability– crawling– archive distribution– index construction– storage

• Consistency– freshness– versions

• Dissemination

• Archiving– “units”– coordination

• IP Management– copy access– link access– access control

• Hidden Web• Topic-Specific

Collection Building

What is a Crawler?

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

Parallel Crawling

Independent Crawlers

site 1

site 2

Partition: Firewall

site 1

site 2

partition·URL hash·Site hash·Hierarchical

Partition: Cross-Over

site 1

site 2

partition

Partition: Cross-Over

site 1

site 2

partition

Partition: Exchange

site 1

site 2

partition

Partition: Exchange

site 1

site 2

partition

Coverage vs Overlap

cross-over crawler; 5 random seeds per C-proc

WebBase Parallel Crawling

sitequeues ...

process

sitequeues ...

process

computer

other computers

coordinator

WebBase Parallel Crawling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

pages/sec cpu utilization sites-being-crawled

2 cpuutilzation

number of processes

Challenges

• Scalability– crawling– archive distribution– index construction– storage

• Consistency– freshness– versions

• Dissemination

• Archiving– “units”– coordination

• IP Management– copy access– link access– access control

• Hidden Web• Topic-Specific

Collection Building

How to Refresh?

webrepository

a changes daily

b changes once a week

can visitone page per week

• How should we visit pages?– a a a a a a a ...– b b b b b b b ...– a b a b a b a b ... [uniform]

– a a a a a a b a a a ... [proportional]

Using WebBase

• Fast Page Rank• Complex Queries

Structure of the Web

Color the nodes by their domainred = stanford.edugreen = berkeley.edublue = mit.edu

Structure of the Web

stanford.edu berkeley.edu

mit.edu

Nested Block Structure of the Web

Berkeley

Stanford

Personalized Page Rank

Complex Queries

Stanford WebBase Repositor

Text searchE.g., Search for “SARS Symptoms”

Bulk/Streaming accessLarge-scale mining & indexingE.g., compute PageRank, extract communities

Complex queriesDeclarative analysis interface

Example of a Complex Query

Rank pages in S by PageRank

Rank domains in R by sum (incoming ranks)

Web Entire Web

Compute S = stanford.edu pages containing phrase

“Mobile networking”stanford.ed

uMobile

networking pages

Compute R = set of all “.edu”

domains pointed to by

pages in SS

RList top 10 domains in

find universitiescollaborating with Stanfordon mobile networking

Supernodes

Web graph

= {N1, N2, N3}

E1,2E3,2

E1,3E3,1

Supernode graph

IntraNode1

SEdgePos1

IntraNode3

SEdgeNeg3

Growth of Supernode Graph

0 20 40 60 80 100 120

Number of pages (Millions)

82MB, 115M pages(830 GB of

raw HTML)

Query Execution Times

Query 1 Query 2 Query 3 Query 4 Query 5 Query 6

S-Node representationRelational DBConnectivity Server

Files of adjacency lists

Query Optimization

4pDepth

".net/%domainmy 2." LIKE pURL

5pDepth

Impact of cluster-based optimization

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Sample Queries

No optimizationOptimization enabled

35-million page dataset600 million links300GB of HTML

40-45% reduction in query execution times

Conclusion (So Far)

• Web is universal information resource• WebBase exploits this resource• WebBase Challenges:

– scalability, consitency, complex queries...

• The Future for WebBase(and clones)??

Will WebBase Scale?

web content(indexable)

webBasecapacity(pesimistic)

webBasecapacity(optimistic)

timetoday

Pessimistic Scenario

• Specialized WebBases– sports– shopping– ...

webBasecapacity(pesimistic)

timetoday

Optimistic Scenario

• Web in a Box– web delivered in

“CD” monthy– search engine

handles updates

webBasecapacity(optimistic)

timetoday

Legal Issues?

• Is WebBase legal?– copies– links, deep linking

• International regulations

Biasing Results

• How long will Google, Altavista, etc.resist “temptations”?

• Biasing Crawler• Link and Content Spam

Access Data

• WebBase does not capture access patterns

? WebBase

Semantic Web?

• Will tags be generated?• By whom?• Agreement?

? WebBase

semantic tags

Future Technical Challenges

• Incremental Updates• Query Optimization• Crawling Deep Web

Final Conclusion

• Many challenges ahead...• Additional information:

Google: Stanford WebBase

WEB PAGE

WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin,...

Documents

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed

Popularity-Aware Topic Model for Social Graphs Junghoo “John” Cho cho@cs.ucla.edu UCLA

1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science

Nooriya Haveliwala Hit and Run Case Judgement Mumbai NDPS Court Judgement

Understanding Pollution Dynamics in P2P File Sharing Uichin Lee, Min Choi , Junghoo Cho M. Y. Sanadidi, Mario Gerla UCLA, KAIST IPTPS’06

E cient Computation of PageRank - Stanford Universitytaherh/papers/efficient-pr.pdfE cient Computation of PageRank Taher H. Haveliwala Stanford University taherh@db.stanford.edu October

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University

Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec

CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture Major modules for a search engine? 1. Crawler Page

Searching the Web - University of California, Los Angelescho/papers/cho-toit01.pdfSearching the Web Arvind Arasu Junghoo Cho Hector Garcia-Molina Andreas Paepcke Sriram Raghavan

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University

Challenges and Opportunities in Building Personalized Online Content Aggregators Ka Cheung Sia Adviser: Prof. Junghoo Cho Oral Defense January 12 2009

1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina

CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics