54
1 Needle in the Haystack: The Technology of Internet Search Randy H. Katz The United Microelectronics Corporation Distinguished Professor Computer Science Division, EECS Department University of California, Berkeley Berkeley, CA 94720-1776 USA [email protected]

Needle in the Haystack: The Technology of Internet Search

  • Upload
    isaura

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Needle in the Haystack: The Technology of Internet Search. Randy H. Katz The United Microelectronics Corporation Distinguished Professor Computer Science Division, EECS Department University of California, Berkeley Berkeley, CA 94720-1776 USA [email protected]. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Needle in the Haystack: The Technology of Internet Search

1

Needle in the Haystack:The Technology of Internet Search

Randy H. KatzThe United Microelectronics Corporation Distinguished Professor

Computer Science Division, EECS DepartmentUniversity of California, BerkeleyBerkeley, CA 94720-1776 USA

[email protected]

Page 2: Needle in the Haystack: The Technology of Internet Search

2

Outline

• Historical Background• Information Tsunami• Anatomy of a Web Page• Anatomy of Web Access• The Challenge of Search• Google’s Page Rank Algorithm• Fun and Games with Internet Search• New Directions

Page 3: Needle in the Haystack: The Technology of Internet Search

3

Search is BIG!

Page 4: Needle in the Haystack: The Technology of Internet Search

4

And the World is Going Digital

Page 5: Needle in the Haystack: The Technology of Internet Search

5

Outline

• Historical Background• Information Tsunami• Anatomy of a Web Page• Anatomy of Web Access• The Challenge of Search• Google’s Page Rank Algorithm• Fun and Games with Internet Search• New Directions

Page 6: Needle in the Haystack: The Technology of Internet Search

6

Historical Background:The Perfect Storm

ARPANet 1969NSFNet 1985

Commercial Internet 1995

World Wide WebMarc AndreessenNCSA Mosaic1993

Jim ClarkNetscape1995

Vannevar Bush “As WeMay Think” MEMEX 1947

Ted Nelson Xanadu Hypertext 1965-1990Autodesk

SGML 1986

Tim Berners-Lee URL/HTTP/HTML 1989Bill Atkinson Hypercard 1987

Est. $15.5 Billion spent on-lineThanksgivings to Xmas 2004, up 28% since 2003

Page 7: Needle in the Haystack: The Technology of Internet Search

7

Outline

• Historical Background• Information Tsunami• Anatomy of a Web Page• Anatomy of Web Access• The Challenge of Search• Google’s Page Rank Algorithm• Fun and Games with Internet Search• New Directions

Page 8: Needle in the Haystack: The Technology of Internet Search

8

Information Tsunami

• Bit: Binary digit – either a 0 or 1• Byte: 8 bits

– 1 byte: single character– 10 bytes: a single word– 100 bytes: Telegram or punched card

• Kilobyte: 1,000 or 103 bytes– 1 kilobyte: Very short story– 2 kilobytes: Typewritten page– 10 kilobytes: Encyclopedia page– 50 kilobytes: Compressed document image page– 100 kilobytes: Low-res photo– 200 kilobytes: Box of punched cards

http://www.sims.berkeley.edu/research/projects/how-much-info/index.html

Page 9: Needle in the Haystack: The Technology of Internet Search

9

Information Tsunami

• Megabyte: 1,000,000 or 106 bytes– 1 megabyte: Small novel or 3.5in floppy disk– 2 megabytes: Hi-res photo– 5 megabytes: Complete works of Shakespeare– 10 megabytes: Minute of hi-fi sound– 100 megabytes: 1m shelved books– 500 megabytes: CD-ROM

• Gigabyte: 1,000,000,000 or 109 bytes– 1 gigabyte: Pickup truck filled with paper– 2 gigabytes: Movie on a DVD– 50 gigabytes: Floor of books– 100 gigabytes: Floor of academic journals– 500 gigabytes: Biggest FTP site

http://www.sims.berkeley.edu/research/projects/how-much-info/index.html

Page 10: Needle in the Haystack: The Technology of Internet Search

10

Information Tsunami• Terabyte: 1,000,000,000,000 or 1012 bytes

– 1 terabyte: 50,000 trees made into paper and printed or 1 day of EOS data

– 2 terabytes: Academic research library– 10 terabytes: Printed collection of the U.S. Library of Congress– 50 terabytes: Contents of a large mass storage system– 400 terabytes: National Climate Data Center (NOAA) database

• Petabyte: 1,000,000,000,000,000 or 1015 bytes– 1 petabytes: 3 years of Earth Observing System (EOS) data– 2 petabytes: All U.S. academic research libraries– 8 petabytes: All information available on the Web– 200 petabytes: All printed material (2001)

http://www.sims.berkeley.edu/research/projects/how-much-info/index.html

Page 11: Needle in the Haystack: The Technology of Internet Search

11

Information Tsunami• Exabyte: 1,000,000,000,000,000,000 or 1018 bytes

– 2 exabytes: Total volume of information generated worldwide annually

– 5 exabytes: All words ever spoken by humans• Zettabyte: 1,000,000,000,000,000,000,000 or 1021 bytes• Yottabyte: 1,000,000,000,000,000,000,000,000 or 1024

bytes

http://www.sims.berkeley.edu/research/projects/how-much-info/index.html

Page 12: Needle in the Haystack: The Technology of Internet Search

12

Outline

• Historical Background• Information Tsunami• Anatomy of a Web Page• Anatomy of Web Access• The Challenge of Search• Google’s Page Rank Algorithm• Fun and Games with Internet Search• New Directions

Page 13: Needle in the Haystack: The Technology of Internet Search

13

Anatomy of aWeb Page:

Randy’s Home Page

• URL: Uniform Resource Locator

• Images• Text

Page 14: Needle in the Haystack: The Technology of Internet Search

14

Anatomy of a Web Page:Randy’s Home Page

<html><head><title>Professor Randy Howard Katz University of California BerkeleyComputer Science Division Home Page</title><meta name="description“ content="Home Page of Berkeley Computer Science

Professor Randy Howard Katz"> <meta name="keywords“ content="Katz Randy Howard Berkeley Professor University

California Electrical Engineering Computer Science Department RAID Redundant Arrays Inexpensive Disks SPUR Snoop Wireless Communications Networks Programmable Network Elements">

</head><body><p><img height="269" src="Randy_2004.jpg" width="182" align="bottom" naturalsizeflag="0">&nbsp;&nbsp; <img height="269" src="RHK85a.jpg" width="177" align="bottom" naturalsizeflag="0">&nbsp;&nbsp; </p><p><font size="-1">2005 vs. 1985 ... The hair is grayer, but the smirkremains the same!<br><br>"... Katz, a thin, almost gaunt man with horn-rimmed glasses magnifyingsunken eyes. ..."<br>--George Johnson, WIRED Magazine, (January 2000), page 150.</font></p><p><img

src="VISIONAR.JPG" align="bottom"> </p>…

Page 15: Needle in the Haystack: The Technology of Internet Search

15

• Text• Images• Links!

Page 16: Needle in the Haystack: The Technology of Internet Search

16

Anatomy of a Web Page:Randy’s Web Page

<hr align="left"><h1>Professor Randy H. Katz</h1><h3>Electrical Engineering and Computer Science

Department</h3><p><a href="http://www.umc.com.tw/"><img hspace="6"

src="UMCLogo.gif" align="left"> </a><b><font size="+1">The<a href="http://www.umc.com.tw/">United Microelectronics

Corporation</a> Distinguished Professor</font></b></p><p><font size="-1"><br clear="left">Ph.D., University of California, Berkeley, 1980.<br>M.S., University of California, Berkeley, 1978.<br>A.B., Cornell University, 1976.<br></font></p>

Page 17: Needle in the Haystack: The Technology of Internet Search

17

Outline

• Historical Background• Information Tsunami• Anatomy of a Web Page• Anatomy of Web Access• The Challenge of Search• Google’s Page Rank Algorithm• Fun and Games with Internet Search• New Directions

Page 18: Needle in the Haystack: The Technology of Internet Search

18

Anatomy of Web Access

Web Browser Web Server

Web PageIn HTML

Naming System (DNS):Name-to-Address MappingIP address

Link URLhttp://www.umc.com.tw/

(1)(2)

(3)

(4)

Taiwan

Page 19: Needle in the Haystack: The Technology of Internet Search

19

EdgeCache

Anatomy of Web AccessContent Caching

Web Browser OriginWeb Server

Web PageIn HTML

Naming System (DNS)Origin IP

Link URL…/English/about/index.asp

(5)(6)

(7)

(8)

Content Network DNSEdge Cache IP

ContentDistribution

Taiwan

San Jose

Page 20: Needle in the Haystack: The Technology of Internet Search

20

Outline

• Historical Background• Information Tsunami• Anatomy of a Web Page• Anatomy of Web Access• The Challenge of Search• Google’s Page Rank Algorithm• Fun and Games with Internet Search• New Directions

Page 21: Needle in the Haystack: The Technology of Internet Search

21

Challenges of Search• How to find all the pages on the Web?• How to order the pages by relevance?• How to make searchable the content on those

pages?• How to keep it all up-to-date?

• Web Crawlers/SpiderBots– Network software executing in parallel that follow links in the

Web to find content– Web pages “scraped” for more links follow– Web revisited on the order of once every two-three days

• Indexers– Web pages “scraped” for search terms to build indexes– (Google) Page rank algorithm: order a page within the index

based (roughly) on how many pages refer to it

Page 22: Needle in the Haystack: The Technology of Internet Search

22

Quick (and Incomplete) History of Search Engines

UMinnVeronica &

Archieservices

for gopher &

ftp

MITWandex/

WWWWanderer

Aliweb

CMULycos

1st Commercial Search Engine

StanfordYahoo!

DirectoriesBattle for Popularity: Webcrawler (UWash)

HotBot (Wired)Excite (Stanford) Infoseek (ABC)

Inktomi (Berkeley) AltaVista (DEC)

Google (Stanford)

Yahoo! acquires Inktomi

Yahoo! acquires Overture

(AlltheWeb, AltaVista)

1993 1995 1997 1999 2001 2003Pre-Web 2005

Yahoo! deploys

jointtechnology

a9.comAlltheWebAsk Jeeves

ClustyGigablastEz2FindTeoma

WiseNutGoHookWalhelloKartoo

Page 23: Needle in the Haystack: The Technology of Internet Search

23

Search Challenges and Issues

• Web growing faster than search engines can index• Web pages updated frequently, forcing frequent

revisits • Key word only searches results in many false positives• Difficult to index dynamically generated sites: the so-

called “invisible web”• Some search engines order results by financial

“placement” considerations rather than relevance• Some sites trick search engine to display them first for

some keywords—results in polluted search results, with more relevant links pushed down among the results

Page 24: Needle in the Haystack: The Technology of Internet Search

24

Outline

• Historical Background• Information Tsunami• Anatomy of a Web Page• Anatomy of Web Access• The Challenge of Search• Google’s Page Rank Algorithm• Fun and Games with Internet Search• New Directions

Page 25: Needle in the Haystack: The Technology of Internet Search

25

Page Ranking Algorithms

• Web page relevancy– Many hits, how to insure the best/most relevant web

pages are presented first in answer to a search• Location and Frequency of Keywords

– Index terms in page title raise its relevance for that term

– Keywords near “top” of page more relevant than bottom

– High keyword frequency boosts relevance• If search engine strategy is known, page

developers will “game” the strategy to get their pages ranked higher

Page 26: Needle in the Haystack: The Technology of Internet Search

26

Google’s Page Rank Algorithm

• Which is the most important page?

Page 27: Needle in the Haystack: The Technology of Internet Search

27

Google’s Page Rank Algorithm

• Googlese from their web page:– PageRank relies on the uniquely

democratic nature of the web by using its vast link structure as an indicator of an individual page's value. Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important.”

Page 28: Needle in the Haystack: The Technology of Internet Search

28

Google Page Rank Algorithm

• Basic idea:– Page’s rank determined by the number of links to the page (also

known as citations)– If citing page is more important (has a high page rank/authority page)

then the pages it cites are more important– If citing page has many links, then cited page is less important

(normalize for number of links on citing page)

PR(P) is page rank of page P, T1, …, TN are pages that cite P,C(P) is the # links from Page P, D is a “decay factor”, e.g., 0.85then:

PR(P) = (1 – d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

• See http://www-db.stanford.edu/~backrub/google.html

Page 29: Needle in the Haystack: The Technology of Internet Search

29

GoogleConceptualArchitecture

Page 30: Needle in the Haystack: The Technology of Internet Search

30

Google Server Architecture

• Index servers: search term partitioned and mapped to doc list• Intersect to find document list, sort by page rank• Document IDs used to extract text from Doc Servers• Over 100,000 processors (and growing) in Googleplex

GoogleWeb Server Spell Checker

Ad Server

Doc ServerDoc ServerDoc ServerDoc ServerDoc Server

Doc ServerDoc ServerDoc ServerDoc ServerIndex Server

Page 31: Needle in the Haystack: The Technology of Internet Search

31

Outline

• Historical Background• Information Tsunami• Anatomy of a Web Page• Anatomy of Web Access• The Challenge of Search• Google’s Page Rank Algorithm• Fun and Games with Internet Search• New Directions

Page 32: Needle in the Haystack: The Technology of Internet Search

32

Fun and Games

• Google Scholar• Googling Someone• Google News• Comparison Shopping• Google Whacks

Page 33: Needle in the Haystack: The Technology of Internet Search

33

Google Scholar

Page 34: Needle in the Haystack: The Technology of Internet Search

34

Google Randy

Page 35: Needle in the Haystack: The Technology of Internet Search

35

Google Randy Katz “Google Index”

AdvertisingPlacement

Page 36: Needle in the Haystack: The Technology of Internet Search

36

Google News

Page 37: Needle in the Haystack: The Technology of Internet Search

37

Comparison Shopping

Page 38: Needle in the Haystack: The Technology of Internet Search

38

elgooG

Page 39: Needle in the Haystack: The Technology of Internet Search

39

Google Whacks

Page 40: Needle in the Haystack: The Technology of Internet Search

40

Business ModelAd Placement and Click-

Thru

Old data (2002): Google is now market leader in ad revenue2004 revenue through 9/30/04: $2.1B

Page 41: Needle in the Haystack: The Technology of Internet Search

41

Outline

• Historical Background• Information Tsunami• Anatomy of a Web Page• Anatomy of Web Access• The Challenge of Search• Google’s Page Rank Algorithm• Fun and Games with Internet Search• New Directions

Page 42: Needle in the Haystack: The Technology of Internet Search

42

Top 10 Search Engines

10. DMOZ.org9. Alltheweb.com8. KartOO.com7. MSN.com6. Dogpile.com5. AskJeeves.com4. About.com2. Yahoo.com2. Vivismio.com1. Google.com

Page 43: Needle in the Haystack: The Technology of Internet Search

43

Clustering

Page 44: Needle in the Haystack: The Technology of Internet Search

44

Google Video Search

Page 45: Needle in the Haystack: The Technology of Internet Search

45

Google Video Search

Page 46: Needle in the Haystack: The Technology of Internet Search

46

Amazon’s A9

Page 47: Needle in the Haystack: The Technology of Internet Search

47

Amazon’s A9

Page 48: Needle in the Haystack: The Technology of Internet Search

48

A9’s Yellow Pages

Page 49: Needle in the Haystack: The Technology of Internet Search

49

A9’s Yellow Pages

Page 50: Needle in the Haystack: The Technology of Internet Search

50

Innovations Now andYet to Come

• Index ever larger portions of the Web, even beyond traditional web pages, e.g., video

• Better quality/higher relevance searches• Better presentation of results, e.g., clustering,

site information• Better exploitation of semantic relationships for

improved page ranking, more personalization, e.g., user’s zip code

• More services (Web, news groups, blogs, comparison shopping, video/audio, yellow pages, etc.)

• Integrate with desktop machine

Page 51: Needle in the Haystack: The Technology of Internet Search

51

Parting Thoughts

Page 52: Needle in the Haystack: The Technology of Internet Search

52

Parting Thoughts

Page 53: Needle in the Haystack: The Technology of Internet Search

53

“Where is the wisdom we have lost in knowledge?Where is the knowledge we have lost in information?”

T.S. Eliot, “Choruses from the rock”, Selected Poems, NY: Harvest / Harcourt, 1962, p. 107.

Page 54: Needle in the Haystack: The Technology of Internet Search

54

Needle in the Haystack: The Technology of Internet

Search

Thanks for Your Patience & Attention!Questions?