27
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

Embed Size (px)

Citation preview

Page 1: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

The Search Engine Architecture

CSCI 572: Information Retrieval and Search Engines

Summer 2010

Page 2: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-2

Outline• Introduction• Google

– The PageRank algorithm– The Google Architecture

– Architectural components– Architectural interconnections– Architectural data structures

– Evaluation of Google• Summary

Page 3: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-3

Problems with search engines circa the last decade

• Human maintenance– Subjective

• Example: Ranking hits based on $$$

• Automated search engines– Quality of result

• Neglect to take user’s context into account

• Searching process– High quality results aren’t always at the top of the list

Page 4: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-4

The Typical Search Engine Process

Search Engine Page

Filter Through Results Select Result Assess Page QualityExhausted

Results

Yes

No

Good Match

No

Yes

In what stages is the most time spent?

Page 5: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-5

How to scale to modern times?• Currently

– Efficient index– Petabyte scale storage space– Efficient Crawling– Cost effectiveness of hardware

• Future– Qualitative context

• Maintaining localization data

– Perhaps send indexing to clients– Client computers help gather Google’s index in a distributed,

decentralized fashion?

Page 6: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-6

Google• The whole idea is to keep up with the growth of the web• Design Goals:

-Remove Junk Results-Scalable document indices

• Use of link structure to improve quality filtering• Use as an academic digital library

– Provide search engine datasets– Search engine infrastructure and evolution

Page 7: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-7

Google• Archival of information

– Use of compression

– Efficient data structures

– Proprietary file system

• Leverage of usage data

• PageRank algorithm– Sort of a “lineage” of a source of information

• Citation graph

Page 8: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-8

PageRank Algorithm• Numerical method to calculate page’s importance

– this approach might well be followed by people doing research • Page Rank of a page A

– With damping factor d– Where PR(x) = Page Rank of page X– Where C(x) = the amount of outgoing links from page x– Where T1…Tn is the set of pages with incoming links to page

A– PR(A)=(1-d)+d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))

• It’s actually a bit more complicated than it first looks– For instance, what’s PR(T1) and PR(T2) and so on?

Page 9: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-9

PageRank Algorithm• An excellent explanation

– http://www.iprcom.com/papers/pagerank/

• Since the PR(A) equation is a probability distribution over all web pages linking to web page A…– And because of the (1-d) term and the d*(PR….) term– The PageRanks of all the web pages on the web will sum

to 1

Page 10: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-10

PageRank: Example• So, where do you start?• It turns out that you can

effectively “guess” whatthe PageRanks for the webpages are initially– In our example, guess 0 for all of

the pages• Then you run the PR function to calculate PR for all the web pages

iteratively• You do this until…

– The page ranks for each web page stop changing in each iteration– They “settle down”

Page 11: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-11

PageRank: Example

PR(a) = 1 - $damp + $damp * PR(c);

PR(b) = 1 - $damp + $damp * (PR(a)/2)

PR(c) = 1 - $damp + $damp * (PR(a)/2 + PR(b) + PR(d));

PR(d) = 1 - $damp;

Below is the iterative calculation that we would run

Page 12: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-12

PageRank Algorithm: First 18 iterationsa: 0.00000 b: 0.00000 c: 0.00000 d: 0.00000 a: 0.15000 b: 0.21375 c: 0.39544 d: 0.15000 a: 0.48612 b: 0.35660 c: 0.78721 d: 0.15000 a: 0.81913 b: 0.49813 c: 1.04904 d: 0.15000 a: 1.04169 b: 0.59272 c: 1.22403 d: 0.15000 a: 1.19042 b: 0.65593 c: 1.34097 d: 0.15000 a: 1.28982 b: 0.69818 c: 1.41912 d: 0.15000 a: 1.35626 b: 0.72641 c: 1.47136 d: 0.15000 a: 1.40065 b: 0.74528 c: 1.50626 d: 0.15000 a: 1.43032 b: 0.75789 c: 1.52959 d: 0.15000 a: 1.45015 b: 0.76632 c: 1.54518 d: 0.15000 a: 1.46341 b: 0.77195 c: 1.55560 d: 0.15000 a: 1.47226 b: 0.77571 c: 1.56257 d: 0.15000 a: 1.47818 b: 0.77823 c: 1.56722 d: 0.15000 a: 1.48214 b: 0.77991 c: 1.57033 d: 0.15000 a: 1.48478 b: 0.78103 c: 1.57241 d: 0.15000 a: 1.48655 b: 0.78178 c: 1.57380 d: 0.15000 a: 1.48773 b: 0.78228 c: 1.57473 d: 0.15000

Still changing too much

Page 13: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-13

PageRank: next 13 iterationsa: 1.48852 b: 0.78262 c: 1.57535 d: 0.15000 a: 1.48904 b: 0.78284 c: 1.57576 d: 0.15000 a: 1.48940 b: 0.78299 c: 1.57604 d: 0.15000 a: 1.48963 b: 0.78309 c: 1.57622 d: 0.15000 a: 1.48979 b: 0.78316 c: 1.57635 d: 0.15000 a: 1.48990 b: 0.78321 c: 1.57643 d: 0.15000 a: 1.48997 b: 0.78324 c: 1.57649 d: 0.15000 a: 1.49001 b: 0.78326 c: 1.57652 d: 0.15000 a: 1.49004 b: 0.78327 c: 1.57655 d: 0.15000 a: 1.49007 b: 0.78328 c: 1.57656 d: 0.15000 a: 1.49008 b: 0.78328 c: 1.57657 d: 0.15000 a: 1.49009 b: 0.78329 c: 1.57658 d: 0.15000 a: 1.49009 b: 0.78329 c: 1.57659 d: 0.15000

Starting to stabilize

Page 14: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-14

PageRank: Last 9 iterationsa: 1.49010 b: 0.78329 c: 1.57659 d: 0.15000 a: 1.49010 b: 0.78329 c: 1.57659 d: 0.15000 a: 1.49010 b: 0.78329 c: 1.57659 d: 0.15000 a: 1.49010 b: 0.78329 c: 1.57659 d: 0.15000 a: 1.49011 b: 0.78329 c: 1.57660 d: 0.15000 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000

Average pagerank = 1.0000

Stabilized

Page 15: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-15

Google Architecture Key components Interconnections Data structures A reference

architecture forsearch engines?

Page 16: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-16

Google Data Components• BigFiles• Repository

– Use zlib to compress

• Lexicon– Word base

• Hit Lists– Word->document ID map

• Document Indexing– Forward Index– Inverted Index

Page 17: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-17

Google File System (GFS)

• BigFiles– A.k.a. Google’s Proprietary Filesystem

– 64-bit addressable

– Compression

– Conventional operating systems don’t suffice• No explanation of why?

– GFS: http://labs.google.com/papers/gfs.html

Page 18: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-18

Google Key Data Components Repository

Stores full text of web pages Use zlib to compress

Zlib less efficient than bzip Tradeoff of time complexity versus space efficiency

Bzip more space efficient, but slower

Why is it important to compress the pages?

Page 19: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-19

Google Lexicon

• Lexicon– Contains 14 million words

– Implemented as a hash table of pointers to words

– Full explanation beyond the scope of this discussion

• Why is it important to have a lexicon?– Tokenization

– Analysis

– Language Identification

– SPAM

Page 20: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-20

Mapping queries to hits HitLists

wordID->(docID,position,font,capitalization) mapping

Takes up most of the space in the forward and inverted indices

Types:Fancy,Plain,Anchor

Page 21: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-21

Document Indexing

• Document Indexing– Forward Index

• docIDs->wordIDs

• Partially sorted

• Duplicated doc IDs– Makes it easier for final indexing and coding

– Inverted Index• wordIDs->docIDs

• 2 sets of inverted barrels

Page 22: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-22

Crawling and Indexing• Crawling

– Distributed, Parallel– Social issues

• Bringing down web servers: politeness• Copyright issues• Text versus code

• Indexing– Developed their own web page parser– Barrels

• Distribution of compressed documents– Sorting

Page 23: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-23

Google’s Query Evaluation• 1: Parse the query• 2: Convert words into WordIDs

– Using Lexicon• 3: Select the barrels that contain documents which match the

WordIDs• 4: Search through documents in the selected barrels until one is

discovered that matches all the search terms• 5: Compute that document’s rank (using PageRank as one of the

components)• 6: Repeat step 4 until no documents are found and we’ve went

through all the barrels• 7: Sort the set of returned documents by document rank and return

the top k documents

Page 24: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-24

Google Evaluation• Performed by generating

numerical results – Query satisfaction

• Bill Clinton Example– Storage requirements

• 55GB Total– System Performance

• 9 days to download 26 million pages

• 63 hours to get the final 11 million (at the time)

– Search Performance• Between 1 and 10 seconds

for most queries (at the time)

Page 25: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-25

Wrapup• Loads of future work

– Even at that time, there were issues of:• Information extraction from semi-structured sources (such as web

pages)– Still an active area of research

• Search engines as a digital library– What services, APIs and toolkits should a search engine provide?– What storage methods are the most efficient?

– From 2005 to 2010 to ???• Enhancing metadata

– Automatic markup and generation– What are the appropriate fields?

• Automatic Concept Extraction– Present the Searcher with a context

• Searching languages: beyond context-free queries• Other types of search: Facet, GIS, etc.

Page 26: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-26

The Future?

• User poses keyword query search– “Google-like” result page comes back

– Along with each link returned, there will be• A “Concept Map” outlining – using extraction methods – what

the “real” content of the document is– This basically allows you to “visually” see what the page rank is

– Discover information visually

– Existing evidence that this works well• http://vivisimo.com/

• Carrot2/3 clustering

Page 27: The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-27

Concept Map

Chris’s Homepagehttp://sunset.usc.edu/~mattmann

DataPublications

Software

Data Grid

Science Data Systems

Software Architecture