27
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

  • View
    238

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

The Anatomy of a Large-Scale Hypertextual Web Search EngineSergey Brin and Lawrence Page

Page 2: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

The Original Google Paper

Google is the common spelling of googol, or 10100, which fit well with the authors’ goal of

building very large-scale search engines.

Page 3: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

OutlineDesign goals

System features

System anatomy

Results and performance

Paper analysis

Page 4: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Design Goals

Page 5: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Design Goals1. Scale with the rapid growth of the web

1994 1997 20000

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

110,000100,000,000

1,000,000,000

1,500 20,000,000100,000,000

Webpages Indexed Queries/day

Page 6: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Design Goals2. Improved Search Quality

Number of documents on the web are increasing rapidly, but users’ ability to look at them lags.

Current search engines return lots of “junk” results, with little relevance. (Note: We’re talking about the year 1998)

3. Academic Search Engine Research Push more development and understanding into the

academic realm. Systems that reasonable number of people can use. Build an architecture to support novel research

activities in large-scale web data.

Page 7: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

System Features

Page 8: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

System Features1. Makes use of the link structure of

the Web to calculate a quality ranking for each page, called the PageRank.

A probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.

It considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value.

Page 9: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

PageRank: Bringing Order to the Web

PR(A) PageRank of a webpage A PR(Ti) PageRank of a webpage Ti pointing to A

C(Ti) Number of outbound links for webpage Ti

L(A) Set of webpages linking to A d damping factor, a value between 0 and 1, is the

probability that a random surfer will stop clicking Note that PageRanks form a probability distribution of

webpages, so the summation of all webpages will be 1.

Page 10: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

PageRank: Bringing Order to the Web

Assume a universe of 4 webpages: A, B, C, and D

Taking into consideration that a random surfer will eventually stop clicking, we assume a damping factor, d, which is generally assumed to be 0.85

Page 11: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

System Features2. Makes use of Anchor text of links on

webpages: E.g. <a href=http://www.yahoo.com>Yahoo!</a> Text of a link is not only associated with the

webpage it is on, it also gives information (sometimes more relevant) to the webpage it points to.

Anchors may exist for documents which generally cannot be indexed by text-based search engines, such as images, programs, and databases.

Page 12: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

System Features3. Uses location information for all hits and

thus makes extensive use of proximity in search.

4. Keeps track of visual presentation of text on webpages such as font sizes. Words with bolder/larger font are given more importance.

5. Stores complete raw HTML of webpages in repository.

Page 13: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

System Anatomy

Page 14: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page
Page 15: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Major Data Structures1. BigFiles

Virtual files spanning multiple file systems and addressable by 64 bit integers.

2. Repository Contains full compressed HTML of all pages. Stored one after another prefixed with docID,

length, and URL. Compressed using high speed compression

technique (zlib) instead of high compression ratio (bzip).

Page 16: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Major Data Structures3. Document Index

Keeps information about each document. It’s a fixed width index, ordered by docID. Stores document status, pointer into the

repository, and checksum. If document is indexed, points to a variable width

file docinfo which contains URL and title. Else points to URLlist containing only the URL.

4. Lexicon Contains list of null separated words (about 14

million) and hash table of pointers.

Page 17: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Major Data Structures5. Hit Lists

A list of occurrences of a particular word in a particular document including position, font, and capitalization information.

Hit lists account for most of the space used in both the forward and the inverted indices.

6. Forward Index Stored in a number of barrels. If a document contains words that fall into a

particular barrel, the docID is recorded into the barrel followed by a list of wordIDs with their hitlists.

Page 18: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Major Data Structures7. Inverted Index

The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter.

Page 19: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Crawling the Web1. Several distributed crawlers.

URLserver serves list of URLs to the crawler. Each crawler keeps ~300 open connections. At max, a system of 4 crawlers can crawl ~100

pages/sec or ~600 K/second of data. Each maintains it’s own DNS cache for fast lookup.

2. Parser handles a huge array of possible errors including HTML errors, non-ASCII characters, or HTML tags nested hundreds deep

Page 20: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Indexing the Web3. Indexing Documents into Barrels

After each document is parsed, it is encoded into a number of barrels.

Every word is converted into a wordID using an in-memory hash table – the lexicon.

Once words are converted into wordIDs, their occurrences in the current document are translated into hit lists and are written into the forward barrels.

4. Sorting Sorter takes each of the forward barrels and sorts

by wordID to produce an inverted barrel for title and anchor hits and full text inverted barrel.

Page 21: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Searching1. Parse the query

2. Convert words into wordIDs.

3. Seek to the start of the doclist in the short barrel for every word.

4. Scan through the doclists until there is a document that matches all the search terms.

5. Compute the rank of that document for the query.

6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.

7. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.

Page 22: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Results and Performance

Page 23: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Results and PerformanceA qualitative analysis of the search results by

users has generally been positive.

The current version of Google answers most queries in between 1 and 10 seconds.

Since Google takes into consideration the proximity of word occurrences, results are more relevant than other search engines giving a set of results for all words in queries. (E.g. search for ‘bill clinton’ gives lower importance to results with independent ‘bill’ and ‘clinton’)

Page 24: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Future WorksCurrent version of Google search times are

dominated by disk IO. Introduce query caching, and hardware, software and algorithmic optimizations.

Improve search efficiency and quickly scale to ~100 million web pages.

Develop Google as a resource for large scale research tool for searchers and researchers.

Page 25: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Analyses of the Research PaperPros

One of the first descriptions of the PageRank algorithm which changed how search engines ranked and indexed the web.

Using citation graph and anchor text to rank pages closely resembled user behavior of ranking websites.

Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them.

The paper mentions Google does not compromise PageRanks for monetary gains giving more credibility to search results. This holds true to date.

Page 26: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Analyses of the Research PaperCons

One of the first flaws found in the PageRank algorithm was the “Google Bomb”: Because of the PageRank, a page will be ranked

higher if the sites that link to that page use consistent anchor text.

A Google bomb is created if a large number of sites link to the page in this manner.

Ranking quality is insufficient using only PageRank and anchor text. (Google today uses more than 200 different parameters to judge quality of a webpage.)

Page 27: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

Thank YouPresented by: Nilay Khandelwal