The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

The Anatomy of a Large-Scale Hypertextual Web Search EngineSergey Brin and Lawrence Page

The Original Google Paper

Google is the common spelling of googol, or 10100, which fit well with the authors’ goal of

building very large-scale search engines.

OutlineDesign goals

System features

System anatomy

Results and performance

Paper analysis

Design Goals

Design Goals1. Scale with the rapid growth of the web

1994 1997 20000

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

110,000100,000,000

1,000,000,000

1,500 20,000,000100,000,000

Webpages Indexed Queries/day

Design Goals2. Improved Search Quality

Number of documents on the web are increasing rapidly, but users’ ability to look at them lags.

Current search engines return lots of “junk” results, with little relevance. (Note: We’re talking about the year 1998)

3. Academic Search Engine Research Push more development and understanding into the

academic realm. Systems that reasonable number of people can use. Build an architecture to support novel research

activities in large-scale web data.

System Features

System Features1. Makes use of the link structure of

the Web to calculate a quality ranking for each page, called the PageRank.

A probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.

It considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value.

PageRank: Bringing Order to the Web

PR(A) PageRank of a webpage A PR(Ti) PageRank of a webpage Ti pointing to A

C(Ti) Number of outbound links for webpage Ti

L(A) Set of webpages linking to A d damping factor, a value between 0 and 1, is the

probability that a random surfer will stop clicking Note that PageRanks form a probability distribution of

webpages, so the summation of all webpages will be 1.

PageRank: Bringing Order to the Web

Assume a universe of 4 webpages: A, B, C, and D

Taking into consideration that a random surfer will eventually stop clicking, we assume a damping factor, d, which is generally assumed to be 0.85

System Features2. Makes use of Anchor text of links on

webpages: E.g. <a href=http://www.yahoo.com>Yahoo!</a> Text of a link is not only associated with the

webpage it is on, it also gives information (sometimes more relevant) to the webpage it points to.

Anchors may exist for documents which generally cannot be indexed by text-based search engines, such as images, programs, and databases.

System Features3. Uses location information for all hits and

thus makes extensive use of proximity in search.

4. Keeps track of visual presentation of text on webpages such as font sizes. Words with bolder/larger font are given more importance.

5. Stores complete raw HTML of webpages in repository.

System Anatomy

Major Data Structures1. BigFiles

Virtual files spanning multiple file systems and addressable by 64 bit integers.

2. Repository Contains full compressed HTML of all pages. Stored one after another prefixed with docID,

length, and URL. Compressed using high speed compression

technique (zlib) instead of high compression ratio (bzip).

Major Data Structures3. Document Index

Keeps information about each document. It’s a fixed width index, ordered by docID. Stores document status, pointer into the

repository, and checksum. If document is indexed, points to a variable width

file docinfo which contains URL and title. Else points to URLlist containing only the URL.

4. Lexicon Contains list of null separated words (about 14

million) and hash table of pointers.

Major Data Structures5. Hit Lists

A list of occurrences of a particular word in a particular document including position, font, and capitalization information.

Hit lists account for most of the space used in both the forward and the inverted indices.

6. Forward Index Stored in a number of barrels. If a document contains words that fall into a

particular barrel, the docID is recorded into the barrel followed by a list of wordIDs with their hitlists.

Major Data Structures7. Inverted Index

The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter.

Crawling the Web1. Several distributed crawlers.

URLserver serves list of URLs to the crawler. Each crawler keeps ~300 open connections. At max, a system of 4 crawlers can crawl ~100

pages/sec or ~600 K/second of data. Each maintains it’s own DNS cache for fast lookup.

2. Parser handles a huge array of possible errors including HTML errors, non-ASCII characters, or HTML tags nested hundreds deep

Indexing the Web3. Indexing Documents into Barrels

After each document is parsed, it is encoded into a number of barrels.

Every word is converted into a wordID using an in-memory hash table – the lexicon.

Once words are converted into wordIDs, their occurrences in the current document are translated into hit lists and are written into the forward barrels.

4. Sorting Sorter takes each of the forward barrels and sorts

by wordID to produce an inverted barrel for title and anchor hits and full text inverted barrel.

Searching1. Parse the query

2. Convert words into wordIDs.

3. Seek to the start of the doclist in the short barrel for every word.

4. Scan through the doclists until there is a document that matches all the search terms.

5. Compute the rank of that document for the query.

6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.

7. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.

Results and Performance

Results and PerformanceA qualitative analysis of the search results by

users has generally been positive.

The current version of Google answers most queries in between 1 and 10 seconds.

Since Google takes into consideration the proximity of word occurrences, results are more relevant than other search engines giving a set of results for all words in queries. (E.g. search for ‘bill clinton’ gives lower importance to results with independent ‘bill’ and ‘clinton’)

Future WorksCurrent version of Google search times are

dominated by disk IO. Introduce query caching, and hardware, software and algorithmic optimizations.

Improve search efficiency and quickly scale to ~100 million web pages.

Develop Google as a resource for large scale research tool for searchers and researchers.

Analyses of the Research PaperPros

One of the first descriptions of the PageRank algorithm which changed how search engines ranked and indexed the web.

Using citation graph and anchor text to rank pages closely resembled user behavior of ranking websites.

Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them.

The paper mentions Google does not compromise PageRanks for monetary gains giving more credibility to search results. This holds true to date.

Analyses of the Research PaperCons

One of the first flaws found in the PageRank algorithm was the “Google Bomb”: Because of the PageRank, a page will be ranked

higher if the sites that link to that page use consistent anchor text.

A Google bomb is created if a large number of sites link to the page in this manner.

Ranking quality is insufficient using only PageRank and anchor text. (Google today uses more than 200 different parameters to judge quality of a webpage.)

Thank YouPresented by: Nilay Khandelwal