31
THE ANATOMY OF A LARGE SCALE-HYPER TEXTUAL WEB SEARCH ENGINE ASIM FROM UNIVERSITY PESAHAWAR. Author: Sergey Brin, Lawrence Page

Anatomy of google

Embed Size (px)

Citation preview

THE ANATOMY OF A LARGE SCALE-HYPER

TEXTUAL WEB SEARCH ENGINE

ASIM FROM UNIVERSITY PESAHAWAR.

Author: Sergey Brin, Lawrence Page

ABSTRACT

Google Search Engine as Prototype

Anatomy

Web Users: Queries (tens of millions)

Academic research

Building a large scale search engine

Heavy use of hyper textual information

(anchor links, hyperlinks)

INTRODUCTION

Web (as a dynamic entity)

Irrelevant Search Results

Human maintained Indices, Table of Contents

Too many low quality research

Address many problems of users (Page Ranking)

CONT…

Google: Scaling with the Web

Google’s Fast Crawling Technology

Storage space availability

Indexing system processing 100’s of Gigabytes

Data

Minimized Queries Response Time

DESIGN GOALS

Improved Search Quality.

Indexing does not provide Relevant Search Results.

Making the percentage of Junks Results as low as possible.

Users show interest in top ranked results.

Notion is to provide relevant results.

Google make uses of Link structure & anchor text.

CONT…

Academic search engine results.

User Accessibility & Availability of the desired

results.

Supports Novel Research.

All problem solving solutions to be given in a single

place.

SYSTEM FEATURES

Google search engine has two important features.

Link structure of the web(page ranking).

Utilization Links(anchor text) to improve search

results.

<A href="http://www.yahoo.com/">Yahoo!</A>

Besides the text of a hyperlink (anchor text) is

associated with the page that the link is on,

it is also associated with the page the link

points to.

PAGE RANK

Page Rank: bringing order to the web

Academic citation literature is applied to calculate

page rank

PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))

In the equation 't1 - tn' are pages linking to page A,

'C' is the number of outbound links that a page has

and 'd' is a damping factor, usually set to 0.85.

PAGE RANK (INTUITIVE JUSTIFICATION)

Many pages that point to a single page

A page having high PageRank that points to

another page

Broken Links are not listed on Higher Page Ranked

sites

Text of the link provides more description, Google

utilizes such information

Provides more accurate results for images, graphs,

databases

SYSTEM ANATOMY

SYSTEM ANATOMY

URL Server:

provides list of URLs to the Crawlers for fetching information from web

Distributed Crawlers (Downloading WebPages)

Store Server:

Compression and Storage in Repository

docID’s are used to distinguish WebPages

Indexer

Indexing, Sorting, Uncompressing, Parsing

Hits records word occurences, position, text formate information in

documents

Hits are organized into barrels which creates partially sorted forward index

FORWARD INDEX

Document Words

Document 1 the,cow,says,moo

Document 2 the,cat,and,the,hat

Document 3 the,dish,ran,away,with,the,spoon

INVERTED INDEX

T0 = "it is what it is“

T1 = "what is it“

T2 = "it is a banana“ A term search for the terms "what", "is" and "it" would give the set.

If we run a phrase search for "what is it" we get hits for all the words in both document 0

and 1. But the terms occur consecutively only in document 1.

Inverted Index Words

{(2, 2)} a

{(2, 3)} banana

{(0, 1), (0, 4), (1, 1), (2, 1)} is

{(0, 0), (0, 3), (1, 2), (2, 0)} It

{(0, 2), (1, 0)} What

CONT…

Indexer:

Anchor files as a result of parsing possessing links information (in & out links)

URL resolver:

Reads anchor files, converts relative to absolute URLs and inturn into docIDs

Puts anchor text in forward index

Database of links, necessary to compute PageRanks

Sorter :

Takes the barrels which are sorted by docID and resorts them by wordID to generate inverted index.

It produces a list of wordIDs and offsets into the inverted index.

CONT…

DumpLexicon

A program DumpLexicon takes this list together with the

lexicon produced by the indexer and generates a new

lexicon to be used by the searcher.

Searcher:

The searcher is run by a web server and uses the

lexicon built by DumpLexicon together with the inverted

index and the PageRanks to answer queries.

CONT…

Major data structures

Data is stored in BigFiles which are virtual files and it supports compression.

Half of the storage used by raw html repository.

Having compressed html of every page and its small header.

Document index keep information of each document.

The ISAM(Index sequential access mode) index is ordered by docID.

Each stored entry includes information of current status, pointer into the repository, document checksum, URL and title information.

They all are memory-based hash tables with varying values attached with each word.

CONT…

Hit lits encoding

Uses compact encoding(a hand optimized)

It requires less space and less bit manipulation.

It uses two bytes for every hits.

For saving space the length of a hit list is combined with

the wordID in the forward index and the docID in the

inverted index.

Forward index is stored in the number of barrels(64).

Each barrels holds word IDs

Words falling in particular barrel, the DocIDs is recorded

into the barrel followed by the List of WordIDs with

Hitlists which corresponds to those words

CONT…

The inverted index consist of the same barrels as

the forward index. Inverted index is processed by

the sorter

Pointer is used for pointing to wordID in barrels.

Pointer points to List of docIDs and Hit list, this is

called docList

CRAWLING

Web Crawling (downloading pages)

Crawlers (3 to 4)

Each crawler contains three hundred open

connections

Social issues

Efficiency

ARCHITECTURE OF THE GOOGLE SEARCH ENGINE

DESCRIPTION OF THE PICTORIAL COMPONENTSComponents Description

Crawlers There are several distributed crawlers, they parse the pages and extract links and keywords.

URL Server Provides to crawlers a list of URLs to scan. The crawlers sends collected data to a store server.

Server Store It compresses the pages and places them in the repository. Each page is stored with an identifier, a docID.

Repository Contains a copy of the pages and images, allowing comparisons and caching.

Indexer It decompresses documents and converts them into sets of words called "hits". It distributes hits among a set of "barrels". This provides an index partially sorted. It also creates a list of URLs on each page. A hit contains the following information: the word, its position in the document, font size, capitalization.

Barrels These "barrels" are databases that classify documents by docID. They are created by the indexer and used by the sorter.

Anchors The bank of anchors created by the indexer contains internal links and text associated with each link.

CONT…Components Description

URL Resolver

It takes the contents of anchors, converts relative URLs into absolute addresses and finds or creates a docID. It builds an index of documents and a database of links.

Doc Index Contains the text relative to each URL.

LinksThe database of links associates each one with a docID (and so to a real document on the Web).

PageRankThe software uses the database of links to define the PageRank of each page.

SorterIt interacts with barrels. It includes documents classified by docID and creates an inverted list sorted by wordID.

Lexicon

A software called DumpLexicon takes the list provided by the sorter (classified by wordID), and also includes the lexicon created by the indexer (the sets of keywords in each page), and produces a new lexicon to the searcher.

SearcherIt runs on a web server in a datacenter, uses the lexicon built by DumpLexicon in combination with the index classified by wordID, taking into account the PageRank, and produces a results page.

RESULTS, PROBLEMS & CONCLUSION

Most important issue is quality of search results

Google performance is better compared to other commercial engines

Need of Relevant and exact Query Results

Up to date information processing

Performing search queries

Crawling technologies

Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information.

“The ultimate search engine would understand exactly what you mean and give back exactly what you want.” by Larry Page

“The ultimate search engine would understand exactly what you

mean and give back exactly what you want.” by Larry Page

“The absolute search engine’s query generation would be based

on information, not based on the repository records and query

results will be real timed, and it will change the whole internet

and web architecture.” by asim

Thanks!