Googling of GooGle

How Google Search Engine Works….

The Web is both an excellent medium for sharing information, as

well as an attractive platform for delivering products and services. Google is designed to crawl and index the Web efficiently and

produce much more satisfying search results than any existing search engine.

Many web pages are unscrupulous and try to fool search engines to get to the top of ranking

Google uses page rank and trust rank techniques to give accurate results of queries

Introduction

What is Search engineA tool designed to search information on web

Works with the help of• Crawler• Indexer• Search algorithms

Gives precise results on the basis of different ranking procedures

Hi how are u

WEB CRAWLER

WEB CRAWLER

Indexer

It collects, parses and stores data to facilitate fast and accurate information retrieval for a search query

The inverted index stores a list of the documents containing each word associated with the query

Word Document

apple Document 1,document 2,document 3

is Document 2,document 4

red Document 5

The search engine then matches the query with each document indexed.

Then it filters the matching results.

Since each query would require approx. 250GB of memory so with the help of compression techniques the index can be reduced to a fraction of this size.

Indexer are regularly updated with help of index merging

SEARCH ALGORITHM

Search for a query return millions of important or authoritative pages then search algorithm is used by engine to decide which one is going to be the listing that comes to the top

There are two key drivers in web search: content analysis and linkage analysis.

Famous algorithms used by different search engines are

1.Page rank 2.Trust rank 3.Hilltop algorithm 4.Binary search

Different search engines uses different algorithms to rank the priority of pages.

Different engines look for different things to determine search relevancy. Things that help you rank in one engine could preclude you from ranking in another

73%

71%

64%

56%

51%

Positive ranking factors

68%

56%

51%

51%

46%

Negative ranking factors

Keyword focused anchor text from external links

External link Popularity

Diversity of link sources

Keyword Use Anywhere in the title tag

Trustworthiness of the Domain Based on Link Distance from Trusted

Cloaking with Malicious intent

Link acquisition from known link brokers

Link from the page to Web Spam Pages

Cloaking by User Agent

Frequent Server Downtime & Site Inaccessibility

OVERALL RANKING FACTORS

Google architecture

Web crawling done by several distributed crawlers

The web pages fetched then sent to store server which compresses the pages and stores them in repository.

Indexer then reads repository, uncompress the documents and parses them.

Every web page has an associated ID number called DocID, which is assigned during parsing.

Each documents is converted into sets of word occurrence called HITS, which is distributed by indexer into barrels, creating a partially sorted index.

Indexer also parses out all the link in every page and stores important information about them in an anchors file which determine where each link points from and to, and the text of the link

The URLresolver reads the anchor file, retrieve their anchor text, puts the anchor text into forward index and generate a database of links

Link database is used to compute PageRanks of all the documents.

The sorter takes the barrels, which are sorted by docID, resorts them into wordID, produces a list of it and offsets into inverted index.

A program called Dump Lexicon takes this list together with the lexicon produced by indexer and create a new lexicon for the searcher.

The searcher uses these lexicons together with inverted index and Page Rank to answer our queries.

Crawling deeply in Google's Architecture

• Major data structure• Google’s data structures are optimized so that a large document

collection can be crawled, indexed, and searched with little cost• Cpus and bulk input output have increased upto millions.Google is

designed to avoid disk seeks whenever possible

• Repositry • It contains the full HTML of every web page and compresses it

using Zlib which is a tradeoff between speed• The documents are stored one after the other and are prefixed by

docID, length, and URL as shown Doc Id Ecode Url len pagelen url page

• Document index• It keeps information about each document which include the

current document status, pointer into the repository, document checksum and various statistics

• URLs is converted into docIDs in batch by doing a merge with this file

• To find the docID of a particular URL, the URL’s checksum is computed and a binary search is performed

• Lexicon• It’s a program used by indexer as a word storage system and

fit in machine memory for a reasonable price• The current lexicon contains 14 million words and takes only

256 MB of main memory of machine.

• Hit list• It’s a list of occurrences of a particular word in a particular document

including position, font and capitalization information. • It also account for most of the space used in both the forward and

the inverted indices• Types of hits: Fancy hits and Plain hits Fancy hits include hits

occurring in a URL, title, anchor text, or meta tag. Plain hits include everything

• The length of the hit list is combined with the wordID in the forward index and the docID in the inverted index

• Forward index

• It is partially sorted and stored in a number of barrels. Each barrel holds a range of wordID’s. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s

• Inverted index Inverted index consists of the same barrels as the forward index,

except it is processed by the sorter

For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into and points to a doclist of docID’s together

It has two sets of inverted barrels, one set for hit lists which include title or anchor hits and another set for all hit lists

It checks for the first set of barrels first and if there are not enough matches within those barrels it checks the larger ones.

• Indexing the web• Any parser which is designed to run on the entire Web handle a

huge array of possible errors. For maximum speed it uses flex to generate a lexical analyzer which runs at a reasonable speed and is very robust involved a fair amount of work.

Searching techniques The goal of searching is to provide quality search results efficiently.

once a certain number (currently 40,000) of matching documents are found, the searcher automatically sort the documents that have matched, by rank, and return the top results.

Google considers each hit (title, anchor, URL,large and small font), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list.

Every count is converted into a count-weight. We take the dot product

of the vector of count-weights

Vector of type-weights is used to compute an IR score for the document.

IR score is combined with PageRank to give a final rank to the document.

• Page Rank is based on a mutual reinforcement between pages.

• It’s a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents

• A page that is linked to by many pages with high Page Rank

receives a high rank itself. If there are no links to a web page there is no support for that page.

• A recent analysis of the algorithm showed that the total Page Rank score PR (t) of a group t of pages depends on four factors:

PR(t) = PRstatic(t)+PRin(t)-PRout(t)-PRsink(t)

Page rank

Page C has a higher PageRank than Page E, even though it has fewer links to it; the link it has is of a much higher value.

A web surfer who chooses a random link on every page (but with 15% likelihood jumps to a random page on the whole web) is going to be on Page E for 8.1% of the time.

(The 15% likelihood of jumping to an arbitrary page corresponds to a damping factor of 85%.) Without damping, all web surfers would eventually end up on Pages A, B, or C, and all other pages would have Page Rank zero. Page A is assumed to link to all pages in the web, because it has no outgoing links

Mathematical Page Ranks

Trust rank

Google and Web Spam

• All deceptive actions which try to increase the ranking of a page in search engines are generally referred to as Web spam. It also refereed as “any attempt to deceive a search engine’s relevancy algorithm”.

• There are three types of web spam on the web .They are:- Content spam : Maliciously crafting the content of Web pages. It refers to changes in the

content of the pages, for instance by inserting a large number of keywords.

Link spam : Includes changes to the link structure of the sites, by creating link farms .A link farm is a densely connected set of pages, created explicitly with the purpose of deceiving a link based ranking algorithm.

Cloaking : It is achieved by creating a rogue copy of a popular website which shows contents similar to the original to a web crawler, but redirects web surfers to unrelated or malicious websites. Spammers can use this technique to achieve high rankings in result pages for certain key words.

Link based web spamLink based web spam

• The foundation of spam detection system is a cost sensitive decision tree. It incorporates a combined approach based on link and content analysis to detect different types of Web spam pages

Content Based Features • Number of words in the page • Fraction of anchor text • Fraction of visible text

• A comparative study content based features of the below mentioned figures show following results:

• Figure 1- Average Word Length in Spam pages are much higher in spam pages

• Figure2-Number of words in spam page is much higher than non-spam page

Web spam detection and result

Thus based on the following features the content based spam pages can be detected by Naïve Bayesian Classifier which focuses on the no of times a word is repeated in the content of the page.

Figure 1: Figure 2:

Link Based Features

Data set is obtained by using web crawler. For each page, links and its contents are obtained. From data set, a full graph is built. For each host and page, certain features are computed. Link-based features are extracted from host graph.

Link Based classifier operates on the three features of the link farm which are as follows :-

• Based on the Estimation of Supporters

Based on Trust Rank and Page Rank

It has been observed that a normal webpage have their graph of the supporter increasing exponentially and the number of supporters increases with the distance.

But in the case of the web spam their graph has a sudden increase in the supporters over a small distance of time and decreasing to zero after some distance.

The distribution of the supporters over the distance has been shown in the figure

Distribution of supporters over a distance of the spam and non-spam page

Non spam spam

System performance

It is important for a search engine to crawl and index efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly

In total it took roughly 9 days to download the 26 million pages (including errors) downloading the last11 million pages in just 63 hours, averaging just over 4million pages per day or 48.5 pages per second.

The indexer runs at roughly 54 pages persecond. The sorters can be run completely in parallel; usingfour machines, the whole process of sorting takes about 24hours.

Google’s immediate goals are to improve search efficiency and to scale to approximately 100 million web pages.

They are planning to add simple features supported by commercial search engines like boolean operators, negation, and stemming and extending the use of link structure and link text.

Page Rank can be personalized by increasing the weight of a user’s home page or bookmarks.

Google is planning to use all the other centrality measures. The Centrality measures of a node are

– Degree centrality – Betweenness centrality– Closeness centrality

Future work

conclusion

Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information.

Google keeps us away from spammy link exchange hubs and other sources of junk links. It gives more importance to .gov and .edu web pages.

We had applied algorithms for Web spam detection based on these features of the web farm i.e Context based(Naïve Bayesian Classifier) and link based(PageRank Algorithm).

References • Best of the Web 1994 -- Navigators http://botw.org/1994/awards/navigators.html

• l.Bzip2 Homepage http://www.muraroa.demon.co.uk/

• Google Search Engine http://google.stanford.edu/• Harvest http://harvest.transarc.com/

• Mauldin, Michael L. Lycos Design Choices in an Internet Search Service, IEEE Expert Interview

• http://www.computer.org/pubs/expert/1997/trends/x1008/mauldin.htm

• Search Engine Watch http://www.searchenginewatch.com/

• Robots Exclusion Protocol: http://info.webcrawler.com/mak/projects/robots/exclusion.htm

http://botw.org/1994/awards/navigators.html

http://www.muraroa.demon.co.uk/

http://harvest.transarc.com/

http://www.searchenginewatch.com/

Thank You All !!

Technology

Googling of GooGle