Upload
binit-singh
View
2.752
Download
0
Embed Size (px)
Citation preview
How Google Search Engine Works….
The Web is both an excellent medium for sharing information, as
well as an attractive platform for delivering products and services. Google is designed to crawl and index the Web efficiently and
produce much more satisfying search results than any existing search engine.
Many web pages are unscrupulous and try to fool search engines to get to the top of ranking
Google uses page rank and trust rank techniques to give accurate results of queries
Introduction
What is Search engineA tool designed to search information on web
Works with the help of• Crawler• Indexer• Search algorithms
Gives precise results on the basis of different ranking procedures
Hi how are u
WEB CRAWLER
WEB CRAWLER
Indexer
It collects, parses and stores data to facilitate fast and accurate information retrieval for a search query
The inverted index stores a list of the documents containing each word associated with the query
Word Document
apple Document 1,document 2,document 3
is Document 2,document 4
red Document 5
The search engine then matches the query with each document indexed.
Then it filters the matching results.
Since each query would require approx. 250GB of memory so with the help of compression techniques the index can be reduced to a fraction of this size.
Indexer are regularly updated with help of index merging
SEARCH ALGORITHM
Search for a query return millions of important or authoritative pages then search algorithm is used by engine to decide which one is going to be the listing that comes to the top
There are two key drivers in web search: content analysis and linkage analysis.
Famous algorithms used by different search engines are
1.Page rank 2.Trust rank 3.Hilltop algorithm 4.Binary search
Different search engines uses different algorithms to rank the priority of pages.
Different engines look for different things to determine search relevancy. Things that help you rank in one engine could preclude you from ranking in another
73%
71%
64%
56%
51%
Positive ranking factors
68%
56%
51%
51%
46%
Negative ranking factors
Keyword focused anchor text from external links
External link Popularity
Diversity of link sources
Keyword Use Anywhere in the title tag
Trustworthiness of the Domain Based on Link Distance from Trusted
Cloaking with Malicious intent
Link acquisition from known link brokers
Link from the page to Web Spam Pages
Cloaking by User Agent
Frequent Server Downtime & Site Inaccessibility
OVERALL RANKING FACTORS
Google architecture
Web crawling done by several distributed crawlers
The web pages fetched then sent to store server which compresses the pages and stores them in repository.
Indexer then reads repository, uncompress the documents and parses them.
Every web page has an associated ID number called DocID, which is assigned during parsing.
Each documents is converted into sets of word occurrence called HITS, which is distributed by indexer into barrels, creating a partially sorted index.
Indexer also parses out all the link in every page and stores important information about them in an anchors file which determine where each link points from and to, and the text of the link
The URLresolver reads the anchor file, retrieve their anchor text, puts the anchor text into forward index and generate a database of links
Link database is used to compute PageRanks of all the documents.
The sorter takes the barrels, which are sorted by docID, resorts them into wordID, produces a list of it and offsets into inverted index.
A program called Dump Lexicon takes this list together with the lexicon produced by indexer and create a new lexicon for the searcher.
The searcher uses these lexicons together with inverted index and Page Rank to answer our queries.
Crawling deeply in Google's Architecture
• Major data structure• Google’s data structures are optimized so that a large document
collection can be crawled, indexed, and searched with little cost• Cpus and bulk input output have increased upto millions.Google is
designed to avoid disk seeks whenever possible
• Repositry • It contains the full HTML of every web page and compresses it
using Zlib which is a tradeoff between speed• The documents are stored one after the other and are prefixed by
docID, length, and URL as shown Doc Id Ecode Url len pagelen url page
• Document index• It keeps information about each document which include the
current document status, pointer into the repository, document checksum and various statistics
• URLs is converted into docIDs in batch by doing a merge with this file
• To find the docID of a particular URL, the URL’s checksum is computed and a binary search is performed
• Lexicon• It’s a program used by indexer as a word storage system and
fit in machine memory for a reasonable price• The current lexicon contains 14 million words and takes only
256 MB of main memory of machine.
• Hit list• It’s a list of occurrences of a particular word in a particular document
including position, font and capitalization information. • It also account for most of the space used in both the forward and
the inverted indices• Types of hits: Fancy hits and Plain hits Fancy hits include hits
occurring in a URL, title, anchor text, or meta tag. Plain hits include everything
• The length of the hit list is combined with the wordID in the forward index and the docID in the inverted index
• Forward index
• It is partially sorted and stored in a number of barrels. Each barrel holds a range of wordID’s. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s
• Inverted index Inverted index consists of the same barrels as the forward index,
except it is processed by the sorter
For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into and points to a doclist of docID’s together
It has two sets of inverted barrels, one set for hit lists which include title or anchor hits and another set for all hit lists
It checks for the first set of barrels first and if there are not enough matches within those barrels it checks the larger ones.
• Indexing the web• Any parser which is designed to run on the entire Web handle a
huge array of possible errors. For maximum speed it uses flex to generate a lexical analyzer which runs at a reasonable speed and is very robust involved a fair amount of work.
Searching techniques The goal of searching is to provide quality search results efficiently.
once a certain number (currently 40,000) of matching documents are found, the searcher automatically sort the documents that have matched, by rank, and return the top results.
Google considers each hit (title, anchor, URL,large and small font), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list.
Every count is converted into a count-weight. We take the dot product
of the vector of count-weights
Vector of type-weights is used to compute an IR score for the document.
IR score is combined with PageRank to give a final rank to the document.
• Page Rank is based on a mutual reinforcement between pages.
• It’s a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents
• A page that is linked to by many pages with high Page Rank
receives a high rank itself. If there are no links to a web page there is no support for that page.
• A recent analysis of the algorithm showed that the total Page Rank score PR (t) of a group t of pages depends on four factors:
PR(t) = PRstatic(t)+PRin(t)-PRout(t)-PRsink(t)
Page rank
Page C has a higher PageRank than Page E, even though it has fewer links to it; the link it has is of a much higher value.
A web surfer who chooses a random link on every page (but with 15% likelihood jumps to a random page on the whole web) is going to be on Page E for 8.1% of the time.
(The 15% likelihood of jumping to an arbitrary page corresponds to a damping factor of 85%.) Without damping, all web surfers would eventually end up on Pages A, B, or C, and all other pages would have Page Rank zero. Page A is assumed to link to all pages in the web, because it has no outgoing links
Mathematical Page Ranks
Trust rank
Google and Web Spam
• All deceptive actions which try to increase the ranking of a page in search engines are generally referred to as Web spam. It also refereed as “any attempt to deceive a search engine’s relevancy algorithm”.
• There are three types of web spam on the web .They are:- Content spam : Maliciously crafting the content of Web pages. It refers to changes in the
content of the pages, for instance by inserting a large number of keywords.
Link spam : Includes changes to the link structure of the sites, by creating link farms .A link farm is a densely connected set of pages, created explicitly with the purpose of deceiving a link based ranking algorithm.
Cloaking : It is achieved by creating a rogue copy of a popular website which shows contents similar to the original to a web crawler, but redirects web surfers to unrelated or malicious websites. Spammers can use this technique to achieve high rankings in result pages for certain key words.
Link based web spamLink based web spam
• The foundation of spam detection system is a cost sensitive decision tree. It incorporates a combined approach based on link and content analysis to detect different types of Web spam pages
Content Based Features • Number of words in the page • Fraction of anchor text • Fraction of visible text
• A comparative study content based features of the below mentioned figures show following results:
• Figure 1- Average Word Length in Spam pages are much higher in spam pages
• Figure2-Number of words in spam page is much higher than non-spam page
Web spam detection and result
Thus based on the following features the content based spam pages can be detected by Naïve Bayesian Classifier which focuses on the no of times a word is repeated in the content of the page.
Figure 1: Figure 2:
Link Based Features
Data set is obtained by using web crawler. For each page, links and its contents are obtained. From data set, a full graph is built. For each host and page, certain features are computed. Link-based features are extracted from host graph.
Link Based classifier operates on the three features of the link farm which are as follows :-
• Based on the Estimation of Supporters
Based on Trust Rank and Page Rank
It has been observed that a normal webpage have their graph of the supporter increasing exponentially and the number of supporters increases with the distance.
But in the case of the web spam their graph has a sudden increase in the supporters over a small distance of time and decreasing to zero after some distance.
The distribution of the supporters over the distance has been shown in the figure
Distribution of supporters over a distance of the spam and non-spam page
Non spam spam
System performance
It is important for a search engine to crawl and index efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly
In total it took roughly 9 days to download the 26 million pages (including errors) downloading the last11 million pages in just 63 hours, averaging just over 4million pages per day or 48.5 pages per second.
The indexer runs at roughly 54 pages persecond. The sorters can be run completely in parallel; usingfour machines, the whole process of sorting takes about 24hours.
Google’s immediate goals are to improve search efficiency and to scale to approximately 100 million web pages.
They are planning to add simple features supported by commercial search engines like boolean operators, negation, and stemming and extending the use of link structure and link text.
Page Rank can be personalized by increasing the weight of a user’s home page or bookmarks.
Google is planning to use all the other centrality measures. The Centrality measures of a node are
– Degree centrality – Betweenness centrality– Closeness centrality
Future work
conclusion
Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information.
Google keeps us away from spammy link exchange hubs and other sources of junk links. It gives more importance to .gov and .edu web pages.
We had applied algorithms for Web spam detection based on these features of the web farm i.e Context based(Naïve Bayesian Classifier) and link based(PageRank Algorithm).
References • Best of the Web 1994 -- Navigators http://botw.org/1994/awards/navigators.html
• l.Bzip2 Homepage http://www.muraroa.demon.co.uk/
• Google Search Engine http://google.stanford.edu/• Harvest http://harvest.transarc.com/
• Mauldin, Michael L. Lycos Design Choices in an Internet Search Service, IEEE Expert Interview
• http://www.computer.org/pubs/expert/1997/trends/x1008/mauldin.htm
• Search Engine Watch http://www.searchenginewatch.com/
• Robots Exclusion Protocol: http://info.webcrawler.com/mak/projects/robots/exclusion.htm
Thank You All !!