36
Web Intelligence Search and Ranking

Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Embed Size (px)

Citation preview

Page 1: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Web Intelligence

Search and Ranking

Page 2: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Today

The anatomy of search engines (read it yourself)

The key design goal(s) for search engines

Why google is good: PageRank and anchor text

Coursework 3 (last slide)

Page 3: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

The Anatomy of a Search Engine

A classic paper from the founders of Google;Available at the site.

• Challenges faced by a www search system

• Design Goals

• Google’s key ideas (for improved search/relevance)

• System design of google.

Page 4: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

The one-slide guide to search engines

A search engine’s back end is simply an index of the pages on the web, in precisely the same way that an index of a book is an index of the pages in the book!

In a book, to find the pages that discuss `apples’, you look up `apples’ in the index, and get page numbers. In the www, you look up `apples’ and get URLs.

To give more appropriate URLs, web indexes are more sophisticated, but it’s still the same idea.

So if you want to start a search engine company, you need:– to build and maintain an index to the web (how?)– to provide an interface (of course)– to have a routine that takes a search query and, making good use

of your index, finds the most relevant urls.

Page 5: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Growth of the WWW

Month # of Web sites % .com sitesHosts per Web server

6/93 130 1.5 13,000

12/93 623 4.6 3,475

6/94 2,738 13.5 1,095

12/94 10,022 18.3 451

6/95 23,500 31.3 270

1/96 100,000 50.0 94

From 93 to 96 – from http://www.mit.edu/people/mkgray/growth/

In 2005, From: http://www.cs.uiowa.edu/~asignori/web-size/

The indexable web in Jan 2005: 11.5 billion pages;Coverage: Google=76.16%, Msn Beta=61.90%, Ask/Teoma=57.62%, Yahoo!=69.32%

Based on: 438,141 in 75 different languages

Page 6: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

How can we make search better?

Page 7: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

How can we make search better?Reduce (ideally eliminate) results that are irrelevant to our requirements

Page 8: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

How can we make search better?Reduce (ideally eliminate) results that are irrelevant to our requirements

this started to be a problem in 1997, when the size of the www at the time (c. 100,000,000 pages) was such that relevant responses to queries were often swamped by irrelevant ones.

Page 9: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

How can we make search better?Reduce (ideally eliminate) results that are irrelevant to our requirements

this started to be a problem in 1997, when the size of the www at the time (c. 100,000,000 pages) was such that relevant responses to queries were often swamped by irrelevant ones.

Number of potential pages to list against a query is always growing, but peoples’ ability to filter them is static: you don’t want to look at more than a few 10s of documents. So, search engines need to continually get better at ranking.

Page 10: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Two key metricsSuppose you enter a query into a search engine;Suppose that there are precisely 20 documents on the www that are fully relevant to your query;Suppose that the search engine returns a list of 10 documents; 2 of these are relevant, the rest are not.

Page 11: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Two key metricsSuppose you enter a query into a search engine;Suppose that there are precisely 20 documents on the www that are fully relevant to your query;Suppose that the search engine returns a list of 10 documents; 2 of these are relevant, the rest are not.

Precision the percentage of retrieved documents that are relevant to the query.

Recall the percentage of the truly relevant documents that have been retrieved by the search engine.

What are Precision and Recall in this case?

Page 12: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Precision and Recall Speculations

Page 13: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Precision and Recall Speculations

• Is it possible, and/or desirable, to design a search engine so that it achieves 100% Recall?

Page 14: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Precision and Recall Speculations

• Is it possible, and/or desirable, to design a search engine so that it achieves 100% Precision?

Page 15: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Precision and Recall Speculations

• Which is better:– 10% precision, 0.001% recall

– 0.001% precision, 10% recall

Page 16: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Search engine design goals ….

High precision is the number one priority for a good search engine.

Fast response is very important for a good search engine.

Scalability is very important. As the www continues to grow, precision and response should not significantly degrade.

Page 17: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Comparing Search Engines

The main area in which search engines differ is in their Precision.

This is the area in which most research effort is expended.

The key research question always has been, and shall remain:

How can we automatically estimate the relevance and usefulness of a web document given a particular search query?

Page 18: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Basic notes on queries:

Suppose your search query is: “flower shops in Edinburgh”

If a document contains all of these words, maybe several times, does that mean it is relevant?

Page 19: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Basic notes on queries:

Suppose your search query is: “flower shops in Edinburgh”

If a document contains all of these words, maybe several times, does that mean it is relevant?

If a document contains this precise phrase, does that mean it is relevant?

Page 20: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Basic notes on queries:

Suppose your search query is: “flower shops in Edinburgh”

If a document contains all of these words, maybe several times, does that mean it is relevant?

If a document contains this precise phrase, does that mean it is relevant?

If the answer to these questions is no, or not necessarily, then how can we work out what documents are relevant?

Page 21: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Google’s key points w.r.t. relevance

PageRank – a way of calculating a document’s overall importance; important documents containing the keywords are more likely to be relevant than `unimportant’ documents that also contain the keywords. PageRank is based very much on the link graph of the web.

Page 22: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Google’s key points w.r.t. relevance

PageRank – a way of calculating a document’s overall importance; important documents containing the keywords are more likely to be relevant than `unimportant’ documents that also contain the keywords. PageRank is based very much on the link graph of the web.

Anchor Text – as well as attaching extra weight to query words if they appear in Headings and similar, google associates the text in a hyperlink (the anchor text) with the document it is pointing to. So, for example, if a page contains “edinburgh flower shop”, that doesn’t mean it is relevant to the query. But if the text of a link to that page contains “edinburgh flower shop”, then it is far more likely to be relevant.

Page 23: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Google’s Page Ranking method

Problem: given maybe 100s of pages that contain the words in a search query, in what order should these be displayed to the user?

PageRank was the method that made google different (and better) than other search engines of the time.

It makes use of the directed network defined by links between pages.

Page 24: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

The PageRank Calculation

To find the PageRank for page A : PR(A):

Suppose that pages T1, T2, …, Tn all point to A; so A has n inlinks

Let C(X) be the number of outlinks from a page X.

Let d be a “damping factor” (set perhaps at 0.85)

PR(A) = (1 – d) + d( PR(T1)/C(T1) + PR(T2)/C(T2) + … + PR(Tn)/C(Tn) )

Page 25: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

PageRank Notes

The PR of A is worked out in terms of the PRs of all of the pages that link to it. Although simple to define, it is not simple to calculate; it requires some slightly advanced mathematics.

Technically, the PR of a page corresponds to an entry in the principal eigenvector of the link matrix.

Note that PR gets a positive contribution from:• inlinks from highly ranked pages that have low numbers of outlinks.• having many inlinks.

PR(A) = (1 – d) + d( PR(T1)/C(T1) + PR(T2)/C(T2) + … + PR(Tn)/C(Tn) )

Page 26: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

PageRank Very Simple Example

A

C F

ED

BPR(A) = 0.2PR(B) = 0.2 + 0.8*(PR(A)/4) = 0.24PR(C) = 0.2 + 0.8*(PR(A)/4) = 0.24PR(D) = 0.2 + 0.8*(PR(A)/4 + PR(B)/2 + PR(C)/2) = 0.432 PR(E) = 0.2 + 0.8*(PR(A)/4 + PR(D)/1) = 0.5856PR(F) = 0.2 + 0.8*(PR(B)/2 + PR(C)/2) = 0.392

Set d = 0.8 -- when C(page) = 0 set C(page) = 1

Page 27: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

PageRank Notes IIThe random surfer model

Imagine randomly surfing the web, clicking links at random, and never clicking Back. Occasionally, you click a “random page” button and start again.

PageRank corresponds to the probability that a random surfer will find the page. d is probability for pages not linked to from anywhere.

Important pages are likely to have very many inlinks, or maybe just a few inlinks, but each of those from important pages.

There is much more detail about PageRank, but we have seen the basic ideas.

Page 28: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

The google paper is essential reading, rather than lecture material;

but if there’s time, here are some slides anyway

Page 29: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Inside Google

Page 30: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

High level anatomy

Several web-crawlers on different servers continually crawl thewww searching for new pages and links, to help google maintainits view of the www’s link structure.

Pages are sent to the storeserver, which compresses and storesthem in the repository

Every unique www page is givena unique docID

The Links database is just a collectionof pairs of docIDs. PageRanks arecaclulated using this.

Page 31: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

The indexer

A key component that takes a www page from the repository, and then stores information about it in a way that supports fast search.

The indexer:

parses the document (recognises tags, headings, links, text, etc…) converts the document into a set of hits where each hit is characterised by type (plain (ordinary text) or fancy (within a tag)), the wordID, position in the document, and some other info. The hits are distributed among the barrels.

Page 32: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

The Barrels contain the main index that supports the front end of the search process.

When you type a query into google, info in the barrels is used to identify what documents actually contain the query words.

Most barrels contains a straightforward index, like this:

wordID, docID, hit, hit, hit …, docID, hit, hit, hit …, …wordID, docID, hit, hit, hit …, docID, hit, hit, hit …, …

i.e. when you look up a particular word in a barrel, the entry for that word is a list of docIDs for that word (these are the documents on the www that contain the word), and a list of hits for that word in each one

Page 33: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Barrell Contents example

E.g. these may be partial entries for two successive words in a barrel:

potato: doc12852; 3; 101, 178, 2009; doc12990; 1; 809; …quake: doc07828; 1; 16; doc10023; 4; 3, 11, 12, 678; …

When you search on “quake” in google, this barrel tells google all of the documents on the www (that it knows about so far) in which the work quake occurs; at the beginning of the entry, we see that quake occurs in document doc07828, just once, as the 16th word in that document. It also occurs in doc10023, 4 times, in positions 3, 11, 12 and 678; and so on.

Page 34: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Anchor Text, Types of Barrels

Short barrels contain a reduced version of the index, in whichthe hitlists refer only to Anchor text and other special text such as headings.

The Full barrels contain the entire index (including what is in theshort barrels)

The Lexicon contains each word (14,000,00 words) recognised bygoogle, and a pointer into the short and long barrels that containthe doclists for that word.

Suppose “see here for potato info” is a hyperlink in doc1 that points to doc2. Google considers the hyperlink text to be contained in doc2 for search/retrieval purposes.

Page 35: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Google Query Evaluation

1. The query is parsed (removing common words etc.)2. Words are converted into wordIDs3. For each word in the query, find the beginning of its doclist in the short barrels4. Scan the doclists until a document D is found that matches all the search wordIDs5. Compute Rank of D for the query (this is PageRank combined with other things)6. If we are in the short barrels but have reached the end of any one of the doclists, then move into the short barrels, positioning ourselves at the beginning of each doclist, and go to 4.7. If we are not at the end of any doclist, go to step 4. Otherwise, sort by rank all the documents that have matched all wordIDs, and show them to the user.

Page 36: Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Coursework 3Search Engine Optimization (SEO) is the task of designing a web page in such a way that it is more likely to be highly ranked (i.e. appear high on the list) in the set of pages returned after suitable queries to a search engine. E.g. if you have an online presence via which you sell PacMan-style games software, you would wish to do SEO, so that when a user searches google with query “pacman games”, your page is high on the list.

You will find lots of information on the WWW about SEO.

Write a <=250 word statement (the word limit is strict), which explains how to do effective SEO without spending anymoney (e.g. without paying for google adwords). In addition, list up to 5 URLs that were key to your research.

Marking: Content 70% / Clarity 30% As usual, email PDF to [email protected]