Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Introduction to Information Retrieval(Manning, Raghavan, Schutze)

Chapter 1Boolean retrieval

Information Retrieval: IR

n Finding material (usually document) of an unstructured nature (usually text) that satisfies an information need from within large collections

n Started in the 50’s. SIGIR (80), TREC (92)n The field of IR also covers supporting users in

browsing or filtering document collections or further processing a set of retrieved documentsn clusteringn classification

n Scale: from web search to personal information retrieval

How good are the retrieved docs?

n Precision : Fraction of retrieved docs that are relevant to user’s information need

n Recall : Fraction of relevant docs in collection that are retrieved

n More precise definitions and measurements to follow in later lectures

Boolean retrieval

n Queries are Boolean expressionsn e.g., Brutus AND Caesarn Shakespeare’s Collected Worksn Which plays of Shakespeare contain

the words Brutus AND Caesar ?

n The search engine returns all documents satisfying the Boolean expression.n Does Google use the Boolean model?

n http://www.rhymezone.com/shakespeare/

Term-document incidence

1 if play contains word, 0 otherwise

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Brutus AND Caesar but NOTCalpurnia

Inverted index

n For each term T, we must store a list of all documents that contain T.

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings listsSorted by docID

Posting

Boolean query processing: AND

n Consider processing the query:Brutus AND Caesarn Locate Brutus in the Dictionary;

n Retrieve its postings.n Locate Caesar in the Dictionary;

n Retrieve its postings.n “Merge” the two postings:

12834

2 4 8 16 32 641 2 3 5 8 13 21

BrutusCaesar

Example: WestLaw http://www.westlaw.com/

n Commercially successful Boolean retrievaln Largest commercial (paying subscribers) legal

search service (started 1975; ranking added 1992)n Tens of terabytes of data; 700,000 usersn Majority of users still use boolean queriesn Example query:

n What is the statute of limitations in cases involving the federal tort claims act?

n LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

n /3 = within 3 words, /S = in same sentence

Query optimization

n What is the best order for query processing?n Consider a query that is an AND of t terms.n For each of the t terms, get its postings, then

AND them together.

Brutus

Calpurnia

Caesar

1 2 3 5 8 16 21 34

2 4 8 16 32 64128

13 16

Query: Brutus AND Calpurnia AND Caesar9


Chapter 2The term vocabulary and

postings lists

Parsing a document

n Before we start worrying about terms … need to know format and language of each document

n What format is it in?n pdf/word/excel/html?

n What language is it in?n What character set is in use?n Each of these is a classification problem,

n But often done heuristically

What is a unit of document?

n A file?n Traditional Unix stores a sequence of emails in one file,

but you might want to regard each email as a separate document

n An email with 5 attachments?

n Indexing granularity, e.g. a collection of booksn Each book as a document?n Each chapter? Each paragraph? Each sentence?

n Precision recall tradeoffn Small unit: good precision, poor recalln Big unit: good recall, poor precision

Tokenization

n Input: “Friends, Romans, Countrymen”n Output: Tokens

n Friends n Romansn Countrymen

n Each such token is now a candidate for an index entry, after further processingn Described below

n But what are valid tokens to emit?

Common terms: stop wordsn Stop words = extreme common words which would

appear of little value in helping select document in matching a user needn a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on,

that, the, to, was, were, will, withn There are a lot of them: ~30% of postings for top 30 words

n Stop word elimination used to be standard in older IR systemsn Size of stop list: 200-300; 7-12

Current trendn The trend is away from doing this:

n Good compression techniques (lecture 5) means the space for including stopwords in a system is very small

n Good query optimization techniques mean you pay little at query time for including stop words.

n You need them for:n Phrase queries: “King of Denmark”n Various song titles, etc.: “Let it be”, “To be or not to be”n “Relational” queries: “flights to London”

n Nowadays search engines generally do not eliminate stop words

Normalizationn Need to “normalize” terms in indexed text as well as

query terms into the same formn We want to match U.S.A. and USA

n We most commonly implicitly define equivalence classes of termsn e.g., by deleting periods in a term

n Alternative is to do asymmetric expansion:n Enter: window Search: window, windowsn Enter: windows Search: Windows, windows, windown Enter: Windows Search: Windows (no expansion)

n Two approaches for the (more powerful) alternativen Index unnormalized tokens and expand query termsn Expand during index constructionn Both less efficient than equivalent claassing

Case folding

n Reduce all letters to lower casen exception: upper case in mid-sentence?

n e.g., General Motorsn Fed vs. fedn SAIL vs. sail

n Often best to lower case everything, since users will use lowercase regardless of ‘correct’capitalization…

Lemmatization

n Reduce inflectional/variant forms to base formn am, are, is → ben car, cars, car's, cars' → car

n the boy's cars are different colors → the boy car be different color

n Lemmatization implies doing “proper” reduction to dictionary headword form

Stemming

n Reduce terms to their “roots” before indexingn “Stemming” suggest crude affix chopping

n language dependentn e.g., automate(s), automatic, automation all

reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress andcompress ar both acceptas equival to compress

Porter’s algorithm (1980)

n Most common algorithm for stemming Englishn Results suggest it’s at least as good as other stemming

optionsn 5 phases of reductions

n phases applied sequentiallyn With each phase, there are various conventions of

selecting rulesn E.g., Sample convention: Of the rules in a compound

command, select the one that applies to the longest suffix.

n http://www.tartarus.org/~martin/PorterStemmer/

Phrase queries

n Want to be able to answer queries such as “stanford university” – as a phrase

n “The inventor Stanford Ovshinsky never went to univerisity” is not a match. n The concept of phrase queries has proven easily

understood by usersn 10% of web queries are phrase queries

n For this, it no longer suffices to store only<term : docs> entries

any ideas?

Solution 2: Positional indexes

n In the postings, store, for each term, entries of the form:<term, number of docs containing term;doc1: position1, position2 … ;doc2: position1, position2 … ;etc.>

Proximity queries: same idean employment /3 placen Find all document that contain employment

and place within 3 words of each othern Employment agencies that place healthcare

workers are seeing growthn hit

n Employment agencies that help place healthcare workers are seeing growthn not a hit

n Clearly, positional indexes can be used for such queries; biword indexes cannot.

Positional index size

n You can compress position values/offsets: covered in chapter 5

n Nevertheless, a positional index expands postings storage substantiallyn Need an entry for each occurrence, not just once

per documentn Compare to biword: “Index blowup due to bigger

dictionary”n Nevertheless, a positional index is now

standardly used because of the power and usefulness of phrase and proximity queries

Rules of thumb

n A positional index is 2–4 as large as a non-positional index

n Positional index size 35–50% of volume of original text

n Caveat: the above holds for English-like languages.


Chapter 3Dictionaries and Tolerant retrieval

Chapter 4Index construction

Chapter 5Index compression

Dictionaryn The dictionary is the data structure for storing the

term vocabularyn For each term, we need to store:

n document frequencyn pointers to each postings list

Dictionary data structures

n Two main choices:n Hash tablen Tree

n Some IR systems use hashes, some treesn Criteria in choosing hash or tree

n fixed number of terms or keep growingn Relative frequencies with which various keys are accessedn How many terms

Distributed indexing

n For web-scale indexing (don’t try this at home!):must use a distributed computing cluster

n Individual machines are fault-pronen Can unpredictably slow down or fail

n How do we exploit such a pool of machines?

Google data centers

n Google data centers mainly use commodity machinesn Data centers are distributed around the world.n Estimate: a total of 1 million servers, 3 million

processors/cores (Gartner 2007)n Estimate: Google installs 100,000 servers each quarter.

n Based on expenditures of $200–250 million per yearn This would be 10% of the computing capacity of the

world!?!

Data flow

splits

Parser

Parser

Parser

Master

a-f g-p q-z

a-f g-p q-z

a-f g-p q-z

Inverter

Inverter

Inverter

Postings

a-f

g-p

q-z

assign assign

Mapphase

Segment files Reducephase

MapReduce

n The index construction algorithm we just described is an instance of MapReduce.

n MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple architecture for distributed computing …

n … without having to write code for the distribution part.

MapReduce

n MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairsn For indexing, (termID, docID)

n Map: mapping splits of the input data to key-value pairsn Reduce: all values for a given key to be stored close

together, so that they can be read and processed quicklyn This is achieved by partitioning the keys into j terms

partitions and having the parsers write key-value pairs for each term partition into a separate segment file

MapReducen They describe the Google indexing system (ca. 2002)

as consisting of a number of phases, each implemented in MapReduce.

n Index construction was just one phase.n Another phase: transforming a term-partitioned index

into document-partitioned index.n Term-partitioned: one machine handles a subrange of termsn Document-partitioned: one machine handles a subrange of

documentsn (As we discuss in the web part of the course) most

search engines use a document-partitioned index … better load balancing, etc.)


Chapter 6Scoring term weighting and the

vector space model

Ranked retrievaln Thus far, our queries have all been Boolean.

n Documents either match or don’tn Good for expert users with precise understanding of

their needs and the collection.n Also good for applications, which can easily consume

1000s of resultsn Not good for the majority of users.

n Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).

n Most users don’t want to wade through 1000s of results.n This is particularly true of web search.

Problem with Boolean search: feast or famine

n Boolean queries often result in either too few (=0) or too many (1000s) results.

n Query 1: “standard user dlink 650”n 200,000 hits

n Query 2: “standard user dlink 650 no card found”n 0 hits

n It takes skill to come up with a query that produces a manageable number of hits.n AND gives too few; OR gives too many

Take 1: Jaccard coefficient

n A commonly used measure of overlap of two sets A and B

n jaccard(A,B) = |A ∩ B| / |A ∪ B|n jaccard(A,A) = 1n jaccard(A,B) = 0 if A ∩ B = 0

n Always assigns a number between 0 and 1.

Issues with Jaccard for scoring

n It doesn’t consider term frequency (how many times a term occurs in a document)n tf weight

n Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this informationn idf weight

n We need a more sophisticated way of normalizing for lengthn cosine

Bag of words model

n Vector representation doesn’t consider the ordering of words in a document

n John is quicker than Mary and Mary is quicker than John have the same vectors

n This is called the bag of words model.

Term frequency

n The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

n We want to use term frequency when computing query-document match scores. But how?

n Rawtermfrequencymaynotbewhatwewant:n Adocumentwith10occurrencesofthetermismorerelevantthanadocumentwith1occurrenceoftheterm.

n Butnot10timesmorerelevant.n Relevancedoesnotincreaseproportionallywithtermfrequency

term frequency (tf) weightn many variants for tf weight, where log-frequency

weighting is a common one, dampening the effect of raw tf (raw count)

log tft,d =

n 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.n The score is 0 if none of the query terms is

present in the document.

1 + log10 tft,d, if tft,d > 00, otherwise

!"#

$#

Document frequency

n Rare terms are more informative than frequent termsn Recall stop words

n Consider a term in the query that is rare in the collection (e.g., arachnocentric)

n A document containing this term is very likely to be relevant to the query arachnocentric

n → We want a high weight for rare terms like arachnocentric.

Document frequency, continued

n Consider a query term that is frequent in the collection (e.g., high, increase, line)

n A document containing such a term is more likely to be relevant than a document that doesn’t, but it’s not a sure indicator of relevance.

n For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms.

n We will use document frequency (df) to capture this in the score.

n df (≤ N) is the number of documents that contain the term

Inverse document frequency (idf) weight

n dft is the document frequency of t: the number of documents that contain tn dft is an inverse measure of the informativeness of tn Inverse document frequency is a direct measure of the

informativeness of tn We define the idf (inverse document frequency) of t by

n use log to dampen the effect of N/dftn Most common variant of idf weight

tt N/df log idf 10=

idf example, suppose N= 1 millionterm dft idftcalpurnia 1 6

animal 100 4

sunday 1,000 3

fly 10,000 2

under 100,000 1

the 1,000,000 0

There is one idf value for each term t in a collection.

)/df( log idf 10 tt N=

Collection vs. Document frequencyn The collection frequency of t is the number of

occurrences of t in the collection, counting multiple occurrences.

n Example: which word is a better search term (and should get a higher weight)?

n The example suggests that df is better for weighting than cf

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

tf-idf weightingn The it-idf weight of a term is the product of its tf weight

and its idf weight.

tf weight (t,d) x idf weight (t)

n Increases with the number of occurrences within a document

n Increases with the rarity of the term in the collectionn Best known instantiation of TF-IDF weighting

tf -idft,d =

(1+ log10 tft,d )× log10 (N / dft )

Recall: Binary term-document incidence matrix




mercy 1 0 1 1 1 1worser 1 0 1 1 1 0

Each document is represented by a binary vector ∈ {0,1}|V|

Term-document count matrices

n Consider the number of occurrences of a term in a document: n Each document is a count vector in ℕv: a column below




mercy 2 0 3 5 5 1worser 2 0 1 1 1 0

Binary → count → weight matrix


Antony 5.25 2.44 0 0 0 0Brutus 0.16 6.10 0 0.04 0 0Caesar 8.59 8.40 0 0.07 0.04 0.04

Calpurnia 0 1.54 0 0 0 0Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 2.27 3.78 3.78 0.76worser 1.37 0 0.69 0.69 0.69 0

Each document is now represented by a real-valued vector of TF-IDF weights ∈ R|V|

Documents as vectors

n So we have a |V|-dimensional vector spacen Terms are axes of the spacen Documents are points or vectors in this spacen Very high-dimensional

n hundreds of millions of dimensions when you apply this to a web search engine

n This is a very sparse vectorn most entries are zero

Queries as vectors

n Key idea 1: Do the same for queries: represent them as vectors in the space

n Key idea 2: Rank documents according to their proximity to the query in this space

n proximity = similarity of vectorsn proximity ≈ inverse of distancen Recall: We do this because we want to get away

from the either-in-or-out Boolean model.n Instead: rank more relevant documents higher

than less relevant documents

Why Euclidean distance is badThe Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 arevery similar.

Use angle instead of distance

n Thought experiment: take a document d and append it to itself. Call this document d′.

n “Semantically” d and d′ have the same contentn The Euclidean distance between the two

documents can be quite largen The angle between the two documents is 0,

corresponding to maximal similarity.

Length normalization

n A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:

n Dividing a vector by its L2 norm makes it a unit (length) vector

n Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.

n The cosine of the angle between two normalized vectors is the dot product of the two

∑=i ixx 2

2

!

cosine(query,document)

cos(!q,!d ) =

!q •!d!q!d=!q!q•

!d!d=

qtdtt=1

V∑qt2

t=1

V∑ dt

2

t=1

V∑

Dot product Unit vectors

qt is the tf-idf weight of term t in the querydt is the tf-idf weight of term t in the document

cos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.

• The cosine similarity can be seen as a method of normalizing document length during comparison

Cosine similarity exampled q normalized d normalized q

t1 0.5 1.5 0.51 0.83t2 0.8 1 0.81 0.555t3 0.3 0 0.30 0 sim(d, q) = 0.5x1.5 + 0.8x1 + 0.3x0 _

sqrt(0.52+0.82+ .32) x sqrt(1.52+12+02)= 1.55 _

0.99 x 1.8 = 0.87

sim(d,q) = 0.51x0.83 + 0.81x0.555 + 0.30x0 = 0.87

More variants of TF-IDF weighting

SMART notation: columns headed ‘n’ are acronymsfor weight schemes.

Summary – vector space ranking

n Represent the query as a weighted TF-IDF vectorn Represent each document as a weighted TF-IDF

vectorn Compute the cosine similarity score for the query

vector and each document vectorn Rank documents with respect to the query by scoren Return the top k (e.g., k = 10) to the user


Chapter 7Computing scores in a complete

search system

Content

n Speeding up vector space rankingn Putting together a complete search

system

Cluster pruning: query processing

n Process a query as follows:n Given query Q, find its nearest leader L.n Seek K nearest docs from among L’s

followers.

Visualization

Query

Leader Follower

Putting it all together


Chapter 8Evaluation and Result Summaries

Summariesn The title is typically automatically extracted from

document metadata. What about the summaries?n This description is crucial.n User can identify good/relevant hits based on description.

n Two basic kinds:n Staticn Dynamic

n A static summary of a document is always the same, regardless of the query that hit the doc

n A dynamic summary is a query-dependent attempt to explain why the document was retrieved for the query at hand

Dynamic summariesn Present one or more “windows” within the document

that contain several of the query termsn “KWIC” snippets: Keyword in Context presentation

n Generated in conjunction with scoringn If query found as a phrase, all or some occurrences of the

phrase in the docn If not, document windows that contain multiple query terms

n The summary itself gives the entire content of the window – all terms, not only the query terms – how?

Evaluating search engines

Relevance to what?n Relevance is assessed relative to the information

need not the queryn E.g., Information need: I'm looking for information

on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

n Query: wine red white heart attack effectiven You evaluate whether the doc addresses the

information need, not whether it has these words

n Our terminology is sloppy: we talk about query-document relevance judgment although we mean information-need-document relevance judgment

Unranked retrieval evaluation:Precision and Recall

n Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)

n Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)

n Precision P = tp/(tp + fp)n Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp fp

Not Retrieved fn tn

Should we instead use the accuracy measure for evaluation?

n Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant”

n The accuracy of an engine: the fraction of these classifications that are correct

n Accuracy is a commonly used evaluation measure in machine learning classification work

n Why is this not a very useful evaluation measure in IR?

Why not just use accuracy?

n How to build a 99.9999% accurate search engine on a low budget….

n People doing information retrieval want to findsomething and have a certain tolerance for junk.

Search for:

0 matching results found.

Precision/Recall tradeoff

n You can get high recall (but low precision) by retrieving all docs for all queries!

n Recall is a non-decreasing function of the number of docs retrieved

n In a good system, precision decreases as either the number of docs retrieved or recall increasesn This is not a theorem, but a result with strong

empirical confirmation

A combined measure: Fn Combined measure that assesses precision/recall tradeoff is

F measure:

n Weighted harmonic mean of P and R:n People usually use balanced F measure

n F1 ; or F β =1 ; n with β = 1 or α = ½; harmonic mean:

n β < 1 emphasizes P or R?n Either P or R is bad -> bad F

RPPR

RP

F+

+=

−+= 2

2 )1(1)1(1

1ββ

αα αα

β)1(2 −

=

RPF1)1(11

αα −+=

)11(211

RPF+=

Evaluating ranked results

n P/R/F are measured for unranked setsn We can easily turn set measures into measures

for ranked resultsn The system can return any number of resultsn Just use the set measures for each “prefix”, the

top 1, top 2, top 3, top 4, etc., resultsn Doing this for precision and recall produces a

precision-recall curve, where a “prefix”corresponds to a level of recall

A precision-recall curve

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0Recall

Precision

n Sawtooth shape:n If the (k+1)th doc is

non-relevant, R is the same as for the top k docs, but P has dropped

n If it is relevant, then both P and R increase, and the curve jags up and to the right

n Often useful to remove the jiggles: interpolationn Take maximum precision of all future points

11-point interpolated average precisionn Entire precision-recall graph is very

informative, but there is often a desire to boil this information down to a few numbers, even a single number

n 11-point interpolated average precisionn The standard measure in the early TREC

competitionsn take the precision at 11 levels of recall

varying from 0 to 1 by tenths of the documents, using interpolation, and average over queries

n Evaluates performance at all recall levels

Typical (good) 11 point precisionsn SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Precision

Mean average precision (MAP)n Recently, other measures have become more common. Most

standard among TREC community is MAPn A single figure measure of quality across recall levelsn Good discrimination and stability

n For a single information need, average precision is the average of precision value obtained for the top k docs each time a relevant doc is retrievedn Approximates the area under the un-interpolated precision-recall curve

n Then, this value (average precision) is averaged over many information needs to get MAPn Approximates the average area under the precision-curve for a set of

queries

Yet more evaluation measures…n The above ones factor in precision at all recall levelsn For many prominent applications, e.g., web search, this may

not be appropriate, where what matters is rather how many good results there are on the first page or the first 3 pages!n Leads to measuring precision at fixed low levels (e.g., 10 or 30) of

retrieved resultsn Precision at k: precision of top k results

n Standard for web searchn Cons: the least stable among commonly used measures; does not

average well because the total number of relevant docs for a query has strong influence on precision at k

n R-precision alleviates this problemn But may not be feasible for web search

R-precisionn If have known (though perhaps incomplete) set of relevant

documents of size Rel, then calculate precision of top Rel docs returnedn Averaging the measure across queries makes more sense

n If there are |Rel| relevant docs for query, we examine the top |Rel| results, and find r are relevant. Then, n recall = precision = r / |Rel|n Thus, R-precision is identical to the break-even point

n Empirically, highly correlated with MAP

Critique of pure relevance

n Assumption: relevance of one doc is treated as independent of relevance of other docs in the collectionn But a document can be redundant (e.g., duplicates) even if

it is highly relevantn Duplicates

n Marginal Relevance: concerns whether a doc still have distinctive usefulness after the user has looked at certain other documents … (Carbonell and Goldstein 1998)

n Maximizing marginal relevance requires returning documents that exhibit diversity and novelty


Chapter 19Web search basics

1. Brief history and overview

n Early keyword-based enginesn Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997

n A hierarchy of categoriesn Yahoo!n Many problems, popularity declined. Existing variants

are About.com and Open Directory Project

n Classical IR techniques continue to be necessary for web search, by no means sufficientn E.g., classical IR measures relevancy, web search

needs to measure relevancy + authoritativeness

Web search overview

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User

Web IR: Differences from traditional IR

n Links: The web is a hyperlinked document collectionn Queries: web queries are different, more varied and there are a lot of

themn How many? 108 every day, approaching 109

n Users: users are different, more varied and there are a lot of themn How many? 109

n Documents: documents are different, more varied and there are a lot of themn How many? ~ 1011. Indexed 1010

n Context: context is more important on the web than in many other IR applications

n Ads and spam

Duplicate documentsn Significant duplication: 30-40% duplicates in some studiesn Duplicates in search results were common in early days of

the Webn Today’s search engines eliminate duplicates very

effectivelyn Key for high user satisfaction

Duplicate detection

n The web is full of duplicated contentn Strict duplicate detection = exact match

n Not as commonn But many, many cases of near duplicates

n E.g., Last modified date the only difference between two copies of a page

n Various techniquesn Fingerprint, shingles, sketch

Size of the web: issues

n How to define size? Number of web serves? Number of pages? Terabytes of data available?n Some servers are seldom connected

n example: your laptop running a web servern Is it part of the web?

n The “dynamic” web is infiniten Any sum of two numbers is its own dynamic page on Google (e.g., “2+4”)

Goal of spamming on the webn You have a page that will generate lots of revenue for

you if people visit itn Therefore, you’d like to redirect visitors to this pagen One way of doing this: get your page ranked highly in

search results

Simplest formsn First generation engines relied heavily on tf/idfn Hidden text: dense repetitions of chosen keywords

n Often, the repetitions would be in the same color as the background of the web page. So that repeated terms got indexed by crawlers, but not visible to humans on browsers

n Keyword stuffing: misleading meta-tags with excessive repetition of chosen keywords

n Used to be effective, most search engines now catch these

n Spammers responded with a richer set of spam techniques

Link spamn Create lots of links pointing to the page you want to

promoten Put these links on pages with high (at least non-zero)

pagerankn Newer registered domains (domain flooding)n A set of pages pointing to each other to boost each

other’s pagerank (mutual admiration society)n Pay somebody to put your link on their highly ranked

page (“schuetze horoskop” example”)n http://www-csli.stanford.edu/~hinrich/horoskop-schuetze.html

n Leave comments that include the link on blogsn Link farm

Search engine optimization

n Promoting a page is not necessarily spamn It can also be a legitimate business, which is called SEO

n You can hire an SEO firm to get your page highly rankedn Motives

n Commercial, political, religious, lobbiesn Promotion funded by advertising budget

n Operatorsn Contractors (Search Engine Optimizers) for lobbies, companiesn Web mastersn Hosting services

n Forumsn E.g., Web master world ( www.webmasterworld.com )

3. Advertising as economic model

n Sponsored search ranking: Goto.com (morphed into Overture.com → Yahoo!)n Your search ranking depended on how much you paidn Auction for keywords: casino was expensive!n No separation of ads/docs

n 1998+: Link-based ranking pioneered by Googlen Blew away all early enginesn Google added paid-placement “ads” to the side,

independent of search resultsn Strict separation of ads and results

First generation of search ads: Goto (1996)

n No separation of ads/docs. Just one results!n Buddy Blake bid the maximum ($0.38) for this searchn He paid $0.38 to Goto every time somebody clicked on the linkn Upfront and honest. No relevance ranking, but Goto did not pretend

there was any.

Algorithmic results.

Ads

The appeal of search ads to advertisers

n Why is web search potentially more attractive for advertisers than TV spots, newspaper ads or radio spots?

n Someone who just searched for “Saturn Aura Sport Sedan” is infinitely more likely to buy one than a random person watching TV.

n Most importantly, the advertiser only pays if the customer took an action indicating interest (i.e., clicking on the ad)

Users of web searchn Use short queries (average < 3)n Rarely use operatorsn Don’t want to spend a lot of time on composing a queryn Only look at the first couple of resultsn Want a simple UI, not a search engine start page

overloaded with graphicsn Extreme variability in terms of user needs, user

expectations, experience, knowledge, …n Industrial/developing world, English/Estonian, old/young,

rich/poor, differences in culture and classn One interface for hugely divergent needs

User query needsn Need [Brod02, RL04]

n Informational – want to learn about something (~40% / 65%)n Not a single page containing the info

n Navigational – want to go to that page (~25% / 15%)

n Transactional – want to do something (web-mediated) (~35% / 20%)n Access a service

n Downloads

n Shopn Gray areas

n Find a good hubn Exploratory search “see what’s there”

Low hemoglobin

United Airlines

Seattle weatherMars surface images

Canon S410

Car rental Brasil

Query distribution (1)

Query distribution (2)n Queries have a power law distributionn Recall Zipf’s law: a few very frequent words, a large

number of very rare wordsn Same here very frequent queries, a large number of very

rare queriesn Examples of rare queries: search for names, towns,

books etcn The proportion of adult queries is much lower than 1/3


Chapter 21Link analysis

The Web as a Directed Graph

Assumption 1: a hyperlink is a quality signal• A hyperlink between pages denotes author perceived relevance

Assumption 2: The anchor text describes the target page• we use anchor text somewhat loosely here• extended anchor text, window of text surrounding anchor text• You can find cheap cars <a href= …>here</a>

Page Ahyperlink Page BAnchor

Google bombsn Indexing anchor text can have unexpected side

effects: Google bombs.n whatelse does not have side effects?

n A Google bomb is a search with “bad” results due to maliciously manipulated anchor text

n Google introduced a new weighting function in January 2007 that fixed many Google bombs

Google bomb example

Cocitation similarity on Google:similar pages

Origins of PageRank: Citation analysis (1)

Origins of PageRank: Citation analysis (2)

Query-independent ordering

n First generation link-based ranking for web search n using link counts as simple measures of popularity.n simple link popularity: number of in-links

n First, retrieve all pages meeting the text query (say venture capital).n Then, order these by the simple link popularity

n Easy to spam. Why?

Basics for PageRank: random walk

n Imagine a web surfer doing a random walk on the web page:n start at a random pagen at each step, go out of the current page along one of

the links on that page, equiprobablyn In the steady state each page has a long-term visit

rate - use this as the page’s scoren So, pagerank = steady state probability

= long-term visit rate

1/31/31/3

Not quite enough

n The web is full of dead-endsn random walk can get stuck in dead-endsn makes no sense to talk about long-term visit rates

??

Teleportingn Teleport operation: surfer jumps from a node to any

other node in the web graph, chosen uniformly at random from all web pages

n Used in two ways:n At a dead end, jump to a random web pagen At any non-dead end, with teleportation probability 0 < α < 1

(say, α = 0.1), jump to a random web page; with remaining probability 1 - α (0.9), go out on a random link

n Now cannot get stuck locallyn There is a long-term rate at which any page is visited

n Not obvious, explain latern How do we compute this visit rate?

Markov chains

n A Markov chain consists of n states, plus an n×ntransition probability matrix P.

n At each step, we are in exactly one of the states.n For 1 ≤ i, j ≤ n, the matrix entry Pij tells us the

probability of j being the next state, given we are currently in state i.

n Clearly, for each i,

n Markov chains are abstractions of random walkn State = page

i jPij

.11

=∑=

ij

n

jP

ExerciseRepresent the teleporting random walk as a Markov chain, for the following case, using transition probability matrix

α = 0.3:

0.1 0.45 0.451/3 1/3 1/3 0.45 0.45 0.1

C A B

C A B

0.1

0.450.45

0.45

1/3

State diagram

Link structure

0.1

1/3

0.45

1/3

Transition matrix

Ergodic Markov chains

n A Markov chain is ergodic iff it’s irreducible and aperiodic n Irreducibility: roughly, there is a path from any state to any

othern Aperiodicity: roughly, the nodes cannot be partitioned

such that the random walker visits the partitions sequentially

n A non-ergodic Markov chain

1

1

Ergodic Markov chains

n Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.n Steady-state probability distribution.

n Over a long time-period, we visit each state in proportion to this rate.

n It doesn’t matter where we start.

Formalization of visit: probability vector

n A probability (row) vector x = (x1, … xn) tells us where the walk is at any point.

n e.g., (000…1…000) means we’re in state ii n1

More generally, the vector x = (x1, … xn)means the walk is in state i with probability xi

xii=1

n

∑ =1

Change in probability vector

n If the probability vector is x = (x1, … xn) at this step, what is it at the next step?

n Recall that row i of the transition prob. matrix P tells us where we go next from state i.

n So from x, our next state is distributed as xP

Steady state example

n The steady state is simply a vector of probabilities a = (a1, … an):n ai is the probability that we are in state in ai is the long-term visit rate (or pagerank) of state (page) In so we can think of pagerank as a long vector, one entry for

each page

How do we compute this vector?

n Let a = (a1, … an) denote the row vector of steady-state probabilities.

n If our current position is described by a, then the next step is distributed as aP

n But a is the steady state, so a=aPn Solving this matrix equation gives us a

n so a is the (left) eigenvector for Pn corresponds to the principal eigenvector of P with

the largest eigenvaluen transition probability matrices always have larges

eigenvalue 1

One way of computing

n Recall, regardless of where we start, we eventually reach the steady state a

n Start with any distribution (say x=(10…0)).n After one step, we’re at xPn after two steps at xP2 , then xP3 and so on.n “Eventually” means for “large” k, xPk = an Algorithm: multiply x by increasing powers of P

until the product looks stable

n This is called the power method

Power method: example

Pagerank summaryn Preprocessing:

n Given graph of links, build transition probability matrix Pn From it compute an The entry ai is a number between 0 and 1: the pagerank of page i.

n Query processing:n Retrieve pages meeting queryn Rank them by their pagerankn Order is query-independent

n In practice, pagerank alone wouldn’t workn Google paper:

http://infolab.stanford.edu/~backrub/google.html

In practice

n Consider the query “video service”n Yahoo! has very high pagerank, and contains both wordsn With simple pagerank alone, Yahoo! Would be top-rankedn Clearly not desirable

n In practice, composite score is used in rankingn Pagerank, cosine similarity, term proximity etc.n May apply machine-learned scoringn Many other clever heuristics are used

How important is PageRank?

Pagerank: Issues and Variants

n How realistic is the random surfer model?n What if we modeled the back button?n Surfer behavior sharply skewed towards short pathsn Search engines, bookmarks & directories make jumps non-

random.

n Biased Surfer Modelsn Weight edge traversal probabilities based on match with

topic/query (non-uniform edge selection)n Bias jumps to pages on topic (e.g., based on personal

bookmarks & categories of interest)n Non-uniform teleportation allows topic-specific

pagerank and personalized pagerank

Topic Specific Pagerankn Conceptually, we use a random surfer who

teleports, with say 10% probability, using the following rule:

n Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories

n Teleport to a page uniformly at random within the chosen category

Pagerank applications beyond web search

n A person is reputable if s/he receives many references from reputable people.

n How to compute reputation for people?

n Rent a room in an exhibition center. Find one with the most visit rate.

Hyperlink-Induced Topic Search (HITS)

n In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:n Hub pages are good lists of links to pages answering

the information needn e.g., “Bob’s list of cancer-related links

n Authority pages are direct answers to the information need

n occur recurrently on good hubs for the subject

n Most approaches to search do not make the distinction between the two sets

Hubs and Authorities

n Thus, a good hub page for a topic pointsto many authoritative pages for that topic

n A good authority page for a topic is pointed to by many good hubs for that topic

n Circular definition - will turn this into an iterative computation

Examples of hubs and authorities

AT&T Alice

SprintBob MCILong distance telephone companies

HubsAuthorities

High-level scheme

n Do a regular web search firstn Call the search results the root setn Add in any page that either

n points to a page in the root set, orn is pointed to by a page in the root set

n Call this the base setn From these, identify a small set of top hub and

authority pagesn Iterative algorithm

Visualization

Rootset

Base set

Assembling the base set

n Root set typically 200-1000 nodesn Base set may have up to 5000 nodesn How do you find the base set nodes?

n Follow out-links by parsing root set pagesn Get in-links from a connectivity server, get pages

n This assumes our inverted index supports searches for links, in addition to terms

Distilling hubs and authorities

n Compute, for each page x in the base set, a hub score h(x) and an authority score a(x)

n Initialize: for all x, h(x)←1; a(x) ←1;n Iteratively update all h(x), a(x);n After convergence

n output pages with highest h() scores as top hubsn output pages with highest a() scores as top authoritiesn so we output two ranked lists

Key

Iterative update

n Iterate these two steps until convergence

for all x:

for all x:

∑←yxyaxh

!

)()(

∑←xyyhxa

!

)()(

x

x

Scaling

n To prevent the h() and a() values from getting too big, can scale down after each iteration

n Scaling factor doesn’t really matter:n we only care about the relative values

of the scores

How many iterations?

n Relative values of scores will converge after a few iterations

n In fact, suitably scaled, h() and a() scores settle into a steady state!n proof of this comes later

n In practice, ~5 iterations get you close to stability

Japan Elementary Schools

n The American School in Japan n The Link Page n ‰ª�è�s—§ˆä“c�¬Šw�Zƒz�[ƒ�ƒy�[ƒW n Kids' Space n ˆÀ�é�s—§ˆÀ�é�¼•”�¬Šw�Z n ‹{�é‹³ˆç‘åŠw•�‘®�¬Šw�Z n KEIMEI GAKUEN Home Page ( Japanese ) n Shiranuma Home Page n fuzoku-es.fukui-u.ac.jp n welcome to Miasa E&J school n �_“Þ�ìŒ§�E‰¡•l�s—§’†�ì�¼�¬Šw�Z‚Ìƒyn http://www...p/~m_maru/index.html n fukui haruyama-es HomePage n Torisu primary school n goo n Yakumo Elementary,Hokkaido,Japan n FUZOKU Home Page n Kamishibun Elementary School...

n schools n LINK Page-13 n “ú–{‚ÌŠw�Z n �a‰„�¬Šw�Zƒz�[ƒ�ƒy�[ƒW n 100 Schools Home Pages (English) n K-12 from Japan 10/...rnet and Education ) n http://www...iglobe.ne.jp/~IKESAN n ‚l‚f‚j�¬Šw�Z‚U”N‚P‘g•¨Œê n �ÒŠ—’¬—§�ÒŠ—“Œ�¬Šw�Z n Koulutus ja oppilaitokset n TOYODA HOMEPAGE n Education n Cay's Homepage(Japanese) n –y“ì�¬Šw�Z‚Ìƒz�[ƒ�ƒy�[ƒW n UNIVERSITY n ‰J—³�¬Šw�Z DRAGON97-TOP n �Â‰ª�¬Šw�Z‚T”N‚P‘gƒz�[ƒ�ƒy�[ƒW n ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼

Hubs Authorities

Things to note

n Pulled together good pages regardless of language of page content.

n Use only link analysis after base set assembledn Is HITS query-independent?

n Typical use, non Iterative computation after text index retrieval -

significant overhead.

PageRank vs. HITS: Discussionn The PageRank and HITS make two different design choices

concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to

n These two are orthogonaln We could also apply HITS to the entire web and PageRank to

a small base setn On the web, a good hub almost always is also a good

authorityn The actual difference between PageRank ranking and HITS

ranking is therefore not as large as one might expect

HITS applications beyond web search

n Researchers publish/present papers in conferences. A conference is reputable if it hosts many reputable researchers to publish/present their papers. A researcher is reputable if s/he publishes/presents many papers in reputable conferences.

n How to compute reputation for conferences? How to compute reputation for researchers?