142
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval

Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

  • Upload
    others

  • View
    14

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Introduction to Information Retrieval(Manning, Raghavan, Schutze)

Chapter 1Boolean retrieval

Page 2: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Information Retrieval: IR

n Finding material (usually document) of an unstructured nature (usually text) that satisfies an information need from within large collections

n Started in the 50’s. SIGIR (80), TREC (92)n The field of IR also covers supporting users in

browsing or filtering document collections or further processing a set of retrieved documentsn clusteringn classification

n Scale: from web search to personal information retrieval

Page 3: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

How good are the retrieved docs?

n Precision : Fraction of retrieved docs that are relevant to user’s information need

n Recall : Fraction of relevant docs in collection that are retrieved

n More precise definitions and measurements to follow in later lectures

Page 4: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Boolean retrieval

n Queries are Boolean expressionsn e.g., Brutus AND Caesarn Shakespeare’s Collected Worksn Which plays of Shakespeare contain

the words Brutus AND Caesar ?

n The search engine returns all documents satisfying the Boolean expression.n Does Google use the Boolean model?

n http://www.rhymezone.com/shakespeare/

Page 5: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Term-document incidence

1 if play contains word, 0 otherwise

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Brutus AND Caesar but NOTCalpurnia

Page 6: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Inverted index

n For each term T, we must store a list of all documents that contain T.

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings listsSorted by docID

Posting

Page 7: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Boolean query processing: AND

n Consider processing the query:Brutus AND Caesarn Locate Brutus in the Dictionary;

n Retrieve its postings.n Locate Caesar in the Dictionary;

n Retrieve its postings.n “Merge” the two postings:

12834

2 4 8 16 32 641 2 3 5 8 13 21

BrutusCaesar

Page 8: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Example: WestLaw http://www.westlaw.com/

n Commercially successful Boolean retrievaln Largest commercial (paying subscribers) legal

search service (started 1975; ranking added 1992)n Tens of terabytes of data; 700,000 usersn Majority of users still use boolean queriesn Example query:

n What is the statute of limitations in cases involving the federal tort claims act?

n LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

n /3 = within 3 words, /S = in same sentence

Page 9: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Query optimization

n What is the best order for query processing?n Consider a query that is an AND of t terms.n For each of the t terms, get its postings, then

AND them together.

Brutus

Calpurnia

Caesar

1 2 3 5 8 16 21 34

2 4 8 16 32 64128

13 16

Query: Brutus AND Calpurnia AND Caesar9

Page 10: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Introduction to Information Retrieval(Manning, Raghavan, Schutze)

Chapter 2The term vocabulary and

postings lists

Page 11: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Parsing a document

n Before we start worrying about terms … need to know format and language of each document

n What format is it in?n pdf/word/excel/html?

n What language is it in?n What character set is in use?n Each of these is a classification problem,

n But often done heuristically

Page 12: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

What is a unit of document?

n A file?n Traditional Unix stores a sequence of emails in one file,

but you might want to regard each email as a separate document

n An email with 5 attachments?

n Indexing granularity, e.g. a collection of booksn Each book as a document?n Each chapter? Each paragraph? Each sentence?

n Precision recall tradeoffn Small unit: good precision, poor recalln Big unit: good recall, poor precision

Page 13: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Tokenization

n Input: “Friends, Romans, Countrymen”n Output: Tokens

n Friends n Romansn Countrymen

n Each such token is now a candidate for an index entry, after further processingn Described below

n But what are valid tokens to emit?

Page 14: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Common terms: stop wordsn Stop words = extreme common words which would

appear of little value in helping select document in matching a user needn a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on,

that, the, to, was, were, will, withn There are a lot of them: ~30% of postings for top 30 words

n Stop word elimination used to be standard in older IR systemsn Size of stop list: 200-300; 7-12

Page 15: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Current trendn The trend is away from doing this:

n Good compression techniques (lecture 5) means the space for including stopwords in a system is very small

n Good query optimization techniques mean you pay little at query time for including stop words.

n You need them for:n Phrase queries: “King of Denmark”n Various song titles, etc.: “Let it be”, “To be or not to be”n “Relational” queries: “flights to London”

n Nowadays search engines generally do not eliminate stop words

Page 16: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Normalizationn Need to “normalize” terms in indexed text as well as

query terms into the same formn We want to match U.S.A. and USA

n We most commonly implicitly define equivalence classes of termsn e.g., by deleting periods in a term

n Alternative is to do asymmetric expansion:n Enter: window Search: window, windowsn Enter: windows Search: Windows, windows, windown Enter: Windows Search: Windows (no expansion)

n Two approaches for the (more powerful) alternativen Index unnormalized tokens and expand query termsn Expand during index constructionn Both less efficient than equivalent claassing

Page 17: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Case folding

n Reduce all letters to lower casen exception: upper case in mid-sentence?

n e.g., General Motorsn Fed vs. fedn SAIL vs. sail

n Often best to lower case everything, since users will use lowercase regardless of ‘correct’capitalization…

Page 18: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Lemmatization

n Reduce inflectional/variant forms to base formn am, are, is → ben car, cars, car's, cars' → car

n the boy's cars are different colors → the boy car be different color

n Lemmatization implies doing “proper” reduction to dictionary headword form

Page 19: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Stemming

n Reduce terms to their “roots” before indexingn “Stemming” suggest crude affix chopping

n language dependentn e.g., automate(s), automatic, automation all

reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress andcompress ar both acceptas equival to compress

Page 20: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Porter’s algorithm (1980)

n Most common algorithm for stemming Englishn Results suggest it’s at least as good as other stemming

optionsn 5 phases of reductions

n phases applied sequentiallyn With each phase, there are various conventions of

selecting rulesn E.g., Sample convention: Of the rules in a compound

command, select the one that applies to the longest suffix.

n http://www.tartarus.org/~martin/PorterStemmer/

Page 21: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Phrase queries

n Want to be able to answer queries such as “stanford university” – as a phrase

n “The inventor Stanford Ovshinsky never went to univerisity” is not a match. n The concept of phrase queries has proven easily

understood by usersn 10% of web queries are phrase queries

n For this, it no longer suffices to store only<term : docs> entries

any ideas?

Page 22: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Solution 2: Positional indexes

n In the postings, store, for each term, entries of the form:<term, number of docs containing term;doc1: position1, position2 … ;doc2: position1, position2 … ;etc.>

Page 23: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Proximity queries: same idean employment /3 placen Find all document that contain employment

and place within 3 words of each othern Employment agencies that place healthcare

workers are seeing growthn hit

n Employment agencies that help place healthcare workers are seeing growthn not a hit

n Clearly, positional indexes can be used for such queries; biword indexes cannot.

Page 24: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Positional index size

n You can compress position values/offsets: covered in chapter 5

n Nevertheless, a positional index expands postings storage substantiallyn Need an entry for each occurrence, not just once

per documentn Compare to biword: “Index blowup due to bigger

dictionary”n Nevertheless, a positional index is now

standardly used because of the power and usefulness of phrase and proximity queries

Page 25: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Rules of thumb

n A positional index is 2–4 as large as a non-positional index

n Positional index size 35–50% of volume of original text

n Caveat: the above holds for English-like languages.

Page 26: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Introduction to Information Retrieval(Manning, Raghavan, Schutze)

Chapter 3Dictionaries and Tolerant retrieval

Chapter 4Index construction

Chapter 5Index compression

Page 27: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Dictionaryn The dictionary is the data structure for storing the

term vocabularyn For each term, we need to store:

n document frequencyn pointers to each postings list

Page 28: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Dictionary data structures

n Two main choices:n Hash tablen Tree

n Some IR systems use hashes, some treesn Criteria in choosing hash or tree

n fixed number of terms or keep growingn Relative frequencies with which various keys are accessedn How many terms

Page 29: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Distributed indexing

n For web-scale indexing (don’t try this at home!):must use a distributed computing cluster

n Individual machines are fault-pronen Can unpredictably slow down or fail

n How do we exploit such a pool of machines?

Page 30: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Google data centers

n Google data centers mainly use commodity machinesn Data centers are distributed around the world.n Estimate: a total of 1 million servers, 3 million

processors/cores (Gartner 2007)n Estimate: Google installs 100,000 servers each quarter.

n Based on expenditures of $200–250 million per yearn This would be 10% of the computing capacity of the

world!?!

Page 31: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Data flow

splits

Parser

Parser

Parser

Master

a-f g-p q-z

a-f g-p q-z

a-f g-p q-z

Inverter

Inverter

Inverter

Postings

a-f

g-p

q-z

assign assign

Mapphase

Segment files Reducephase

Page 32: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

MapReduce

n The index construction algorithm we just described is an instance of MapReduce.

n MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple architecture for distributed computing …

n … without having to write code for the distribution part.

Page 33: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

MapReduce

n MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairsn For indexing, (termID, docID)

n Map: mapping splits of the input data to key-value pairsn Reduce: all values for a given key to be stored close

together, so that they can be read and processed quicklyn This is achieved by partitioning the keys into j terms

partitions and having the parsers write key-value pairs for each term partition into a separate segment file

Page 34: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

MapReducen They describe the Google indexing system (ca. 2002)

as consisting of a number of phases, each implemented in MapReduce.

n Index construction was just one phase.n Another phase: transforming a term-partitioned index

into document-partitioned index.n Term-partitioned: one machine handles a subrange of termsn Document-partitioned: one machine handles a subrange of

documentsn (As we discuss in the web part of the course) most

search engines use a document-partitioned index … better load balancing, etc.)

Page 35: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Introduction to Information Retrieval(Manning, Raghavan, Schutze)

Chapter 6Scoring term weighting and the

vector space model

Page 36: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Ranked retrievaln Thus far, our queries have all been Boolean.

n Documents either match or don’tn Good for expert users with precise understanding of

their needs and the collection.n Also good for applications, which can easily consume

1000s of resultsn Not good for the majority of users.

n Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).

n Most users don’t want to wade through 1000s of results.n This is particularly true of web search.

Page 37: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Problem with Boolean search: feast or famine

n Boolean queries often result in either too few (=0) or too many (1000s) results.

n Query 1: “standard user dlink 650”n 200,000 hits

n Query 2: “standard user dlink 650 no card found”n 0 hits

n It takes skill to come up with a query that produces a manageable number of hits.n AND gives too few; OR gives too many

Page 38: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Take 1: Jaccard coefficient

n A commonly used measure of overlap of two sets A and B

n jaccard(A,B) = |A ∩ B| / |A ∪ B|n jaccard(A,A) = 1n jaccard(A,B) = 0 if A ∩ B = 0

n Always assigns a number between 0 and 1.

Page 39: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Issues with Jaccard for scoring

n It doesn’t consider term frequency (how many times a term occurs in a document)n tf weight

n Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this informationn idf weight

n We need a more sophisticated way of normalizing for lengthn cosine

Page 40: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Bag of words model

n Vector representation doesn’t consider the ordering of words in a document

n John is quicker than Mary and Mary is quicker than John have the same vectors

n This is called the bag of words model.

Page 41: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Term frequency

n The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

n We want to use term frequency when computing query-document match scores. But how?

n Rawtermfrequencymaynotbewhatwewant:n Adocumentwith10occurrencesofthetermismorerelevantthanadocumentwith1occurrenceoftheterm.

n Butnot10timesmorerelevant.n Relevancedoesnotincreaseproportionallywithtermfrequency

Page 42: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

term frequency (tf) weightn many variants for tf weight, where log-frequency

weighting is a common one, dampening the effect of raw tf (raw count)

log tft,d =

n 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.n The score is 0 if none of the query terms is

present in the document.

1 + log10 tft,d, if tft,d > 00, otherwise

!"#

$#

Page 43: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Document frequency

n Rare terms are more informative than frequent termsn Recall stop words

n Consider a term in the query that is rare in the collection (e.g., arachnocentric)

n A document containing this term is very likely to be relevant to the query arachnocentric

n → We want a high weight for rare terms like arachnocentric.

Page 44: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Document frequency, continued

n Consider a query term that is frequent in the collection (e.g., high, increase, line)

n A document containing such a term is more likely to be relevant than a document that doesn’t, but it’s not a sure indicator of relevance.

n For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms.

n We will use document frequency (df) to capture this in the score.

n df (≤ N) is the number of documents that contain the term

Page 45: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Inverse document frequency (idf) weight

n dft is the document frequency of t: the number of documents that contain tn dft is an inverse measure of the informativeness of tn Inverse document frequency is a direct measure of the

informativeness of tn We define the idf (inverse document frequency) of t by

n use log to dampen the effect of N/dftn Most common variant of idf weight

tt N/df log idf 10=

Page 46: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

idf example, suppose N= 1 millionterm dft idftcalpurnia 1 6

animal 100 4

sunday 1,000 3

fly 10,000 2

under 100,000 1

the 1,000,000 0

There is one idf value for each term t in a collection.

)/df( log idf 10 tt N=

Page 47: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Collection vs. Document frequencyn The collection frequency of t is the number of

occurrences of t in the collection, counting multiple occurrences.

n Example: which word is a better search term (and should get a higher weight)?

n The example suggests that df is better for weighting than cf

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Page 48: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

tf-idf weightingn The it-idf weight of a term is the product of its tf weight

and its idf weight.

tf weight (t,d) x idf weight (t)

n Increases with the number of occurrences within a document

n Increases with the rarity of the term in the collectionn Best known instantiation of TF-IDF weighting

tf -idft,d =

(1+ log10 tft,d )× log10 (N / dft )

Page 49: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Recall: Binary term-document incidence matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 0Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1worser 1 0 1 1 1 0

Each document is represented by a binary vector ∈ {0,1}|V|

Page 50: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Term-document count matrices

n Consider the number of occurrences of a term in a document: n Each document is a count vector in ℕv: a column below

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0Brutus 4 157 0 1 0 0Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1worser 2 0 1 1 1 0

Page 51: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Binary → count → weight matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 2.44 0 0 0 0Brutus 0.16 6.10 0 0.04 0 0Caesar 8.59 8.40 0 0.07 0.04 0.04

Calpurnia 0 1.54 0 0 0 0Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 2.27 3.78 3.78 0.76worser 1.37 0 0.69 0.69 0.69 0

Each document is now represented by a real-valued vector of TF-IDF weights ∈ R|V|

Page 52: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Documents as vectors

n So we have a |V|-dimensional vector spacen Terms are axes of the spacen Documents are points or vectors in this spacen Very high-dimensional

n hundreds of millions of dimensions when you apply this to a web search engine

n This is a very sparse vectorn most entries are zero

Page 53: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Queries as vectors

n Key idea 1: Do the same for queries: represent them as vectors in the space

n Key idea 2: Rank documents according to their proximity to the query in this space

n proximity = similarity of vectorsn proximity ≈ inverse of distancen Recall: We do this because we want to get away

from the either-in-or-out Boolean model.n Instead: rank more relevant documents higher

than less relevant documents

Page 54: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Why Euclidean distance is badThe Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 arevery similar.

Page 55: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Use angle instead of distance

n Thought experiment: take a document d and append it to itself. Call this document d′.

n “Semantically” d and d′ have the same contentn The Euclidean distance between the two

documents can be quite largen The angle between the two documents is 0,

corresponding to maximal similarity.

Page 56: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Length normalization

n A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:

n Dividing a vector by its L2 norm makes it a unit (length) vector

n Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.

n The cosine of the angle between two normalized vectors is the dot product of the two

∑=i ixx 2

2

!

Page 57: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

cosine(query,document)

cos(!q,!d ) =

!q •!d!q!d=!q!q•

!d!d=

qtdtt=1

V∑qt2

t=1

V∑ dt

2

t=1

V∑

Dot product Unit vectors

qt is the tf-idf weight of term t in the querydt is the tf-idf weight of term t in the document

cos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.

• The cosine similarity can be seen as a method of normalizing document length during comparison

Page 58: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Cosine similarity exampled q normalized d normalized q

t1 0.5 1.5 0.51 0.83t2 0.8 1 0.81 0.555t3 0.3 0 0.30 0 sim(d, q) = 0.5x1.5 + 0.8x1 + 0.3x0 _

sqrt(0.52+0.82+ .32) x sqrt(1.52+12+02)= 1.55 _

0.99 x 1.8 = 0.87

sim(d,q) = 0.51x0.83 + 0.81x0.555 + 0.30x0 = 0.87

Page 59: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

More variants of TF-IDF weighting

SMART notation: columns headed ‘n’ are acronymsfor weight schemes.

Page 60: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Summary – vector space ranking

n Represent the query as a weighted TF-IDF vectorn Represent each document as a weighted TF-IDF

vectorn Compute the cosine similarity score for the query

vector and each document vectorn Rank documents with respect to the query by scoren Return the top k (e.g., k = 10) to the user

Page 61: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Introduction to Information Retrieval(Manning, Raghavan, Schutze)

Chapter 7Computing scores in a complete

search system

Page 62: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Content

n Speeding up vector space rankingn Putting together a complete search

system

Page 63: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Cluster pruning: query processing

n Process a query as follows:n Given query Q, find its nearest leader L.n Seek K nearest docs from among L’s

followers.

Page 64: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Visualization

Query

Leader Follower

Page 65: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Putting it all together

Page 66: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Introduction to Information Retrieval(Manning, Raghavan, Schutze)

Chapter 8Evaluation and Result Summaries

Page 67: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Summariesn The title is typically automatically extracted from

document metadata. What about the summaries?n This description is crucial.n User can identify good/relevant hits based on description.

n Two basic kinds:n Staticn Dynamic

n A static summary of a document is always the same, regardless of the query that hit the doc

n A dynamic summary is a query-dependent attempt to explain why the document was retrieved for the query at hand

Page 68: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Dynamic summariesn Present one or more “windows” within the document

that contain several of the query termsn “KWIC” snippets: Keyword in Context presentation

n Generated in conjunction with scoringn If query found as a phrase, all or some occurrences of the

phrase in the docn If not, document windows that contain multiple query terms

n The summary itself gives the entire content of the window – all terms, not only the query terms – how?

Page 69: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Evaluating search engines

Page 70: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Relevance to what?n Relevance is assessed relative to the information

need not the queryn E.g., Information need: I'm looking for information

on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

n Query: wine red white heart attack effectiven You evaluate whether the doc addresses the

information need, not whether it has these words

n Our terminology is sloppy: we talk about query-document relevance judgment although we mean information-need-document relevance judgment

Page 71: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Unranked retrieval evaluation:Precision and Recall

n Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)

n Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)

n Precision P = tp/(tp + fp)n Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp fp

Not Retrieved fn tn

Page 72: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Should we instead use the accuracy measure for evaluation?

n Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant”

n The accuracy of an engine: the fraction of these classifications that are correct

n Accuracy is a commonly used evaluation measure in machine learning classification work

n Why is this not a very useful evaluation measure in IR?

Page 73: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Why not just use accuracy?

n How to build a 99.9999% accurate search engine on a low budget….

n People doing information retrieval want to findsomething and have a certain tolerance for junk.

Search for:

0 matching results found.

Page 74: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Precision/Recall tradeoff

n You can get high recall (but low precision) by retrieving all docs for all queries!

n Recall is a non-decreasing function of the number of docs retrieved

n In a good system, precision decreases as either the number of docs retrieved or recall increasesn This is not a theorem, but a result with strong

empirical confirmation

Page 75: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

A combined measure: Fn Combined measure that assesses precision/recall tradeoff is

F measure:

n Weighted harmonic mean of P and R:n People usually use balanced F measure

n F1 ; or F β =1 ; n with β = 1 or α = ½; harmonic mean:

n β < 1 emphasizes P or R?n Either P or R is bad -> bad F

RPPR

RP

F+

+=

−+= 2

2 )1(1)1(1

1ββ

αα αα

β)1(2 −

=

RPF1)1(11

αα −+=

)11(211

RPF+=

Page 76: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Evaluating ranked results

n P/R/F are measured for unranked setsn We can easily turn set measures into measures

for ranked resultsn The system can return any number of resultsn Just use the set measures for each “prefix”, the

top 1, top 2, top 3, top 4, etc., resultsn Doing this for precision and recall produces a

precision-recall curve, where a “prefix”corresponds to a level of recall

Page 77: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

A precision-recall curve

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0Recall

Precision

n Sawtooth shape:n If the (k+1)th doc is

non-relevant, R is the same as for the top k docs, but P has dropped

n If it is relevant, then both P and R increase, and the curve jags up and to the right

n Often useful to remove the jiggles: interpolationn Take maximum precision of all future points

Page 78: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

11-point interpolated average precisionn Entire precision-recall graph is very

informative, but there is often a desire to boil this information down to a few numbers, even a single number

n 11-point interpolated average precisionn The standard measure in the early TREC

competitionsn take the precision at 11 levels of recall

varying from 0 to 1 by tenths of the documents, using interpolation, and average over queries

n Evaluates performance at all recall levels

Page 79: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Typical (good) 11 point precisionsn SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Precision

Page 80: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Mean average precision (MAP)n Recently, other measures have become more common. Most

standard among TREC community is MAPn A single figure measure of quality across recall levelsn Good discrimination and stability

n For a single information need, average precision is the average of precision value obtained for the top k docs each time a relevant doc is retrievedn Approximates the area under the un-interpolated precision-recall curve

n Then, this value (average precision) is averaged over many information needs to get MAPn Approximates the average area under the precision-curve for a set of

queries

Page 81: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Yet more evaluation measures…n The above ones factor in precision at all recall levelsn For many prominent applications, e.g., web search, this may

not be appropriate, where what matters is rather how many good results there are on the first page or the first 3 pages!n Leads to measuring precision at fixed low levels (e.g., 10 or 30) of

retrieved resultsn Precision at k: precision of top k results

n Standard for web searchn Cons: the least stable among commonly used measures; does not

average well because the total number of relevant docs for a query has strong influence on precision at k

n R-precision alleviates this problemn But may not be feasible for web search

Page 82: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

R-precisionn If have known (though perhaps incomplete) set of relevant

documents of size Rel, then calculate precision of top Rel docs returnedn Averaging the measure across queries makes more sense

n If there are |Rel| relevant docs for query, we examine the top |Rel| results, and find r are relevant. Then, n recall = precision = r / |Rel|n Thus, R-precision is identical to the break-even point

n Empirically, highly correlated with MAP

Page 83: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Critique of pure relevance

n Assumption: relevance of one doc is treated as independent of relevance of other docs in the collectionn But a document can be redundant (e.g., duplicates) even if

it is highly relevantn Duplicates

n Marginal Relevance: concerns whether a doc still have distinctive usefulness after the user has looked at certain other documents … (Carbonell and Goldstein 1998)

n Maximizing marginal relevance requires returning documents that exhibit diversity and novelty

Page 84: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Introduction to Information Retrieval(Manning, Raghavan, Schutze)

Chapter 19Web search basics

Page 85: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

1. Brief history and overview

n Early keyword-based enginesn Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997

n A hierarchy of categoriesn Yahoo!n Many problems, popularity declined. Existing variants

are About.com and Open Directory Project

n Classical IR techniques continue to be necessary for web search, by no means sufficientn E.g., classical IR measures relevancy, web search

needs to measure relevancy + authoritativeness

Page 86: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Web search overview

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User

Page 87: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Web IR: Differences from traditional IR

n Links: The web is a hyperlinked document collectionn Queries: web queries are different, more varied and there are a lot of

themn How many? 108 every day, approaching 109

n Users: users are different, more varied and there are a lot of themn How many? 109

n Documents: documents are different, more varied and there are a lot of themn How many? ~ 1011. Indexed 1010

n Context: context is more important on the web than in many other IR applications

n Ads and spam

Page 88: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Duplicate documentsn Significant duplication: 30-40% duplicates in some studiesn Duplicates in search results were common in early days of

the Webn Today’s search engines eliminate duplicates very

effectivelyn Key for high user satisfaction

Page 89: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Duplicate detection

n The web is full of duplicated contentn Strict duplicate detection = exact match

n Not as commonn But many, many cases of near duplicates

n E.g., Last modified date the only difference between two copies of a page

n Various techniquesn Fingerprint, shingles, sketch

Page 90: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Size of the web: issues

n How to define size? Number of web serves? Number of pages? Terabytes of data available?n Some servers are seldom connected

n example: your laptop running a web servern Is it part of the web?

n The “dynamic” web is infiniten Any sum of two numbers is its own dynamic page on Google (e.g., “2+4”)

Page 91: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Goal of spamming on the webn You have a page that will generate lots of revenue for

you if people visit itn Therefore, you’d like to redirect visitors to this pagen One way of doing this: get your page ranked highly in

search results

Page 92: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Simplest formsn First generation engines relied heavily on tf/idfn Hidden text: dense repetitions of chosen keywords

n Often, the repetitions would be in the same color as the background of the web page. So that repeated terms got indexed by crawlers, but not visible to humans on browsers

n Keyword stuffing: misleading meta-tags with excessive repetition of chosen keywords

n Used to be effective, most search engines now catch these

n Spammers responded with a richer set of spam techniques

Page 93: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Link spamn Create lots of links pointing to the page you want to

promoten Put these links on pages with high (at least non-zero)

pagerankn Newer registered domains (domain flooding)n A set of pages pointing to each other to boost each

other’s pagerank (mutual admiration society)n Pay somebody to put your link on their highly ranked

page (“schuetze horoskop” example”)n http://www-csli.stanford.edu/~hinrich/horoskop-schuetze.html

n Leave comments that include the link on blogsn Link farm

Page 94: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Search engine optimization

n Promoting a page is not necessarily spamn It can also be a legitimate business, which is called SEO

n You can hire an SEO firm to get your page highly rankedn Motives

n Commercial, political, religious, lobbiesn Promotion funded by advertising budget

n Operatorsn Contractors (Search Engine Optimizers) for lobbies, companiesn Web mastersn Hosting services

n Forumsn E.g., Web master world ( www.webmasterworld.com )

Page 95: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

3. Advertising as economic model

n Sponsored search ranking: Goto.com (morphed into Overture.com → Yahoo!)n Your search ranking depended on how much you paidn Auction for keywords: casino was expensive!n No separation of ads/docs

n 1998+: Link-based ranking pioneered by Googlen Blew away all early enginesn Google added paid-placement “ads” to the side,

independent of search resultsn Strict separation of ads and results

Page 96: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

First generation of search ads: Goto (1996)

n No separation of ads/docs. Just one results!n Buddy Blake bid the maximum ($0.38) for this searchn He paid $0.38 to Goto every time somebody clicked on the linkn Upfront and honest. No relevance ranking, but Goto did not pretend

there was any.

Page 97: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Algorithmic results.

Ads

Page 98: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

The appeal of search ads to advertisers

n Why is web search potentially more attractive for advertisers than TV spots, newspaper ads or radio spots?

n Someone who just searched for “Saturn Aura Sport Sedan” is infinitely more likely to buy one than a random person watching TV.

n Most importantly, the advertiser only pays if the customer took an action indicating interest (i.e., clicking on the ad)

Page 99: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Users of web searchn Use short queries (average < 3)n Rarely use operatorsn Don’t want to spend a lot of time on composing a queryn Only look at the first couple of resultsn Want a simple UI, not a search engine start page

overloaded with graphicsn Extreme variability in terms of user needs, user

expectations, experience, knowledge, …n Industrial/developing world, English/Estonian, old/young,

rich/poor, differences in culture and classn One interface for hugely divergent needs

Page 100: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

User query needsn Need [Brod02, RL04]

n Informational – want to learn about something (~40% / 65%)n Not a single page containing the info

n Navigational – want to go to that page (~25% / 15%)

n Transactional – want to do something (web-mediated) (~35% / 20%)n Access a service

n Downloads

n Shopn Gray areas

n Find a good hubn Exploratory search “see what’s there”

Low hemoglobin

United Airlines

Seattle weatherMars surface images

Canon S410

Car rental Brasil

Page 101: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Query distribution (1)

Page 102: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Query distribution (2)n Queries have a power law distributionn Recall Zipf’s law: a few very frequent words, a large

number of very rare wordsn Same here very frequent queries, a large number of very

rare queriesn Examples of rare queries: search for names, towns,

books etcn The proportion of adult queries is much lower than 1/3

Page 103: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Introduction to Information Retrieval(Manning, Raghavan, Schutze)

Chapter 21Link analysis

Page 104: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

The Web as a Directed Graph

Assumption 1: a hyperlink is a quality signal• A hyperlink between pages denotes author perceived relevance

Assumption 2: The anchor text describes the target page• we use anchor text somewhat loosely here• extended anchor text, window of text surrounding anchor text• You can find cheap cars <a href= …>here</a>

Page Ahyperlink Page BAnchor

Page 105: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Google bombsn Indexing anchor text can have unexpected side

effects: Google bombs.n whatelse does not have side effects?

n A Google bomb is a search with “bad” results due to maliciously manipulated anchor text

n Google introduced a new weighting function in January 2007 that fixed many Google bombs

Page 106: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Google bomb example

Page 107: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Cocitation similarity on Google:similar pages

Origins of PageRank: Citation analysis (1)

Page 108: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Origins of PageRank: Citation analysis (2)

Page 109: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Query-independent ordering

n First generation link-based ranking for web search n using link counts as simple measures of popularity.n simple link popularity: number of in-links

n First, retrieve all pages meeting the text query (say venture capital).n Then, order these by the simple link popularity

n Easy to spam. Why?

Page 110: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Basics for PageRank: random walk

n Imagine a web surfer doing a random walk on the web page:n start at a random pagen at each step, go out of the current page along one of

the links on that page, equiprobablyn In the steady state each page has a long-term visit

rate - use this as the page’s scoren So, pagerank = steady state probability

= long-term visit rate

1/31/31/3

Page 111: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Not quite enough

n The web is full of dead-endsn random walk can get stuck in dead-endsn makes no sense to talk about long-term visit rates

??

Page 112: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Teleportingn Teleport operation: surfer jumps from a node to any

other node in the web graph, chosen uniformly at random from all web pages

n Used in two ways:n At a dead end, jump to a random web pagen At any non-dead end, with teleportation probability 0 < α < 1

(say, α = 0.1), jump to a random web page; with remaining probability 1 - α (0.9), go out on a random link

n Now cannot get stuck locallyn There is a long-term rate at which any page is visited

n Not obvious, explain latern How do we compute this visit rate?

Page 113: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Markov chains

n A Markov chain consists of n states, plus an n×ntransition probability matrix P.

n At each step, we are in exactly one of the states.n For 1 ≤ i, j ≤ n, the matrix entry Pij tells us the

probability of j being the next state, given we are currently in state i.

n Clearly, for each i,

n Markov chains are abstractions of random walkn State = page

i jPij

.11

=∑=

ij

n

jP

Page 114: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

ExerciseRepresent the teleporting random walk as a Markov chain, for the following case, using transition probability matrix

α = 0.3:

0.1 0.45 0.451/3 1/3 1/3 0.45 0.45 0.1

C A B

C A B

0.1

0.450.45

0.45

1/3

State diagram

Link structure

0.1

1/3

0.45

1/3

Transition matrix

Page 115: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Ergodic Markov chains

n A Markov chain is ergodic iff it’s irreducible and aperiodic n Irreducibility: roughly, there is a path from any state to any

othern Aperiodicity: roughly, the nodes cannot be partitioned

such that the random walker visits the partitions sequentially

n A non-ergodic Markov chain

1

1

Page 116: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Ergodic Markov chains

n Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state.n Steady-state probability distribution.

n Over a long time-period, we visit each state in proportion to this rate.

n It doesn’t matter where we start.

Page 117: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Formalization of visit: probability vector

n A probability (row) vector x = (x1, … xn) tells us where the walk is at any point.

n e.g., (000…1…000) means we’re in state ii n1

More generally, the vector x = (x1, … xn)means the walk is in state i with probability xi

xii=1

n

∑ =1

Page 118: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Change in probability vector

n If the probability vector is x = (x1, … xn) at this step, what is it at the next step?

n Recall that row i of the transition prob. matrix P tells us where we go next from state i.

n So from x, our next state is distributed as xP

Page 119: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Steady state example

n The steady state is simply a vector of probabilities a = (a1, … an):n ai is the probability that we are in state in ai is the long-term visit rate (or pagerank) of state (page) In so we can think of pagerank as a long vector, one entry for

each page

Page 120: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

How do we compute this vector?

n Let a = (a1, … an) denote the row vector of steady-state probabilities.

n If our current position is described by a, then the next step is distributed as aP

n But a is the steady state, so a=aPn Solving this matrix equation gives us a

n so a is the (left) eigenvector for Pn corresponds to the principal eigenvector of P with

the largest eigenvaluen transition probability matrices always have larges

eigenvalue 1

Page 121: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

One way of computing

n Recall, regardless of where we start, we eventually reach the steady state a

n Start with any distribution (say x=(10…0)).n After one step, we’re at xPn after two steps at xP2 , then xP3 and so on.n “Eventually” means for “large” k, xPk = an Algorithm: multiply x by increasing powers of P

until the product looks stable

n This is called the power method

Page 122: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Power method: example

Page 123: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Pagerank summaryn Preprocessing:

n Given graph of links, build transition probability matrix Pn From it compute an The entry ai is a number between 0 and 1: the pagerank of page i.

n Query processing:n Retrieve pages meeting queryn Rank them by their pagerankn Order is query-independent

n In practice, pagerank alone wouldn’t workn Google paper:

http://infolab.stanford.edu/~backrub/google.html

Page 124: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

In practice

n Consider the query “video service”n Yahoo! has very high pagerank, and contains both wordsn With simple pagerank alone, Yahoo! Would be top-rankedn Clearly not desirable

n In practice, composite score is used in rankingn Pagerank, cosine similarity, term proximity etc.n May apply machine-learned scoringn Many other clever heuristics are used

Page 125: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

How important is PageRank?

Page 126: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Pagerank: Issues and Variants

n How realistic is the random surfer model?n What if we modeled the back button?n Surfer behavior sharply skewed towards short pathsn Search engines, bookmarks & directories make jumps non-

random.

n Biased Surfer Modelsn Weight edge traversal probabilities based on match with

topic/query (non-uniform edge selection)n Bias jumps to pages on topic (e.g., based on personal

bookmarks & categories of interest)n Non-uniform teleportation allows topic-specific

pagerank and personalized pagerank

Page 127: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Topic Specific Pagerankn Conceptually, we use a random surfer who

teleports, with say 10% probability, using the following rule:

n Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories

n Teleport to a page uniformly at random within the chosen category

Page 128: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Pagerank applications beyond web search

n A person is reputable if s/he receives many references from reputable people.

n How to compute reputation for people?

n Rent a room in an exhibition center. Find one with the most visit rate.

Page 129: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Hyperlink-Induced Topic Search (HITS)

n In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:n Hub pages are good lists of links to pages answering

the information needn e.g., “Bob’s list of cancer-related links

n Authority pages are direct answers to the information need

n occur recurrently on good hubs for the subject

n Most approaches to search do not make the distinction between the two sets

Page 130: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Hubs and Authorities

n Thus, a good hub page for a topic pointsto many authoritative pages for that topic

n A good authority page for a topic is pointed to by many good hubs for that topic

n Circular definition - will turn this into an iterative computation

Page 131: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Examples of hubs and authorities

AT&T Alice

SprintBob MCILong distance telephone companies

HubsAuthorities

Page 132: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

High-level scheme

n Do a regular web search firstn Call the search results the root setn Add in any page that either

n points to a page in the root set, orn is pointed to by a page in the root set

n Call this the base setn From these, identify a small set of top hub and

authority pagesn Iterative algorithm

Page 133: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Visualization

Rootset

Base set

Page 134: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Assembling the base set

n Root set typically 200-1000 nodesn Base set may have up to 5000 nodesn How do you find the base set nodes?

n Follow out-links by parsing root set pagesn Get in-links from a connectivity server, get pages

n This assumes our inverted index supports searches for links, in addition to terms

Page 135: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Distilling hubs and authorities

n Compute, for each page x in the base set, a hub score h(x) and an authority score a(x)

n Initialize: for all x, h(x)←1; a(x) ←1;n Iteratively update all h(x), a(x);n After convergence

n output pages with highest h() scores as top hubsn output pages with highest a() scores as top authoritiesn so we output two ranked lists

Key

Page 136: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Iterative update

n Iterate these two steps until convergence

for all x:

for all x:

∑←yxyaxh

!

)()(

∑←xyyhxa

!

)()(

x

x

Page 137: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Scaling

n To prevent the h() and a() values from getting too big, can scale down after each iteration

n Scaling factor doesn’t really matter:n we only care about the relative values

of the scores

Page 138: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

How many iterations?

n Relative values of scores will converge after a few iterations

n In fact, suitably scaled, h() and a() scores settle into a steady state!n proof of this comes later

n In practice, ~5 iterations get you close to stability

Page 139: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Japan Elementary Schools

n The American School in Japan n The Link Page n ‰ª�è�s—§ˆä“c�¬Šw�Zƒz�[ƒ�ƒy�[ƒW n Kids' Space n ˆÀ�é�s—§ˆÀ�é�¼•”�¬Šw�Z n ‹{�鋳ˆç‘åŠw•�‘®�¬Šw�Z n KEIMEI GAKUEN Home Page ( Japanese ) n Shiranuma Home Page n fuzoku-es.fukui-u.ac.jp n welcome to Miasa E&J school n �_“Þ�쌧�E‰¡•l�s—§’†�ì�¼�¬Šw�Z‚̃yn http://www...p/~m_maru/index.html n fukui haruyama-es HomePage n Torisu primary school n goo n Yakumo Elementary,Hokkaido,Japan n FUZOKU Home Page n Kamishibun Elementary School...

n schools n LINK Page-13 n “ú–{‚ÌŠw�Z n �a‰„�¬Šw�Zƒz�[ƒ�ƒy�[ƒW n 100 Schools Home Pages (English) n K-12 from Japan 10/...rnet and Education ) n http://www...iglobe.ne.jp/~IKESAN n ‚l‚f‚j�¬Šw�Z‚U”N‚P‘g•¨Œê n �ÒŠ—’¬—§�ÒŠ—“Œ�¬Šw�Z n Koulutus ja oppilaitokset n TOYODA HOMEPAGE n Education n Cay's Homepage(Japanese) n –y“ì�¬Šw�Z‚̃z�[ƒ�ƒy�[ƒW n UNIVERSITY n ‰J—³�¬Šw�Z DRAGON97-TOP n �‰ª�¬Šw�Z‚T”N‚P‘gƒz�[ƒ�ƒy�[ƒW n ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼

Hubs Authorities

Page 140: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

Things to note

n Pulled together good pages regardless of language of page content.

n Use only link analysis after base set assembledn Is HITS query-independent?

n Typical use, non Iterative computation after text index retrieval -

significant overhead.

Page 141: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

PageRank vs. HITS: Discussionn The PageRank and HITS make two different design choices

concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to

n These two are orthogonaln We could also apply HITS to the entire web and PageRank to

a small base setn On the web, a good hub almost always is also a good

authorityn The actual difference between PageRank ranking and HITS

ranking is therefore not as large as one might expect

Page 142: Introduction to Information Retrievaljg66/teaching/7312/notes/...Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector

HITS applications beyond web search

n Researchers publish/present papers in conferences. A conference is reputable if it hosts many reputable researchers to publish/present their papers. A researcher is reputable if s/he publishes/presents many papers in reputable conferences.

n How to compute reputation for conferences? How to compute reputation for researchers?