CSE 494/598 Lecture-3: Indexing, Retrieval, Tolerant ...lmanikon/CSE494-598/lectures/lecture3.pdf · •Simple and fast to compute •Advantages: •Term-weighting improves quality

CSE 494/598Lecture-3:Indexing, Retrieval, Tolerant DictionariesLYDIA MANIKONDA HTTP://WWW.PUBLIC.ASU.EDU/~LMANIKON /

**Content adapted from last year’s slides

http://www.public.asu.edu/~LMANIKON/

Announcements• Project-1 released Due: February 15th 2016

• Analysis report: Before 4 pm (Hard copy)

• Code: Before 11.59 pm (Through email: [email protected] )

• Homework-1 will be released shortly

• Office hours: Monday 3:00 pm – 4:00 pm and Wednesday 11:00 am – 12:00 pm

• Office hours location: M1-38 Brickyard (Mezzanine floor)

• TA office hours: Thursday & Friday 5:00 – 6:00 pm

• Weekly summary

mailto:[email protected]

Today• Similarity models/metrics

• Jaccard Similarity

• Vector model

• …

• TF-IDF

• Indexing

• Tolerant dictionaries

Background of Information Retrieval• Traditional Model

• Given• A set of documents

• A query expressed as a set of keywords

• Returns• A ranked set of documents most relevant to the query

• Evaluation• Precision: Fraction of returned documents that are

relevant

• Recall: Fraction of relevant documents that are returned

• Efficiency

• Web-induced headaches• Scale

• billions of documents

• Hypertext• Inter-document connections

• Consequently• Ranking that takes link structure into account

• Authority/Hub

• Indexing and retrieval algorithms that are ultra fast

Jaccard Similarity Metric• Estimates the degree of overlap between sets (or bags)

• For bags, intersection and union are defined in terms of max & min• Ex:

• Document A contains 5 oranges, 8 applesDocument B contains 3 oranges and 12 apples

• A ∩ B is 3 oranges and 8 apples

• A ∪ B is 5 oranges and 12 apples

• Jaccard similarity is (3+8)/(5+12) = 11/17 = 0.65

Can be used with set semantics

Exercise: Documents as bags of wordst1= databaset2=SQLt3=indext4=regressiont5=likelihoodt6=linear

Similarity(d1,d2)

= (24+10+5)/(32+21+9+3+3)= 0.57

• What about d1 and d1d1 (which is a twice concatenated version of d1)? • Need to normalize the docs (e.g. divide coeffs by doc size)• Also can better differentiate the coeffs (tf/idf metrics)

The Effect of Bag Size• If you have 2 documents

• Document 1: 5 apples, 8 oranges

• Document 2: 9 apples, 4 oranges

• Jaccard: (5+4)/(9+8) = 9/17 = 0.53

• If you triple the size of Document 1: 15 apples, 24 oranges• Jaccard: (9+4)/(15+24) = 13/29 = 0.45 – Similarity has changed!!

• How do we address this?

• Normalize all documents to the same size

• Document of 5 apples and 8 oranges can be normalized as: 5/(5+8) apples; 8/(5+8) oranges

The Vector Model• Documents/Queries bags are seen as vectors over keyword space

• Vec(dj) = (w1j, w2j, …, wtj) – each vector holds a place for all terms in the collection – leading to sparsity

• Vec(q) = (w1q, w2q, …, wtq)• wiq >=0 associated with the pair (ki, q)

• Wij >0 whenever ki ∈ dj

• To each term is associated a unitary vector vec(i)• Unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the

documents)

•The t unitary vectors vec(i) form an orthonormal basis for a t-dimensional space

Similarity Function• The similarity or closeness of a document

d = {w1 , w2 , … , wk }

with respect to a query (or another document) q = {q1 , q2 , … , qk }

is computed using a similarity (distance) function.

• Many similarity functions exist: • Euclidean distance

• Dot product

• Normalized dot product (cosine-theta)

• …

Euclidean distanceGiven two document vectors d1 and d2, it is the straight-line distance between these two vectors in a Euclidean space

i

wiwiddDist 2)21()2,1(

Dot Product• Given a document vector d and a query vector q

• Sim(q,d) = dot(q,d) = q1*w1 + q2*w2 + … + qk*wk

• Properties of dot product function: • Documents having more common terms with a query have higher similarities with the given query

• For terms that appear in both q and d, those with higher weights contribute more to sim(q,d) than those with lower weights

• It favors long documents over short documents

• Computed similarities have no clear upper bound

• Given a document vector d = (0.2, 0, 0.3, 1) and a query vector q = (0.75, 0.75, 0, 1)• Sim(q,d) = ??

A normalized similarity metric• Sim(q, dj) = cos(θ)

= (vec(dj) . vec(q))/( |dj| * |q|)= (Σ wij*wiq)/( |dj| * |q|)

• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <= 1

•A document is retrieved even if it matches the query terms only partially

system

interfaceuser

a

c

b

||||)cos(

BA

BAAB

a b c

Interface 0 0 1

User 0 1 1

System 2 1 1

i

j

dj

q

t1= databaset2=SQLt3=indext4=regressiont5=likelihoodt6=linear

Euclidean

Cosine

Comparison of Euclidian and Cosine distance metrics

Whiter => more similar

Answering Queries• Represent query as vector

• Compute distances to all documents

• Rank according to distance

• Example: “database index”

• Given query Q = {database, index}• Query vector q = (1,0,1,0,0,0)


Term Weights in the vector model• Sim(q, dj) = (Σ wij*wiq)/( |dj| * |q|)

• How to compute the weights wij and wiq? • Simple keyword frequencies tend to favor common words

• E.g. query: The Computer Tomography

• Ideally, a term weighting should solve “Feature Selection Problem” • Viewing retrieval as a “classification of documents” in to those relevant/irrelevant to the query

• A good weight must take two effects in to account: • Quantification of intra-document contents (similarity)

• tf factor – term frequency within a document

• Quantification of inter-documents separation (dissimilarity)• idf factor – inverse document frequency

• Wij = tf(I,j) * idf(i)

TF-IDF• Let,

• N – total number of documents in the collection

• ni – number of documents that contain ki

• freq(i,j) – raw frequency of ki within dj

• A normalized tf factor is given by• f(i,j) = freq(i,j) / max(freq(i,j))

• where the maximum is computed over all terms which occur within the document dj

• The idf factor is computed as • Idf(i) = log(N/ni)

• The log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki

Document/Query Representation using TF-IDF

•The best term-weighting schemes use weights which are given by • wij = f(i,j) * log(N/ni)

• the strategy is called a tf-idf weighting scheme

• For the query term weights, several possibilities • wiq = (0.5 + 0.5 * [freq(i,q) * max(freq(i,q))]) * log(N/ni)

• Alternatively, just use the IDF weights (to give preference to rare words)

• Let the user give the weights to the keywords to reflect her real preferences• Easier said than done

• Help them with “relevance feedback” techniques


Given Q={database, index}= {1,0,1,0,0,0}

Note: In this case, the weights used in query were 1 for t1 and t3, and 0 for the rest.

The Vector Model Summary• The vector model with tf-idf weights is a good ranking strategy with general collections

• Vector model is usually as good as the known ranking alternatives

• Simple and fast to compute

• Advantages: • Term-weighting improves quality of the answer set

• Partial matching allows retrieval of docs that approximate the query conditions

• Cosine ranking formula sorts documents according to degree of similarity to the query

• Disadvantages: • Assumes independence of index terms

• Does not handle synonymy/polysemy

• Query weighting may not reflect user relevance criteria

Indexing

Efficient Retrieval• Document-term matrix

t1 t2 . . . tj . . . tm nf

d1 w11 w12 . . . w1j . . . w1m 1/|d1|

d2 w21 w22 . . . w2j . . . w2m 1/|d2|

. . . . . . . . . . . . . .

di wi1 wi2 . . . wij . . . wim 1/|di|

. . . . . . . . . . . . . .

dn wn1 wn2 . . . wnj . . . wnm 1/|dn|

• wij is the weight of term tj in document di

• Most wij’s will be zero

Naïve Retrieval• Consider query q = (q1, q2, q3, …, qn), nf = 1/|q|

• How to evaluate q (i.e., compute the similarity between q and every document) ?

• Method 1 : Compare q with every document directly

• Document data structuredi : ((t1, wi1), (t2, wi2), . . ., (tj, wij), . . ., (tm, wim ), 1/|di|)• Only terms with positive weight are kept

• Terms are in alphabetic order

• Query data structureq : ((t1, q1), (t2, q2), . . ., (tj, qj), . . ., (tm, qm ), 1/|q|)

Naïve retrieval Method 1: Compare q with documents directly

Algorithm:

initialize all sim(q, di) = 0;

for each document di (i = 1, …, n)

{for each term tj (j = 1, …, m)

if tj appears in both q and di

sim(q, di) += qj wij;

sim(q, di) = sim(q, di) (1/|q|) (1/|di|); }

sort documents in descending similarities and

display the top k to the user;

Observation• Method 1 is not efficient

• Needs to access most non-zero entries in doc-term matrix

• Solution: Inverted index• Data structure to permit fast searching

• Like an index in the back of a text book• Key words – page numbers

• Ex: Precision, 40, 55, 60-63, 89, 220

• Lexicon

• Occurrences

Search Processing (Overview)• Lexicon search

• Ex: Looking in index to find entry

• Retrieval of occurrences • Seeing what term occurs

• Manipulation of occurrences• Going to the right page

Inverted Files

a (1, 4, 36)

entry (10, 20, 30)

file (2, 38)

list (5, 41)

position (9, 16, 26)

positions (44)

word (10, 20, 24, 30, 35, 45)

words (7)

4562 (21, 27)

A file is a list of words by position

First entry is the word in position 1 (first word)

Entry 4562 is the word in position 4562 (4562nd word)

Last entry is the last word

An inverted file is a list of positions by word!

POS

1

10

20

30

36

File

Inverted File

Inverted Index for Multiple Documents

107 4 322 354 381 405

232 6 15 195 248 1897 1951 2192

677 1 481

713 3 42 312 802

WORD NDOCS PTR

jezebel 20

jezer 3

jezerit 1

jeziah 1

jeziel 1

jezliah 1

jezoar 1

jezrahliah 1

jezreel 39jezoar

34 6 1 118 2087 3922 3981 5002

44 3 215 2291 3010

56 4 5 22 134 992

DOCID OCCUR POS 1 POS 2 . . .

566 3 203 245 287

67 1 132

. . .

“jezebel” occurs

6 times in document 34,

3 times in document 44,

4 times in document 56 . . .

LEXICON

OCCURENCE

INDEX

…

Can also store precomputed “tf”

Positional information is useful for (a) proximity queries and (b) snippet construction

Many variations possible• Address space (flat, hierarchical)

• Position

• TF/IDF info precalculated

• Header, font, tag info stored

• Compression strategies

Inverted filesSeveral data structures

• For each term tj, create a list (inverted file list) that contains all document ids that have tj

I(tj) = { (d1, w1j), (d2, w2j), …, (di, wij), …, (dn, wnj) }

• di is the document id of the ith document

• Weights come from frequency of the term in document

• Only entries with non-zero weights should be kept

Inverted Files (cont..)More data structures

• Normalization factors of documents are pre-computed and stored in an array: nf[i] stores 1/|di|

• Lexicon: A hash table for all terms in the collection

• Inverted file lists are typically stored on disk

• The number of distinct terms is usually very large

tj pointer to I(tj)

Retrieval using Inverted FilesAlgorithm:

initialize all sim(q, di) = 0;

for each term tj in q

{find I(t) using the hash table;

for each (di, wij) in I(t)

sim(q, di) += qj wij;}

for each document di (i = 1, …, n)

sim(q, di) = sim(q, di) nf[i];

sort documents in descending similarities and

display the top k to the user;

Observations about Method 2• If a document d does not contain any terms of a given query q, then d will not be involved in the evaluation of q

• Only non-zero entries in the columns in the document-term matrix corresponding to the query terms are used to evaluate the query

• Compute the similarities of multiple documents simultaneously (w.r.t each query word)

Efficient Retrievalq = { (t1, 1), (t3, 1) }, 1/|q| = 0.7071

d1 = { (t1, 2), (t2, 1), (t3, 1) }, nf[1] = 0.4082

d2 = { (t2, 2), (t3, 1), (t4, 1) }, nf[2] = 0.4082

d3 = { (t1, 1), (t3, 1), (t4, 1) }, nf[3] = 0.5774

d4 = { (t1, 2), (t2, 1), (t3, 2), (t4, 2) }, nf[4] = 0.2774

d5 = { (t2, 2), (t4, 1), (t5, 2) }, nf[5] = 0.3333

I(t1) = { (d1, 2), (d3, 1), (d4, 2) }

I(t2) = { (d1, 1), (d2, 2), (d4, 1), (d5, 2) }

I(t3) = { (d1, 1), (d2, 1), (d3, 1), (d4, 2) }

I(t4) = { (d2, 1), (d3, 1), (d4, 1), (d5, 1) }

I(t5) = { (d5, 2) }

Efficient RetrievalAfter t1 is processed:

sim(q, d1) = 2, sim(q, d2) = 0, sim(q, d3) = 1

sim(q, d4) = 2, sim(q, d5) = 0

After t3 is processed:

sim(q, d1) = 3, sim(q, d2) = 1, sim(q, d3) = 2

sim(q, d4) = 4, sim(q, d5) = 0

After normalization:

sim(q, d1) = .87, sim(q, d2) = .29, sim(q, d3) = .82

sim(q, d4) = .78, sim(q, d5) = 0

Approximate Ranking

Motivation: We want to further reduce the documents for which we compute thequery distance, without affecting the top-10 results too much (even at the expenseof ranking a 2014th document as 2019th!)

We are willing to do this because (a) most users want high precisionsat low recalls (except of course the guy who wrote the 2014th ranked doc )(b) the ranking process—based as it is on vector similarity is not all that

sacrosanct anyway…

Approximate RankingQuery based ideas

Idea 1: Don’t consider documents that have less than k of the query words

Idea 2: Don’t consider documents that don’t have query words with IDF above a threshold

Idea 2 generalizes Idea 1

Document corpus-based ideas

Split documents into different (at least two) barrels of decreasing importance but increasing size (Ex. 20% top docs in the short barrel and 80% remaining docs in the long barrel). Focus on the short barrel first in looking for top 10 matches

How to split into barrels?◦ Based on some intrinsic measure of

importance of the document◦ E.g. short barrel contains articles published in

prestigious journals

◦ E.g. short barrel contains pages with high page rank

Can combine both

What is indexed? • Traditional IR only indexed “keywords”

• Which are either manually given or automatically generated through text operations like stemming, stopword elimination, etc.

---------------------------------------------------------------------------------------------------------------------------------

• Modern search engines index the full text

Generating Keywords (Index Terms) in Traditional IR

• Stop-word elimination

• Noun phrase detection

• Stemming (Porter stemming for English) – Rules like suffix is “ization” and prefix contains atleast one vowel followed by a consonant, then replace suffix with ‘ize” Ex: Binarization Binarize

• Generating index terms

•Improving quality of terms – Synonyms, Co-occurrence detection, Latent semantic indexing

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Example of Stemming and StopwordElimination

The number of Web pages on the World Wide Web was estimated to be over 800 million in 1999.

Stop word elimination

Stemming

So does Google use stemming?All kinds of stemming?

Stopword elimination?Any non-obvious stop-words?

Why do stop wordelimination?

Reduces the index sizeImproves answer relevance

“the rose” is converted to “rose”

Modern search engines don’tcare about size of the index

But how about the relevancepart?

Idf weight takes care of itto a certain extent

K-gram indexes for spelling correction• Enumerate all k-grams in the query term

• Use the k-gram index to retrieve “correct” words that match query term k-grams

• Threshold by number of matching k-grams

• E.x., only vocabulary terms that differ by at most 3 k-grams

• Example: bigram index, misspelled word bordroom

• Bigrams: bo, or, rd, dr, ro, oo, om

How do we decide what is a correct word?

Webster dictionary: Would be good but it may not have all the special-purpose words

So, use lexicon of the inverted index itself The postings list contains the number

of times a word appears in the corpus.If it is “high” you can assume it is a correct word..

Correction can also bedone w.r.t. query logsrather than documentcorpus..

Edit Distance• The edit distance between string S1 and S2 is the minimum number of basic operations to convert s1 to s2.

• Levenshtein distance: The admissible basic operations are insert, delete, and replace

• Levenshtein distance: • dog-do: 1

• cat-cart: 1

• cat-cut: 1

• cat-act: 2

• Damerau-Levenshtein distance cat-act: 1

• Damerau-Levenshtein includes transposition as a fourth possible operation

Exercise: Edit DistanceWhat is the levenshtein distance between “kitten” and “sitting”

= 3

kitten → sitten (replace "s" for "k")

sitten → sittin (replace "i" for "e")

sittin → sitting (insert "g" at the end)

Weighted Edit Distance• As above, but weight of an operation depends on the characters involved

• Meant to capture keyboard errors, e.x., m more likely to be mistyped as n than as q

• Therefor, replacing m by n is a smaller edit distance than by q

• So how do we get the weights? • Learn…

Edit Distance and Optimal Alignment• Finding the levenshtein distance between two strings is non-trivial, as we need to find the minimum number of changes needed. For this you need to align the strings correctly first

• E.g. Consider umbrella and mbkrella. ◦ If you start with the first alignment, then it looks like every character is wrong (u replaced by

m; m by b etc), giving a distance of 7—since one of the ls does align). ◦ If you shift the second word one position to the right and compare, then you will see that you

have a distance of 2 (u deleted and k added)

• Conceptually you want to first find the best alignment and compute distance wrt it. Turns out that these two phases can be done together using dynamic programming algorithms

• See Manning et. al. chapter on tolerant retrieval.

Similar to sequence alignment task in genetics, dynamic time warping in speech recognition

Using Edit DistanceMotivation: To reduce computation, we want to focus not on all words in the dictionary but a subset of them

• Given query, first enumerate all character sequences within a preset (possibly weighted) edit distance

• Intersect this set with list of “correct” words

• The suggest terms you found to the user

• Or do automatic correction – but this is potentially expensive and disempowers the user

Peter Norvig’s Spell Corrector

http://norvig.com/spell-correct.html

Bayesian Account of Spelling Correction

Given a dictionary of words

A partial or complete typing of a “word”

Complete/Correct the word

argmaxc P(c|w)

argmaxc P(w|c) P(c) / P(w)

P(w|c) Error model◦ What is the probability that you will type w

when you meant c?

◦ Different kinds of errors (e.g. letter swapping) have different prob

◦ Consider edit distance

P(c) “language model”

- How frequent is c in the language that is used? - “fallacy of the prior” or how people get all

hypochondriac because they think they have symptoms of a fatal disease (which turns out to be extremely rare)

http://norvig.com/spell-correct.html

Lessons Learned Today• Vector space models

• TF-IDF

• Inverted Indexing

• Generating keywords – Stemming, Spelling Corrections