Modern Information Retrieval Lecture 3: Boolean Retrieval

  • View
    220

  • Download
    4

Embed Size (px)

Transcript

Modern Information Retrieval

Modern Information RetrievalLecture 3: Boolean Retrieval1Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferences

Marjan Ghazvininejad2Sharif University Spring 2012Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferences

Marjan Ghazvininejad3Sharif University Spring 2012Types of dataUnstructuredSemi-structuredStructuredMarjan Ghazvininejad4Sharif University Spring 2012Unstructured dataTypically refers to free textAllowsKeyword queries including operatorsMore sophisticated concept queries e.g.,find all web pages dealing with drug abuseClassic model for searching text documents

5Marjan GhazvininejadSharif University Spring 2012Unstructured (text) vs. structured (database) data in 19966

Marjan GhazvininejadSharif University Spring 2012Unstructured (text) vs. structured (database) data in 2009

7Marjan GhazvininejadSharif University Spring 2012Unstructured data in 1680Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?One could grep all of Shakespeares plays for Brutus and Caesar, then strip out lines containing Calpurnia?Why is that not the answer?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans near countrymen) not feasibleRanked retrieval (best documents to return)Later lectures8Sec. 1.1Marjan GhazvininejadSharif University Spring 2012Grep is line-oriented; IR is document oriented.8Semi-structured dataIn fact almost no data is unstructuredE.g., this slide has distinctly identified zones such as the Title and BulletsFacilitates semi-structured search such asTitle contains data AND Bullets contain search

to say nothing of linguistic structure9Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferences

Marjan Ghazvininejad10Sharif University Spring 2012Term-document incidence

1 if play contains word, 0 otherwiseBrutus AND Caesar BUT NOT CalpurniaSec. 1.1Marjan Ghazvininejad11Sharif University Spring 201211Incidence vectorsSo we have a 0/1 vector for each term.To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND.110100 AND 110111 AND 101111 = 100100. 12Sec. 1.1Marjan GhazvininejadSharif University Spring 2012Bigger collectionsConsider N = 1 million documents, each with about 1000 words.Avg 6 bytes/word including spaces/punctuation 6GB of data in the documents.Say there are M = 500K distinct terms among these.13Sec. 1.1Marjan GhazvininejadSharif University Spring 2012Cant build the matrix500K x 1M matrix has half-a-trillion 0s and 1s.But it has no more than one billion 1s.matrix is extremely sparse.Whats a better representation?We only record the 1 positions.14Why?Sec. 1.1Marjan GhazvininejadSharif University Spring 2012Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferences

Marjan Ghazvininejad15Sharif University Spring 2012Inverted indexFor each term t, we must store a list of all documents that contain t.Identify each by a docID, a document serial numberCan we used fixed-size arrays for this?

16What happens if the word Caesar is added to document 14? Sec. 1.2BrutusCalpurniaCaesar12456165713212411314517323117454101Marjan GhazvininejadSharif University Spring 2012Inverted indexWe need variable-size postings listsOn disk, a continuous run of postings is normal and bestIn memory, can use linked lists or variable length arraysSome tradeoffs in size/ease of insertion17DictionaryPostingsSorted by docID (more later on why).PostingSec. 1.2BrutusCalpurniaCaesar12456165713212411314517323117454101Linked lists generally preferred to arraysDynamic space allocationInsertion of terms into documents easySpace overhead of pointers17TokenizerToken streamFriendsRomansCountrymenInverted index constructionLinguistic modulesModified tokensfriendromancountrymanIndexerInverted indexfriendromancountryman24213161

Documents tobe indexedFriends, Romans, countrymen.Sec. 1.2Indexer steps: Token sequenceSequence of (Modified token, Document ID) pairs.I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.Doc 1So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitiousDoc 2

Sec. 1.2Marjan Ghazvininejad19Sharif University Spring 2012Indexer steps: SortSort by termsAnd then docID

Core indexing stepSec. 1.2Marjan Ghazvininejad20Sharif University Spring 2012Indexer steps: Dictionary & PostingsMultiple term entries in a single document are merged.Split into Dictionary and PostingsDoc. frequency information is added.

Sec. 1.2

Marjan Ghazvininejad21Sharif University Spring 2012

Where do we pay in storage?22PointersTerms and countsLater in the course:How do we index efficiently?How much storage do we need?Sec. 1.2Lists of docIDsMarjan GhazvininejadSharif University Spring 2012Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferences

Marjan Ghazvininejad23Sharif University Spring 2012Query processing: ANDConsider processing the query:Brutus AND CaesarLocate Brutus in the Dictionary;Retrieve its postings.Locate Caesar in the Dictionary;Retrieve its postings.Merge the two postings:2412834248163264123581321BrutusCaesarSec. 1.3Marjan GhazvininejadSharif University Spring 2012The mergeWalk through the two postings simultaneously, in time linear in the total number of postings entries253412824816326412358132112834248163264123581321BrutusCaesar28If list lengths are x and y, merge takes O(x+y) operations.Crucial: postings sorted by docID.Sec. 1.3Marjan GhazvininejadSharif University Spring 2012Boolean queries: Exact matchThe Boolean retrieval model is being able to ask a query that is a Boolean expression:Boolean Queries use AND, OR and NOT to join query termsViews each document as a set of wordsIs precise: document matches condition or not.Perhaps the simplest model to build an IR system onPrimary commercial retrieval tool for 3 decades. Many search systems you still use are Boolean:Email, library catalog, Mac OS X Spotlight26Sec. 1.3Marjan GhazvininejadSharif University Spring 2012Boolean queriesMany professional searchers still like Boolean searchYou know exactly what you are gettingBut that doesnt mean it actually works better.

Marjan Ghazvininejad27Sharif University Spring 2012Boolean queries: More general mergesExercise: Adapt the merge for the queries:Brutus AND NOT CaesarBrutus OR NOT Caesar

Can we still run through the merge in time O(x+y)?What can we achieve?

28Sec. 1.3Marjan GhazvininejadSharif University Spring 2012MergingWhat about an arbitrary Boolean formula?(Brutus OR Caesar) AND NOT(Antony OR Cleopatra)Can we always merge in linear time?Linear in what?Can we do better?29Sec. 1.3Marjan GhazvininejadSharif University Spring 2012Lecture OverviewData TypesIncidence VectorsInverted IndexesQuery Processing Query OptimizationDiscussionReferences

Marjan Ghazvininejad30Sharif University Spring 2012Query optimizationWhat is the best order for query processing?Consider a query that is an AND of n terms.For each of the n terms, get its postings, then AND them together.

Query: Brutus AND Calpurnia AND Caesar31Sec. 1.3Marjan Ghazvininejad31Sharif University Spring 2012BrutusCalpurniaCaesar123581621342481632641281316Query optimization exampleProcess in order of increasing freq:start with smallest set, then keep cutting further.32This is why we keptdocument freq. in dictionaryExecute the query as (Calpurnia AND Brutus) AND Caesar.Sec. 1.3BrutusCaesarCalpurnia123581621342481632641281316Marjan GhazvininejadSharif University Spring 2012More general optimizatione.g., (madding OR crowd) AND (ignoble OR strife)Get doc. freq.s for all terms.Estimate the size of each OR by the sum of its doc. freq.s (conservative).Process in increasing order of OR sizes.33Sec. 1.3Marjan GhazvininejadSharif University Spring 2012ExerciseRecommend a query processing order for

34(tangerine OR trees) AND(marmalade OR skies) AND(kaleidoscope OR eyes)

Marjan GhazvininejadSharif University Spring 2012Query processing exercisesExercise: If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen?Exercise: Extend the merge to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size?Hint: Begin with the case of a Boolean formula query where each term appears only once in the query.

35Marjan GhazvininejadSharif University Spring 2012ExerciseTry the search feature at http://www.rhymezone.com/shakespeare/Write down five search features you think it could do better36Marjan GhazvininejadSharif University Spring 2012Whats ahead in IR?Beyond term searchWhat about phrases?Stanford UniversityProximity: Find Gates NEAR Microsoft.Need index to capture position information in docs.Zones in documents: Find documents with (author = Ullman) AND (text contains automata).37Marjan GhazvininejadSharif University Spring 2012Evidence accumulation1 vs. 0 occurrence of a search term2 vs. 1 occurrence3 vs. 2 occurrences, etc.Usually more seems betterNeed term frequency information in docs38Marjan GhazvininejadSharif University Spring 2012Ranking search resultsBoolean queries give inclusion or exclusion of docs.Often we want to rank/group resultsNeed to measure proximity from query to each doc.Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query.39Marjan GhazvininejadSharif University Spring 2012Clustering, classification and rankingClustering: Given a set of docs, group them into clusters based