Upload
anthony-leslie-gardner
View
225
Download
2
Tags:
Embed Size (px)
Citation preview
Search Engines & Question AnsweringSearch Engines & Question AnsweringIndexingIndexing
Giuseppe AttardiGiuseppe AttardiUniversità di PisaUniversità di Pisa
(some slides borrowed from C. Manning, H. Schütze)(some slides borrowed from C. Manning, H. Schütze)
TopicsTopics
Indexing and SearchIndexing and Search– Indexing and inverted files– Compression– Postings Lists– Query processing
IndexingIndexing
Inverted index storageInverted index storage– Compressing dictionaries in memory
Processing Boolean queriesProcessing Boolean queries– Optimizing term processing– Skip list encoding
Wild-card queriesWild-card queriesPositional/phrase/proximity queriesPositional/phrase/proximity queries
QueryQuery
Which plays of Shakespeare contain the Which plays of Shakespeare contain the words words BrutusBrutus ANDAND CaesarCaesar but but NOTNOT CalpurniaCalpurnia??
Could Could grepgrep all of Shakespeare’s plays all of Shakespeare’s plays for for BrutusBrutus and and CaesarCaesar then strip out lines then strip out lines containing containing CalpurniaCalpurnia??– Slow (for large corpora)– NOT is non-trivial– Other operations (e.g., find the phrase
Romans and countrymenRomans and countrymen) not feasible
Term-document incidenceTerm-document incidence
1 if play contains word, 0 otherwise1 if play contains word, 0 otherwise
Antony and CleopatraJulius
CaesarThe Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 0
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
Incidence vectorsIncidence vectors
So we have a 0/1 vector for each So we have a 0/1 vector for each termterm
To answer query:To answer query:– take the vectors for Brutus, Caesar and
Calpurnia (complemented)– perform bitwise AND
110100 AND 110111 AND 101111 = 100100
Answers to queryAnswers to query
Antony and Cleopatra, Act III, Scene Antony and Cleopatra, Act III, Scene iiiiAgrippa [Aside to Domitius Enobarus]: Why, Enobarbus,
When Antony found Julius When Antony found Julius CaesarCaesar dead, dead,
He cried almost to roaring; and he weptHe cried almost to roaring; and he wept
When at Philippi he found When at Philippi he found BrutusBrutus slain. slain.
Hamlet, Act III, Scene iiHamlet, Act III, Scene iiLord Polonius:Lord Polonius: I did enact Julius I did enact Julius CaesarCaesar I was killed i' the I was killed i' the
Capitol; Capitol; BrutusBrutus killed me. killed me.
Bigger corporaBigger corpora
Consider Consider n n = 1 M documents, each = 1 M documents, each with about 1 K termswith about 1 K terms
Avg 6 bytes/term including Avg 6 bytes/term including spaces/punctuation spaces/punctuation – 6 GB of data
Say there are Say there are m m = 500 K = 500 K distinctdistinct terms among theseterms among these
Can’t build the matrixCan’t build the matrix
500K x 1M matrix has half-a-trillion 500K x 1M matrix has half-a-trillion 0’s and 1’s0’s and 1’s
But it has no more than one billion But it has no more than one billion 1’s1’s– matrix is extremely sparse
What’s a better representation?What’s a better representation?
Why?
Documents are parsed to extract Documents are parsed to extract words and these are saved with words and these are saved with the Document ID.the Document ID.
I did enact JuliusCaesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2
caesar 2was 2ambitious 2
Inverted indexInverted index
After all documents After all documents have been parsed have been parsed the inverted file is the inverted file is sorted by terms sorted by terms
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Multiple term Multiple term entries in a single entries in a single document are document are merged and merged and frequency frequency information addedinformation added
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2did 1enact 1hath 2I 1i' 1it 2julius 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
The file is commonly split into a The file is commonly split into a DictionaryDictionary and a and a PostingsPostings
filefile
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Where do we pay in storage?Where do we pay in storage?
Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Pointers
Terms
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Two conflicting forcesTwo conflicting forces
A term like A term like CalpurniaCalpurnia occurs in occurs in maybe one doc out of a million - maybe one doc out of a million - would like to store this pointer using would like to store this pointer using loglog22 1M ~ 20 bits 1M ~ 20 bits
A term like A term like thethe occurs in virtually occurs in virtually every doc, so 20 bits/pointer is too every doc, so 20 bits/pointer is too expensiveexpensive– Prefer 0/1 vector in this case
Postings file entryPostings file entry
Store list of docs containing a term Store list of docs containing a term in increasing order of doc idin increasing order of doc id– Brutus: 33,47,154,159,202 …
ConsequenceConsequence: suffices to store gaps: suffices to store gaps– 33,14,107,5,43 …
HopeHope: most gaps encoded with far : most gaps encoded with far fewer than 20 bitsfewer than 20 bits
Variable encodingVariable encoding
For For CalpurniaCalpurnia, use ~20 bits/gap entry, use ~20 bits/gap entryFor For thethe, use ~1 bit/gap entry, use ~1 bit/gap entry If the average gap for a term is If the average gap for a term is GG, ,
want to use ~logwant to use ~log22GG bits/gap entry bits/gap entry
codes for gap encodingcodes for gap encoding
Represent a gap Represent a gap GG as the pair as the pair <length,offset><length,offset>
lengthlength is in unary and uses is in unary and uses loglog22GG +1 bits +1 bits to specify the length of the binary to specify the length of the binary encoding ofencoding of
offsetoffset = = GG - 2 - 2loglog22
GG
e.g., 9 represented as 1110001e.g., 9 represented as 1110001 Encoding Encoding GG takes 2 takes 2 loglog22GG +1 bits +1 bits
length offset
What we’ve just doneWhat we’ve just done
Encoded each gap as tightly as Encoded each gap as tightly as possible, to within a factor of 2possible, to within a factor of 2
For better tuning (and a simple For better tuning (and a simple analysis) - need some handle on the analysis) - need some handle on the distribution of gap valuesdistribution of gap values
Zipf’s lawZipf’s law
The The kkth most frequent term has th most frequent term has frequency proportional to frequency proportional to 1/k1/k
Use this for a crude analysis of the Use this for a crude analysis of the space used by our postings file space used by our postings file pointerspointers
Rough analysis based on ZipfRough analysis based on Zipf
Most frequent term occurs in Most frequent term occurs in nn docs docs– n gaps of 1 each
Second most frequent term in Second most frequent term in n/2n/2 docsdocs– n/2 gaps of 2 each …
kkth most frequent term in th most frequent term in n/kn/k docs docs– n/k gaps of k each - use 2log2k +1 bits
for each gap– net of ~(2n/k) log2k bits for kth most
frequent term
Sum over Sum over kk from 1 to 500K from 1 to 500K
Do this by breaking values of k into Do this by breaking values of k into groups:groups:– group i consists of 2i-1 k < 2i
Group Group ii has has 22i-1i-1 components in the components in the sum, each contributing at most sum, each contributing at most ((2ni2ni)/2)/2i-1i-1
Summing over Summing over ii from 1 to 19, we get from 1 to 19, we get a net estimate of 340 Mbits, ~45 MB a net estimate of 340 Mbits, ~45 MB for our indexfor our index Work out
calculation
CaveatsCaveats
This is not the entire space for our index:This is not the entire space for our index:– does not account for dictionary storage– as we get further, we’ll store even more stuff in
the index
Assumes Zipf’s law applies to occurrence Assumes Zipf’s law applies to occurrence of terms in docsof terms in docs
All gaps for a term taken to be the sameAll gaps for a term taken to be the same Does not talk about query processingDoes not talk about query processing
Issues with index we just builtIssues with index we just built
How do we process a query?How do we process a query?What terms in a doc do we index?What terms in a doc do we index?
– All words or only “important” ones?StopwordStopword list: terms that are so list: terms that are so
common that they’re ignored for common that they’re ignored for indexingindexing– e.g., the, a, an, of, to …– language-specific
Exercise: Repeat postings size calculation if 100 mostfrequent terms are not indexed.
Issues in what to indexIssues in what to index
Cooper’sCooper’s vs. vs. CooperCooper vs. vs. CoopersCoopers Full-textFull-text vs. vs. fullfull texttext vs. { vs. {fullfull, , texttext} vs. } vs.
fulltextfulltext Accents:Accents: résumérésumé vs. vs. resumeresume
Cooper’s concordance of Wordsworth was published in 1911. The applications of full-text retrieval are legion: they include résumé scanning, litigation support and searching published journals on-line.
PunctuationPunctuation
Ne’erNe’er: use language-specific, : use language-specific, handcrafted “locale” to normalizehandcrafted “locale” to normalize
State-of-the-artState-of-the-art: break up : break up hyphenated sequencehyphenated sequence
U.S.AU.S.A.. vs. vs. USAUSA - use locale- use localea.outa.out
NumbersNumbers
3/12/913/12/91Mar. 12, 1991Mar. 12, 199155 B.C.55 B.C.B-52B-52100.2.86.144100.2.86.144
– Generally, don’t index as text– Creation dates for docs
Case foldingCase folding
Reduce all letters to lower caseReduce all letters to lower case– exception: upper case in mid-sentence
• e.g., General Motors• Fed vs. fed• SAIL vs. sail
Thesauri and soundexThesauri and soundex
Handle synonyms and homonymsHandle synonyms and homonyms– Hand-constructed equivalence classes
• e.g., car automobile• your you’re
Index such equivalences, or expand Index such equivalences, or expand query?query?– More later ...
Spell correctionSpell correction
Look for all words within (say) edit Look for all words within (say) edit distance 3 (Insert/Delete/Replace) at distance 3 (Insert/Delete/Replace) at query timequery time– e.g. Alanis Morisette
Spell correction is expensive and Spell correction is expensive and slows the query (upto a factor of 100)slows the query (upto a factor of 100)– Invoke only when index returns zero
matches.– What if docs contain mis-spellings?
LemmatizationLemmatization
Reduce inflectional/variant forms to Reduce inflectional/variant forms to base formbase form
E.g.,E.g.,– am, are, is be
– car, cars, car's, cars' car
the boy's cars are different colorsthe boy's cars are different colors the boy car be different colorthe boy car be different color
StemmingStemming
Reduce terms to their “roots” before Reduce terms to their “roots” before indexingindexing– language dependent– e.g. automate(s), automatic, automation
all reduced to automat
for example compressed and compression are both accepted as equivalent to compress.
for exampl compres andcompres are both acceptas equival to compres.
Porter’s algorithmPorter’s algorithm
Commonest algorithm for stemming Commonest algorithm for stemming EnglishEnglish
Conventions + 5 phases of Conventions + 5 phases of reductionsreductions– phases applied sequentially– each phase consists of a set of
commands– sample convention: Of the rules in a
compound command, select the one that applies to the longest suffix
Typical rules in PorterTypical rules in Porter
ssessses ssss iesies iiationalational ateatetionaltional tiontion
Other stemmersOther stemmers
Other stemmers exist, e.g., Lovins Other stemmers exist, e.g., Lovins stemmer stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/generhttp://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htmal/lovins.htm
Single-pass, longest suffix removal Single-pass, longest suffix removal (about 250 rules)(about 250 rules)
Motivated by Linguistics as well as Motivated by Linguistics as well as IRIR
Full morphological analysis - modest Full morphological analysis - modest benefits for retrievalbenefits for retrieval
Beyond term searchBeyond term search
What about phrases?What about phrases?Proximity: Find Proximity: Find GatesGates NEAR NEAR
MicrosoftMicrosoft– Need index to capture position
information in docsZones in documents: Find Zones in documents: Find
documents with (documents with (author = author = UllmanUllman)) ANDAND (text contains (text contains automataautomata))
Dictionary and postings files: Dictionary and postings files: a fast, compact inverted indexa fast, compact inverted index
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Tot Freq
ambitious 1 1
be 1 1
brutus 2 2
capitol 1 1
caesar 2 3
did 1 1
enact 1 1
hath 1 1
I 1 2
i' 1 1
it 1 1
julius 1 1
killed 1 2
let 1 1
me 1 1
noble 1 1
so 1 1
the 2 2
told 1 1
you 1 1
was 2 2
with 1 1
Usually in memory Gap-encoded,on disk
Inverted index storage Inverted index storage
Dictionary storageDictionary storage– Dictionary in main memory, postings on
disk• This is common, especially for something
like a search engine where high throughput is essential, but can also store most of it on disk with small, in‑memory index
Tradeoffs between compression and Tradeoffs between compression and query processing speedquery processing speed– Cascaded family of techniques
How big is the lexicon V?How big is the lexicon V?
Grows (but more slowly) with corpus sizeGrows (but more slowly) with corpus size Empirically okay model:Empirically okay model:
V = V = kNkNbb
where where bb ≈≈ 0.5, 0.5, k k ≈≈ 30–100; 30–100; N N = # tokens= # tokens For instance TREC collection (2 Gb; For instance TREC collection (2 Gb;
750,000 newswire articles): 750,000 newswire articles): ~ 500,000 ~ 500,000 termsterms
Number is decreased by case-folding, Number is decreased by case-folding, stemmingstemming
Indexing all numbers could make it Indexing all numbers could make it extremely largeextremely large (so some SE don’t) (so some SE don’t)
Spelling errors contribute a fair bit of sizeSpelling errors contribute a fair bit of size
Exercise: Can onederive this from
Zipf’s Law?
Dictionary storage - first cutDictionary storage - first cut
Array of fixed-width entriesArray of fixed-width entries– 500,000 terms; 28 bytes/term = 14 MB
Terms Freq. Postings ptr.
a 999,712
aardvark 71
…. ….
zzzz 99
Allows for fast binarysearch into dictionary
20 bytes 4 bytes each
ExercisesExercises
Is binary search really a good idea?Is binary search really a good idea?What are the alternatives?What are the alternatives?
Fixed-width terms are wastefulFixed-width terms are wasteful
Most of the bytes in the Most of the bytes in the TermsTerms column are wasted column are wasted –– we allot 20 bytes for 1 letter terms we allot 20 bytes for 1 letter terms– and still can’t handle supercalifragilisticexpialidocioussupercalifragilisticexpialidocious
Written English averages ~4.5 charactersWritten English averages ~4.5 characters– Exercise: Why is/isn’t this the number to use for
estimating the dictionary size?– Short words dominate token counts
Average word Average word typetype in English: ~8 characters in English: ~8 characters Store dictionary as a string of characters:Store dictionary as a string of characters:
– Pointer of next word shows end of last– Hope to save up to 60% of dictionary space
Compressing the term listCompressing the term list
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
Binary searchthese pointers
Total string length =500KB x 8 = 4MB
Pointers resolve 4Mpositions: log24M =
22bits = 3bytes
Total space for compressed listTotal space for compressed list
4 bytes per term for Freq4 bytes per term for Freq4 bytes per term for pointer to 4 bytes per term for pointer to
PostingsPostings3 bytes per term pointer3 bytes per term pointerAvg. 8 bytes per term in term stringAvg. 8 bytes per term in term string500 K terms 500 K terms 9.5 MB 9.5 MB
Now avg. 11 bytes/term, not 20.
BlockingBlocking
Store pointers to every Store pointers to every kkth on term stringth on term string Need to store term lengths (1 extra byte)Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
7
Save 9 bytes on 3 pointers.
Lose 4 bytes onterm lengths.
ExerciseExercise
Estimate the space usage (and Estimate the space usage (and savings compared to 9.5 MB) with savings compared to 9.5 MB) with blocking, for block sizes of blocking, for block sizes of k = 4, 8 k = 4, 8 andand 16 16
Design GoalsDesign Goals
Specialized tool (indexing and search)Specialized tool (indexing and search) C++ framework with high-level primitivesC++ framework with high-level primitives
– Applications built with few lines of C++– Specialization by inheritance
High performanceHigh performance ScalabilityScalability Simple to maintainSimple to maintain
– Hard to deal with autoconf, autoheader, automake, configure, libtool, …
– Developed my own Make templates
Keep it as simple as possible but not Keep it as simple as possible but not simplersimpler..
Albert EinsteinAlbert Einstein
LexiconLexicon
ab: 24930
ac: 24931
ad: 24932
ae: 24933
Bigram index Word index Postings
ate0cent0cute0rial0
Extreme compression (see Extreme compression (see MGMG))
Front-coding:Front-coding:– Sorted words commonly have long common
prefix – store differences only (for 3 in 4) Using perfect hashing to store terms Using perfect hashing to store terms
“within” their pointers“within” their pointers– not good for vocabularies that change
Partition dictionary into pagesPartition dictionary into pages– use B-tree on first terms of pages– pay a disk seek to grab each page– if we’re paying 1 disk seek anyway to get the
postings, “only” another seek/query term
Is it worth it?Is it worth it?
Average lexicon search time:Average lexicon search time:– IXE: 8 msec– Front coding: 6 msec
Average query time:Average query time: 300 msec300 msec
Number EncodingNumber Encoding
Whole chapter in Managing Whole chapter in Managing GigabytesGigabytes
Best solution: local Bernoulli using Best solution: local Bernoulli using Golomb coding Golomb coding – Roughly: quotient (unary) + remainder
(binary)– Compression: ~1 bit per posting
Quick and clean solution: eptacodeQuick and clean solution: eptacode
EptacodeEptacode
Use 7 bits in a byte, sign is Use 7 bits in a byte, sign is continuationcontinuation
1 127
2 16129
3 2048383
4 260144641
5 33038369407
GolombGolomb1 127
2 16384
3 2097152
4 268435456
5 34359738368
Golomb drawbacksGolomb drawbacks
Need to store base for quotientNeed to store base for quotient Postings are non consecutivePostings are non consecutive Result: 30% increase in size of indexResult: 30% increase in size of index
0%
10%
20%
30%
40%
50%
Nocompression
Golomb Eptacode
0
5
10
15
20
25
Index sizeQuery time
Fundamental IdeasFundamental Ideas
Rely on hardware caching and mmapRely on hardware caching and mmap– Keep data as compact as possible– Stucture on disk same as used by
algorithmsRely on good data structures and Rely on good data structures and
algorithmsalgorithms– STL
Specialize data structuresSpecialize data structures– For indexing– For search
IndexingIndexing
Posting Lists are created in memoryPosting Lists are created in memory– Provide as much memory as possible to
indexing machinesWhen size of lists reaches a When size of lists reaches a
threshold, dump partial index to diskthreshold, dump partial index to diskPerform final merging of partial Perform final merging of partial
indexesindexesMerging operation used also for:Merging operation used also for:
– Incremental indexing– Distributed indexing
SearchSearch
Search mmaps index:Search mmaps index:– lexicon completely– Posting on demand (too big)
Can’t be done while indexing and Can’t be done while indexing and viceversaviceversa
However one can:However one can:– add dynamically a collection with new
documents to search– Mark documents as deleted
Full-text Index File StructureFull-text Index File Structure
FileHeaderFileHeader
ColumnColumn00 (Lexicon) (Lexicon)
……
ColumnColumnnn (Lexicon) (Lexicon)
StopWordsStopWords
ColorsColors
ColorsColors
Generalization of Google hits Generalization of Google hits properties (anchor, size, properties (anchor, size, capitalization)capitalization)
Similar to Fulcrum zonesSimilar to Fulcrum zonesUsed for rankingUsed for ranking
– E.g. title words contribute more to rank of document
and selective queriesand selective queriestext matches author = attardi
Query processing exercisesQuery processing exercises
If the query is If the query is friendsfriends AND AND romansromans AND (NOT AND (NOT countrymencountrymen), ), how could how could we use the freq of we use the freq of countrymencountrymen??
How can we perform the How can we perform the ANDAND of two of two postings entries without explicitly postings entries without explicitly building the 0/1 term-doc incidence building the 0/1 term-doc incidence vector?vector?
Boolean queries: Exact matchBoolean queries: Exact match
An algebra of queries using AND, OR and An algebra of queries using AND, OR and NOT together with query wordsNOT together with query words– Uses “set of words” document representation– Precise: document matches condition or not
Primary commercial retrieval tool for 3 Primary commercial retrieval tool for 3 decades decades – Researchers had long argued superiority of
ranked IR systems, but not much used in practice until spread of web search engines
– Professional searchers still like boolean queries: you know exactly what you’re getting
• Cf. Google’s boolean AND criterion
Query optimizationQuery optimization
Consider a query that is an Consider a query that is an ANDAND of of tt termsterms
The idea: for each of the The idea: for each of the tt terms, get terms, get its term-doc incidence from the its term-doc incidence from the postings, then postings, then ANDAND together together
Process in order of Process in order of increasing freqincreasing freq::– start with smallest set, then keep
cutting further This is whywe kept freqin dictionary
Small Adaptive Set IntersectionSmall Adaptive Set Intersection
Query compilerQuery compiler– One cursor on posting lists for each
node– CursorWord, CursorAnd, CursorOr,
CursorPhraseQueryCursor.next(Result& min)QueryCursor.next(Result& min)
– Returns first result r >= minSingle operator for all kind of Single operator for all kind of
queries: e.g. proximityqueries: e.g. proximity
Speeding up postings mergesSpeeding up postings merges
Insert skip pointersInsert skip pointersSay our current list of candidate Say our current list of candidate
docs for an docs for an ANDAND query is 8,13,21 query is 8,13,21– (having done a bunch of ANDs)
We want to We want to ANDAND with the following with the following postings entry: postings entry: 2,4,6,8,10,12,14,16,18,20,222,4,6,8,10,12,14,16,18,20,22
Linear scan is slowLinear scan is slow
Skip pointers or skip listsSkip pointers or skip lists
At query time: as we walk the current At query time: as we walk the current candidate list, concurrently walk inverted candidate list, concurrently walk inverted file entry - can skip aheadfile entry - can skip ahead– (e.g., 8,21).
Skip size: recommend about Skip size: recommend about (list length)(list length)
2,4,6,8,10,12,14,16,18,20,22,24, ...
At indexing time: augment postings with At indexing time: augment postings with skip pointersskip pointers
General query optimizationGeneral query optimization
e.g. (e.g. (maddingmadding OR OR crowdcrowd) ) AND AND ((ignoble ignoble OR OR strifestrife))– Can put any boolean query into CNF
Get freq’s for all termsGet freq’s for all termsEstimate the size of each Estimate the size of each OROR by the by the
sum of its freq’s (conservative)sum of its freq’s (conservative)Process in increasing order of Process in increasing order of OROR
sizessizes
ExerciseExercise
Recommend a Recommend a query processing query processing order fororder for
(tangerine OR trees) AND(marmalade OR skies) AND(kaleidoscope OR eyes)
Term Freq eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
IXE ArchitectureIXE Architecture
CrawlerCrawler
Table<DocInfo>Table<DocInfo>
IndexerIndexerLexicon
Postings
DocStore
mmap
name:time:size:
DocInfo
mmaplocalcache
mmap
DocInfo DocInfo
name:time:size:
name:time:size:
DocInfo DocInfo
name:time:size:title:summary:type:
name:time:size:title:summary:type:
DocInfo DocInfo
name:time:size:title:summary:type:
name:time:size:title:summary:type:
Storing Objects in Relational TablesStoring Objects in Relational Tables
SQLSQLcreate table video (name varchar(256),
caption varchar(2048), format INT, PRIMARY KEY(name))
Template MetaprogrammingTemplate Metaprogramming
class Video : public DocInfo {class Video : public DocInfo {char*char* name;name;char*char* caption;caption;intint format;format;
META(Video, (SUPERCLASS(DocInfo),META(Video, (SUPERCLASS(DocInfo), VARKEY(name, 256),VARKEY(name, 256),
VARFIELD(caption, 2048),VARFIELD(caption, 2048),FIELD(format)));FIELD(format)));
};};
Attribute (for indexing)
Programming Applications (C+Programming Applications (C++)+)
Collection<Video> videos(“CNN”);Collection<Video> videos(“CNN”);videos.insert(video1);videos.insert(video1);
Query q(“caption MATCHES Jordan and Query q(“caption MATCHES Jordan and format = wav”);format = wav”);
Cursor<Video> cursor(videos, q);Cursor<Video> cursor(videos, q);
while (cursor.MoveNext())while (cursor.MoveNext())cout << cursor.Current();cout << cursor.Current();
Single cursor operatorSingle cursor operator
Struct QueryResult {Struct QueryResult {CollectionID cid;CollectionID cid;DocIDDocID did;did;PositionPosition pos;pos;
}}QueryResult qmin;QueryResult qmin;Cursor.next(qmin);Cursor.next(qmin);
Returns next result qr (document or word within document) such that Returns next result qr (document or word within document) such that qr >= qminqr >= qmin
Normal search:Normal search: pos = 0pos = 0 Proximity search:Proximity search: pos = ipos = i Multiple collections search (increment cid or select cid)Multiple collections search (increment cid or select cid) ‘‘where’ clauses (e.g. date > 1/1/2002)where’ clauses (e.g. date > 1/1/2002) Boolean combinationsBoolean combinations
An independent benchmarkAn independent benchmark
0,00
50,00
100,00
150,00
200,00
250,00
Indexing (Intel) Retrieval (Intel)
AltaVistaIXE
0,00
50,00
100,00
150,00
200,00
250,00
Indexing (Intel) Retrieval (Intel)
AltaVistaIXE
doc/sec
query/sec
Independent evaluationsIndependent evaluations
Major portal, GermanyMajor portal, GermanyMajor portal, FranceMajor portal, FranceMajor portal, ItalyMajor portal, Italy
– Stress test with 300 concurrent queries– Verity crashed in several cases
Microsoft RedmondMicrosoft Redmond
TREC Terabyte 2004TREC Terabyte 2004
GOV2 collection:GOV2 collection:– ~ 25 million documets from .gov domain– ~ 500 GB of documents
IXE index split into 23 IXE index split into 23 shardshard
Data Structure Size
Lexicon 4,2 GB
Posting Lists (including offsets) 62,0 GB
Metadata 26,0 GB
Document cache (optional) 84,0 GB
Total 176,2 GB
TREC Terabyte 2004TREC Terabyte 2004
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
Monash U. Pisa MSRAsia
U.Amst.
CMU Sabir DublinC.U.
RMIT
Average query time (sec)
Distributed Search ArchitectureDistributed Search Architecture
…
MultiThreadedHTTP Server Query
ServerQueryServer
query
Shardindex
QueryServerQueryServer
Shardindex
BrokerModuleBrokerModule
BrokerModuleBrokerModule
Async I/O
Other FeaturesOther Features
SnippetsSnippetsDocument cacheDocument cacheColorsColorsMultiple collectionsMultiple collections
– Sorted by page rank– Authoritativeness– Popularity
Filter/Group by similarityFilter/Group by similarity
Impact on searchImpact on search
Binary search down to 4-term blockBinary search down to 4-term block Then linear search through terms in blockThen linear search through terms in block 8 documents: binary tree ave. = 2.6 8 documents: binary tree ave. = 2.6
comparescompares Blocks of 4 (binary tree), ave. = 3 comparesBlocks of 4 (binary tree), ave. = 3 compares
= (1+2= (1+2∙∙2+42+4∙∙3+4)/83+4)/8 =(1+2=(1+2∙∙2+22+2∙∙3+23+2∙∙4+5)/84+5)/8
3
7
57
432
8
6
4
2
8
1
65
1
Compression: Two alternativesCompression: Two alternatives
Lossless compression: all information is Lossless compression: all information is preserved, but we try to encode it preserved, but we try to encode it compactlycompactly– What IR people mostly do
Lossy compression: discard some Lossy compression: discard some informationinformation– Using a stoplist can be thought of in this way– Techniques such as Latent Semantic Indexing
can be viewed as lossy compression– One could prune from postings entries unlikely
to turn up in the top k list for query on word• Especially applicable to web search with huge
numbers of documents but short queries• e.g., Carmel et al. SIGIR 2002
CachingCaching
If 25% of your users are searching If 25% of your users are searching forfor
Britney SpearsBritney Spearsthen you probably then you probably dodo need spelling need spelling correction, but you correction, but you don’tdon’t need to need to keep on intersecting those two keep on intersecting those two postings listspostings lists
Web query distribution is extremely Web query distribution is extremely skewed, and you can usefully cache skewed, and you can usefully cache results for common queriesresults for common queries
Query vs. index expansionQuery vs. index expansion
Recall:Recall:– thesauri for term equivalents– soundex for homonyms
How do we use these?How do we use these?– Can “expand” query to include
equivalences• Query car tyres car tyres automobile tires
– Can expand index• Index docs containing car under
automobile, as well
Query expansionQuery expansion
Usually do query expansionUsually do query expansion– No index blowup– Query processing slowed down
• Docs frequently contain equivalences
– May retrieve more junk• puma jaguar
– Carefully controlled wordnets
Wild-card queries: *Wild-card queries: *
mon*mon*:: find all docs containing any word find all docs containing any word beginning “mon”.beginning “mon”.
Easy with binary tree (or B-Tree) lexicon: Easy with binary tree (or B-Tree) lexicon: retrieve all words in range: retrieve all words in range: mon mon ≤≤ w < moo w < moo
*mon*mon: : find words ending in “mon”: harderfind words ending in “mon”: harder Permuterm index: for word Permuterm index: for word hellohello index index
under:under:– hello$, ello$h, llo$he, lo$hel, o$hell
Queries:Queries:– X lookup on X$ X* lookup on X*$– *X lookup on X$* *X* lookup on X*– X*Y lookup on Y$X* X*Y*Z ??? Exercise!
Wild-card queriesWild-card queries
Permuterm problem: Permuterm problem: ≈≈ quadruples lexicon quadruples lexicon sizesize
Another way: index all Another way: index all kk-grams occurring -grams occurring in any word (any sequence of in any word (any sequence of kk chars) chars)
e.g.,e.g., from text “April is the cruelest from text “April is the cruelest month” we get the 2-grams (month” we get the 2-grams (bigramsbigrams))
– $ is a special word boundary symbol
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$
Processing Processing nn-gram wild-cards-gram wild-cards
Query Query mon* can now be run as can now be run as– $m AND mo AND on
Fast, space efficientFast, space efficient But we’d get a match on But we’d get a match on moonmoon Must post-filter these results against Must post-filter these results against
queryquery Further wild-card refinementsFurther wild-card refinements
– Cut down on pointers by using blocks– Wild-card queries tend to have few bigrams
• keep postings on disk– Exercise: given a trigram index, how do you
process an arbitrary wild-card query?
Phrase searchPhrase search
Search for Search for ““to be or not to beto be or not to be”” No longer suffices to store only No longer suffices to store only
<<termterm::docsdocs> entries> entries But could just do this anyway, and then But could just do this anyway, and then
post-filter post-filter [i.e., grep] for phrase matches[i.e., grep] for phrase matches– Viable if phrase matches are uncommon
Alternatively, store, for each Alternatively, store, for each termterm, entries, entries– <number of docs containing term;– doc1: position1, position2 … ;– doc2: position1, position2 … ;– etc.>
Positional index examplePositional index example
Can compress position values/offsets as Can compress position values/offsets as we did with docs in the last lecture we did with docs in the last lecture
Nevertheless, this expands postings list in Nevertheless, this expands postings list in size size substantiallysubstantially
<be: 993427;1: 7, 18, 33, 72, 86, 231;2: 3, 149;4: 17, 191, 291, 430, 434;5: 363, 367, …>
Which of these docscould contain “to be
or not to be”?
Processing a phrase queryProcessing a phrase query
Extract inverted index entries for each distinct Extract inverted index entries for each distinct term: term: toto, , bebe, , oror, , notnot
Merge their Merge their doc:positiondoc:position lists to enumerate all lists to enumerate all positions where “positions where “to be or not to beto be or not to be” begins.” begins.
toto:
2:1,17,74,222,551; 4:8,27,101,429,433; 7:13,23,191; ...
bebe:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
Same general method for proximity searchesSame general method for proximity searches
Example: WestLaw Example: WestLaw http://www.westlaw.com/http://www.westlaw.com/
Largest commercial (paying subscribers) Largest commercial (paying subscribers) legal search service (started 1975; ranking legal search service (started 1975; ranking added 1992)added 1992)
About 7 terabytes of data; 700,000 usersAbout 7 terabytes of data; 700,000 users Majority of users Majority of users still still use boolean queriesuse boolean queries Example query:Example query:
– What is the statute of limitations in cases involving the federal tort claims act?
– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
Long, precise queries; proximity operators; Long, precise queries; proximity operators; incrementally developed; not like web incrementally developed; not like web searchsearch
Index sizeIndex size
Stemming/case folding cutStemming/case folding cut– number of terms by ~40%– number of pointers by 10-20%– total space by ~30%
Stop wordsStop words– Rule of 30: ~30 words account for ~30%
of all term occurrences in written text– Eliminating 150 commonest terms from
indexing will cut almost 25% of space
Positional index sizePositional index size
Need an entry for each occurrence, not Need an entry for each occurrence, not just once per documentjust once per document
Index size depends on average document Index size depends on average document sizesize– Average web page has <1000 terms– SEC filings, books, even some epic poems …
easily 100,000 terms Consider a term with frequency 0.1%Consider a term with frequency 0.1%
Why?
10010011100,000100,000
111110001000
Positional postingsPositional postingsPostingsPostingsDocument size
Rules of thumbRules of thumb
Positional index size factor of 2-4 Positional index size factor of 2-4 over non-positional indexover non-positional index
Positional index size 35-50% of Positional index size 35-50% of volume of original textvolume of original text
Caveat: all of this holds for “English-Caveat: all of this holds for “English-like” languageslike” languages
Index constructionIndex construction
Thus far, considered index spaceThus far, considered index spaceWhat about index construction time?What about index construction time?What strategies can we use with What strategies can we use with
limited main memory?limited main memory?
Somewhat bigger corpusSomewhat bigger corpus
Number of docs = n = 40MNumber of docs = n = 40MNumber of terms = m = 1MNumber of terms = m = 1MUse Zipf to estimate number of Use Zipf to estimate number of
postings entries:postings entries:n + n/2 + n/3 + …. + n/m ~ nn + n/2 + n/3 + …. + n/m ~ n ln ln m = m =
560M entries560M entriesNo positional info yetNo positional info yet Check for
yourself
Documents are parsed to extract Documents are parsed to extract words and these are saved with words and these are saved with the Document ID.the Document ID.
I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious
Doc 2
Recall index constructionRecall index construction Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Key stepKey step
After all documents After all documents have been parsed the have been parsed the inverted file is sorted inverted file is sorted by terms by terms
Index constructionIndex construction
As we build up the index, cannot As we build up the index, cannot exploit compression tricksexploit compression tricks– parse docs one at a time, final postings
entry for any term incomplete until the end
– (actually you can exploit compression, but this becomes a lot more complex)
At 10-12 bytes per postings entry, At 10-12 bytes per postings entry, demands several temporary demands several temporary gigabytesgigabytes
System parameters for designSystem parameters for design
Disk seek ~ 1 millisecondDisk seek ~ 1 millisecondBlock transfer from disk ~ 1 Block transfer from disk ~ 1
microsecond per byte (microsecond per byte (following a following a seekseek))
All other ops ~ 10 microsecondsAll other ops ~ 10 microseconds
BottleneckBottleneck
Parse and build postings entries one Parse and build postings entries one doc at a timedoc at a time
To now turn this into a term-wise To now turn this into a term-wise view, must sort postings entries by view, must sort postings entries by term (then by doc within each term)term (then by doc within each term)
Doing this with random disk seeks Doing this with random disk seeks would be too slowwould be too slow
If every comparison took 1 disk seek, and n items could besorted with nlog2n comparisons, how long would this take?
Sorting with fewer disk seeksSorting with fewer disk seeks
12-byte (4+4+4) records 12-byte (4+4+4) records (term, doc, freq)(term, doc, freq) These are generated as we parse docsThese are generated as we parse docs Must now sort 560M such 12-byte records Must now sort 560M such 12-byte records
by by termterm Define a Define a BlockBlock = 10M such records = 10M such records
– can “easily” fit a couple into memory
Will sort within blocks first, then merge Will sort within blocks first, then merge the blocks into one long sorted orderthe blocks into one long sorted order
Sorting 56 blocks of 10M recordsSorting 56 blocks of 10M records
First, read each block and sort within: First, read each block and sort within:
– Quicksort takes about 2 x (10M ln 10M) steps
Exercise: estimate total time to read each Exercise: estimate total time to read each block from disk and and quicksort itblock from disk and and quicksort it
56 times this estimate - gives us 56 sorted 56 times this estimate - gives us 56 sorted runsruns of 10M records each of 10M records each
Need 2 copies of data on disk, throughoutNeed 2 copies of data on disk, throughout
Merging 56 sorted runsMerging 56 sorted runs
Merge tree of logMerge tree of log2256 ~ 6 layers56 ~ 6 layers
During each layer, read into memory runs During each layer, read into memory runs in blocks of 10M, merge, write backin blocks of 10M, merge, write back
Disk
1
3 4
2
42
1 3
Merging 56 runsMerging 56 runs
Time estimate for disk transfer:Time estimate for disk transfer:
6 x 56 x (120M x 106 x 56 x (120M x 10-6-6) x 2 ~ 22 hours) x 2 ~ 22 hours
block size
At each stage the run size doubles but the runs divide by half
read+write
blocktransfer time
Exercise - fill in this tableExercise - fill in this table
StepStep TimeTime
11 56 initial quicksorts of 10M records each
22 read 2 sorted blocks for merging, write back
33 merge 2 sorted blocks
44 add (2) + (3) = time to read/merge/write
55 56 times (4) = total merge time
?
Large memory indexingLarge memory indexing
Suppose instead that we had 16GB of Suppose instead that we had 16GB of memory for the above indexing task.memory for the above indexing task.
Exercise: how much time to index?Exercise: how much time to index? Repeat with a couple of values of n, m.Repeat with a couple of values of n, m. In practice, spidering interlaced with In practice, spidering interlaced with
indexing.indexing.– Spidering bottlenecked by WAN speed and
many other factors - more on this later
Improving on merge treeImproving on merge tree
Compressed temporary filesCompressed temporary files– compress terms in temporary dictionary runs
Merge more than 2 runs at a timeMerge more than 2 runs at a time– maintain heap of candidates from each run
…
…
1 5 2 4 3 6
Indexing speed in practiceIndexing speed in practice
From TREC TeraByte 2004:From TREC TeraByte 2004:24-38 GB/hour on a 1GHz Pentium PC
(depending on HTML parser)
Dynamic indexingDynamic indexing
Docs come in over timeDocs come in over time– postings updates for terms already in
dictionary– new terms added to dictionary
Docs get deletedDocs get deleted
Simplest approachSimplest approach
Maintain “big” main indexMaintain “big” main index New docs go into “small” auxiliary indexNew docs go into “small” auxiliary index Search across both, merge resultsSearch across both, merge results DeletionsDeletions
– Invalidation bit-vector for deleted docs– Filter docs output on a search result by this
invalidation bit-vector
Periodically, re-index into one main indexPeriodically, re-index into one main index
More complex approachMore complex approach
Fully dynamic updatesFully dynamic updatesOnly one index at all timesOnly one index at all times
– No big and small indicesActive management of a pool of Active management of a pool of
spacespace
Fully dynamic updatesFully dynamic updates
Inserting a (variable-length) recordInserting a (variable-length) record– e.g., a typical postings entry
Maintain a pool of (say) 64KB Maintain a pool of (say) 64KB chunkschunksChunk header maintains metadata on Chunk header maintains metadata on
records in chunk, and its free spacerecords in chunk, and its free space
RecordRecordRecordRecord
Header
Free space
Global trackingGlobal tracking
In memory, maintain a global record In memory, maintain a global record address table that says, for each address table that says, for each record, the chunk it’s in.record, the chunk it’s in.
Define one chunk to be current.Define one chunk to be current. InsertionInsertion
– if current chunk has enough free space• extend record and update metadata.
– else look in other chunks for enough space
– else open new chunk
Changes to dictionaryChanges to dictionary
New terms appear over timeNew terms appear over time– cannot use a static perfect hash for
dictionaryOK to use term character string OK to use term character string
w/pointers from postings as in w/pointers from postings as in lecture 2lecture 2
Index on disk vs. memoryIndex on disk vs. memory
Most retrieval systems keep the dictionary Most retrieval systems keep the dictionary in memory and the postings on diskin memory and the postings on disk
Web search engines frequently keep both Web search engines frequently keep both in memoryin memory– massive memory requirement
– feasible for large web service installations, less so for standard usage where
• query loads are lighter
• users willing to wait 2 seconds for a response
More on this when discussing deployment More on this when discussing deployment modelsmodels
Distributed indexingDistributed indexing
Suppose we had several machines Suppose we had several machines available to do the indexingavailable to do the indexing– how do we exploit the parallelism
Two basic approachesTwo basic approaches– stripe by dictionary as index is built up– stripe by documents
Indexing in the real worldIndexing in the real world
Typically, don’t have all documents sitting Typically, don’t have all documents sitting on a local filesystemon a local filesystem– Documents need to be spidered– Could be dispersed over a WAN with varying
connectivity– Must schedule distributed spiders/indexers – Could be (secure content) in
• Databases• Content management applications• Email applications
http often not the most efficient way of http often not the most efficient way of fetching these documents - native API fetching these documents - native API fetchingfetching
Indexing in the real worldIndexing in the real world
Documents in a variety of formatsDocuments in a variety of formats– word processing formats (e.g., MS Word)– spreadsheets– presentations– publishing formats (e.g., pdf)
Generally handled using format-specific Generally handled using format-specific “filters”“filters”– convert format into text + meta-data
Documents in a variety of languagesDocuments in a variety of languages– automatically detect language(s) in a
document– tokenization, stemming, are language-
dependent