Search Engines & Question Answering Indexing Giuseppe Attardi Università di Pisa (some slides borrowed from C. Manning, H. Schütze)

Search Engines & Question AnsweringSearch Engines & Question AnsweringIndexingIndexing

Giuseppe AttardiGiuseppe AttardiUniversità di PisaUniversità di Pisa

(some slides borrowed from C. Manning, H. Schütze)(some slides borrowed from C. Manning, H. Schütze)

TopicsTopics

Indexing and SearchIndexing and Search– Indexing and inverted files– Compression– Postings Lists– Query processing

IndexingIndexing

Inverted index storageInverted index storage– Compressing dictionaries in memory

Processing Boolean queriesProcessing Boolean queries– Optimizing term processing– Skip list encoding

Wild-card queriesWild-card queriesPositional/phrase/proximity queriesPositional/phrase/proximity queries

QueryQuery

Which plays of Shakespeare contain the Which plays of Shakespeare contain the words words BrutusBrutus ANDAND CaesarCaesar but but NOTNOT CalpurniaCalpurnia??

Could Could grepgrep all of Shakespeare’s plays all of Shakespeare’s plays for for BrutusBrutus and and CaesarCaesar then strip out lines then strip out lines containing containing CalpurniaCalpurnia??– Slow (for large corpora)– NOT is non-trivial– Other operations (e.g., find the phrase

Romans and countrymenRomans and countrymen) not feasible

Term-document incidenceTerm-document incidence

1 if play contains word, 0 otherwise1 if play contains word, 0 otherwise

Antony and CleopatraJulius

CaesarThe Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 0

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

Worser 1 0 1 1 1 0

Incidence vectorsIncidence vectors

So we have a 0/1 vector for each So we have a 0/1 vector for each termterm

To answer query:To answer query:– take the vectors for Brutus, Caesar and

Calpurnia (complemented)– perform bitwise AND

110100 AND 110111 AND 101111 = 100100

Answers to queryAnswers to query

Antony and Cleopatra, Act III, Scene Antony and Cleopatra, Act III, Scene iiiiAgrippa [Aside to Domitius Enobarus]: Why, Enobarbus,

When Antony found Julius When Antony found Julius CaesarCaesar dead, dead,

He cried almost to roaring; and he weptHe cried almost to roaring; and he wept

When at Philippi he found When at Philippi he found BrutusBrutus slain. slain.

Hamlet, Act III, Scene iiHamlet, Act III, Scene iiLord Polonius:Lord Polonius: I did enact Julius I did enact Julius CaesarCaesar I was killed i' the I was killed i' the

Capitol; Capitol; BrutusBrutus killed me. killed me.

Bigger corporaBigger corpora

Consider Consider n n = 1 M documents, each = 1 M documents, each with about 1 K termswith about 1 K terms

Avg 6 bytes/term including Avg 6 bytes/term including spaces/punctuation spaces/punctuation – 6 GB of data

Say there are Say there are m m = 500 K = 500 K distinctdistinct terms among theseterms among these

Can’t build the matrixCan’t build the matrix

500K x 1M matrix has half-a-trillion 500K x 1M matrix has half-a-trillion 0’s and 1’s0’s and 1’s

But it has no more than one billion But it has no more than one billion 1’s1’s– matrix is extremely sparse

What’s a better representation?What’s a better representation?

Why?

Documents are parsed to extract Documents are parsed to extract words and these are saved with words and these are saved with the Document ID.the Document ID.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Inverted indexInverted index

After all documents After all documents have been parsed have been parsed the inverted file is the inverted file is sorted by terms sorted by terms

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Multiple term Multiple term entries in a single entries in a single document are document are merged and merged and frequency frequency information addedinformation added

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2did 1enact 1hath 2I 1i' 1it 2julius 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

The file is commonly split into a The file is commonly split into a DictionaryDictionary and a and a PostingsPostings

filefile


Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Where do we pay in storage?Where do we pay in storage?

Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Pointers

Terms

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Two conflicting forcesTwo conflicting forces

A term like A term like CalpurniaCalpurnia occurs in occurs in maybe one doc out of a million - maybe one doc out of a million - would like to store this pointer using would like to store this pointer using loglog22 1M ~ 20 bits 1M ~ 20 bits

A term like A term like thethe occurs in virtually occurs in virtually every doc, so 20 bits/pointer is too every doc, so 20 bits/pointer is too expensiveexpensive– Prefer 0/1 vector in this case

Postings file entryPostings file entry

Store list of docs containing a term Store list of docs containing a term in increasing order of doc idin increasing order of doc id– Brutus: 33,47,154,159,202 …

ConsequenceConsequence: suffices to store gaps: suffices to store gaps– 33,14,107,5,43 …

HopeHope: most gaps encoded with far : most gaps encoded with far fewer than 20 bitsfewer than 20 bits

Variable encodingVariable encoding

For For CalpurniaCalpurnia, use ~20 bits/gap entry, use ~20 bits/gap entryFor For thethe, use ~1 bit/gap entry, use ~1 bit/gap entry If the average gap for a term is If the average gap for a term is GG, ,

want to use ~logwant to use ~log22GG bits/gap entry bits/gap entry

codes for gap encodingcodes for gap encoding

Represent a gap Represent a gap GG as the pair as the pair <length,offset><length,offset>

lengthlength is in unary and uses is in unary and uses loglog22GG +1 bits +1 bits to specify the length of the binary to specify the length of the binary encoding ofencoding of

offsetoffset = = GG - 2 - 2loglog22

GG

e.g., 9 represented as 1110001e.g., 9 represented as 1110001 Encoding Encoding GG takes 2 takes 2 loglog22GG +1 bits +1 bits

length offset

What we’ve just doneWhat we’ve just done

Encoded each gap as tightly as Encoded each gap as tightly as possible, to within a factor of 2possible, to within a factor of 2

For better tuning (and a simple For better tuning (and a simple analysis) - need some handle on the analysis) - need some handle on the distribution of gap valuesdistribution of gap values

Zipf’s lawZipf’s law

The The kkth most frequent term has th most frequent term has frequency proportional to frequency proportional to 1/k1/k

Use this for a crude analysis of the Use this for a crude analysis of the space used by our postings file space used by our postings file pointerspointers

Zipf’s law log-log plotZipf’s law log-log plot

Rough analysis based on ZipfRough analysis based on Zipf

Most frequent term occurs in Most frequent term occurs in nn docs docs– n gaps of 1 each

Second most frequent term in Second most frequent term in n/2n/2 docsdocs– n/2 gaps of 2 each …

kkth most frequent term in th most frequent term in n/kn/k docs docs– n/k gaps of k each - use 2log2k +1 bits

for each gap– net of ~(2n/k) log2k bits for kth most

frequent term

Sum over Sum over kk from 1 to 500K from 1 to 500K

Do this by breaking values of k into Do this by breaking values of k into groups:groups:– group i consists of 2i-1 k < 2i

Group Group ii has has 22i-1i-1 components in the components in the sum, each contributing at most sum, each contributing at most ((2ni2ni)/2)/2i-1i-1

Summing over Summing over ii from 1 to 19, we get from 1 to 19, we get a net estimate of 340 Mbits, ~45 MB a net estimate of 340 Mbits, ~45 MB for our indexfor our index Work out

calculation

CaveatsCaveats

This is not the entire space for our index:This is not the entire space for our index:– does not account for dictionary storage– as we get further, we’ll store even more stuff in

the index

Assumes Zipf’s law applies to occurrence Assumes Zipf’s law applies to occurrence of terms in docsof terms in docs

All gaps for a term taken to be the sameAll gaps for a term taken to be the same Does not talk about query processingDoes not talk about query processing

Issues with index we just builtIssues with index we just built

How do we process a query?How do we process a query?What terms in a doc do we index?What terms in a doc do we index?

– All words or only “important” ones?StopwordStopword list: terms that are so list: terms that are so

common that they’re ignored for common that they’re ignored for indexingindexing– e.g., the, a, an, of, to …– language-specific

Exercise: Repeat postings size calculation if 100 mostfrequent terms are not indexed.

Issues in what to indexIssues in what to index

Cooper’sCooper’s vs. vs. CooperCooper vs. vs. CoopersCoopers Full-textFull-text vs. vs. fullfull texttext vs. { vs. {fullfull, , texttext} vs. } vs.

fulltextfulltext Accents:Accents: résumérésumé vs. vs. resumeresume

Cooper’s concordance of Wordsworth was published in 1911. The applications of full-text retrieval are legion: they include résumé scanning, litigation support and searching published journals on-line.

PunctuationPunctuation

Ne’erNe’er: use language-specific, : use language-specific, handcrafted “locale” to normalizehandcrafted “locale” to normalize

State-of-the-artState-of-the-art: break up : break up hyphenated sequencehyphenated sequence

U.S.AU.S.A.. vs. vs. USAUSA - use locale- use localea.outa.out

NumbersNumbers

3/12/913/12/91Mar. 12, 1991Mar. 12, 199155 B.C.55 B.C.B-52B-52100.2.86.144100.2.86.144

– Generally, don’t index as text– Creation dates for docs

Case foldingCase folding

Reduce all letters to lower caseReduce all letters to lower case– exception: upper case in mid-sentence

• e.g., General Motors• Fed vs. fed• SAIL vs. sail

Thesauri and soundexThesauri and soundex

Handle synonyms and homonymsHandle synonyms and homonyms– Hand-constructed equivalence classes

• e.g., car automobile• your you’re

Index such equivalences, or expand Index such equivalences, or expand query?query?– More later ...

Spell correctionSpell correction

Look for all words within (say) edit Look for all words within (say) edit distance 3 (Insert/Delete/Replace) at distance 3 (Insert/Delete/Replace) at query timequery time– e.g. Alanis Morisette

Spell correction is expensive and Spell correction is expensive and slows the query (upto a factor of 100)slows the query (upto a factor of 100)– Invoke only when index returns zero

matches.– What if docs contain mis-spellings?

LemmatizationLemmatization

Reduce inflectional/variant forms to Reduce inflectional/variant forms to base formbase form

E.g.,E.g.,– am, are, is be

– car, cars, car's, cars' car

the boy's cars are different colorsthe boy's cars are different colors the boy car be different colorthe boy car be different color

StemmingStemming

Reduce terms to their “roots” before Reduce terms to their “roots” before indexingindexing– language dependent– e.g. automate(s), automatic, automation

all reduced to automat

for example compressed and compression are both accepted as equivalent to compress.

for exampl compres andcompres are both acceptas equival to compres.

Porter’s algorithmPorter’s algorithm

Commonest algorithm for stemming Commonest algorithm for stemming EnglishEnglish

Conventions + 5 phases of Conventions + 5 phases of reductionsreductions– phases applied sequentially– each phase consists of a set of

commands– sample convention: Of the rules in a

compound command, select the one that applies to the longest suffix

Typical rules in PorterTypical rules in Porter

ssessses ssss iesies iiationalational ateatetionaltional tiontion

Other stemmersOther stemmers

Other stemmers exist, e.g., Lovins Other stemmers exist, e.g., Lovins stemmer stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/generhttp://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htmal/lovins.htm

Single-pass, longest suffix removal Single-pass, longest suffix removal (about 250 rules)(about 250 rules)

Motivated by Linguistics as well as Motivated by Linguistics as well as IRIR

Full morphological analysis - modest Full morphological analysis - modest benefits for retrievalbenefits for retrieval

Beyond term searchBeyond term search

What about phrases?What about phrases?Proximity: Find Proximity: Find GatesGates NEAR NEAR

MicrosoftMicrosoft– Need index to capture position

information in docsZones in documents: Find Zones in documents: Find

documents with (documents with (author = author = UllmanUllman)) ANDAND (text contains (text contains automataautomata))

Dictionary and postings files: Dictionary and postings files: a fast, compact inverted indexa fast, compact inverted index


Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Tot Freq

ambitious 1 1

be 1 1

brutus 2 2

capitol 1 1

caesar 2 3

did 1 1

enact 1 1

hath 1 1

I 1 2

i' 1 1

it 1 1

julius 1 1

killed 1 2

let 1 1

me 1 1

noble 1 1

so 1 1

the 2 2

told 1 1

you 1 1

was 2 2

with 1 1

Usually in memory Gap-encoded,on disk

Inverted index storage Inverted index storage

Dictionary storageDictionary storage– Dictionary in main memory, postings on

disk• This is common, especially for something

like a search engine where high throughput is essential, but can also store most of it on disk with small, in‑memory index

Tradeoffs between compression and Tradeoffs between compression and query processing speedquery processing speed– Cascaded family of techniques

How big is the lexicon V?How big is the lexicon V?

Grows (but more slowly) with corpus sizeGrows (but more slowly) with corpus size Empirically okay model:Empirically okay model:

V = V = kNkNbb

where where bb ≈≈ 0.5, 0.5, k k ≈≈ 30–100; 30–100; N N = # tokens= # tokens For instance TREC collection (2 Gb; For instance TREC collection (2 Gb;

750,000 newswire articles): 750,000 newswire articles): ~ 500,000 ~ 500,000 termsterms

Number is decreased by case-folding, Number is decreased by case-folding, stemmingstemming

Indexing all numbers could make it Indexing all numbers could make it extremely largeextremely large (so some SE don’t) (so some SE don’t)

Spelling errors contribute a fair bit of sizeSpelling errors contribute a fair bit of size

Exercise: Can onederive this from

Zipf’s Law?

Dictionary storage - first cutDictionary storage - first cut

Array of fixed-width entriesArray of fixed-width entries– 500,000 terms; 28 bytes/term = 14 MB

Terms Freq. Postings ptr.

a 999,712

aardvark 71

…. ….

zzzz 99

Allows for fast binarysearch into dictionary

20 bytes 4 bytes each

ExercisesExercises

Is binary search really a good idea?Is binary search really a good idea?What are the alternatives?What are the alternatives?

Fixed-width terms are wastefulFixed-width terms are wasteful

Most of the bytes in the Most of the bytes in the TermsTerms column are wasted column are wasted –– we allot 20 bytes for 1 letter terms we allot 20 bytes for 1 letter terms– and still can’t handle supercalifragilisticexpialidocioussupercalifragilisticexpialidocious

Written English averages ~4.5 charactersWritten English averages ~4.5 characters– Exercise: Why is/isn’t this the number to use for

estimating the dictionary size?– Short words dominate token counts

Average word Average word typetype in English: ~8 characters in English: ~8 characters Store dictionary as a string of characters:Store dictionary as a string of characters:

– Pointer of next word shows end of last– Hope to save up to 60% of dictionary space

Compressing the term listCompressing the term list

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

Binary searchthese pointers

Total string length =500KB x 8 = 4MB

Pointers resolve 4Mpositions: log24M =

22bits = 3bytes

Total space for compressed listTotal space for compressed list

4 bytes per term for Freq4 bytes per term for Freq4 bytes per term for pointer to 4 bytes per term for pointer to

PostingsPostings3 bytes per term pointer3 bytes per term pointerAvg. 8 bytes per term in term stringAvg. 8 bytes per term in term string500 K terms 500 K terms 9.5 MB 9.5 MB

Now avg. 11 bytes/term, not 20.

BlockingBlocking

Store pointers to every Store pointers to every kkth on term stringth on term string Need to store term lengths (1 extra byte)Need to store term lengths (1 extra byte)

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

7

Save 9 bytes on 3 pointers.

Lose 4 bytes onterm lengths.

ExerciseExercise

Estimate the space usage (and Estimate the space usage (and savings compared to 9.5 MB) with savings compared to 9.5 MB) with blocking, for block sizes of blocking, for block sizes of k = 4, 8 k = 4, 8 andand 16 16

IXE: IIXE: IndendeXXing ing EEnginengine

Design GoalsDesign Goals

Specialized tool (indexing and search)Specialized tool (indexing and search) C++ framework with high-level primitivesC++ framework with high-level primitives

– Applications built with few lines of C++– Specialization by inheritance

High performanceHigh performance ScalabilityScalability Simple to maintainSimple to maintain

– Hard to deal with autoconf, autoheader, automake, configure, libtool, …

– Developed my own Make templates

Keep it as simple as possible but not Keep it as simple as possible but not simplersimpler..

Albert EinsteinAlbert Einstein

LexiconLexicon

ab: 24930

ac: 24931

ad: 24932

ae: 24933

Bigram index Word index Postings

ate0cent0cute0rial0

Extreme compression (see Extreme compression (see MGMG))

Front-coding:Front-coding:– Sorted words commonly have long common

prefix – store differences only (for 3 in 4) Using perfect hashing to store terms Using perfect hashing to store terms

“within” their pointers“within” their pointers– not good for vocabularies that change

Partition dictionary into pagesPartition dictionary into pages– use B-tree on first terms of pages– pay a disk seek to grab each page– if we’re paying 1 disk seek anyway to get the

postings, “only” another seek/query term

Is it worth it?Is it worth it?

Average lexicon search time:Average lexicon search time:– IXE: 8 msec– Front coding: 6 msec

Average query time:Average query time: 300 msec300 msec

Number EncodingNumber Encoding

Whole chapter in Managing Whole chapter in Managing GigabytesGigabytes

Best solution: local Bernoulli using Best solution: local Bernoulli using Golomb coding Golomb coding – Roughly: quotient (unary) + remainder

(binary)– Compression: ~1 bit per posting

Quick and clean solution: eptacodeQuick and clean solution: eptacode

EptacodeEptacode

Use 7 bits in a byte, sign is Use 7 bits in a byte, sign is continuationcontinuation

1 127

2 16129

3 2048383

4 260144641

5 33038369407

GolombGolomb1 127

2 16384

3 2097152

4 268435456

5 34359738368

Golomb drawbacksGolomb drawbacks

Need to store base for quotientNeed to store base for quotient Postings are non consecutivePostings are non consecutive Result: 30% increase in size of indexResult: 30% increase in size of index

0%

10%

20%

30%

40%

50%

Nocompression

Golomb Eptacode

0

5

10

15

20

25

Index sizeQuery time

Fundamental IdeasFundamental Ideas

Rely on hardware caching and mmapRely on hardware caching and mmap– Keep data as compact as possible– Stucture on disk same as used by

algorithmsRely on good data structures and Rely on good data structures and

algorithmsalgorithms– STL

Specialize data structuresSpecialize data structures– For indexing– For search

IndexingIndexing

Posting Lists are created in memoryPosting Lists are created in memory– Provide as much memory as possible to

indexing machinesWhen size of lists reaches a When size of lists reaches a

threshold, dump partial index to diskthreshold, dump partial index to diskPerform final merging of partial Perform final merging of partial

indexesindexesMerging operation used also for:Merging operation used also for:

– Incremental indexing– Distributed indexing

SearchSearch

Search mmaps index:Search mmaps index:– lexicon completely– Posting on demand (too big)

Can’t be done while indexing and Can’t be done while indexing and viceversaviceversa

However one can:However one can:– add dynamically a collection with new

documents to search– Mark documents as deleted

Index StructureIndex Structure

Full-text index fileFull-text index filePostings filePostings file

Full-text Index File StructureFull-text Index File Structure

FileHeaderFileHeader

ColumnColumn00 (Lexicon) (Lexicon)

……

ColumnColumnnn (Lexicon) (Lexicon)

StopWordsStopWords

ColorsColors

ColorsColors

Generalization of Google hits Generalization of Google hits properties (anchor, size, properties (anchor, size, capitalization)capitalization)

Similar to Fulcrum zonesSimilar to Fulcrum zonesUsed for rankingUsed for ranking

– E.g. title words contribute more to rank of document

and selective queriesand selective queriestext matches author = attardi

Query processing exercisesQuery processing exercises

If the query is If the query is friendsfriends AND AND romansromans AND (NOT AND (NOT countrymencountrymen), ), how could how could we use the freq of we use the freq of countrymencountrymen??

How can we perform the How can we perform the ANDAND of two of two postings entries without explicitly postings entries without explicitly building the 0/1 term-doc incidence building the 0/1 term-doc incidence vector?vector?

Boolean queries: Exact matchBoolean queries: Exact match

An algebra of queries using AND, OR and An algebra of queries using AND, OR and NOT together with query wordsNOT together with query words– Uses “set of words” document representation– Precise: document matches condition or not

Primary commercial retrieval tool for 3 Primary commercial retrieval tool for 3 decades decades – Researchers had long argued superiority of

ranked IR systems, but not much used in practice until spread of web search engines

– Professional searchers still like boolean queries: you know exactly what you’re getting

• Cf. Google’s boolean AND criterion

Query optimizationQuery optimization

Consider a query that is an Consider a query that is an ANDAND of of tt termsterms

The idea: for each of the The idea: for each of the tt terms, get terms, get its term-doc incidence from the its term-doc incidence from the postings, then postings, then ANDAND together together

Process in order of Process in order of increasing freqincreasing freq::– start with smallest set, then keep

cutting further This is whywe kept freqin dictionary

Small Adaptive Set IntersectionSmall Adaptive Set Intersection

Query compilerQuery compiler– One cursor on posting lists for each

node– CursorWord, CursorAnd, CursorOr,

CursorPhraseQueryCursor.next(Result& min)QueryCursor.next(Result& min)

– Returns first result r >= minSingle operator for all kind of Single operator for all kind of

queries: e.g. proximityqueries: e.g. proximity

SASI exampleSASI example

world wide web

3

9

12

20

40

47

1

8

10

25

40

2

4

6

21

30

35

40

41

Speeding up postings mergesSpeeding up postings merges

Insert skip pointersInsert skip pointersSay our current list of candidate Say our current list of candidate

docs for an docs for an ANDAND query is 8,13,21 query is 8,13,21– (having done a bunch of ANDs)

We want to We want to ANDAND with the following with the following postings entry: postings entry: 2,4,6,8,10,12,14,16,18,20,222,4,6,8,10,12,14,16,18,20,22

Linear scan is slowLinear scan is slow

Skip pointers or skip listsSkip pointers or skip lists

At query time: as we walk the current At query time: as we walk the current candidate list, concurrently walk inverted candidate list, concurrently walk inverted file entry - can skip aheadfile entry - can skip ahead– (e.g., 8,21).

Skip size: recommend about Skip size: recommend about (list length)(list length)

2,4,6,8,10,12,14,16,18,20,22,24, ...

At indexing time: augment postings with At indexing time: augment postings with skip pointersskip pointers

General query optimizationGeneral query optimization

e.g. (e.g. (maddingmadding OR OR crowdcrowd) ) AND AND ((ignoble ignoble OR OR strifestrife))– Can put any boolean query into CNF

Get freq’s for all termsGet freq’s for all termsEstimate the size of each Estimate the size of each OROR by the by the

sum of its freq’s (conservative)sum of its freq’s (conservative)Process in increasing order of Process in increasing order of OROR

sizessizes

ExerciseExercise

Recommend a Recommend a query processing query processing order fororder for

(tangerine OR trees) AND(marmalade OR skies) AND(kaleidoscope OR eyes)

Term Freq eyes 213312

kaleidoscope 87009

marmalade 107913

skies 271658

tangerine 46653

trees 316812

IXE ArchitectureIXE Architecture

CrawlerCrawler

Table<DocInfo>Table<DocInfo>

IndexerIndexerLexicon

Postings

DocStore

mmap

name:time:size:

DocInfo

mmaplocalcache

mmap

DocInfo DocInfo

name:time:size:

name:time:size:

DocInfo DocInfo

name:time:size:title:summary:type:


DocInfo DocInfo



Document StoreDocument Store

Storing Objects in Relational TablesStoring Objects in Relational Tables

SQLSQLcreate table video (name varchar(256),

caption varchar(2048), format INT, PRIMARY KEY(name))

Template MetaprogrammingTemplate Metaprogramming

class Video : public DocInfo {class Video : public DocInfo {char*char* name;name;char*char* caption;caption;intint format;format;

META(Video, (SUPERCLASS(DocInfo),META(Video, (SUPERCLASS(DocInfo), VARKEY(name, 256),VARKEY(name, 256),

VARFIELD(caption, 2048),VARFIELD(caption, 2048),FIELD(format)));FIELD(format)));

};};

Attribute (for indexing)

Programming Applications (C+Programming Applications (C++)+)

Collection<Video> videos(“CNN”);Collection<Video> videos(“CNN”);videos.insert(video1);videos.insert(video1);

Query q(“caption MATCHES Jordan and Query q(“caption MATCHES Jordan and format = wav”);format = wav”);

Cursor<Video> cursor(videos, q);Cursor<Video> cursor(videos, q);

while (cursor.MoveNext())while (cursor.MoveNext())cout << cursor.Current();cout << cursor.Current();

Single cursor operatorSingle cursor operator

Struct QueryResult {Struct QueryResult {CollectionID cid;CollectionID cid;DocIDDocID did;did;PositionPosition pos;pos;

}}QueryResult qmin;QueryResult qmin;Cursor.next(qmin);Cursor.next(qmin);

Returns next result qr (document or word within document) such that Returns next result qr (document or word within document) such that qr >= qminqr >= qmin

Normal search:Normal search: pos = 0pos = 0 Proximity search:Proximity search: pos = ipos = i Multiple collections search (increment cid or select cid)Multiple collections search (increment cid or select cid) ‘‘where’ clauses (e.g. date > 1/1/2002)where’ clauses (e.g. date > 1/1/2002) Boolean combinationsBoolean combinations

PerformancePerformance

An independent benchmarkAn independent benchmark

0,00

50,00

100,00

150,00

200,00

250,00

Indexing (Intel) Retrieval (Intel)

AltaVistaIXE

0,00

50,00

100,00

150,00

200,00

250,00

Indexing (Intel) Retrieval (Intel)

AltaVistaIXE

doc/sec

query/sec

Independent evaluationsIndependent evaluations

Major portal, GermanyMajor portal, GermanyMajor portal, FranceMajor portal, FranceMajor portal, ItalyMajor portal, Italy

– Stress test with 300 concurrent queries– Verity crashed in several cases

Microsoft RedmondMicrosoft Redmond

TREC Terabyte 2004TREC Terabyte 2004

GOV2 collection:GOV2 collection:– ~ 25 million documets from .gov domain– ~ 500 GB of documents

IXE index split into 23 IXE index split into 23 shardshard

Data Structure Size

Lexicon 4,2 GB

Posting Lists (including offsets) 62,0 GB

Metadata 26,0 GB

Document cache (optional) 84,0 GB

Total 176,2 GB

TREC Terabyte 2004TREC Terabyte 2004

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

2

Monash U. Pisa MSRAsia

U.Amst.

CMU Sabir DublinC.U.

RMIT

Average query time (sec)

Distributed Search ArchitectureDistributed Search Architecture

…

MultiThreadedHTTP Server Query

ServerQueryServer

query

Shardindex

QueryServerQueryServer

Shardindex

BrokerModuleBrokerModule

BrokerModuleBrokerModule

Async I/O

Other FeaturesOther Features

SnippetsSnippetsDocument cacheDocument cacheColorsColorsMultiple collectionsMultiple collections

– Sorted by page rank– Authoritativeness– Popularity

Filter/Group by similarityFilter/Group by similarity

Index CompressionIndex Compression

Impact on searchImpact on search

Binary search down to 4-term blockBinary search down to 4-term block Then linear search through terms in blockThen linear search through terms in block 8 documents: binary tree ave. = 2.6 8 documents: binary tree ave. = 2.6

comparescompares Blocks of 4 (binary tree), ave. = 3 comparesBlocks of 4 (binary tree), ave. = 3 compares

= (1+2= (1+2∙∙2+42+4∙∙3+4)/83+4)/8 =(1+2=(1+2∙∙2+22+2∙∙3+23+2∙∙4+5)/84+5)/8

3

7

57

432

8

6

4

2

8

1

65

1

Compression: Two alternativesCompression: Two alternatives

Lossless compression: all information is Lossless compression: all information is preserved, but we try to encode it preserved, but we try to encode it compactlycompactly– What IR people mostly do

Lossy compression: discard some Lossy compression: discard some informationinformation– Using a stoplist can be thought of in this way– Techniques such as Latent Semantic Indexing

can be viewed as lossy compression– One could prune from postings entries unlikely

to turn up in the top k list for query on word• Especially applicable to web search with huge

numbers of documents but short queries• e.g., Carmel et al. SIGIR 2002

CachingCaching

If 25% of your users are searching If 25% of your users are searching forfor

Britney SpearsBritney Spearsthen you probably then you probably dodo need spelling need spelling correction, but you correction, but you don’tdon’t need to need to keep on intersecting those two keep on intersecting those two postings listspostings lists

Web query distribution is extremely Web query distribution is extremely skewed, and you can usefully cache skewed, and you can usefully cache results for common queriesresults for common queries

Query vs. index expansionQuery vs. index expansion

Recall:Recall:– thesauri for term equivalents– soundex for homonyms

How do we use these?How do we use these?– Can “expand” query to include

equivalences• Query car tyres car tyres automobile tires

– Can expand index• Index docs containing car under

automobile, as well

Query expansionQuery expansion

Usually do query expansionUsually do query expansion– No index blowup– Query processing slowed down

• Docs frequently contain equivalences

– May retrieve more junk• puma jaguar

– Carefully controlled wordnets

Wild-card queries: *Wild-card queries: *

mon*mon*:: find all docs containing any word find all docs containing any word beginning “mon”.beginning “mon”.

Easy with binary tree (or B-Tree) lexicon: Easy with binary tree (or B-Tree) lexicon: retrieve all words in range: retrieve all words in range: mon mon ≤≤ w < moo w < moo

*mon*mon: : find words ending in “mon”: harderfind words ending in “mon”: harder Permuterm index: for word Permuterm index: for word hellohello index index

under:under:– hello$, ello$h, llo$he, lo$hel, o$hell

Queries:Queries:– X lookup on X$ X* lookup on X*$– *X lookup on X$* *X* lookup on X*– X*Y lookup on Y$X* X*Y*Z ??? Exercise!

Wild-card queriesWild-card queries

Permuterm problem: Permuterm problem: ≈≈ quadruples lexicon quadruples lexicon sizesize

Another way: index all Another way: index all kk-grams occurring -grams occurring in any word (any sequence of in any word (any sequence of kk chars) chars)

e.g.,e.g., from text “April is the cruelest from text “April is the cruelest month” we get the 2-grams (month” we get the 2-grams (bigramsbigrams))

– $ is a special word boundary symbol

$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$

Processing Processing nn-gram wild-cards-gram wild-cards

Query Query mon* can now be run as can now be run as– $m AND mo AND on

Fast, space efficientFast, space efficient But we’d get a match on But we’d get a match on moonmoon Must post-filter these results against Must post-filter these results against

queryquery Further wild-card refinementsFurther wild-card refinements

– Cut down on pointers by using blocks– Wild-card queries tend to have few bigrams

• keep postings on disk– Exercise: given a trigram index, how do you

process an arbitrary wild-card query?

Phrase searchPhrase search

Search for Search for ““to be or not to beto be or not to be”” No longer suffices to store only No longer suffices to store only

<<termterm::docsdocs> entries> entries But could just do this anyway, and then But could just do this anyway, and then

post-filter post-filter [i.e., grep] for phrase matches[i.e., grep] for phrase matches– Viable if phrase matches are uncommon

Alternatively, store, for each Alternatively, store, for each termterm, entries, entries– <number of docs containing term;– doc1: position1, position2 … ;– doc2: position1, position2 … ;– etc.>

Positional index examplePositional index example

Can compress position values/offsets as Can compress position values/offsets as we did with docs in the last lecture we did with docs in the last lecture

Nevertheless, this expands postings list in Nevertheless, this expands postings list in size size substantiallysubstantially

<be: 993427;1: 7, 18, 33, 72, 86, 231;2: 3, 149;4: 17, 191, 291, 430, 434;5: 363, 367, …>

Which of these docscould contain “to be

or not to be”?

Processing a phrase queryProcessing a phrase query

Extract inverted index entries for each distinct Extract inverted index entries for each distinct term: term: toto, , bebe, , oror, , notnot

Merge their Merge their doc:positiondoc:position lists to enumerate all lists to enumerate all positions where “positions where “to be or not to beto be or not to be” begins.” begins.

toto:

2:1,17,74,222,551; 4:8,27,101,429,433; 7:13,23,191; ...

bebe:

1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...

Same general method for proximity searchesSame general method for proximity searches

Example: WestLaw Example: WestLaw http://www.westlaw.com/http://www.westlaw.com/

Largest commercial (paying subscribers) Largest commercial (paying subscribers) legal search service (started 1975; ranking legal search service (started 1975; ranking added 1992)added 1992)

About 7 terabytes of data; 700,000 usersAbout 7 terabytes of data; 700,000 users Majority of users Majority of users still still use boolean queriesuse boolean queries Example query:Example query:

– What is the statute of limitations in cases involving the federal tort claims act?

– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

Long, precise queries; proximity operators; Long, precise queries; proximity operators; incrementally developed; not like web incrementally developed; not like web searchsearch

Index sizeIndex size

Stemming/case folding cutStemming/case folding cut– number of terms by ~40%– number of pointers by 10-20%– total space by ~30%

Stop wordsStop words– Rule of 30: ~30 words account for ~30%

of all term occurrences in written text– Eliminating 150 commonest terms from

indexing will cut almost 25% of space

Positional index sizePositional index size

Need an entry for each occurrence, not Need an entry for each occurrence, not just once per documentjust once per document

Index size depends on average document Index size depends on average document sizesize– Average web page has <1000 terms– SEC filings, books, even some epic poems …

easily 100,000 terms Consider a term with frequency 0.1%Consider a term with frequency 0.1%

Why?

10010011100,000100,000

111110001000

Positional postingsPositional postingsPostingsPostingsDocument size

Rules of thumbRules of thumb

Positional index size factor of 2-4 Positional index size factor of 2-4 over non-positional indexover non-positional index

Positional index size 35-50% of Positional index size 35-50% of volume of original textvolume of original text

Caveat: all of this holds for “English-Caveat: all of this holds for “English-like” languageslike” languages

Index constructionIndex construction

Thus far, considered index spaceThus far, considered index spaceWhat about index construction time?What about index construction time?What strategies can we use with What strategies can we use with

limited main memory?limited main memory?

Somewhat bigger corpusSomewhat bigger corpus

Number of docs = n = 40MNumber of docs = n = 40MNumber of terms = m = 1MNumber of terms = m = 1MUse Zipf to estimate number of Use Zipf to estimate number of

postings entries:postings entries:n + n/2 + n/3 + …. + n/m ~ nn + n/2 + n/3 + …. + n/m ~ n ln ln m = m =

560M entries560M entriesNo positional info yetNo positional info yet Check for

yourself

Documents are parsed to extract Documents are parsed to extract words and these are saved with words and these are saved with the Document ID.the Document ID.

I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious

Doc 2

Recall index constructionRecall index construction Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Key stepKey step

After all documents After all documents have been parsed the have been parsed the inverted file is sorted inverted file is sorted by terms by terms

Index constructionIndex construction

As we build up the index, cannot As we build up the index, cannot exploit compression tricksexploit compression tricks– parse docs one at a time, final postings

entry for any term incomplete until the end

– (actually you can exploit compression, but this becomes a lot more complex)

At 10-12 bytes per postings entry, At 10-12 bytes per postings entry, demands several temporary demands several temporary gigabytesgigabytes

System parameters for designSystem parameters for design

Disk seek ~ 1 millisecondDisk seek ~ 1 millisecondBlock transfer from disk ~ 1 Block transfer from disk ~ 1

microsecond per byte (microsecond per byte (following a following a seekseek))

All other ops ~ 10 microsecondsAll other ops ~ 10 microseconds

BottleneckBottleneck

Parse and build postings entries one Parse and build postings entries one doc at a timedoc at a time

To now turn this into a term-wise To now turn this into a term-wise view, must sort postings entries by view, must sort postings entries by term (then by doc within each term)term (then by doc within each term)

Doing this with random disk seeks Doing this with random disk seeks would be too slowwould be too slow

If every comparison took 1 disk seek, and n items could besorted with nlog2n comparisons, how long would this take?

Sorting with fewer disk seeksSorting with fewer disk seeks

12-byte (4+4+4) records 12-byte (4+4+4) records (term, doc, freq)(term, doc, freq) These are generated as we parse docsThese are generated as we parse docs Must now sort 560M such 12-byte records Must now sort 560M such 12-byte records

by by termterm Define a Define a BlockBlock = 10M such records = 10M such records

– can “easily” fit a couple into memory

Will sort within blocks first, then merge Will sort within blocks first, then merge the blocks into one long sorted orderthe blocks into one long sorted order

Sorting 56 blocks of 10M recordsSorting 56 blocks of 10M records

First, read each block and sort within: First, read each block and sort within:

– Quicksort takes about 2 x (10M ln 10M) steps

Exercise: estimate total time to read each Exercise: estimate total time to read each block from disk and and quicksort itblock from disk and and quicksort it

56 times this estimate - gives us 56 sorted 56 times this estimate - gives us 56 sorted runsruns of 10M records each of 10M records each

Need 2 copies of data on disk, throughoutNeed 2 copies of data on disk, throughout

Merging 56 sorted runsMerging 56 sorted runs

Merge tree of logMerge tree of log2256 ~ 6 layers56 ~ 6 layers

During each layer, read into memory runs During each layer, read into memory runs in blocks of 10M, merge, write backin blocks of 10M, merge, write back

Disk

1

3 4

2

42

1 3

Merging 56 runsMerging 56 runs

Time estimate for disk transfer:Time estimate for disk transfer:

6 x 56 x (120M x 106 x 56 x (120M x 10-6-6) x 2 ~ 22 hours) x 2 ~ 22 hours

block size

At each stage the run size doubles but the runs divide by half

read+write

blocktransfer time

Exercise - fill in this tableExercise - fill in this table

StepStep TimeTime

11 56 initial quicksorts of 10M records each

22 read 2 sorted blocks for merging, write back

33 merge 2 sorted blocks

44 add (2) + (3) = time to read/merge/write

55 56 times (4) = total merge time

?

Large memory indexingLarge memory indexing

Suppose instead that we had 16GB of Suppose instead that we had 16GB of memory for the above indexing task.memory for the above indexing task.

Exercise: how much time to index?Exercise: how much time to index? Repeat with a couple of values of n, m.Repeat with a couple of values of n, m. In practice, spidering interlaced with In practice, spidering interlaced with

indexing.indexing.– Spidering bottlenecked by WAN speed and

many other factors - more on this later

Improving on merge treeImproving on merge tree

Compressed temporary filesCompressed temporary files– compress terms in temporary dictionary runs

Merge more than 2 runs at a timeMerge more than 2 runs at a time– maintain heap of candidates from each run

…

…

1 5 2 4 3 6

Indexing speed in practiceIndexing speed in practice

From TREC TeraByte 2004:From TREC TeraByte 2004:24-38 GB/hour on a 1GHz Pentium PC

(depending on HTML parser)

Dynamic indexingDynamic indexing

Docs come in over timeDocs come in over time– postings updates for terms already in

dictionary– new terms added to dictionary

Docs get deletedDocs get deleted

Simplest approachSimplest approach

Maintain “big” main indexMaintain “big” main index New docs go into “small” auxiliary indexNew docs go into “small” auxiliary index Search across both, merge resultsSearch across both, merge results DeletionsDeletions

– Invalidation bit-vector for deleted docs– Filter docs output on a search result by this

invalidation bit-vector

Periodically, re-index into one main indexPeriodically, re-index into one main index

More complex approachMore complex approach

Fully dynamic updatesFully dynamic updatesOnly one index at all timesOnly one index at all times

– No big and small indicesActive management of a pool of Active management of a pool of

spacespace

Fully dynamic updatesFully dynamic updates

Inserting a (variable-length) recordInserting a (variable-length) record– e.g., a typical postings entry

Maintain a pool of (say) 64KB Maintain a pool of (say) 64KB chunkschunksChunk header maintains metadata on Chunk header maintains metadata on

records in chunk, and its free spacerecords in chunk, and its free space

RecordRecordRecordRecord

Header

Free space

Global trackingGlobal tracking

In memory, maintain a global record In memory, maintain a global record address table that says, for each address table that says, for each record, the chunk it’s in.record, the chunk it’s in.

Define one chunk to be current.Define one chunk to be current. InsertionInsertion

– if current chunk has enough free space• extend record and update metadata.

– else look in other chunks for enough space

– else open new chunk

Changes to dictionaryChanges to dictionary

New terms appear over timeNew terms appear over time– cannot use a static perfect hash for

dictionaryOK to use term character string OK to use term character string

w/pointers from postings as in w/pointers from postings as in lecture 2lecture 2

Index on disk vs. memoryIndex on disk vs. memory

Most retrieval systems keep the dictionary Most retrieval systems keep the dictionary in memory and the postings on diskin memory and the postings on disk

Web search engines frequently keep both Web search engines frequently keep both in memoryin memory– massive memory requirement

– feasible for large web service installations, less so for standard usage where

• query loads are lighter

• users willing to wait 2 seconds for a response

More on this when discussing deployment More on this when discussing deployment modelsmodels

Distributed indexingDistributed indexing

Suppose we had several machines Suppose we had several machines available to do the indexingavailable to do the indexing– how do we exploit the parallelism

Two basic approachesTwo basic approaches– stripe by dictionary as index is built up– stripe by documents

Indexing in the real worldIndexing in the real world

Typically, don’t have all documents sitting Typically, don’t have all documents sitting on a local filesystemon a local filesystem– Documents need to be spidered– Could be dispersed over a WAN with varying

connectivity– Must schedule distributed spiders/indexers – Could be (secure content) in

• Databases• Content management applications• Email applications

http often not the most efficient way of http often not the most efficient way of fetching these documents - native API fetching these documents - native API fetchingfetching

Indexing in the real worldIndexing in the real world

Documents in a variety of formatsDocuments in a variety of formats– word processing formats (e.g., MS Word)– spreadsheets– presentations– publishing formats (e.g., pdf)

Generally handled using format-specific Generally handled using format-specific “filters”“filters”– convert format into text + meta-data

Documents in a variety of languagesDocuments in a variety of languages– automatically detect language(s) in a

document– tokenization, stemming, are language-

dependent