71
1 Web Search and Information Retrieval

1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

Embed Size (px)

Citation preview

Page 1: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

1

Web Search and Information Retrieval

Page 2: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

2

Definition of information retrieval Information retrieval (IR) is finding material

(usually documents) of an unstructured nature (usually text) that satisfies an information need within large collections (usually stored on computers)

Page 3: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

3

Structured vs unstructured data Structured data : information in “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000Ivy Smith

Typically allows numerical range and exact match(for text) queries, e.g.,Salary < 60000 AND Manager = Smith.

Page 4: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

4

Unstructured data

Typically refers to free text

Allows Keyword-based queries including operators More sophisticated “concept” queries, e.g.,

find all web pages dealing with drug abuse

Page 5: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

5

Ultimate Focus of IR

Satisfying user information need Emphasis is on retrieval of information (not data)

Predicting which documents are relevant, and then linearly ranking them.

Page 6: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

6SIGIR 2005

Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information

that is relevant to user’s information need and helps him complete a task

Page 7: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

7

The classic search model

Corpus

TASK

Info Need

Query

Verbal form

Results

SEARCHENGINE

QueryRefinement

Get rid of mice in a politically correct way

Info about removing micewithout killing them

How do I trap mice alive?

mouse trap

Mis-conception

Mis-translation

Mis-formulation

Page 8: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

8

Boolean Queries

Some simple query examples Documents containing the word “Java” Documents containing the word “Java” but not the

word “coffee” Documents containing the phrase “Java beans” or

the term “API” Documents where “Java” and “island” occur in the

same sentence The last two queries are called proximity

queries

Page 9: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

9

Before processing the queries… Documents in the collection should be t

okenized in a suitable manner

We need to decide what terms should be put in the index

Page 10: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

10

Tokens and Terms

Page 11: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

11

Tokenization Input: “Friends, Romans and Countrymen” Output: Tokens

Friends Romans Countrymen

Each such token is now a candidate for an index entry, after further processing Described below

Page 12: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

12

Why tokenization is difficult – even in English

Example: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.

Tokenize this sentence

Page 13: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

13

One word or two? (or several) fault-finder co-education state-of-the-art data base San Francisco cheap San Francisco-Los Angeles fares

Page 14: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

14

Tokenization: language issues Chinese and Japanese have no spaces

between words: 莎拉波娃現在居住在美國東南部的佛羅里達。 Not always guaranteed a unique tokenization

Page 15: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

15

Ambiguous segmentation in Chinese

The two characters can be treated as one word meaning ‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’.

Page 16: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

16

Normalization

Need to “normalize” terms in indexed text as well as query terms into the same form.

Example: We want to match U.S.A. and USA Two general solutions

We most commonly implicitly define equivalence classes of terms.

Alternatively: do asymmetric expansion window → window, windows windows → Windows, windows Windows (no expansion) More powerful, but less efficient

Page 17: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

17

Case folding Reduce all letters to lower case

exception: upper case in mid-sentence? Fed vs. fed

Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…

Page 18: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

18

Lemmatization Reduce inflectional/variant forms to base

form E.g.,

am, are, is be car, cars, car's, cars' car

the boy's cars are different colors the boy car be different color

Lemmatization implies doing “proper” reduction to dictionary headword form

Page 19: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

19

Stemming Definition of stemming: Crude heuristic process that chops

off the ends of words in the hope of achieving what “principled” lemmatization attempts to do with a lot of linguistic knowledge Reduce terms to their “roots” before indexing

“Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to

automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress andcompress ar both acceptas equival to compress

Page 20: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

20

Porter algorithm

Most common algorithm for stemming English Results suggest that it is at least as good as other

stemming options Phases are applied sequentially Each phase consists of a set of commands.

Sample command: Delete final “ement” if what remains is longer than 1 character

replacement → replac cement → cement

Sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.

Page 21: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

21

Porter stemmer: A few rules

Rule Example SSES → SS caresses → caress IES → I ponies → poni SS → SS caress → caress S → cats → cat

Page 22: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

22

Other stemmers Other stemmers exist, e.g., Lovins stemmer

http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

Single-pass, longest suffix removal (about 250 rules)

Full morphological analysis – at most modest benefits for retrieval

Do stemming and other normalizations help? English: very mixed results. Helps recall for some

queries but harms precision on others E.g., Porter Stemmer equivalence class oper contains all of

operate operating operates operation operative operatives operational

Definitely useful for Spanish, German, Finnish, …

Page 23: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

23

Thesauri Handle synonyms and homonyms

Hand-constructed equivalence classes e.g., car = automobile color = colour

Rewrite to form equivalence classes Index such equivalences

When the document contains automobile, index it under car as well (usually, also vice-versa)

Or expand query? When the query contains automobile, look under

car as well

Page 24: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

24

Stop words(1) stop words = extremely common words which would

appear to be of little value in helping select documents matching a user need

They have little semantic content Examples: a, an, and, are, as, at, be, by, for, from, has, he,

in, is, it, its, of, on, that, the, to, was, were, will, with

Without suitable compression techniques, it needs a lot of space to index stop words.

Stop word elimination used to be standard in older IR systems.

Page 25: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

25

Stop words(2)

But the trend is away from doing this: Good compression techniques mean the space for

including stopwords in a system is very small Good query optimization techniques mean you pay

little at query time for including stop words. You need them for:

Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” ‘can’ as a verb is not very useful for keyword queries,

but ‘can’ as a noun could be central to a query Most web search engines index stop words

Page 26: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

26

The information contains in Doc1&&2 can be represented in the right table.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Start to process Boolean queries(1)

tid did posI 1 1did 1 2enact 1 3julius 1 4caesar 1 5I 1 6was 1 7killed 1 8i' 1 9the 1 10capitol 1 11brutus 1 12killed 1 13me 1 14so 2 1let 2 2it 2 3be 2 4with 2 5caesar 2 6the 2 7noble 2 8brutus 2 9hath 2 10told 2 11you 2 12

caesar 2 13was 2 14

Page 27: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

27

Start to process Boolean queries(2) The table mentioned above is called POSTING By using a table like this, it is simple to answer the

queries using SQL Documents containing the word “Java” select did from POSTING where tid=‘jave’

Documents containing the word “Java” but not the word

“coffee” (select did from POSTING where tid= ‘java’) except (select

did from POSTING where tid=‘coffee’)

Page 28: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

28

Start to process Boolean queries(3) Documents containing the phrase “Java beans” or the term “API”

With D_JAVA(did, pos) as (select did, pos from POSTING where tid=‘java’),D_BEANS(did, pos) as (select did, pos from POSTING where tid=‘beans’),D_JAVABEANS(did) as

(select D_JAVA.did from D_JAVA, D_BEANS where D_JAVA.did= D_BEANS.did and

D_JAVA.pos+1=D_BEANS.pos),D_API(did) as (select did from POSTING where tid=‘api’),

(select did from D_JAVABEANS) union (select did from D_API)

Documents where “Java” and “island” occur in the same sentence If sentence terminators are well defined, one can keep a sentence

counter and maintain sentence positions as well as token positions in the POSTING table.

Page 29: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

29

Is it efficient? Although the three-column table makes it

easy to write keyword queries, it wastes a great deal of space.

To reduce the storage space Document-term matrix -> term-document matrix Inverted index

For each term T, we must store a list of all documents that contain T.

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 34

2 4 8 16 32 64 128

13 16

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 341 2 3 5 8 13 21 34

2 4 8 16 32 64 1282 4 8 16 32 64 128

13 16

Page 30: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

30

Inverted index: the basic concept

Page 31: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

31

Inverted index Linked lists generally preferred to arrays

Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings lists

Sorted by docID

Posting

Page 32: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

32

Query processing: AND Consider processing the query:

Brutus AND Caesar Locate Brutus in the Dictionary;

Retrieve its postings. Locate Caesar in the Dictionary;

Retrieve its postings. “Merge” the two postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13

21

Brutus

Caesar

Page 33: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

33

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

The merge Walk through the two postings

simultaneously, in time linear in the total number of postings entries

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar2 8

If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.

Page 34: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

34

Sequence of (Modified token, Document ID) pairs.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Index construction

Page 35: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

35

Sort by terms.

External sort is used N-way merge sort

Large scale indexer Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Core indexing step.

Page 36: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

36

Multiple term entries in a single document are merged.

Frequency information is added.

Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Why frequency?Will discuss later.

Page 37: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

37

The result is split into a Dictionary file and a Postings file.

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Page 38: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

38

Distributed indexing For web-scale indexing (don’t try this at

home!):must use a distributed computing cluster

Individual machines are fault-prone Can unpredictably slow down or fail

How do we exploit such a pool of machines?

Page 39: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

39

Google data centers Google data centers mainly contain

commodity machines. Data centers are distributed around the

world. Estimate: a total of 1 million servers, 3 million

processors/cores (Gartner 2007) Estimate: Google installs 100,000 servers

each quarter.

Page 40: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

40

Distributed indexing Maintain a master machine directing the

indexing job – considered “safe”. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle

machine from a pool.

Page 41: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

41

Parallel tasks We will use two sets of parallel tasks

Parsers Inverters

Break the input document corpus into splits Each split is a subset of documents

Page 42: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

42

Parsers Master assigns a split to an idle parser

machine Parser reads a document at a time and emits

(term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms’ first

letters (e.g., a-f, g-p, q-z) – here j=3.

Now to complete the index inversion

Page 43: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

43

Inverters An inverter collects all (term,doc) pairs (=

postings) for one term-partition. Sorts and writes to postings lists

Page 44: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

44

Data flow

splits

Parser

Parser

Parser

Master

a-f g-p q-z

a-f g-p q-z

a-f g-p q-z

Inverter

Inverter

Inverter

Postings

a-f

g-p

q-z

assign assign

Mapphase

Segment files Reducephase

Page 45: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

45

MapReduce The index construction algorithm we just

described is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004) is a

robust and conceptually simple framework for distributed computing …

… without having to write code for the distribution part.

Page 46: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

46

MapReduce Index construction was just one phase. Another phase: transforming a term-partitioned

index into document-partitioned index. Term-partitioned: one machine handles a subrange of

terms Document-partitioned: one machine handles a

subrange of documents (As we discuss in the web part of the course)

most search engines use a document-partitioned index … better load balancing, etc.)

Page 47: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

47

Dynamic indexing Up to now, we have assumed that collections

are static. They rarely are:

Documents come in over time and need to be inserted.

Documents are deleted and modified. This means that the dictionary and postings

lists have to be modified: Postings updates for terms already in dictionary New terms added to dictionary

Page 48: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

48

Simplest approach Maintain “big” main index Insertions

New docs go into “small” auxiliary index Search across both, merge results

Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this

invalidation bit-vector Periodically, re-index into one main index

Page 49: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

49

Dynamic indexing at search engines All the large search engines now do dynamic

indexing Their indices have frequent incremental

changes News items, new topical web pages

But (sometimes/typically) they also periodically reconstruct the index from scratch Query processing is then switched to the new index,

and the old index is then deleted

Page 50: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

50

Something about dictionary

Page 51: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

51

A naïve dictionary An array of struct:

char[20] int Postings *

20 bytes 4/8 bytes 4/8 bytes How do we quickly look up elements at query time? How do we store a dictionary in memory efficiently?

Page 52: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

52

Dictionary data structures Two main choices:

Hash table Tree

Some IR systems use hashes, some trees

Page 53: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

53

Hashes Each vocabulary term is hashed to an integer

(We assume you’ve seen hashtables before) Pros:

Lookup is faster than for a tree: O(1) Cons:

No easy way to find minor variants: judgment/judgement

No prefix search [tolerant retrieval] If vocabulary keeps going, need to occasionally do

the expensive operation of rehashing everything

Page 54: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

54

Trees Simplest: binary tree More usual: B+-tree

Pros: Solves the prefix problem (terms starting with hyp)

Cons: Slower: O(log M) [and this requires balanced

tree] Rebalancing binary trees is expensive

But B+-trees mitigate the rebalancing problem

Page 55: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

55

Other issues Wild-card query

Example mon*: find all docs containing any word beginning with

“mon”

Spell correction Two main flavors:

Isolated word Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words

e.g., from form Context-sensitive

Look at surrounding words, e.g., I flew form Heathrow to Narita.

Page 56: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

56

Why compress the dictionary Must keep in memory

Search begins with the dictionary

Embedded/mobile devices

Page 57: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

57

Dictionary storage - first cut Array of fixed-width entries

~400,000 terms; 28 bytes/term = 11.2 MB.

Terms Freq. Postings ptr.

a 656,265

aachen 65

…. ….

zulu 221

Dictionary searchstructure

20 bytes 4 bytes each

Page 58: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

58

Fixed-width terms are wasteful

Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. And we still can’t handle

supercalifragilisticexpialidocious.

Ave. dictionary word in English: ~8 characters How do we use ~8 characters per dictionary

term?

Page 59: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

59

Compressing the term list: Dictionary-as-a-String

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

Total string length =400K x 8B = 3.2MB

Pointers resolve 3.2Mpositions: log23.2M =

22bits = 3bytes

Store dictionary as a (long) string of characters:

Pointer to next word shows end of current wordHope to save up to 60% of dictionary space.

Page 60: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

60

Blocking Store pointers to every kth term string.

Example below: k=4. Need to store term lengths (1 extra byte)

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

7

Save 9 bytes on 3 pointers.

Lose 4 bytes onterm lengths.

Page 61: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

61

Net Where we used 3 bytes/pointer without

blocking 3 x 4 = 12 bytes for k=4 pointers,

now we use 3+4=7 bytes for 4 pointers.

Shaved another ~0.5MB; can save more with larger k.

Why not go with larger k?

Page 62: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

62

Dictionary search without blocking

Assuming each dictionary term equally likely in query (not really so in practice!), average number of comparisons = (1+2*2+4*3+4)/8 ~2.6

Page 63: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

63

Dictionary search with blocking

Binary search down to 4-term block; Then linear search through terms in block.

Blocks of 4 (binary tree), avg. = (1+2*2+2*3+2*4+5)/8 = 3 compares

Page 64: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

64

Front coding Front-coding:

Sorted words commonly have long common prefix – store differences only

(for last k-1 in a block of k)

8automata8automate9automatic10automation

8automat*a1e2ic3ion

Encodes automat Extra lengthbeyond automat.

Begins to resemble general string compression.

Page 65: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

65

Appendix

Page 66: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

66

B+-tree

Records must be ordered over an attribute

Queries: exact match and range queries over the indexed attribute: “find the name of the student with ID=087-34-7892” or “find all students with gpa between 3.00 and 3.5”

Page 67: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

67

B+-tree:properties

Insert/delete at log F (N/B) cost; keep tree height-balanced. (F = fanout)

Two types of nodes: index nodes and data nodes; each node is 1 page (disk based method)

Page 68: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

68

57

81

95

to keys to keys to keys to keys

< 57 57 k<81 81k<95 95

Index node

Page 69: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

69

Data node5

7

81

95

To r

eco

rd

wit

h k

ey 5

7

To r

eco

rd

wit

h k

ey 8

1

To r

eco

rd

wit

h k

ey 8

5

From non-leaf node

to next leaf

in sequence

Page 70: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

70

EX: B+ Tree of order 3.

(a) Initial tree

60

80

20 , 40

205,10 6040,50 80,100

Index level

Data level

Page 71: 1 Web Search and Information Retrieval. 2 Definition of information retrieval Information retrieval (IR) is finding material (usually documents) of an

71

Query Example

Root

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

Range[32, 160]