Information Retrieval

INFORMATION RETRIEVAL

Yu Hong and Heng [email protected] 15, 2014

Outline

• Introduction• IR Approaches and Ranking• Query Construction• Document Indexing• IR Evaluation• Web Search• INDRI

Information

Basic Function of Information• Information = transmission of thought

Thoughts

Words

Sounds

Thoughts

Words

Sounds

Encoding Decoding

Speech

Writing

Telepathy?

Information Theory• Better called “communication theory”• Developed by Claude Shannon in 1940’s

• Concerned with the transmission of electrical signals over wires• How do we send information quickly and reliably?

• Underlies modern electronic communication:• Voice and data traffic…• Over copper, fiber optic, wireless, etc.

• Famous result: Channel Capacity Theorem• Formal measure of information in terms of entropy

• Information = “reduction in surprise”

The Noisy Channel Model• Information Transmission = producing the same message

at the destination as that was sent at the source• The message must be encoded for transmission across a medium

(called channel)• But the channel is noisy and can distort the message

Source Destination

channelmessage Receiver messageTransmitter

noise

A Synthesis• Information retrieval as communication over time and

space, across a noisy channelSource Destination

Transmitter Receiverchannelmessage message

noise

Sender Recipient

Encoding Decodingstoragemessage message

noiseindexing/writing acquisition/reading

What is Information Retrieval?

• Most people equate IR with web-search• highly visible, commercially successful endeavors• leverage 3+ decades of academic research

• IR: finding any kind of relevant information• web-pages, news events, answers, images, …• “relevance” is a key notion

What is Information Retrieval (IR)?

• Most people equate IR with web-search• highly visible, commercially successful endeavors• leverage 3+ decades of academic research

• IR: finding any kind of relevant information• web-pages, news events, answers, images, …• “relevance” is a key notion

Interesting Examples• Google image search

• Google video search

• People Search• http://www.intelius.com

• Social Network Search• http://arnetminer.org/

http://images.google.com/

http://video.google.com/

http://www.intelius.com/


http://arnetminer.org/




Interesting Examples• Google image search

• Google video search

• People Search• http://www.intelius.com

• Social Network Search• http://arnetminer.org/









IR System

IRSystem

Query String

Documentcorpus

RankedDocuments

1. Doc12. Doc23. Doc3 . .

Sender Recipient

Encoding Decodingstoragemessage message

noiseindexing/writing acquisition/reading

The IR Black BoxDocumentsQuery

Results

Inside The IR Black Box

DocumentsQuery

Results

RepresentationFunction


Query Representation Document Representation

ComparisonFunction Index

Building the IR Black Box• Fetching model• Comparison model• Representation Model• Indexing Model

Building the IR Black Box• Fetching models

• Crawling model• Gentle Crawling model

• Comparison models• Boolean model• Vector space model• Probabilistic models• Language models• PageRank

• Representation Models• How do we capture the meaning of documents?• Is meaning just the sum of all terms?

• Indexing Models• How do we actually store all those words?• How do we access indexed terms quickly?

Outline


Fetching model: Crawling

Documents

Web pages

Search Engines

DocumentsQuery

Results





Crawling

FetchingFunction

World Wide Web

Fetching model: Crawling• Q1: How many web pages should we fetch?

• As many as we can.

IRSystem

Query String

Documentcorpus

RankedDocuments

1. Doc12. Doc23. Doc3 . .

More web pages=

Richer knowledge=

Intelligent Search engine

Fetching model: Crawling• Q1: How many web pages should we fetch?

• As many as we can.• Fetching model is enriching the knowledge in the brain of the

search engine

IRSystem

FetchingFunction

I know everything now,

hahahahaha!

Fetching model: Crawling• Q2: How to fetch the web pages?

• First, we should know the basic network structure of the web• Basic Structure: Nodes and Links (hyperlinks)

World Wide Web

Basic Structure


• Crawling program (Crawler) visit each node in the web through hyperlink.

Basic Network Structure

IRSystem


• Q2-1: what are the known nodes?• It means that the crawler know the addresses of nodes

• The nodes are web pages• So the addresses are the URLs (URL: Uniform Resource Locater)

• Such as: www.yahoo.com, www.sohu.com, www.sina.com, etc.

• Q2-2: what are the unknown nodes?• It means that the crawler don’t know the addresses of nodes

• The seed nodes are the known ones• Before dispatching the crawler, a search engine will introduce some

addresses of the web pages to the crawler. The web pages are the earliest known nodes (so called seeds)

http://www.yahoo.com/

http://www.sohu.com/

http://www.sina.com/


• Q2-3: How can the crawler find the unknown nodes?

Nod.Nod.

KnownNod.

Nod.

Nod.

Nod.

Unknown

Unknown

Unknown

Unknown

Unknown

Doc.

I can do this. Believe me.



Nod.Nod.

Nod.

Nod.

Nod.

Nod.

Unknown

Unknown

Unknown

Unknown

Unknown

Doc.




Nod.Nod.

Nod.

Nod.

Nod.

Nod.

Unknown

Unknown

Unknown

Unknown

Unknown

Doc.




Nod.Nod.

KnownNod.

Nod.

Nod.

Nod.

Unknown

Unknown

Unknown

Unknown

Unknown

Doc.

Good news for me.

PARSERKnown

Known

Known

Known

Known


• Q2-3: How can the crawler find the unknown nodes?• If you introduce a web page to the crawler (let it known the web

address), the crawler will use a parser of source code to mine lots of new web pages. Of cause, the crawler have known their addresses.

• But if you don’t tell the crawler anything, it will be on strike because it can do nothing.

• That is the reason why we need the seed nodes (seed web pages) to awaken the crawler.

Give me some seeds.


• To traverse the whole network of the web, the crawler need some auxiliary equipment.• A register of FIFO (First in, First out) data structure, such as QUEUE.• An Access Control Program (ACP)• Source Code Parser (SCP)• Seed nodes

I need some equipment.

crawler

FIFO Register

ACP SCP


• Robotic crawling procedure (Only five steps)• Initialization: push seed nodes (known web pages) into the empty queue• Step 1: Take out a node from the queue (FIFO) and visit it (ACP)• Step 2: Steal necessary information from the source code of the node (SCP)• Step 3: Send the stolen text information (title, text body, keywords and

Language) back to search engine for storage (ACP)• Step 4: Push the newly found nodes into the queue• Step 5: Execute Step 1-5 iteratively

I am working now.


• Trough the steps, the number of the known nodes continuously grows• The underlying reason why the crawler can travers the whole web

• Crawler stops working until the register is empty• Although the register is empty, the information of all nodes in the web has

been stolen and stored in the server of the search engine.

Slot

Seed

Seed

Seed

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

New

Node

New

Node

New

Node

New

Node

New

Node

New

Node

Slot

Slot

Slot

Slot

Slot

Slot

New

Node

New

Node

New

Node

New

Node

New

Node

New

Node

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

Slot

New

Node

New

Node

New

Node

New

Node

New

Node

I control this.

Fetching model: Crawling• Problems

• 1) Actually, the crawler can not traverse the whole web.• Such as encountering the infinite loop when falling into a partial closed-circle

network (snare) in the web

Node

Node

Node Node

NodeNode

NodeNode

No.

Fetching model: Crawling

SlotSlot

SlotSlot

SlotSlot

SlotSlot

Slot

SlotSlot

SlotSlot

NodeNodeNodeNodeNodeNode

SlotSlot

SlotSlot

SlotSlot

NodeNodeNodeNode

Slot

SlotSlot

SlotSlot

Slot

SlotSlot

SlotSlot

Slot

NodeNodeNode

Slot

NodeNodeNode

NodeNode

SlotSlot

SlotSlot

Node

SlotSlot

SlotSlot

NodeNodeNodeNode

https:// www.yahoo.com

https://screen.yahoo.com/live/https://games.yahoo.com/

https://mobile.yahoo.com/

https://groups.yahoo.com/neo

https://answers.yahoo.com/

http://finance.yahoo.com/

https://weather.yahoo.com/

https://autos.yahoo.com/

https://shopping.yahoo.com/

https://www.yahoo.com/health

https://www.yahoo.com/foodhttps://www.yahoo.com/style

• Problems• 2) Crude Crawling.

• A portal web site causes a series of homologous nodes in the register. Abided by the FIFO rule, the iterative crawling of the nodes will continuously visit the mutual server of the nodes. It is crude crawling.

Network of Web

A class of homologous web pages linking to a

portal sit

Fetching model: Crawling• Homework

• 1) How to overcome the infinite loop cased by the partial closed-circle network in the web?

• 2) Please find a way to crawl the web like a gentlemen (not crude).

• Please select one of the problems as the topic of your homework. A short paper is necessary. No more than 500 words in the paper. But please include at least your idea and a methodology. The methodology can be described with natural languages, flow diagram, or algorithm.

• Send it to me. Email: [email protected]• Thanks.

mailto:[email protected]

Building the IR Black Box• Fetching models

• Crawling model• Gentle Crawling model

• Comparison models• Boolean model• Vector space model• Probabilistic models• Language models• PageRank

• Representation Models• How do we capture the meaning of documents?• Is meaning just the sum of all terms?

• Indexing Models• How do we actually store all those words?• How do we access indexed terms quickly?

DocumentsQuery

Results





DocumentsQuery

Results





Ignore Now

A heuristic formula for IR (Boolean model)• Rank docs by similarity to the query

• suppose the query is “spiderman film”

• Relevance= # query words in the doc• favors documents with both “spiderman” and “film”• mathematically:

• Logical variations (set-based)• Boolean AND (require all words):

• Boolean OR (any of the words):

Qq

DqQDsim 1),(

∏=q

DqOQDAND ),(),(

∏=q

DqOQDOR ),(-1),(

Term Frequency (TF)• Observation:

• key words tend to be repeated in a document

• Modify our similarity measure:• give more weight if word

occurs multiple times

• Problem:• biased towards long documents• spurious occurrences• normalize by length:

Qq

D qtfQDsim )(),(

Qq

D

D

qtfQDsim

||

)(),(

Inverse Document Frequency (IDF)• Observation:

• rare words carry more meaning: cryogenic, apollo• frequent words are linguistic glue: of, the, said, went

• Modify our similarity measure:• give more weight to rare words

… but don’t be too aggressive (why?)

• |C| … total number of documents• df(q) … total number of documents that contain q

)(

||log

||

)(),(

qdf

C

D

qtfQDsim

Qq

D

TF normalization• Observation:

• D1={cryogenic,labs}, D2 ={cryogenic,cryogenic}

• which document is more relevant?• which one is ranked higher? (df(labs) > df(cryogenic))

• Correction:• first occurrence more important than a repeat (why?)• “squash” the linearity of TF:

Kqtf

qtf

)(

)(

tf 1 2 3

State-of-the-art Formula

)(

||log

||)(

)(),(

qdf

C

DKqtf

qtfQDsim

Qq D

D

More query words good

Repetitions of query words good

Penalize verylong documents

Common wordsless important

Strengths and Weaknesses• Strengths

• Precise, if you know the right strategies• Precise, if you have an idea of what you’re looking for• Implementations are fast and efficient

• Weaknesses• Users must learn Boolean logic• Boolean logic insufficient to capture the richness of language• No control over size of result set: either too many hits or none• When do you stop reading? All documents in the result set are

considered “equally good”• What about partial matches? Documents that “don’t quite match”

the query may be useful also

Vector-space approach to IRcat

pig

dog•cat cat pig dog dog

•cat cat•cat cat cat•cat pig•pig cat

θ

Assumption: Documents that are “close together” in vector space “talk about” the same things

Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Some formulas for Similarity

Dot product

Cosine

Dice

Jaccard

i i iiiii

iii

i iii

iii

i iii

iii

ii

baba

baQDSim

ba

baQDSim

ba

baQDSim

baQDSim

) * (

) * (),(

) * (2),(

*

) * (),(

) * (),(

22

22

22

t1

t2

D

Q

An Example• A document space is defined by three terms:

• hardware, software, users• the vocabulary

• A set of documents are defined as:• A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)• A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)• A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)

• If the Query is “hardware and software”• what documents should be retrieved?

An Example (cont.)• In Boolean query matching:

• document A4, A7 will be retrieved (“AND”)• retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)

• In similarity matching (cosine): • q=(1, 1, 0)• S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0• S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5• S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5• Document retrieved set (with ranking)=

• {A4, A7, A1, A2, A5, A6, A8, A9}

Probabilistic model

• Given D, estimate P(R|D) and P(NR|D)• P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)

P(D|R)

D = {t1=x1, t2=x2, …}

•

i

ii

i

ii

i

ii

i

ii

ii

t

xi

xi

t

xi

xi

t

xi

xi

t

xi

xi

Dxtii

qqNRtPNRtPNRDP

ppRtPRtP

RxtPRDP

)1()1(

)1()1(

)(

)1()|0()|1()|(

)1()|0()|1(

)|()|(

absent

presentxi 0

1

Prob. model (cont’d)

)1(

)1(log

1

1log

)1(

)1(log

)1(

)1(

log)|(

)|(log)(

)1(

)1(

ii

ii

ti

t i

i

ii

ii

ti

t

xi

xi

t

xi

xi

pq

qpx

q

p

pq

qpx

qq

pp

NRDP

RDPDOdd

i

ii

i

ii

i

ii

For document ranking


• How to estimate pi and qi?

• A set of N relevant and irrelevant samples:

ri

Rel. doc. with ti

ni-ri

Irrel.doc. with ti

ni

Doc. with ti

Ri-ri

Rel. doc. without ti

N-Ri–n+ri

Irrel.doc. without ti

N-ni

Doc. without ti

Ri

Rel. doc

N-Ri

Irrel.doc.

NSamples

i

iii

i

ii RN

rnq

R

rp


• Smoothing (Robertson-Sparck-Jones formula)

• When no sample is available:pi=0.5, qi=(ni+0.5)/(N+0.5)ni/N

• May be implemented as VSM

))((

)(

)1(

)1(log)(

iiii

iiii

ti

ii

ii

ti

rnrR

rnRNrx

pq

qpxDOdd

i

i

Dti

iiii

iiii

ti

ii

wrnrR

rnRNrxDOdd

)5.0)(5.0(

)5.0)(5.0()(

An Appraisal of Probabilistic Models

Among the oldest formal models in IR Maron & Kuhns, 1960: Since an IR system cannot predict

with certainty which document is relevant, we should deal with probabilities

Assumptions for getting reasonable approximations of the needed probabilities:

Boolean representation of documents/queries/relevance Term independence Out-of-query terms do not affect retrieval Document relevance values are independent

An Appraisal of Probabilistic Models

The difference between ‘vector space’ and ‘probabilistic’ IR is not that great: In either case you build an information retrieval scheme in

the exact same way. Difference: for probabilistic IR, at the end, you score

queries not by cosine similarity and tf-idf in a vector space, but by a slightly different formula motivated by probability theory

Language-modeling Approach• query is a random sample from a “perfect” document

• words are “sampled” independently of each other• rank documents by the probability of generating query

P ( ) P ( ) P ( ) P ( )P ( ) == 4/9 * 2/9 * 4/9 * 3/9

queryD

57

Naive Bayes and LM generative models We want to classify document d.

We want to classify a query q.

Classes: geographical regions like China, UK, Kenya.Each document in the collection is a different class.

Assume that d was generated by the generative model.

Assume that q was generated by a generative model. Key question: Which of the classes is most likely to have generated the document?

Which document (=class) is most likely to have generated the query q?

Or: for which class do we have the most evidence? For which document (as the source of the query) do we have the most evidence?

Using language models (LMs) for IR

❶ LM = language model❷ We view the document as a generative model that generates the query. ❸ What we need to do:❹ Define the precise generative model we want to use❺ Estimate parameters (different parameters for each document’s model)❻ Smooth to avoid zeros❼ Apply to query and find document most likely to have generated the query❽ Present most likely document(s) to user

❾ Note that x – y is pretty much what we did in Naive Bayes.

59

What is a language model?

We can view a finite state automaton as a deterministic language

model.

I wish I wish I wish I wish . . . Cannot generate: “wish I wish”

or “I wish I”. Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic.

60

A probabilistic language model

This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog STOP

P(string) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02

= 0.0000000000048

61

A different language model for each document

frog said that toad likes frog STOP P(string|Md1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = 0.0000000000048 = 4.8 · 10-12

P(string|Md2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 · 0.02 = 0.0000000000120 = 12 · 10-12

P(string|Md1 ) < P(string|Md2 )

Thus, document d2 is “more relevant” to the string “frog said that toad likes frog STOP” than d1 is.

62

Using language models in IR Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q)

P(q) is the same for all documents, so ignore P(d) is the prior – often treated as the same for all d

But we can give a prior to “high-quality” documents, e.g., those with high PageRank.

P(q|d) is the probability of q given d. So to rank documents according to relevance to q, ranking according to P(q|d) and

P(d|q) is equivalent.

63

Where we are

In the LM approach to IR, we attempt to model the query generation process.

Then we rank documents by the probability that a query would be observed as a random sample from the respective document model.

That is, we rank according to P(q|d). Next: how do we compute P(q|d)?

64

How to compute P(q|d) We will make the same conditional independence

assumption as for Naive Bayes.

(|q|: length ofr q; tk : the token occurring at position k in q) This is equivalent to:

tft,q: term frequency (# occurrences) of t in q Multinomial model (omitting constant factor)

65

Parameter estimation Missing piece: Where do the parameters P(t|Md). come from? Start with maximum likelihood estimates (as we did for Naive

Bayes)

(|d|: length of d; tft,d : # occurrences of t in d) As in Naive Bayes, we have a problem with zeros. A single t with P(t|Md) = 0 will make zero. We would give a single term “veto power”. For example, for query [Michael Jackson top hits] a document

about “top songs” (but not using the word “hits”) would have P(t|Md) = 0. – That’s bad.

We need to smooth the estimates to avoid zeros.

66

Smoothing Key intuition: A nonoccurring term is possible (even though

it didn’t occur), . . . . . . but no more likely than would be expected by chance

in the collection. Notation: Mc: the collection model; cft: the number of

occurrences of t in the collection; : the total number of tokens in the collection.

We will use to “smooth” P(t|d) away from zero.

67

Mixture model

P(t|d) = λP(t|Md) + (1 - λ)P(t|Mc) Mixes the probability from the document with the general

collection frequency of the word. High value of λ: “conjunctive-like” search – tends to

retrieve documents containing all query words. Low value of λ: more disjunctive, suitable for long queries Correctly setting λ is very important for good performance.

68

Mixture model: Summary

What we model: The user has a document in mind and generates the query from this document.

The equation represents the probability that the document that the user had in mind was in fact this one.

69

Example

Collection: d1 and d2

d1 : Jackson was one of the most talented entertainers of all time

d2: Michael Jackson anointed himself King of Pop Query q: Michael Jackson Use mixture model with λ = 1/2 P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003

P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013

Ranking: d2 > d1

70

Exercise: Compute ranking

Collection: d1 and d2

d1 : Xerox reports a profit but revenue is down

d2: Lucene narrows quarter loss but decreases further Query q: revenue down Use mixture model with λ = 1/2 P(q|d1) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 =

3/256 P(q|d2) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 =

1/256 Ranking: d2 > d1

71

LMs vs. vector space model (1)

LMs have some things in common with vector space models.

Term frequency is directed in the model. But it is not scaled in LMs.

Probabilities are inherently “length-normalized”. Cosine normalization does something similar for vector space.

Mixing document and collection frequencies has an effect similar to idf.

Terms rare in the general collection, but common in some documents will have a greater influence on the ranking.

72

LMs vs. vector space model (2) LMs vs. vector space model: commonalities

Term frequency is directly in the model. Probabilities are inherently “length-normalized”. Mixing document and collection frequencies has an effect similar to idf.

LMs vs. vector space model: differences LMs: based on probability theory Vector space: based on similarity, a geometric/ linear algebra notion Collection frequency vs. document frequency Details of term frequency, length normalization etc.

73

Language models for IR: Assumptions

Simplifying assumption: Queries and documents are objects of same type. Not true!

There are other LMs for IR that do not make this assumption. The vector space model makes the same assumption.

Simplifying assumption: Terms are conditionally independent. Again, vector space model (and Naive Bayes) makes the same assumption.

Cleaner statement of assumptions than vector space Thus, better theoretical foundation than vector space

… but “pure” LMs perform much worse than “tuned” LMs.

Relevance Using Hyperlinks• Number of documents relevant to a query can be

enormous if only term frequencies are taken into account

• Using term frequencies makes “spamming” easy• E.g., a travel agency can add many occurrences of the words “travel”

to its page to make its rank very high

• Most of the time people are looking for pages from popular sites

• Idea: use popularity of Web site (e.g., how many people visit it) to rank site pages that match given keywords

• Problem: hard to find actual popularity of site• Solution: next slide

Relevance Using Hyperlinks (Cont.)• Solution: use number of hyperlinks to a site as a measure

of the popularity or prestige of the site• Count only one hyperlink from each site (why? - see previous slide)• Popularity measure is for site, not for individual page

• But, most hyperlinks are to root of site• Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu

contains many unrelated pages of varying popularity

• Refinements• When computing prestige based on links to a site, give more weight

to links from sites that themselves have higher prestige• Definition is circular• Set up and solve system of simultaneous linear equations

• Above idea is basis of the Google PageRank ranking mechanism

PageRank in Google

PageRank in Google (Cont’)

• Assign a numeric value to each page• The more a page is referred to by important pages, the more this page is

important

• d: damping factor (0.85)

• Many other criteria: e.g. proximity of query words• “…information retrieval …” better than “… information … retrieval …”

A Bi i

i

IC

IPRddAPR

)(

)()1()(I1

I2

Relevance Using Hyperlinks (Cont.)• Connections to social networking theories that ranked

prestige of people• E.g., the president of the U.S.A has a high prestige since many

people know him• Someone known by multiple prestigious people has high prestige

• Hub and authority based ranking• A hub is a page that stores links to many pages (on a topic)• An authority is a page that contains actual information on a topic• Each page gets a hub prestige based on prestige of authorities

that it points to• Each page gets an authority prestige based on prestige of hubs

that point to it • Again, prestige definitions are cyclic, and can be got by

solving linear equations• Use authority prestige when ranking answers to a query

79

HITS: Hubs and authorities

80

HITS update rules A: link matrix h: vector of hub scores a: vector of authority scores HITS algorithm:

Compute h = Aa Compute a = ATh Iterate until convergence Output (i) list of hubs ranked according to hub score and (ii) list of authorities

ranked according to authority score

Outline


82

Keyword Search• Simplest notion of relevance is that the query string

appears verbatim in the document.• Slightly less strict notion is that the words in the query

appear frequently in the document, in any order (bag of words).

83

Problems with Keywords• May not retrieve relevant documents that include

synonymous terms.• “restaurant” vs. “café”• “PRC” vs. “China”

• May retrieve irrelevant documents that include ambiguous terms.• “bat” (baseball vs. mammal)• “Apple” (company vs. fruit)• “bit” (unit of data vs. act of eating)

Query Expansion• http://www.lemurproject.org/lemur/IndriQueryLanguage.php• Most errors caused by vocabulary mismatch

• query: “cars”, document: “automobiles”• solution: automatically add highly-related words

• Thesaurus / WordNet lookup:• add semantically-related words (synonyms)• cannot take context into account:

• “rail car” vs. “race car” vs. “car and cdr”• Statistical Expansion:

• add statistically-related words (co-occurrence)• very successful

Indri Query Examples

• <parameters><query>#combine( #weight( 0.063356 #1(explosion) 0.187417 #1(blast) 0.411817 #1(wounded) 0.101370 #1(injured) 0.161191 #1(death) 0.074849 #1(deaths)) #weight( 0.311760 #1(Davao Cityinternational airport) 0.311760 #1(Tuesday) 0.103044 #1(DAVAO) 0.195505 #1(Philippines) 0.019817 #1(DXDC) 0.058113 #1(Davao Medical Center)))</query></parameters>

Synonyms and Homonyms• Synonyms

• E.g., document: “motorcycle repair”, query: “motorcycle maintenance”• Need to realize that “maintenance” and “repair” are synonyms

• System can extend query as “motorcycle and (repair or maintenance)”

• Homonyms• E.g., “object” has different meanings as noun/verb• Can disambiguate meanings (to some extent) from the context

• Extending queries automatically using synonyms can be problematic• Need to understand intended meaning in order to infer synonyms

• Or verify synonyms with user

• Synonyms may have other meanings as well

Concept-Based Querying• Approach

• For each word, determine the concept it represents from context• Use one or more ontologies:

• Hierarchical structure showing relationship between concepts• E.g., the ISA relationship that we saw in the E-R model

• This approach can be used to standardize terminology in a specific field

• Ontologies can link multiple languages• Foundation of the Semantic Web (not covered here)

Outline


Indexing of Documents• An inverted index maps each keyword Ki to a set of

documents Si that contain the keyword• Documents identified by identifiers

• Inverted index may record • Keyword locations within document to allow proximity based ranking• Counts of number of occurrences of keyword to compute TF

• and operation: Finds documents that contain all of K1, K2, ..., Kn.• Intersection S1 S2 ..... Sn

• or operation: documents that contain at least one of K1, K2, …, Kn

• union, S1 S2 ..... Sn,.

• Each Si is kept sorted to allow efficient intersection/union by merging • “not” can also be efficiently implemented by merging of sorted lists

• Goal = Find the important meanings and create an internal representation

• Factors to consider:• Accuracy to represent meanings (semantics)• Exhaustiveness (cover all the contents)• Facility for computer to manipulate

• What is the best representation of contents?• Char. string (char trigrams): not precise enough• Word: good coverage, not precise• Phrase: poor coverage, more precise• Concept: poor coverage, precise

Coverage(Recall)

Accuracy(Precision)String Word Phrase Concept

Indexing of Documents

• Sequence of (Modified token, Document ID) pairs.

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The noble

Brutus hath told youCaesar was ambitious

Doc 2

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2

caesar 2was 2ambitious 2

Indexer steps

• Multiple term entries in a single document are merged.

• Frequency information is added.

Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

An example

• function words do not bear useful information for IRof, in, about, with, I, although, …

• Stoplist: contain stopwords, not to be used as index• Prepositions• Articles• Pronouns• Some adverbs and adjectives• Some frequent words (e.g. document)

• The removal of stopwords usually improves IR effectiveness• A few “standard” stoplists are commonly used.

Stopwords / Stoplist

Stemming

• Reason: • Different word forms may bear similar meaning (e.g. search,

searching): create a “standard” representation for them• Stemming:

• Removing some endings of word computercompute computescomputingcomputedcomputation

comput

Lemmatization

• transform to standard form according to syntactic category.E.g. verb + ing verb

noun + s noun• Need POS tagging• More accurate than stemming, but needs more resources

• crucial to choose stemming/lemmatization rules noise v.s. recognition rate

• compromise between precision and recall

light/no stemming severe stemming-recall +precision +recall -precision

97

Simple conjunctive query (two terms)

Consider the query: BRUTUS AND CALPURNIA To find all matching documents using inverted index:

❶ Locate BRUTUS in the dictionary❷ Retrieve its postings list from the postings file❸ Locate CALPURNIA in the dictionary❹ Retrieve its postings list from the postings file❺ Intersect the two postings lists❻ Return intersection to user

98

Intersecting two posting lists

This is linear in the length of the postings lists.

Note: This only works if postings lists are sorted.

99

Does Google use the Boolean model? On Google, the default interpretation of a query [w1 w2 . . .wn] is w1 AND w2

AND . . .AND wn

Cases where you get hits that do not contain one of the wi : anchor text page contains variant of wi (morphology, spelling correction,

synonym) long queries (n large) boolean expression generates very few hits

Simple Boolean vs. Ranking of result set Simple Boolean retrieval returns matching documents in no

particular order. Google (and most well designed Boolean engines) rank the

result set – they rank good hits (according to some estimator of relevance) higher than bad hits.

Outline


• Efficiency: time, space• Effectiveness:

• How is a system capable of retrieving relevant documents?

• Is a system better than another one?• Metrics often used (together):

• Precision = retrieved relevant docs / retrieved docs• Recall = retrieved relevant docs / relevant docs

relevant retrievedretrieved relevant

IR Evaluation

IR Evaluation (Cont’)

• Information-retrieval systems save space by using index structures that support only approximate retrieval. May result in:• false negative (false drop) - some relevant documents may

not be retrieved.• false positive - some irrelevant documents may be

retrieved.• For many applications a good index should not permit any

false drops, but may permit a few false positives.

• Relevant performance metrics:• precision - what percentage of the retrieved documents are

relevant to the query.• recall - what percentage of the documents relevant to the

query were retrieved.

• Recall vs. precision tradeoff:• Can increase recall by retrieving many documents (down to a low

level of relevance ranking), but many irrelevant documents would be fetched, reducing precision

• Measures of retrieval effectiveness: • Recall as a function of number of documents fetched, or• Precision as a function of recall

• Equivalently, as a function of number of documents fetched

• E.g., “precision of 75% at recall of 50%, and 60% at a recall of 75%”

• Problem: which documents are actually relevant, and which are not

IR Evaluation (Cont’)

General form of precision/recall

Precision 1.0 Recall 1.0

-Precision change w.r.t. Recall (not a fixed point)

-Systems cannot compare at one Precision/Recall point

-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)

An illustration of P/R calculation

List Rel?Doc1

Y

Doc2Doc3

Y

Doc4

Y

Doc5…

Precision 1.0 - * (0.2, 1.0) 0.8 - * (0.6, 0.75) * (0.4, 0.67) 0.6 - * (0.6, 0.6) * (0.2, 0.5) 0.4 - 0.2 - 0.0 | | | | | Recall 0.2 0.4 0.6 0.8 1.0

Assume: 5 relevant docs.

MAP (Mean Average Precision)

• rij = rank of the j-th relevant document for Qi

• |Ri| = #rel. doc. for Qi

• n = # test queries• E.g. Rank: 1 4 1st rel. doc.

5 8 2nd rel. doc.

10 3rd rel. doc.

i ijQ RD iji r

j

RnMAP

||

11

)]8

2

4

1(

2

1)

10

3

5

2

1

1(

3

1[

2

1MAP

Some other measures• Noise = retrieved irrelevant docs / retrieved docs• Silence = non-retrieved relevant docs / relevant docs

• Noise = 1 – Precision; Silence = 1 – Recall• Fallout = retrieved irrel. docs / irrel. docs• Single value measures:

• F-measure = 2 P * R / (P + R)• Average precision = average at 11 points of recall• Precision at n document (often used for Web IR)• Expected search length (no. irrelevant documents to read before

obtaining n relevant doc.)

Interactive system’s evaluation

• Definition:

Evaluation = the process of systematically collecting data that informs us about what it is like for a particular user or group of users to use a product/system for a particular task in a certain type of environment.

Problems• Attitudes:

• Designers assume that if they and their colleagues can use the system and find it attractive, others will too• Features vs. usability or security

• Executives want the product on the market yesterday• Problems “can” be addressed in versions 1.x

• Consumers accept low levels of usability• “I’m so silly”

Two main types of evaluation• Formative evaluation is done at different stages of

development to check that the product meets users’ needs.• Part of the user-centered design approach• Supports design decisions at various stages• May test parts of the system or alternative designs

• Summative evaluation assesses the quality of a finished product. • May test the usability or the output quality• May compare competing systems

What to evaluateIterative design & evaluation is a continuous process that

examines:

• Early ideas for conceptual model • Early prototypes of the new system• Later, more complete prototypes

Designers need to check that they understand users’ requirements and that the design assumptions hold.

Four evaluation paradigms• ‘quick and dirty’

• usability testing

• field studies

• predictive evaluation

Quick and dirty• ‘quick & dirty’ evaluation describes the common

practice in which designers informally get feedback from users or consultants to confirm that their ideas are in-line with users’ needs and are liked.

• Quick & dirty evaluations are done any time.• The emphasis is on fast input to the design process rather

than carefully documented findings.

Usability testing• Usability testing involves recording typical users’

performance on typical tasks in controlled settings. Field observations may also be used.

• As the users perform these tasks they are watched & recorded on video & their key presses are logged.

• This data is used to calculate performance times, identify errors & help explain why the users did what they did.

• User satisfaction questionnaires & interviews are used to elicit users’ opinions.

Usability testing• It is very time consuming to conduct and analyze

• Explain the system, do some training• Explain the task, do a mock task• Questionnaires before and after the test & after each task• Pilot test is usually needed

• Insufficient number of subjects for ‘proper’ statistical analysis

• In laboratory conditions, subjects do not behave exactly like in a normal environment

Field studies• Field studies are done in natural settings• The aim is to understand what users do naturally and how

technology impacts them.• In product design field studies can be used to:

- identify opportunities for new technology- determine design requirements - decide how best to introduce new technology- evaluate technology in use

Predictive evaluation• Experts apply their knowledge of typical users, often

guided by heuristics, to predict usability problems. • Another approach involves theoretically based models. • A key feature of predictive evaluation is that users need

not be present• Relatively quick & inexpensive

The TREC experiments• Once per year• A set of documents and queries are distributed to the participants (the standard answers are unknown) (April)

• Participants work (very hard) to construct, fine-tune their systems, and submit the answers (1000/query) at the deadline (July)

• NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August)

• TREC conference (November)

TREC evaluation methodology• Known document collection (>100K) and query set (50)• Submission of 1000 documents for each query by each

participant• Merge 100 first documents of each participant -> global

pool• Human relevance judgment of the global pool• The other documents are assumed to be irrelevant• Evaluation of each system (with 1000 answers)

• Partial relevance judgments• But stable for system ranking

Tracks (tasks)• Ad Hoc track: given document collection, different

topics• Routing (filtering): stable interests (user profile),

incoming document flow• CLIR: Ad Hoc, but with queries in a different language• Web: a large set of Web pages• Question-Answering: When did Nixon visit China?• Interactive: put users into action with system• Spoken document retrieval• Image and video retrieval• Information tracking: new topic / follow up

CLEF and NTCIR

• CLEF = Cross-Language Experimental Forum• for European languages• organized by Europeans• Each per year (March – Oct.)

• NTCIR: • Organized by NII (Japan)• For Asian languages• cycle of 1.5 year

Impact of TREC

• Provide large collections for further experiments• Compare different systems/techniques on realistic

data• Develop new methodology for system evaluation

• Similar experiments are organized in other areas (NLP, Machine translation, Summarization, …)

Outline


IR on the Web

• No stable document collection (spider, crawler)• Invalid document, duplication, etc.• Huge number of documents (partial collection)• Multimedia documents• Great variation of document quality• Multilingual problem• …

125

Web Search

• Application of IR to HTML documents on the World Wide Web.

• Differences:• Must assemble document corpus by spidering the web.• Can exploit the structural layout information in HTML (XML).• Documents change uncontrollably.• Can exploit the link structure of the web.

126

Web Search System

Query String

IRSystem

RankedDocuments

1. Page12. Page23. Page3 . .

Documentcorpus

Web Spider

Challenges• Scale, distribution of documents• Controversy over the unit of indexing

• What is a document ? (hypertext)• What does the use expect to be retrieved ?

• High heterogeneity• Document structure, size, quality, level of abstraction / specialization• User search or domain expertise, expectations

• Retrieval strategies• What do people want ?

• Evaluation

Web documents / data• No traditional collection

• Huge• Time and space to crawl index• IRSs cannot store copies of documents

• Dynamic, volatile, anarchic, un-controlled• Homogeneous sub-collections

• Structure• In documents (un-/semi-/fully-structured)• Between docs: network of inter-connected nodes• Hyper-links - conceptual vs. physical documents

Web documents / data• Mark-up

• HTML – look & feel• XML – structure, semantics• Dublin Core Metadata• Can webpage authors be trusted to correctly mark-up / index their

pages ?

• Multi-lingual documents

• Multi-media

Theoretical models forindexing / searching

• Content-based weighting• As in traditional IRS, but trying to incorporate

• hyperlinks• the dynamic nature of the Web (page validity, page caching)

• Link-based weighting• Quality of webpages

• Hubs & authorities• Bookmarked pages• Iterative estimation of quality

Architecture• Centralized

• Main server contains the index, built by an indexer, searched by a query engine• Advantage: control, easy update• Disadvantage: system requirements (memory, disk, safety/recovery)

• Distributed• Brokers & gatherers

• Advantage: flexibility, load balancing, redundancy• Disadvantage: software complexity, update

User variability• Power and flexibility for expert users vs. intuitiveness and

ease of use for novice users• Multi-modal user interface

• Distinguish between experts and beginners, offer distinct interfaces (functionality)

• Advantage: can make assumptions on users• Disadvantage: habit formation, cognitive shift

• Uni-modal interface• Make essential functionality obvious• Make advanced functionality accessible

Search strategies• Web directories• Query-based searching• Link-based browsing (provided by the browser, not the IRS)• “More like this”• Known site (bookmarking)

• A combination of the above

Support for Relevance Feedback• RF can improve search effectiveness … but is rarely used

• Voluntary vs. forced feedback

• At document vs. word level

• “Magic” vs. control

Some techniques to improve IR effectiveness

• Interaction with user (relevance feedback)- Keywords only cover part of the contents- User can help by indicating relevant/irrelevant document

• The use of relevance feedback• To improve query expression:

Qnew = *Qold + *Rel_d - *Nrel_d

where Rel_d = centroid of relevant documents

NRel_d = centroid of non-relevant documents

Modified relevance feedback

• Users usually do not cooperate (e.g. AltaVista in early years)

• Pseudo-relevance feedback (Blind RF)• Using the top-ranked documents as if they are relevant:

• Select m terms from n top-ranked documents

• One can usually obtain about 10% improvement

Term clustering• Based on `similarity’ between terms

• Collocation in documents, paragraphs, sentences

• Based on document clustering• Terms specific for bottom-level document clusters are assumed to

represent a topic

• Use• Thesauri• Query expansion

User modelling• Build a model / profile of the user by recording

• the `context’• topics of interest• preferences

based on interpreting (his/her actions):• Implicit or explicit relevance feedback• Recommendations from `peers’• Customization of the environment

Personalised systems• Information filtering

• Ex: in a TV guide only show programs of interest

• Use user model to disambiguate queries• Query expansion• Update the model continuously

• Customize the functionality and the look-and-feel of the system• Ex: skins; remember the levels of the user interface

Autonomous agents• Purpose: find relevant information on behalf of the user• Input: the user profile• Output: pull vs. push• Positive aspects:

• Can work in the background, implicitly• Can update the master with new, relevant info

• Negative aspects: control

• Integration with collaborative systems

Outline


Document Representation

<html><head><title>Department Descriptions</title></head><body>The following list describes … <h1>Agriculture</h1> …<h1>Chemistry</h1> … <h1>Computer Science</h1> …<h1>Electrical Engineering</h1> ……<h1>Zoology</h1></body></html>

<title>department descriptions</title>

<h1>agriculture</h1> <h1>chemistry</h1>… <h1>zoology</h1>

.

.

.

<body>the following list describes … <h1>agriculture</h1> … </body>

<title> context

<body> context

<h1> context 1. agriculture

2. chemistry…36. zoology

<h1>extents

1. the following list describes <h1>agriculture</h1> …

<body>extents

1. department descriptions

<title>extents

Model

• Based on original inference network retrieval framework [Turtle and Croft ’91]

• Casts retrieval as inference in simple graphical model

• Extensions made to original model• Incorporation of probabilities based on language

modeling rather than tf.idf• Multiple language models allowed in the network (one

per indexed context)

Model

D

θtitle θbody θh1

r1 rN… r1 rN

… r1 rN…

I

q1 q2

α,βtitle

α,βbody

α,βh1

Document node (observed)

Model hyperparameters (observed)

Context language models

Representation nodes(terms, phrases, etc…)

Belief nodes(#combine, #not, #max)Information need node

(belief node)

Model

I

D

θtitle θbody θh1

r1 rN… r1 rN

… r1 rN…

q1 q2

α,βtitle

α,βbody

α,βh1

P( r | θ )

• Probability of observing a term, phrase, or “concept” given a context language model• ri nodes are binary

• Assume r ~ Bernoulli( θ )• “Model B” – [Metzler, Lavrenko, Croft ’04]

• Nearly any model may be used here• tf.idf-based estimates (INQUERY)• Mixture models

Model

I

D

θtitle θbody θh1

r1 rN… r1 rN

… r1 rN…

q1 q2

α,βtitle

α,βbody

α,βh1

P( θ | α, β, D )

• Prior over context language model determined by α, β • Assume P( θ | α, β ) ~ Beta( α, β )

• Bernoulli’s conjugate prior• αw = μP( w | C ) + 1• βw = μP( ¬ w | C ) + 1• μ is a free parameter

||

)|(),,|()|(),,|( ,

D

CwPtfDPrPDrP Dw

ii

Model

I

D

θtitle θbody θh1

r1 rN… r1 rN

… r1 rN…

q1 q2

α,βtitle

α,βbody

α,βh1

P( q | r ) and P( I | r )

• Belief nodes are created dynamically based on query

• Belief node CPTs are derived from standard link matrices• Combine evidence from parents in various ways• Allows fast inference by making marginalization

computationally tractable• Information need node is simply a belief node that combines all network evidence into a single value

• Documents are ranked according to:P( I | α, β, D)

Example: #AND

A B

Q

BA

BABABABA

BABABABA

baand

pp

pppppppp

pptttPppfttPpptftPppfftP

bBPaAPbBaAtrueQPtrueQP

1)1(0)1(0)1)(1(0

),|()1(),|()1)(,|()1)(1)(,|(

)()(),|()(,

#

P(Q=true|a,b) A B

0 false false

0 false true

0 true false

1 true true

Query Language• Extension of INQUERY query language• Structured query language

• Term weighting• Ordered / unordered windows• Synonyms

• Additional features• Language modeling motivated constructs• Added flexibility to deal with fields via contexts• Generalization of passage retrieval (extent retrieval)

• Robust query language that handles many current language modeling tasks

Terms

Type Example Matches

Stemmed term dog All occurrences of dog (and its stems)

Surface term “dogs” Exact occurrences of dogs (without stemming)

Term group (synonym group) <”dogs” canine> All occurrences of dogs (without stemming) or canine (and its stems)

Extent match #any:person Any occurrence of an extent of type person

Date / Numeric FieldsExample Example Matches

#less #less(URLDEPTH 3) Any URLDEPTH numeric field extent with value less than 3

#greater #greater(READINGLEVEL 3) Any READINGINGLEVEL numeric field extent with value greater than 3

#between #between(SENTIMENT 0 2) Any SENTIMENT numeric field extent with value between 0 and 2

#equals #equals(VERSION 5) Any VERSION numeric field extent with value equal to 5

#date:before #date:before(1 Jan 1900) Any DATE field before 1900

#date:after #date:after(June 1 2004) Any DATE field after June 1, 2004

#date:between

#date:between(1 Jun 2000 1 Sep 2001)

Any DATE field in summer 2000.

Proximity

Type Example Matches

#odN(e1 … em) or#N(e1 … em)

#od5(saddam hussein) or#5(saddam hussein)

All occurrences of saddam and hussein appearing ordered within 5 words of each other

#uwN(e1 … em) #uw5(information retrieval) All occurrences of information and retrieval that appear in any order within a window of 5 words

#uw(e1 … em) #uw(john kerry) All occurrences of john and kerry that appear in any order within any sized window

#phrase(e1 … em) #phrase(#1(willy wonka)#uw3(chocolate factory))

System dependent implementation (defaults to #odm)

Context Restriction

Example Matches

yahoo.title All occurrences of yahoo appearing in the title context

yahoo.title,paragraph All occurrences of yahoo appearing in both a title and paragraph contexts (may not be possible)

<yahoo.title yahoo.paragraph> All occurrences of yahoo appearing in either a title context or a paragraph context

#5(apple ipod).title All matching windows contained within a title context

Context Evaluation

Example Evaluated

google.(title) The term google evaluated using the title context as the document

google.(title, paragraph) The term google evaluated using the concatenation of the title and paragraph contexts as the document

google.figure(paragraph) The term google restricted to figure tags within the paragraph context.

Belief Operators

INQUERY INDRI#sum / #and #combine#wsum* #weight#or #or#not #not#max #max

* #wsum is still available in INDRI, but should be used with discretion

Extent / Passage Retrieval

Example Evaluated

#combine[section](dog canine) Evaluates #combine(dog canine) for each extent associated with the section context

#combine[title, section](dog canine) Same as previous, except is evaluated for each extent associated with either the title context or the section context

#combine[passage100:50](white house) Evaluates #combine(dog canine) 100 word passages, treating every 50 words as the beginning of a new passage

#sum(#sum[section](dog)) Returns a single score that is the #sum of the scores returned from #sum(dog) evaluated for each section extent

#max(#sum[section](dog)) Same as previous, except returns the maximum score

Extent Retrieval Example

<document><section><head>Introduction</head>Statistical language modeling allows formal methods to be applied to information retrieval....</section><section><head>Multinomial Model</head>Here we provide a quick review of multinomial language models....</section><section><head>Multiple-Bernoulli Model</head>We now examine two formal methods for statistically modeling documents and queries based on the multiple-Bernoulli distribution....</section>…</document>

Query:#combine[section]( dirichlet smoothing )

SCORE DOCID BEGIN END0.50 IR-352 51 2050.35 IR-352 405 5480.15 IR-352 0 50… … … …

0.15 1. Treat each section extent as a “document”

2. Score each “document” according to #combine( … )

3. Return a ranked list of extents.

0.50

0.05

Other Operators

Type Example Description

Filter require #filreq( #less(READINGLEVEL 10) ben franklin))

Requires that documents have a reading level less than 10. Documents then ranked by query ben franklin

Filter reject #filrej( #greater(URLDEPTH 1) microsoft))

Rejects (does not score) documents with a URL depth greater than 1. Documents then ranked by query microsoft

Prior #prior( DATE ) Applies the document prior specified for the DATE field

System Overview• Indexing

• Inverted lists for terms and fields• Repository consists of inverted lists, parsed documents, and

document vectors• Query processing

• Local or distributed• Computing local / global statistics

• Features