86
2 Information Retrieval

2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Embed Size (px)

Citation preview

Page 1: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

2 Information Retrieval

Page 2: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 2Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Motivation

Information Retrieval is a field of activity for many years

It was long seen as an area of narrow interest

Advent of the Web changed this perception universal repository of knowledge

free (low cost) universal access

no central editorial board

many problems though: IR seen as key to finding the solutions!

Page 3: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 3Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Motivation

Information Retrieval: representation, storage, organization of, and access to information items

Emphasis on the retrieval of information (not data)

Focus is on the user information need

User information need Example: Find all documents containing information about car

accidents which happend in Vienna had people injured

The information need is expressed as a query

Page 4: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 4Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Generic Schema of an Information System

Comparison (Ranking)

Information Retrieval systems do not search through the documents but through the representation (also called index, meta-data or description).

source: (Ferber 2004)

representationof resources

(index/meta-data)

representation ofinformation need

(query)

userinformationresources

Page 5: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 5Information Retrieval and Knowledge Organisation - 2 Information Retrieval

document D3

but: not all terms of the query occur in the document the occurring terms „accident“ and „heavy“ also occur in D1

Example

accident heavy vehicles vienna

Heavy accident

Because of a heavy car accident 4 people died yesterday morning in Vienna.

Truck causes accident

In Vienna a trucker drove into a crowd of people. Four people were injured.

More vehicles

In this quarter more cars became registered in Vienna.

D1 D2 D3

Expected result:

Query:

Information need: documents containing information about accidents with heavy vehicles in Vienna

Page 6: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 6Information Retrieval and Knowledge Organisation - 2 Information Retrieval

indexin

gin

dexin

gre

treiv

al (s

earc

h)

retr

eiv

al (s

earc

h)

Retrieval System

Each document represented by a set of representative keywords or index terms

An index term is a document word useful for remembering the document main themes

Ranking

weighteddocuments

set of documents

index

assign IDs

store documentsand IDs

document resources

indexing

terms

Text

queryprocessing

query

terms

interface

answer: sorted list of IDsinformationneed

documents

the index is stored in an efficient system or data structured

Queries are answered using the index

with the ID die document can be retrieved

Page 7: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 7Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Indexing manual indexing – key words

user specifies key words, he/she assumes useful Usually, key words are nouns because nouns have

meaning by themselves there are two possibilities

1. user can assign any terms2. user can select from a predefined set of terms

( controlled vocabulary)

automatic indexing – full text search search engines assume that all words are index

terms (full text representation) system generates index terms from the words

occurring in the text

Page 8: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 8Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Automatic Indexing:1. Decompose a Document into Terms

rules determine how text are decomposed into terms by defining separators like punctuation marks, blanks or

hyphens

Additional preprocessing, e.g.. exclude specific strings (stop

words, numbers) generate normal form

stemming substitute characters (e.g.

upper case – lower case, Umlaut)

D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna

D2: more vehicles in this quarter more cars became registered in vienna

D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured

D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna

D2: more vehicles in this quarter more cars became registered in vienna

D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured

Page 9: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 9Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Automatic Indexing2. Index represented as an inverted List

For each term: list of documents in which the

term occurs additional information can be

stored with each document like frequency of occurrence positions of occurrence

Term Dokument-IDsa D1,D3accident D1,D3became D2because D1car D1 cars D2died D1heavy D1in D1,D2,D3more D2of D1people D1,D3quarter D2registered D2truck D3vehicles D2…

Term Dokument-IDsa D1,D3accident D1,D3became D2because D1car D1 cars D2died D1heavy D1in D1,D2,D3more D2of D1people D1,D3quarter D2registered D2truck D3vehicles D2…

In inverted list is similar to an index in a book

Page 10: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 10Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Index as Inverted List with Frequency

term (document; frequency)

a (D1,1) (D3,2)accident (D1,2) (D3,1)became (D2,1)because (D1,1)car (D1,1) cars (D2,1)died (D1,1)heavy (D1,2)in (D1,1) (D2,1) (D3,1)more (D2,1)of (D1,1)people (D1,1) (D3,2)quarter (D2,1)registered (D2,1)truck (D3,1)vehicles (D2,1)...

term (document; frequency)

a (D1,1) (D3,2)accident (D1,2) (D3,1)became (D2,1)because (D1,1)car (D1,1) cars (D2,1)died (D1,1)heavy (D1,2)in (D1,1) (D2,1) (D3,1)more (D2,1)of (D1,1)people (D1,1) (D3,2)quarter (D2,1)registered (D2,1)truck (D3,1)vehicles (D2,1)...

In this example the inverted list contains the document identifier an the frequency of the term in the document.

Page 11: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 11Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Problems of Information Retrieval

Word form A word can occur in different forms, e.g. singular or plural. Example: A query for „car“ should find also documents

containing the word „cars“

Meaning A singular term can have different meanings; on the other hand

the same meaning can be expressed using different terms. Example: when searching for „car“ also documents containing

„vehicle“ should be found.

Wording, phrases The same issue can be expressed in various ways Example: searching for „motorcar“ should also find documents

containing „motorized car“

Page 12: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 12Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Word Forms

Flexion: Conjugation and declension of a wordcar - carsrun – ran - running

Derivations: words having the same stem

form – format – formation

compositions: statements

information management - management of information

In German, compositions are written as single words, sometimes with hyphenInformationsmanagementInformations-Management

Page 13: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 13Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Word Meaning and Phrases

Synonymsrecord – file - dossierseldom – not often

Variants in spelling (e.g BE vs. AE)organisation – organizationnight - nite

AbbrevationsUN – United Nations

Polyseme: words with multiple meaningsBank

Dealing with words having the same or similar meaning

Page 14: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 14Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.1 Dealing with Word Forms and Phrases

We distinguish two ways to deal with word forms and phrases

Indexing without preprocessing All occuring word forms are included in the index Different word forms are unified at search time

string operations

Indexing with preprocessing Unification of word forms during indexing Terms normal forms of occuring word forms index is largely independent of the concrete formulation of the text

computerlinguistic approach

Page 15: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 15Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.1.1 Indexing Without Preprocessing

Index: contains all the word forms occuring in the document

Query: Searching for specific word forms is possible

(e.g. searching for „cars“ but not for „car“)

To search for different word forms string operations can be applied

Operators for truncation and masking, e.g.

? covers exactly one character

* covers arbitrary number of characters Context operators, e.g.

[n] exact distance between terms

<n> maximal distance between terms

Page 16: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 16Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Index Without Preprocessing and Query

vehicle? car? people

Term Dokument-IDsa D1,D3accident D1,D3became D2because D1car D1 cars D2died D1heavy D1in D1,D2,D3more D2of D1people D1,D3quarter D2registered D2truck D3vehicles D2…

Term Dokument-IDsa D1,D3accident D1,D3became D2because D1car D1 cars D2died D1heavy D1in D1,D2,D3more D2of D1people D1,D3quarter D2registered D2truck D3vehicles D2…

Page 17: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 17Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Truncation and Masking: Searching for Different Word Forms

Truncation: Wildcards cover characters at the beginning and end of words –prefix or suffixschreib* finds schreiben, schreibt, schreibst, schreibe,…

??schreiben finds anschreiben, beschreiben, but not verschreiben

Masking deals with characters in words – in particular in German, declensions and conjugation affect not only suffix and prefixschr??b* can find schreiben, schrieb

h??s* can find Haus, Häuser

Disadvantage: With truncation and masking not only the intended words are foundschr??b* also finds schrauben

h??s* also finds Hans, Hanse, hausen, hassenand also words in other languages like horse

Page 18: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 18Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Context Operators

Context operators allow searching for variations of text phrases

exact word distanceBezug [3] Telefonat

Bezug nehmend auf unser Telefonat

maximal word distancetext <2> retrieval

text retrievaltext and fact retrieval

For context operators to be applicable, the positions of the words must be stored in the index

Page 19: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 19Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Indexing Without Preprocessing

Efficiency Efficient Indexing Overhead at retrieval time to apply string operators

Wort forms user has to codify all possible word forms and phrases in the query

using truncation and masking operators no support given by search engine retrieval engine is language independent

Phrases Variants in text phrases can be coded using context operators

Page 20: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 20Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.1.2 Preprocessing of the Index –Computerlinguistic Approach

Each document is represented by a set of representative keywords or index terms

An index term is a document word useful for remembering the document’s main themes

Index contains standard forms of useful terms1. Restrict allowed terms

2. Normalisation: Map terms to a standard form

Index terms

not forIndex

Page 21: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 21Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Restricting allowed Index Terms

Objective:

increase efficiency effectivity by neglecting terms that do not contribute to the assessment of a document‘s relevance

There are two possibilities to restrict allowed index terms

1. Explicitly specify allowed index terms controlled vocabulary

2. Specify terms that are not allowed as index terms stopwords

Page 22: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 22Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Stop Words Stop words are terms that are not stored in the index Candidates for stop words are

words that occur very frequently A term occurring in every document ist useless as an index term,

because it does not tell anything about which document the user might be interested in

a word which occurs only in 0.001% of the documents is quite useful because it narrows down the space of documents which might be of interest for the user

words with no/little meanings terms that are not words (e.g. numbers)

Examples: General: articles, conjunctions, prepositions, auxiliary verbs (to be,

to have) occur very often and in general have no meaning as a search criteria

application-specific stop words are also possible

Page 23: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 23Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Normalisation of Terms

There are various possibilities to compute standard forms N-Grams

stemming: removing suffixes or prefixes

Page 24: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 24Information Retrieval and Knowledge Organisation - 2 Information Retrieval

N-Grams

Index: sequence of charcters of length N Example: „persons“

3-Grams (N=3): per, ers, rso, son, ons

4-Grams (N=4): pers, erso, rson, sons

N-Grams can also cross word boundaries Example: „persons from switzerland“

3-Grams (N=3): er, ers, rso, son, ons, ns_, s_f, _fr, rom, om_, m_s, _sw, swi, wit, itz, tze, zer, erl, rla, lan, and

Page 25: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 25Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Stemming

Stemming: remove suffixes and prefixes to find a comming stem, e.g. remove –ing and –ed for verbs

remove plural -s for nouns

There are a number of exceptions, e.g. –ing and –ed may belong to a stem as in red or ring

irregular verbs like go - went - gone, run - ran - run

Approaches for stemming: rule-based approach

lexicon-based approach

Page 26: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 26Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Rules for Stemming in English

Ending Replacement Condition

1 ies y

2 XYes XY XY = Co, ch, sh, ss, zz oder Xx 3 XYs XY XY = XC, Xe, Vy, Vo, oa oder ea 4 ies' y

5 Xes' X

6 Xs' X

7 X 's X

8 X' X

9 XYing XY XY= CC, XV, Xx 10 XYing XYe XY= VC 11 ied y

12 XYed XY XY = CC, XV, Xx 13 XYed XYe XY= VC

X and Y are any lettersC stands for a consonantV stands for any vowel

Kuhlen (1977) derived a rule set for stemming of most English words:

Source: (Ferber 2003)

Page 27: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 27Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Problems for Stemming In English a small number of rules cover most of the aorkd

In German it is more difficult because also stem changes for many words insertion of Umlauts, e.g. Haus – Häuser new prefixes, e.g laufen – gelaufen separation/retaining of prefix, e.g.

mitbringen – er brachte den Brief mit überbringen – er überbrachte den Brief

Irregular insertion of „Verfungen“ when building composita Schwein-kram, Schwein-s-haxe, Schwein-e-braten

These problems that cannot be easily dealt with by general rules operating only on strings

Page 28: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 28Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Lexicon-based Approaches for Stemming

Principle idea: A lexicon contains stems for word forms

complete lexicon: for each possible form the stem is storedpersons – person went – gorunning – run going – goran – run gone - go

word stem lexicon: to each stem all the necessary data are stored to derive all word forms

Distinction of different flexion classes specification of anomalies Example: To compute the stem of Flüssen, the last characters

are removed successively and the Umlaut is exchanged until a valid stem is found (Lezius 1995)Fall/Endung - n en sen ...

normal Flüssen- Flüsse-n Flüss-en Flüs-sen ...

Umlaut Flussen- Flusse-n Fluss-en Flus-sen ...Source: (Ferber 2003)

Page 29: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 29Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Index with Stemming and Stop Word Elimination

Terms Document IDs

accident D1,D3car D1, D2cause D3crowd D3die D1drive D3four D3heavy D1injur D3more D2morning D1people D1, D3quarter D2register D2truck D3trucker D3vehicle D2vienna D1, D2, D3yesterday D1

Index:

D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna

D2: more vehicles in this quarter more cars became registered in vienna

D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured

D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna

D2: more vehicles in this quarter more cars became registered in vienna

D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured

Page 30: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 30Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.2 Classical Information Retrieval Models

Classcial Models Boolean Model

Vectorspace model

Probabilistic Model

Alternative Models user preferences

Associative Search

Social Filtering

Page 31: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 31Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Classic IR Models - Basic Concepts

Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents

The importance of the index terms is represented by weights associated to them

Let ti be an index termdj be a documentwij is a weight associated with (ti,dj)

The weight wij quantifies the importance of the index term for describing the document contents

(Stop words can be regarded as terms where wij = 0 for every document)

(Baeza-Yates & Ribeirp-Neto 1999)

Page 32: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 32Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Classic IR Models - Basic Concepts

ti is an index term

dj is a document

n is the total number of docs

T = (t1, t2, …, tk) is the set of all index terms

wij >= 0 is a weight associated with (ti,dj)

wij = 0 indicates that term does not belong to doc

vec(dj) = (w1j, w2j, …, wkj) is a weighted vector associated with the document dj

gi(vec(dj)) = wij is a function which returns the weight associated with pair (ti,dj)

fi is the number of documents containing term tisource: teaching material of Ribeirp-Neto

Page 33: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 33Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Index vectors as Matrix

the vectors vec(dj) = (w1j, w2j, …, wkj) associated with the document dj can be represented as a matrix

Each colunm represents a document vector

vec(dj) = (w1j, w2j, …, wkj) the document dj contains a term ti if

wij > 0

Each row represents a term vector tvec(ti) = (wi1, wi2, …, win) the term ti is in document dj if wij > 0

d1 d2 d3 d4

t1 w1,1 w1,2 w1,3 w1,4

t2 w2,1 w2,2 w2,3 w2,4

t3 w3,1 w3,2 w3,3 w3,4

...

tn wn,1 wn,2 wn,3 wn,4

d1 d2 d3 d4

t1 w1,1 w1,2 w1,3 w1,4

t2 w2,1 w2,2 w2,3 w2,4

t3 w3,1 w3,2 w3,3 w3,4

...

tn wn,1 wn,2 wn,3 wn,4

Page 34: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 34Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Boolean Document Vectors

d1 d2 d3

accident 1 0 1car 1 1 0cause 0 0 1crowd 0 0 1die 1 0 0drive 0 0 1four 0 0 1heavy 1 0 0injur 0 0 1more 0 1 0morning 1 0 0people 1 0 1quarter 0 1 0register 0 1 0truck 0 0 1trucker 0 0 1vehicle 0 1 0vienna 1 1 1yesterday 1 0 0

d1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna

d2: more vehicles in this quarter more cars became registered in vienna

d3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured

d1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna

d2: more vehicles in this quarter more cars became registered in vienna

d3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured

Page 35: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 35Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.2.1 The Boolean Model

Simple model based on set theory precise semantics

neat formalism

Binary index: Terms are either present or absent. Thus, wij {0,1}

Queries are specified as boolean expressions using operators AND (), OR (), and NOT () q = ta (tb tc)

vehicle OR car AND accident

Page 36: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 36Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Boolean Retrieval Function

The retrieval function can be defined recursivley

R(ti,di) = TRUE, if wij = 1 (i.e. ti is in dj )= FALSE, if wij = 0 (i.e. ti is not in dj )

R(q1 AND q2,di) = R(q1,di) AND R(q2,di)

R(q1 OR q2,di) = R(q1,di) OR R(q2,di)

R(NOT q,di) = NOT R(q,di)

The Boolean functions computes only values 0 or 1, i.e. Boolean retrieval classifies documents into two categories relevant (R = 1) irrelevant (R = 0)

Page 37: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 37Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Example für Boolesches Retrieval

(vehicle OR car) AND accident

Query:

R(vehicle OR car AND accident, d1) =R(vehicle OR car AND accident, d2) =R(vehicle OR car AND accident, d3) =

100

d1 d2 d3

accident 1 0 1car 1 1 0cause 0 0 1crowd 0 0 1die 1 0 0drive 0 0 1four 0 0 1heavy 1 0 0injur 0 0 1more 0 1 0morning 1 0 0people 1 0 1quarter 0 1 0register 0 1 0truck 0 0 1trucker 0 0 1vehicle 0 1 0vienna 1 1 1yesterday 1 0 0

(vehicle AND car) OR accident

Query:

R(vehicle AND car OR accident, d1) =R(vehicle AND car OR accident, d2) =R(vehicle AND car OR accident, d3) =

010

Page 38: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 38Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Drawbacks of the Boolean Model Retrieval based on binary decision criteria

no notion of partial matching No ranking of the documents is provided (absence of a

grading scale) The query q = t1 OR t2 OR t3 is satisfied by document

containing one, two or three of the terms t1, t2, t3

No weighting of terms, wij {0,1} Information need has to be translated into a Boolean

expression which most users find awkward The Boolean queries formulated by the users are most

often too simplistic As a consequence, the Boolean model frequently returns

either too few or too many documents in response to a user query

Page 39: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 39Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.2.2 Vector Space Model Index can be regarded as an n-

dimensional space wij > 0 whenever ti dj

Each term corresponds to a dimension

To each term ti is associated a unitary vector vec(i)

The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)

document can be regarded as

vector started from (0,0,0)

point in space

(4,3,1)

vehicle

accident

car

(3,2,3)

d1 d2accident 4 3car 3 2vehicle 1 3

Example:

Page 40: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 40Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.2.2.1 Coordinate Matching Documents and query are represented as

document vectors vec(dj) = (w1j, w2j, …, wkj) query vector vec(q) = (w1q,...,wkq)

Vectors have binary values wij = 1 if term ti occurs in Dokument dj wij = 0 else

Ranking: Return the documents containing at least one query term rank by number of occuring query terms

Ranking function: scalar product R(q,d) = q * d

= qi * dii=1

n

Multiply components and summarize

Page 41: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 41Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Coordinate Matching: Example

Resultat:

q * d1 =

q * d2 =

q * d3 =

322

query vector represents terms of the query (cf. stemming)

accident heavy vehicles vienna

d1 d2 d3 q

accident 1 0 1 1car 1 1 0 0cause 0 0 1 0crowd 0 0 1 0die 1 0 0 0drive 0 0 1 0four 0 0 1 0heavy 1 0 0 1injur 0 0 1 0more 0 1 0 0morning 1 0 0 0people 1 0 1 0quarter 0 1 0 0register 0 1 0 0truck 0 0 1 0trucker 0 0 1 0vehicle 0 1 0 1vienna 1 1 1 1yesterday 1 0 0 0

Page 42: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 42Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Assessment of Coordinate Matching

Advantage compared to Boolean Model: Ranking

Three main drawbacks

frequency of terms in documents in not considered

no weighting of terms

privilege for larger documents

Page 43: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 43Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.2.2.2 Term Weighting Use of binary weights is too limiting

Non-binary weights provide consideration for partial matches

These term weights are used to compute a degree of similarity between a query and each document

How to compute the weights wij and wiq ?

A good weight must take into account two effects: quantification of intra-document contents (similarity)

tf factor, the term frequency within a document

quantification of inter-documents separation (dissi-milarity) idf factor, the inverse document frequency

wij = tf(i,j) * idf(i) (Baeza-Yates & Ribeirp-Neto 1999)

Page 44: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 44Information Retrieval and Knowledge Organisation - 2 Information Retrieval

TF - Term Frequency

Let freq(i,j) be the raw frequency of term ti within document dj (i.e. number of occurrences of term ti in document dj)

A simple tf factor can be computed as

f(i,j) = freq(i,j)

A normalized tf factor is given by

f(i,j) = freq(i,j) / max(freq(l,j))

where the maximum is computed over all terms which occur within the document dj

d1 d2 d3 q

accident 2 0 1 1car 1 1 0 0cause 0 0 1 0crowd 0 0 1 0die 1 0 0 0drive 0 0 1 0four 0 0 1 0heavy 2 0 0 1injur 0 0 1 0more 0 2 0 0morning 1 0 0 0people 1 0 2 0quarter 0 1 0 0register 0 1 0 0truck 0 0 1 0trucker 0 0 1 0vehicle 0 1 0 1vienna 1 1 1 1yesterday 1 0 0 0

(Baeza-Yates & Ribeiro-Neto 1999)For reasons of simplicity, in this example f(i,j) = freq(i,j)

Page 45: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 45Information Retrieval and Knowledge Organisation - 2 Information Retrieval

IDF – Inverse Document Frequency

IDF can also be interpreted as the amount of information associated with the term ti . A term occurring in few documents is more useful as an index term than a term occurring in nearly every document

Let ni be the number of documents containing term tiN be the total number of documents

A simple idf factor can be computed as idf(i) = 1/ni

A normalized idf factor is given by idf(i) = log (N/ni)

the log is used to make the values of tf and idf comparable.

Page 46: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 46Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Example with TF and IDF

In this examle a simple tf factor

f(i,j) = freq(i,j)

and a simple idf factor

idf(i) = 1/ni

are used

It is of advantage to store IDF and TF separately

IDF d1 d2 d3

accident 0.5 2 0 1car 0.5 1 1 0cause 1 0 0 1crowd 1 0 0 1die 1 1 0 0drive 1 0 0 1four 1 0 0 1heavy 1 2 0 0injur 1 0 0 1more 1 0 2 0morning 1 1 0 0people 0.5 1 0 2quarter 1 0 1 0register 1 0 1 0truck 1 0 0 1trucker 1 0 0 1vehicle 1 0 1 0vienna 0.33 1 1 1yesterday 1 1 0 0

Page 47: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 47Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Indexing a new Document

Changes of the indexes when adding a new document d a new document vector with tf factors for d is created

idf factors for terms occuring in d are adapted

All other document vectors remain unchanged

Page 48: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 48Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Ranking

Scalar product computes co-occurrences of term in document and query

Drawback: Scalar product privileges large documents over small ones

Euclidian distance between endpoint of vectors Drawback: euclidian distance privileges small

documents over large ones

Angle between vectors the smaller the angle beween query and document

vector the more similar they are

the angle is independent of the size of the document

the cosine is a good measure of the angle

t1

t2

q d

Page 49: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 49Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Cosine Ranking Formula

the more the directions of query a and document dj coincide the more relevant is dj

the cosine formula takes into account the ratio of the terms not their concrete number

Let be the angle between q and dj

Because all values wij >= 0 the angle is between 0° und 90°

the larger the less is cos the less the larger is cos cos 0 = 1 cos 90° = 0

t1

t2

q dj

cos(q,dj) = q ° dj

|q| ° |dj|

Page 50: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 50Information Retrieval and Knowledge Organisation - 2 Information Retrieval

The Vector Model

The best term-weighting schemes use weights which are given by wij = f(i,j) * log(N/ni)

the strategy is called a tf-idf weighting scheme

For the query term weights, a suggestion is wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)

(Baeza-Yates & Ribeirp-Neto 1999)

Page 51: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 51Information Retrieval and Knowledge Organisation - 2 Information Retrieval

The Vector Model The vector model with tf-idf weights is a good ranking

strategy with general collections

The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute.

Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate

the query conditions cosine ranking formula sorts documents according to

degree of similarity to the query

Disadvantages: assumes independence of index terms (??); not clear that

this is bad though (Baeza-Yates & Ribeiro-Neto 1999)

Page 52: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 52Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.2.3 Extensions of the Classical Models

Combination of Boolean model

vector model

indexing with and without preprocessing

Extended index with additional information like document format (.doc, .pdf, …)

language

Using information about links in hypertext link structure

anchor text

Page 53: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 53Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Boolean Operators in the Vector Model

Many search engines allow queries with Boolean operators

Retrieval:

Boolean operators are used to select relevant documents

in the example, only documents containing „accident“ and either „vehicle“ or „car“are considered relevant

ranking of the relevant documents is based on vector model

idf-tf weighting cosine ranking formula

(vehicle OR car) AND accident

d1 d2 d3 q

accident 2 0 1 1car 1 1 0 0cause 0 0 1 0crowd 0 0 1 0die 1 0 0 0drive 0 0 1 0four 0 0 1 0heavy 2 0 0 1injur 0 0 1 0more 0 2 0 0morning 1 0 0 0people 1 0 2 0quarter 0 1 0 0register 0 1 0 0truck 0 0 1 0trucker 0 0 1 0vehicle 0 1 0 1vienna 1 1 1 1yesterday 1 0 0 0

Page 54: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 54Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Queries with Wild Cards in the Vector Model

vector model based in index without preprocessing

index contains all word forms occuring in the documents

Queries allow wildcards (masking and truncation), e.g.

Principle of query answering

First, wildcards are extended to all matching terms (here vehicle* matches „vehicles“)

ranking according to vector model

d1 d2 d3 q

accident 2 0 1 1car 1 0 0 0cars 0 1 0 0causes 0 0 1 0crowd 0 0 1 0died 1 0 0 0drove 0 0 1 0four 0 0 1 0heavy 2 0 0 1injured 0 0 1 0more 0 2 0 0morning 1 0 0 0people 1 0 2 0quarter 0 1 0 0registered 0 1 0 0truck 0 0 1 0trucker 0 0 1 0vehicles 0 1 0 1vienna 1 1 1 1yesterday 1 0 0 0

accident heavy vehicle* vienna

Page 55: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 55Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Using Link Information in Hypertext

Ranking: link structure is used to calculate a quality ranking for each web page

PageRank®

HITS – Hypertext Induced Topic Selection (Authority and Hub)

Hilltop

Indexing: text of a link (anchor text) is associated both with the page the link is on and

with the page the link points to

Page 56: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 56Information Retrieval and Knowledge Organisation - 2 Information Retrieval

The PageRank Calculation

1) S. Brin and L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Computer Networks and ISDN Systems. Vol. 30, 1998, Seiten 107-117http://www-db.stanford.edu/~backrub/google.html oder http://infolab.stanford.edu/pub/papers/google.pdf

PageRank has been developed by Sergey Brin and Lawrence Page at Stanford University and published in 19981)

PageRank uses the link structure of web pages

Original version of PageRank calculation:

with

PR(A) being the PageRank of page A,

PR(Ti) being the PageRank of apges Ti that contain a link to page A

C(Ti) being the number of links going out of page Ti

d being a damping factor with 0 <= d <= 1

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Page 57: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 57Information Retrieval and Knowledge Organisation - 2 Information Retrieval

The PageRank Calculation - Explanation

The PageRank of page A is recursively defined by the PageRanks of those pages which link to page A

The PageRank of a page Ti is always weighted by the number of outbound links C(Ti) on page Ti: This means that the more outbound links a page Ti has, the less will page A benefit from a link to it on page Ti.

The weighted PageRank of pages Ti is then added up. The outcome of this is that an additional inbound link for page A will always increase page A's PageRank.

Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1.

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Source: http://pr.efactory.de/e-pagerank-algorithm.shtml

Page 58: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 58Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Damping Factor and the Random Surfer Model

The PageRank algorithm and the damping factor are motivated by the model of a random surfer. The random surfer finds a page A by

following a link from a page Ti to page A or by random choice of a web page (e.g. typing the URL).

The probability that the random surfer clicks on a particular link is given by the number of links on that page: If a page Ti contains C(Ti) links, the probability for each links is 1/ C(Ti)

The justification of the damping factor is that the surfer does not click on an infinite number of links, but gets bored sometimes and jumps to another page at random.

d is the probability for the random surfer not stopping to click on links – this is way the sum of pageRanks is multiplied by d

(1-d) is the probability that the surfer jumps to another page at random after he stopped clicking links.Regardless of inbound links, the probability for the random surfer jumping to a page is always (1-d), so a page has always a minimum PageRank

(According to Brin and Page d = 0.85 is a good value)

Source: http://pr.efactory.de/e-pagerank-algorithm.shtml

Page 59: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 59Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Calculation of the PageRank - Example

We regard a small web consisting of only three pages A, B and C and the links structure shon in the figure

To keep the calculation simple d is set to 0.5

These are the equation for the PageRank calculation:

PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

Solving these equations we get the following PageRank values for the single pages:

PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615

Quelle: http://pr.efactory.de/e-pagerank-algorithmus.shtml

Page 60: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 60Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Iterative Calculation of the PageRank - Example

Iteration PR(A) PR(B) PR(C)0 1 1 11 1 0.75 1.1252 1.0625 0.765625 1.14843753 1.07421875 0.76855469 1.152832034 1.07641602 0.76910400 1.153656015 1.07682800 0.76920700 1.153810506 1.07690525 0.76922631 1.153839477 1.07691973 0.76922993 1.153844908 1.07692245 0.76923061 1.153845929 1.07692296 0.76923074 1.15384611

10 1.07692305 0.76923076 1.1538461511 1.07692307 0.76923077 1.1538461512 1.07692308 0.76923077 1.15384615

According to Lawrence Page and Sergey Brin, about 100 iterations are necessary to get a good approximation of the PageRank values of the whole web.

Quelle: http://pr.efactory.de/d-pagerank-algorithmus.shtml

Because of the size of the actual web, the Google search engine uses an approximative, iterative computation of PageRank values

• each page is assigned an initial starting value• the PageRanks of all pages are then calculated in several computation cycles.

Page 61: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 61Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Alternative Link Analysis Algorithms (I): HITS

Jon Kleinberg: Authoritative sources in a hyperlinked environment. In: Journal of the ACM, Vol. 36, No. 5, pp. 604-632, 1999, http://www.cs.cornell.edu/home/kleinber/auth.pdf

Hypertext-Induced Topic Selection (HITS) is a link analysis algorithm proposed by J. Kleinberg 1999

HITS rates Web pages for their authority and hub values:

The authority value estimates the value of the content of the page; a good authority is a page that is pointed to by many good hubs

the hub value estimates the value of its links to other pages; a good hub is a page that points to many good authorities (examples of hubs are good link collections);

Every page i is assigned a hub weight hi and an Authority weight ai :

Page 62: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 62Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Alternative Link Analysis Algorithms (II): Hilltop The Hilltop-Algorithm1) rates documents based on their incoming links

from so-called expert pages

Expert pages are defined as pages that are about a topic and have links to many non-affiliated pages on that topic.

Pages are defined as non-affiliated if they are from authors of non-affiliated organisations.

Websites which have backlinks from many of the best expert pages are authorities and are ranked high.

A good directory page is an example of an expert page (cp. hubs).

Determination of expert pages is a central point of the hilltop algorithm.

1) The Hilltop-Algorithmus was developed by Bharat und Mihaila an publishes in 1999:Krishna Bharat, George A. Mihaila: Hilltop: A Search Engine based on Expert Documents. In 2003 Google bought the patent of the algorithm(see also http://pagerank.suchmaschinen-doktor.de/hilltop.html)

Page 63: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 63Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Anchor-Text The Google search engine uses the text of links

twice First, the text of a link is associated with the

page that the link is on, In addition, it is associated with the page

the link points to. Advantages:

Anchors provide additional description of a web pages – from a user‘s point of view

Documents without text can be indexed, such as images, programs, and databases.

Disadvantage: Search results can be manipulated

(cf. Google Bombing1))

A Google bomb influences the ranking of the search engine. It is created if a large number of sites link to the page with anchor text that often has humourous, political or defamatory statements. In the meanwhile, Google bombs are defused by Google.

The polar bear Knut was born in the zoo of Berlin

Page 64: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 64Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Natural Language Queries

Natural language queries are treated as any other query

Stop word elimination

Stemming

but no interpretation of the meaning of the query

i need information about accidents with cars and other vehicles

is equivalent to

information accident car vehicle

Page 65: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 65Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Searching Similar Documents

Is is often difficult to express the information need as a query

An alternative search method can be to search for similar documents to a given document d

Page 66: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 66Information Retrieval and Knowledge Organisation - 2 Information Retrieval

IDF d1 d2 d3

accident 0.5 2 0 1car 0.5 1 1 0cause 1 0 0 1crowd 1 0 0 1die 1 1 0 0drive 1 0 0 1four 1 0 0 1heavy 1 2 0 0injur 1 0 0 1more 1 0 2 0morning 1 1 0 0people 0.5 1 0 2quarter 1 0 1 0register 1 0 1 0truck 1 0 0 1trucker 1 0 0 1vehicle 1 0 1 0vienna 0.33 1 1 1yesterday 1 1 0 0

Finding Similar Documents – Principle and ExampleExample: Find the most similar documents to d1

Principle: Use a given document d as a query

Compare all document di with d

Example (scalar product):

IDF * d1 * d2 =IDF * d1 * d3 =

The approach is the same as for a : same index same ranking function

0.832.33

Page 67: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 67Information Retrieval and Knowledge Organisation - 2 Information Retrieval

The Vector Space Model

The vector space model ...… is relatively simple and clear,… is efficient,… ranks documents,… can be applied for any collection of documents

The model has many heuristic components of parameters, e.g. determintation of index terms calculation of tf and idf ranking function

The best parameter setting depends on the document collection

Page 68: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 68Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.3 Implementation of the Index The vector space model is usually

implemented with an inverted index

For each term a pointer references a „posting list“ with an entry for each document containing the term

The posting lists can be implemented as linked lists or

more efficient data structures that reduce the storage requirements (index pruning)

To answer a query, the corresponding posting lists are retrieved and the documents are ranked, i.e. efficient retrieval of posting list is essential

Source: D. Grossman, O. Frieder (2004) Information Retrieval, Springer-Verlag

Page 69: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 69Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Implementing the Term Structure as a Trie

Sequentially scanning the index for query terms/posting lists is inefficient

a trie is a tree structure

each node is an array,one element for eachcharacter

each element contains a link to another node

*) the characters and their order are identical for each node. Therefore they do not need to be stored explicitly.

Example: Structure of a node in a trie*)

Source: G. Saake, K.-U. Sattler: Algorithmen und Datenstrukturen – Eine Einführung mit Java. dpunkt Verlag 2004

Page 70: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 70Information Retrieval and Knowledge Organisation - 2 Information Retrieval

The Index as a Trie

The leaves of the trie are the index terms, pointing to the corresponding position lists

Searching a term in in a trie:

search starts at the root

subsequently for each character of the term the reference to the corresponding subtree is followed until

a leave with the term is found search stops without success

(Saake, Sattler 2004)

Page 71: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 71Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Patricia Trees

Idea: Skip irrelevant parts of terms

This is achieved by storing in each node the number of characters to be skipped.

Example:

(Saake, Sattler 2004)Patricia = Practical Algorithm To Retrieve Information Coded in Alphanumeric

Page 72: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 72Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.4 Evaluating Search Methods

Set of alldocument

relevant documentsthat are not found

documentsfound

relevant documents found

Page 73: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 73Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Performance Measure of Information Retrieval:Recall und Precision

Several different measures for evaluating the performance of information retrieval systems have been proposed; two important ones are:

Recall: fraction of the relevant documents that are are successfully retrieved.

answer set DA

relevant documents DR

relevant documents in answer set DRA

Precision: fraction of the documents retrieved that are relevant to the user's information need

R =|DRA|

|DR|

P =|DRA||DA|

D

Page 74: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 74Information Retrieval and Knowledge Organisation - 2 Information Retrieval

F-Measure

The F-measure is a mean of precision and recall

In this version, precision and recall are equally weighted.

The more general version allows to give preference to recall or precision

F2 weights recall twice as much as precision

F0.5 weights precision twice as much as recall

Page 75: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 75Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Computing Recall and Precision Evaluation: Perform a predefined set of queries

The search engines delivers a ranked set of documents Use the first X documents of the result list as answer set Compute recall and precision for the frist X documents of the ranked

result list.

How do you know, which documents are relevant?

1. A general reference set of documents can be used. For example, TREC (Text REtrieval Conference) is an annual event where large test collections in different domains are used to measure and compare performance of infomration retrieval systems

2. For companies it is more important to evaluate information retrieval systems using their own documents1. Collect a representative set of documents

2. Specify queries and associated relevant documents

3. evaluate search engines by computing recall and precision for the query results

Page 76: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 76Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.5 User Adaptation

Take into account information of a user to filter document particularly relevant to this user

Relevance Feedback Retrieval in multiple passes; in each pass the use refines

the query based on results of previous queries

Explicit User Profiles subscription User-specific weights of terms

Social Filtering Similar use get similar documents

Page 77: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 77Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.5.1 Relevance Feedback given by the User

The user specifies relevance of each document.Example: for query "Pisa" only the documents about the education assessment are regarded as relevant

In the next pass, the top ranked documents are only about the education assessment

This example is from the SmartFinder system from empolis. The mindaccess system from Insiders GmbH uses the same techology.

Example:

Page 78: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 78Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Relevance Feedback: Probabilistic Model Assumption: Given a user query, there is an ideal

answer set

Idea: An initial answer is iteratively improved based on user feedback

Approach: An initial set of documents is retrieved

somehow User inspects these docs looking for the

relevant ones (usually, only top 10-20 need to be inspected)

IR system uses this information to refine description of ideal answer set

By repeting this process, it is expected that the description of the ideal answer set will improve

The description of ideal answer set is modeled in probabilistic terms

(Baeza-Yates & Ribeiro-Neto 1999)

Page 79: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 79Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Probabilistic Ranking

Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant).

The model assumes that this probability of relevance depends on the query and the document representations only.

Probabilistic ranking is:

Definitions:

wij {0,1} (i.e. weights are binary)

similarity of document dj to the query q

is the document vector of dj

is the probability that document dj is relevant

is the probability that document dj is not relevant

(Baeza-Yates & Ribeiro-Neto 1999)

Page 80: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 80Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Computing Probabilistic Ranking

Probabilistic ranking can be computed as:

wherestands for the probability that the index term ki is present in a document randomly selected fromthe set R of relevant documents

is the weight of term ki in the query

stands for the probability that the index term ki

is not present in a document randomly selected fromthe set R of relevant documents

is the weight of term ki in document dj(Baeza-Yates & Ribeiro-Neto 1999)

Page 81: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 81Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Relevance Feedback: Probabilistic Model

The probabilities that a term ki is (not) present in a set of relevant documents can be computed as :

N total number of documents

ni number of documents containing term ki

V number of relevant documents retrieved by the probabilistic model

Vi number of relevant documents containing term ki

There are different ways to find the relevant document V : Automatically: V can be specified as the top r

documents found By user feedback: The user specifies for each retrieved

document whether it is relevant or not

Page 82: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 82Information Retrieval and Knowledge Organisation - 2 Information Retrieval

2.5.2 Explicit User Profiles

Idea: Using knoweldge about the user to provide information that is particularly relelvant for him/her

users specify topics of interest as a set of terms these terms represent the user profile documents containing the terms of the user profile are prefered

information need/preferences documents

user profile index

profileacquisition

document representation

ranking function

Page 83: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 83Information Retrieval and Knowledge Organisation - 2 Information Retrieval

User profiles for subscribing to information user profiles are treated

as queries

Example: news feedAs soon as a new document arrives it is tested for similarity with the user profiles

Vector space model can be applied

A document is regarded relevant if the ranking reaches a specified threshold

Example: User 1 is interested in any car accicentUser 2 is interested in deadly car accidents with trucks

IDF d1 d2 d3 U1 U2

accident 0.5 2 0 1 1 1car 0.5 1 1 0 1 0cause 1 0 0 1 0 0crowd 1 0 0 1 0 0die 1 1 0 0 0 1drive 1 0 0 1 0 0four 1 0 0 1 0 0heavy 1 2 0 0 0 0injur 1 0 0 1 0 0more 1 0 2 0 0 0morning 1 1 0 0 0 0people 0.5 1 0 2 0 0quarter 1 0 1 0 0 0register 1 0 1 0 0 0truck 1 0 0 1 1 1trucker 1 0 0 1 0 0vehicle 1 0 1 0 1 0vienna 0.33 1 1 1 0 0yesterday 1 1 0 0 0 0

Page 84: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 84Information Retrieval and Knowledge Organisation - 2 Information Retrieval

User profiles for Individual Queries Users specify importance

of terms

User profiles are used as additional term weights

Different ranking for different users

Example

ranking for user 1

IDF * d1 * U1 *q =IDF * d2 * U1 *q =IDF * d3 * U1 *q =

Example: users profiles with term

1,410,5

2,20,10,5

ranking for user 2

IDF * d1 * U2 *q =IDF * d2 * U2 *q =IDF * d3 * U2 *q =

IDF d1 d2 d3 U1 U2 q

accident 0.5 2 0 1 1 1 1car 0.5 1 1 0 0.8 0.2 0cause 1 0 0 1 0 0 0crowd 1 0 0 1 0 0 0die 1 1 0 0 0 0.8 0drive 1 0 0 1 0 0 0four 1 0 0 1 0 0 0heavy 1 2 0 0 0.2 0.6 1injur 1 0 0 1 0 0 0more 1 0 2 0 0 0 0morning 1 1 0 0 0 0 0people 0.5 1 0 2 0.5 0.8 0quarter 1 0 1 0 0 0 0register 1 0 1 0 0 0 0truck 1 0 0 1 0.6 1 0trucker 1 0 0 1 0 0.6 0vehicle 1 0 1 0 1 0.1 1vienna 0.33 1 1 1 0 0 1yesterday 1 1 0 0 0 0 0

Page 85: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 85Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Acquisition and Maintenance of User Profiles

There are different ways to specify user profiles

manual: users specifies topics of interests (and weights) explicitly selection of predefined terms or query Problem: Maintenance

user feedback: user collects relevant documents terms in selected document are regarded as important Problem: How to motivate the user to give feedback (a similar approach is used by spam filters - classification)

Heuristics: observing user behaviour Example: If a user has opened a document for long time, it is

assumend that he/she read it and therefore it might be relevant Problem: Heuristics might be wrong

Page 86: 2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

Prof. Dr. Knut Hinkelmann 86Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Social Filtering

Idea: Information is relevant, if other users who showed similar behaviour regarded the information as relevant Relevance is specified by the users

User profiles are compared

Example: A simple variant can be found at Amazon purchases of books and CDs are stored

„people who bought this book also bought …“