2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

2 Information Retrieval

Prof. Dr. Knut Hinkelmann 2Information Retrieval and Knowledge Organisation - 2 Information Retrieval

Motivation

Information Retrieval is a field of activity for many years

It was long seen as an area of narrow interest

Advent of the Web changed this perception universal repository of knowledge

free (low cost) universal access

no central editorial board

many problems though: IR seen as key to finding the solutions!


Motivation

Information Retrieval: representation, storage, organization of, and access to information items

Emphasis on the retrieval of information (not data)

Focus is on the user information need

User information need Example: Find all documents containing information about car

accidents which happend in Vienna had people injured

The information need is expressed as a query


Generic Schema of an Information System

Comparison (Ranking)

Information Retrieval systems do not search through the documents but through the representation (also called index, meta-data or description).

source: (Ferber 2004)

representationof resources

(index/meta-data)

representation ofinformation need

(query)

userinformationresources


document D3

but: not all terms of the query occur in the document the occurring terms „accident“ and „heavy“ also occur in D1

Example

accident heavy vehicles vienna

Heavy accident

Because of a heavy car accident 4 people died yesterday morning in Vienna.

Truck causes accident

In Vienna a trucker drove into a crowd of people. Four people were injured.

More vehicles

In this quarter more cars became registered in Vienna.

D1 D2 D3

Expected result:

Query:

Information need: documents containing information about accidents with heavy vehicles in Vienna


indexin

gin

dexin

gre

treiv

al (s

earc

h)

retr

eiv

al (s

earc

h)

Retrieval System

Each document represented by a set of representative keywords or index terms

An index term is a document word useful for remembering the document main themes

Ranking

weighteddocuments

set of documents

index

assign IDs

store documentsand IDs

document resources

indexing

terms

Text

queryprocessing

query

terms

interface

answer: sorted list of IDsinformationneed

documents

the index is stored in an efficient system or data structured

Queries are answered using the index

with the ID die document can be retrieved


Indexing manual indexing – key words

user specifies key words, he/she assumes useful Usually, key words are nouns because nouns have

meaning by themselves there are two possibilities

1. user can assign any terms2. user can select from a predefined set of terms

( controlled vocabulary)

automatic indexing – full text search search engines assume that all words are index

terms (full text representation) system generates index terms from the words

occurring in the text


Automatic Indexing:1. Decompose a Document into Terms

rules determine how text are decomposed into terms by defining separators like punctuation marks, blanks or

hyphens

Additional preprocessing, e.g.. exclude specific strings (stop

words, numbers) generate normal form

stemming substitute characters (e.g.

upper case – lower case, Umlaut)

D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna

D2: more vehicles in this quarter more cars became registered in vienna

D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured





Automatic Indexing2. Index represented as an inverted List

For each term: list of documents in which the

term occurs additional information can be

stored with each document like frequency of occurrence positions of occurrence

Term Dokument-IDsa D1,D3accident D1,D3became D2because D1car D1 cars D2died D1heavy D1in D1,D2,D3more D2of D1people D1,D3quarter D2registered D2truck D3vehicles D2…


In inverted list is similar to an index in a book


Index as Inverted List with Frequency

term (document; frequency)

a (D1,1) (D3,2)accident (D1,2) (D3,1)became (D2,1)because (D1,1)car (D1,1) cars (D2,1)died (D1,1)heavy (D1,2)in (D1,1) (D2,1) (D3,1)more (D2,1)of (D1,1)people (D1,1) (D3,2)quarter (D2,1)registered (D2,1)truck (D3,1)vehicles (D2,1)...

term (document; frequency)

a (D1,1) (D3,2)accident (D1,2) (D3,1)became (D2,1)because (D1,1)car (D1,1) cars (D2,1)died (D1,1)heavy (D1,2)in (D1,1) (D2,1) (D3,1)more (D2,1)of (D1,1)people (D1,1) (D3,2)quarter (D2,1)registered (D2,1)truck (D3,1)vehicles (D2,1)...

In this example the inverted list contains the document identifier an the frequency of the term in the document.


Problems of Information Retrieval

Word form A word can occur in different forms, e.g. singular or plural. Example: A query for „car“ should find also documents

containing the word „cars“

Meaning A singular term can have different meanings; on the other hand

the same meaning can be expressed using different terms. Example: when searching for „car“ also documents containing

„vehicle“ should be found.

Wording, phrases The same issue can be expressed in various ways Example: searching for „motorcar“ should also find documents

containing „motorized car“


Word Forms

Flexion: Conjugation and declension of a wordcar - carsrun – ran - running

Derivations: words having the same stem

form – format – formation

compositions: statements

information management - management of information

In German, compositions are written as single words, sometimes with hyphenInformationsmanagementInformations-Management


Word Meaning and Phrases

Synonymsrecord – file - dossierseldom – not often

Variants in spelling (e.g BE vs. AE)organisation – organizationnight - nite

AbbrevationsUN – United Nations

Polyseme: words with multiple meaningsBank

Dealing with words having the same or similar meaning


2.1 Dealing with Word Forms and Phrases

We distinguish two ways to deal with word forms and phrases

Indexing without preprocessing All occuring word forms are included in the index Different word forms are unified at search time

string operations

Indexing with preprocessing Unification of word forms during indexing Terms normal forms of occuring word forms index is largely independent of the concrete formulation of the text

computerlinguistic approach


2.1.1 Indexing Without Preprocessing

Index: contains all the word forms occuring in the document

Query: Searching for specific word forms is possible

(e.g. searching for „cars“ but not for „car“)

To search for different word forms string operations can be applied

Operators for truncation and masking, e.g.

? covers exactly one character

* covers arbitrary number of characters Context operators, e.g.

[n] exact distance between terms

<n> maximal distance between terms


Index Without Preprocessing and Query

vehicle? car? people




Truncation and Masking: Searching for Different Word Forms

Truncation: Wildcards cover characters at the beginning and end of words –prefix or suffixschreib* finds schreiben, schreibt, schreibst, schreibe,…

??schreiben finds anschreiben, beschreiben, but not verschreiben

Masking deals with characters in words – in particular in German, declensions and conjugation affect not only suffix and prefixschr??b* can find schreiben, schrieb

h??s* can find Haus, Häuser

Disadvantage: With truncation and masking not only the intended words are foundschr??b* also finds schrauben

h??s* also finds Hans, Hanse, hausen, hassenand also words in other languages like horse


Context Operators

Context operators allow searching for variations of text phrases

exact word distanceBezug [3] Telefonat

Bezug nehmend auf unser Telefonat

maximal word distancetext <2> retrieval

text retrievaltext and fact retrieval

For context operators to be applicable, the positions of the words must be stored in the index


Indexing Without Preprocessing

Efficiency Efficient Indexing Overhead at retrieval time to apply string operators

Wort forms user has to codify all possible word forms and phrases in the query

using truncation and masking operators no support given by search engine retrieval engine is language independent

Phrases Variants in text phrases can be coded using context operators


2.1.2 Preprocessing of the Index –Computerlinguistic Approach

Each document is represented by a set of representative keywords or index terms

An index term is a document word useful for remembering the document’s main themes

Index contains standard forms of useful terms1. Restrict allowed terms

2. Normalisation: Map terms to a standard form

Index terms

not forIndex


Restricting allowed Index Terms

Objective:

increase efficiency effectivity by neglecting terms that do not contribute to the assessment of a document‘s relevance

There are two possibilities to restrict allowed index terms

1. Explicitly specify allowed index terms controlled vocabulary

2. Specify terms that are not allowed as index terms stopwords


Stop Words Stop words are terms that are not stored in the index Candidates for stop words are

words that occur very frequently A term occurring in every document ist useless as an index term,

because it does not tell anything about which document the user might be interested in

a word which occurs only in 0.001% of the documents is quite useful because it narrows down the space of documents which might be of interest for the user

words with no/little meanings terms that are not words (e.g. numbers)

Examples: General: articles, conjunctions, prepositions, auxiliary verbs (to be,

to have) occur very often and in general have no meaning as a search criteria

application-specific stop words are also possible


Normalisation of Terms

There are various possibilities to compute standard forms N-Grams

stemming: removing suffixes or prefixes


N-Grams

Index: sequence of charcters of length N Example: „persons“

3-Grams (N=3): per, ers, rso, son, ons

4-Grams (N=4): pers, erso, rson, sons

N-Grams can also cross word boundaries Example: „persons from switzerland“

3-Grams (N=3): er, ers, rso, son, ons, ns_, s_f, _fr, rom, om_, m_s, _sw, swi, wit, itz, tze, zer, erl, rla, lan, and


Stemming

Stemming: remove suffixes and prefixes to find a comming stem, e.g. remove –ing and –ed for verbs

remove plural -s for nouns

There are a number of exceptions, e.g. –ing and –ed may belong to a stem as in red or ring

irregular verbs like go - went - gone, run - ran - run

Approaches for stemming: rule-based approach

lexicon-based approach


Rules for Stemming in English

Ending Replacement Condition

1 ies y

2 XYes XY XY = Co, ch, sh, ss, zz oder Xx 3 XYs XY XY = XC, Xe, Vy, Vo, oa oder ea 4 ies' y

5 Xes' X

6 Xs' X

7 X 's X

8 X' X

9 XYing XY XY= CC, XV, Xx 10 XYing XYe XY= VC 11 ied y

12 XYed XY XY = CC, XV, Xx 13 XYed XYe XY= VC

X and Y are any lettersC stands for a consonantV stands for any vowel

Kuhlen (1977) derived a rule set for stemming of most English words:

Source: (Ferber 2003)


Problems for Stemming In English a small number of rules cover most of the aorkd

In German it is more difficult because also stem changes for many words insertion of Umlauts, e.g. Haus – Häuser new prefixes, e.g laufen – gelaufen separation/retaining of prefix, e.g.

mitbringen – er brachte den Brief mit überbringen – er überbrachte den Brief

Irregular insertion of „Verfungen“ when building composita Schwein-kram, Schwein-s-haxe, Schwein-e-braten

These problems that cannot be easily dealt with by general rules operating only on strings


Lexicon-based Approaches for Stemming

Principle idea: A lexicon contains stems for word forms

complete lexicon: for each possible form the stem is storedpersons – person went – gorunning – run going – goran – run gone - go

word stem lexicon: to each stem all the necessary data are stored to derive all word forms

Distinction of different flexion classes specification of anomalies Example: To compute the stem of Flüssen, the last characters

are removed successively and the Umlaut is exchanged until a valid stem is found (Lezius 1995)Fall/Endung - n en sen ...

normal Flüssen- Flüsse-n Flüss-en Flüs-sen ...

Umlaut Flussen- Flusse-n Fluss-en Flus-sen ...Source: (Ferber 2003)


Index with Stemming and Stop Word Elimination

Terms Document IDs

accident D1,D3car D1, D2cause D3crowd D3die D1drive D3four D3heavy D1injur D3more D2morning D1people D1, D3quarter D2register D2truck D3trucker D3vehicle D2vienna D1, D2, D3yesterday D1

Index:








2.2 Classical Information Retrieval Models

Classcial Models Boolean Model

Vectorspace model

Probabilistic Model

Alternative Models user preferences

Associative Search

Social Filtering


Classic IR Models - Basic Concepts

Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents

The importance of the index terms is represented by weights associated to them

Let ti be an index termdj be a documentwij is a weight associated with (ti,dj)

The weight wij quantifies the importance of the index term for describing the document contents

(Stop words can be regarded as terms where wij = 0 for every document)

(Baeza-Yates & Ribeirp-Neto 1999)


Classic IR Models - Basic Concepts

ti is an index term

dj is a document

n is the total number of docs

T = (t1, t2, …, tk) is the set of all index terms

wij >= 0 is a weight associated with (ti,dj)

wij = 0 indicates that term does not belong to doc

vec(dj) = (w1j, w2j, …, wkj) is a weighted vector associated with the document dj

gi(vec(dj)) = wij is a function which returns the weight associated with pair (ti,dj)

fi is the number of documents containing term tisource: teaching material of Ribeirp-Neto


Index vectors as Matrix

the vectors vec(dj) = (w1j, w2j, …, wkj) associated with the document dj can be represented as a matrix

Each colunm represents a document vector

vec(dj) = (w1j, w2j, …, wkj) the document dj contains a term ti if

wij > 0

Each row represents a term vector tvec(ti) = (wi1, wi2, …, win) the term ti is in document dj if wij > 0

d1 d2 d3 d4

t1 w1,1 w1,2 w1,3 w1,4

t2 w2,1 w2,2 w2,3 w2,4

t3 w3,1 w3,2 w3,3 w3,4

...

tn wn,1 wn,2 wn,3 wn,4

d1 d2 d3 d4

t1 w1,1 w1,2 w1,3 w1,4

t2 w2,1 w2,2 w2,3 w2,4

t3 w3,1 w3,2 w3,3 w3,4

...

tn wn,1 wn,2 wn,3 wn,4


Boolean Document Vectors

d1 d2 d3

accident 1 0 1car 1 1 0cause 0 0 1crowd 0 0 1die 1 0 0drive 0 0 1four 0 0 1heavy 1 0 0injur 0 0 1more 0 1 0morning 1 0 0people 1 0 1quarter 0 1 0register 0 1 0truck 0 0 1trucker 0 0 1vehicle 0 1 0vienna 1 1 1yesterday 1 0 0

d1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna

d2: more vehicles in this quarter more cars became registered in vienna

d3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured

d1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna

d2: more vehicles in this quarter more cars became registered in vienna

d3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured


2.2.1 The Boolean Model

Simple model based on set theory precise semantics

neat formalism

Binary index: Terms are either present or absent. Thus, wij {0,1}

Queries are specified as boolean expressions using operators AND (), OR (), and NOT () q = ta (tb tc)

vehicle OR car AND accident


Boolean Retrieval Function

The retrieval function can be defined recursivley

R(ti,di) = TRUE, if wij = 1 (i.e. ti is in dj )= FALSE, if wij = 0 (i.e. ti is not in dj )

R(q1 AND q2,di) = R(q1,di) AND R(q2,di)

R(q1 OR q2,di) = R(q1,di) OR R(q2,di)

R(NOT q,di) = NOT R(q,di)

The Boolean functions computes only values 0 or 1, i.e. Boolean retrieval classifies documents into two categories relevant (R = 1) irrelevant (R = 0)


Example für Boolesches Retrieval

(vehicle OR car) AND accident

Query:

R(vehicle OR car AND accident, d1) =R(vehicle OR car AND accident, d2) =R(vehicle OR car AND accident, d3) =

100

d1 d2 d3

accident 1 0 1car 1 1 0cause 0 0 1crowd 0 0 1die 1 0 0drive 0 0 1four 0 0 1heavy 1 0 0injur 0 0 1more 0 1 0morning 1 0 0people 1 0 1quarter 0 1 0register 0 1 0truck 0 0 1trucker 0 0 1vehicle 0 1 0vienna 1 1 1yesterday 1 0 0

(vehicle AND car) OR accident

Query:

R(vehicle AND car OR accident, d1) =R(vehicle AND car OR accident, d2) =R(vehicle AND car OR accident, d3) =

010


Drawbacks of the Boolean Model Retrieval based on binary decision criteria

no notion of partial matching No ranking of the documents is provided (absence of a

grading scale) The query q = t1 OR t2 OR t3 is satisfied by document

containing one, two or three of the terms t1, t2, t3

No weighting of terms, wij {0,1} Information need has to be translated into a Boolean

expression which most users find awkward The Boolean queries formulated by the users are most

often too simplistic As a consequence, the Boolean model frequently returns

either too few or too many documents in response to a user query


2.2.2 Vector Space Model Index can be regarded as an n-

dimensional space wij > 0 whenever ti dj

Each term corresponds to a dimension

To each term ti is associated a unitary vector vec(i)

The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)

document can be regarded as

vector started from (0,0,0)

point in space

(4,3,1)

vehicle

accident

car

(3,2,3)

d1 d2accident 4 3car 3 2vehicle 1 3

Example:


2.2.2.1 Coordinate Matching Documents and query are represented as

document vectors vec(dj) = (w1j, w2j, …, wkj) query vector vec(q) = (w1q,...,wkq)

Vectors have binary values wij = 1 if term ti occurs in Dokument dj wij = 0 else

Ranking: Return the documents containing at least one query term rank by number of occuring query terms

Ranking function: scalar product R(q,d) = q * d

= qi * dii=1

n

Multiply components and summarize


Coordinate Matching: Example

Resultat:

q * d1 =

q * d2 =

q * d3 =

322

query vector represents terms of the query (cf. stemming)

accident heavy vehicles vienna

d1 d2 d3 q

accident 1 0 1 1car 1 1 0 0cause 0 0 1 0crowd 0 0 1 0die 1 0 0 0drive 0 0 1 0four 0 0 1 0heavy 1 0 0 1injur 0 0 1 0more 0 1 0 0morning 1 0 0 0people 1 0 1 0quarter 0 1 0 0register 0 1 0 0truck 0 0 1 0trucker 0 0 1 0vehicle 0 1 0 1vienna 1 1 1 1yesterday 1 0 0 0


Assessment of Coordinate Matching

Advantage compared to Boolean Model: Ranking

Three main drawbacks

frequency of terms in documents in not considered

no weighting of terms

privilege for larger documents


2.2.2.2 Term Weighting Use of binary weights is too limiting

Non-binary weights provide consideration for partial matches

These term weights are used to compute a degree of similarity between a query and each document

How to compute the weights wij and wiq ?

A good weight must take into account two effects: quantification of intra-document contents (similarity)

tf factor, the term frequency within a document

quantification of inter-documents separation (dissi-milarity) idf factor, the inverse document frequency

wij = tf(i,j) * idf(i) (Baeza-Yates & Ribeirp-Neto 1999)


TF - Term Frequency

Let freq(i,j) be the raw frequency of term ti within document dj (i.e. number of occurrences of term ti in document dj)

A simple tf factor can be computed as

f(i,j) = freq(i,j)

A normalized tf factor is given by

f(i,j) = freq(i,j) / max(freq(l,j))

where the maximum is computed over all terms which occur within the document dj

d1 d2 d3 q


(Baeza-Yates & Ribeiro-Neto 1999)For reasons of simplicity, in this example f(i,j) = freq(i,j)


IDF – Inverse Document Frequency

IDF can also be interpreted as the amount of information associated with the term ti . A term occurring in few documents is more useful as an index term than a term occurring in nearly every document

Let ni be the number of documents containing term tiN be the total number of documents

A simple idf factor can be computed as idf(i) = 1/ni

A normalized idf factor is given by idf(i) = log (N/ni)

the log is used to make the values of tf and idf comparable.


Example with TF and IDF

In this examle a simple tf factor

f(i,j) = freq(i,j)

and a simple idf factor

idf(i) = 1/ni

are used

It is of advantage to store IDF and TF separately

IDF d1 d2 d3

accident 0.5 2 0 1car 0.5 1 1 0cause 1 0 0 1crowd 1 0 0 1die 1 1 0 0drive 1 0 0 1four 1 0 0 1heavy 1 2 0 0injur 1 0 0 1more 1 0 2 0morning 1 1 0 0people 0.5 1 0 2quarter 1 0 1 0register 1 0 1 0truck 1 0 0 1trucker 1 0 0 1vehicle 1 0 1 0vienna 0.33 1 1 1yesterday 1 1 0 0


Indexing a new Document

Changes of the indexes when adding a new document d a new document vector with tf factors for d is created

idf factors for terms occuring in d are adapted

All other document vectors remain unchanged


Ranking

Scalar product computes co-occurrences of term in document and query

Drawback: Scalar product privileges large documents over small ones

Euclidian distance between endpoint of vectors Drawback: euclidian distance privileges small

documents over large ones

Angle between vectors the smaller the angle beween query and document

vector the more similar they are

the angle is independent of the size of the document

the cosine is a good measure of the angle

t1

t2

q d


Cosine Ranking Formula

the more the directions of query a and document dj coincide the more relevant is dj

the cosine formula takes into account the ratio of the terms not their concrete number

Let be the angle between q and dj

Because all values wij >= 0 the angle is between 0° und 90°

the larger the less is cos the less the larger is cos cos 0 = 1 cos 90° = 0

t1

t2

q dj

cos(q,dj) = q ° dj

|q| ° |dj|


The Vector Model

The best term-weighting schemes use weights which are given by wij = f(i,j) * log(N/ni)

the strategy is called a tf-idf weighting scheme

For the query term weights, a suggestion is wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)

(Baeza-Yates & Ribeirp-Neto 1999)


The Vector Model The vector model with tf-idf weights is a good ranking

strategy with general collections

The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute.

Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate

the query conditions cosine ranking formula sorts documents according to

degree of similarity to the query

Disadvantages: assumes independence of index terms (??); not clear that

this is bad though (Baeza-Yates & Ribeiro-Neto 1999)


2.2.3 Extensions of the Classical Models

Combination of Boolean model

vector model

indexing with and without preprocessing

Extended index with additional information like document format (.doc, .pdf, …)

language

Using information about links in hypertext link structure

anchor text


Boolean Operators in the Vector Model

Many search engines allow queries with Boolean operators

Retrieval:

Boolean operators are used to select relevant documents

in the example, only documents containing „accident“ and either „vehicle“ or „car“are considered relevant

ranking of the relevant documents is based on vector model

idf-tf weighting cosine ranking formula

(vehicle OR car) AND accident

d1 d2 d3 q



Queries with Wild Cards in the Vector Model

vector model based in index without preprocessing

index contains all word forms occuring in the documents

Queries allow wildcards (masking and truncation), e.g.

Principle of query answering

First, wildcards are extended to all matching terms (here vehicle* matches „vehicles“)

ranking according to vector model

d1 d2 d3 q

accident 2 0 1 1car 1 0 0 0cars 0 1 0 0causes 0 0 1 0crowd 0 0 1 0died 1 0 0 0drove 0 0 1 0four 0 0 1 0heavy 2 0 0 1injured 0 0 1 0more 0 2 0 0morning 1 0 0 0people 1 0 2 0quarter 0 1 0 0registered 0 1 0 0truck 0 0 1 0trucker 0 0 1 0vehicles 0 1 0 1vienna 1 1 1 1yesterday 1 0 0 0

accident heavy vehicle* vienna


Using Link Information in Hypertext

Ranking: link structure is used to calculate a quality ranking for each web page

PageRank®

HITS – Hypertext Induced Topic Selection (Authority and Hub)

Hilltop

Indexing: text of a link (anchor text) is associated both with the page the link is on and

with the page the link points to


The PageRank Calculation

1) S. Brin and L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Computer Networks and ISDN Systems. Vol. 30, 1998, Seiten 107-117http://www-db.stanford.edu/~backrub/google.html oder http://infolab.stanford.edu/pub/papers/google.pdf

PageRank has been developed by Sergey Brin and Lawrence Page at Stanford University and published in 19981)

PageRank uses the link structure of web pages

Original version of PageRank calculation:

with

PR(A) being the PageRank of page A,

PR(Ti) being the PageRank of apges Ti that contain a link to page A

C(Ti) being the number of links going out of page Ti

d being a damping factor with 0 <= d <= 1

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


The PageRank Calculation - Explanation

The PageRank of page A is recursively defined by the PageRanks of those pages which link to page A

The PageRank of a page Ti is always weighted by the number of outbound links C(Ti) on page Ti: This means that the more outbound links a page Ti has, the less will page A benefit from a link to it on page Ti.

The weighted PageRank of pages Ti is then added up. The outcome of this is that an additional inbound link for page A will always increase page A's PageRank.

Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1.

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Source: http://pr.efactory.de/e-pagerank-algorithm.shtml


Damping Factor and the Random Surfer Model

The PageRank algorithm and the damping factor are motivated by the model of a random surfer. The random surfer finds a page A by

following a link from a page Ti to page A or by random choice of a web page (e.g. typing the URL).

The probability that the random surfer clicks on a particular link is given by the number of links on that page: If a page Ti contains C(Ti) links, the probability for each links is 1/ C(Ti)

The justification of the damping factor is that the surfer does not click on an infinite number of links, but gets bored sometimes and jumps to another page at random.

d is the probability for the random surfer not stopping to click on links – this is way the sum of pageRanks is multiplied by d

(1-d) is the probability that the surfer jumps to another page at random after he stopped clicking links.Regardless of inbound links, the probability for the random surfer jumping to a page is always (1-d), so a page has always a minimum PageRank

(According to Brin and Page d = 0.85 is a good value)

Source: http://pr.efactory.de/e-pagerank-algorithm.shtml


Calculation of the PageRank - Example

We regard a small web consisting of only three pages A, B and C and the links structure shon in the figure

To keep the calculation simple d is set to 0.5

These are the equation for the PageRank calculation:

PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

Solving these equations we get the following PageRank values for the single pages:

PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615

Quelle: http://pr.efactory.de/e-pagerank-algorithmus.shtml


Iterative Calculation of the PageRank - Example

Iteration PR(A) PR(B) PR(C)0 1 1 11 1 0.75 1.1252 1.0625 0.765625 1.14843753 1.07421875 0.76855469 1.152832034 1.07641602 0.76910400 1.153656015 1.07682800 0.76920700 1.153810506 1.07690525 0.76922631 1.153839477 1.07691973 0.76922993 1.153844908 1.07692245 0.76923061 1.153845929 1.07692296 0.76923074 1.15384611

10 1.07692305 0.76923076 1.1538461511 1.07692307 0.76923077 1.1538461512 1.07692308 0.76923077 1.15384615

According to Lawrence Page and Sergey Brin, about 100 iterations are necessary to get a good approximation of the PageRank values of the whole web.

Quelle: http://pr.efactory.de/d-pagerank-algorithmus.shtml

Because of the size of the actual web, the Google search engine uses an approximative, iterative computation of PageRank values

• each page is assigned an initial starting value• the PageRanks of all pages are then calculated in several computation cycles.


Alternative Link Analysis Algorithms (I): HITS

Jon Kleinberg: Authoritative sources in a hyperlinked environment. In: Journal of the ACM, Vol. 36, No. 5, pp. 604-632, 1999, http://www.cs.cornell.edu/home/kleinber/auth.pdf

Hypertext-Induced Topic Selection (HITS) is a link analysis algorithm proposed by J. Kleinberg 1999

HITS rates Web pages for their authority and hub values:

The authority value estimates the value of the content of the page; a good authority is a page that is pointed to by many good hubs

the hub value estimates the value of its links to other pages; a good hub is a page that points to many good authorities (examples of hubs are good link collections);

Every page i is assigned a hub weight hi and an Authority weight ai :


Alternative Link Analysis Algorithms (II): Hilltop The Hilltop-Algorithm1) rates documents based on their incoming links

from so-called expert pages

Expert pages are defined as pages that are about a topic and have links to many non-affiliated pages on that topic.

Pages are defined as non-affiliated if they are from authors of non-affiliated organisations.

Websites which have backlinks from many of the best expert pages are authorities and are ranked high.

A good directory page is an example of an expert page (cp. hubs).

Determination of expert pages is a central point of the hilltop algorithm.

1) The Hilltop-Algorithmus was developed by Bharat und Mihaila an publishes in 1999:Krishna Bharat, George A. Mihaila: Hilltop: A Search Engine based on Expert Documents. In 2003 Google bought the patent of the algorithm(see also http://pagerank.suchmaschinen-doktor.de/hilltop.html)


Anchor-Text The Google search engine uses the text of links

twice First, the text of a link is associated with the

page that the link is on, In addition, it is associated with the page

the link points to. Advantages:

Anchors provide additional description of a web pages – from a user‘s point of view

Documents without text can be indexed, such as images, programs, and databases.

Disadvantage: Search results can be manipulated

(cf. Google Bombing1))

A Google bomb influences the ranking of the search engine. It is created if a large number of sites link to the page with anchor text that often has humourous, political or defamatory statements. In the meanwhile, Google bombs are defused by Google.

The polar bear Knut was born in the zoo of Berlin


Natural Language Queries

Natural language queries are treated as any other query

Stop word elimination

Stemming

but no interpretation of the meaning of the query

i need information about accidents with cars and other vehicles

is equivalent to

information accident car vehicle


Searching Similar Documents

Is is often difficult to express the information need as a query

An alternative search method can be to search for similar documents to a given document d


IDF d1 d2 d3

accident 0.5 2 0 1car 0.5 1 1 0cause 1 0 0 1crowd 1 0 0 1die 1 1 0 0drive 1 0 0 1four 1 0 0 1heavy 1 2 0 0injur 1 0 0 1more 1 0 2 0morning 1 1 0 0people 0.5 1 0 2quarter 1 0 1 0register 1 0 1 0truck 1 0 0 1trucker 1 0 0 1vehicle 1 0 1 0vienna 0.33 1 1 1yesterday 1 1 0 0

Finding Similar Documents – Principle and ExampleExample: Find the most similar documents to d1

Principle: Use a given document d as a query

Compare all document di with d

Example (scalar product):

IDF * d1 * d2 =IDF * d1 * d3 =

The approach is the same as for a : same index same ranking function

0.832.33


The Vector Space Model

The vector space model ...… is relatively simple and clear,… is efficient,… ranks documents,… can be applied for any collection of documents

The model has many heuristic components of parameters, e.g. determintation of index terms calculation of tf and idf ranking function

The best parameter setting depends on the document collection


2.3 Implementation of the Index The vector space model is usually

implemented with an inverted index

For each term a pointer references a „posting list“ with an entry for each document containing the term

The posting lists can be implemented as linked lists or

more efficient data structures that reduce the storage requirements (index pruning)

To answer a query, the corresponding posting lists are retrieved and the documents are ranked, i.e. efficient retrieval of posting list is essential

Source: D. Grossman, O. Frieder (2004) Information Retrieval, Springer-Verlag


Implementing the Term Structure as a Trie

Sequentially scanning the index for query terms/posting lists is inefficient

a trie is a tree structure

each node is an array,one element for eachcharacter

each element contains a link to another node

*) the characters and their order are identical for each node. Therefore they do not need to be stored explicitly.

Example: Structure of a node in a trie*)

Source: G. Saake, K.-U. Sattler: Algorithmen und Datenstrukturen – Eine Einführung mit Java. dpunkt Verlag 2004


The Index as a Trie

The leaves of the trie are the index terms, pointing to the corresponding position lists

Searching a term in in a trie:

search starts at the root

subsequently for each character of the term the reference to the corresponding subtree is followed until

a leave with the term is found search stops without success

(Saake, Sattler 2004)


Patricia Trees

Idea: Skip irrelevant parts of terms

This is achieved by storing in each node the number of characters to be skipped.

Example:

(Saake, Sattler 2004)Patricia = Practical Algorithm To Retrieve Information Coded in Alphanumeric


2.4 Evaluating Search Methods

Set of alldocument

relevant documentsthat are not found

documentsfound

relevant documents found


Performance Measure of Information Retrieval:Recall und Precision

Several different measures for evaluating the performance of information retrieval systems have been proposed; two important ones are:

Recall: fraction of the relevant documents that are are successfully retrieved.

answer set DA

relevant documents DR

relevant documents in answer set DRA

Precision: fraction of the documents retrieved that are relevant to the user's information need

R =|DRA|

|DR|

P =|DRA||DA|

D


F-Measure

The F-measure is a mean of precision and recall

In this version, precision and recall are equally weighted.

The more general version allows to give preference to recall or precision

F2 weights recall twice as much as precision

F0.5 weights precision twice as much as recall


Computing Recall and Precision Evaluation: Perform a predefined set of queries

The search engines delivers a ranked set of documents Use the first X documents of the result list as answer set Compute recall and precision for the frist X documents of the ranked

result list.

How do you know, which documents are relevant?

1. A general reference set of documents can be used. For example, TREC (Text REtrieval Conference) is an annual event where large test collections in different domains are used to measure and compare performance of infomration retrieval systems

2. For companies it is more important to evaluate information retrieval systems using their own documents1. Collect a representative set of documents

2. Specify queries and associated relevant documents

3. evaluate search engines by computing recall and precision for the query results


2.5 User Adaptation

Take into account information of a user to filter document particularly relevant to this user

Relevance Feedback Retrieval in multiple passes; in each pass the use refines

the query based on results of previous queries

Explicit User Profiles subscription User-specific weights of terms

Social Filtering Similar use get similar documents


2.5.1 Relevance Feedback given by the User

The user specifies relevance of each document.Example: for query "Pisa" only the documents about the education assessment are regarded as relevant

In the next pass, the top ranked documents are only about the education assessment

This example is from the SmartFinder system from empolis. The mindaccess system from Insiders GmbH uses the same techology.

Example:


Relevance Feedback: Probabilistic Model Assumption: Given a user query, there is an ideal

answer set

Idea: An initial answer is iteratively improved based on user feedback

Approach: An initial set of documents is retrieved

somehow User inspects these docs looking for the

relevant ones (usually, only top 10-20 need to be inspected)

IR system uses this information to refine description of ideal answer set

By repeting this process, it is expected that the description of the ideal answer set will improve

The description of ideal answer set is modeled in probabilistic terms

(Baeza-Yates & Ribeiro-Neto 1999)


Probabilistic Ranking

Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant).

The model assumes that this probability of relevance depends on the query and the document representations only.

Probabilistic ranking is:

Definitions:

wij {0,1} (i.e. weights are binary)

similarity of document dj to the query q

is the document vector of dj

is the probability that document dj is relevant

is the probability that document dj is not relevant

(Baeza-Yates & Ribeiro-Neto 1999)


Computing Probabilistic Ranking

Probabilistic ranking can be computed as:

wherestands for the probability that the index term ki is present in a document randomly selected fromthe set R of relevant documents

is the weight of term ki in the query

stands for the probability that the index term ki

is not present in a document randomly selected fromthe set R of relevant documents

is the weight of term ki in document dj(Baeza-Yates & Ribeiro-Neto 1999)


Relevance Feedback: Probabilistic Model

The probabilities that a term ki is (not) present in a set of relevant documents can be computed as :

N total number of documents

ni number of documents containing term ki

V number of relevant documents retrieved by the probabilistic model

Vi number of relevant documents containing term ki

There are different ways to find the relevant document V : Automatically: V can be specified as the top r

documents found By user feedback: The user specifies for each retrieved

document whether it is relevant or not


2.5.2 Explicit User Profiles

Idea: Using knoweldge about the user to provide information that is particularly relelvant for him/her

users specify topics of interest as a set of terms these terms represent the user profile documents containing the terms of the user profile are prefered

information need/preferences documents

user profile index

profileacquisition

document representation

ranking function


User profiles for subscribing to information user profiles are treated

as queries

Example: news feedAs soon as a new document arrives it is tested for similarity with the user profiles

Vector space model can be applied

A document is regarded relevant if the ranking reaches a specified threshold

Example: User 1 is interested in any car accicentUser 2 is interested in deadly car accidents with trucks

IDF d1 d2 d3 U1 U2

accident 0.5 2 0 1 1 1car 0.5 1 1 0 1 0cause 1 0 0 1 0 0crowd 1 0 0 1 0 0die 1 1 0 0 0 1drive 1 0 0 1 0 0four 1 0 0 1 0 0heavy 1 2 0 0 0 0injur 1 0 0 1 0 0more 1 0 2 0 0 0morning 1 1 0 0 0 0people 0.5 1 0 2 0 0quarter 1 0 1 0 0 0register 1 0 1 0 0 0truck 1 0 0 1 1 1trucker 1 0 0 1 0 0vehicle 1 0 1 0 1 0vienna 0.33 1 1 1 0 0yesterday 1 1 0 0 0 0


User profiles for Individual Queries Users specify importance

of terms

User profiles are used as additional term weights

Different ranking for different users

Example

ranking for user 1

IDF * d1 * U1 *q =IDF * d2 * U1 *q =IDF * d3 * U1 *q =

Example: users profiles with term

1,410,5

2,20,10,5

ranking for user 2

IDF * d1 * U2 *q =IDF * d2 * U2 *q =IDF * d3 * U2 *q =

IDF d1 d2 d3 U1 U2 q

accident 0.5 2 0 1 1 1 1car 0.5 1 1 0 0.8 0.2 0cause 1 0 0 1 0 0 0crowd 1 0 0 1 0 0 0die 1 1 0 0 0 0.8 0drive 1 0 0 1 0 0 0four 1 0 0 1 0 0 0heavy 1 2 0 0 0.2 0.6 1injur 1 0 0 1 0 0 0more 1 0 2 0 0 0 0morning 1 1 0 0 0 0 0people 0.5 1 0 2 0.5 0.8 0quarter 1 0 1 0 0 0 0register 1 0 1 0 0 0 0truck 1 0 0 1 0.6 1 0trucker 1 0 0 1 0 0.6 0vehicle 1 0 1 0 1 0.1 1vienna 0.33 1 1 1 0 0 1yesterday 1 1 0 0 0 0 0


Acquisition and Maintenance of User Profiles

There are different ways to specify user profiles

manual: users specifies topics of interests (and weights) explicitly selection of predefined terms or query Problem: Maintenance

user feedback: user collects relevant documents terms in selected document are regarded as important Problem: How to motivate the user to give feedback (a similar approach is used by spam filters - classification)

Heuristics: observing user behaviour Example: If a user has opened a document for long time, it is

assumend that he/she read it and therefore it might be relevant Problem: Heuristics might be wrong


Social Filtering

Idea: Information is relevant, if other users who showed similar behaviour regarded the information as relevant Relevance is specified by the users

User profiles are compared

Example: A simple variant can be found at Amazon purchases of books and CDs are stored

„people who bought this book also bought …“