DocumentIr

AG Corporate Semantic Web

Freie Universität Berlin

http://www.inf.fu-berlin.de/groups/ag-csw/

Information Retrieval (IR)

Mohammed Al-Mashraee

Corporate Semantic Web (AG-CSW)

Institute for Computer Science, Freie Universität Berlin

[email protected]


mailto:[email protected]











2 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Agenda Introduction Motivation

Data structures and general representations

IR Definition

IR Models Set theoritic / Boolean

Weighting Methods

Algebric / Vector

IR Evaluation


Introduction

IR System

IR

System Query

String

Document

corpus

Ranked

Documents

1. Doc1

2. Doc2

3. Doc3

.

.


Introduction - IR Tasks

Given • A corpus of textual natural-language documents.

• A user query in the form of a textual string.

Find:

A ranked set of documents that are relevant

to the query.


Introduction

Motivation

These days we frequently think first of web

search, but there are many other cases:

• E-mail search

• Searching your laptop

• Corporate knowledge bases

• Legal information retrieval


Introduction - Motivation

Unstructured (text) vs. structured (database) data

in the mid-nineties


Introduction - Motivation

Unstructured (text) vs. structured (database) data today

Data structures and

general representations


Data structures and representations

[Almashraee, 2013]

Data representation is a process of providing a

good environment for data to be accessed and

manipulated fast. There are different structures: Database scheme structure.

Semi-structured data.

Semantic representation / RDF representation.

Feature vector representation



Database scheme structure A structured data representation.

Provides a smooth access and manipulation to

the data stored in its scheme, e.g., Oracle,

MySQL, etc.

Semi-structured data representation Can have direct storage manipulation to data, but

limited querying ability, e.g., XML.



Semantic representation / RDF

representation Newly available and structured data

representation.

It is usually used by Semantic Web

technology applications to interpret their

related information and store them in a triple

(Subject, Predicate, and Object) format.



Feature vector representation The most common used representation in which

some extracted features presented as a vector.

This representation allow different methods

(such as, information retrieval, support vector

machines, Nave Bayes, association rule mining,

decision trees, hidden Markov models, maximum

entropy models, etc.) to build useful models to

solve related problems.


IR vs. databases:

Structured vs unstructured data

Structured data tends to refer to information

in “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

Ivy Smith 50000

Typically allows numerical range and exact match

(for text) queries, e.g.,

Salary < 60000 AND Manager = Smith


Unstructured data

Typically refers to free text

Allows

o Keyword queries including operators

o More sophisticated “concept” queries e.g.,

• find all web pages dealing with drug abuse

Classic model for searching text documents


More General

Structured vs. Unstructured data

Search vs. Discoveryvery


Definition

[Manning et al. 2008]

Information Retrieval (IR)

Finding material (usually documents) of an

unstructured nature (usually text) that satisfies

an information need from within large

collections (usually stored on computers)


Representation of Documents

[Paschke notes]

Set of terms T = {t1,…,tn}

Each document dj is represented as a vector

of weighted term:

dj=(w1,j,…,wn,j)

wi,j is a weight for the term ti in the

document dj

Set of documents D

Similarity measure sim describes the

similarity of a document to the query

IR Models


IR Models

Boolean models/Set theoretic

Vector space models (Statistical/Algebric)

Probabilitic models

Boolean models/

Set theoretic


Set theoretic / Boolean Retrieval

Documents represented as vector of index terms True if term exist in document, false otherwise

Weight wi,j i.e. 0 or 1 (Boolean truth weight)

Interpreted as Boolean variables

Queries represented as Boolean expressions Terms are queries

(q1 AND q2), (q1 OR q2), (: q1) are queries

Document is relevant, if the query expression and

the document expression together are true

Similarity measure is also Boolean


Example.1: Set theoretic/Boolean Retrieval

T = („today“, „is“, „Monday“, „lecture“, „no“)

Documents d1:„today is Monday“, d2:„today is

lecture“, d3:„Monday is lecture“

Today is Monday lecture no

d1 1 1 1 0 0

d2 1 1 0 1 0

d3 0 1 1 1 0

q is Monday AND lecture Today OR Monday NOT lecture

d1 1 0 1 1

d2 1 0 1 0

d3 1 1 1 0


Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Binary incidence matrix




Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1



mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Binary incidence matrix


A document is represented by a binary vector ϵ{0, 1}|v|

The size of the vectore depends on the size of the

vocabulary (dictionary)

|v|

Weighting Schemes


Weighting Schemes

Weighting schemes are used to give a score

for documents according to a particular

quiry to rank the document returned:

• Bag-of-words model

• Term frequency (tf) model

• Document frequency (df)

• Inverse document frequency (idf)

• Term frequency – Inverse document frequency

(tf-idf)


Term-document count matrices • Consider the number of occurrences of a term in a document:

The number of occurrences of a term in a document is considered

A document is represented by a natural number vector N|V|

The size of the vectore depends on the size of the vocabulary (dictionary)


Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1



mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

|v|

Weighting Schemes


Bag of words model

Vector representation doesn’t consider

the ordering of words in a document

E.g.,

John is quicker than Mary

and

Mary is quicker than John

have the same vectors


Term Frequency (tf) weighting

The term frequency tft,d of term t in document d is

defined as the number of times that t occurs in d.

tf is used to compute the query-document match scores.

Raw term frequency is not what we want: • A document with 10 occurrences of the term is more

relevant than a document with 1 occurrence of the term.

• But not 10 times more relevant.

Relevance does not increase proportionally with term

frequency (relevance goes up but not linearly).

Frequency in IR denotes the count of a word in the

document


Log-frequency weighting

The log frequency weight of term t in document d is:

0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

Score for a document-query pair: sum over terms t in

both query q and document d:

Score

The score is 0 if none of the query terms is present in

the document.

otherwise 0,

0 tfif, tflog 1

10 t,dt,d

t,dw

dqt dt ) tflog (1 ,


Inverse Document Frequency (idf)

Another score used for ranking the matches of documents

to a query

Idea: Terms that appear in many different documents are

less indicative of overall topic

Rare terms are more informative than frequent terms.

dft is the document frequency of t: the number of

documents that contain t • dft is an inverse measure of the informativeness of t

• dft N

We use log (N/dft) instead of N/dft to “dampen” the effect

of idf.



We define the idf (inverse document frequency) of t by

)/df( log idf 10 tt N

df i = document frequency of term i

= number of documents containing term i

idfi = inverse document frequency of term i,

= log2 (N/ df i)

(N: total number of documents)



Example:

• Suppose N = 1 million (total number of

documents) )/df( log idf 10 tt N

term dft idft

calpurnia 1 6

animal 100 4

sunday 1,000 3

fly 10,000 2

under 100,000 1

the 1,000,000 0


Tf-idf Weighting

The tf-idf weight of a term is the product of its tf weight and

its idf weight

Score(q,d) = tf.idft,dt ÎqÇdå

A term occurring frequently in the document but rarely in the rest of the collection is given high weight.

Experimentally, tf-idf has been found to work well and proved to be the Best known weighting scheme in information retrieval

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection

)df/(log)tflog1(w 10,, tdt Ndt


Tf-idf Weighting

Binary → count → weight matrix


Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0

Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

|v|

Each document is now represented by a real-valued vector of

tf-idf weights ∈ R|V|


Tf-idf Weighting

Example

Given a document containing terms with given frequencies:

A(3), B(2), C(1)

Assume collection contains 10,000 documents and

document frequencies of these terms are:

A(50), B(1300), C(250)

Then:

A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6

B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0

C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

Vector space model


Vector space model

Documents are represented as vectors

Queries are also represented as vectors

Now we have a |V|-dimensional vector space

Terms are axes of the space

Documents are points or vectors in this space

T3

T1

T2

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

3 2

5

T3

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3

T1

T2

D2 = 3T1 + 7T2 + T3

Example:

D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3


Vector space model

How to measure the similarity between

Di and Q?


Vector space model

Similarity Measure Documents are ranked according to their

proximity (similarity) to the query in a given

space.

A similarity measure is a function that

computes the degree of similarity between

two vectors.

• Scalar Product (Inner Product)

• Cosine measure


Vector space model

Similarity measure

Inner Product Distance between the end points of the two vectors

Similarity between vectors for the document di and query q can be computed

as the vector inner product (dot product):

sim(dj,q) = dj•q =

where wij is the weight of term i in document j and wiq is the

weight of term i in the query.

Example:

D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10

sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

iq

t

i

ijww1


Vector space model

Why Inner Product is not a good solution

for vector similarities

The Euclidean distance between q and

d2 is large even though the distribution

of terms in the query q and the

distribution of terms in the document

d2 are very similar


Vector space model

Similarity measure

Cosine measure Compute the weight for D and Q

Normalize the length of vectors for D and Q

Compute the cosine similarity


Vector space model

Cosine measure

Length normalization

• A vector can be length normalized by

dividing each of its components by its length

• Long and short documents now have

comparable weights


Vector space model

Cosine measure


Vector space model

Cosine measure

Cosine similarity

V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Dot product Unit vectors

qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.

47

Vector space model

Cosine for length-normalized vectors

Since vectors are normalized

Since vectors are normalized (length-

normalized) vectors, cosine similarity is

simply the dot product (or scalar product

for q, d length-normalized.

47

cos(

q ,

d ) =

q ·

d = qidii=1

V

å

48

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar are

the novels

SaS: Sense and

Sensibility

PaP: Pride and

Prejudice, and

WH: Wuthering

Heights? Term frequencies (counts)

Sec. 6.3

Note: To simplify this example, we don’t do idf weighting.

Example

49

3 documents example contd.

Log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30

jealous 2.00 1.85 2.04

gossip 1.30 0 1.78

wuthering 0 0 2.58

After length normalization

term SaS PaP WH

affection 0.789 0.832 0.524

jealous 0.515 0.555 0.465

gossip 0.335 0 0.405

wuthering 0 0 0.588

cos(SaS,PaP) ≈ 0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69

Sec. 6.3

Evaluation

51 51

Relevant documents

Retrieved documents

Entire document

collection

retrieved &

relevant

not retrieved but

relevant

retrieved &

irrelevant

Not retrieved &

irrelevant

retrieved not retrieved

rele

van

t ir

rele

van

t

Precision and Recall

Meature Formula

Precision TP / (TP + FP)

Recall TP / (TP + FN)

52 52


Precision

The ability to retrieve top-ranked documents that

are mostly relevant.

Recall

The ability of the search to find all of the relevant

items in the corpus.

53


Example

Assume the following:

• A database contains 80 records on a particular topic

• A search was conducted on that topic and 60 records were retrieved.

• Of the 60 records retrieved, 45 were relevant.

Calculate the precision and recall scores for the search

Using the designations above:

• A = The number of relevant records retrieved,

• B = The number of relevant records not retrieved, and

• C = The number of irrelevant records retrieved.

In this example A = 45, B = 35 (80-45) and C = 15 (60-45).

Recall = (45 / (45 + 35)) * 100% => 45/80 * 100% = 56%

Precision = (45 / (45 + 15)) * 100% => 45/60 * 100% = 75%

AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/


Thank You,

questions!

55

References

Manning, Christopher et al. Introduction to Information

Retrieval, 2008.

Baeza-Yates et al. Modern Information Retrieval,

1999.

Adrian Paschke - „Web Based Information Systems -

Lecture notes“, FU Berlin.

Rohit Kate- „ Natural Language Processing - Lecture notes“,

University of Wisconsin-Milwaukee.

AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/

Education

DocumentIr