Upload
almashraee
View
162
Download
3
Tags:
Embed Size (px)
Citation preview
AG Corporate Semantic Web
Freie Universität Berlin
http://www.inf.fu-berlin.de/groups/ag-csw/
Information Retrieval (IR)
Mohammed Al-Mashraee
Corporate Semantic Web (AG-CSW)
Institute for Computer Science, Freie Universität Berlin
http://www.inf.fu-berlin.de/groups/ag-csw/
2 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Agenda Introduction Motivation
Data structures and general representations
IR Definition
IR Models Set theoritic / Boolean
Weighting Methods
Algebric / Vector
IR Evaluation
3 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Introduction
IR System
IR
System Query
String
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
4 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Introduction - IR Tasks
Given • A corpus of textual natural-language documents.
• A user query in the form of a textual string.
Find:
A ranked set of documents that are relevant
to the query.
5 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Introduction
Motivation
These days we frequently think first of web
search, but there are many other cases:
• E-mail search
• Searching your laptop
• Corporate knowledge bases
• Legal information retrieval
6 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Introduction - Motivation
Unstructured (text) vs. structured (database) data
in the mid-nineties
7 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Introduction - Motivation
Unstructured (text) vs. structured (database) data today
Data structures and
general representations
9 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Data structures and representations
[Almashraee, 2013]
Data representation is a process of providing a
good environment for data to be accessed and
manipulated fast. There are different structures: Database scheme structure.
Semi-structured data.
Semantic representation / RDF representation.
Feature vector representation
10 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Data structures and representations
Database scheme structure A structured data representation.
Provides a smooth access and manipulation to
the data stored in its scheme, e.g., Oracle,
MySQL, etc.
Semi-structured data representation Can have direct storage manipulation to data, but
limited querying ability, e.g., XML.
11 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Data structures and representations
Semantic representation / RDF
representation Newly available and structured data
representation.
It is usually used by Semantic Web
technology applications to interpret their
related information and store them in a triple
(Subject, Predicate, and Object) format.
12 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Data structures and representations
Feature vector representation The most common used representation in which
some extracted features presented as a vector.
This representation allow different methods
(such as, information retrieval, support vector
machines, Nave Bayes, association rule mining,
decision trees, hidden Markov models, maximum
entropy models, etc.) to build useful models to
solve related problems.
13 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
IR vs. databases:
Structured vs unstructured data
Structured data tends to refer to information
in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith
14 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Unstructured data
Typically refers to free text
Allows
o Keyword queries including operators
o More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse
Classic model for searching text documents
15 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
More General
Structured vs. Unstructured data
Search vs. Discoveryvery
16 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Definition
[Manning et al. 2008]
Information Retrieval (IR)
Finding material (usually documents) of an
unstructured nature (usually text) that satisfies
an information need from within large
collections (usually stored on computers)
17 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Representation of Documents
[Paschke notes]
Set of terms T = {t1,…,tn}
Each document dj is represented as a vector
of weighted term:
dj=(w1,j,…,wn,j)
wi,j is a weight for the term ti in the
document dj
Set of documents D
Similarity measure sim describes the
similarity of a document to the query
IR Models
19 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
IR Models
Boolean models/Set theoretic
Vector space models (Statistical/Algebric)
Probabilitic models
Boolean models/
Set theoretic
21 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Set theoretic / Boolean Retrieval
Documents represented as vector of index terms True if term exist in document, false otherwise
Weight wi,j i.e. 0 or 1 (Boolean truth weight)
Interpreted as Boolean variables
Queries represented as Boolean expressions Terms are queries
(q1 AND q2), (q1 OR q2), (: q1) are queries
Document is relevant, if the query expression and
the document expression together are true
Similarity measure is also Boolean
22 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Example.1: Set theoretic/Boolean Retrieval
T = („today“, „is“, „Monday“, „lecture“, „no“)
Documents d1:„today is Monday“, d2:„today is
lecture“, d3:„Monday is lecture“
Today is Monday lecture no
d1 1 1 1 0 0
d2 1 1 0 1 0
d3 0 1 1 1 0
q is Monday AND lecture Today OR Monday NOT lecture
d1 1 0 1 1
d2 1 0 1 0
d3 1 1 1 0
23 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Binary incidence matrix
Example.2: Set theoretic/Boolean Retrieval
24 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Binary incidence matrix
Example.2: Set theoretic/Boolean Retrieval
A document is represented by a binary vector ϵ{0, 1}|v|
The size of the vectore depends on the size of the
vocabulary (dictionary)
|v|
Weighting Schemes
26 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Weighting Schemes
Weighting schemes are used to give a score
for documents according to a particular
quiry to rank the document returned:
• Bag-of-words model
• Term frequency (tf) model
• Document frequency (df)
• Inverse document frequency (idf)
• Term frequency – Inverse document frequency
(tf-idf)
27 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Term-document count matrices • Consider the number of occurrences of a term in a document:
The number of occurrences of a term in a document is considered
A document is represented by a natural number vector N|V|
The size of the vectore depends on the size of the vocabulary (dictionary)
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
|v|
Weighting Schemes
28 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Bag of words model
Vector representation doesn’t consider
the ordering of words in a document
E.g.,
John is quicker than Mary
and
Mary is quicker than John
have the same vectors
29 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Term Frequency (tf) weighting
The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
tf is used to compute the query-document match scores.
Raw term frequency is not what we want: • A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
• But not 10 times more relevant.
Relevance does not increase proportionally with term
frequency (relevance goes up but not linearly).
Frequency in IR denotes the count of a word in the
document
30 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Log-frequency weighting
The log frequency weight of term t in document d is:
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
Score for a document-query pair: sum over terms t in
both query q and document d:
Score
The score is 0 if none of the query terms is present in
the document.
otherwise 0,
0 tfif, tflog 1
10 t,dt,d
t,dw
dqt dt ) tflog (1 ,
31 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Inverse Document Frequency (idf)
Another score used for ranking the matches of documents
to a query
Idea: Terms that appear in many different documents are
less indicative of overall topic
Rare terms are more informative than frequent terms.
dft is the document frequency of t: the number of
documents that contain t • dft is an inverse measure of the informativeness of t
• dft N
We use log (N/dft) instead of N/dft to “dampen” the effect
of idf.
32 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Inverse Document Frequency (idf)
We define the idf (inverse document frequency) of t by
)/df( log idf 10 tt N
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
33 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Inverse Document Frequency (idf)
Example:
• Suppose N = 1 million (total number of
documents) )/df( log idf 10 tt N
term dft idft
calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0
34 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Tf-idf Weighting
The tf-idf weight of a term is the product of its tf weight and
its idf weight
Score(q,d) = tf.idft,dt ÎqÇdå
A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
Experimentally, tf-idf has been found to work well and proved to be the Best known weighting scheme in information retrieval
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection
)df/(log)tflog1(w 10,, tdt Ndt
35 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Tf-idf Weighting
Binary → count → weight matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95
|v|
Each document is now represented by a real-valued vector of
tf-idf weights ∈ R|V|
36 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Tf-idf Weighting
Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
Vector space model
38 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Vector space model
Documents are represented as vectors
Queries are also represented as vectors
Now we have a |V|-dimensional vector space
Terms are axes of the space
Documents are points or vectors in this space
T3
T1
T2
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
3 2
5
T3
D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
T1
T2
D2 = 3T1 + 7T2 + T3
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
39 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Vector space model
How to measure the similarity between
Di and Q?
40 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Vector space model
Similarity Measure Documents are ranked according to their
proximity (similarity) to the query in a given
space.
A similarity measure is a function that
computes the degree of similarity between
two vectors.
• Scalar Product (Inner Product)
• Cosine measure
41 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Vector space model
Similarity measure
Inner Product Distance between the end points of the two vectors
Similarity between vectors for the document di and query q can be computed
as the vector inner product (dot product):
sim(dj,q) = dj•q =
where wij is the weight of term i in document j and wiq is the
weight of term i in the query.
Example:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10
sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
iq
t
i
ijww1
42 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Vector space model
Why Inner Product is not a good solution
for vector similarities
The Euclidean distance between q and
d2 is large even though the distribution
of terms in the query q and the
distribution of terms in the document
d2 are very similar
43 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Vector space model
Similarity measure
Cosine measure Compute the weight for D and Q
Normalize the length of vectors for D and Q
Compute the cosine similarity
44 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Vector space model
Cosine measure
Length normalization
• A vector can be length normalized by
dividing each of its components by its length
• Long and short documents now have
comparable weights
45 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Vector space model
Cosine measure
46 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Vector space model
Cosine measure
Cosine similarity
V
i i
V
i i
V
i ii
dq
dq
d
d
q
q
dq
dqdq
1
2
1
2
1),cos(
Dot product Unit vectors
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
47
Vector space model
Cosine for length-normalized vectors
Since vectors are normalized
Since vectors are normalized (length-
normalized) vectors, cosine similarity is
simply the dot product (or scalar product
for q, d length-normalized.
47
cos(
q ,
d ) =
q ·
d = qidii=1
V
å
48
Cosine similarity amongst 3 documents
term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
How similar are
the novels
SaS: Sense and
Sensibility
PaP: Pride and
Prejudice, and
WH: Wuthering
Heights? Term frequencies (counts)
Sec. 6.3
Note: To simplify this example, we don’t do idf weighting.
Example
49
3 documents example contd.
Log frequency weighting
term SaS PaP WH
affection 3.06 2.76 2.30
jealous 2.00 1.85 2.04
gossip 1.30 0 1.78
wuthering 0 0 2.58
After length normalization
term SaS PaP WH
affection 0.789 0.832 0.524
jealous 0.515 0.555 0.465
gossip 0.335 0 0.405
wuthering 0 0 0.588
cos(SaS,PaP) ≈ 0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69
Sec. 6.3
Evaluation
51 51
Relevant documents
Retrieved documents
Entire document
collection
retrieved &
relevant
not retrieved but
relevant
retrieved &
irrelevant
Not retrieved &
irrelevant
retrieved not retrieved
rele
van
t ir
rele
van
t
Precision and Recall
Meature Formula
Precision TP / (TP + FP)
Recall TP / (TP + FN)
52 52
Precision and Recall
Precision
The ability to retrieve top-ranked documents that
are mostly relevant.
Recall
The ability of the search to find all of the relevant
items in the corpus.
53
Precision and Recall
Example
Assume the following:
• A database contains 80 records on a particular topic
• A search was conducted on that topic and 60 records were retrieved.
• Of the 60 records retrieved, 45 were relevant.
Calculate the precision and recall scores for the search
Using the designations above:
• A = The number of relevant records retrieved,
• B = The number of relevant records not retrieved, and
• C = The number of irrelevant records retrieved.
In this example A = 45, B = 35 (80-45) and C = 15 (60-45).
Recall = (45 / (45 + 35)) * 100% => 45/80 * 100% = 56%
Precision = (45 / (45 + 15)) * 100% => 45/60 * 100% = 75%
AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
54 AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/
Thank You,
questions!
55
References
Manning, Christopher et al. Introduction to Information
Retrieval, 2008.
Baeza-Yates et al. Modern Information Retrieval,
1999.
Adrian Paschke - „Web Based Information Systems -
Lecture notes“, FU Berlin.
Rohit Kate- „ Natural Language Processing - Lecture notes“,
University of Wisconsin-Milwaukee.
AG Corporate Semantic Web http://www.inf.fu-berlin.de/groups/ag-csw/