Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Information Retrieval
Introduction and basic concepts
Luca Bondi
Information Retrieval
Introduction
What is Information Retrieval?
• Information Retrieval deals with the representation, storage,
organization of and access to information items [Baeza-Yates,
Ribeiro-Nieto, 1999]
• The user need determines what information is
• How to characterize the user information need in a way that
computers can handle it?
2
Information Retrieval
A user expresses its need in a natural
language and expects a computer
system to generate results relevant to
user need
Data Retrieval
A user specifies a query in a formal
language and expects a computer
system to generate results that exactly
match the query statement
? Database
Information Retrieval
Introduction
Why do we need Information Retrieval?
• Huge document collections due to cheap and easy generation,
storage, processing
• 968,882,453 (almost 1 billion) websites at the end of 2014
• 1.5 billion Facebook users in Q2 2015
• 316 million Twitter active users in Q2 2015
• 30 billion Instagram photos up to august 2015
• 70 millions new photos per day
• Every minute 300 hours of new videos are uploaded on YouTube
• We need Information Retrieval to search as fast as possible
something (or everything) that is relevant to our needs!
3
Information Retrieval
Introduction
The Information Retrieval challenges
• User needs are not uniquely definable
• Example
• I need a new mouse for my workstation
• Let’s Google for it!
• Mmm... not exactly the kind of mouse I was looking for. But the
second result might be helpful…
4
Information Retrieval
Introduction
The nature of Information Retrieval
• Retrieving all objects which might be useful or relevant to the
user’s information need
• Given unstructured user queries
• Errors in the results are tolerated
• What is relevant to a user?
• What is the trade-off between precision and recall
• How to rank results to make users happy?
5
Information Retrieval
Introduction
Typical tasks covered in IR
• Search
• Static documents collection
• Dynamic queries
• Example
• Searching for a file within your PC
• Filtering
• Dynamic documents collection
• Static queries
• Example
• Automatic e-mail filtering to separate students e-mail from
department spam
• Clustering, Categorization, Recommendation, Browsing,
Summarization, Question answering
6
Information Retrieval
Definitions
Information Retrieval Model
7
𝑑𝑗
𝐷 – set of logical views for
the documents
𝑞𝑖
𝑄 – set of logical views for
the user’s needs (aka
queries)
𝑅 𝑞𝑖 , 𝑑𝑗 → ℝ – ranking function
Associates a real number to a document
representation 𝑑𝑗 with respect to a query 𝑞𝑖
The ranking defines an ordering among all the
documents with regard to the query 𝑞𝑖
𝐼𝑅𝑀 ≜ 𝐷,𝑄, 𝑅 𝑞𝑖 , 𝑑𝑗
Information Retrieval
Definitions
Relevance
8
• The relevance of a document with respect to user’s needs is
• Subjective
• different users may express the same information need in
different ways, thus expressing different queries leading to
different documents ranking
• two users with the same information need, expressed by the
same query, may give different judgments on the same retrieved
document
• Dynamic
• in time: documents retrieved and displayed now could influence
the user judgment on documents that will be displayed later
• in space: documents relevant to a user in a specific location may
not be relevant to a user in another geographic location
• Not known to the system prior to user judgement
• The system guesses the document relevance computing the
ranking function which depends on the adopted IRM
Information Retrieval
Definitions
Ranking vs Relevance
9
Ranking
Relevance
subjective
deterministic
user dependent
time variant
space variant
independent from user context
(at least in simple cases)
time and space invariant
(at least in simple cases)
Information Retrieval
Basic concepts
Similarity
10
𝑑𝑗
𝑞𝑖
𝐼𝑅𝑀 𝑆𝐶 𝑞𝑖 , 𝑑𝑗
Given a document 𝑑𝑗 and a query 𝑞𝑖 an Information Retrieval
Model assigns a measure of similarity - Similarity Coefficient
𝑆𝐶 𝑞𝑖 , 𝑑𝑗 - between the document and the query.
An idea of what similarity means:
The more often terms are found both in the document and the
query, the more relevant the document is with regard to the
query
Given a query 𝑞 and a set of documents 𝐷 = 𝑑1, 𝑑2, … , 𝑑𝑁 a
retrieval strategy is an algorithm that identifies the Similarity
Coefficient 𝑆𝐶 𝑞, 𝑑𝑗 for each document 𝑑𝑗 , ∀𝑗 ∈ 1, 𝑁 .
Information Retrieval
Basic concepts
Similarity and Rank
11
Similarity
Rank
depends on
documents collection
depends on the model independent from
documents
collection
depends on the model
the higher, the better
the lower, the better
Information Retrieval
Basic concepts
Similarity and Rank: a trivial example
12
𝒅𝒋 𝑺𝑪 𝒒, 𝒅𝒋 𝑹 𝒒, 𝒅𝒋
I’m scared by black cats 2 1
she’s missing her cat 1 2 (tie)
too many cats in my
neighbourhood 1 2 (tie)
what a beautiful flower! 0 4
𝑞 = “black cat”
𝑆𝐶 𝑞, 𝑑𝑗 = “number of words in query 𝑞 that also appear in document 𝑑𝑗”
Information Retrieval
Basic concepts
Index terms
13
Each document 𝑑𝑗 is represented by a set of keywords called
index terms 𝑡𝑖
Index terms are used to index and summarize the document
content
Distinct index terms have varying relevance (to the user) when
used to describe document contents. This effect is modelled
assigning a numerical weight 𝑤𝑖,𝑗 to each index term 𝑡𝑖 for each
document 𝑑𝑗
𝑤𝑖,𝑗 quantifies the importance of the index term for describing
the document semantic contents
Information Retrieval
Basic concepts
Index terms
14
Each document 𝑑𝑗 is associated to an index term vector
𝐝𝑗 = 𝑤1,𝑗 , 𝑤2,𝑗 , … , 𝑤𝑀,𝑗𝑇
𝑀 = total number of index terms
𝑤𝑖,𝑗 = 𝑔 𝑡𝑖, 𝑑𝑗 , where 𝑔 is a function that computes the weight
of term 𝑡𝑖 in document 𝑑𝑗
𝑤𝑖,𝑗 = 0 if term 𝑡𝑖 does not appear in document 𝑑𝑗
Information Retrieval
Basic concepts
Index terms
15
Index term weights are usually assumed to be mutually
independent
This means that knowing the weight 𝑤𝑖,𝑗 associated with the
pair 𝑡𝑖 , 𝑑𝑗 tells us nothing about the weight 𝑤𝑖+1,𝑗 associated
with the pair 𝑡𝑖+1, 𝑑𝑗
This is clearly a simplification because occurrences of index
terms in a document are not uncorrelated
• In a telecommunication book, for instance, the terms
computer and network are likely to appear coupled, thus the
weights of those terms are clearly correlated