Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Information Retrieval
Information Retrieval
Information Retrieval constructs an index for a given corpus and responds to queries by retrieving all the relevant documents and as few non-relevant documents as possible.
index a collection of documents (access efficiency) given user’s query rank documents by importance (accuracy)
Query
How exact is the representation of the document ?
How exact is the representation of the query ?
How well is query matched to data? How relevant is the result to the query ?
Document collection
Document Representation
Query representation
Query Answer TYPICAL IR
PROBLEM
History of IR Systems
Role of documentalists Role of database researchers Role of researchers in information
retrieval systems Role of researchers in information
retrieval systems and knowledge management systems.
Sources of Information on IR Top Tier Journals:
Journal of the American Society for Information Science and Technology (JASIST)
Information Processing & Management (IPM) Information Retrieval (IR) Information Sciences (IS) Journal of Documentation (JDoc) IEEE Transactions on Knowledge and Data Eng. (TKDE) ACM Transactions on Information Systems (TOIS)
Top Tier Conferences: ACM SIGIR (Special Interest Group on Information Retrieval) ACM CIKM (Int. Conf. on Info. and Know. Management) AAAI Conference on Artificial Intelligence Annual Meeting of the Association for Computational Linguistics European Conference on Information Retrieval (ECIR) TREC (Text REtrieval Evaluation Conference) * ACM SIGKDD (Special Interest Group on Knowledge Discovery,
Data Mining, Large-scale Data Analytics and Big Data)
Typical IR Task
Given: A corpus of textual natural-language
documents. A user query in the form of a textual
string. Find: A ranked set of documents that are
relevant to the query
Traditional IR System
IR System
Query String
Document corpus
Ranked Documents
1. Doc1 2. Doc2 3. Doc3 . .
Web Search System
Query String
IR System
Ranked Documents
1. Page1 2. Page2 3. Page3 . .
Document corpus
Web Crawler
Retrieval Models
A retrieval model specifies the details of: Document representation Query representation Retrieval function
Information Retrieval Models Three ‘classic’ models:
Boolean Model
Vector Space Model
Probabilistic Model
Additional models Extended Boolean
Fuzzy matching
Cluster-based retrieval
Language models
“Classic” Retrieval Models Boolean
Documents and queries are sets of index terms
‘set theoretic’ Vector
Documents and queries are documents in n-dimensional space
‘algebraic’ Probabilistic
Based on probability theory
Documents
A document is a stored data record in any form
Examples: Book, journal article, report, dissertation,
encyclopedia Part of a text, e.g. paragraph,
encyclopedia article Also: Web page, image, music, sound,
video, video clip
Are Queries Documents?
Similarities: text based, similar terminology Differences usually shorter, linguistically less formed,
differ in statistics of text Simpler to think of queries as documents
Retrieval as a “matching” process
Sample TREC Topic (Query)
Paragraph
<top> <num> Number: 327 <title> Topic: Windows Longhorn <desc> Description: Microsoft is currently developing its newest incarnation of the Windows operating system: Longhorn. <narr> Narrative: As the competition against Microsoft increases, the company is also seeking out new battlefields with its new version of Windows, such as improved file-searching technology. Including this new searching technology, what improvements will be added to Windows, and how is the competition responding?
<related-text> Relevant Longhorn will include a database-like storage engine called Windows Future Storage (WinFS), which is based on technology from SQL Server 2003 (code-named Yukon). This storage engine builds on NTFS and will abstract physical file locations from the user and allow for the sorts of complex data searching that are impossible today. For example, today, your email messages, contacts, Word documents, and music files are all completely separate. That won't be the case in Longhorn. WinFS requires NTFS. </top>
SGML Markup Short Phrase
Sentence (fragment)
Retrieval Matching Process
Binary: D = 1, 1, 1, 0, 1, 1, 0
Q = 1, 0 , 1, 0, 0, 1, 1
sim(D, Q) = 3 Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query
Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Document Processing
Document Processing in IR Systems
Assign identifier, store document Identify “Words” Positional Information Word Stemming Term Weighting
Relevance Feedback in IR
After initial retrieval results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents.
Use this feedback information to reformulate the query.
Produce new results based on reformulated query. Allows more interactive, multi-pass process.
Relevance Feedback Architecture
Rankings IR System
Document corpus
Ranked Documents
1. Doc1 2. Doc2 3. Doc3 . .
1. Doc1 ⇓ 2. Doc2 ⇑ 3. Doc3 ⇓ . .
Feedback
Query String
Revised
Query ReRanked Documents
1. Doc2 2. Doc4 3. Doc5 . .
Query Reformulation
Boolean Information Retrieval
Boolean Model
Based on set theory and Boolean algebra
Queries are specified as Boolean expressions
Widely used in commercial IR systems (Dialog, Lexis/Nexis)
Based on inverted index file Usually supplemented with proximity
operators
Boolean Model Output: Document is relevant or not. No partial
matches or ranking and requires an exact match.
A document is represented as a set of keywords.
Queries are Boolean expressions of keywords, connected by logical AND, OR, and NOT, including the use of brackets to indicate scope. [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
Logical AND (∧) (Set Intersection)
A ∧ B
is the set of things in common, i.e., in both sets A and B
A B Aged Blind
A ∧ B (Aged, Blind People)
Logical OR (∨) (Set Union)
A ∨ B
is the set of: things in either A, B or both.
A B Aged Blind
A ∨ B (people that are either Aged or Blind or both)
Logical NOT (¬) (Set Complement)
¬ B
is the set of things outside the set B
B
(people who aren’t blind)
Blind
¬ B
Example Combination
A ∧ (¬ B)
B
(old people who aren’t blind)
Blind
A ∧ (¬ B)
A Aged
More Examples
D1 = “computer information retrieval” D2 = “computer retrieval” D3 = “information” D4 = “computer information”
Q1 = “information ∧ retrieval” Q2 = “information ∧ ¬ computer”
D1
D3
Popular retrieval model because: Easy to understand for simple queries. Clean formalism.
Reasonably efficient implementations possible
for normal queries.
Boolean Retrieval Model
Very rigid: AND means all; OR means any. Difficult to express complex user requests. Difficult to control the number of documents
retrieved. All matched documents will be returned.
Difficult to rank output. All matched documents logically satisfy the
query. Difficult to perform relevance feedback.
If a document is identified by the user as relevant or irrelevant, how should the query be modified?
Drawbacks of the Boolean Model
Drawbacks of the Boolean Model
Retrieval based on binary decision criteria with no notion of partial matching
No ranking of the documents is provided (absence of a grading scale)
Information need has to be translated into a Boolean expression which most users find awkward
The Boolean queries formulated by the users are most often too simplistic
As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
Vector Space Information Retrieval
Vector Space Model
Based on idea of n-dimensional document space
Query is also located in document space Documents are ranked in order of their
“closeness” to the query Many possible matching functions
Issues for Vector Space Model How to determine important words in a document?
Word sense?
Word n-grams (and phrases, idioms,…) terms
How to determine the degree of importance of a term within a document and within the entire collection?
How to determine the degree of similarity between a document and the query?
In the case of the web, what is a collection and what are the effects of links, formatting information, etc.?
Vector-Space Model Assume t distinct terms remain after preprocessing; call
them index terms or the vocabulary. These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary| Each term, i, in a document or query, j, is given a real-
valued weight, wij.
Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)
Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
3 2
5
• Is D1 or D2 more similar to Q? • How to measure the degree of
similarity? Distance? Angle? Projection?
Inner Product -- Examples Binary:
D = 1, 1, 1, 0, 1, 1, 0
Q = 1, 0 , 1, 0, 0, 1, 1
sim(D, Q) = 3
Size of vector = size of vocabulary = 7 0 means corresponding term not found in
document or query
Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Document Collection A collection of n documents can be represented in the
vector space model by a term-document matrix. An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.
T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn
Term Weights: Term Frequency
More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j
May want to normalize term frequency (tf) across the entire corpus: tfij = fij / max{fij}
This image cannot currently be displayed.
Term Weights: Inverse Document Frequency
Terms that appear in many different documents are less indicative of overall topic.
df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) An indication of a term’s discrimination power. Log used to dampen the effect relative to tf.
TF-IDF Weighting A typical combined term importance
indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi)
A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
Many other ways of determining term weights have been proposed.
Experimentally, tf-idf has been found to work well.
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2
Query Vector
Query vector is typically treated as a document and also tf-idf weighted.
Alternative is for the user to supply weights for the given query terms.
Similarity Measure A similarity measure is a function that
computes the degree of similarity between two vectors.
Using a similarity measure between the query and each document: It is possible to rank the retrieved documents in
the order of presumed relevance. It is possible to enforce a certain threshold so that
the size of the retrieved set can be controlled.
Similarity Measure - Inner Product Similarity between vectors for the document di and
query q can be computed as the vector inner product: sim(dj,q) = dj•q = wij · wiq
where wij is the weight of term i in document j and
wiq is the weight of term i in the query For binary vectors, the inner product is the number of
matched query terms in the document (size of intersection).
For weighted term vectors, it is the sum of the products of the weights of the matched terms.
∑=
t
i 1
Inner Product -- Examples Binary:
D = 1, 1, 1, 0, 1, 1, 0
Q = 1, 0 , 1, 0, 0, 1, 1
sim(D, Q) = 3
Size of vector = size of vocabulary = 7 0 means corresponding term not found in
document or query
Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Cosine Similarity Measure Cosine similarity measures the cosine
of the angle between two vectors. Inner product normalized by the
vector lengths.
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3
θ2
t3
t1
t2
D1
D2
Q θ1
D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product.
∑ ∑
∑
= =
=•
⋅
⋅=
⋅t
i
t
i
t
i
ww
wwqdqd
iqij
iqij
j
j
1 1
22
1)(
CosSim(dj, q) =
Outline
Probabilistic Information Retrieval
System Evaluation
Web Mining
Probabilistic Information Retrieval
The Basics
Bayesian probability formulas
Odds:
)()|()()|()(
)()|()|(
)()|()()()|(
apabpbpbapbp
apabpbap
apabpbapbpbap
=
=
=∩=
)(1)(
)()()(
ypyp
ypypyO
−==
The Basics
)()()|()|(
)()()|()|(
xpNRpNRxpxNRp
xpRpRxpxRp
=
=
• Document Relevance:
• Note:
1)|()|( =+ xNRpxRp
Binary Independence Model “Binary” = Boolean: documents are
represented as binary vectors of terms: iff term i is present in document x.
“Independence”: terms occur in documents
independently Different documents can be modeled as same
vector.
),,( 1 nxxx =1=ix
Binary Independence Model Queries: binary vectors of terms Given query q,
for each document d need to compute p(R|q,d).
replace with computing p(R|q,x) where x is vector representing d
Interested only in ranking Will use odds:
),|(),|(
)|()|(
),|(),|(),|(
qNRxpqRxp
qNRpqRp
xqNRpxqRpxqRO ⋅==
Binary Independence Model
• Using Independence Assumption:
∏=
=n
i i
i
qNRxpqRxp
qNRxpqRxp
1 ),|(),|(
),|(),|(
),|(),|(
)|()|(
),|(),|(),|(
qNRxpqRxp
qNRpqRp
xqNRpxqRpxqRO ⋅==
Constant for each query Needs estimation
∏=
⋅=n
i i
i
qNRxpqRxpqROdqRO
1 ),|(),|()|(),|(•So :
Binary Independence Model
∏=
⋅=n
i i
i
qNRxpqRxpqROdqRO
1 ),|(),|()|(),|(
• Since xi is either 0 or 1:
∏∏== =
=⋅
==
⋅=01 ),|0(
),|0(),|1(
),|1()|(),|(ii x i
i
x i
i
qNRxpqRxp
qNRxpqRxpqROdqRO
• Let );,|1( qRxpp ii == );,|1( qNRxpr ii ==
Then...
All matching terms Non-matching query terms
Binary Independence Model
All matching terms All query terms
∏ ∏
∏ ∏
= = =
= = = =
− −
⋅ − −
⋅ =
− −
⋅ ⋅ =
1 1
1 0 1
1 1
) 1 ( ) 1 ( ) | (
1 1 ) | ( ) , | (
i i i
i i i i
q i
i
q x i i
i i
q x i
i
q x i
i
r p
p r r p q R O
r p
r p q R O x q R O
All matching terms
Binary Independence Model
Constant for each query
Only quantity to be estimated for rankings
∏∏=== −
−⋅
−−
⋅=11 11
)1()1()|(),|(
iii q i
i
qx ii
ii
rp
prrpqROxqRO
• Retrieval Status Value:
∑∏==== −
−=
−−
=11 )1(
)1(log)1()1(log
iiii qx ii
ii
qx ii
ii
prrp
prrpRSV
Binary Independence Model
• All boils down to computing RSV.
∑∏==== −
−=
−−
=11 )1(
)1(log)1()1(log
iiii qx ii
ii
qx ii
ii
prrp
prrpRSV
∑==
=1
;ii qx
icRSV)1()1(log
ii
iii pr
rpc−−
=
So, how do we compute ci’s from our data ?
Binary Independence Model • Estimating RSV coefficients. • For each term i look at the following table: Documents Relevant Non-Relevant Total
Xi=1 r n-r nXi=0 R-r N-n-R+r N-nTotal R N-R N
Rrpi ≈ )(
)(RNrnri −
−≈
)()()(log),,,(
rRnNrnrRrrRnNKci +−−−
−=≈
• Estimates: Add 0.5 to every expression
System Evaluation
Why System Evaluation? There are many retrieval models/
algorithms/ systems, which one is the best? What is the best component for:
Ranking function (dot-product, cosine, …) Term selection (stemming…) Term weighting (TF, TF-IDF,…)
How far down the ranked list will a user need to look to find some/all relevant documents?
What Can We Measure? Algorithm (Efficiency)
Speed of algorithm Update potential of indexing scheme Size of storage required Potential for distribution & parallelism
User Experience (Effectiveness) How many of all relevant docs were found How many were missed How many errors in selection How many need to be scanned before get good ones
Measures Based on Relevance
RR
NN
NR RN
not retrieved not relevant
retrieved not relevant
retrieved relevant
not retrieved relevant
Doc set
documents relevant of number Totalretrieved documents relevant of Number recall =
retrieved documents of number Totalretrieved documents relevant of Number precision =
Relevant documents
Retrieved documents
Entire document collection
retrieved & relevant
not retrieved but relevant
retrieved & irrelevant
Not retrieved & irrelevant
retrieved not retrieved
rele
vant
irr
elev
ant
Precision and Recall Relevant and retrieved
Trade-off between Recall and Precision
1 0
1
Recall
Prec
isio
n The ideal
Returns relevant documents but misses many useful ones too
Returns most relevant documents but includes lots of junk
R=3/6=0.5; P=3/4=0.75
Computing Recall/Precision Points: An Example
n doc # relevant1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 578
10 98511 10312 59113 772 x14 990
Let total # of relevant docs = 6 Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/2=1
R=5/6=0.833; p=5/13=0.38
R=4/6=0.667; P=4/6=0.667
Missing one relevant document.
Never reach 100% recall
R- Precision Precision at the R-th position in the ranking
of results for a query that has R relevant documents.
n doc # relevant1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
Compare Two or More Systems
The curve closest to the upper right-hand corner of the graph indicates the best performance
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Prec
isio
n
NoStem Stem
An Example for Precision-Recall Curve
Famous Examples of System Evaluation
• The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 –1968 (hundreds of docs)
• Okapi System, Jimmy Huang and Stephen Robertson York University & Microsoft • SMART System, Gerald Salton, Cornell University
• TREC, Donna Harman, National Institute of Standards and Technology (NIST), 1992 - (millions of docs, 100k to 7.5M per set, training Q’s and test Q’s, 150 each)
Evaluating Retrieval Systems: Text REtrieval Conference
“TREC” An annual bake-off for text retrieval systems Sponsored by Roughly 2.5 gigabytes of text (428 gigabytes of Web data) 50 “topics” (queries) Return top 1000 documents for each topic Results judged by retired CIA and NSA analysts No-gloat rule Numerous tracks, including text routing, very large corpus,
cross-language retrieval
Web Mining
Contents What is Web mining? What can Web mining do? What is challenge for Web mining? Web mining categories
Web usage mining Web content mining Web structure mining
Applications of Web mining Examples
What is Web Mining? Web Mining is
the use of data mining techniques to automatically discover and extract information from the Web documents.
What is Web Mining ?
By the development of Computer technology, people begin to “abuse” data!
More and more data are available on the Web. However, the fact is : Some interesting things are buried.
So we need ………
What is Web Mining ?
Our objective is to find valuable knowledge hidden among the data ………..
Web Mining Techniques - Navigation Patterns
A
B
C D
E
Web Page Hierarchy of a Web Site
Web Mining Techniques - Navigation Patterns
A
B
C D
E
A link could be provided from C to E
What Data Mining can do ? An Example
What Web Mining can do ?
sales
month
What is challenge for Web Mining?
The Web is a huge collection of documents
The Web is very dynamic Challenge: Develop new Web
mining algorithms and adapt traditional data mining algorithms
Categories of Web Mining Web Usage Mining Web Content Mining
Text Multimedia
Web Structure Mining Reference R. Kosala and H. Blockeel, “Web Mining Research: A
Survey”, SIGKDD Exploration, vol. 2, issue 1, 2000. J. Srivastava et al, “Web Usage Mining: Discovery and
Applications of Usage Patterns from Web Data”, SIGKDD Exploration, vol. 2, issue 1, 1999.
Web Usage Mining Process
Preprocessing Mining Patterns Pattern Analysis
Background Knowledge
Raw Logs User Session File
Rules & Patterns Interesting rules & patterns
Web Usage Mining
Discovery information about how the Web pages are being accessed: By whom For how long When What is the order of page references
Can be used to determine a better way to organize the Web site
Web Usage Mining - Pattern Discovery
Applies Web mining techniques to generate rules and patterns
Web Mining Techniques Statistical Analysis Association Rule Generation on Web Clustering Classification Sequential Pattern
Generate simple statistical reports: A report of hits and bytes transferred A list of top requested URLs A list of top referrers Learn: Who is visiting your site How much time visitors spend on each page The most common starting page
Web Usage Mining - Statistical Analysis
Web Usage Mining - Statistical Analysis
Statistical Analysis is useful for
Improving the system performance
Enhancing the security of the system
Facilitation the site modification task
Providing support for marketing decisions