Upload
kaela-patty
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Indexing
• Indexing is a technique borrowed from databases
• An index is a data structure that supports efficient lookups in a large data set– E.g., hash indexes, R-trees, B-trees, etc.
CIS 8590 – Fall 2008 NLP3
Document Retrieval
• In search engines, the lookups have to find all documents that contain query terms.
• What’s the problem with using a tree-based index?
• A hash index?
CIS 8590 – Fall 2008 NLP4
Inverted Index
An inverted index stores an entry for every word, and a pointer to every document where that word is seen.
CIS 8590 – Fall 2008 NLP5
Vocabulary Postings ListWord1 Document17, Document 45123...WordN Document991, Document123001
Example
Example:Document D1: “yes we got no bananas”
Document D2: “what you got”
Document D3: “yes I like what you got”
CIS 8590 – Fall 2008 NLP6
Vocabulary Postings ListYes D1, D3we D1got D1, D2, D3no D1bananas D1What D2, D3You D2, D3I D3like D3
Query“you got”:
“you” {D2, D3}“got” {D1, D2, D3}
Whole query gives the intersection:
“you got” {D2, D3} ^ {D1, D2, D3} = {D2, D3}
Variations
• Record-level index stores just document identifiers in the postings list
• Word-level index stores document IDs and offsets for the position of the words in each document– Supports phrased based searches (why?)
Real search engines add all kinds of other information to their postings lists (see below).
CIS 8590 – Fall 2008 NLP7
Index Construction
Algorithm:1.Scan through each document, word by word
- Write term, docID pair for each word to TempIndex file
2. Sort TempIndex by terms
3. Iterate through sorted TempIndex:
merge all entries for the same term into one postings list.
CIS 8590 – Fall 2008 NLP8
Efficient Index Construction
Problem: Indexes can be huge. How can we efficiently build them?
- Blocked Sort-based Construction (BSBI)
- Single-Pass In-Memory Indexing (SPIMI)
What’s the difference?CIS 8590 – Fall 2008 NLP
9
Problem: Too many matching results for every query
Using an inverted index is all fine and good, but if your document collection has 10^12 documents and someone searches for “banana”, they’ll get 90 million results.
We need to be able to return the “most relevant” results.
We need to rank the results.
CIS 8590 – Fall 2008 NLP11
Documents as Vectors
CIS 8590 – Fall 2008 NLP12
Example:Document D1: “yes we got no bananas”
Document D2: “what you got”
Document D3: “yes I like what you got”
Vector V1:
Vector V2:
Vector V3:
yes we got no bananas what you I like
1 1 1 1 1 0 0 0 0
0 0 1 0 0 1 1 0 0
1 0 1 0 0 1 1 1 1
What about queries?
In the vector space model, queries are treated as (very short) documents.
Example query: “bananas”
CIS 8590 – Fall 2008 NLP13
yes we got no bananas what you I like
Query Q1: 0 0 0 0 1 0 0 0 0
Measuring Similarity
Similarity metric:
the size of the angle between document vectors.
“Cosine Similarity”:
CIS 8590 – Fall 2008 NLP14
21
21,21 21
cos),(vv
vvvvCS vv
Ranking documents
CIS 8590 – Fall 2008 NLP15
yes we got no bananas what you I like
Query Q1: 0 0 1 0 0 0 1 0 0
Vector V1:
Vector V2:
Vector V3:
1 1 1 1 1 0 0 0 0
0 0 1 0 0 1 1 0 0
1 0 1 0 0 1 1 1 1
52
1),(
11
1111
vq
vqvqCS
32
2),(
21
2121
vq
vqvqCS
62
2),(
31
3131
vq
vqvqCS
All words are equal?
The TF-IDF measure is used to weight different words by more or less, depending on how informative they are.
CIS 8590 – Fall 2008 NLP16
Compare Document Classification and Document Retrieval/Ranking
• Similarities:
• Differences:
CIS 8590 – Fall 2008 NLP17
Handling Synonymy in Retrieval
Problem:
Straightforward search for a term may miss the most relevant results, because those documents use a synonym of the term.
Examples:Search for “Burma” will miss documents containing only “Myanmar”
Search for “document classification” will miss results for “text classification”
Search for “scientists” will miss results for “physicists”, “chemists”, etc.
NLP19
Two approaches
1. Convert retrieval into a classification or clustering problem
- Relevance Feedback (classification)- Pseudo-relevance Feedback (clustering)
2. Expand the query to include synonyms or other relevant terms
- Thesaurus-based- Automatic query expansion
NLP20
Relevance Feedback
Algorithm:
1.User issues a query q
2.System returns initial results D1
3.User labels some results (relevant or not)
4.System learns a classifier/ranker for relevance
5.System returns new result set D2
NLP21
Relevance Feedback as Text Classification
• The system gets a set of labeled documents
(+ = relevant, - = not relevant)– This is exactly the input to a standard text
classification problem
• Solution: convert labeled documents into vectors, then apply standard learning– Rocchio, Naïve Bayes, k-NN, SVM, …
NLP22
Details
• In relevance feedback, there are few labeled examples
• Efficiency is a concern – user is waiting online during training and testing
• Output is ranking, not binary classification– But most classifiers can be converted into rankerse.g., Naïve Bayes can rank according to the probability score,
SVM can rank according to wTx + b
CIS 8590 – Spring 2010 NLP23
Pseudo Relevance Feedback
IDEA: instead of waiting for user to provide relevance judgements, just use top-K documents to represent + (relevant) class
• It’s a somewhat mind-bending thought, but this actually works in practice.
• Essentially, this is like one iteration of K-means clustering!
NLP24
Clickstream Mining
(Aka, “Indirect relevance feedback”)
IDEA: use the clicks that users make as proxies for relevance judgments
For example, if the search engine returns 10 documents for “bananas”, and users consistently click on the third link first, then increase the rank of that document and similar ones.
CIS 8590 – Spring 2010 NLP25
Query Expansion
IDEA: help users formulate “better” queries
“better” can mean
- More precise, to exclude more unrelated stuff
- More inclusive, to increase recall of documents that wouldn’t match a basic query
CIS 8590 – Spring 2010 NLP26
Query Term Suggestion
Problem:
Given a base query q, suggest a list of terms T={t1, …, tK} that could help the user refine the query.
One common technique, is to suggest terms that frequently “co-occur” with terms already in the base query.
CIS 8590 – Spring 2010 NLP27
Co-occurrence
Terms t1 and t2 “co-occur” if they occur near each other in the same document.
There are many measures of co-occurrence, including:
PMI, MI, LSI-based scores, and others
CIS 8590 – Spring 2010 NLP28
Computing Co-occurrence Example
d1 d2 d3 d4
t1 2 0 1 0
t2 1 0 0 1
t3 0 1 2 0
t4 1 2 0 2
CIS 8590 – Spring 2010 NLP29
At,d =
Computing Co-occurrence Example
t1 t2 t3 t4
d1 2 1 0 1
d2 0 0 1 2
d3 1 0 2 0
d4 0 1 0 2
CIS 8590 – Spring 2010 NLP30
Ct,t’ = ATA=
*
d1 d2 d3 d4
t1 2 0 1 0
t2 1 0 0 1
t3 0 1 2 0
t4 1 2 0 2