Upload
tariq-hassan
View
177
Download
0
Embed Size (px)
Citation preview
Techniques of Information RetrievalTariq Hassan & Sabahat
Road Map :• What is IR ?• Why & How it works?• Evaluation Techniques• Global & Local Methods1. Relevance Feedback2. Probabilistic Relevance Feedback3. Indirect Relevance Feedback4. Rocchio Algorithm5. Linear Classifiers6. Naïve Bayes Text Classification
Question & Discussion
What is IR? Why & How?
• Information needed to satisfy user.
• Why? Due to different formats of Data.• How?
StopListStemmingInverse Document FrequencyWord Counts
What is IR? Why & How?
Generally IR used in 3 scenarios1. Web search2. Personal IR ( Text Classification )3. Enterprise Level
Evaluation Techniques
• Why?• How? Relevant & Non Relevant Documents
Precision And Recall MethodsP = # (relevant Items Retrieved) #(retrieved Items)
R = #(relevant Items Retrieved) #(relevant Items)
Methods:1. Global Methods Reformulation Queries
2. Local MethodsRelative to the initial results against any
query
Local Methods
1. Relevance Feedback
2. Probabilistic Relevance Feedback
3. Indirect Feedback
1. Relevance FeedbackFeedback given by the user about the relevance of thedocuments in the initial set of results.
1. Relevance Feedback2. Probabilistic Relevance Feedback PRF is implementing by building a classifiers.
1. Relevance Feedback2. Probabilistic Relevance Feedback3. Indirect Relevance Feedback Without user interventions. 1. By using user actions. 2. By using user Histories or Logs
Conclusion : Relevance Feedback
Assumption: User have initial knowledge
Issues : Misspelling Cross Languages Mismatch Vocabulary
Rocchio AlgorithmIncorporates the relevance feedback mechanism in vector space model.Also uses the Cosine Similarity FunctionEuclidean Mechanism
Example
Outcome• Relevance Feedback plays an
important role to understand the user requirements.
• Rocchio Algorithm is not the best but the optimized and better option due to its simplicity and good results.
• Have a significant importance with respect to content based systems.
Classification Problems• Given:
– A document d– A fixed set of categories:
Sports, Informatics, literature, medical, entertainment– A training set of documents each
labeled with its class• Determine:
– A learning method or algorithm which will enable us to learn a classifier
– For a test document dT we have to determine its category
Classification Techniques
• Manual (a.k.a. Knowledge Engineering)
– typically, rule-based expert systems
• Machine Learning
–Naïve Bayesian (Probabilistic)
– Decision Trees (Decision Structures)
– Support Vector Machines (Linear Classification)
Document Representation
• Binary Representation• Frequency Representation• TF*IDF Representation
Naïve Bayes document classification example
• Probabilistic– Prior vs Posterior
• Bernoulli Model– Feature vector with binary
elements• Multinomial Model
– Integers representing frequency of words
Classify the document
Naïve Bayes classfication
• Very fast learning and testing– Why?
• Low storage requirements• Very good in domains with
many equally important features
• More robust to irrelevant features than many learning methods
Linear Classification
• Documents as labeled vectors• Documents in the same class form a
contiguous region of space• Documents from different classes
don’t overlap (much)• Learning a classifier: build surfaces
to delineate classes in the space
Support Vector Machines
• Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
• One Possible Solution
B1
Support Vector Machines
• Another possible solution
B2
Support Vector Machines
• Other possible solutions
B2
Support Vector Machines
• Which one is better? B1 or B2?• How do you define better?
B1
B2
Support Vector Machines
• Find hyperplane maximizes the margin
B1
B2
b11
b12
b21b22
margin
Support Vector MachinesB1
B2
b11
b12
b21b22
margin
Support Vectors
Support Vector Machines
B1
b11
b12
0 bxw
1 bxw 1 bxw
1bxw if1
1bxw if1)(
xf 2||||
2 Marginw
Support Vector Machines
B1
b11
b12
0 bxw
1 bxw 1 bxw
1bxw if1
1bxw if1)(
xf 2||||
2 Marginw
Questions & Discussion
Bottom Line• Which classifier do I use for a given document
classification problem? Answer : Depends
How much training data is available? How simple/complex is the problem? How noisy is the data? How stable is the problem over time?
For an unstable problem, its better to use a simple and robust classifier.