Techniques of information retrieval

Techniques of Information RetrievalTariq Hassan & Sabahat

Road Map :• What is IR ?• Why & How it works?• Evaluation Techniques• Global & Local Methods1. Relevance Feedback2. Probabilistic Relevance Feedback3. Indirect Relevance Feedback4. Rocchio Algorithm5. Linear Classifiers6. Naïve Bayes Text Classification

Question & Discussion

What is IR? Why & How?

• Information needed to satisfy user.

• Why? Due to different formats of Data.• How?

StopListStemmingInverse Document FrequencyWord Counts

What is IR? Why & How?

Generally IR used in 3 scenarios1. Web search2. Personal IR ( Text Classification )3. Enterprise Level

Evaluation Techniques

• Why?• How? Relevant & Non Relevant Documents

Precision And Recall MethodsP = # (relevant Items Retrieved) #(retrieved Items)

R = #(relevant Items Retrieved) #(relevant Items)

Methods:1. Global Methods Reformulation Queries

2. Local MethodsRelative to the initial results against any

query

Local Methods

1. Relevance Feedback

2. Probabilistic Relevance Feedback

3. Indirect Feedback

1. Relevance FeedbackFeedback given by the user about the relevance of thedocuments in the initial set of results.

1. Relevance Feedback2. Probabilistic Relevance Feedback PRF is implementing by building a classifiers.

1. Relevance Feedback2. Probabilistic Relevance Feedback3. Indirect Relevance Feedback Without user interventions. 1. By using user actions. 2. By using user Histories or Logs

Conclusion : Relevance Feedback

Assumption: User have initial knowledge

Issues : Misspelling Cross Languages Mismatch Vocabulary

Rocchio AlgorithmIncorporates the relevance feedback mechanism in vector space model.Also uses the Cosine Similarity FunctionEuclidean Mechanism

Example

Outcome• Relevance Feedback plays an

important role to understand the user requirements.

• Rocchio Algorithm is not the best but the optimized and better option due to its simplicity and good results.

• Have a significant importance with respect to content based systems.

Classification Problems• Given:

– A document d– A fixed set of categories:

Sports, Informatics, literature, medical, entertainment– A training set of documents each

labeled with its class• Determine:

– A learning method or algorithm which will enable us to learn a classifier

– For a test document dT we have to determine its category

Classification Techniques

• Manual (a.k.a. Knowledge Engineering)

– typically, rule-based expert systems

• Machine Learning

–Naïve Bayesian (Probabilistic)

– Decision Trees (Decision Structures)

– Support Vector Machines (Linear Classification)

Document Representation

• Binary Representation• Frequency Representation• TF*IDF Representation

Naïve Bayes document classification example

• Probabilistic– Prior vs Posterior

• Bernoulli Model– Feature vector with binary

elements• Multinomial Model

– Integers representing frequency of words

Classify the document

Naïve Bayes classfication

• Very fast learning and testing– Why?

• Low storage requirements• Very good in domains with

many equally important features

• More robust to irrelevant features than many learning methods

Linear Classification

• Documents as labeled vectors• Documents in the same class form a

contiguous region of space• Documents from different classes

don’t overlap (much)• Learning a classifier: build surfaces

to delineate classes in the space

Support Vector Machines

• Find a linear hyperplane (decision boundary) that will separate the data


• One Possible Solution

B1


• Another possible solution

B2


• Other possible solutions

B2


• Which one is better? B1 or B2?• How do you define better?

B1

B2


• Find hyperplane maximizes the margin

B1

B2

b11

b12

b21b22

margin

Support Vector MachinesB1

B2

b11

b12

b21b22

margin

Support Vectors


B1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf 2||||

2 Marginw


B1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf 2||||

2 Marginw

Questions & Discussion

Bottom Line• Which classifier do I use for a given document

classification problem? Answer : Depends

How much training data is available? How simple/complex is the problem? How noisy is the data? How stable is the problem over time?

For an unstable problem, its better to use a simple and robust classifier.

Technology

Techniques of information retrieval