Meta-Search Engine based on Query-Expansion … grade/Anand Arun Atre...Meta-Search Engine based on Query-Expansion Using Latent Semantic Analysis and Probabilistic Latent Semantic

Meta-Search Engine based on

Query-Expansion Using Latent Semantic Analysis and

Probabilistic Latent Semantic Analysis

DISSERTATION

SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

MASTER OF TECHNOLOGY IN INFORMATION TECHNOLOGY

(SOFTWARE ENGINEERING)

Under the Supervision of Dr. Sudip Sanyal Associate Professor

INDIAN INSTITUTE OF INFORMATION TECHNOLOGY – ALLAHABAD

(DEEMED UNIVERSITY)

DEOGHAT, JHALWA

ALLAHABAD- 211011, (U.P.)

INDIA

IIIT-Allahabad

Submitted by Anand Arun Atre

MS200504 M.Tech. IT (Software Engineering)

IIIT-Allahabad

IINNDDIIAANN IINNSSTTIITTUUTTEE OOFF IINNFFOORRMMAATTIIOONN TTEECCHHNNOOLLOOGGYY

AALLLLAAHHAABBAADD (A University Established under sec.3 of UGC Act, 1956 vide Notification No. F.9-4/99-U.3 Dated 04.08.2000

of the Govt. of India )

(A Centre of Excellence in Information Technology Established by Govt. of India)

Date: ______________

We do hereby recommend that the thesis work prepared

under my/our supervision by Anand Arun Atre entitled

“Meta-Search Engine based on Query-Expansion Using

Latent Semantic Analysis and Probabilistic Latent

Semantic Analysis ” be accepted in partial fulfillment of

the requirements of the degree of Master of Technology in

Information Technology (Software Engineering) for

examination.

COUNTERSIGNED

Dr. Sudip Sanyal ______________________________ THESIS ADVISER

Dr. U. S. Tiwary DEAN (ACADEMICS)

IINNDDIIAANN IINNSSTTIITTUUTTEE OOFF IINNFFOORRMMAATTIIOONN TTEECCHHNNOOLLOOGGYY

AALLLLAAHHAABBAADD (A University Established under sec.3 of UGC Act, 1956 vide Notification No. F.9-4/99-U.3 Dated 04.08.2000

of the Govt. of India )

(A Centre of Excellence in Information Technology Established by Govt. of India)

CERTIFICATE OF APPROVAL*

The foregoing thesis is hereby approved as a creditable

study in the area of Information Technology carried out

and presented in a manner satisfactory to warrant its

acceptance as a pre-requisite to the degree for which it

has been submitted. It is understood that by this

approval the undersigned do not necessarily endorse or

approve any statement made, opinion expressed or

conclusion drawn therein but approve the thesis only

for the purpose for which it is submitted.

COMMITTEE ON

FINAL EXAMINATION

FOR EVALUATION

OF THE THESIS

*Only in case the recommendation is concurred in

Acknowledgement

The satisfaction and bliss that accompany the successful completion of any task

would be incomplete without the mention of people who made it possible because

success is not only an exemplar of hard work and perseverance but endures an

encouraging guidance and tremendous help. This thesis, while an achievement that

bears my name, would not have been possible without the help of others. I am glad to

take this opportunity to thank the people who helped me to make this work possible.

At the start I thank the Almighty God for his divine grace and blessings showered

on me and thereby giving me strength and courage to complete thesis work and in

turn my course successfully.

It is my privilege to study at Indian Institute of Information Technology,

Allahabad where students and professors are always eager to learn new things and to

make continuous improvements by providing innovative solutions. I am highly

grateful to the honorable Director, IIIT-Allahabad, Dr. M. D. Tiwari, for his ever

helping attitude and encouraging us to excel in studies. I am also thankful to

Dr.U.S.Tiwary, Dean Academics, IIIT-Allahabad for providing all the necessary

requirements and his moral support for this dissertation work.

Regarding this thesis work, first and foremost, I would like to heartily thank my

supervisor Dr. Sudip Sanyal for his able guidance. His fruitful suggestions, valuable

comments and support was an immense help for me. Inspite of his hectic schedule he

took pains, with smile, in various discussions which enriched me with new

enthusiasm and vigour.

I owe my gratitude to Mr. Mithilesh Mishra, Coordinator, IIIT-A Network

Development, Engineering & Management(INDEM); to grant me special access

privilege on IIIT-A network for bypassing the request through proxy for successful

execution of Search-Engines’ APIs. I am also thankful to Mr. Balwant Singh,

i

incharge Maintenance Cell, to issue me a Computer system with all the necessary

accessories and configuration.

Now, I would like to mention the names of my classmates who in the both ways,

directly or indirectly, helped me a lot. Firstly, my thankful wish is for my one of the

best friend ever, Mr. Mohd. Imran Khan. Right from the beginning and till the

completion of this project, he discussed plenty of things with me about intricacies and

complexities of project. He always directed me to make an efficient and more

maintainable code. He is fully aware of all the thick and thins of my project that is an

evident of how success of the whole thesis depends on him. Next , I would like to

thank Mr. Kamal Sawan, for helping me to execute GoogleAPI during the starting

days of project, Mr. Prabhat Saheja for successful execution of MSN API,because

he is well familiar of .NET framework and C-Sharp and Mr. Nilesh Chandra Shukla

and Mr. Vineet Chauhan helped me a lot during coding phase.

It is always a nice experience to spend two most exciting year of life in IIIT-A

with friends like Mr. Adish Singh, Mr. Abhay Sukhdeo Pawane, Mr. Pankaj

Kandpal, Mr. Sampath Kumar Mada and Mr. Sahab Nath Yadav, who always

motivated me to do this project with sincerity and patience.

I am also profoundly thankful to Mr. Parikshit Totawar and Mr.Shrikant

Mantri to assist me in learning Perl.

I also wish to extend my thanks to Mr. K. Ashwin Kumar and Mr. Prabhash

Dhyani, member of INDEM team ,to solve the network-access problems, related to

extra privilege assigned to my IIITA-account. Students of pre-final year of B.Tech,

specially Mr. Animesh Nayan and Mr. B. Ravikiran Rao helped me effectively for

efficient usage of “NER” library.

I also owe my thanks to Mr. Dhirendra Pratap Singh, Mr. Mahindra Giri

Vasireddy and Ms. Megha Thakkar for suggesting me some nice improvements

during implementation.

ii

Lastly, I would like to express warm gratitude to my grand-mother, my parents

and my maternal-uncle Mr. M. A. Vetal, for their unbounded love and priceless

support throughout my life. Their support has kept me striving for success. I hope that

with the completion of this course, I have made them proud.

Anand Arun Atre

14-July-2007

iii

DECLARATION

This is to certify that this thesis work entitled “Meta-Search Engine based on

Query–Expansion using Latent Semantic Analysis and Probabilistic Latent

Semantic Analysis”’ which is submitted by me in partial fulfillment of the

requirement for the completion of M.Tech. in Information Technology specialization

in Software Engineering to Indian Institute Of Information Technology, Allahabad

comprises only my original work and due acknowledgements has been made in the

text to all other materials used.

Name : Anand Arun Atre

M.Tech.-IT: Software Engineering

Enrolment No: MS200504

iv

Abstract

As the result of the rapid advancements in Information Technology, Information

Retrieval on Internet (Internet -Searching) is gaining importance, day by day. Search-

Engines are admittedly essential tools for this purpose. But, like a two side of the

same coin, search-engines’ performance degrade due to some critical issues. This fact

motivates for another solution-namely “Implementation of Meta- Search Engine”.

This thesis presents an analysis of the applicability of the Probabilistic Latent

Semantic Analysis technique for performing Query Expansion in the context of Meta

Search Engines. The basic idea is to refine results using query-expansion. Our

experiments clearly demonstrate that the technique gives excellent results for query

expansion with distinct senses of the query keywords getting grouped in different

topics. Moreover, the applied method converges very rapidly, thus providing an

efficient and extremely pragmatic method for query expansion. We also compare our

results with those obtained using Latent Semantic Analysis.

Keywords: Meta-Search Engine, Query-Expansion, Latent Semantic Analysis,

Probabilistic Latent Semantic Analysis, Convergence.

v

Table of Contents

Acknowledgement ..........................................................................................................i

DECLARATION .......................................................................................................... iv

Abstract.......................................................................................................................... v

Table of Contents .........................................................................................................vi

List of Tables ..............................................................................................................viii

List of Figures ..............................................................................................................ix

Introduction...................................................................................................................1 1.1 Overview..................................................................................................................1 1.2 Objective..................................................................................................................1 1.3 Motivation ...............................................................................................................2 1.4 Problem Statement .................................................................................................4 1.5 Contribution of Thesis ...........................................................................................5 1.6 Structure of Thesis .................................................................................................5 1.7 Summary .................................................................................................................6

Literature Survey...........................................................................................................7 2.1 Current Trends in Meta-Search Engine...............................................................7 2.2 Vector Space Model................................................................................................8 2.3 Latent Semantic Analysis (LSA) .........................................................................12

2.3.1 Concept of LSA.............................................................................................................. 12 2.3.2 Limitations of LSA......................................................................................................... 19 2.3.3 Advantages and Applications of LSA ............................................................................ 20

2.4 Probabilistic Latent Semantic Analysis (PLSA) ................................................21 2.4.1 Concept of PLSA............................................................................................................ 21 2.4.2 PLSA Algorithm............................................................................................................. 23 2.4.3 Advantages and Applications of PLSA .......................................................................... 26

2.5 Summary ...............................................................................................................27 Proposed Meta-Search Engine...................................................................................28

3.1 Basic Theme .................................................................................................................28 3.2 Architecture of Proposed MSE ..................................................................................29 3.3 Implementation Details ...............................................................................................31 3.4 Features of Proposed System......................................................................................34 3.5 Summary ......................................................................................................................35

Result and Analysis.....................................................................................................36

vi

4.1 Result-Analysis of LSA ........................................................................................36 4.1.1 Value of ‘k’ for Optimal Rank Approximation of Term-Document matrix ................... 36 4.1.2 Comparison of “Tf-IDf Measure” to “Term-Count Measure”........................................ 38

4.2 Result Analysis of PLSA ......................................................................................40 4.2.1 Optimal value for number of topics (a) .......................................................................... 41 4.2.2 Convergence................................................................................................................... 43 4.2.3 Number of Iterations for Convergence ........................................................................... 46 4.2.4 PLSA slide-shots ............................................................................................................ 46

4.3 Convergence in number of unique links after some iteration ..........................48 4.4 Comparison between LSA and PLSA results ....................................................48 4.5 Comparison with Dogpile Search-Engine ..........................................................49

Improvements from NER (Named-Entity Recognizer)..............................................52 5.1 Introduction ..........................................................................................................52 5.2 Modified Architecture of Meta-Search Engine..................................................54 5.3 Modified High-level Design .................................................................................55 5.4 Results of NER......................................................................................................57 5.5 Summary ...............................................................................................................58

Conclusion and Future Enhancements .....................................................................59 6.1 Conclusion.............................................................................................................59 6.2 Future Enhancements ..........................................................................................59

Appendix-A: ................................................................................................................61

Search Engines’ API ..................................................................................................61

Appendix-B: ................................................................................................................65

Parser’s API ................................................................................................................65

Appendix-C: ................................................................................................................67

List of Stop Words .......................................................................................................67

Appendix-D: ................................................................................................................70

JAMA API (for SVD)..................................................................................................70

References ...................................................................................................................72

vii

List of Tables

Table 2.1 Titles that representing small corpus. .................................................................17

Table 2.2 Term-Document Representation of corpus (T). .................................................17

Table 2.3 Complete SVD of T. ..............................................................................................18

Table 2.4 Reconstruction of Original Matrix. ....................................................................19

Table 2.5 Four Aspects (topics) those are most likely to generate term ‘Cricket’. ..........25

Table. 4.1 Next Keywords for query “India Tourism” for different values of ‘K’. .........37

Table 4.2 Next Keywords for query “Thread” using Term-Count Measure. ..................38

Table 4.3 Next Keywords for query “Thread” using Tf-IDf Measure..............................38

Table 4.4 Results of PLSA for query “Thread” ..................................................................40

Table 4.5 Results of PLSA for query “Australian University”..........................................40

Table 4.6 Results of PLSA for query “India Tourism” for different value of num of topic

‘a’=1, 2, 3........................................................................................................................42

Table A.1 Classes and Methods of Google SOAP Search API..........................................61

Table A.2 Classes and Methods of Yahoo Search Web Service API................................63

Table C.1 List of stop words. ...............................................................................................67

Table D.1 Classes and Methods of JAMA API ..................................................................70

viii

List of Figures

Fig. 1.1. Anatomy of Crawler Based Search Engine.............................................................3

Fig. 2.1 Document Representation in Term Space................................................................9

Fig. 2.2 SVD of Term-Document Matrix ‘T’ .......................................................................14

Fig. 2.3 Rank K approximation of original matrix T..........................................................15

Fig. 2.4Two Matrix Formations from PLSA. ......................................................................24

Fig. 3.1 Architecture of proposed Meta-Search Engine .....................................................29

Fig. 3.2 High-Level Design (Package Diagram) ..................................................................31

Fig 4.1 Graphical User Interface for LSA ...........................................................................39

Fig 4.2 Convergence in Term- Topic Matrix computed by Absolute Measure ................44

Fig 4.3 Convergence in Topic-Document Matrix computed by Absolute Measure .........44

Fig 4.4 Convergence in Term-Topic Matrix computed by Average Measure..................45

Fig 4.5 Convergence in Topic-Document Matrix computed by Average Measure..........46

Fig 4.6 GUI representing results for query “Thread”........................................................47

Fig 4.7 GUI representing results for query “India Tourism”............................................47

Fig 4.8 Behavior of num. of unique web-links to iterations for Query Expansion...........48

Fig 4.9 Top ten results of Meta-Search Engine “Dogpile” for Query “Thread”............50

Fig 4.10 Top ten results of proposed MSE for Expanded Query “Thread Package Java”

.........................................................................................................................................50

Fig 4.11 Top ten results of proposed MSE for Expanded Query “Thread Dress” ..........51

Fig. 5.1 Architecture of Modified Meta-Search Engine......................................................54

Fig. 5.2 High-Level Design with NER (Package Diagram).................................................55

Fig. 5.3 A text file before and after Named-Entity Recognition .......................................56

Fig. 5.4 GUI after applying NER for query “India Tourism” ...........................................58

ix

Chapter 1

Introduction

This chapter presents the overview of the thesis. It will give the reader an insight

of current situation in the field of Searching on Internet. It describes the objective,

motivation and problem statement of thesis. At last, it presents organization of thesis.

1.1 Overview

The focus of this thesis to add a new dimension to Internet-Searching and that is

to apply semantic aspects towards it. In precise words,“the search must be what user

wish, not what he/she types”. In the current scenario users are flooded with numerous

web-links (urls) given by Search-Engines (SEs). Hence the users waste their useful

time in navigating through undesired links, searching the needed one. The prime

reason for this is that the SEs index the pages on the basis of key-words. On the other

hand, when we are searching the internet we quite often may not know the correct and

complete set of key words that might have led us to the desired url. In order to

overcome this shortcoming we need to devise a method that will allow the user to find

the relevant key words starting from the few key words that he/she may actually

know. In other words, we need to look into the semantics of the key words. This

thesis suggests a new approach that is based on some algorithms which considers

semantic aspects and uses them to implement a Meta-Search Engine (MSE).

1.2 Objective

‘To develop a Meta-Search Engine for refining the search-results of existing

Search-Engines by Query Expansion using Latent Semantic Analysis (LSA) and

Probabilistic Latent Semantic Analysis (PLSA)Algorithms.’

1

The project activity basically consists of an implementation of both of LSA and

PLSA algorithms on the results of basic Search–Engines in order to refine it by

Query-Expansion (QE). LSA has been already recommended for QE in searching on

internet [1]. An essential component of this thesis is to compare the performance of

these two algorithms empirically and to analyze various factors which affect the

result. The thesis concludes with the fact that PLSA outperforms LSA.

1.3 Motivation

In the current scenario, Information Technology is advancing rapidly. World Wide

Web or Internet is one of the best achievements of it. Internet can be treated as a huge

repository of information and sophisticated methods are always required to extract

needed information. Search Engines like Google, Yahoo and MSN are really

necessary tools to retrieve needed information.

Most of such search-engines basically perform Crawler-Based Search. These SEs

generally consist of a WebCrawler - a program that crawls the web, an Indexing

Technique, some Encoding Mechanism and a huge Database. These SEs use crawlers

(spiders) for information collection on the web. Then indexing, encoding and storing

of collected data are performed subsequently [2,3]. Following diagram represents the

anatomy of search-engine.

Steps of Crawler Based Search-engines [2]:

1. Web–Crawling: Search-Engines use a special program called Robot or

Spider which crawls (travels) the web from one page to another. It travels

the popular sites and then follows each link available at that site.

2. Information Collection: Spider records all the words and their respective

position on the visited web-page. Some search-engines do not consider

common words such as articles ( ‘a’,’an’,’the’); prepositions (‘of’,’on’).

2

Fig. 1.1 Anatomy of Crawler Based Search Engine

3. Build Index: After collecting all the data, search-engines build an index to

store that data so that a user can access pages quickly. Different search-

engines use different approach for indexing. Due to this fact the different

search-engines give different results for the same query. Some important

considerations for building indexes include: the frequency of a term of

appearing in a web-page, part of a web-page where that term appears, font-

size of a term (whether capitalized or not). In fact, Google ranks a page

higher if more number of pages vote (having links) to that particular page.

4. Data Encoding: Before storing the indexing information in databases, it is

encoded into reduced size to speed up the response time of particular

search-engine.

5. Store Data: the last step is to store this indexing information into

databases.

However, to extract desired information quickly and easily is a common problem

that user face [4]. Keyword selection for searching is also a critical issue. Very few

users utilize the full power of SEs [5]. Along with all of the above, some of the

surveys also suggest following facts:

1. Any search-engine is not capable of covering more than one third of the web-

pages available on Internet [6].

2. Sometimes search-engines give such results which contain obsolete or dead

link [6].

Information Collection

Building Index

Encoding the data

Web Crawling

Data Storage

3

A study was performed to evaluate the overlapping among first page results of

three SEs namely-Google, Yahoo and AskJeeves [7]. Study reveals that only 85%

links are unique while 12% links were found common in any of two search-engines.

Only 3% of links were common to all three search-engines. This very small amount of

overlap shows significant differences in ranking and retrieval policies of search-

engines. From these data we can infer that if internet users are using only a single

search-engine that they may miss needed and relevant results [8].

These facts give us a motivation for implementing an Indirect Search Engine (also

called Meta Search Engine) which combines the results of existing search-engines and

refines them using some algorithm or represents them in such a format which is more

user-friendly. Simple distinction between the Search-Engines and Meta-Search

Engine (MSE) is that the latter do not require crawling the web and hence there is no

need of indexing and databases. Main directions for implementing MSE are to

improve user-interface and to filter results according to user need. All such current

trends are illustrated in section 1 of chapter 2 Literature Survey with full of details

and their limitations. No one among them is based on phenomena of peering into

semantic (meaning) of content and to refine results by query expansion. All these

factors give strong motivation to implement such a MSE which can take responsibility

of all the bottom lines explained above.

1.4 Problem Statement

Approaches for implementing a Meta-Search Engine that are suggested till now,

are not refining the search-results up to the desired level. Such approaches are based

on either extracting user preferences or maintaining user profile. They also do not

address the problem of Synonymy (where more than two terms can be used to

represent same object) and Polysemy (same term may represent different meaning in

different context). The one and only reason behind this is that the current Meta-Search

Engines do not consider the semantic aspect of a term. To apply some algorithm on

search-results may provide better solutions to explained problem. So the main

problem of concern is to choose appropriate algorithms which can solve the above

mentioned problem and to use them for implementing a Meta-Search Engine.

4

1.5 Contribution of Thesis

In this thesis we have designed and implemented a complete MSE. The design of

the MSE is quite flexible and provides the facility to add the results of new SEs, when

they become available. We have developed a method for performing QE on the results

returned by the SEs using PLSA and also incorporated the existing methodologies

available using LSA. Extensive experiments are performed to compare the results

obtained with PLSA and LSA for the task of QE. Analysis of these results clearly

demonstrates that PLSA outperforms LSA. Further analysis also reveals some

shortcomings. Methods for overcoming these shortcomings are also discussed.

1.6 Structure of Thesis

The thesis comprises of various chapters. An overview and objective of thesis is

presented in Chapter1 Introduction. It then demonstrates all the factors responsible

for motivation of thesis. Problem statement, contribution and structure of this thesis

are illustrated next to that.

Chapter 2 Literature Survey describes all the current inclinations towards

design and implementation of Meta-Search Engines. It further describes how the

proposed idea overcomes all the problems and then gives all the exhaustive

algorithmic details that are used by proposed MSE namely Vector Space Model,

Latent Semantic Analysis and Probabilistic Latent Semantic Analysis.

Chapter 3 Proposed Meta-Search Engine illustrates the Architecture of system

with all the components. It also depicts necessary requirements and corresponding

implementation details for them.

Chapter 4 Result and Analysis firstly shows the results of LSA for different

critical factors. Further results of PLSA are presented for different cases. Then

comparison between both of them is demonstrated. This chapter contains all the

elucidating examples with diagrams, slide-shots and graphs which reinforce the

superiority of PLSA compared to LSA. At last, results of suggested MSE are

compared to Open Source Meta-Search Engine-“Dogpile”.

5

Chapter 5 Improvements from NER demonstrates addition of a newer module

into Meta-Search Engine namely “Named Entity Recognizer”. This chapter then

illustrates all the modifications, effects and consequences of addition of this new

component.

Chapter 6 Conclusion and Future Enhancements presents the conclusion of

thesis and suggests some future enhancements to it.

Appendix section of thesis contains details about various Search-Engine API’s,

HTML parser, PDF to text converter, JAMA API (for SVD) etc.

1.7 Summary

The deriving reasons for implementing a Meta-Search Engine (MSE) have been

illustrated. Proposed MSE is based on the phenomena of Query-Expansion which

relies on the fact that after few successive iterations of firing expanded query the

results will be automatically refined. For the purpose of query-expansion some

algorithms have been proposed that try to focus on semantic aspect behind a query.

The next chapter explains all the algorithms used and their respective effects.

6

Chapter 2

Literature Survey

2.1 Current Trends in Meta-Search Engine

In this chapter we survey some of the methods that have been used in developing

Meta-Search Engines (MSEs). We also examine the strengths and weaknesses of

various approaches. There are three main directions for implementing Meta Search

Engine:[4]

1. Improvement in user-interface

2. To filter results of query

3. To apply algorithms for indexing of web-page.

Heavy emphasis on user requirements is recommended in the architecture of

Meta-Search Engine [9]. A Personalized Meta-Search Engine, Excalibur [10], has

been already proposed that provides quick response with re-ranked results after

extracting user preference. It uses Naïve Bayesian Classifier for re-ranking.

Some MSEs use proxy log records for extracting user’s accessing pattern and

store these patterns in database. A relevance score is estimated using some heuristic

for each user and the url that he/she visited. A profile is maintained for user which

contains currently visited most relevant urls. Relevance of these urls with their

respective relative position is updated in profile when users visit those links in future

[6].

Current research also suggests the framework of Meta-search engine based on

Agent Technology [11]. An enhanced version of open source Helios Meta-search

engine takes input keywords along with specified context or language and gives

refined results as per user’s need [8].

7

All the proposed solutions refine search-results up to some extent but they have a

serious drawback which is that the user profile is not stationary. A user who is

currently new to the context of search-topic may become experienced during the

course of time. In this way his/her requirements may also vary. To manage and view

the log of previous searches may thus lead to inappropriate search-results. The above

observation leads us to consider alternative methods of re-ranking. This is offered by

purely statistical methods like Latent Semantic Analysis (interchangeably called

Latent Semantic Indexing) and the newly introduced Probabilistic Latent Semantic

Analysis (interchangeably called Probabilistic Latent Semantic Indexing) which

promises to give results that are more accurate than those of Latent Semantic

Analysis. Thus, the emergence of these algorithms and the need for robust meta-

search engines is the catalyst of the present thesis.

Tests performed by various groups reveal that Latent Semantic Analysis (LSA)

and Probabilistic Latent Semantic Analysis (PLSA) give robust results for

Information Retrieval when the task is to search the most relevant documents from a

given corpus, for a given query. As both of these methods depend on the Vector

Space Model, the Vector Space Model is explained prior to both. In the present thesis

we contend that LSA and PLSA can be used to perform query expansion i.e. given a

keyword set as a query, we would like to use these algorithms to automatically

suggest additional keywords that would help to refine the search. In the following

sections we first examine the vector space model. We then extend this model to LSA

and PLSA and also examine how these can be used to perform the query expansion.

2.2 Vector Space Model

Most of the text-retrieval techniques are based on indexing keywords. Since only

keywords are not capable of sufficiently capturing the whole documents’ content, they

results poor retrieval performance. But indexing of keywords is still the most

applicable way to process large corpora of text. After identification of the significant

index term a document can be matched to a given query by Boolean Model or

Statistical Model. Boolean Model applies a match that relies on the extent up to which

8

an index term satisfies a Boolean expression while statistical properties are used to

discover similarity between query and document in Statistical Model [12].

In 1975, Gerald Salton [13] proposed a statistically based “Vector Space Model”

which is based on the theme of placing the documents in the n-dimensional space,

where n is number of distinct terms or words (as- t1, t2…tn) which constitutes the

whole vocabulary of the corpus or text collection. Each dimension belongs to a

particular term. Each document is considered as a vector as- D1, D2…Dr; where r is

the total number of documents in corpora. Document Vector can be shown as

following:

Dr = { d1r , d2r , d3r ,……..dnr }

where dir is considered to be the ith component of the vector representing the rth

document [12].

Information

Retrieval

System

Doc 1

Doc 2

Query

Fig 2.1 Document Representation in Term Space

The above figure shows representation of documents Doc1 and Doc2 in space of

three terms namely “Information”, “Retrieval” and “System”. Three perpendicular

dimensions for each term represents “Term-Independence”. This independence can

be of two types namely linguistic and statistical.

When the occurrence of a single term does not depend upon appearance of other

term, it is called Statistical independence. In Linguistic independence; interpretation

of a term does not rely on other any term [14].

9

This assumption of pair-wise Term-Orthogonality is not realistic but acceptable

for first approximation [15].

The Vector-Space Model is traditionally used in such a case where the collection

of documents are placed in term-space and it is required to find the most relevant

document for a given query. A query is just like a very short document. The similarity

between the query and all documents in the collection is computed and the best

matching documents are returned.

There are various similarity measures that are proposed and one of them, that is

very frequently used, is Cosine Similarity.

Cos θ = Q * D / |Q| * |D|

The above expression represents the cosine of the angle between two vectors in

the term space. The relevant document will be that one which is the nearest to given

query. In the same way two documents would be considered relevant if they are in

neighbour-hood region of each other.

Other Similarity Measures are [14]:

• Inner Product = ∑ Q j * D j

• Dice Coefficient = 2 ∑ Q j * D j / { ∑ Q j 2 + ∑ D j

2 }

• Jaccard Coefficient = ∑ Q j * D j / { ∑ Q j 2 + ∑ D j

2 -∑ Q j * D j }

Each component of document vector is always associated with some numeric-

factor which is called weight of that respective term in document. This weight, wi, can

be replaced by term-count or term-frequency (tfi ). This assignment leads to another

variation of the model that is called “Term Count Model”. This Model is sensitive

to term repetition. Thus, long documents score higher simply because they are longer,

not because they are more relevant [16]. Lee, Chung and Seamon performed

comparison among different term-weight models and deduced that the term-count

model does not give excellent result in any situation [12]. This model is biased for

large documents because they may contain many terms that are repeated very often.

So, long documents score higher in spite of being less relevant.

10

Hence in traditional Vector Space Model, tf x idf method is used to determine the

weight of the term in given document vector. It is based on two factors:

1. The occurrence of term ‘b’ in the document ‘a’ (term frequency tf a, b)

2. The occurrence of term ‘b’ in the document-collection (document frequency

df b) [12]

So, the weight of a term ‘b’ in given document ‘a’ can be written as

W a, b = tf a, b * idf b = tf a, b * log (N / df b)

where,

N= number of documents in the document-collection

idf= inverse document frequency

tf = term frequency

This model incorporates both the local and global information. The first term, tf a,b

accounts for local weight. The ratio (df b /N) is the probability of selecting a document

that contains a queried term from documents-collection. This ratio can be treated as a

global probability for the whole collection. Thus, log (N / df b) stands for Inverse

Document Frequency that accounts for global information. This measure gives better

results than other term- weight measure [16].

Vector Space Model suffers from following limitations:

1. It assumes term-independence.

2. It is very calculation-intensive and requires large processing time.

3. Recalculation of vectors is needed whenever a new term is added

4. Long documents make similarity measures difficult [16].

All these limitations can be taken care of by some means but the main

disadvantage of this model is the lack of focus on Synonymy and Polysemy.

Synonymy means separate terms having the same meaning. Synonymy describes the

fact that there are many ways to refer the same object. Users with different needs and

knowledge describe the same information using different terms. For example, the

terms car and automobile are often used interchangeably. They are similar in meaning

but tend to decrease the “recall”. This occurs due to the fact that when we use the term

“car” as a query term, the Vector Space Model will only look for documents

11

containing the term “car” and will ignore those documents that contain the term

“automobile”. Polysemy stands for terms with multiple distinct meanings

(homography). Use of such term in a search query does not necessarily mean that a

document containing or labeled by the same term is of interest. Polysemy reduces the

precision in results. This can be understood by considering the word “bank”. We can

have “river bank”, “bank” as financial institution or even the “banking of an

airplane”. Thus, if the word “bank” is given as a query term then the Vector Space

Model will pick up documents that contain the word “bank”, regardless of the sense in

which it has been used. The user, obviously, was interested in only one of the possible

senses. Thus, the search results will contain a large number of results that are not

according to what the user desired.

All these problems are due the fact that there is not a connection between topic

and term since the vector space model does not allow searching based on terms in a

specific topic or context. A topic is not visible as terms. They are latent and related to

semantic meaning. Most of the search engines do only term matching and are not

based on semantic aspect of term. Thus, we need to develop methods that try to

consider the semantic aspects of the terms. This leads us to various latent semantic

analysis methods that are described next.

2.3 Latent Semantic Analysis (LSA)

2.3.1 Concept of LSA

In 1990, Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K.

Landauer and Richard Harshman [17] proposed a method that can be used to decide

similarity of meaning of terms and paragraphs by analyzing a huge text-corpora. This

method is called Latent Semantic Analysis (LSA).

LSA is a statistical /mathematical technique to elicit and infer relationship among

usage of words in a paragraph for a given context. It does not use any artificial

intelligence method or a natural language processing technique. Its functioning is not

12

based on grammars, parsers and dictionaries [17]. It tries to explore something about

meaning of the words and about the topics in text-documents [18, 19].

LSA is based on the principle of Dimensionality Reduction. As the Internet

Technology is advancing; the number of electronically available documents is

increasing at an exponential rate. Hence efficient tools for document organization,

summarization, clustering, navigation and retrieval are always needed. Document-

Clustering is a daunting task due to the high-dimensionality. As mentioned

previously, the dimensionality of the space is determined by the number of terms in

the document collection i.e. the number of distinct words in the corpus. Thus, the

dimensionality can be extremely high for even a modest number of documents of

average size. In most of such applications, a document is expressed in the form of

vector in term-space (like- Vector Space Model). A short passage may contain

hundreds of distinct term, i.e. in real applications where processing of large text-

corpora is required, the number of these term-dimensions will be enormous. This high

dimensionality reduces the discriminative power of distance measures in significant

manner [20].

To solve this problem various dimension reduction techniques have been already

proposed. These techniques can be treated as a promising way to extract the

“concepts” from unstructured data [21, 22]. These dimension reduction techniques

can be classified in two categories:

1. Feature Selection

2. Feature Transformation

Feature Selection methods sorts all the terms using a suitable mathematical

measure that are computed from documents. Examples of such methods are:

Document Frequency, Mean TfIDf and Term Frequency Variance.

Feature Transformation methods assign vector space representation of the

collection of documents into lower dimension subspace. The new dimensions are

linear combination of the original ones. Some very famous such methods are Latent

Semantic Analysis (LSA), Random Projection (RP), Independent Component

Analysis (ICA), PCA (Principal Component Analysis) [23, 24].

13

As an initial step in LSA , text is represented as a matrix. In this matrix, each row

stands for a unique term or word and each column stands for a paragraph of context or

a document. Each cell contains the frequency with which the word of its row appears

in the document denoted by its column. LSA takes this term-document matrix as

input and applies Singular Value Decomposition (SVD) on it [17].

In SVD, a matrix is decomposed into the product of three other matrices. One

component matrix R0 explains the original row entities as vectors of derived

orthogonal factor values while another component matrix C0 describes the original

column entities in the same way. The third component is a diagonal matrix S0

containing scaling values. When the three components are matrix-multiplied, the

original matrix is reconstructed [17, 18].

T

=

R0

S0

C0’

documents

terms

t x d

m x m

m x d

t x m

T = R0 S0 C0’

Fig. 2.2 SVD of Term-Document Matrix ‘T’

Singular Value Decomposition of the term-document matrix, T, where:

R0 has orthogonal, unit–length columns (R0’ R0= I)

C0 has orthogonal, unit-length columns (C0’ C0= I)

S0 is the diagonal matrix of singular values

m is the rank of T (<= min (t, d))

14

SVD is a very useful technique because it provides a simple procedure for an ideal

approximate fit using smaller matrices. If the singular values in S0 is arranged by size,

the initial k largest may be kept and remaining smaller fixed to zero. The

multiplication of the resulting matrices is a matrix Tnew which is nearly equal to T, and

is of rank k. It can be shown that the new matrix Tnew is almost equal to T, in the least

square sense [18].

T new

=

R

S

C’

documents

terms

t x d

k x k

k x d

t x k

T new = R S C’

Fig. 2.3 Rank K approximation of original matrix T

The value of k is a parameter whose choice is of critical importance. This is

because it decides the amount of dimension reduction. Ideally it should be small

enough so that all the sampling errors can be ignored but large enough to capture all

the real structure in the data [18]. Each value in the new representation is calculated as

a linear combination of the original cell values. As a result of this, any change in the

cell value of original matrix is reflected in the values of newly reconstructed matrix

with reduced dimensions. The dimension reduction step has cut down the matrices in

such a way that the terms that occurred in some contexts will appear with larger or

smaller predictable frequency and some words that did not appear actually now do

appear, fractionally [17].

There are three kinds of comparisons that can be made by this reduced dimension

matrix [17].

15

(1.) Term-Term Comparison- The dot product between two row vectors of

Tnew shows the scope up to which two terms have similar patterns to occur

in given set of documents. The matrix T new * Tnew’ is the square symmetric

matrix that contains all term-to-term dot products. This can be verified :

T new * T new’ = (R * S* C’) * (R * S* C’)’

= (R* S* C’) * (R * S’ * C’) because (A * B)’= (B’ * A’)

= R * S * (C’ * C) * S’ * R’

= R * S * S’ * R’ because C is Orthogonal

= R * S2 * R’ because S is Diagonal

(2.) Document-Document Comparison- The dot product between two

column vectors of Tnew shows the scope up to which two documents have

similar patterns. The matrix Tnew’ * Tnew is the square symmetric matrix

containing all term-to-term dot products. This can be verified:

T new’ * T new = (R * S* C’)’ * (R * S* C’)

= (C * S’* R’) * (R * S* C’) because (A * B)’= (B’ * A’)

= C * S’ * (R’ * R) * S’ * C’

= C * S * S’ * C’ because R is Orthogonal

= C * S2 * C’ because S is Diagonal

(3.) Term-Document Comparison- The term and a document comparison is

the entry of an individual cell of Tnew. The i, j cell of T new is obtained by

taking the dot product between the i th row of R * S ½ matrix and the j th

row of the C * S ½ matrix.

Using Term-Term and Document- Document similarity we can easily find out

respectively all the terms and documents that are highly related to each other. This

similarity measure gives an approach to Query-Expansion using term- term similarity.

In next chapters we show in detail how a query is expanded in the context of MSE

which refines results. Another approach to find term- term similarity is by using

correlation constant between two terms or two documents [18]. Following is an

example that demonstrates how, after applying dimension reduction using LSA, the

two terms become closer or far according to semantics in the semantic space.

16

Table 2.1 Titles that represents small corpus.

Document Title

Doc 1 Design and Analysis of Algorithm

Doc 2 Satellite Imaginary

Doc 3 Image Processing

Doc 4 Digital Signal Processing

Doc 5 Data Structure and Algorithm

Among above, Doc 2, 3, 4 belong to the domain of “Image Processing” while the

remaining documents, Doc1 and Doc 5 are from “Design of Algorithms”. The term-

document matrix T can be represented as in the Table 2.2.

Table 2.2 Term-Document Representation of corpus (T).

DOC 1 DOC 2 DOC 3 DOC 4 DOC 5

Signal 0 0 0 1 0

Structure 0 0 0 0 1

Image 0 0 1 0 0

Imaginary 0 1 0 0 0

Analysis 1 0 0 0 0

Digital 0 0 0 1 0

Data 0 0 0 0 1

Processing 0 0 1 1 0

Design 1 0 0 0 0

Algorithm 1 0 0 0 1

Satellite 0 1 0 0 0

If we calculate correlation constants between (image, processing) and (image,

algorithm) we get following result.

r (image, processing ) = 0.61

r (image, algorithm) = -0.40

17

where r = Spearman’s correlation constant. Now we perform SVD on this matrix to

obtain the following

Table 2.3 Complete SVD of T

R =

0 0.45 0 0 0.45

0.35 0 0 0.5 0

0 0.28 0 0 -0.72

0 0 -0.71 0 0

0.35 0 0 -0.5 0

0 0.45 0 0 0.45

0.35 0 0 0.5 0

0 0.72 0 0 0.28

0.35 0 0 -0.5 0

0.71 0 0 0 0

0 0 -0.71 0 0

S =

2.0 0 0 0 0

0 1.9 0 0 0

0 0 1.4 0 0

0 0 0 1.4 0

0 0 0 0 1.2

C =

0.71 0 0 -0.71 0

0 0 -1 0 0

0 0.53 0 0 -0.85

0 0.85 0 0 0.53

0.71 0 0 0.71 0

18

When we reconstruct our original matrix by taking only three significant values

from matrix S (i.e. considering rank 3 approximation) we recover T new as follows:

T new =

Table 2.4 Reconstruction of Original Matrix.

DOC 1 DOC 2 DOC 3 DOC 4 DOC 5

Signal 0 0 0.45 0.72 0

Structure 0.5 0 0 0 0.5

Image 0 0 0.28 0.45 0

Imaginary 0 1 0 0 0

Analysis 0.5 0 0 0 0.5

Digital 0 0 0.45 0.72 0

Data 0.5 0 0 0 0.5

Processing 0 0 0.72 1.17 0

Design 0.5 0 0 0 0.5

Algorithm 1 0 0 0 1

Satellite 0 1 0 0 0

Now if we calculate the correlation constants we get following values:

r (image, processing)= 0.99

r (image, algorithm)= -0.63

This clearly depicts the fact that in given corpora ‘image’ and ‘processing’ are

mostly related while there is less relation between ‘image’ and ‘algorithm’. High

correlation between ‘image’ and ‘processing’ supports that they belong to Image

Processing context while ‘algorithm’ belongs to Design of Algorithms and hence less

related.

2.3.2 Limitations of LSA

LSA maps terms and documents in some constant number of concepts (or topics)

which are orthogonal (not related) to each other. But practically it is not so, because a

19

document may contain a number of concepts in it. Along with this fact there are not

enough statistical bases for LSA.

Computational complexity, storage and sparseness in term-document matrix are

critical issues where LSA is desired to be used.

It makes no use of morphology, word order or syntactic relations. Hence it is

always suspected to result in incomplete or erroneous output on some occasion [18].

All the other least-squares method are developed for normally-distributed data.

SVD is also based on the same principle. Term-by-document matrix that works as an

input for LSA consists of count data and for count data such a distribution is

inappropriate. This fact is another objection for usage of LSA [25].

2.3.3 Advantages and Applications of LSA

1. LSA is able to model human conceptual knowledge very well. It is able to

develop summarization skills [26] and text comprehension [18].

2. LSA can be used in essay scoring techniques [27]. It can be used to predict the

extent up to which a student has learnt from a specific text [28].

3. Since LSA does not depend upon literal matching, it performs well in the case

of noisy text. As- in Optical Character Reader, in spelling errors, etc. [25].

4. LSA technique makes no use of English syntax and semantics. It is based on

“Bags of Words” approach. Due to this fact it is applicable to any language

and hence in Cross-Lingual Information Retrieval [25, 27]

5. LSA is providing good results in the context of Relevance Feedback and

Information filtering.

6. Definitely sparseness and storage is a big hurdle towards its usage. But in the

context of Meta-Search Engine, since term-document matrix is of smaller size

compared to traditional information retrieval domain; these problems are not

so significant. LSA is used for query-expansion so problem of SVD updating

is not there.

20

2.4 Probabilistic Latent Semantic Analysis (PLSA)

2.4.1 Concept of PLSA

In 1999, Thomas Hoffman [29, 30] proposed this technique. The base of PLSA is

the Aspect Model. It is a latent variable model for co-occurrence data which combines

a hidden class variable a € A = { a1, a2,….} for each observation , i.e. with each

occurrence of terms (or words) t € T = { t1, t2,……} in a document d € D = {d1,

d2…….}. Various parameters in this context can be defined in the following way:

P (d) = probability of selecting a document d,

P (a | d) = probability of picking a hidden class a,

P (t | a) = probability of generating a term t.

An observed pair (d, t) can be obtained, while the hidden class variable ‘a’ is

eliminated. Converting the whole process into a joint probability model yielded

following expressions

P (d, t) = P (d) * P (t | d) , ------------ (2.1)

where

P (t | d) = Σ P (t | a) * P (a | d) ------------ (2.2)

PLSA uses this idea in following way:

Like Vector Space Model and LSA, the term–document matrix acts as an input to

this model. This matrix T(t, d) contains term t = 1: m (i.e. ranges from 1 to m),

documents d = 1: n and the number of topics A, to be sought. T (t, d) corresponds to

the entry in specified row and column.

By Random Sequence Model, we can illustrate that—

P (d) = P (t1 | d) * P (t2 | d) …………….P (tm | d)

21

m T (t,d)

= П P(tm | d) t=1 ------------ (2.3)

Now if we have A topics as well:

A P ( tm | d) = ∑ P(tm | topica ) * P(topica | d) a=1 ------------ (2.4)

The same, written using shorthand:

A P (t | d) = ∑ P(t | a) * P(a | d) a=1 ------------ (2.5)

So by replacing this, for any document in the collection,

m A T(t,d)

P (d) = П { ∑ P (t |a) * P (a | d) } t=1 a=1 ------------ (2.6)

Now, P ( t | a) and P (a | d) are two parameters of this model. Equations can be derived

to compute these parameters by Maximum Likelihood. After doing so we will get—

• P (t | a) for all t and k, is a term by topic matrix

(gives the terms which make up a topic)

• P (a | d) for all a and d, is a topic by document matrix

(gives the topics of a document)

The log likelihood of this model is the log probability of the entire collection:

n n m A ∑ log P(d) =∑ ∑ T(t,d) log ∑ P(t | a) * P (a | d) d=1 d=1 t=1 a=1 ------------ (2.7)

22

which is to be maximized w.r.t. parameters P (t | a) and also P (a | d), subject to

constraints that

m A ∑ P( t | a) =1 and ∑ P (a | d)=1 t=1 a=1 In such cases where it is required to deal with missing data, an ideal approach for

computing Maximum-Likelihood Estimation (ML) is Expectation Maximization

(EM). In ML estimation, parameters are estimated in a way so that most likely we get

same observed data.

There are two steps in EM algorithm:

1. In Expectation Step, current estimates of parameters are used to compute

posterior probability for hidden variables.

2. In Maximization-step, posterior probabilities that are computed in Expectation

steps are used to update parameters [31].

One admirable property of EM is that convergence is assured i.e. the algorithm is

guaranteed to increase the likelihood at each iteration. Following is the PLSA

algorithm that precisely depicts proper input, processing steps and output given by

this algorithm.

2.4.2 PLSA Algorithm

• Inputs: term to document matrix T(t , d), t=1:m, d=1:n and the number A of

topics sought [19]

• Initialize arrays P1 and P2 randomly with numbers between [0,1] and

normalize them row-wise to 1 [19]

• Iterate until convergence

For d=1 to n, For t =1 to m,, For a=1: A

n A P1 (t ,a) = P1 (t ,a) ∑{ T(t,d) * P2(a, d) / {∑ P1(t, a) * P2(a, d)}}

d=1 a=1 ---------- (2.8)

23

m A P2 (a, d) = P2 (a, d) ∑{ T(t,d) * P1(t, a) / {∑ P1(t, a) * P2(a, d)}} t=1 a=1 -------------- (2.9)

m P1(t, a) = P1(t, a) / ∑ P1(t, a) t=1 --------------- (2.10) A P2(a,d) = P2(a,d) / ∑ P2 (a ,d) a=1 ---------------- (2.11)

• Output: arrays P1 and P2, which hold the estimated parameters P (t |a) and

P (a| d) respectively [19].

Equation (2.8) and (2.9) illustrate expectation steps in which posterior

probabilities are calculated from currently estimated values. For initial step these

estimated parameters are assigned by a uniform random number generator which

generates number between 0 and 1. Equation (2.10) and (2.11) are maximization steps

where initial parameters are updated from the values resulting from expectation step.

The outputs are two matrices that are P1 and P2 having the probabilities of term

distribution per topic and topic distribution per document, respectively.

==t

d

t

a

a

P (t | d) P (t | a)

P (a | d)

Observed term Distribution

Term distributionper topic

Topic distribution per document

d

Fig. 2.4 Two Matrix Formations from PLSA.

24

Accuracy of PLSA algorithm is based on two crucial factors:

1. Number of topics (value of a)

2. Number of iterations

Number of topics is context-specific. If documents in text corpus contains

different concepts or belong to various domains then exact estimation for this

parameter is critical. But as the number of topics would be closer to the actual value

the results would improve.

Number of iterations for convergence is another decisive factor. It must be an

optimal one. More number of iterations over fit (or over tune) the data, while less

number of those iterations will not cater true results. Early Stopping can be used to

prevent over fitting of data in which iterations can be stopped after some specific

number of iterations. Result and Analysis section of this thesis will demonstrate the

effect of all these factors with empirical results.

Next is an example that directly shows how the term distribution in topics is

yielded by the algorithm.

Table 2.5 Four Aspects (topics) those are most likely to generate term

‘Cricket’.

Aspect 1 Aspect 2 Aspect 3 Aspect 4

Sports earlier Cricket nice

Cricket disease using girl

play swimming insecticides saying

makes state insect dull

indoors lots kill student

outdoors cycling small hockey

games cured harmful school

person healthy nice study

25

The above example is derived for A=4 aspect model of a document collection

which consist different aspects related to “Cricket”. The displayed words are most

probable words in the class-conditional distribution P (t| a), from top to bottom in

descending order.

2.4.3 Advantages and Applications of PLSA

The result of PLSA are better than those of LSA because of PLSA has firm

statistical background compared to LSA. PLSA uses the principle of conditional

probability and EM that is guaranteed to converge and hence produce better results.

LSA resolves the problem of Synonymy but in the case of Polysemy its results are

still doubtful. PLSA resolves both the problems efficiently. PLSA classifies all the

term to topic distribution data in such a manner so that a polysemous term is clubbed

with other terms with different probability and therefore represents different topics. In

the previously explained example Aspect 1 seems related to “Sport - Cricket”. Other

terms as play, outdoors, games supports the thought of outdoor and indoor

classification of games. Aspect 2 tells about “Diseases Prevention” aspect of outdoor

games which makes anyone healthier. ”Swimming” and “Cycling” are also examples

of such games. Aspect 3 shows relevance with “Cricket - Insect”. “Insecticides”,

“insect”, “kill”, “small”, “harmful” are most probable words in this context. Aspect 4

represents concept of “Sports in School”. Other terms in this concept as girl, dull,

hockey, study etc persuade this belief. Different position of term “Cricket” shows

their respective possibility to appear in a given context which is quite understandable.

PLSA is already in use in some applications and contributing fruitful results.

Apart from already explained domain where relevant document are retrieved for given

query; PLSA is used in “Web Page Grouping” [32] and in the construction of

“Community Web Directories”. This thesis will suggest a new dimension for

implementing Meta-Search Engine using PLSA for query-expansion.

26

2.5 Summary

Current Approaches for implementing MSE have been presented. After

illustrating their draw-backs, we discuss in details the algorithms (LSA/PLSA) that

are used in the proposed MSE. All the applications, advantages and limitations of

these algorithms are shown which proves the superiority of PLSA over LSA.

Architecture and Design issues of proposed MSE are presented next.

27

Chapter 3

Proposed Meta-Search Engine 3.1 Basic Theme

One most important points of concern is how the idea of query-expansion will

refine existing search- engine results. This can be justified by following fact.

All the search-engines perform literal-matching for given query and retrieve web-

pages which contain that term or combination of terms. These terms may be

synonymous or polysemous in nature. Therefore, in results we get all the pages that

belong to same or different context. At a particular time a user is always interested in

some specific topic or domain. Since till present time no search-engine classifies the

result according to topic; a common user navigates through all the links and wastes

his time till needed information is found. As for query keyword “Cricket”, first page

links extracted from Google and Yahoo contain web-pages which are solely related to

the game – cricket. Not a single result was associated to cricket that is an insect.

Hence it is obvious that in the cases of such a query a user will search through all the

pages of sport-cricket till he gets a web-page where cricket is described as an insect.

Proposed MSE will ease the task of the user by suggesting other terms that are

likely to occur in a particular context. Now, the user can select a term or group of

terms from his/her area of interest and will fire a new expanded query to MSE. After

some specific number of iterations all the links will belong to the same topic. In this

way links would be confined to be of user’s need.

28

3.2 Architecture of Proposed MSE

This section demonstrates different components that are the part of proposed

Meta-Search Engine and illustrates their responsibilities, significance and interaction

with each other.

User Interface Common Interface to Search-

Engines

Search-Engines (SEs)

Search Engine

1

Search Engine

2

Search Engine

nth

Baseline Establishment (Naïve Algorithm)

Page Retriever Pre-Processing Unit

Algorithms

LSA PLSA

Query

Next Keywords And

Ranked Links

(URLs)

Web-Pages

Processed Text-Corpora

Fig. 3.1 Architecture of proposed Meta-Search Engine

1. User Interface The user can pose a query from User-Interface (UI) to MSE. The user interface

must be easily understandable by any novice user so that it can be used with ease. All

the results and information at different parts of UI must be self explanatory to user

from any domain.

29

2. Common Interface to Search-Engines This component takes query keywords as input. It basically contains all the APIs

(Application Program Interface) or libraries for different search-engines. These APIs

take query at front-end and pass these to corresponding search-engines at back-end.

3. Search-Engines It shows the combination of different search-engines. Most frequently used

search-engines are Google, Yahoo, MSN, AltaVista etc. All of these perform the

crawling, indexing and ranking according to their own mechanism. Anatomy of such

search-engines has already been explained in the Motivation section of Chapter 1.

Results of search-engines are ranked web-links to user query.

4. Baseline Establishment Responsibility of this part is to establish a local ranking to retrieved results before

presenting them to user. The local ranking makes a baseline for the results which

apparently respect the result of search-engines. A naïve technique for this baseline

establishment might be to rank a link higher if it is respectively higher at the result of

search-engines and rank them low if that are in the result of single search-engine.

5. Page-Retriever This section downloads all the web-pages according to baseline ranking on a

local machine. These pages may be of different format as .txt, .pdf, .html etc.

6. Preprocessing Unit This section takes all the web-pages and makes a text-corpus of these. It may be

noted that the corpus changes with the query. In text corpus we need to perform

preprocessing steps like stop-word removal and stemming of terms. Stop-words are

common words as articles, prepositions etc. which never represent a specific meaning

related to a context. Hence they must be ignored before applying algorithms. These

terms can be considered as a noise in corpora. Examples of stop words are ‘a’, ‘an’,

‘the’, etc. The stemming process converts all the terms into its root word so that terms

having same root word can be represented by single entity and therefore appropriate

30

weight can be assigned to it. For example, we may have the words ‘boy’ and ‘boys’ in

different parts of the corpus. Rather than count these as separate terms, the stemming

process reduces both these to the root form ‘boy’. This step is necessary because both

words are morphological forms of the same word and therefore are semantically

similar. Thus, we should reduce them to the same term. This has two advantages. The

first is that it reflects the situation more accurately. Secondly, it reduces the number of

terms and thus leads to higher efficiency.

7. Algorithms This module contains both the algorithms that are explained in previous chapters.

The processed text corpus is used to form a suitable input for the algorithms and at

last these algorithms yield next probable query.

3.3 Implementation Details

This section demonstrates all the insights of Design and Implementation part of

suggested solution. Java is used as a programming tool because it is solely Object-

Oriented language. Proposed solution must be easily modifiable and extendable

because it is a basic need in such a case. Object-Orientation Methodology fulfills this

purpose. A Package Diagram as a part of High Level Design is shown in Fig. 3.2.

MetaSearchEngine

LSA / PLSAAlgorithm

Reading AndStemming

Parser

SearchEngineInterface

GUI

BaseLine

Fig. 3.2 High-Level Design (Package Diagram)

31

1. GUI This package contains all the classes which are responsible for user interface. It

has two classes, one for LSA and another for PLSA. According to requirement both of

these classes have different components and represent next probable query keyword

in different manner.

2. Meta-Search Engine The class in this package has the authority for passing control from one package to

another. This class acts as a common interface for most of the packages. It takes query

as an input and returns links and next probable query keywords in desired format.

3. Search-Engine Interface All the classes in this package interact with corresponding search-engines. These

classes use API’s or libraries of search-engines so that query can be fired on SE at

back-end. Currently there are three classes which communicate to their respective SE.

These are

• GoogleApi.java

• YahooApi.java

• MSN Api.java

The first two classes respectively use googleapi.jar [33], yahoo-search-2.0.0.jar

[34]. MSN Api.java class executes the .exe file of ConsoleWebSeach.cs [35] which is

based on .NET framework.

Both Google and MSN Api provide top ten results. YahooApi can yield any

number of links if results are available. Currently the Meta-Search Engine extracts top

ten results from all the three SE. Therefore it is not biased toward any search-engines.

4. BaseLine This package is responsible for local ranking of links that are given by all SEs. A

very basic approach is implemented in this package which ranks a link higher if it

appears in the result of all three search-engines and ranks a link lower if it is in the

32

results of two or even a single search-engine. After this all the links are downloaded

on local system.

5. Parser Downloaded links may yield files of different format. These format may be .txt,

.html, pdf,. ppt., .ps etc. Currently, the classes in this package deal with only .txt,

.html, and pdf. formats. Jericho HTML-2.3 Parser [36] is used to parse HTML pages

and extract text contents from it. PDFBOX-0.7.0 [37] libraries are used to convert a

.pdf file into text file. For .pdf to text conversion the following java archives are

used:

• ant.jar

• checkstyle-all-3.5.jar

• log4j-1.2.9.jar

• PDFBox-0.7.0.jar

HTML parser uses jericho-html-2.3.jar.

6. Reading and Stemming This particular package is under obligation for all the functionalities of the

component Pre-processing Unit that is explained in architectural details. Firstly

stopwordrem.java class removes all the stop-words from all the text files.

After stop-word removal, stemming algorithm is applied on each term of text files

by porter.java [38]. It uses Porter’s Stemming algorithm. There are some other

stemmers that are available as- Lovins stemmer, Dawson Stemmer [39]. Porter’s

stemmer is used mostly in Information Retrieval and Language Processing problems.

Since its performance is pretty good hence it is also used in present context.

Once all the text files are stemmed the text-corpora becomes ready to be converted

into desired input format.

7. LSA/PLSA Algorithm Package LSA and PLSA replaces each other according to case. If LSA is applied

for query-expansion LSA package is used otherwise PLSA.

33

Term-Doc.java class converts whole text corpus into term-document matrix

format and after applying different term-weight factors this matrix is passed to

another class which performs singular-value decomposition of input matrix. By

varying different value of k, resultant matrix is generated. Query term and all other

terms with their all matrix entries are then passed to Corr.java class which computes

correlation constants and hence new probable words can be sent to GUI. This package

use Jama-1.0.2.jar for SVD computation [40].

In the case of PLSA, this tem-doc matrix is passed to PLSA.class which applies

algorithm on this matrix. To check the convergence, in each iteration all the previous

and new matrix elements are then passed to Conv.class which computes average and

absolute convergence measures. Early Stopping is used to prevent the results from

over tuning. Now from term-topic matrix all the higher probable words are extracted

and are suggested as new probable query keyword on GUI which are in fact classified

on the basis of different aspects.

3.4 Features of Proposed System

Suggested Meta-Search engine tries to refine the results by query-expansion.

Apart from this, there are some other features that add a new dimension to presented

Meta-Search engine. Those features are compiled next:

• After getting the next keywords on User-Interface, a user instead of simply

adding a term into previous query can add combination of terms. By doing

so he/she may get needed urls very soon. Even a user can replace or

exchange the position of terms according to his intellectual and need.

• Sometimes a user may be novice and may possess lack of information about

the domains related to given keywords, for which he/she wants to extract

information. In this scenario, next keywords will suggest most of the

available concepts regarding that query, on internet.

• If in any condition, the suggested keywords do not belong to the area from

which the user desires to retrieve information, it infers a fact that for given

34

query there is very limited amount of information available on internet. This

situation motivates a user to reform the given query, again.

• Proposed solution does not use any sort of “Thesaurus” or “Dictionary” for

query-expansion. It utilizes the content itself for recommending next key-

words. This feature enhances the scope of meta-search engine to a

significant extent. It can be illustrated as follows: for query “CAT”, a

thesaurus provides all the synonyms for CAT but it does not suggest

anything that is related to “Common Admission Test” whose acronym is

CAT. On Internet, there are plenty of such cases. Proposed MSE extracts

information from the web-pages itself and hence gets rid of this problem

easily.

3.5 Summary

Architecture and design of proposed Meta-Search Engine has been discussed in

this chapter. Extendibility and maintainability are two admirable features of proposed

MSE. All the components and their significance are shown. The next chapter will

illustrate all the result and analysis part of thesis, with various elucidating examples

and their justifications.

35

Chapter 4

Result and Analysis

This chapter presents all the experimental results that have been performed and

their analysis. Various queries from different contexts have been fired on the

proposed Meta-Search Engine for both the LSA and PLSA. All the important and

critical factors are changed accordingly and variations in results are then noted down

carefully. Results are in accordance with all the theoretical facts which have been

illustrated in literature-survey sections. Full details of these experiments with

elucidating examples are described next which in turn proves the success of this novel

idea for designing a Meta-Search engine. Firstly the results of all experiments of LSA

are shown and then of PLSA. Comparison between LSA and PLSA are shown which

demonstrates the superiority of PLSA over LSA.

4.1 Result-Analysis of LSA

4.1.1 Value of ‘k’ for Optimal Rank Approximation of Term-Document matrix

As explained in Chapter 3, the value of ‘k’ is a decisive factor because it governs

the dimensionality of vector space. If it is high then the vector space would be large

and vice-versa. Very high value of k leads for bigger vector space and may contain

noisy data. Very small value of k results a small vector space which will lose some

important information. This fact can be shown by the following example which is

performed for the query “India-Tourism” and with number of terms = 2447 and

number of documents = 20. These results are shown in Table 4.1 below.

36

Table. 4.1 Next Keywords for query “India Tourism” for different values of

‘K’.

50% Vector-Space

K=2

75% Vector-Space

K=6

85%

Vector-Space K=8

90% Vector-Space

K=10

98% Vector-Space

K=14

Festivals Royal Varanasi Varanasi Varanasi Kerala Varanasi Best Best Best Tiger Best Jaipur Forts Forts

National Jaipur Royal Kerala Kerala Boat Policy Travel Travel Travel

Sariska Travel Forts Royal Bharatpur Carbett Kerala Kerala Explorer Jaipur Golden Forts Explorer Bharatpur Rajasthan

Bird Explorer Policy Rajasthan Indian Backwaters Palace Bharatpur Jaipur Golden

General Himachal Himachal Himachal Himachal Australia Queen Andhra Andhra Andhra Kingdom Andhra Jammu Northern Northern Glorious part Orissa Kashmir Sikkim Gangtok Pilgrimage Kashmir Orissa Hill

Sri Hill Northern Jammu TamilNadu Chandigarh Inc Sikkim Sikkim September Ayurveda TamilNadu Bengal Pilgrimage major France related related Hill winter

Italy Jammu Pilgrimage TamilNadu Orissa

In the table all the terms which are most related to query “India Tourism” is

shown for different extent of vector space. In each column, first ten entries are terms

that are next related to “India” and another ten are for “Tourism”. All the italicized

terms are assumed as non-relevant while non-italicized terms are assumed as relevant

to the given context.

From the table it is clear that for k=2, vector space is 50% and terms like

“Festivals”, “General”, “Australia”, “Kingdome”, “Sri”, “France” and “Italy” are

appearing that are not so relevant. 75% of vector space is covered as k increases to 6.

This results in only few non-relevant terms as-“Policy”, “Queen”, “part”, “Inc” and

“related”. Even for k = 8, the number of such unwanted terms reduces and for k = 10

(i.e. for 90% of vector space) all the visualized terms can be treated as relevant. Now,

37

further increment in k degrades the result. We observe that for k=14, terms

“September”, “major”, and “winter” appear that demonstrate presence of noisy data.

So this result gives firm support for considering such a value of k which can cover

90% of vector space. All the next related keywords also show nice prospect for query-

expansion.

4.1.2 Comparison of “Tf-IDf Measure” to “Term-Count Measure”

There are various term-weight measures that assign a weight (or say importance)

to terms of term-document matrix. It is very often illustrated in all the literature that

“Tf-IDf Measure” is more efficient compared to simple “Term-Count Measure”. The

results of proposed Meta search-engine are assembled for both the measures and

represented for query “Thread”.

Table 4.2 Next Keywords for query “Thread” using Term-Count Measure.

50% Vector-Space

75% Vector-Space

85%

Vector-Space

90%

Vector-Space

Actually Exists Active Active

allows natural String String

unique Active natural Package

debugging machine length SE

liked developers machine lower

copy depends called called

errors length Package machine

print String SE allows

which raised lower fact

In both the Table 4.2 and Table 4.3 italicized and bold terms are assumed as

relevant terms. As usual, for 90% vector-space coverage results are fine in both the

cases.

Table 4.3 Next Keywords for query “Thread” using Tf-IDf Measure.

38

50% Vector-Space

75% Vector-Space

85%

Vector-Space

90% Vector-Space

liked Exists Active Active debugging developers String String Actually natural natural Package unique Active length lower allows machine called SE

appropriate depends machine called action raised SE allows

executing String lower machine copy length Package fact

In fact in this particular case, the results of term-count and Tf-IDf measures are

not different. This does not imply that Tf-IDf is not better than former one. In the

present context, we have a limited number of documents (say 20-25) and all the stop-

words (that occurred in most documents) are also removed in pre-processing step. So

the significance of factor “log (N /df b)” is not so much and hence the results of both

the measures are almost same. As the number of retrieved documents will increase,

Tf-IDf will surely show improved performance over term-count measure. Following

snap-shot of GUI represents results for query “India-Tourism” when we use the LSA

algorithm for query expansion.

Fig 4.1 Graphical User Interface for LSA

39

4.2 Result Analysis of PLSA

Various tests have been performed to check performance of PLSA, as with LSA.

In the context of Meta Search Engine it produces appreciable results that are quite

useful. Since PLSA classifies all the next keywords according to some topic, it gives

an extra edge to expand the query for a specific domain. Some examples of PLSA

results are illustrated in following tables:

Table 4.4 Results of PLSA for query “Thread”

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5

Thread Fangohr Auckland Gigalink Showcase

System Firms Zealand BridesMaid Prices

Java Hungry NZ ThreadDesign Boutiques

Class Stock Necklace www Dj

lang Flavor Crochet dress Thousands

process sure Paris collection Cocktails

Table 4.5 Results of PLSA for query “Australian University”

Topic 1

Topic 2

Topic 3

Topic 4

Topic 5

University Forum Museum Museum AIU

Australian Study Forum Forum whistleblowing

Australia UNDA England large Security

ANU JCU images books Below

Research CQU large architecture Counter

Page SCU above here Sells

Student CDU books churches Login

International ECU here images Buy

40

Table 4.4 shows next keywords according to five different topics. First topic is

related to “Thread” that belongs to “Java-language”. Another topic seems related to

country “Hungry” and some firms. Topic three is solely related to “New Zealand’s

Fashion-Culture Magazine and On-line Store” that is also named as “Thread”. Topic 4

contains terms like “dress”, ”bridesmaid”, “collections” which are showing “Fabrics”

aspect of term Thread.

Similarly Table 4.5 shows next keywords for the query “Australian University”

and in results topic 1 simply shows general term as “student”, “international”,

“ANU”, ”research” that are related to Australian University. Topic 2 contains terms

like “UNDA”, “JCU”, “CQU”, “SCU”, “CDU”, “ECU” which are acronyms of

respectively “University of Notre Dame Australia”, ”James Cook University”,

“Central Queensland University” and so on. Hence, second topic shows “List of

Australian Universities”. In the same way other topics can be easily understood.

These terms can be used for query-expansion and will in turn yield focused search.

4.2.1 Optimal value for number of topics (a)

The number of topics, ‘a’, in PLSA is one of the most important factors. Its value

must be an optimal one. A large value of ‘a’ will give some redundant topics which

will not be informative enough and similarly a small value will hide some useful

concept. Results of various tests suggest that this value should be in between 3 to7 for

most of the cases of current Meta-search engines because at maximum level it will

have 24 to 27 documents. For such a specified number, the range of 3 to 7 topics is

appropriate. An example for increasing value of k is shown for same query “India

Tourism”. All the terms in different topics are showing different aspects and

significance.

41

Table 4.6 Results of PLSA for query “India Tourism” for different value of

num of topic ‘a’=1, 2, 3

Topic 1

India

Tourism

Tour

Travel

Kerala

Rajasthan

Topic 1 Topic 2

India yimg

Tourism Hyatt

Tour directly

Travel JS

Kerala Regency

Rajasthan suggest

Topic 1 Topic 2 Topic 3

India yimg Kalpa

Tourism Hyatt Munsiyari

Tour directly demanding

Travel JS manmade

Kerala Regency Wing

Rajasthan Mariott Interzigm

42

4.2.2 Convergence

Since PLSA uses EM for maximum likelihood, it also guarantees a convergent

behavior for the iterative procedure. It always tries to find local maxima for given

data distribution. In the context of Meta-search engine, PLSA also shows converging

nature. To check it, two measures are used:

• Absolute Measure

• Average Measure

4.2.2.1 Absolute Measure

It can be computed by following formula

Max i,j = | P i, j n+1 - P i, j n |

where

P i, j n = value at i th row and j th column of term-topic matrix or topic-

document matrix after n th iteration.

In PLSA, firstly some random values are assigned to both term-topic and topic-

document matrix. After going through one iteration of the E and M steps, the

algorithm generates two new versions of these matrices. This new version now acts as

an input for the next iteration of the algorithm and this iterative procedure continues

till convergence.

For measuring convergence, we compute the maximum difference Max i,j between

all the corresponding cell entries of term-topic matrix and its newer version. This

calculation is performed for each iteration and the maximum value is noted. Results

show that this maximum difference tends to decrease after some time and then

continually decreases and then converges. Same procedure is performed for topic-

document matrix and it yields same behavior i.e. that one also converges, might be

earlier or later. Following graphs are evident for this nature of PLSA. This experiment

is performed for query keyword “IIT” and for three topics using absolute measure.

For efficient visualization of this sort of small data, negative of natural logarithm is

computed for y-axis. Thus, a value of 20 on the y-axis represents exp(-20).

43

Convergence

0

2

4

6

8

10

12

14

16

18

20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Iteration

-ive

max

diff

(in

log

scal

e)

Term-Topic Matrix

Fig 4.2 Convergence in Term- Topic Matrix computed by Absolute

Measure

Convergence

0

2

4

6

8

10

12

14

16

18

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Iteration

-ive

max

diff

(in

log

scal

e)

Topic-Document Matrix

Fig 4.3 Convergence in Topic-Document Matrix computed by Absolute

Measure

44

4.2.2.2 Average Measure

The average measure can be computed by the following formula

Max i, j = | P i, j n+1 - P i, j n | / 2 ( | P i, j n+1 | + | P i, j n | )

where

P i, j n = value at ith row and jth column of term-topic matrix or topic-

document matrix after n th iteration.

The same procedure as previously explained, is used here. Only average measure

is used in place of absolute measure. Following graphs also represent convergent

behavior using this measure. Experiment is performed for query keyword “IIT” and

for three topics, as in the previous experiment.

Convergence

0

5

10

15

20

25

30

35

40

45

50

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Iteration

-ive

max

diff

(in

log

scal

e)

Term-Topic Matrix

Fig 4.4 Convergence in Term-Topic Matrix computed by Average

Measure

45

Convergence

0

5

10

15

20

25

30

35

40

45

1 3 5 7 9 11 13 15 17 19 21 23 25

Iteration

-ive

max

diff

(in

log

scal

e)

Topic-Document Matrix

Fig 4.5 Convergence in Topic-Document Matrix computed by Average

Measure

4.2.3 Number of Iterations for Convergence

The number of iterations for convergence is also one of the important issues. This

number must be an optimal one. It should not be so small which may persist non-

converged state and should not be so large which may over tune the values

(probabilities in both the matrix). The technique of “Early Stopping” is used for these

cases. The algorithm is implemented in such a way that it can take care of these

situations automatically. Maximum difference between corresponding cell value of

both the older and newer matrix is computed for each iteration. If this difference

appears small enough (say < .001), then the iterations are automatically stopped.

4.2.4 PLSA slide-shots Following are two examples of User Interface which shows results for query, showing

that different query results are grouped according to their context. The queries are

46

1. Thread

2. India Tourism

Fig 4.6 GUI representing results for query “Thread”

Fig 4.7 GUI representing results for query “India Tourism”

47

4.3 Convergence in number of unique links after some iteration

It is really a point of concern that for how many times a query should be expanded

to get refined as well as needed result. Following graph clearly depicts that after some

specified iterations for query expansion (say 5-6), number of unique links (in turn

unique documents) will converge. The test is performed for query “Thread” and

further expanded gradually as “Thread Package”, ”Thread Package lang” and

”Thread Package lang Java” etc. A web-page link and its respective sub-page link is

considered once and treated as unique.

0

5

10

15

20

25

30

1 2 3 4 5 6 7

Iterations for Query-Expansion

Num

ber o

f Uni

que

Web

-link

s

Fig 4.8 Behavior of num. of unique web-links to iterations for Query

Expansion

4.4 Comparison between LSA and PLSA results

Results of both LSA and PLSA that are demonstrated in previous sections are

quite admirable. In case of LSA, next suggested keywords generally belong to one

dominating context but are pointing very clear direction for search by query

expansion. For example “Forts”, “Hill” and likewise other terms for expansion, surely

directs a search towards a focused area of need, in the case of search for “India

Tourism”.

48

On the other hand, PLSA suggests next keywords by classifying them into needed

number of topics entered by user. This splendid feature of PLSA ensures a big ease

for refinement of results by selecting a keyword from a specific topic and then to use

it for query-expansion. As- for same query “India Tourism”, next keywords are

grouped into some topics. First topic shows simple aspect of tourism and displays

famous places for visit. In other groups all the famous hotel-name and restaurants as-

“Hyatt”, “Mariott”, “Regency” are present which represent another important aspect

of “Tourism in India”.

Definitely, since PLSA represents results in more organized way with capability

of distributing terms according to various logical aspects, it is better than LSA for

query expansion purpose in Meta-search Engine. Firm base of statistics and use of

EM for convergence are two sufficient reasons for PLSA’s commendable results.

4.5 Comparison with “Dogpile” Meta-Search Engine

A comparative study was performed to check top ten results of open-source Meta-

Search Engine “Dogpile” and results of proposed MSE after query expansion. Results

demonstrate the fact that while Dogpile shows the results in jumbled manner about

both the concept namely “Dress” and “Java” related to query “Thread”. The results of

proposed MSE for expanded query “Thread Package Java” and “Thread Dress” are

totally confined to their respective concepts. The long web-link (urls) in results

illustrates the fact that proposed MSE is yielding result to sub pages of web-sites and

proving the fact that they are more confined.

49

Fig 4.9 Top ten results of Meta-Search Engine “Dogpile” for Query

“Thread”

Fig 4.10 Top ten results of proposed MSE for Expanded Query “Thread

Package Java”

50

Fig 4.11 Top ten results of proposed MSE for Expanded Query “Thread

Dress”

Summary In this chapter we have reviewed the results obtained with the proposed meta

search engine and compared the results with those of “Dogpile”. Various experiments

confirm that the LSA and PLSA can indeed provide effective query expansion. Also,

PLSA seems to outperform LSA. However, there are some shortcomings in the

present version. We shall discuss the shortcoming and see how it can be overcome in

the next chapter.

51

Chapter 5

Improvements from NER (Named-Entity Recognizer)

5.1 Introduction

This chapter presents a significant improvement to the results of MSE. The

improvement is due to the use of “Named Entity Recognizer”. The chapter introduces

the Named-Entity Recognizer, its role and hence its influence in the context of MSE.

Next to this, the chapter demonstrates the change in previous architecture to

incorporate this extra module and its consequence. At last, results are shown with

illustrative examples.

In previous chapter, results of LSA and PLSA are shown with clarifying

examples. These are all appreciable results but are lacking in some way. For example

if we refer to Table 4.1 then terms like “Corbett”, ”Golden”, ”Royal” and ”North”

have appeared there. These terms are relevant to search but not providing the real

meaning because they are only representing part of some collection of words. These

words, when grouped together, reveal the real meaning. Similarly if we examine

Table 4.5 which shows results of PLSA for query “Australian University”, we find

terms like “Australian”, ”International”, “University” are appearing independently

which are actually part of single entity “Australian International University”. The

same scenario is there with terms “Jammu Kashmir” and “Taj Mahal”. Due to this

fact we will have to expand query for few more iterations to get refined results. Also

the inference obtained by looking at the classified keywords may be erroneous.

52

This is not actually the problem of the techniques of information retrieval, used in

the proposed MSE. The reason for such partially correct results is Search-Engine’s

literal matching mechanism. If we fire a query “A B” then Search-Engines retrieve all

the pages that contain A, B and both. Since our MSE is based on these SEs and parses

each term individually this type of error in results is obvious.

The essence of the problem is that we are considering each word as a separate

term. This may be incorrect in several cases where a group of words should be treated

as a single term since they represent a single semantic unit. However, the problem

gets complicated because the individual words may also have acceptable semantics of

their own. For example, if we are considering a single entity like “Banaras Hindu

University”, then the individual words have distinct meanings of their own which are

quite different from the meaning we get when all three words are taken together as a

single unit. This is actually a very famous problem of natural language processing

which is called “Recognition of Named Entity”. This named entity may be name of

person, organization, institute, place or any proper noun as- Mr. Albert Einstein,

Carnegie Mellon University, President of India, etc. “Named-Entity” is a collection of

few words when grouped together, represent a meaning behind those words.

A “Named Entity Recognizer (NER)” is a system which can recognize all the

named entities in a given passage. Thus, we realize that if an NER is introduced in our

system then we will be able to find all the named entities in our corpus and thus treat

them as single terms. There are already some NER packages that are freely available

for English and other language. Therefore, rather than implementing our own NER

module, it was felt that it would be better to use an available one. Since our design of

MSE supports easy extendibility, we can easily incorporate this new module.

We expected two positive outcomes after adding NER. Firstly the number of

terms in term-document matrix will be reduced which will definitely increase the

responsiveness of system, particularly for PLSA. This is because PLSA uses an

iterative procedure for convergence and complexity of the algorithm used, in each

iteration, depends on the number of terms. Secondly, we will have more number of

different terms as next probable query, because some of the most related terms are

53

already grouped within their respective named-entities. In fact the whole procedure

will add a notion of meaning to next keywords.

Empirical results shows that on an average the number of terms without NER

were 3096 while after addition of NER module this number reduced to 3079 which

clearly reflects presence of 17 such named-entities. After using NER, the named

entities will be treated as single terms and will provide more meaningful next

probable keywords for expansion.

5.2 Modified Architecture of Meta-Search Engine

In Figure 5.1 below we show the modified MSE. As can be seen, since our

architecture was highly modular, we could easily plug in the NER module.

User Interface Common Interface to Search-

Engines

Search-Engines (SEs)

Search Engine

1

Search Engine

2

Search Engine

nth

Baseline Establishment (Naïve Algorithm)

Page Retriever Pre-Processing Unit

Algorithms

LSA PLSA

Query

Next Keywords And

Ranked Links

(URLs)

Web-Pages

Processed Text-Corpora

NER

Fig. 5.1 Architecture of Modified Meta-Search Engine

From component User Interface to Page Retriever everything is same as was

presented in previous chapter. The named entity recognition task has to be performed

after the text has been extracted from the retrieved pages and before the pre-

54

processing unit. Named entities must be recognized from all the text-files and be

stored in desired term-document format with suitable term-weight factor. Moreover,

the identified named entities must not be passed through the stop-word removal and

stemming phases. For example, if we consider the named entity “Indian Institute of

Information Technology”, if “of” is removed by the stop word removal process then it

will distort the whole named-entity. Therefore, the NER is appropriate just before the

preprocessing unit. Now remaining terms are stored in term-document matrix which

then works as an input to the rest of the system where LSA or PLSA are executed. As

mentioned previously, these algorithms yield next keywords for given query.

5.3 Modified High-level Design

MetaSearchEngine

LSA / PLSAAlgorithm Reading And

Stemming

Parser

SearchEngineInterface

GUI

BaseLine

NER

Fig. 5.2 High-Level Design with NER (Package Diagram)

Package diagram given in Fig. 5.2 shows the position of NER package and its

interactions with the remaining packages. The advantage of using modular design is

evident in this case. We can easily add ‘NER’ module to our previous design with few

of the changes and the new idea works well.

55

This NER package uses Named Entity Recognizer Library that has been

developed by “Natural Language Processing Group” of Stanford University. This

library is freely available under GNU license. It uses a Conditional Random Field

(CRF) classifier. The library provides an implementation of Conditional Random

Field Sequence model and is coupled with a feature extractor for NER. It recognizes

three types of named entities: person, location and organization. This library also

contains some another models and versions with and without additional similarity

features. These features improve performance but requires considerable amount of

memory. For proposed MSE a classifier with least memory requirement is used. The

library is implemented in java and is available as a .jar file called standford-ner.jar

[41]. Following is an example which shows text data after named entity recognition.

Fig. 5.3 A text file before and after Named-Entity Recognition

From Fig 5.3 it is evident that named-entities as “Indian Institute of Information

Technology”, “Allahabad” and “Dr.M.D.Tiwari” are properly identified and enclosed

under respective tags.

Classes of NER package as NamedEntity.java converts all the text files into the

named-entity format. NEExtraction.java extracts named-entities and stores them in

term-document matrix and non-named entity terms are again stored in corresponding

56

files for further processing of the text. For proper and efficient usage of NER a small

modification was done in Jericho-html Parser.

5.4 Results of NER

We performed the same experiments as were performed in the previous chapter

without using NER. The results obtained were as per our expectations. The following

user-interface displays one of the results of query, “India Tourism”, after applying

Named-Entity Recognizer. Result contains named entities like- “Golden City”,

“Indian Wildlife”, “Corbett National Park Tour”, ”Taj & Wildlife Tour”, ”Discover

North India” and “Discover Forts and Palaces”. It is instructive to compare the results

shown below with that obtained for the same query, without using NER, given in the

previous chapter. If we refer to those results then it is quite clear that term “Corbett”

is related to national park, “North” comes in context of ”Discover North India” and

forts and palaces belong to “Discover Forts and Palaces”.

Now we can use these named entities for query expansion and will get refined

results within the next one or two iterations. Similarly we can still get some other

most related keywords from different context. However, it should be emphasized that

the improvement in the results would be less dramatic if the results of the original

query did not contain a significant number of named entities or if the named entities

were not very relevant to the original query.

57

Fig. 5.4 GUI after applying NER for query “India Tourism”

5.5 Summary

In this chapter we explored thoroughly the effect of introducing a new module

“Named Entity Recognizer”. We examined its importance, consequences and results.

From the results it is quite evident that the new module provides significant

improvements, particularly for those queries where the named entities are likely to

have a high relevance. This chapter also demonstrated the strength of the design of

our software because we could add the new module quite easily into the existing

system.

58

Chapter 6

Conclusion and Future Enhancements

7. Conclusion

In present scenario search-engines are really useful devices to extract needed

information from Internet. Meta-Search engines solve the same purpose with big span

of coverage and advanced features like maintaining user’s profile, filtering results etc.

Proposed MSE is based on refining the results using query-expansion while next

keywords are suggested by MSE itself without using any thesaurus or dictionary. We

can very easily conclude that both the algorithms namely LSA and PLSA work well

for suggesting next keywords for MSE.

Result and analysis part demonstrates that PLSA outperforms LSA and represents

all the results in well classified and easily understandable format. Further

incorporation of “Named Entity Recognizer” in MSE improves results. So, at last, it

can be concluded that to design MSE using LSA/PLSA for query expansion is a nice

and fruitful thought.

8. Future Enhancements

Following points recommend all the future amendments to proposed MSE-

• Current meta-search engine uses only result of Google, Yahoo and MSN.

Other search-engines are still available as Altavista, Askjeeves etc. They

can be added to proposed MSE It will increase the coverage span of MSE

and can provide even more acceptable results. Design of proposed MSE

supports easy modification.

59

• API’s used for implementing the MSE provide only limited results of

respective search-engines. If number of these results can be enhanced then

it will yield nice results.

• Various parsers for different type of file formats can be added and would

give this MSE an admirable feature.

• Sometimes web-pages may contain advertisements and images in a large

extent. These things are of no use from the algorithm and query-expansion

point of view. A good provision can be made in MSE to deal with all such

situations, effectively.

• To maintain the information about user-profiles could be an extended

feature of proposed MSE. If a provision is made to keep a (user, url)

information for a particular user, then next time whenever a user will

query, results could be filtered or categorized before displaying on user-

interface.

• Pronouns in context, reduces the weight of noun appeared in it. For

example we consider following passage:

“The Taj Mahal is one of the most famous historical monument of India. It

is one among the seven wonders of world. It is built by Shahjahan.”

‘It’ in sentence number 2 and 3 is referring to “Taj Mahal”. So they must

be replaced by “Taj Mahal” i.e. the frequency count of “Taj Mahal” should be

three. However, in the present technique it is only one. Therefore the resultant

weight of “Taj Mahal” is not the correct one. In fact, it is reduced. This is one

of the famous problems of NLP, called as “Procedure for Anaphora

Resolution”. Thus, in order to solve this problem, we need to add a module for

Anaphora Resolution. Such a module can be easily added in the design of

MSE and even more efficient results can be expected.

60

Appendix-A:

Search Engines’ API

A.1 Google SOAP Search API (beta)

Software developers are now capable of making their own program by which

they can query lots of web pages. They can use Google SOAP Search API for this

purpose. To provide such facility, Google utilizes Simple Object Access Protocol

(SOAP) and Web Service Description Language (WSDL). The availability of this

API in various languages and platforms like- Perl, Java and Visual Studio .NET gives

a liberty to developer so that he/she can choose his favorite environment.[33]

Some good example code and complete documentation is present with

developer’s kit. It is needed to have license key and Google Account for accessing

services of this API. By using both of these, we are entitled to fire 1000 queries per

day. Maximally 10 links can be retrieved by this API.[33] Google has replaced it

with its newer version “Google AJAX API”. Some classes and their used methods are

described next. Some times proxy does not allow passing SOAP request. To eliminate

this problem a user must have privileges so that his/ her request can be bypassed

through proxy. Some of the essential classes and their respective methods of this API

that are used in proposed MSE are given in Table A.1 below:

Table A.1 Classes and Methods of Google SOAP Search API

Class Method Description

GoogleSearch

public GoogleSearch()

Construct a new instance of a GoogleSearch client.

public void

setKey(String key)

Set the user key used for authorization by Google SOAP server. This is a

mandatory attribute for all requests.

61

public void

setQueryString(String q)

Set the query string for this search.

public byte[] doGetCachedPage

(String url) throws GoogleSearchFault

Retrieve a cached web page from Google. The key attribute must be set.

public String doSpellingSuggestion

(String phrase) throws GoogleSearchFault

Ask Google to return a spelling suggestion for a word or phrase.

public GoogleSearchResult

doSearch() throws GoogleSearchFault

Invoke the Google search. Note: key and query attributes must already be

set. GoogleSearch

Result

public GoogleSearchResult() Constructor

public String toString()

Returns a nicely formatted representation of a Google search

result.

A.2 Yahoo Search Web Service API

Yahoo! developer’s network provides various Web Services for application

developers. They can access services to build new and customized applications. These

services are based on REST that stands for Representational State Transfer. Yahoo!

Web Services use some operations over HTTP requests in which URL must be

encoded. All the libraries and example code are bundled together for accessing the

Yahoo! Search Web Services as a Software Development Kit (SDK).It can be easily

downloaded from their website [34]. This SDK includes code in the Java, Lua,

JavaScript, Perl etc. The developer can easily choose a language and platform of his

choice. It is required to register and to use an application ID, which is tied to

application for accessing Yahoo! Web Services. This application ID must be

associated with each Web Services request.

62

Table A.2 Classes and Methods of Yahoo Search Web Service API

Class Method Description SearchClient

public SearchClient (String appId)

Constructs a new SearchClient with the given application ID using the default

settings.

public WebSearchResults webSearch(WebSearchReque

st request) throws IOException,

SearchException

Searches the Yahoo database for web results.

WebSearch Request

public

WebSearchRequest (String query)

Constructs a new web search request.

public void setResults

(int results)

The maximum number of results to return. May return fewer results if there aren't enough results in the database. At the time of writing, the default value is

10, the maximum value is 50. Interface

WebSearch Result

String getTitle() The title of the web page.

String getUrl()

The URL for the web page.

Interface WebSearch

Results

BigInteger getTotalResultsAvailable

( )

The number of query matches in the database.

BigInteger getTotalResultsReturned

( )

The number of query matches returned. This may be lower than the number of results requested if there were fewer

total results available.

WebSearchResult[]

listResults()

The list (in order) of results from the search.

63

http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html

http://java.sun.com/j2se/1.4.2/docs/api/java/io/IOException.html




http://java.sun.com/j2se/1.4.2/docs/api/java/math/BigInteger.html

http://java.sun.com/j2se/1.4.2/docs/api/java/math/BigInteger.html

A.3 MSN Search SDK beta MSN search SDK beta gives a provision for a user to send queries to MSN Live

Search and receive results. The documentation with SDK explains the essential

concepts, guidelines and library for the MSN Search Web Service. The SDK also

contains example code that illustrates techniques for application development.

This SDK requires any of the Windows platforms, as- Windows 2000, Server

2003, Vista, XP. A computer with the ability of sending requests via SOAP 1.1 and

HTTP 1.1 is needed. It must be able to parse XML. Microsoft® Visual Studio® .NET

2003 or Microsoft® Visual Studio® .NET 2005 and the Microsoft .NET Framework

must be installed on a deployment computer to build, run and execute the

applications. An application ID must always be entitled with request. For a given

query, top 10 results of MSN can be received in user program [35].

64

Appendix-B:

Parser’s API

B.1 Jericho HTML Parser

Jericho HTML Parser is a powerful library in java which analyses and

manipulates parts of an HTML document [36]. It also consist some functions that can

manipulate high-level HTML forms. Since, it is available as open source library we

can easily use it in our commercial applications. The library has following major

features that are different from other HTML parsers:

• It is not a tree based parser. It is completely based on simple text search

and efficient recognition of tags.

• The requirements for memory and resources are far better compared to

DOM based parser.

• Each parsed segment can be easily accessed and modifications into

selected segments can be efficiently performed.

• It provides an easy way to define and register custom-tags so that parser

can recognize them.

65

B.2 PDFBOX-0.7.0

PDFBox performs the functionalities like- creating new PDF document,

manipulating them and extracting content from them. It is available as open source.

PDFBox also comprises of several utilities[37]. Some very essential features of

PDFBOX are:

• Text extraction from PDF

• Merging of PDF Documents

• Encryption/Decryption of PDF Document

• Integration with Lucene Search Engine

• Creation of a PDF from a text file

• Images Creation from PDF pages

For proposed MSE only PDF to text extraction feature is used.

66

Appendix-C:

List of Stop Words

Following Stop word list is available on computer science department’s LSI web

site of University of Tennessee [42].

Table C.1 List of stop words. A appear C doing former a appreciate c'mon don't formerly

a's appropriate c's done forth able are came down four

about aren't can downwards from above around can't during further

according as cannot E furthermore accordingly aside cant each G

across ask cause edu get actually asking causes eg gets

after associated certain eight getting afterwards at certainly either given

again available changes else gives against away clearly elsewhere go

ain't awfully co enough goes all B com entirely going

allow be come especially gone allows became comes et got almost because concerning etc gotten alone become consequently even greetings along becomes consider ever H

already becoming considering every had also been contain everybody hadn't

although before containing everyone happens always beforehand contains everything hardly

am behind corresponding everywhere has among being could ex hasn't

amongst believe couldn't exactly have an below course example haven't and beside currently except having

another besides D F he any best definitely far he's

anybody better described few hello

67

anyhow between despite fifth help anyone beyond did first hence

anything both didn't five her anyway brief different followed here anyways but do following here's anywhere by does follows hereafter

apart doesn't for hereby

herein K N others saying hereupon keep name otherwise says

hers keeps namely ought second herself kept nd our secondly

hi know near ours see him knows nearly ourselves seeing

himself known necessary out seem his L need outside seemed

hither last needs over seeming hopefully lately neither overall seems

how later never own seen howbeit latter nevertheless P self however latterly new particular selves

I least next particularly sensible i'd less nine per sent i'll lest no perhaps serious i'm let nobody placed seriously i've let's non please seven ie like none plus several if liked noone possible shall

ignored likely nor presumably she immediate little normally probably should

in look not provides shouldn't inasmuch looking nothing Q since

inc looks novel que six indeed ltd now quite so indicate M nowhere qv some indicated mainly O R somebody indicates many obviously rather somehow

inner may of rd someone insofar maybe off re something instead me often really sometime

into mean oh reasonably sometimes inward meanwhile ok regarding somewhat

is merely okay regardless somewhere isn't might old regards soon

it more on relatively sorry it'd moreover once respectively specified it'll most one right specify it's mostly ones S specifying

68

its much only said still itself must onto same sub

J my or saw such just myself other say sup

sure they U we whole T they'd un we'd whom t's they'll under we'll whose

take they're unfortunately we're why taken they've unless we've will tell think unlikely welcome willing

tends third until well wish th this unto went with

than thorough up were within thank thoroughly upon weren't without thanks those us what won't thanx though use what's wonder that three used whatever would

that's through useful when would thats throughout uses whence wouldn't the thru using whenever X

their thus usually where Y theirs to uucp where's yes them together V whereafter yet

themselves too value whereas you then took various whereby you'd

thence toward very wherein you'll there towards via whereupon you're

there's tried viz wherever you've thereafter tries vs whether your thereby truly W which yours

therefore try want while yourself therein trying wants whither yourselves theres twice was who Z

thereupon two wasn't who's zero these way whoever

69

Appendix-D:

JAMA API (for SVD)

Classes of package Jama are listed:[38]

• CholeskyDecomposition • EigenvalueDecomposition • LUDecomposition • Matrix • QRDecomposition • SingularValueDecomposition

Classes and their respective methods that are used in proposed MSE for SVD are:

Table D.1 Classes and Methods of JAMA API

Class Method Description

Matrix

public Matrix(double[][] A)

Construct a matrix from a 2-D array.

public int

getColumnDimension()

Get column dimension.

public int

getRowDimension()

Get row dimension.

public double[][]

getArray()

Access the internal two-dimensional array.

public double[][] getArrayCopy()

Copy the internal two-dimensional array.

public Matrix getMatrix(int i0, int i1,int j0,int j1)

Get a submatrix.

public Matrix transpose()

Matrix transpose.

70

http://math.nist.gov/javanumerics/jama/doc/Jama/CholeskyDecomposition.html

http://math.nist.gov/javanumerics/jama/doc/Jama/EigenvalueDecomposition.html

http://math.nist.gov/javanumerics/jama/doc/Jama/LUDecomposition.html

http://math.nist.gov/javanumerics/jama/doc/Jama/Matrix.html

http://math.nist.gov/javanumerics/jama/doc/Jama/QRDecomposition.html

http://math.nist.gov/javanumerics/jama/doc/Jama/SingularValueDecomposition.html

public Matrix

times(Matrix B)

Linear algebraic matrix multiplication, A * B

public void print(int w, int d)

Print the matrix to stdout. Line the elements up in columns with a Fortran-

like 'Fw.d' style format.

public

SingularValueDecomposition svd()

Singular Value Decomposition

SingularValue Decomposition

public SingularValueDecomposi

tion(Matrix Arg)

Construct the singular value decomposition

public Matrix getU() Return the left singular vectors

public Matrix getV() Return the right singular vectors

public Matrix getS()

Return the diagonal matrix of singular values

71

References [1] Jae Hyun Lim, Young-Chan Kim, Hyonwoo Seung, Jun Hwang , Heung-

Nam Kim, “Query Expansion for Intelligent Information Retrieval on Internet”,

Proceedings of the International Conference on Parallel and Distributed Systems,

Page(s):652 - 656, 1997

[2] Boston University, “How Search-Engine Works”, www.bu.edu

[3] Sullivan Danny,”How Search-Engine Works”, www.searchenginewatch.com

[4] Zheng Li, Yuanqiong Wang ,Vincent Oria, “A New Architecture to Web Meta-

Search Engine”, CIS Department ,New Jersey Institute of Tech., Seventh Americas

Conference on Information Systems, 2001

[5] Abawajy, J.H.; Hu, M.J., “A new Internet meta-search engine and

implementation”, The 3rd ACS/IEEE International Conference on Computer

Systems and Applications, Page(s):103, 2005

[6] Shanmukha Rao, B.; Rao, S.V.; Sajith, G.; “A user-profile assisted meta search

engine”, TENCON 2003 Conference on Convergent Technologies for Asia-Pacific

Region Volume 2, Page(s):713 - 717 , 15-17 Oct. 2003

[7] Spink, A.; Jansen, B.J.; Blakely, C.; Koshman, S.; “Overlap Among Major Web

Search engines”, ITNG 2006 Third International Conference on Information

Technology: New Generations, 2006. Page(s):370 – 374, 10-12 April 2006.

[8] A. Gulli, A. Signorini ,”Building an opensource Meta-Search Engine”, Special

interest tracks and posters of the 14th international conference on World Wide Web

WWW '05 ,ACM Press, May 2005

[9] Eric J. Glover, Steve Lawrence, William P. Birmingham, C. Lee Giles;

“Architecture of a Meta-Search Engine that Supports User Information Needs”

,Proceedings of the eighth international conference on Information and knowledge

management CIKM '99, ACM Press,Pages: 210 - 216, 1999

72

http://www.searchenginewatch.com/

[10] Yuen, L.; Chang, M.; Lai, Y.K.; Chung Keung Poon; “Excalibur: a

personalized Meta search engine” COMPSAC 2004 Conference on Computer

Software and Applications, 2004.. Proceedings of the 28th Annual International

Volume 2, Page(s):49 - 50 ,2004.

[11] Junjie Chen; Wei Liu; ”A framework for intelligent meta-search engine based

on agents”, Third International Conference on Information Technology and

Applications, 2005. ICITA 2005. Volume 1, Page(s):276 - 279 , 4-7 July 2005.

[12] Lee, D.L.; Huei Chuang; Seamons, K.; “Document ranking and the vector-

space model” Software, IEEE Volume 14, Issue 2, Page(s):67 – 75, Mar/Apr 1997

[13] G.Salton, A. Wong, C.S. Yang ,”A Vector Space Model for Indexing” ,1975

[14]Website:

http://mingo.infoscience.uiowa.edu/courses/230/Lectures/Vector1.html#1d

[15]Vijay V. Raghvan, S.K.M. Wong “A Critical Analysis for Vector Space Model

for Information Retrieval”, Journal of the American Society for Information

Science, vol. 35, no 5, pp. 279—287, 1986.

[16] Website:http://www.miislita.com/tervector/term-vector-2.html

[17]Scott Deerwester et al: “Indexing by latent semantic analysis”, Journal of the

American Society for Information Science, vol. 41, no 6, pp. 391—407, 1990.

[18]Thomas K. Landauer, Peter W. Foltz , Darrell Laham ,”An Introduction to

Latent Semantic Analysis” ,Discourse Processes,25 ,259-284, 1998.

[19] Website: www.cs.bham.ac.uk/~axk/ML_PLSA.ppt

[20] Bin Tang, Xiao Luo, Malcolm I. Heywood, Michael Shepherd, “A

Comparative Study of Dimension Reduction Techniques for Document Clustering”

,Technical Report CS-2004-14, December 6, 2004.

73

http://mingo.infoscience.uiowa.edu/courses/230/Lectures/Vector1.html#1d

http://www.miislita.com/tervector/term-vector-2.html

http://www.cs.bham.ac.uk/%7Eaxk/ML_PLSA.ppt

[21]Holger Bast, ”Dimension Reduction: A Powerful Principle for Automatically

Finding Concepts in Unstructured Data”, Max-Planck-Institute for Informatics

[22]Cambridge University Press, “Dimensionality reduction and Latent Semantic

Indexing” DRAFT! © October 13, 2006.

[23] Vishwa Vinay, Ingemar J.Cox, Ken Wood, Natasa Milic-Frayling, “A

Comparison of Dimensionality Reduction Technique for Text Retrieval”,

Proceeding of the Fourth IEEE International Conference on Machine Learning and

Applications(ICMLA’05),2005

[24] Bin Tang, Michael Shepherd, Evangelos Milios, Malcolm I. Heywood,

“Comparing and Combining Dimension Reduction Techniques for Efficient Text

Clustering” January 21,2005

[25] Barbara Rosario, “ Latent Semantic Indexing: An overview”, INFOSYS 240,

Final Paper, Spring 2000

[26] Eileen Kintsch, Dave Steinhart, Gerry Stahl, Cindy Matthews, Ronald Lamb,

“Developing Summarization Skills through the Use of LSA-Based Feedback”,(in

press),Interactive Learning Environments

[27] Thomas K. Landauer, Darrell Laham, Peter Foltz, “Learning Human –like

Knowledge by Singular Value Decomposition: A Progress Report”

[28] Bob Rehder, M. E. Schreiner, Michael B.W.Wolfe, Darrell Laham, Thomas K.

Landauer, Walter Kintsch, “ Using Latent Semantic Analysis to assess knowledge”

[29] Thomas Hofmann, “Probabilistic Latent Semantic Indexing ”,Annual ACM

Conference on Research and Development in Information Retrieval, Proceedings of

the 22nd annual international ACM SIGIR conference on Research and

development in information retrieval ,Berkeley, California, United States,pp 50 –

57, 1999

[30]Thomas Hofmann, Probabilistic Latent Semantic Analysis. Proceedings of the

Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99)

74

[31] Sean Borman , “The Expectation Maximization Algorithm, A short Tutorial”,

June 28,2006

[32]Guandong Xu, Yanchun Zhang, Xiaofang Zhou, “Using Probabilistic Latent

Semantic Analysis for Web Page Grouping”, Proceeding of the 15th IEEE

International Workshop on Research Issues in Data Engineering: Stream Data

Mining and Applications(RIDE-SDMA’05),2005

[33] Website: http://code.google.com

[34] Website: http://developer.yahoo.com/download/download.html

[35] Website: http://www.microsoft.com/downloads/details.aspx

[36] Website: http://sourceforge.net/projects/jerichohtml/

[37]Website:

http://sourceforge.net/project/showfiles.php?group_id=78314&package_id=79377

[38] Website: http://www.dcs.gla.ac.in/idom/ir_resources/linguistic_util/porter.java

[39] http://www.comp.lancs.ac.uk/computing/research/stemming/paice/article.htm

[40] Website: http://math.nist.gov/javanumerics/jama/

[41] Website: http://nlp.stanford.edu/software/CRF-NER.shtml

[42] Website: http://www.cs.utk.edu/~lsi/

75

http://code.google.com/

http://developer.yahoo.com/download/download.html

http://www.microsoft.com/downloads/details.aspx

http://sourceforge.net/projects/jerichohtml/

http://sourceforge.net/project/showfiles.php?group_id=78314&package_id=79377

http://www.dcs.gla.ac.in/idom/ir_resources/linguistic_util/porter.java

http://www.comp.lancs.ac.uk/computing/research/stemming/paice/article.htm

http://math.nist.gov/javanumerics/jama/

http://nlp.stanford.edu/software/CRF-NER.shtml

http://www.cs.utk.edu/%7Elsi/

76