Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
Meta-Search Engine based on
Query-Expansion Using Latent Semantic Analysis and
Probabilistic Latent Semantic Analysis
DISSERTATION
SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
MASTER OF TECHNOLOGY IN INFORMATION TECHNOLOGY
(SOFTWARE ENGINEERING)
Under the Supervision of Dr. Sudip Sanyal Associate Professor
INDIAN INSTITUTE OF INFORMATION TECHNOLOGY – ALLAHABAD
(DEEMED UNIVERSITY)
DEOGHAT, JHALWA
ALLAHABAD- 211011, (U.P.)
INDIA
IIIT-Allahabad
Submitted by Anand Arun Atre
MS200504 M.Tech. IT (Software Engineering)
IIIT-Allahabad
IINNDDIIAANN IINNSSTTIITTUUTTEE OOFF IINNFFOORRMMAATTIIOONN TTEECCHHNNOOLLOOGGYY
AALLLLAAHHAABBAADD (A University Established under sec.3 of UGC Act, 1956 vide Notification No. F.9-4/99-U.3 Dated 04.08.2000
of the Govt. of India )
(A Centre of Excellence in Information Technology Established by Govt. of India)
Date: ______________
We do hereby recommend that the thesis work prepared
under my/our supervision by Anand Arun Atre entitled
“Meta-Search Engine based on Query-Expansion Using
Latent Semantic Analysis and Probabilistic Latent
Semantic Analysis ” be accepted in partial fulfillment of
the requirements of the degree of Master of Technology in
Information Technology (Software Engineering) for
examination.
COUNTERSIGNED
Dr. Sudip Sanyal ______________________________ THESIS ADVISER
Dr. U. S. Tiwary DEAN (ACADEMICS)
IINNDDIIAANN IINNSSTTIITTUUTTEE OOFF IINNFFOORRMMAATTIIOONN TTEECCHHNNOOLLOOGGYY
AALLLLAAHHAABBAADD (A University Established under sec.3 of UGC Act, 1956 vide Notification No. F.9-4/99-U.3 Dated 04.08.2000
of the Govt. of India )
(A Centre of Excellence in Information Technology Established by Govt. of India)
CERTIFICATE OF APPROVAL*
The foregoing thesis is hereby approved as a creditable
study in the area of Information Technology carried out
and presented in a manner satisfactory to warrant its
acceptance as a pre-requisite to the degree for which it
has been submitted. It is understood that by this
approval the undersigned do not necessarily endorse or
approve any statement made, opinion expressed or
conclusion drawn therein but approve the thesis only
for the purpose for which it is submitted.
COMMITTEE ON
FINAL EXAMINATION
FOR EVALUATION
OF THE THESIS
*Only in case the recommendation is concurred in
Acknowledgement
The satisfaction and bliss that accompany the successful completion of any task
would be incomplete without the mention of people who made it possible because
success is not only an exemplar of hard work and perseverance but endures an
encouraging guidance and tremendous help. This thesis, while an achievement that
bears my name, would not have been possible without the help of others. I am glad to
take this opportunity to thank the people who helped me to make this work possible.
At the start I thank the Almighty God for his divine grace and blessings showered
on me and thereby giving me strength and courage to complete thesis work and in
turn my course successfully.
It is my privilege to study at Indian Institute of Information Technology,
Allahabad where students and professors are always eager to learn new things and to
make continuous improvements by providing innovative solutions. I am highly
grateful to the honorable Director, IIIT-Allahabad, Dr. M. D. Tiwari, for his ever
helping attitude and encouraging us to excel in studies. I am also thankful to
Dr.U.S.Tiwary, Dean Academics, IIIT-Allahabad for providing all the necessary
requirements and his moral support for this dissertation work.
Regarding this thesis work, first and foremost, I would like to heartily thank my
supervisor Dr. Sudip Sanyal for his able guidance. His fruitful suggestions, valuable
comments and support was an immense help for me. Inspite of his hectic schedule he
took pains, with smile, in various discussions which enriched me with new
enthusiasm and vigour.
I owe my gratitude to Mr. Mithilesh Mishra, Coordinator, IIIT-A Network
Development, Engineering & Management(INDEM); to grant me special access
privilege on IIIT-A network for bypassing the request through proxy for successful
execution of Search-Engines’ APIs. I am also thankful to Mr. Balwant Singh,
i
incharge Maintenance Cell, to issue me a Computer system with all the necessary
accessories and configuration.
Now, I would like to mention the names of my classmates who in the both ways,
directly or indirectly, helped me a lot. Firstly, my thankful wish is for my one of the
best friend ever, Mr. Mohd. Imran Khan. Right from the beginning and till the
completion of this project, he discussed plenty of things with me about intricacies and
complexities of project. He always directed me to make an efficient and more
maintainable code. He is fully aware of all the thick and thins of my project that is an
evident of how success of the whole thesis depends on him. Next , I would like to
thank Mr. Kamal Sawan, for helping me to execute GoogleAPI during the starting
days of project, Mr. Prabhat Saheja for successful execution of MSN API,because
he is well familiar of .NET framework and C-Sharp and Mr. Nilesh Chandra Shukla
and Mr. Vineet Chauhan helped me a lot during coding phase.
It is always a nice experience to spend two most exciting year of life in IIIT-A
with friends like Mr. Adish Singh, Mr. Abhay Sukhdeo Pawane, Mr. Pankaj
Kandpal, Mr. Sampath Kumar Mada and Mr. Sahab Nath Yadav, who always
motivated me to do this project with sincerity and patience.
I am also profoundly thankful to Mr. Parikshit Totawar and Mr.Shrikant
Mantri to assist me in learning Perl.
I also wish to extend my thanks to Mr. K. Ashwin Kumar and Mr. Prabhash
Dhyani, member of INDEM team ,to solve the network-access problems, related to
extra privilege assigned to my IIITA-account. Students of pre-final year of B.Tech,
specially Mr. Animesh Nayan and Mr. B. Ravikiran Rao helped me effectively for
efficient usage of “NER” library.
I also owe my thanks to Mr. Dhirendra Pratap Singh, Mr. Mahindra Giri
Vasireddy and Ms. Megha Thakkar for suggesting me some nice improvements
during implementation.
ii
Lastly, I would like to express warm gratitude to my grand-mother, my parents
and my maternal-uncle Mr. M. A. Vetal, for their unbounded love and priceless
support throughout my life. Their support has kept me striving for success. I hope that
with the completion of this course, I have made them proud.
Anand Arun Atre
14-July-2007
iii
DECLARATION
This is to certify that this thesis work entitled “Meta-Search Engine based on
Query–Expansion using Latent Semantic Analysis and Probabilistic Latent
Semantic Analysis”’ which is submitted by me in partial fulfillment of the
requirement for the completion of M.Tech. in Information Technology specialization
in Software Engineering to Indian Institute Of Information Technology, Allahabad
comprises only my original work and due acknowledgements has been made in the
text to all other materials used.
Name : Anand Arun Atre
M.Tech.-IT: Software Engineering
Enrolment No: MS200504
iv
Abstract
As the result of the rapid advancements in Information Technology, Information
Retrieval on Internet (Internet -Searching) is gaining importance, day by day. Search-
Engines are admittedly essential tools for this purpose. But, like a two side of the
same coin, search-engines’ performance degrade due to some critical issues. This fact
motivates for another solution-namely “Implementation of Meta- Search Engine”.
This thesis presents an analysis of the applicability of the Probabilistic Latent
Semantic Analysis technique for performing Query Expansion in the context of Meta
Search Engines. The basic idea is to refine results using query-expansion. Our
experiments clearly demonstrate that the technique gives excellent results for query
expansion with distinct senses of the query keywords getting grouped in different
topics. Moreover, the applied method converges very rapidly, thus providing an
efficient and extremely pragmatic method for query expansion. We also compare our
results with those obtained using Latent Semantic Analysis.
Keywords: Meta-Search Engine, Query-Expansion, Latent Semantic Analysis,
Probabilistic Latent Semantic Analysis, Convergence.
v
Table of Contents
Acknowledgement ..........................................................................................................i
DECLARATION .......................................................................................................... iv
Abstract.......................................................................................................................... v
Table of Contents .........................................................................................................vi
List of Tables ..............................................................................................................viii
List of Figures ..............................................................................................................ix
Introduction...................................................................................................................1 1.1 Overview..................................................................................................................1 1.2 Objective..................................................................................................................1 1.3 Motivation ...............................................................................................................2 1.4 Problem Statement .................................................................................................4 1.5 Contribution of Thesis ...........................................................................................5 1.6 Structure of Thesis .................................................................................................5 1.7 Summary .................................................................................................................6
Literature Survey...........................................................................................................7 2.1 Current Trends in Meta-Search Engine...............................................................7 2.2 Vector Space Model................................................................................................8 2.3 Latent Semantic Analysis (LSA) .........................................................................12
2.3.1 Concept of LSA.............................................................................................................. 12 2.3.2 Limitations of LSA......................................................................................................... 19 2.3.3 Advantages and Applications of LSA ............................................................................ 20
2.4 Probabilistic Latent Semantic Analysis (PLSA) ................................................21 2.4.1 Concept of PLSA............................................................................................................ 21 2.4.2 PLSA Algorithm............................................................................................................. 23 2.4.3 Advantages and Applications of PLSA .......................................................................... 26
2.5 Summary ...............................................................................................................27 Proposed Meta-Search Engine...................................................................................28
3.1 Basic Theme .................................................................................................................28 3.2 Architecture of Proposed MSE ..................................................................................29 3.3 Implementation Details ...............................................................................................31 3.4 Features of Proposed System......................................................................................34 3.5 Summary ......................................................................................................................35
Result and Analysis.....................................................................................................36
vi
4.1 Result-Analysis of LSA ........................................................................................36 4.1.1 Value of ‘k’ for Optimal Rank Approximation of Term-Document matrix ................... 36 4.1.2 Comparison of “Tf-IDf Measure” to “Term-Count Measure”........................................ 38
4.2 Result Analysis of PLSA ......................................................................................40 4.2.1 Optimal value for number of topics (a) .......................................................................... 41 4.2.2 Convergence................................................................................................................... 43 4.2.3 Number of Iterations for Convergence ........................................................................... 46 4.2.4 PLSA slide-shots ............................................................................................................ 46
4.3 Convergence in number of unique links after some iteration ..........................48 4.4 Comparison between LSA and PLSA results ....................................................48 4.5 Comparison with Dogpile Search-Engine ..........................................................49
Improvements from NER (Named-Entity Recognizer)..............................................52 5.1 Introduction ..........................................................................................................52 5.2 Modified Architecture of Meta-Search Engine..................................................54 5.3 Modified High-level Design .................................................................................55 5.4 Results of NER......................................................................................................57 5.5 Summary ...............................................................................................................58
Conclusion and Future Enhancements .....................................................................59 6.1 Conclusion.............................................................................................................59 6.2 Future Enhancements ..........................................................................................59
Appendix-A: ................................................................................................................61
Search Engines’ API ..................................................................................................61
Appendix-B: ................................................................................................................65
Parser’s API ................................................................................................................65
Appendix-C: ................................................................................................................67
List of Stop Words .......................................................................................................67
Appendix-D: ................................................................................................................70
JAMA API (for SVD)..................................................................................................70
References ...................................................................................................................72
vii
List of Tables
Table 2.1 Titles that representing small corpus. .................................................................17
Table 2.2 Term-Document Representation of corpus (T). .................................................17
Table 2.3 Complete SVD of T. ..............................................................................................18
Table 2.4 Reconstruction of Original Matrix. ....................................................................19
Table 2.5 Four Aspects (topics) those are most likely to generate term ‘Cricket’. ..........25
Table. 4.1 Next Keywords for query “India Tourism” for different values of ‘K’. .........37
Table 4.2 Next Keywords for query “Thread” using Term-Count Measure. ..................38
Table 4.3 Next Keywords for query “Thread” using Tf-IDf Measure..............................38
Table 4.4 Results of PLSA for query “Thread” ..................................................................40
Table 4.5 Results of PLSA for query “Australian University”..........................................40
Table 4.6 Results of PLSA for query “India Tourism” for different value of num of topic
‘a’=1, 2, 3........................................................................................................................42
Table A.1 Classes and Methods of Google SOAP Search API..........................................61
Table A.2 Classes and Methods of Yahoo Search Web Service API................................63
Table C.1 List of stop words. ...............................................................................................67
Table D.1 Classes and Methods of JAMA API ..................................................................70
viii
List of Figures
Fig. 1.1. Anatomy of Crawler Based Search Engine.............................................................3
Fig. 2.1 Document Representation in Term Space................................................................9
Fig. 2.2 SVD of Term-Document Matrix ‘T’ .......................................................................14
Fig. 2.3 Rank K approximation of original matrix T..........................................................15
Fig. 2.4Two Matrix Formations from PLSA. ......................................................................24
Fig. 3.1 Architecture of proposed Meta-Search Engine .....................................................29
Fig. 3.2 High-Level Design (Package Diagram) ..................................................................31
Fig 4.1 Graphical User Interface for LSA ...........................................................................39
Fig 4.2 Convergence in Term- Topic Matrix computed by Absolute Measure ................44
Fig 4.3 Convergence in Topic-Document Matrix computed by Absolute Measure .........44
Fig 4.4 Convergence in Term-Topic Matrix computed by Average Measure..................45
Fig 4.5 Convergence in Topic-Document Matrix computed by Average Measure..........46
Fig 4.6 GUI representing results for query “Thread”........................................................47
Fig 4.7 GUI representing results for query “India Tourism”............................................47
Fig 4.8 Behavior of num. of unique web-links to iterations for Query Expansion...........48
Fig 4.9 Top ten results of Meta-Search Engine “Dogpile” for Query “Thread”............50
Fig 4.10 Top ten results of proposed MSE for Expanded Query “Thread Package Java”
.........................................................................................................................................50
Fig 4.11 Top ten results of proposed MSE for Expanded Query “Thread Dress” ..........51
Fig. 5.1 Architecture of Modified Meta-Search Engine......................................................54
Fig. 5.2 High-Level Design with NER (Package Diagram).................................................55
Fig. 5.3 A text file before and after Named-Entity Recognition .......................................56
Fig. 5.4 GUI after applying NER for query “India Tourism” ...........................................58
ix
Chapter 1
Introduction
This chapter presents the overview of the thesis. It will give the reader an insight
of current situation in the field of Searching on Internet. It describes the objective,
motivation and problem statement of thesis. At last, it presents organization of thesis.
1.1 Overview
The focus of this thesis to add a new dimension to Internet-Searching and that is
to apply semantic aspects towards it. In precise words,“the search must be what user
wish, not what he/she types”. In the current scenario users are flooded with numerous
web-links (urls) given by Search-Engines (SEs). Hence the users waste their useful
time in navigating through undesired links, searching the needed one. The prime
reason for this is that the SEs index the pages on the basis of key-words. On the other
hand, when we are searching the internet we quite often may not know the correct and
complete set of key words that might have led us to the desired url. In order to
overcome this shortcoming we need to devise a method that will allow the user to find
the relevant key words starting from the few key words that he/she may actually
know. In other words, we need to look into the semantics of the key words. This
thesis suggests a new approach that is based on some algorithms which considers
semantic aspects and uses them to implement a Meta-Search Engine (MSE).
1.2 Objective
‘To develop a Meta-Search Engine for refining the search-results of existing
Search-Engines by Query Expansion using Latent Semantic Analysis (LSA) and
Probabilistic Latent Semantic Analysis (PLSA)Algorithms.’
1
The project activity basically consists of an implementation of both of LSA and
PLSA algorithms on the results of basic Search–Engines in order to refine it by
Query-Expansion (QE). LSA has been already recommended for QE in searching on
internet [1]. An essential component of this thesis is to compare the performance of
these two algorithms empirically and to analyze various factors which affect the
result. The thesis concludes with the fact that PLSA outperforms LSA.
1.3 Motivation
In the current scenario, Information Technology is advancing rapidly. World Wide
Web or Internet is one of the best achievements of it. Internet can be treated as a huge
repository of information and sophisticated methods are always required to extract
needed information. Search Engines like Google, Yahoo and MSN are really
necessary tools to retrieve needed information.
Most of such search-engines basically perform Crawler-Based Search. These SEs
generally consist of a WebCrawler - a program that crawls the web, an Indexing
Technique, some Encoding Mechanism and a huge Database. These SEs use crawlers
(spiders) for information collection on the web. Then indexing, encoding and storing
of collected data are performed subsequently [2,3]. Following diagram represents the
anatomy of search-engine.
Steps of Crawler Based Search-engines [2]:
1. Web–Crawling: Search-Engines use a special program called Robot or
Spider which crawls (travels) the web from one page to another. It travels
the popular sites and then follows each link available at that site.
2. Information Collection: Spider records all the words and their respective
position on the visited web-page. Some search-engines do not consider
common words such as articles ( ‘a’,’an’,’the’); prepositions (‘of’,’on’).
2
Fig. 1.1 Anatomy of Crawler Based Search Engine
3. Build Index: After collecting all the data, search-engines build an index to
store that data so that a user can access pages quickly. Different search-
engines use different approach for indexing. Due to this fact the different
search-engines give different results for the same query. Some important
considerations for building indexes include: the frequency of a term of
appearing in a web-page, part of a web-page where that term appears, font-
size of a term (whether capitalized or not). In fact, Google ranks a page
higher if more number of pages vote (having links) to that particular page.
4. Data Encoding: Before storing the indexing information in databases, it is
encoded into reduced size to speed up the response time of particular
search-engine.
5. Store Data: the last step is to store this indexing information into
databases.
However, to extract desired information quickly and easily is a common problem
that user face [4]. Keyword selection for searching is also a critical issue. Very few
users utilize the full power of SEs [5]. Along with all of the above, some of the
surveys also suggest following facts:
1. Any search-engine is not capable of covering more than one third of the web-
pages available on Internet [6].
2. Sometimes search-engines give such results which contain obsolete or dead
link [6].
Information Collection
Building Index
Encoding the data
Web Crawling
Data Storage
3
A study was performed to evaluate the overlapping among first page results of
three SEs namely-Google, Yahoo and AskJeeves [7]. Study reveals that only 85%
links are unique while 12% links were found common in any of two search-engines.
Only 3% of links were common to all three search-engines. This very small amount of
overlap shows significant differences in ranking and retrieval policies of search-
engines. From these data we can infer that if internet users are using only a single
search-engine that they may miss needed and relevant results [8].
These facts give us a motivation for implementing an Indirect Search Engine (also
called Meta Search Engine) which combines the results of existing search-engines and
refines them using some algorithm or represents them in such a format which is more
user-friendly. Simple distinction between the Search-Engines and Meta-Search
Engine (MSE) is that the latter do not require crawling the web and hence there is no
need of indexing and databases. Main directions for implementing MSE are to
improve user-interface and to filter results according to user need. All such current
trends are illustrated in section 1 of chapter 2 Literature Survey with full of details
and their limitations. No one among them is based on phenomena of peering into
semantic (meaning) of content and to refine results by query expansion. All these
factors give strong motivation to implement such a MSE which can take responsibility
of all the bottom lines explained above.
1.4 Problem Statement
Approaches for implementing a Meta-Search Engine that are suggested till now,
are not refining the search-results up to the desired level. Such approaches are based
on either extracting user preferences or maintaining user profile. They also do not
address the problem of Synonymy (where more than two terms can be used to
represent same object) and Polysemy (same term may represent different meaning in
different context). The one and only reason behind this is that the current Meta-Search
Engines do not consider the semantic aspect of a term. To apply some algorithm on
search-results may provide better solutions to explained problem. So the main
problem of concern is to choose appropriate algorithms which can solve the above
mentioned problem and to use them for implementing a Meta-Search Engine.
4
1.5 Contribution of Thesis
In this thesis we have designed and implemented a complete MSE. The design of
the MSE is quite flexible and provides the facility to add the results of new SEs, when
they become available. We have developed a method for performing QE on the results
returned by the SEs using PLSA and also incorporated the existing methodologies
available using LSA. Extensive experiments are performed to compare the results
obtained with PLSA and LSA for the task of QE. Analysis of these results clearly
demonstrates that PLSA outperforms LSA. Further analysis also reveals some
shortcomings. Methods for overcoming these shortcomings are also discussed.
1.6 Structure of Thesis
The thesis comprises of various chapters. An overview and objective of thesis is
presented in Chapter1 Introduction. It then demonstrates all the factors responsible
for motivation of thesis. Problem statement, contribution and structure of this thesis
are illustrated next to that.
Chapter 2 Literature Survey describes all the current inclinations towards
design and implementation of Meta-Search Engines. It further describes how the
proposed idea overcomes all the problems and then gives all the exhaustive
algorithmic details that are used by proposed MSE namely Vector Space Model,
Latent Semantic Analysis and Probabilistic Latent Semantic Analysis.
Chapter 3 Proposed Meta-Search Engine illustrates the Architecture of system
with all the components. It also depicts necessary requirements and corresponding
implementation details for them.
Chapter 4 Result and Analysis firstly shows the results of LSA for different
critical factors. Further results of PLSA are presented for different cases. Then
comparison between both of them is demonstrated. This chapter contains all the
elucidating examples with diagrams, slide-shots and graphs which reinforce the
superiority of PLSA compared to LSA. At last, results of suggested MSE are
compared to Open Source Meta-Search Engine-“Dogpile”.
5
Chapter 5 Improvements from NER demonstrates addition of a newer module
into Meta-Search Engine namely “Named Entity Recognizer”. This chapter then
illustrates all the modifications, effects and consequences of addition of this new
component.
Chapter 6 Conclusion and Future Enhancements presents the conclusion of
thesis and suggests some future enhancements to it.
Appendix section of thesis contains details about various Search-Engine API’s,
HTML parser, PDF to text converter, JAMA API (for SVD) etc.
1.7 Summary
The deriving reasons for implementing a Meta-Search Engine (MSE) have been
illustrated. Proposed MSE is based on the phenomena of Query-Expansion which
relies on the fact that after few successive iterations of firing expanded query the
results will be automatically refined. For the purpose of query-expansion some
algorithms have been proposed that try to focus on semantic aspect behind a query.
The next chapter explains all the algorithms used and their respective effects.
6
Chapter 2
Literature Survey
2.1 Current Trends in Meta-Search Engine
In this chapter we survey some of the methods that have been used in developing
Meta-Search Engines (MSEs). We also examine the strengths and weaknesses of
various approaches. There are three main directions for implementing Meta Search
Engine:[4]
1. Improvement in user-interface
2. To filter results of query
3. To apply algorithms for indexing of web-page.
Heavy emphasis on user requirements is recommended in the architecture of
Meta-Search Engine [9]. A Personalized Meta-Search Engine, Excalibur [10], has
been already proposed that provides quick response with re-ranked results after
extracting user preference. It uses Naïve Bayesian Classifier for re-ranking.
Some MSEs use proxy log records for extracting user’s accessing pattern and
store these patterns in database. A relevance score is estimated using some heuristic
for each user and the url that he/she visited. A profile is maintained for user which
contains currently visited most relevant urls. Relevance of these urls with their
respective relative position is updated in profile when users visit those links in future
[6].
Current research also suggests the framework of Meta-search engine based on
Agent Technology [11]. An enhanced version of open source Helios Meta-search
engine takes input keywords along with specified context or language and gives
refined results as per user’s need [8].
7
All the proposed solutions refine search-results up to some extent but they have a
serious drawback which is that the user profile is not stationary. A user who is
currently new to the context of search-topic may become experienced during the
course of time. In this way his/her requirements may also vary. To manage and view
the log of previous searches may thus lead to inappropriate search-results. The above
observation leads us to consider alternative methods of re-ranking. This is offered by
purely statistical methods like Latent Semantic Analysis (interchangeably called
Latent Semantic Indexing) and the newly introduced Probabilistic Latent Semantic
Analysis (interchangeably called Probabilistic Latent Semantic Indexing) which
promises to give results that are more accurate than those of Latent Semantic
Analysis. Thus, the emergence of these algorithms and the need for robust meta-
search engines is the catalyst of the present thesis.
Tests performed by various groups reveal that Latent Semantic Analysis (LSA)
and Probabilistic Latent Semantic Analysis (PLSA) give robust results for
Information Retrieval when the task is to search the most relevant documents from a
given corpus, for a given query. As both of these methods depend on the Vector
Space Model, the Vector Space Model is explained prior to both. In the present thesis
we contend that LSA and PLSA can be used to perform query expansion i.e. given a
keyword set as a query, we would like to use these algorithms to automatically
suggest additional keywords that would help to refine the search. In the following
sections we first examine the vector space model. We then extend this model to LSA
and PLSA and also examine how these can be used to perform the query expansion.
2.2 Vector Space Model
Most of the text-retrieval techniques are based on indexing keywords. Since only
keywords are not capable of sufficiently capturing the whole documents’ content, they
results poor retrieval performance. But indexing of keywords is still the most
applicable way to process large corpora of text. After identification of the significant
index term a document can be matched to a given query by Boolean Model or
Statistical Model. Boolean Model applies a match that relies on the extent up to which
8
an index term satisfies a Boolean expression while statistical properties are used to
discover similarity between query and document in Statistical Model [12].
In 1975, Gerald Salton [13] proposed a statistically based “Vector Space Model”
which is based on the theme of placing the documents in the n-dimensional space,
where n is number of distinct terms or words (as- t1, t2…tn) which constitutes the
whole vocabulary of the corpus or text collection. Each dimension belongs to a
particular term. Each document is considered as a vector as- D1, D2…Dr; where r is
the total number of documents in corpora. Document Vector can be shown as
following:
Dr = { d1r , d2r , d3r ,……..dnr }
where dir is considered to be the ith component of the vector representing the rth
document [12].
Information
Retrieval
System
Doc 1
Doc 2
Query
Fig 2.1 Document Representation in Term Space
The above figure shows representation of documents Doc1 and Doc2 in space of
three terms namely “Information”, “Retrieval” and “System”. Three perpendicular
dimensions for each term represents “Term-Independence”. This independence can
be of two types namely linguistic and statistical.
When the occurrence of a single term does not depend upon appearance of other
term, it is called Statistical independence. In Linguistic independence; interpretation
of a term does not rely on other any term [14].
9
This assumption of pair-wise Term-Orthogonality is not realistic but acceptable
for first approximation [15].
The Vector-Space Model is traditionally used in such a case where the collection
of documents are placed in term-space and it is required to find the most relevant
document for a given query. A query is just like a very short document. The similarity
between the query and all documents in the collection is computed and the best
matching documents are returned.
There are various similarity measures that are proposed and one of them, that is
very frequently used, is Cosine Similarity.
Cos θ = Q * D / |Q| * |D|
The above expression represents the cosine of the angle between two vectors in
the term space. The relevant document will be that one which is the nearest to given
query. In the same way two documents would be considered relevant if they are in
neighbour-hood region of each other.
Other Similarity Measures are [14]:
• Inner Product = ∑ Q j * D j
• Dice Coefficient = 2 ∑ Q j * D j / { ∑ Q j 2 + ∑ D j
2 }
• Jaccard Coefficient = ∑ Q j * D j / { ∑ Q j 2 + ∑ D j
2 -∑ Q j * D j }
Each component of document vector is always associated with some numeric-
factor which is called weight of that respective term in document. This weight, wi, can
be replaced by term-count or term-frequency (tfi ). This assignment leads to another
variation of the model that is called “Term Count Model”. This Model is sensitive
to term repetition. Thus, long documents score higher simply because they are longer,
not because they are more relevant [16]. Lee, Chung and Seamon performed
comparison among different term-weight models and deduced that the term-count
model does not give excellent result in any situation [12]. This model is biased for
large documents because they may contain many terms that are repeated very often.
So, long documents score higher in spite of being less relevant.
10
Hence in traditional Vector Space Model, tf x idf method is used to determine the
weight of the term in given document vector. It is based on two factors:
1. The occurrence of term ‘b’ in the document ‘a’ (term frequency tf a, b)
2. The occurrence of term ‘b’ in the document-collection (document frequency
df b) [12]
So, the weight of a term ‘b’ in given document ‘a’ can be written as
W a, b = tf a, b * idf b = tf a, b * log (N / df b)
where,
N= number of documents in the document-collection
idf= inverse document frequency
tf = term frequency
This model incorporates both the local and global information. The first term, tf a,b
accounts for local weight. The ratio (df b /N) is the probability of selecting a document
that contains a queried term from documents-collection. This ratio can be treated as a
global probability for the whole collection. Thus, log (N / df b) stands for Inverse
Document Frequency that accounts for global information. This measure gives better
results than other term- weight measure [16].
Vector Space Model suffers from following limitations:
1. It assumes term-independence.
2. It is very calculation-intensive and requires large processing time.
3. Recalculation of vectors is needed whenever a new term is added
4. Long documents make similarity measures difficult [16].
All these limitations can be taken care of by some means but the main
disadvantage of this model is the lack of focus on Synonymy and Polysemy.
Synonymy means separate terms having the same meaning. Synonymy describes the
fact that there are many ways to refer the same object. Users with different needs and
knowledge describe the same information using different terms. For example, the
terms car and automobile are often used interchangeably. They are similar in meaning
but tend to decrease the “recall”. This occurs due to the fact that when we use the term
“car” as a query term, the Vector Space Model will only look for documents
11
containing the term “car” and will ignore those documents that contain the term
“automobile”. Polysemy stands for terms with multiple distinct meanings
(homography). Use of such term in a search query does not necessarily mean that a
document containing or labeled by the same term is of interest. Polysemy reduces the
precision in results. This can be understood by considering the word “bank”. We can
have “river bank”, “bank” as financial institution or even the “banking of an
airplane”. Thus, if the word “bank” is given as a query term then the Vector Space
Model will pick up documents that contain the word “bank”, regardless of the sense in
which it has been used. The user, obviously, was interested in only one of the possible
senses. Thus, the search results will contain a large number of results that are not
according to what the user desired.
All these problems are due the fact that there is not a connection between topic
and term since the vector space model does not allow searching based on terms in a
specific topic or context. A topic is not visible as terms. They are latent and related to
semantic meaning. Most of the search engines do only term matching and are not
based on semantic aspect of term. Thus, we need to develop methods that try to
consider the semantic aspects of the terms. This leads us to various latent semantic
analysis methods that are described next.
2.3 Latent Semantic Analysis (LSA)
2.3.1 Concept of LSA
In 1990, Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K.
Landauer and Richard Harshman [17] proposed a method that can be used to decide
similarity of meaning of terms and paragraphs by analyzing a huge text-corpora. This
method is called Latent Semantic Analysis (LSA).
LSA is a statistical /mathematical technique to elicit and infer relationship among
usage of words in a paragraph for a given context. It does not use any artificial
intelligence method or a natural language processing technique. Its functioning is not
12
based on grammars, parsers and dictionaries [17]. It tries to explore something about
meaning of the words and about the topics in text-documents [18, 19].
LSA is based on the principle of Dimensionality Reduction. As the Internet
Technology is advancing; the number of electronically available documents is
increasing at an exponential rate. Hence efficient tools for document organization,
summarization, clustering, navigation and retrieval are always needed. Document-
Clustering is a daunting task due to the high-dimensionality. As mentioned
previously, the dimensionality of the space is determined by the number of terms in
the document collection i.e. the number of distinct words in the corpus. Thus, the
dimensionality can be extremely high for even a modest number of documents of
average size. In most of such applications, a document is expressed in the form of
vector in term-space (like- Vector Space Model). A short passage may contain
hundreds of distinct term, i.e. in real applications where processing of large text-
corpora is required, the number of these term-dimensions will be enormous. This high
dimensionality reduces the discriminative power of distance measures in significant
manner [20].
To solve this problem various dimension reduction techniques have been already
proposed. These techniques can be treated as a promising way to extract the
“concepts” from unstructured data [21, 22]. These dimension reduction techniques
can be classified in two categories:
1. Feature Selection
2. Feature Transformation
Feature Selection methods sorts all the terms using a suitable mathematical
measure that are computed from documents. Examples of such methods are:
Document Frequency, Mean TfIDf and Term Frequency Variance.
Feature Transformation methods assign vector space representation of the
collection of documents into lower dimension subspace. The new dimensions are
linear combination of the original ones. Some very famous such methods are Latent
Semantic Analysis (LSA), Random Projection (RP), Independent Component
Analysis (ICA), PCA (Principal Component Analysis) [23, 24].
13
As an initial step in LSA , text is represented as a matrix. In this matrix, each row
stands for a unique term or word and each column stands for a paragraph of context or
a document. Each cell contains the frequency with which the word of its row appears
in the document denoted by its column. LSA takes this term-document matrix as
input and applies Singular Value Decomposition (SVD) on it [17].
In SVD, a matrix is decomposed into the product of three other matrices. One
component matrix R0 explains the original row entities as vectors of derived
orthogonal factor values while another component matrix C0 describes the original
column entities in the same way. The third component is a diagonal matrix S0
containing scaling values. When the three components are matrix-multiplied, the
original matrix is reconstructed [17, 18].
T
=
R0
S0
C0’
documents
terms
t x d
m x m
m x d
t x m
T = R0 S0 C0’
Fig. 2.2 SVD of Term-Document Matrix ‘T’
Singular Value Decomposition of the term-document matrix, T, where:
R0 has orthogonal, unit–length columns (R0’ R0= I)
C0 has orthogonal, unit-length columns (C0’ C0= I)
S0 is the diagonal matrix of singular values
m is the rank of T (<= min (t, d))
14
SVD is a very useful technique because it provides a simple procedure for an ideal
approximate fit using smaller matrices. If the singular values in S0 is arranged by size,
the initial k largest may be kept and remaining smaller fixed to zero. The
multiplication of the resulting matrices is a matrix Tnew which is nearly equal to T, and
is of rank k. It can be shown that the new matrix Tnew is almost equal to T, in the least
square sense [18].
T new
=
R
S
C’
documents
terms
t x d
k x k
k x d
t x k
T new = R S C’
Fig. 2.3 Rank K approximation of original matrix T
The value of k is a parameter whose choice is of critical importance. This is
because it decides the amount of dimension reduction. Ideally it should be small
enough so that all the sampling errors can be ignored but large enough to capture all
the real structure in the data [18]. Each value in the new representation is calculated as
a linear combination of the original cell values. As a result of this, any change in the
cell value of original matrix is reflected in the values of newly reconstructed matrix
with reduced dimensions. The dimension reduction step has cut down the matrices in
such a way that the terms that occurred in some contexts will appear with larger or
smaller predictable frequency and some words that did not appear actually now do
appear, fractionally [17].
There are three kinds of comparisons that can be made by this reduced dimension
matrix [17].
15
(1.) Term-Term Comparison- The dot product between two row vectors of
Tnew shows the scope up to which two terms have similar patterns to occur
in given set of documents. The matrix T new * Tnew’ is the square symmetric
matrix that contains all term-to-term dot products. This can be verified :
T new * T new’ = (R * S* C’) * (R * S* C’)’
= (R* S* C’) * (R * S’ * C’) because (A * B)’= (B’ * A’)
= R * S * (C’ * C) * S’ * R’
= R * S * S’ * R’ because C is Orthogonal
= R * S2 * R’ because S is Diagonal
(2.) Document-Document Comparison- The dot product between two
column vectors of Tnew shows the scope up to which two documents have
similar patterns. The matrix Tnew’ * Tnew is the square symmetric matrix
containing all term-to-term dot products. This can be verified:
T new’ * T new = (R * S* C’)’ * (R * S* C’)
= (C * S’* R’) * (R * S* C’) because (A * B)’= (B’ * A’)
= C * S’ * (R’ * R) * S’ * C’
= C * S * S’ * C’ because R is Orthogonal
= C * S2 * C’ because S is Diagonal
(3.) Term-Document Comparison- The term and a document comparison is
the entry of an individual cell of Tnew. The i, j cell of T new is obtained by
taking the dot product between the i th row of R * S ½ matrix and the j th
row of the C * S ½ matrix.
Using Term-Term and Document- Document similarity we can easily find out
respectively all the terms and documents that are highly related to each other. This
similarity measure gives an approach to Query-Expansion using term- term similarity.
In next chapters we show in detail how a query is expanded in the context of MSE
which refines results. Another approach to find term- term similarity is by using
correlation constant between two terms or two documents [18]. Following is an
example that demonstrates how, after applying dimension reduction using LSA, the
two terms become closer or far according to semantics in the semantic space.
16
Table 2.1 Titles that represents small corpus.
Document Title
Doc 1 Design and Analysis of Algorithm
Doc 2 Satellite Imaginary
Doc 3 Image Processing
Doc 4 Digital Signal Processing
Doc 5 Data Structure and Algorithm
Among above, Doc 2, 3, 4 belong to the domain of “Image Processing” while the
remaining documents, Doc1 and Doc 5 are from “Design of Algorithms”. The term-
document matrix T can be represented as in the Table 2.2.
Table 2.2 Term-Document Representation of corpus (T).
DOC 1 DOC 2 DOC 3 DOC 4 DOC 5
Signal 0 0 0 1 0
Structure 0 0 0 0 1
Image 0 0 1 0 0
Imaginary 0 1 0 0 0
Analysis 1 0 0 0 0
Digital 0 0 0 1 0
Data 0 0 0 0 1
Processing 0 0 1 1 0
Design 1 0 0 0 0
Algorithm 1 0 0 0 1
Satellite 0 1 0 0 0
If we calculate correlation constants between (image, processing) and (image,
algorithm) we get following result.
r (image, processing ) = 0.61
r (image, algorithm) = -0.40
17
where r = Spearman’s correlation constant. Now we perform SVD on this matrix to
obtain the following
Table 2.3 Complete SVD of T
R =
0 0.45 0 0 0.45
0.35 0 0 0.5 0
0 0.28 0 0 -0.72
0 0 -0.71 0 0
0.35 0 0 -0.5 0
0 0.45 0 0 0.45
0.35 0 0 0.5 0
0 0.72 0 0 0.28
0.35 0 0 -0.5 0
0.71 0 0 0 0
0 0 -0.71 0 0
S =
2.0 0 0 0 0
0 1.9 0 0 0
0 0 1.4 0 0
0 0 0 1.4 0
0 0 0 0 1.2
C =
0.71 0 0 -0.71 0
0 0 -1 0 0
0 0.53 0 0 -0.85
0 0.85 0 0 0.53
0.71 0 0 0.71 0
18
When we reconstruct our original matrix by taking only three significant values
from matrix S (i.e. considering rank 3 approximation) we recover T new as follows:
T new =
Table 2.4 Reconstruction of Original Matrix.
DOC 1 DOC 2 DOC 3 DOC 4 DOC 5
Signal 0 0 0.45 0.72 0
Structure 0.5 0 0 0 0.5
Image 0 0 0.28 0.45 0
Imaginary 0 1 0 0 0
Analysis 0.5 0 0 0 0.5
Digital 0 0 0.45 0.72 0
Data 0.5 0 0 0 0.5
Processing 0 0 0.72 1.17 0
Design 0.5 0 0 0 0.5
Algorithm 1 0 0 0 1
Satellite 0 1 0 0 0
Now if we calculate the correlation constants we get following values:
r (image, processing)= 0.99
r (image, algorithm)= -0.63
This clearly depicts the fact that in given corpora ‘image’ and ‘processing’ are
mostly related while there is less relation between ‘image’ and ‘algorithm’. High
correlation between ‘image’ and ‘processing’ supports that they belong to Image
Processing context while ‘algorithm’ belongs to Design of Algorithms and hence less
related.
2.3.2 Limitations of LSA
LSA maps terms and documents in some constant number of concepts (or topics)
which are orthogonal (not related) to each other. But practically it is not so, because a
19
document may contain a number of concepts in it. Along with this fact there are not
enough statistical bases for LSA.
Computational complexity, storage and sparseness in term-document matrix are
critical issues where LSA is desired to be used.
It makes no use of morphology, word order or syntactic relations. Hence it is
always suspected to result in incomplete or erroneous output on some occasion [18].
All the other least-squares method are developed for normally-distributed data.
SVD is also based on the same principle. Term-by-document matrix that works as an
input for LSA consists of count data and for count data such a distribution is
inappropriate. This fact is another objection for usage of LSA [25].
2.3.3 Advantages and Applications of LSA
1. LSA is able to model human conceptual knowledge very well. It is able to
develop summarization skills [26] and text comprehension [18].
2. LSA can be used in essay scoring techniques [27]. It can be used to predict the
extent up to which a student has learnt from a specific text [28].
3. Since LSA does not depend upon literal matching, it performs well in the case
of noisy text. As- in Optical Character Reader, in spelling errors, etc. [25].
4. LSA technique makes no use of English syntax and semantics. It is based on
“Bags of Words” approach. Due to this fact it is applicable to any language
and hence in Cross-Lingual Information Retrieval [25, 27]
5. LSA is providing good results in the context of Relevance Feedback and
Information filtering.
6. Definitely sparseness and storage is a big hurdle towards its usage. But in the
context of Meta-Search Engine, since term-document matrix is of smaller size
compared to traditional information retrieval domain; these problems are not
so significant. LSA is used for query-expansion so problem of SVD updating
is not there.
20
2.4 Probabilistic Latent Semantic Analysis (PLSA)
2.4.1 Concept of PLSA
In 1999, Thomas Hoffman [29, 30] proposed this technique. The base of PLSA is
the Aspect Model. It is a latent variable model for co-occurrence data which combines
a hidden class variable a € A = { a1, a2,….} for each observation , i.e. with each
occurrence of terms (or words) t € T = { t1, t2,……} in a document d € D = {d1,
d2…….}. Various parameters in this context can be defined in the following way:
P (d) = probability of selecting a document d,
P (a | d) = probability of picking a hidden class a,
P (t | a) = probability of generating a term t.
An observed pair (d, t) can be obtained, while the hidden class variable ‘a’ is
eliminated. Converting the whole process into a joint probability model yielded
following expressions
P (d, t) = P (d) * P (t | d) , ------------ (2.1)
where
P (t | d) = Σ P (t | a) * P (a | d) ------------ (2.2)
PLSA uses this idea in following way:
Like Vector Space Model and LSA, the term–document matrix acts as an input to
this model. This matrix T(t, d) contains term t = 1: m (i.e. ranges from 1 to m),
documents d = 1: n and the number of topics A, to be sought. T (t, d) corresponds to
the entry in specified row and column.
By Random Sequence Model, we can illustrate that—
P (d) = P (t1 | d) * P (t2 | d) …………….P (tm | d)
21
m T (t,d)
= П P(tm | d) t=1 ------------ (2.3)
Now if we have A topics as well:
A P ( tm | d) = ∑ P(tm | topica ) * P(topica | d) a=1 ------------ (2.4)
The same, written using shorthand:
A P (t | d) = ∑ P(t | a) * P(a | d) a=1 ------------ (2.5)
So by replacing this, for any document in the collection,
m A T(t,d)
P (d) = П { ∑ P (t |a) * P (a | d) } t=1 a=1 ------------ (2.6)
Now, P ( t | a) and P (a | d) are two parameters of this model. Equations can be derived
to compute these parameters by Maximum Likelihood. After doing so we will get—
• P (t | a) for all t and k, is a term by topic matrix
(gives the terms which make up a topic)
• P (a | d) for all a and d, is a topic by document matrix
(gives the topics of a document)
The log likelihood of this model is the log probability of the entire collection:
n n m A ∑ log P(d) =∑ ∑ T(t,d) log ∑ P(t | a) * P (a | d) d=1 d=1 t=1 a=1 ------------ (2.7)
22
which is to be maximized w.r.t. parameters P (t | a) and also P (a | d), subject to
constraints that
m A ∑ P( t | a) =1 and ∑ P (a | d)=1 t=1 a=1 In such cases where it is required to deal with missing data, an ideal approach for
computing Maximum-Likelihood Estimation (ML) is Expectation Maximization
(EM). In ML estimation, parameters are estimated in a way so that most likely we get
same observed data.
There are two steps in EM algorithm:
1. In Expectation Step, current estimates of parameters are used to compute
posterior probability for hidden variables.
2. In Maximization-step, posterior probabilities that are computed in Expectation
steps are used to update parameters [31].
One admirable property of EM is that convergence is assured i.e. the algorithm is
guaranteed to increase the likelihood at each iteration. Following is the PLSA
algorithm that precisely depicts proper input, processing steps and output given by
this algorithm.
2.4.2 PLSA Algorithm
• Inputs: term to document matrix T(t , d), t=1:m, d=1:n and the number A of
topics sought [19]
• Initialize arrays P1 and P2 randomly with numbers between [0,1] and
normalize them row-wise to 1 [19]
• Iterate until convergence
For d=1 to n, For t =1 to m,, For a=1: A
n A P1 (t ,a) = P1 (t ,a) ∑{ T(t,d) * P2(a, d) / {∑ P1(t, a) * P2(a, d)}}
d=1 a=1 ---------- (2.8)
23
m A P2 (a, d) = P2 (a, d) ∑{ T(t,d) * P1(t, a) / {∑ P1(t, a) * P2(a, d)}} t=1 a=1 -------------- (2.9)
m P1(t, a) = P1(t, a) / ∑ P1(t, a) t=1 --------------- (2.10) A P2(a,d) = P2(a,d) / ∑ P2 (a ,d) a=1 ---------------- (2.11)
• Output: arrays P1 and P2, which hold the estimated parameters P (t |a) and
P (a| d) respectively [19].
Equation (2.8) and (2.9) illustrate expectation steps in which posterior
probabilities are calculated from currently estimated values. For initial step these
estimated parameters are assigned by a uniform random number generator which
generates number between 0 and 1. Equation (2.10) and (2.11) are maximization steps
where initial parameters are updated from the values resulting from expectation step.
The outputs are two matrices that are P1 and P2 having the probabilities of term
distribution per topic and topic distribution per document, respectively.
==t
d
t
a
a
P (t | d) P (t | a)
P (a | d)
Observed term Distribution
Term distributionper topic
Topic distribution per document
d
Fig. 2.4 Two Matrix Formations from PLSA.
24
Accuracy of PLSA algorithm is based on two crucial factors:
1. Number of topics (value of a)
2. Number of iterations
Number of topics is context-specific. If documents in text corpus contains
different concepts or belong to various domains then exact estimation for this
parameter is critical. But as the number of topics would be closer to the actual value
the results would improve.
Number of iterations for convergence is another decisive factor. It must be an
optimal one. More number of iterations over fit (or over tune) the data, while less
number of those iterations will not cater true results. Early Stopping can be used to
prevent over fitting of data in which iterations can be stopped after some specific
number of iterations. Result and Analysis section of this thesis will demonstrate the
effect of all these factors with empirical results.
Next is an example that directly shows how the term distribution in topics is
yielded by the algorithm.
Table 2.5 Four Aspects (topics) those are most likely to generate term
‘Cricket’.
Aspect 1 Aspect 2 Aspect 3 Aspect 4
Sports earlier Cricket nice
Cricket disease using girl
play swimming insecticides saying
makes state insect dull
indoors lots kill student
outdoors cycling small hockey
games cured harmful school
person healthy nice study
25
The above example is derived for A=4 aspect model of a document collection
which consist different aspects related to “Cricket”. The displayed words are most
probable words in the class-conditional distribution P (t| a), from top to bottom in
descending order.
2.4.3 Advantages and Applications of PLSA
The result of PLSA are better than those of LSA because of PLSA has firm
statistical background compared to LSA. PLSA uses the principle of conditional
probability and EM that is guaranteed to converge and hence produce better results.
LSA resolves the problem of Synonymy but in the case of Polysemy its results are
still doubtful. PLSA resolves both the problems efficiently. PLSA classifies all the
term to topic distribution data in such a manner so that a polysemous term is clubbed
with other terms with different probability and therefore represents different topics. In
the previously explained example Aspect 1 seems related to “Sport - Cricket”. Other
terms as play, outdoors, games supports the thought of outdoor and indoor
classification of games. Aspect 2 tells about “Diseases Prevention” aspect of outdoor
games which makes anyone healthier. ”Swimming” and “Cycling” are also examples
of such games. Aspect 3 shows relevance with “Cricket - Insect”. “Insecticides”,
“insect”, “kill”, “small”, “harmful” are most probable words in this context. Aspect 4
represents concept of “Sports in School”. Other terms in this concept as girl, dull,
hockey, study etc persuade this belief. Different position of term “Cricket” shows
their respective possibility to appear in a given context which is quite understandable.
PLSA is already in use in some applications and contributing fruitful results.
Apart from already explained domain where relevant document are retrieved for given
query; PLSA is used in “Web Page Grouping” [32] and in the construction of
“Community Web Directories”. This thesis will suggest a new dimension for
implementing Meta-Search Engine using PLSA for query-expansion.
26
2.5 Summary
Current Approaches for implementing MSE have been presented. After
illustrating their draw-backs, we discuss in details the algorithms (LSA/PLSA) that
are used in the proposed MSE. All the applications, advantages and limitations of
these algorithms are shown which proves the superiority of PLSA over LSA.
Architecture and Design issues of proposed MSE are presented next.
27
Chapter 3
Proposed Meta-Search Engine 3.1 Basic Theme
One most important points of concern is how the idea of query-expansion will
refine existing search- engine results. This can be justified by following fact.
All the search-engines perform literal-matching for given query and retrieve web-
pages which contain that term or combination of terms. These terms may be
synonymous or polysemous in nature. Therefore, in results we get all the pages that
belong to same or different context. At a particular time a user is always interested in
some specific topic or domain. Since till present time no search-engine classifies the
result according to topic; a common user navigates through all the links and wastes
his time till needed information is found. As for query keyword “Cricket”, first page
links extracted from Google and Yahoo contain web-pages which are solely related to
the game – cricket. Not a single result was associated to cricket that is an insect.
Hence it is obvious that in the cases of such a query a user will search through all the
pages of sport-cricket till he gets a web-page where cricket is described as an insect.
Proposed MSE will ease the task of the user by suggesting other terms that are
likely to occur in a particular context. Now, the user can select a term or group of
terms from his/her area of interest and will fire a new expanded query to MSE. After
some specific number of iterations all the links will belong to the same topic. In this
way links would be confined to be of user’s need.
28
3.2 Architecture of Proposed MSE
This section demonstrates different components that are the part of proposed
Meta-Search Engine and illustrates their responsibilities, significance and interaction
with each other.
User Interface Common Interface to Search-
Engines
Search-Engines (SEs)
Search Engine
1
Search Engine
2
Search Engine
nth
Baseline Establishment (Naïve Algorithm)
Page Retriever Pre-Processing Unit
Algorithms
LSA PLSA
Query
Next Keywords And
Ranked Links
(URLs)
Web-Pages
Processed Text-Corpora
Fig. 3.1 Architecture of proposed Meta-Search Engine
1. User Interface The user can pose a query from User-Interface (UI) to MSE. The user interface
must be easily understandable by any novice user so that it can be used with ease. All
the results and information at different parts of UI must be self explanatory to user
from any domain.
29
2. Common Interface to Search-Engines This component takes query keywords as input. It basically contains all the APIs
(Application Program Interface) or libraries for different search-engines. These APIs
take query at front-end and pass these to corresponding search-engines at back-end.
3. Search-Engines It shows the combination of different search-engines. Most frequently used
search-engines are Google, Yahoo, MSN, AltaVista etc. All of these perform the
crawling, indexing and ranking according to their own mechanism. Anatomy of such
search-engines has already been explained in the Motivation section of Chapter 1.
Results of search-engines are ranked web-links to user query.
4. Baseline Establishment Responsibility of this part is to establish a local ranking to retrieved results before
presenting them to user. The local ranking makes a baseline for the results which
apparently respect the result of search-engines. A naïve technique for this baseline
establishment might be to rank a link higher if it is respectively higher at the result of
search-engines and rank them low if that are in the result of single search-engine.
5. Page-Retriever This section downloads all the web-pages according to baseline ranking on a
local machine. These pages may be of different format as .txt, .pdf, .html etc.
6. Preprocessing Unit This section takes all the web-pages and makes a text-corpus of these. It may be
noted that the corpus changes with the query. In text corpus we need to perform
preprocessing steps like stop-word removal and stemming of terms. Stop-words are
common words as articles, prepositions etc. which never represent a specific meaning
related to a context. Hence they must be ignored before applying algorithms. These
terms can be considered as a noise in corpora. Examples of stop words are ‘a’, ‘an’,
‘the’, etc. The stemming process converts all the terms into its root word so that terms
having same root word can be represented by single entity and therefore appropriate
30
weight can be assigned to it. For example, we may have the words ‘boy’ and ‘boys’ in
different parts of the corpus. Rather than count these as separate terms, the stemming
process reduces both these to the root form ‘boy’. This step is necessary because both
words are morphological forms of the same word and therefore are semantically
similar. Thus, we should reduce them to the same term. This has two advantages. The
first is that it reflects the situation more accurately. Secondly, it reduces the number of
terms and thus leads to higher efficiency.
7. Algorithms This module contains both the algorithms that are explained in previous chapters.
The processed text corpus is used to form a suitable input for the algorithms and at
last these algorithms yield next probable query.
3.3 Implementation Details
This section demonstrates all the insights of Design and Implementation part of
suggested solution. Java is used as a programming tool because it is solely Object-
Oriented language. Proposed solution must be easily modifiable and extendable
because it is a basic need in such a case. Object-Orientation Methodology fulfills this
purpose. A Package Diagram as a part of High Level Design is shown in Fig. 3.2.
MetaSearchEngine
LSA / PLSAAlgorithm
Reading AndStemming
Parser
SearchEngineInterface
GUI
BaseLine
Fig. 3.2 High-Level Design (Package Diagram)
31
1. GUI This package contains all the classes which are responsible for user interface. It
has two classes, one for LSA and another for PLSA. According to requirement both of
these classes have different components and represent next probable query keyword
in different manner.
2. Meta-Search Engine The class in this package has the authority for passing control from one package to
another. This class acts as a common interface for most of the packages. It takes query
as an input and returns links and next probable query keywords in desired format.
3. Search-Engine Interface All the classes in this package interact with corresponding search-engines. These
classes use API’s or libraries of search-engines so that query can be fired on SE at
back-end. Currently there are three classes which communicate to their respective SE.
These are
• GoogleApi.java
• YahooApi.java
• MSN Api.java
The first two classes respectively use googleapi.jar [33], yahoo-search-2.0.0.jar
[34]. MSN Api.java class executes the .exe file of ConsoleWebSeach.cs [35] which is
based on .NET framework.
Both Google and MSN Api provide top ten results. YahooApi can yield any
number of links if results are available. Currently the Meta-Search Engine extracts top
ten results from all the three SE. Therefore it is not biased toward any search-engines.
4. BaseLine This package is responsible for local ranking of links that are given by all SEs. A
very basic approach is implemented in this package which ranks a link higher if it
appears in the result of all three search-engines and ranks a link lower if it is in the
32
results of two or even a single search-engine. After this all the links are downloaded
on local system.
5. Parser Downloaded links may yield files of different format. These format may be .txt,
.html, pdf,. ppt., .ps etc. Currently, the classes in this package deal with only .txt,
.html, and pdf. formats. Jericho HTML-2.3 Parser [36] is used to parse HTML pages
and extract text contents from it. PDFBOX-0.7.0 [37] libraries are used to convert a
.pdf file into text file. For .pdf to text conversion the following java archives are
used:
• ant.jar
• checkstyle-all-3.5.jar
• log4j-1.2.9.jar
• PDFBox-0.7.0.jar
HTML parser uses jericho-html-2.3.jar.
6. Reading and Stemming This particular package is under obligation for all the functionalities of the
component Pre-processing Unit that is explained in architectural details. Firstly
stopwordrem.java class removes all the stop-words from all the text files.
After stop-word removal, stemming algorithm is applied on each term of text files
by porter.java [38]. It uses Porter’s Stemming algorithm. There are some other
stemmers that are available as- Lovins stemmer, Dawson Stemmer [39]. Porter’s
stemmer is used mostly in Information Retrieval and Language Processing problems.
Since its performance is pretty good hence it is also used in present context.
Once all the text files are stemmed the text-corpora becomes ready to be converted
into desired input format.
7. LSA/PLSA Algorithm Package LSA and PLSA replaces each other according to case. If LSA is applied
for query-expansion LSA package is used otherwise PLSA.
33
Term-Doc.java class converts whole text corpus into term-document matrix
format and after applying different term-weight factors this matrix is passed to
another class which performs singular-value decomposition of input matrix. By
varying different value of k, resultant matrix is generated. Query term and all other
terms with their all matrix entries are then passed to Corr.java class which computes
correlation constants and hence new probable words can be sent to GUI. This package
use Jama-1.0.2.jar for SVD computation [40].
In the case of PLSA, this tem-doc matrix is passed to PLSA.class which applies
algorithm on this matrix. To check the convergence, in each iteration all the previous
and new matrix elements are then passed to Conv.class which computes average and
absolute convergence measures. Early Stopping is used to prevent the results from
over tuning. Now from term-topic matrix all the higher probable words are extracted
and are suggested as new probable query keyword on GUI which are in fact classified
on the basis of different aspects.
3.4 Features of Proposed System
Suggested Meta-Search engine tries to refine the results by query-expansion.
Apart from this, there are some other features that add a new dimension to presented
Meta-Search engine. Those features are compiled next:
• After getting the next keywords on User-Interface, a user instead of simply
adding a term into previous query can add combination of terms. By doing
so he/she may get needed urls very soon. Even a user can replace or
exchange the position of terms according to his intellectual and need.
• Sometimes a user may be novice and may possess lack of information about
the domains related to given keywords, for which he/she wants to extract
information. In this scenario, next keywords will suggest most of the
available concepts regarding that query, on internet.
• If in any condition, the suggested keywords do not belong to the area from
which the user desires to retrieve information, it infers a fact that for given
34
query there is very limited amount of information available on internet. This
situation motivates a user to reform the given query, again.
• Proposed solution does not use any sort of “Thesaurus” or “Dictionary” for
query-expansion. It utilizes the content itself for recommending next key-
words. This feature enhances the scope of meta-search engine to a
significant extent. It can be illustrated as follows: for query “CAT”, a
thesaurus provides all the synonyms for CAT but it does not suggest
anything that is related to “Common Admission Test” whose acronym is
CAT. On Internet, there are plenty of such cases. Proposed MSE extracts
information from the web-pages itself and hence gets rid of this problem
easily.
3.5 Summary
Architecture and design of proposed Meta-Search Engine has been discussed in
this chapter. Extendibility and maintainability are two admirable features of proposed
MSE. All the components and their significance are shown. The next chapter will
illustrate all the result and analysis part of thesis, with various elucidating examples
and their justifications.
35
Chapter 4
Result and Analysis
This chapter presents all the experimental results that have been performed and
their analysis. Various queries from different contexts have been fired on the
proposed Meta-Search Engine for both the LSA and PLSA. All the important and
critical factors are changed accordingly and variations in results are then noted down
carefully. Results are in accordance with all the theoretical facts which have been
illustrated in literature-survey sections. Full details of these experiments with
elucidating examples are described next which in turn proves the success of this novel
idea for designing a Meta-Search engine. Firstly the results of all experiments of LSA
are shown and then of PLSA. Comparison between LSA and PLSA are shown which
demonstrates the superiority of PLSA over LSA.
4.1 Result-Analysis of LSA
4.1.1 Value of ‘k’ for Optimal Rank Approximation of Term-Document matrix
As explained in Chapter 3, the value of ‘k’ is a decisive factor because it governs
the dimensionality of vector space. If it is high then the vector space would be large
and vice-versa. Very high value of k leads for bigger vector space and may contain
noisy data. Very small value of k results a small vector space which will lose some
important information. This fact can be shown by the following example which is
performed for the query “India-Tourism” and with number of terms = 2447 and
number of documents = 20. These results are shown in Table 4.1 below.
36
Table. 4.1 Next Keywords for query “India Tourism” for different values of
‘K’.
50% Vector-Space
K=2
75% Vector-Space
K=6
85%
Vector-Space K=8
90% Vector-Space
K=10
98% Vector-Space
K=14
Festivals Royal Varanasi Varanasi Varanasi Kerala Varanasi Best Best Best Tiger Best Jaipur Forts Forts
National Jaipur Royal Kerala Kerala Boat Policy Travel Travel Travel
Sariska Travel Forts Royal Bharatpur Carbett Kerala Kerala Explorer Jaipur Golden Forts Explorer Bharatpur Rajasthan
Bird Explorer Policy Rajasthan Indian Backwaters Palace Bharatpur Jaipur Golden
General Himachal Himachal Himachal Himachal Australia Queen Andhra Andhra Andhra Kingdom Andhra Jammu Northern Northern Glorious part Orissa Kashmir Sikkim Gangtok Pilgrimage Kashmir Orissa Hill
Sri Hill Northern Jammu TamilNadu Chandigarh Inc Sikkim Sikkim September Ayurveda TamilNadu Bengal Pilgrimage major France related related Hill winter
Italy Jammu Pilgrimage TamilNadu Orissa
In the table all the terms which are most related to query “India Tourism” is
shown for different extent of vector space. In each column, first ten entries are terms
that are next related to “India” and another ten are for “Tourism”. All the italicized
terms are assumed as non-relevant while non-italicized terms are assumed as relevant
to the given context.
From the table it is clear that for k=2, vector space is 50% and terms like
“Festivals”, “General”, “Australia”, “Kingdome”, “Sri”, “France” and “Italy” are
appearing that are not so relevant. 75% of vector space is covered as k increases to 6.
This results in only few non-relevant terms as-“Policy”, “Queen”, “part”, “Inc” and
“related”. Even for k = 8, the number of such unwanted terms reduces and for k = 10
(i.e. for 90% of vector space) all the visualized terms can be treated as relevant. Now,
37
further increment in k degrades the result. We observe that for k=14, terms
“September”, “major”, and “winter” appear that demonstrate presence of noisy data.
So this result gives firm support for considering such a value of k which can cover
90% of vector space. All the next related keywords also show nice prospect for query-
expansion.
4.1.2 Comparison of “Tf-IDf Measure” to “Term-Count Measure”
There are various term-weight measures that assign a weight (or say importance)
to terms of term-document matrix. It is very often illustrated in all the literature that
“Tf-IDf Measure” is more efficient compared to simple “Term-Count Measure”. The
results of proposed Meta search-engine are assembled for both the measures and
represented for query “Thread”.
Table 4.2 Next Keywords for query “Thread” using Term-Count Measure.
50% Vector-Space
75% Vector-Space
85%
Vector-Space
90%
Vector-Space
Actually Exists Active Active
allows natural String String
unique Active natural Package
debugging machine length SE
liked developers machine lower
copy depends called called
errors length Package machine
print String SE allows
which raised lower fact
In both the Table 4.2 and Table 4.3 italicized and bold terms are assumed as
relevant terms. As usual, for 90% vector-space coverage results are fine in both the
cases.
Table 4.3 Next Keywords for query “Thread” using Tf-IDf Measure.
38
50% Vector-Space
75% Vector-Space
85%
Vector-Space
90% Vector-Space
liked Exists Active Active debugging developers String String Actually natural natural Package unique Active length lower allows machine called SE
appropriate depends machine called action raised SE allows
executing String lower machine copy length Package fact
In fact in this particular case, the results of term-count and Tf-IDf measures are
not different. This does not imply that Tf-IDf is not better than former one. In the
present context, we have a limited number of documents (say 20-25) and all the stop-
words (that occurred in most documents) are also removed in pre-processing step. So
the significance of factor “log (N /df b)” is not so much and hence the results of both
the measures are almost same. As the number of retrieved documents will increase,
Tf-IDf will surely show improved performance over term-count measure. Following
snap-shot of GUI represents results for query “India-Tourism” when we use the LSA
algorithm for query expansion.
Fig 4.1 Graphical User Interface for LSA
39
4.2 Result Analysis of PLSA
Various tests have been performed to check performance of PLSA, as with LSA.
In the context of Meta Search Engine it produces appreciable results that are quite
useful. Since PLSA classifies all the next keywords according to some topic, it gives
an extra edge to expand the query for a specific domain. Some examples of PLSA
results are illustrated in following tables:
Table 4.4 Results of PLSA for query “Thread”
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Thread Fangohr Auckland Gigalink Showcase
System Firms Zealand BridesMaid Prices
Java Hungry NZ ThreadDesign Boutiques
Class Stock Necklace www Dj
lang Flavor Crochet dress Thousands
process sure Paris collection Cocktails
Table 4.5 Results of PLSA for query “Australian University”
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
University Forum Museum Museum AIU
Australian Study Forum Forum whistleblowing
Australia UNDA England large Security
ANU JCU images books Below
Research CQU large architecture Counter
Page SCU above here Sells
Student CDU books churches Login
International ECU here images Buy
40
Table 4.4 shows next keywords according to five different topics. First topic is
related to “Thread” that belongs to “Java-language”. Another topic seems related to
country “Hungry” and some firms. Topic three is solely related to “New Zealand’s
Fashion-Culture Magazine and On-line Store” that is also named as “Thread”. Topic 4
contains terms like “dress”, ”bridesmaid”, “collections” which are showing “Fabrics”
aspect of term Thread.
Similarly Table 4.5 shows next keywords for the query “Australian University”
and in results topic 1 simply shows general term as “student”, “international”,
“ANU”, ”research” that are related to Australian University. Topic 2 contains terms
like “UNDA”, “JCU”, “CQU”, “SCU”, “CDU”, “ECU” which are acronyms of
respectively “University of Notre Dame Australia”, ”James Cook University”,
“Central Queensland University” and so on. Hence, second topic shows “List of
Australian Universities”. In the same way other topics can be easily understood.
These terms can be used for query-expansion and will in turn yield focused search.
4.2.1 Optimal value for number of topics (a)
The number of topics, ‘a’, in PLSA is one of the most important factors. Its value
must be an optimal one. A large value of ‘a’ will give some redundant topics which
will not be informative enough and similarly a small value will hide some useful
concept. Results of various tests suggest that this value should be in between 3 to7 for
most of the cases of current Meta-search engines because at maximum level it will
have 24 to 27 documents. For such a specified number, the range of 3 to 7 topics is
appropriate. An example for increasing value of k is shown for same query “India
Tourism”. All the terms in different topics are showing different aspects and
significance.
41
Table 4.6 Results of PLSA for query “India Tourism” for different value of
num of topic ‘a’=1, 2, 3
Topic 1
India
Tourism
Tour
Travel
Kerala
Rajasthan
Topic 1 Topic 2
India yimg
Tourism Hyatt
Tour directly
Travel JS
Kerala Regency
Rajasthan suggest
Topic 1 Topic 2 Topic 3
India yimg Kalpa
Tourism Hyatt Munsiyari
Tour directly demanding
Travel JS manmade
Kerala Regency Wing
Rajasthan Mariott Interzigm
42
4.2.2 Convergence
Since PLSA uses EM for maximum likelihood, it also guarantees a convergent
behavior for the iterative procedure. It always tries to find local maxima for given
data distribution. In the context of Meta-search engine, PLSA also shows converging
nature. To check it, two measures are used:
• Absolute Measure
• Average Measure
4.2.2.1 Absolute Measure
It can be computed by following formula
Max i,j = | P i, j n+1 - P i, j n |
where
P i, j n = value at i th row and j th column of term-topic matrix or topic-
document matrix after n th iteration.
In PLSA, firstly some random values are assigned to both term-topic and topic-
document matrix. After going through one iteration of the E and M steps, the
algorithm generates two new versions of these matrices. This new version now acts as
an input for the next iteration of the algorithm and this iterative procedure continues
till convergence.
For measuring convergence, we compute the maximum difference Max i,j between
all the corresponding cell entries of term-topic matrix and its newer version. This
calculation is performed for each iteration and the maximum value is noted. Results
show that this maximum difference tends to decrease after some time and then
continually decreases and then converges. Same procedure is performed for topic-
document matrix and it yields same behavior i.e. that one also converges, might be
earlier or later. Following graphs are evident for this nature of PLSA. This experiment
is performed for query keyword “IIT” and for three topics using absolute measure.
For efficient visualization of this sort of small data, negative of natural logarithm is
computed for y-axis. Thus, a value of 20 on the y-axis represents exp(-20).
43
Convergence
0
2
4
6
8
10
12
14
16
18
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Iteration
-ive
max
diff
(in
log
scal
e)
Term-Topic Matrix
Fig 4.2 Convergence in Term- Topic Matrix computed by Absolute
Measure
Convergence
0
2
4
6
8
10
12
14
16
18
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Iteration
-ive
max
diff
(in
log
scal
e)
Topic-Document Matrix
Fig 4.3 Convergence in Topic-Document Matrix computed by Absolute
Measure
44
4.2.2.2 Average Measure
The average measure can be computed by the following formula
Max i, j = | P i, j n+1 - P i, j n | / 2 ( | P i, j n+1 | + | P i, j n | )
where
P i, j n = value at ith row and jth column of term-topic matrix or topic-
document matrix after n th iteration.
The same procedure as previously explained, is used here. Only average measure
is used in place of absolute measure. Following graphs also represent convergent
behavior using this measure. Experiment is performed for query keyword “IIT” and
for three topics, as in the previous experiment.
Convergence
0
5
10
15
20
25
30
35
40
45
50
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Iteration
-ive
max
diff
(in
log
scal
e)
Term-Topic Matrix
Fig 4.4 Convergence in Term-Topic Matrix computed by Average
Measure
45
Convergence
0
5
10
15
20
25
30
35
40
45
1 3 5 7 9 11 13 15 17 19 21 23 25
Iteration
-ive
max
diff
(in
log
scal
e)
Topic-Document Matrix
Fig 4.5 Convergence in Topic-Document Matrix computed by Average
Measure
4.2.3 Number of Iterations for Convergence
The number of iterations for convergence is also one of the important issues. This
number must be an optimal one. It should not be so small which may persist non-
converged state and should not be so large which may over tune the values
(probabilities in both the matrix). The technique of “Early Stopping” is used for these
cases. The algorithm is implemented in such a way that it can take care of these
situations automatically. Maximum difference between corresponding cell value of
both the older and newer matrix is computed for each iteration. If this difference
appears small enough (say < .001), then the iterations are automatically stopped.
4.2.4 PLSA slide-shots Following are two examples of User Interface which shows results for query, showing
that different query results are grouped according to their context. The queries are
46
1. Thread
2. India Tourism
Fig 4.6 GUI representing results for query “Thread”
Fig 4.7 GUI representing results for query “India Tourism”
47
4.3 Convergence in number of unique links after some iteration
It is really a point of concern that for how many times a query should be expanded
to get refined as well as needed result. Following graph clearly depicts that after some
specified iterations for query expansion (say 5-6), number of unique links (in turn
unique documents) will converge. The test is performed for query “Thread” and
further expanded gradually as “Thread Package”, ”Thread Package lang” and
”Thread Package lang Java” etc. A web-page link and its respective sub-page link is
considered once and treated as unique.
0
5
10
15
20
25
30
1 2 3 4 5 6 7
Iterations for Query-Expansion
Num
ber o
f Uni
que
Web
-link
s
Fig 4.8 Behavior of num. of unique web-links to iterations for Query
Expansion
4.4 Comparison between LSA and PLSA results
Results of both LSA and PLSA that are demonstrated in previous sections are
quite admirable. In case of LSA, next suggested keywords generally belong to one
dominating context but are pointing very clear direction for search by query
expansion. For example “Forts”, “Hill” and likewise other terms for expansion, surely
directs a search towards a focused area of need, in the case of search for “India
Tourism”.
48
On the other hand, PLSA suggests next keywords by classifying them into needed
number of topics entered by user. This splendid feature of PLSA ensures a big ease
for refinement of results by selecting a keyword from a specific topic and then to use
it for query-expansion. As- for same query “India Tourism”, next keywords are
grouped into some topics. First topic shows simple aspect of tourism and displays
famous places for visit. In other groups all the famous hotel-name and restaurants as-
“Hyatt”, “Mariott”, “Regency” are present which represent another important aspect
of “Tourism in India”.
Definitely, since PLSA represents results in more organized way with capability
of distributing terms according to various logical aspects, it is better than LSA for
query expansion purpose in Meta-search Engine. Firm base of statistics and use of
EM for convergence are two sufficient reasons for PLSA’s commendable results.
4.5 Comparison with “Dogpile” Meta-Search Engine
A comparative study was performed to check top ten results of open-source Meta-
Search Engine “Dogpile” and results of proposed MSE after query expansion. Results
demonstrate the fact that while Dogpile shows the results in jumbled manner about
both the concept namely “Dress” and “Java” related to query “Thread”. The results of
proposed MSE for expanded query “Thread Package Java” and “Thread Dress” are
totally confined to their respective concepts. The long web-link (urls) in results
illustrates the fact that proposed MSE is yielding result to sub pages of web-sites and
proving the fact that they are more confined.
49
Fig 4.9 Top ten results of Meta-Search Engine “Dogpile” for Query
“Thread”
Fig 4.10 Top ten results of proposed MSE for Expanded Query “Thread
Package Java”
50
Fig 4.11 Top ten results of proposed MSE for Expanded Query “Thread
Dress”
Summary In this chapter we have reviewed the results obtained with the proposed meta
search engine and compared the results with those of “Dogpile”. Various experiments
confirm that the LSA and PLSA can indeed provide effective query expansion. Also,
PLSA seems to outperform LSA. However, there are some shortcomings in the
present version. We shall discuss the shortcoming and see how it can be overcome in
the next chapter.
51
Chapter 5
Improvements from NER (Named-Entity Recognizer)
5.1 Introduction
This chapter presents a significant improvement to the results of MSE. The
improvement is due to the use of “Named Entity Recognizer”. The chapter introduces
the Named-Entity Recognizer, its role and hence its influence in the context of MSE.
Next to this, the chapter demonstrates the change in previous architecture to
incorporate this extra module and its consequence. At last, results are shown with
illustrative examples.
In previous chapter, results of LSA and PLSA are shown with clarifying
examples. These are all appreciable results but are lacking in some way. For example
if we refer to Table 4.1 then terms like “Corbett”, ”Golden”, ”Royal” and ”North”
have appeared there. These terms are relevant to search but not providing the real
meaning because they are only representing part of some collection of words. These
words, when grouped together, reveal the real meaning. Similarly if we examine
Table 4.5 which shows results of PLSA for query “Australian University”, we find
terms like “Australian”, ”International”, “University” are appearing independently
which are actually part of single entity “Australian International University”. The
same scenario is there with terms “Jammu Kashmir” and “Taj Mahal”. Due to this
fact we will have to expand query for few more iterations to get refined results. Also
the inference obtained by looking at the classified keywords may be erroneous.
52
This is not actually the problem of the techniques of information retrieval, used in
the proposed MSE. The reason for such partially correct results is Search-Engine’s
literal matching mechanism. If we fire a query “A B” then Search-Engines retrieve all
the pages that contain A, B and both. Since our MSE is based on these SEs and parses
each term individually this type of error in results is obvious.
The essence of the problem is that we are considering each word as a separate
term. This may be incorrect in several cases where a group of words should be treated
as a single term since they represent a single semantic unit. However, the problem
gets complicated because the individual words may also have acceptable semantics of
their own. For example, if we are considering a single entity like “Banaras Hindu
University”, then the individual words have distinct meanings of their own which are
quite different from the meaning we get when all three words are taken together as a
single unit. This is actually a very famous problem of natural language processing
which is called “Recognition of Named Entity”. This named entity may be name of
person, organization, institute, place or any proper noun as- Mr. Albert Einstein,
Carnegie Mellon University, President of India, etc. “Named-Entity” is a collection of
few words when grouped together, represent a meaning behind those words.
A “Named Entity Recognizer (NER)” is a system which can recognize all the
named entities in a given passage. Thus, we realize that if an NER is introduced in our
system then we will be able to find all the named entities in our corpus and thus treat
them as single terms. There are already some NER packages that are freely available
for English and other language. Therefore, rather than implementing our own NER
module, it was felt that it would be better to use an available one. Since our design of
MSE supports easy extendibility, we can easily incorporate this new module.
We expected two positive outcomes after adding NER. Firstly the number of
terms in term-document matrix will be reduced which will definitely increase the
responsiveness of system, particularly for PLSA. This is because PLSA uses an
iterative procedure for convergence and complexity of the algorithm used, in each
iteration, depends on the number of terms. Secondly, we will have more number of
different terms as next probable query, because some of the most related terms are
53
already grouped within their respective named-entities. In fact the whole procedure
will add a notion of meaning to next keywords.
Empirical results shows that on an average the number of terms without NER
were 3096 while after addition of NER module this number reduced to 3079 which
clearly reflects presence of 17 such named-entities. After using NER, the named
entities will be treated as single terms and will provide more meaningful next
probable keywords for expansion.
5.2 Modified Architecture of Meta-Search Engine
In Figure 5.1 below we show the modified MSE. As can be seen, since our
architecture was highly modular, we could easily plug in the NER module.
User Interface Common Interface to Search-
Engines
Search-Engines (SEs)
Search Engine
1
Search Engine
2
Search Engine
nth
Baseline Establishment (Naïve Algorithm)
Page Retriever Pre-Processing Unit
Algorithms
LSA PLSA
Query
Next Keywords And
Ranked Links
(URLs)
Web-Pages
Processed Text-Corpora
NER
Fig. 5.1 Architecture of Modified Meta-Search Engine
From component User Interface to Page Retriever everything is same as was
presented in previous chapter. The named entity recognition task has to be performed
after the text has been extracted from the retrieved pages and before the pre-
54
processing unit. Named entities must be recognized from all the text-files and be
stored in desired term-document format with suitable term-weight factor. Moreover,
the identified named entities must not be passed through the stop-word removal and
stemming phases. For example, if we consider the named entity “Indian Institute of
Information Technology”, if “of” is removed by the stop word removal process then it
will distort the whole named-entity. Therefore, the NER is appropriate just before the
preprocessing unit. Now remaining terms are stored in term-document matrix which
then works as an input to the rest of the system where LSA or PLSA are executed. As
mentioned previously, these algorithms yield next keywords for given query.
5.3 Modified High-level Design
MetaSearchEngine
LSA / PLSAAlgorithm Reading And
Stemming
Parser
SearchEngineInterface
GUI
BaseLine
NER
Fig. 5.2 High-Level Design with NER (Package Diagram)
Package diagram given in Fig. 5.2 shows the position of NER package and its
interactions with the remaining packages. The advantage of using modular design is
evident in this case. We can easily add ‘NER’ module to our previous design with few
of the changes and the new idea works well.
55
This NER package uses Named Entity Recognizer Library that has been
developed by “Natural Language Processing Group” of Stanford University. This
library is freely available under GNU license. It uses a Conditional Random Field
(CRF) classifier. The library provides an implementation of Conditional Random
Field Sequence model and is coupled with a feature extractor for NER. It recognizes
three types of named entities: person, location and organization. This library also
contains some another models and versions with and without additional similarity
features. These features improve performance but requires considerable amount of
memory. For proposed MSE a classifier with least memory requirement is used. The
library is implemented in java and is available as a .jar file called standford-ner.jar
[41]. Following is an example which shows text data after named entity recognition.
Fig. 5.3 A text file before and after Named-Entity Recognition
From Fig 5.3 it is evident that named-entities as “Indian Institute of Information
Technology”, “Allahabad” and “Dr.M.D.Tiwari” are properly identified and enclosed
under respective tags.
Classes of NER package as NamedEntity.java converts all the text files into the
named-entity format. NEExtraction.java extracts named-entities and stores them in
term-document matrix and non-named entity terms are again stored in corresponding
56
files for further processing of the text. For proper and efficient usage of NER a small
modification was done in Jericho-html Parser.
5.4 Results of NER
We performed the same experiments as were performed in the previous chapter
without using NER. The results obtained were as per our expectations. The following
user-interface displays one of the results of query, “India Tourism”, after applying
Named-Entity Recognizer. Result contains named entities like- “Golden City”,
“Indian Wildlife”, “Corbett National Park Tour”, ”Taj & Wildlife Tour”, ”Discover
North India” and “Discover Forts and Palaces”. It is instructive to compare the results
shown below with that obtained for the same query, without using NER, given in the
previous chapter. If we refer to those results then it is quite clear that term “Corbett”
is related to national park, “North” comes in context of ”Discover North India” and
forts and palaces belong to “Discover Forts and Palaces”.
Now we can use these named entities for query expansion and will get refined
results within the next one or two iterations. Similarly we can still get some other
most related keywords from different context. However, it should be emphasized that
the improvement in the results would be less dramatic if the results of the original
query did not contain a significant number of named entities or if the named entities
were not very relevant to the original query.
57
Fig. 5.4 GUI after applying NER for query “India Tourism”
5.5 Summary
In this chapter we explored thoroughly the effect of introducing a new module
“Named Entity Recognizer”. We examined its importance, consequences and results.
From the results it is quite evident that the new module provides significant
improvements, particularly for those queries where the named entities are likely to
have a high relevance. This chapter also demonstrated the strength of the design of
our software because we could add the new module quite easily into the existing
system.
58
Chapter 6
Conclusion and Future Enhancements
7. Conclusion
In present scenario search-engines are really useful devices to extract needed
information from Internet. Meta-Search engines solve the same purpose with big span
of coverage and advanced features like maintaining user’s profile, filtering results etc.
Proposed MSE is based on refining the results using query-expansion while next
keywords are suggested by MSE itself without using any thesaurus or dictionary. We
can very easily conclude that both the algorithms namely LSA and PLSA work well
for suggesting next keywords for MSE.
Result and analysis part demonstrates that PLSA outperforms LSA and represents
all the results in well classified and easily understandable format. Further
incorporation of “Named Entity Recognizer” in MSE improves results. So, at last, it
can be concluded that to design MSE using LSA/PLSA for query expansion is a nice
and fruitful thought.
8. Future Enhancements
Following points recommend all the future amendments to proposed MSE-
• Current meta-search engine uses only result of Google, Yahoo and MSN.
Other search-engines are still available as Altavista, Askjeeves etc. They
can be added to proposed MSE It will increase the coverage span of MSE
and can provide even more acceptable results. Design of proposed MSE
supports easy modification.
59
• API’s used for implementing the MSE provide only limited results of
respective search-engines. If number of these results can be enhanced then
it will yield nice results.
• Various parsers for different type of file formats can be added and would
give this MSE an admirable feature.
• Sometimes web-pages may contain advertisements and images in a large
extent. These things are of no use from the algorithm and query-expansion
point of view. A good provision can be made in MSE to deal with all such
situations, effectively.
• To maintain the information about user-profiles could be an extended
feature of proposed MSE. If a provision is made to keep a (user, url)
information for a particular user, then next time whenever a user will
query, results could be filtered or categorized before displaying on user-
interface.
• Pronouns in context, reduces the weight of noun appeared in it. For
example we consider following passage:
“The Taj Mahal is one of the most famous historical monument of India. It
is one among the seven wonders of world. It is built by Shahjahan.”
‘It’ in sentence number 2 and 3 is referring to “Taj Mahal”. So they must
be replaced by “Taj Mahal” i.e. the frequency count of “Taj Mahal” should be
three. However, in the present technique it is only one. Therefore the resultant
weight of “Taj Mahal” is not the correct one. In fact, it is reduced. This is one
of the famous problems of NLP, called as “Procedure for Anaphora
Resolution”. Thus, in order to solve this problem, we need to add a module for
Anaphora Resolution. Such a module can be easily added in the design of
MSE and even more efficient results can be expected.
60
Appendix-A:
Search Engines’ API
A.1 Google SOAP Search API (beta)
Software developers are now capable of making their own program by which
they can query lots of web pages. They can use Google SOAP Search API for this
purpose. To provide such facility, Google utilizes Simple Object Access Protocol
(SOAP) and Web Service Description Language (WSDL). The availability of this
API in various languages and platforms like- Perl, Java and Visual Studio .NET gives
a liberty to developer so that he/she can choose his favorite environment.[33]
Some good example code and complete documentation is present with
developer’s kit. It is needed to have license key and Google Account for accessing
services of this API. By using both of these, we are entitled to fire 1000 queries per
day. Maximally 10 links can be retrieved by this API.[33] Google has replaced it
with its newer version “Google AJAX API”. Some classes and their used methods are
described next. Some times proxy does not allow passing SOAP request. To eliminate
this problem a user must have privileges so that his/ her request can be bypassed
through proxy. Some of the essential classes and their respective methods of this API
that are used in proposed MSE are given in Table A.1 below:
Table A.1 Classes and Methods of Google SOAP Search API
Class Method Description
GoogleSearch
public GoogleSearch()
Construct a new instance of a GoogleSearch client.
public void
setKey(String key)
Set the user key used for authorization by Google SOAP server. This is a
mandatory attribute for all requests.
61
public void
setQueryString(String q)
Set the query string for this search.
public byte[] doGetCachedPage
(String url) throws GoogleSearchFault
Retrieve a cached web page from Google. The key attribute must be set.
public String doSpellingSuggestion
(String phrase) throws GoogleSearchFault
Ask Google to return a spelling suggestion for a word or phrase.
public GoogleSearchResult
doSearch() throws GoogleSearchFault
Invoke the Google search. Note: key and query attributes must already be
set. GoogleSearch
Result
public GoogleSearchResult() Constructor
public String toString()
Returns a nicely formatted representation of a Google search
result.
A.2 Yahoo Search Web Service API
Yahoo! developer’s network provides various Web Services for application
developers. They can access services to build new and customized applications. These
services are based on REST that stands for Representational State Transfer. Yahoo!
Web Services use some operations over HTTP requests in which URL must be
encoded. All the libraries and example code are bundled together for accessing the
Yahoo! Search Web Services as a Software Development Kit (SDK).It can be easily
downloaded from their website [34]. This SDK includes code in the Java, Lua,
JavaScript, Perl etc. The developer can easily choose a language and platform of his
choice. It is required to register and to use an application ID, which is tied to
application for accessing Yahoo! Web Services. This application ID must be
associated with each Web Services request.
62
Table A.2 Classes and Methods of Yahoo Search Web Service API
Class Method Description SearchClient
public SearchClient (String appId)
Constructs a new SearchClient with the given application ID using the default
settings.
public WebSearchResults webSearch(WebSearchReque
st request) throws IOException,
SearchException
Searches the Yahoo database for web results.
WebSearch Request
public
WebSearchRequest (String query)
Constructs a new web search request.
public void setResults
(int results)
The maximum number of results to return. May return fewer results if there aren't enough results in the database. At the time of writing, the default value is
10, the maximum value is 50. Interface
WebSearch Result
String getTitle() The title of the web page.
String getUrl()
The URL for the web page.
Interface WebSearch
Results
BigInteger getTotalResultsAvailable
( )
The number of query matches in the database.
BigInteger getTotalResultsReturned
( )
The number of query matches returned. This may be lower than the number of results requested if there were fewer
total results available.
WebSearchResult[]
listResults()
The list (in order) of results from the search.
63
A.3 MSN Search SDK beta MSN search SDK beta gives a provision for a user to send queries to MSN Live
Search and receive results. The documentation with SDK explains the essential
concepts, guidelines and library for the MSN Search Web Service. The SDK also
contains example code that illustrates techniques for application development.
This SDK requires any of the Windows platforms, as- Windows 2000, Server
2003, Vista, XP. A computer with the ability of sending requests via SOAP 1.1 and
HTTP 1.1 is needed. It must be able to parse XML. Microsoft® Visual Studio® .NET
2003 or Microsoft® Visual Studio® .NET 2005 and the Microsoft .NET Framework
must be installed on a deployment computer to build, run and execute the
applications. An application ID must always be entitled with request. For a given
query, top 10 results of MSN can be received in user program [35].
64
Appendix-B:
Parser’s API
B.1 Jericho HTML Parser
Jericho HTML Parser is a powerful library in java which analyses and
manipulates parts of an HTML document [36]. It also consist some functions that can
manipulate high-level HTML forms. Since, it is available as open source library we
can easily use it in our commercial applications. The library has following major
features that are different from other HTML parsers:
• It is not a tree based parser. It is completely based on simple text search
and efficient recognition of tags.
• The requirements for memory and resources are far better compared to
DOM based parser.
• Each parsed segment can be easily accessed and modifications into
selected segments can be efficiently performed.
• It provides an easy way to define and register custom-tags so that parser
can recognize them.
65
B.2 PDFBOX-0.7.0
PDFBox performs the functionalities like- creating new PDF document,
manipulating them and extracting content from them. It is available as open source.
PDFBox also comprises of several utilities[37]. Some very essential features of
PDFBOX are:
• Text extraction from PDF
• Merging of PDF Documents
• Encryption/Decryption of PDF Document
• Integration with Lucene Search Engine
• Creation of a PDF from a text file
• Images Creation from PDF pages
For proposed MSE only PDF to text extraction feature is used.
66
Appendix-C:
List of Stop Words
Following Stop word list is available on computer science department’s LSI web
site of University of Tennessee [42].
Table C.1 List of stop words. A appear C doing former a appreciate c'mon don't formerly
a's appropriate c's done forth able are came down four
about aren't can downwards from above around can't during further
according as cannot E furthermore accordingly aside cant each G
across ask cause edu get actually asking causes eg gets
after associated certain eight getting afterwards at certainly either given
again available changes else gives against away clearly elsewhere go
ain't awfully co enough goes all B com entirely going
allow be come especially gone allows became comes et got almost because concerning etc gotten alone become consequently even greetings along becomes consider ever H
already becoming considering every had also been contain everybody hadn't
although before containing everyone happens always beforehand contains everything hardly
am behind corresponding everywhere has among being could ex hasn't
amongst believe couldn't exactly have an below course example haven't and beside currently except having
another besides D F he any best definitely far he's
anybody better described few hello
67
anyhow between despite fifth help anyone beyond did first hence
anything both didn't five her anyway brief different followed here anyways but do following here's anywhere by does follows hereafter
apart doesn't for hereby
herein K N others saying hereupon keep name otherwise says
hers keeps namely ought second herself kept nd our secondly
hi know near ours see him knows nearly ourselves seeing
himself known necessary out seem his L need outside seemed
hither last needs over seeming hopefully lately neither overall seems
how later never own seen howbeit latter nevertheless P self however latterly new particular selves
I least next particularly sensible i'd less nine per sent i'll lest no perhaps serious i'm let nobody placed seriously i've let's non please seven ie like none plus several if liked noone possible shall
ignored likely nor presumably she immediate little normally probably should
in look not provides shouldn't inasmuch looking nothing Q since
inc looks novel que six indeed ltd now quite so indicate M nowhere qv some indicated mainly O R somebody indicates many obviously rather somehow
inner may of rd someone insofar maybe off re something instead me often really sometime
into mean oh reasonably sometimes inward meanwhile ok regarding somewhat
is merely okay regardless somewhere isn't might old regards soon
it more on relatively sorry it'd moreover once respectively specified it'll most one right specify it's mostly ones S specifying
68
its much only said still itself must onto same sub
J my or saw such just myself other say sup
sure they U we whole T they'd un we'd whom t's they'll under we'll whose
take they're unfortunately we're why taken they've unless we've will tell think unlikely welcome willing
tends third until well wish th this unto went with
than thorough up were within thank thoroughly upon weren't without thanks those us what won't thanx though use what's wonder that three used whatever would
that's through useful when would thats throughout uses whence wouldn't the thru using whenever X
their thus usually where Y theirs to uucp where's yes them together V whereafter yet
themselves too value whereas you then took various whereby you'd
thence toward very wherein you'll there towards via whereupon you're
there's tried viz wherever you've thereafter tries vs whether your thereby truly W which yours
therefore try want while yourself therein trying wants whither yourselves theres twice was who Z
thereupon two wasn't who's zero these way whoever
69
Appendix-D:
JAMA API (for SVD)
Classes of package Jama are listed:[38]
• CholeskyDecomposition • EigenvalueDecomposition • LUDecomposition • Matrix • QRDecomposition • SingularValueDecomposition
Classes and their respective methods that are used in proposed MSE for SVD are:
Table D.1 Classes and Methods of JAMA API
Class Method Description
Matrix
public Matrix(double[][] A)
Construct a matrix from a 2-D array.
public int
getColumnDimension()
Get column dimension.
public int
getRowDimension()
Get row dimension.
public double[][]
getArray()
Access the internal two-dimensional array.
public double[][] getArrayCopy()
Copy the internal two-dimensional array.
public Matrix getMatrix(int i0, int i1,int j0,int j1)
Get a submatrix.
public Matrix transpose()
Matrix transpose.
70
public Matrix
times(Matrix B)
Linear algebraic matrix multiplication, A * B
public void print(int w, int d)
Print the matrix to stdout. Line the elements up in columns with a Fortran-
like 'Fw.d' style format.
public
SingularValueDecomposition svd()
Singular Value Decomposition
SingularValue Decomposition
public SingularValueDecomposi
tion(Matrix Arg)
Construct the singular value decomposition
public Matrix getU() Return the left singular vectors
public Matrix getV() Return the right singular vectors
public Matrix getS()
Return the diagonal matrix of singular values
71
References [1] Jae Hyun Lim, Young-Chan Kim, Hyonwoo Seung, Jun Hwang , Heung-
Nam Kim, “Query Expansion for Intelligent Information Retrieval on Internet”,
Proceedings of the International Conference on Parallel and Distributed Systems,
Page(s):652 - 656, 1997
[2] Boston University, “How Search-Engine Works”, www.bu.edu
[3] Sullivan Danny,”How Search-Engine Works”, www.searchenginewatch.com
[4] Zheng Li, Yuanqiong Wang ,Vincent Oria, “A New Architecture to Web Meta-
Search Engine”, CIS Department ,New Jersey Institute of Tech., Seventh Americas
Conference on Information Systems, 2001
[5] Abawajy, J.H.; Hu, M.J., “A new Internet meta-search engine and
implementation”, The 3rd ACS/IEEE International Conference on Computer
Systems and Applications, Page(s):103, 2005
[6] Shanmukha Rao, B.; Rao, S.V.; Sajith, G.; “A user-profile assisted meta search
engine”, TENCON 2003 Conference on Convergent Technologies for Asia-Pacific
Region Volume 2, Page(s):713 - 717 , 15-17 Oct. 2003
[7] Spink, A.; Jansen, B.J.; Blakely, C.; Koshman, S.; “Overlap Among Major Web
Search engines”, ITNG 2006 Third International Conference on Information
Technology: New Generations, 2006. Page(s):370 – 374, 10-12 April 2006.
[8] A. Gulli, A. Signorini ,”Building an opensource Meta-Search Engine”, Special
interest tracks and posters of the 14th international conference on World Wide Web
WWW '05 ,ACM Press, May 2005
[9] Eric J. Glover, Steve Lawrence, William P. Birmingham, C. Lee Giles;
“Architecture of a Meta-Search Engine that Supports User Information Needs”
,Proceedings of the eighth international conference on Information and knowledge
management CIKM '99, ACM Press,Pages: 210 - 216, 1999
72
[10] Yuen, L.; Chang, M.; Lai, Y.K.; Chung Keung Poon; “Excalibur: a
personalized Meta search engine” COMPSAC 2004 Conference on Computer
Software and Applications, 2004.. Proceedings of the 28th Annual International
Volume 2, Page(s):49 - 50 ,2004.
[11] Junjie Chen; Wei Liu; ”A framework for intelligent meta-search engine based
on agents”, Third International Conference on Information Technology and
Applications, 2005. ICITA 2005. Volume 1, Page(s):276 - 279 , 4-7 July 2005.
[12] Lee, D.L.; Huei Chuang; Seamons, K.; “Document ranking and the vector-
space model” Software, IEEE Volume 14, Issue 2, Page(s):67 – 75, Mar/Apr 1997
[13] G.Salton, A. Wong, C.S. Yang ,”A Vector Space Model for Indexing” ,1975
[14]Website:
http://mingo.infoscience.uiowa.edu/courses/230/Lectures/Vector1.html#1d
[15]Vijay V. Raghvan, S.K.M. Wong “A Critical Analysis for Vector Space Model
for Information Retrieval”, Journal of the American Society for Information
Science, vol. 35, no 5, pp. 279—287, 1986.
[16] Website:http://www.miislita.com/tervector/term-vector-2.html
[17]Scott Deerwester et al: “Indexing by latent semantic analysis”, Journal of the
American Society for Information Science, vol. 41, no 6, pp. 391—407, 1990.
[18]Thomas K. Landauer, Peter W. Foltz , Darrell Laham ,”An Introduction to
Latent Semantic Analysis” ,Discourse Processes,25 ,259-284, 1998.
[19] Website: www.cs.bham.ac.uk/~axk/ML_PLSA.ppt
[20] Bin Tang, Xiao Luo, Malcolm I. Heywood, Michael Shepherd, “A
Comparative Study of Dimension Reduction Techniques for Document Clustering”
,Technical Report CS-2004-14, December 6, 2004.
73
[21]Holger Bast, ”Dimension Reduction: A Powerful Principle for Automatically
Finding Concepts in Unstructured Data”, Max-Planck-Institute for Informatics
[22]Cambridge University Press, “Dimensionality reduction and Latent Semantic
Indexing” DRAFT! © October 13, 2006.
[23] Vishwa Vinay, Ingemar J.Cox, Ken Wood, Natasa Milic-Frayling, “A
Comparison of Dimensionality Reduction Technique for Text Retrieval”,
Proceeding of the Fourth IEEE International Conference on Machine Learning and
Applications(ICMLA’05),2005
[24] Bin Tang, Michael Shepherd, Evangelos Milios, Malcolm I. Heywood,
“Comparing and Combining Dimension Reduction Techniques for Efficient Text
Clustering” January 21,2005
[25] Barbara Rosario, “ Latent Semantic Indexing: An overview”, INFOSYS 240,
Final Paper, Spring 2000
[26] Eileen Kintsch, Dave Steinhart, Gerry Stahl, Cindy Matthews, Ronald Lamb,
“Developing Summarization Skills through the Use of LSA-Based Feedback”,(in
press),Interactive Learning Environments
[27] Thomas K. Landauer, Darrell Laham, Peter Foltz, “Learning Human –like
Knowledge by Singular Value Decomposition: A Progress Report”
[28] Bob Rehder, M. E. Schreiner, Michael B.W.Wolfe, Darrell Laham, Thomas K.
Landauer, Walter Kintsch, “ Using Latent Semantic Analysis to assess knowledge”
[29] Thomas Hofmann, “Probabilistic Latent Semantic Indexing ”,Annual ACM
Conference on Research and Development in Information Retrieval, Proceedings of
the 22nd annual international ACM SIGIR conference on Research and
development in information retrieval ,Berkeley, California, United States,pp 50 –
57, 1999
[30]Thomas Hofmann, Probabilistic Latent Semantic Analysis. Proceedings of the
Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99)
74
[31] Sean Borman , “The Expectation Maximization Algorithm, A short Tutorial”,
June 28,2006
[32]Guandong Xu, Yanchun Zhang, Xiaofang Zhou, “Using Probabilistic Latent
Semantic Analysis for Web Page Grouping”, Proceeding of the 15th IEEE
International Workshop on Research Issues in Data Engineering: Stream Data
Mining and Applications(RIDE-SDMA’05),2005
[33] Website: http://code.google.com
[34] Website: http://developer.yahoo.com/download/download.html
[35] Website: http://www.microsoft.com/downloads/details.aspx
[36] Website: http://sourceforge.net/projects/jerichohtml/
[37]Website:
http://sourceforge.net/project/showfiles.php?group_id=78314&package_id=79377
[38] Website: http://www.dcs.gla.ac.in/idom/ir_resources/linguistic_util/porter.java
[39] http://www.comp.lancs.ac.uk/computing/research/stemming/paice/article.htm
[40] Website: http://math.nist.gov/javanumerics/jama/
[41] Website: http://nlp.stanford.edu/software/CRF-NER.shtml
[42] Website: http://www.cs.utk.edu/~lsi/
75
76