Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 1
Chapter 1
Web-Mining and Information Retrieval
1.1 Introduction
The World Wide Web or simply the web may be seen as a huge collection
of documents freely produced and published by a very large number of people,
without any solid editorial control. This is probably the most democratic – and
anarchic –widespread mean for anyone to express feelings, comments,
convictions and ideas, independently of ethnics, sex, religion or any other
characteristic of human societies. The web constitutes a comprehensive, dynamic,
permanently up-to-date repository of information regarding most of the areas of
human knowledge (Hu, 2002) and supporting an increasingly important part of
commercial, artistic, scientific and personal transactions, which gives rise to a
very strong interest from individuals, as well as from institutions, at a universal
scale. However, the web also exhibits some characteristics that are adverse to the
process of collecting information from it in order to satisfy specific needs: the
large volume of data it contains; its dynamic nature; being mainly constituted by
unstructured or semi-structured data; content and format heterogeneity and
irregular data quality are some of these adverse characteristics. End-users also
introduce some additional difficulties in the retrieval process: information needs
are often imprecisely defined, generating a semantic gap between user needs and
their specification. The satisfaction of a specific information need on the web is
supported by search engines and other tools aimed at helping users gather
information from the web. The user is usually not assisted in the subsequent tasks
of organizing, analyzing and exploring the answers produced. These answers are
usually flat lists of large sets of web pages which demand significant user effort to
be explored. Satisfying information needs on the web is usually seen as an
ephemeral one-step process of information search (the traditional search engine
paradigm). Given these characteristics, it is highly demanding to satisfy private or
institutional information needs on the web. The web itself, and the interests it
promotes, are growing and changing rapidly, at a global scale, both as mean of
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 2
divulgation and dissemination and also as a source of generic and specialized
information. Web users have already realized the potential of this huge
information source and use it for many purposes, mainly in order to satisfy
specific information needs. Simultaneously the web provides a ubiquitous
environment for executing many activities, regardless of place and time.
1.2 Web Mining
Web mining is a very hot research topic which combines two of the
activated research areas: Data Mining and World Wide Web. The Web mining
research relates to several research communities such as Database, Information
Retrieval and Artificial Intelligence [1].
Web mining is defined by [Coo97] as the discovery and analysis of useful
information from the WWW. Web mining is used to extract interesting and
potentially useful patterns and implicit information from artefacts or activity
related to the WWW. Web mining in relation to other forms of data mining and
retrieval is illustrated using Figure 1.1. The diagram demonstrates the fact that
web mining is performed on an unstructured source, i.e. web sites.
Figure 1.1: Web mining in relation to other forms of data mining and retrieval
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 3
1.2.1 Web Content Mining
Web content mining is the automatic search of information resources
available online [Coo97]. As a process, web content mining goes beyond keyword
extraction since web documents present no machine-readable semantic. The two
groups of web content mining approaches concentrate on different aspects. Agent
based approach directly mines document contents. Database approach improves
the search strategy of the search engine with regard to the database it uses.
1.2.2 Web Structure Mining
Web content mining focuses on the internal structure of a web document,
web structure mining tries to discover the link structure of the hyperlinks at the
inter-document level.
1.2.3 Web Usage Mining
Web usage mining is defined as the discovery of user access patterns from
web servers. Web servers record and accumulate user interaction data each time a
user makes a request for resources. Analyzing these web access logs can reveal
patterns regarding a user are browsing habits through the web server [2].
Figure 1.2: Taxonomy of Web Mining
Web mining is the use of data mining techniques to automatically discover and
extract information from Web documents and services. Web mining
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 4
methodologies can generally be classified into one of three distinct categories:
Web structure, Web content and Web usage mining.
The goal of Web structure mining is to categorize the Web pages and
generate information such as the similarity and relationship between them, taking
advantage of their hyperlink topology. In the latter years, the area of Web
structure mining focuses on the identification of authorities, i.e. pages that are
considered as important sources of information from many people in the Web
community.
Web content mining has to do with the retrieval of information (content)
available on the Web into more structured forms as well as its indexing for easy
tracking information locations. Web content may be unstructured (plain text),
semi-structured (HTML documents), or structured (extracted from databases into
dynamic Web pages). Such dynamic data cannot be indexed and consist what is
called ―the hidden Web‖. A research area closely related to content mining is text
mining. Web content mining is nowadays strongly interrelated with Web structure
mining, since usually both are used in combination for extracting and organizing
information from the Web. Web content mining provides methods enabling the
automated discovery, retrieval, organization, and management of the vast amount
of information and resources available in the Web. Cooley et al. [CMS97]
categorize the main research efforts in the area of Content Mining in two
approaches, the Information Retrieval (IR), and the Database (DB) approach. The
IR approach involves the development of sophisticated AI systems that can act
autonomously or semi-autonomously on behalf of a particular user, to discover
and organize Web-based information.
Web usage mining is the process of identifying browsing patterns by
analyzing the user’s navigational behavior. This information takes as input the
usage data, i.e. the data residing in the Web server logs, recording the visits of the
users to a Web site. Extensive research in the area of Web usage mining led to the
appearance of a related research area, that of Web personalization. Web
personalization utilizes the results produced after performing Web usage mining,
in order to dynamically provide recommendations to each user.
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 5
Web mining is moving the World Wide Web toward a more useful
environment in which users can quickly and easily find the information they need.
It includes the discovery and analysis of data, documents, and multimedia from
the World Wide Web. Web mining uses document content, hyperlink structure,
and usage statistics to assist users in meeting their information needs.
The Web itself and search engines contain relationship information about
documents. Web mining is the discovery of these relationships and is
accomplished within three sometimes overlapping areas. Content mining is first.
Search engines define content by keywords. Finding contents’ keywords and
finding the relationship between a Web page’s content and a user’s query content
is content mining. Hyperlinks provide information about other documents on the
Web thought to be important to another document. These links add depth to the
document, providing the multi-dimensionality that characterizes the Web. Mining
this link structure is the second area of Web mining. Finally, there is a
relationship to other documents on the Web that are identified by previous
searches. These relationships are recorded in logs of searches and accesses.
Mining these logs is the third area of Web mining.
Understanding the user is also an important part of Web mining. Analysis
of the user’s previous sessions, preferred display of information, and expressed
preferences may influence the Web pages returned in response to a query.
Web mining is interdisciplinary in nature, spanning across such fields as
information retrieval, natural language processing, information extraction,
machine learning, database, data mining, data warehousing, user interface design,
and visualization. Techniques for mining the Web have practical application in m-
commerce, e-commerce, e-government, e-learning, distance learning,
organizational learning, virtual organizations, knowledge management, and digital
libraries.
1.3 Web Mining and Information Retrieval
Web IR is the application of IR to the web. In classical IR, users specify
queries, in some query language, representing their information needs. The
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 6
system selects the set of documents in its collection that seem the most relevant to
the query and presents them to the user. Users may then refine their queries to
improve the answer. In the web environment user intents are not static and stable
as they usually are in traditional IR. In the web, the information need is associated
with a given task (Broder, 2002) that is not known in advance and may be quite
different from user to user, even if the query specification is the same. The
identification of this task and the mental process of deriving a query from an
information need are crucial aspects in web IR. Web IR is related to web mining
– the automatic discovery of interesting and valuable information from the web
(Chakrabarti, 2003). It is generally accepted that web mining is currently being
developed towards three main research directions, related to the type of data they
mine: web content mining, web structure mining and web usage mining (Kosala
et al., 2000). Recently another type of data – document change, page age and
information recency – is generating research interests: it is related to a temporal
dimension and allows for analyzing the growth and dynamics – over time – of the
Web (Baeza-Yates, 2003; Cho et al., 2000; Lim et al., 2001). This categorization
is merely conceptual, these areas are not mutually exclusive and some techniques
dedicated to one may use data that is typically associated with others.
Web content mining concerns the discovery of useful information from
web page content which is available in many different formats (Baeza-Yates,
2003) – textual, metadata, links, multimedia objects, hidden and dynamic pages
and semantic data.
Web structure mining tries to infer knowledge from the link structure on
the web (Chakrabarti et al., 1999a). Web documents typically point at related
documents through a link forming a social network. This network can be
represented by a directed graph where nodes represent documents and arcs
represent the links between them. The analysis of this graph is the main goal of
web structure mining (Donato et al., 2000; Kumar et al., 2000). In this field, two
algorithms, which rank web pages according to their relevance, have received
special attention: PageRank(Brin et al., 1998) and Hyperlink Induced Topic
Search, or HITS (Kleinberg, 1998).
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 7
Web usage mining tries to explore user behavior on the web by analyzing
data originated from user interaction and automatically recorded in web server
logs. The applications of web usage mining usually intend to learn user profiles or
navigation patterns. Web usage mining is essentially aimed at predicting the next
user request based on the analysis of previous requests. Markov models are very
common in modeling user requests or user paths within a site (Borges, 2000).
Association rules and other standard data mining and OLAP techniques are also
explored. (Cooley et al., 1997) presents an overview of the most relevant work in
web usage mining [3].
IR is the automatic retrieval of all relevant documents while at the same
time retrieving as few of the non relevant as possible (Rijsbergen, 1979). Some
have claimed that resource or document discovery (IR) on the Web is an instance
of Web content mining and the others associate web mining with intelligent IR.
Actually IR has the primary goals of indexing text and searching for useful
documents in a collection and nowadays research in IR includes modeling,
document classification and categorization, user interfaces, data visualization,
filtering, etc. (Baeza-Yates &Berthier, 1999). The task that can be considered to
be an instance of Web mining is Web document classification or categorization,
which could be used for indexing. Viewed in this respect, Web mining is part of
the (Web) IR process. (Kosala&Blockeel, 2000)[4].
1.4 Web Mining and Information Extraction:
IE has the goal of transforming a collection of documents, usually with the
help of anIR system, into information that is more readily digested and analyzed
(Cowie&Lehnert, 1996). IE aims to extract relevant facts from the documents
while IR aims to select relevant documents (Pazienza, 1997). While IE is
interested in the structure or representation of a document, IR views the text in a
document just as a bag of unordered words (Wilks, 1997). Thus, in general IE
works at a finer granularity level than IR dose on the documents. Building IE
systems manually is not feasible and scalable for such a dynamic and diverse
medium such as web contents (Muslea, Minton &Knoblock, 1998). Due to this
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 8
nature of the Web, most IE systems focus on specific web sites to extract. Others
use machine learning or data mining techniques to learn the extraction patterns or
rules for Web documents semi-automatically or automatically (Kushmerick,
1999). Within this view, Web mining is used to improve Web IE (Web mining is
part of IE) (Kosala&Blockeel, 2000). An example of IE without Web mining is
what done by (El-Beltagy, Rafea&Abdelhamid) for building a model for
automatically augmenting segments documents with metadata using dynamically
acquired background domain knowledge in order to assist users in easily locating
information within these documents through a structured front end[5].Web mining
can be divided into four subtasks:
1.4.1 Information Retrieval/Resource Discovery (IR)
Find all relevant documents on the web. The goal of IR is to automatically
find all relevant documents, while at the same time filter out the non-relevant
ones. Search engines are a major tool people use to find web information. Search
engines use key words as the index to perform query. Users have more control in
searching web content. Automated programs such as crawlers and robots are used
to search the web. Such programs traverses the web to recursively retrieve all
relevant documents. A search engine consists of three components: a crawler
which visits web sites, indexing which is updated when a crawler finds a site, and
a ranking algorithm which records those relevant web sites. However, current
search engines have a major problem -low precision, which is manifested often by
the irrelevance of searched results.
1.4.1.1 Information Extraction (IE):automatically extract specific
fragments of a document from web resources retrieved from the IR step. Building
a uniform IE system is difficult because the web content is dynamic and diverse.
Most IE systems use the \wrapper" [33] technique to extract a specific
information for a particular site. Machine learning techniques are also used to
learn the extraction rules.
1.4.1.2 Generalization: discover information patterns at retrieved web
sites. The purpose of this task is to study users' behavior and interest. Data mining
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 9
techniques such as clustering and association rules are utilized here. Several
problems exist during this task. Because web data are heterogeneous, imprecise
and vague, it is difficult to apply conventional clustering and association rule
techniques directly on the raw web data.
1.4.1.3 Analysis/Validation: analyze, interpret and validate the potential
information from the information patterns. The objective of this task is to discover
knowledge from the information provided by former tasks. Based on web data,
one can build models to simulate and validate web information.[6].
1.5 Information Retrieval and Web
The meaning of the term information retrieval can be very broad. Just
getting a credit card out of your wallet so that you can type in the card number is a
form of information retrieval However, as an academic field of study
INFORMATION RETRIEVAL might be defined thus Information retrieval (IR)
is finding material (usually documents) of an unstructured nature As defined in
this way, information retrieval used to be an activity that only a few people
engaged in: reference librarians, paralegals, and similar professional searchers.
Now the world has changed, and hundreds of millions of people engage in
information retrieval every day when they use a web search engine or search their
email. Information retrieval is fast becoming the dominant form of information
access, overtaking traditional database style searching (usually text) that satisfies
an information need from within large collections (usually stored on computers).
IR can also cover other kinds of data and information problems beyond
that specified in the core definition above. The term ―unstructured data‖ refers to
data which does not have clear, semantically overt, easy-for-a-computer structure.
It is the opposite of structured data, the canonical example of which is a relational
database, of the sort companies usually use to maintain product inventories and
personnel records. In reality, almost no data are truly ―unstructured‖. This is
definitely true of all text data if you count the latent linguistic structure of human
languages. But even accepting that the intended notion of structure is overt
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 10
structure, most text has structure, such as headings and paragraphs and footnotes,
which is commonly represented in documents by explicit markup (such as the
coding underlying web pages). IR is also used to facilitate ―semi structured‖
search such as finding a document where the title contains Java and the body
contains threading. The field of information retrieval also covers supporting users
in browsing or filtering document collections or further processing a set of
retrieved documents. Given a set of documents, clustering is the task of coming
up with a good grouping of the documents based on their contents. It is similar to
arranging books on a bookshelf according to their topic. Given a set of topics,
standing information needs, or other categories (such as suitability of texts for
different age groups), classification is the task of deciding which class(es), if any,
each of a set of documents belongs to. It is often approached by first manually
classifying some documents and then hoping to be able to classify new documents
automatically.
Information retrieval systems can also be distinguished by the scale at
which they operate, and it is useful to distinguish three prominent scales. In web
search, the system has to provide search over billions of documents stored on
millions of computers. Distinctive issues are needing to gather documents for
indexing, being able to build systems that work efficiently at this enormous scale,
and handling particular aspects of the web, such as the exploitation of hypertext
and not being fooled by site providers manipulating page content in an attempt to
boost their search engine rankings, given the commercial importance of the web
[7].
1.6 The Web
The web is a public service constituted by a set of applications aimed at
extracting documents from computers accessible in Internet – the Internet is a
network of computer networks. One can also describe the web as an information
repository distributed over millions of computers interconnected through Internet
(Baldi et al., 2003). The W3C defines web in a broad way: ―the World Wide Web
is the universe of network-accessible information, an embodiment of human
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 11
knowledge‖. Due to its comprehensiveness, with contents related to most subjects
of human activity, and global public acceptance, either at a personal or
institutional level, the web is widely explored as an information source. Web
dimension and dynamic nature become serious drawbacks when it comes to
retrieving information. Another relevant characteristic of the web is the absence
of any global editorial control over its content and format. This contributes largely
to web success but also contributes to a high degree of heterogeneity in content,
language, structure, correctness and validity. Although the problems raised by the
size of the web, around 11,5×109 pages (Gulli et al., 2005), and its dynamics
require special treatment, it seems that the major difficulties concerning the
processing of web documents are generated by the lack of editorial rules and the
lack of a common ontology, which would allow for unambiguous document
specification and interpretation. In the absence of such normative rules, each
document has to be treated as unique. In this scenario, document processing
cannot be based on any underlying structure. Although HTML already involves
some structure its use is not mandatory. Therefore, the higher level of abstraction
that may assure compatibility with a generic web document is the common bag-
of-words (Chakrabarti, 2003). This low abstraction level is not very helpful for
automatic processing, requiring significant computational costs. The web is a vast
and popular repository, containing information related to almost all human
activities and being used to perform an ever growing set of distinct activities
(bank transactions, shopping, chatting, government transactions, weather report
and getting geographic directions, just to name a few). Despite the difficulties this
medium poses to automatic as well as to non-automatic processing, it has been
increasingly explored and has been motivating efforts, from both academic and
industry, which aim to facilitate this exploration. Currently the web is a repository
of documents, the majority of them HTML documents, that can be automatically
presented to users but that do not have a base model that might be used by
computers to acquire semantic information on the objects being manipulated. The
semantic web is a formal attempt from W3C to transform the web in a huge
database that might be easier to process automatically than our current syntactic
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 12
web. However, despite many initiatives on the semantic web (Lu et al., 2002), the
web has its own dynamics and web citizens are pushing the web to the social plan.
Collaborative systems, radical trust and participation are the main characteristics
of web2.0, a new paradigm emerging since 2004 (O’Reily, 2004).
1.6.1 A Retrospective View of Web Information Retrieval
In the early 1950s, technical librarianship faced a crisis. The scientific
boom sparked by the Second World War had released a flood of publications,
approaching a million new articles each year. Scientists could no longer stay
abreast of current research by general reading alone. Papers relevant to a new
project, but not previously known to the researcher, had to be retrieved at the
project’s outset and the librarian had to facilitate this retrieval. A variety of
cataloguing schemes had been suggested as tools for retrieval, but none had been
rigorously tested for effectiveness, and all were labour-intensive to implement
In responding to technical information’s rapid growth, librarians and
information scientists developed the field of information retrieval. The defining
discovery of the field was that complex schemes for organizing and cataloguing
information into hierarchical taxonomies did little better than simply indexing the
plain words occurring in the text: the crucial part of information retrieval lay in
the process of retrieval. The finding that taxonomy was redundant was little short
of scandalous—after all, Western information science had since Aristotle been
founded on subdividing knowledge by genus and species. But the effect was
liberating. Word occurrences are readily indexed by computer, and retrieval
technology could be constructed on top of such indexes without having to solve
deep problems in human language analysis and semantics. Significantly, the
sufficiency of word occurrence indexing was not argued theoretically (which,
after centuries of such theoretical dispute, would hardly have had an impact), but
demonstrated empirically, through careful evaluation.
In the mid 1990s, users of the newly-emerged web faced a crisis. The
number of web sites was growing rapidly, and finding information by following a
trail of links from a few popular central sites was no longer an adequate access
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 13
method. Manually curated directories such as that of Yahoo! were popular, but
manual curation was expensive and scaled poorly. Experienced users could not
keep up with the growth in the number of sites, even in areas of personal interest
to them; and, for novice users, the task of finding useful information on the web
was daunting.
Faced with the mushrooming growth of the web in the second half of the
1990s, a new kind of service provider turned to the decades-old technology of
information retrieval, producing the web search engine. Web search transformed
information retrieval from the rarefied activity of librarians, researchers, journalist
fact-checkers, and intelligence analysts, to the daily activity of almost the entire
computer-enabled population. In doing so, search providers finally bridged a
long-established gap between theory and practice. As early as the 1960s,
researchers had developed statistical techniques for effectively retrieving and
ranking documents against plain keyword queries. The retrieval technology
deployed in practice, though, used logical, Boolean query languages that relied
upon the patience and expertise of the querier to formulate complex query
expressions, precisely specifying their information need. But web users little
expertise, and less patience, for constructing complex queries. Search engines
therefore turned to simple queries and sophisticated retrieval, finally deploying,
on a massive scale, the techniques developed three decades earlier, so creating the
modern search engine. To the surprise once more of some search technologists,
simple keyword search simply worked. In an increasingly competitive search
market, though, how could a provider verify the effectiveness of their search
results, and compare their offering with that of their competitors?
Search technology connects simple queries with unannotated documents,
relieving both the producer and the consumer of information from the complexity
of matching information resources to information needs. The result is tools that
allow neophyte users to find relevant information, across billions of web
documents, in a fraction of a second. But in doing away with complex, formal
information representations in favour of rough approximations, statistical
information retrieval introduced an important problem. It is not possible to
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 14
objectively and deterministically state that an information object matches an
information request, even in the terms in which the request is formulated. One can
say that a document has been manually assigned a certain classification under a
hierarchical taxonomy; one can even say that a document contains a Boolean
combination of terms; but one cannot conclusively say that an uncategorized
document meets a user’s information need as expressed by a handful of keywords.
The contemporary retrieval system sits at the interface between computational
formalism on the one hand, and the ambiguity of human cognition on the other.
There is uncertainty in what the retrieval system should do, and therefore in how
correct a set of results are.
The ambiguity of the retrieval task makes the question of retrieval
effectiveness a crucial and contested one. Methods for evaluating effectiveness
are therefore essential, in both research and deployment. Retrieval evaluation
relies fundamentally on human assessment of result quality. The
noncomputability of effectiveness makes information retrieval a deeply empirical
discipline, closer to natural or even social science than to formal computational
theory. The complex, interlocked relations that connect imprecise queries,
uncurated documents, and inchoate information needs, are not given, but must be
hypothesized and tested on observed search behavior.
The importance of empirical evaluation in information retrieval has been
recognized since the field began; the initial work that established the primacy of
retrieval over indexing gained much of its impact from the meticulous and
painstaking experimental work on which it was based. But the same scale of data
that makes retrieval technology necessary, also makes manual assessment costly.
While result quality can be measured by directly assessing user satisfaction with,
or utility gained from, retrieval results, such direct measurement of the user’s
satisfaction with the results lists as a whole is neither reusable nor reliably
repeatable. Assessing the results of any single system is time-consuming, and
there are many competing retrieval algorithms, each tuned by numerous
parameters. A parameter change that takes a few minutes to decide upon, and a
few seconds to run, could take days to manually assess. Moreover, if each
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 15
research group produces its own, independent assessments of retrieval quality,
then not only is much effort duplicated, but also reproduciblity is impaired, and
the potential for bias is introduced. And tuning nowadays is often performed
automatically through machine learning; fitting a manual review stage into each
learning iteration would be unworkable.
The need for scale and automatability, plus the desire for repeatability and
objectivity, has led the information retrieval community to develop hybrid
evaluation technologies, part manual, part automated. The most important of the
evaluation tools is the test collection: a corpus of documents, with a set of queries
(known as topics) to run against the corpus, and judgments of which documents
are (independently) relevant to each query. These relevance judgments must be
manually formed: but once made, the test collection can in principle be reused
indefinitely for fully automated evaluation. The result is an automated and re-
usable evaluation method, based on a simplified model of retrieval.
Test collection evaluation has been the bedrock of retrieval research for
half a century. Collection-based experimentation has grown even more in
importance since the arrival, beginning in the early 1990s, of large scale,
collaboratively developed, and readily obtainable test collections. And (to judge
from publicly available information) the test collection method is also core to the
quality assurance and improvement methods of commercial web search engines.
The practice of retrieval evaluation, though, has run well ahead of the
theory. It was only at the end of the 1990s that the reliability, efficiency, and
interpretability of evaluation results began to be formally investigated. The delay
was in part because it was only after large-scale collaborative experiments had
been running for several years that the datasets needed for a critical investigation
of evaluation became available. Initial enquiries, while foundational, tended to be
either ad-hoc, or else applied statistical methodology developed in other areas to
retrieval evaluation without considering the field’s distinctive features. These
omissions are currently being remedied by the research community.
It is in the context of the effort for greater reliability, accuracy, robustness,
and efficiency in collection-based retrieval evaluation that this thesis is presented.
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 16
Building on the foundational work in the area, and employing the large evaluation
datasets now available, major advances in the accuracy and comparability of
evaluation scores can be made in the design of efficient and reliable experiments,
in the extensibility of test collections in dynamic evaluation environments, and in
the measurement of retrieval similarity without relevance assessment. Technical
contributions with awareness of the wider context of evaluation, and of the
necessity of mixing experimental rigour with research innovation can also be
offered.
The need to store and retrieve written information became increasingly
important over centuries, especially with inventions like paper and the printing
press. Soon after computers were invented, people realized that they could be
used for storing and mechanically retrieving large amounts of information. In
1945 Vannevar Bush published a ground breaking article titled ―As We May
Think‖ that gave birth to the idea of automatic access to large amounts of stored
knowledge[8]. In the 1950s, this idea materialized into more concrete descriptions
of how archives of text could be searched automatically. Several works emerged
in the mid 1950s that elaborated upon the basic idea of searching text with a
computer. One of the most influential methods was described by H.P. Luhn in
1957, in which (put simply) he proposed using words as indexing units for
documents and measuring word overlap as a criterion for retrieval [9].
Several key developments in the field happened in the 1960s. Most
notable were the development of the SMART system by Gerard Salton and his
students, first at Harvard University and later at Cornell University; [10] and the
Cranfield evaluations done by Cyril Cleverdon and his group at the College of
Aeronautics in Cranfield [11]. The Cranfield tests developed an evaluation
methodology for retrieval systems that is still in use by IR systems today. The
SMART system, on the other hand, allowed researchers to experiment with ideas
to improve search quality. A system for experimentation coupled with good
evaluation methodology allowed rapid progress in the field, and paved way for
many critical developments.
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 17
The 1970s and 1980s saw many developments built on the advances of the
1960s. Various models for doing document retrieval were developed and
advances were made along all dimensions of the retrieval process. These new
models/techniques were experimentally proven to be effective on small text
collections (several thousand articles) available to researchers at the time.
However, due to lack of availability of large text collections, the question whether
these models and techniques would scale to larger corpora remained unanswered.
This changed in 1992 with the inception of Text Retrieval Conference, or
TREC[12]. TREC is a series of evaluation conferences sponsored by various US
Government agencies under the auspices of NIST, which aims at encouraging
research in IR from large text collections. With large text collections available
under TREC, many old techniques were modified, and many new techniques were
developed (and are still being developed) to do effective retrieval over large
collections[13].
The evolution of IR systems may be organized in four distinct periods,
with significant differences among the methods that were applied and the sources
used during each one. During an initial period, up to the 50s, the indexing and
searching processes were handled manually. Indexes were based on taxonomies
or alphabetical lists of previously specified concepts. During this phase, IR
systems were mainly used by librarians and scientists.
During a second period, between around 1950 and the advent of web in
the early 90s, the pressure on the field and the evolution on computer and
database technology allowed for significant improvements. Process went from
manual to automated annotation of documents; however indexes were still built
from restricted descriptions of documents (mainly abstracts and document titles).
IR was viewed as finding the right information in text databases. Operating IR
systems frequently required specific learning. IR systems utilization was
expensive and available only to restricted groups. During a third period, covering
the 90s, the process of indexing and searching becomes fully automated. Full text
indexes are built; web mining evolves and explores not only content but also
structure and usage. IR systems become unrestricted, cheap, widely available and
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 18
widely used. From around 2000 on, the fourth and actual period, other sources of
evidence are explored trying to improve systems’ performance.
Searching and browsing are the two basic IR paradigms on the web
(Baeza-Yates et al., 1999). Three approaches to IR seem to have emerged (Broder
et al., 2005):
The search-centric approach argues that free search has become so good
and the search user-interface so common, that users can satisfy all their needs
through simple queries. Search engines follow this approach;
The taxonomy navigation approach claims that users have difficulties
expressing their information needs; organizing information on a hierarchical
structure might help finding relevant information. Directory search systems
follow this approach;
The meta-data centric approach advocates the use of meta-data for
narrowing large sets of results (multi faceted search); third generation search
engines are trying to improve the quality of their answers by merging several
sources of evidence.
IR systems also have to solve problems related to their sources and how to
build their databases/indexes. Several crawling algorithms have been explored, in
order to overcome problems of scale arising from web dimension, such as focused
crawling (Chakrabarti et al., 1999b), intelligent crawling (Aggarwal et al., 2001)
and collaborative crawling (Aggarwal et al., 2004) that explores user behavior
registered in server logs. Other approaches have also been proposed: meta-search
explores the small overlap among search engines’ indexes sending the same query
to a set of search engines and merging their answers – a few specific problems
arise from this approximation (Wang et al., 2003); dynamic search engines try do
deal with web dynamics, such search engines do not have any permanent index
but instead crawl for their answers at query time (Hersovici et al., 1998);
interactive search (Bruza et al., 2000) wrapsa general purpose search engine into
an interface that allows users to navigate towards their goal through a query-by-
navigation process. At present, IR research seems to be focused on retrieval of
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 19
high quality, integration of several sources of evidence and multimedia
retrieval[3].
TREC hasalso branched IR into related but important fields like retrieval
of spoken information, non-English language retrieval, information filtering,
user interactions with a retrieval system, and so on.
1.7 Basic Processes of Information Retrieval
There are three basic processes an information retrieval system has to
support: the representation of the content of the documents, the representation of
the user’s information need, and the comparison of the two representations. The
processes are visualized in figure 1.3 (Croft 1993). In the figure, squared boxes
represent data and rounded boxes represent processes.
Figure 1.3: Information Retrieval Process (Croft 1993)
Representing the documents is usually called the indexing process. The
process takes place off-line, that is, the end user of the information retrieval
system is not directly involved. The indexing process results in a formal
representation of the document: the index representation or document
representation. Often, full text retrieval systems use a rather trivial algorithm to
derive the index representations, for instance an algorithm that identifies words in
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 20
an English text and puts them to lower case. The indexing process may include
the actual storage of the document in the system, but often documents are only
stored partly, for instance only title and abstract, plus information about the actual
location of the document.
The process of representing the information problem or need is often
referred to as the query formulation process. The resulting formal representation
is the query. In a broad sense, query formulation might denote the complete inter
active dialogue between system and user, leading not only to a suitable query but
possibly also to a better understanding by the user of his/her information need. In
this thesis however, query formulation generally denotes the automatic
formulation of the query when there are no previously retrieved documents to
guide the search, that is, the formulation of the initial query. The automatic
formulation of successive queries is called relevance feedback in this thesis. The
user and the system communicate the information need by respectively queries
and retrieved sets of documents. This is not the most natural form of
communication. Humans would use natural language to communicate the
information need amongst each other. Such a natural language statement of the
information need is called a request. Automatic query formulation inputs the
request and outputs an initial query. In practice, this means that some or all of the
words in the request are converted to query terms, for instance by the rather trivial
algorithm that puts words to lower case. Relevance feedback inputs a query or a
request and some previously retrieved relevant and non-relevant documents to
output a successive query. The comparison of the query against the document
representations is also called the matching process. The matching process results
in a ranked list of relevant documents. Users will walk down this document list in
search of the information they need. Ranked retrieval will hopefully put the
relevant documents somewhere in the top of the ranked list, minimizing the time
the user has to invest on reading the documents. Simple but effective ranking
algorithms use the frequency distribution of terms over documents. For instance,
the words ―family‖ and ―entertainment‖ mentioned in the first section occur
relatively infrequent in the whole book, which indicates that this book should not
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 21
receive a top ranking for the request ―family entertainment‖. Ranking algorithms
based on statistical approaches easily halve the time the user has to spend on
reading documents.
1.7.1 Basic models of information retrieval a brief overview
A mathematical model of information retrieval guides the implementation
of information retrieval systems. In the traditional information retrieval systems,
which are usually operated by professional searchers, only the matching process is
automated; indexing and query formulation are manual processes. For these
systems, mathematical models of information retrieval therefore only have to
model the matching process. In practice, traditional information retrieval systems
use the Boolean model of information retrieval.
1.7.1.1 The Boolean model
Is an exact matching model, that is, it either retrieves documents or not
without ranking them. The model supports the use of structured queries, which do
not only contain query terms, but also relations between the terms defined by the
query operators AND, OR and NOT
In modern information retrieval systems, which are usually operated by
nonprofessional users, query formulation is automated as well. However,
candidate mathematical models for these systems still only model the matching
process. There are many candidate models for the matching process of ranked
retrieval systems. These models are so-called approximate matching models, that
is, they use the frequency distribution of terms over documents to compute the
ranking of the retrieved sets. Each of these models has its own advantages and
disadvantages. However, there are two classical candidate models for
approximate matching: the vector space model and the probabilistic model. They
are classical models, not only because they were introduced already in the early
70’s, but also because they represent classical problems in information retrieval.
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 22
1.7.1.2 The vector space model
Represents the problem of ranking the documents given the initial query.
The Vector model, probably the most commonly used, assigns real non-negative
weights to index terms in documents and queries. In this model, documents are
represented by vectors in a multi-dimensional Euclidean space. Each dimension in
this space corresponds to a relevant term/word contained in the document
collection. The degree of similarity of documents with regard to queries is
evaluated as the correlation between the vectors representing the document and
the query which can be, and usually is, quantified by the cosine of the angle
between the two vectors.
In the vector model, index term weights are usually obtained as a function
of two factors: the term frequency factor, TF, a measure of intra-cluster
similarity; computed as the number of times that the term occurs in document,
normalized in a way as to make it independent of document length and an inverse
document frequency, IDF, a measure of inter-cluster dissimilarity; weights each
term according to its discriminative power in the entire collection. This model’s
main advantages are related to improvements in retrieval performance due to term
weighting; partial matching that allows retrieval of documents that approximate
the query conditions. The index term independency assumption is probably its
main disadvantage.
1.7.1.3 The probabilistic model
Represent the problem of ranking the documents after some feedback is
gathered. Probabilistic models compute the similarity between documents and
queries as the odds of a document being relevant to a query. Index term weights
are binary. This model ranks documents in decreasing order of their probability of
being relevant, which is an advantage. Its main disadvantages are: the need to
guess the initial separation of documents into relevant and non-relevant; weights
are binary; index terms are assumed to be independent
From a practical point of view, the Boolean model, the vector space model
and the probabilistic model represent three classical problems of information
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 23
retrieval, respectively structured queries, initial term weighting, and relevance
feedback. The Boolean model provides the query operators AND, OR and NOT to
formulate structured queries. The vector space model was used by Salton and his
colleagues for hundreds of term weighting experiments in order to find algorithms
that predict which documents the user will find relevant given the initial query
(Salton and Buckley 1988).3 The probabilistic model, provides a theory of
optimum ranking if examples of relevant documents are available [14].
1.7.1.4 Evaluation of Information Retrieval System
Evaluation studies investigate the degree to which the stated goals or
expectations have been achieved or the degree to which these can be achieved.
The three major purposes given for evaluating an information retrieval system
were the need for measures with which to make merit comparisons within a single
test situation, the need for measures with which to make comparisons between
results obtained in difficult test situations and the need for assessing the merit a
real-life system. A number of studies have been conducted to measure the
performance of the information retrieval system. Some criteria have been
proposed by several researchers for the evaluation of information retrieval
systems [CC66, LFW68, and SG83]. These criteria include: coverage of the
system, form of presentation of the search output, user effort, the response time of
the system, and recall and precision. Retrieval effectiveness is defined in terms of
retrieving relevant documents and not retrieving non-relevant documents. Two
traditional factors of measuring effectiveness are Recall and Precision.
1.7.1.4.1 Evaluation criteria
Recall indicates the ability of a system to present all relevant items or
documents. In reality it may not be possible to retrieve all the relevant items from
a collection, especially when the collection is large. A system may be able to
retrieve a proportion of the total relevant documents. Thus, the performance of a
system is often measured by the recall ratio, which denotes the percentage of
relevant items retrieved in a given situation.
Chapter 1: Web Mining and Information Retrieval
A Study of Web Mining Tools for Query Optimization Page 24
Precision implies the ability of a system to present only relevant items or
documents and therefore not to retrieve non-relevant documents. This factor-that
is, how far the system is able to withhold unwanted items in a given situation-is
measured in terms of precision ratio. These two measures are denoted by the
following formulas: