Chapter 1 Web-Mining and Information Retrievalshodhganga.inflibnet.ac.in/bitstream/10603/65416/6/06_chapter_1.pdf · Chapter 1: Web Mining and Information Retrieval A Study of Web

Chapter 1: Web Mining and Information Retrieval

A Study of Web Mining Tools for Query Optimization Page 1

Chapter 1

Web-Mining and Information Retrieval

1.1 Introduction

The World Wide Web or simply the web may be seen as a huge collection

of documents freely produced and published by a very large number of people,

without any solid editorial control. This is probably the most democratic – and

anarchic –widespread mean for anyone to express feelings, comments,

convictions and ideas, independently of ethnics, sex, religion or any other

characteristic of human societies. The web constitutes a comprehensive, dynamic,

permanently up-to-date repository of information regarding most of the areas of

human knowledge (Hu, 2002) and supporting an increasingly important part of

commercial, artistic, scientific and personal transactions, which gives rise to a

very strong interest from individuals, as well as from institutions, at a universal

scale. However, the web also exhibits some characteristics that are adverse to the

process of collecting information from it in order to satisfy specific needs: the

large volume of data it contains; its dynamic nature; being mainly constituted by

unstructured or semi-structured data; content and format heterogeneity and

irregular data quality are some of these adverse characteristics. End-users also

introduce some additional difficulties in the retrieval process: information needs

are often imprecisely defined, generating a semantic gap between user needs and

their specification. The satisfaction of a specific information need on the web is

supported by search engines and other tools aimed at helping users gather

information from the web. The user is usually not assisted in the subsequent tasks

of organizing, analyzing and exploring the answers produced. These answers are

usually flat lists of large sets of web pages which demand significant user effort to

be explored. Satisfying information needs on the web is usually seen as an

ephemeral one-step process of information search (the traditional search engine

paradigm). Given these characteristics, it is highly demanding to satisfy private or

institutional information needs on the web. The web itself, and the interests it

promotes, are growing and changing rapidly, at a global scale, both as mean of



divulgation and dissemination and also as a source of generic and specialized

information. Web users have already realized the potential of this huge

information source and use it for many purposes, mainly in order to satisfy

specific information needs. Simultaneously the web provides a ubiquitous

environment for executing many activities, regardless of place and time.

1.2 Web Mining

Web mining is a very hot research topic which combines two of the

activated research areas: Data Mining and World Wide Web. The Web mining

research relates to several research communities such as Database, Information

Retrieval and Artificial Intelligence [1].

Web mining is defined by [Coo97] as the discovery and analysis of useful

information from the WWW. Web mining is used to extract interesting and

potentially useful patterns and implicit information from artefacts or activity

related to the WWW. Web mining in relation to other forms of data mining and

retrieval is illustrated using Figure 1.1. The diagram demonstrates the fact that

web mining is performed on an unstructured source, i.e. web sites.

Figure 1.1: Web mining in relation to other forms of data mining and retrieval



1.2.1 Web Content Mining

Web content mining is the automatic search of information resources

available online [Coo97]. As a process, web content mining goes beyond keyword

extraction since web documents present no machine-readable semantic. The two

groups of web content mining approaches concentrate on different aspects. Agent

based approach directly mines document contents. Database approach improves

the search strategy of the search engine with regard to the database it uses.

1.2.2 Web Structure Mining

Web content mining focuses on the internal structure of a web document,

web structure mining tries to discover the link structure of the hyperlinks at the

inter-document level.

1.2.3 Web Usage Mining

Web usage mining is defined as the discovery of user access patterns from

web servers. Web servers record and accumulate user interaction data each time a

user makes a request for resources. Analyzing these web access logs can reveal

patterns regarding a user are browsing habits through the web server [2].

Figure 1.2: Taxonomy of Web Mining

Web mining is the use of data mining techniques to automatically discover and

extract information from Web documents and services. Web mining



methodologies can generally be classified into one of three distinct categories:

Web structure, Web content and Web usage mining.

The goal of Web structure mining is to categorize the Web pages and

generate information such as the similarity and relationship between them, taking

advantage of their hyperlink topology. In the latter years, the area of Web

structure mining focuses on the identification of authorities, i.e. pages that are

considered as important sources of information from many people in the Web

community.

Web content mining has to do with the retrieval of information (content)

available on the Web into more structured forms as well as its indexing for easy

tracking information locations. Web content may be unstructured (plain text),

semi-structured (HTML documents), or structured (extracted from databases into

dynamic Web pages). Such dynamic data cannot be indexed and consist what is

called ―the hidden Web‖. A research area closely related to content mining is text

mining. Web content mining is nowadays strongly interrelated with Web structure

mining, since usually both are used in combination for extracting and organizing

information from the Web. Web content mining provides methods enabling the

automated discovery, retrieval, organization, and management of the vast amount

of information and resources available in the Web. Cooley et al. [CMS97]

categorize the main research efforts in the area of Content Mining in two

approaches, the Information Retrieval (IR), and the Database (DB) approach. The

IR approach involves the development of sophisticated AI systems that can act

autonomously or semi-autonomously on behalf of a particular user, to discover

and organize Web-based information.

Web usage mining is the process of identifying browsing patterns by

analyzing the user’s navigational behavior. This information takes as input the

usage data, i.e. the data residing in the Web server logs, recording the visits of the

users to a Web site. Extensive research in the area of Web usage mining led to the

appearance of a related research area, that of Web personalization. Web

personalization utilizes the results produced after performing Web usage mining,

in order to dynamically provide recommendations to each user.



Web mining is moving the World Wide Web toward a more useful

environment in which users can quickly and easily find the information they need.

It includes the discovery and analysis of data, documents, and multimedia from

the World Wide Web. Web mining uses document content, hyperlink structure,

and usage statistics to assist users in meeting their information needs.

The Web itself and search engines contain relationship information about

documents. Web mining is the discovery of these relationships and is

accomplished within three sometimes overlapping areas. Content mining is first.

Search engines define content by keywords. Finding contents’ keywords and

finding the relationship between a Web page’s content and a user’s query content

is content mining. Hyperlinks provide information about other documents on the

Web thought to be important to another document. These links add depth to the

document, providing the multi-dimensionality that characterizes the Web. Mining

this link structure is the second area of Web mining. Finally, there is a

relationship to other documents on the Web that are identified by previous

searches. These relationships are recorded in logs of searches and accesses.

Mining these logs is the third area of Web mining.

Understanding the user is also an important part of Web mining. Analysis

of the user’s previous sessions, preferred display of information, and expressed

preferences may influence the Web pages returned in response to a query.

Web mining is interdisciplinary in nature, spanning across such fields as

information retrieval, natural language processing, information extraction,

machine learning, database, data mining, data warehousing, user interface design,

and visualization. Techniques for mining the Web have practical application in m-

commerce, e-commerce, e-government, e-learning, distance learning,

organizational learning, virtual organizations, knowledge management, and digital

libraries.

1.3 Web Mining and Information Retrieval

Web IR is the application of IR to the web. In classical IR, users specify

queries, in some query language, representing their information needs. The



system selects the set of documents in its collection that seem the most relevant to

the query and presents them to the user. Users may then refine their queries to

improve the answer. In the web environment user intents are not static and stable

as they usually are in traditional IR. In the web, the information need is associated

with a given task (Broder, 2002) that is not known in advance and may be quite

different from user to user, even if the query specification is the same. The

identification of this task and the mental process of deriving a query from an

information need are crucial aspects in web IR. Web IR is related to web mining

– the automatic discovery of interesting and valuable information from the web

(Chakrabarti, 2003). It is generally accepted that web mining is currently being

developed towards three main research directions, related to the type of data they

mine: web content mining, web structure mining and web usage mining (Kosala

et al., 2000). Recently another type of data – document change, page age and

information recency – is generating research interests: it is related to a temporal

dimension and allows for analyzing the growth and dynamics – over time – of the

Web (Baeza-Yates, 2003; Cho et al., 2000; Lim et al., 2001). This categorization

is merely conceptual, these areas are not mutually exclusive and some techniques

dedicated to one may use data that is typically associated with others.

Web content mining concerns the discovery of useful information from

web page content which is available in many different formats (Baeza-Yates,

2003) – textual, metadata, links, multimedia objects, hidden and dynamic pages

and semantic data.

Web structure mining tries to infer knowledge from the link structure on

the web (Chakrabarti et al., 1999a). Web documents typically point at related

documents through a link forming a social network. This network can be

represented by a directed graph where nodes represent documents and arcs

represent the links between them. The analysis of this graph is the main goal of

web structure mining (Donato et al., 2000; Kumar et al., 2000). In this field, two

algorithms, which rank web pages according to their relevance, have received

special attention: PageRank(Brin et al., 1998) and Hyperlink Induced Topic

Search, or HITS (Kleinberg, 1998).



Web usage mining tries to explore user behavior on the web by analyzing

data originated from user interaction and automatically recorded in web server

logs. The applications of web usage mining usually intend to learn user profiles or

navigation patterns. Web usage mining is essentially aimed at predicting the next

user request based on the analysis of previous requests. Markov models are very

common in modeling user requests or user paths within a site (Borges, 2000).

Association rules and other standard data mining and OLAP techniques are also

explored. (Cooley et al., 1997) presents an overview of the most relevant work in

web usage mining [3].

IR is the automatic retrieval of all relevant documents while at the same

time retrieving as few of the non relevant as possible (Rijsbergen, 1979). Some

have claimed that resource or document discovery (IR) on the Web is an instance

of Web content mining and the others associate web mining with intelligent IR.

Actually IR has the primary goals of indexing text and searching for useful

documents in a collection and nowadays research in IR includes modeling,

document classification and categorization, user interfaces, data visualization,

filtering, etc. (Baeza-Yates &Berthier, 1999). The task that can be considered to

be an instance of Web mining is Web document classification or categorization,

which could be used for indexing. Viewed in this respect, Web mining is part of

the (Web) IR process. (Kosala&Blockeel, 2000)[4].

1.4 Web Mining and Information Extraction:

IE has the goal of transforming a collection of documents, usually with the

help of anIR system, into information that is more readily digested and analyzed

(Cowie&Lehnert, 1996). IE aims to extract relevant facts from the documents

while IR aims to select relevant documents (Pazienza, 1997). While IE is

interested in the structure or representation of a document, IR views the text in a

document just as a bag of unordered words (Wilks, 1997). Thus, in general IE

works at a finer granularity level than IR dose on the documents. Building IE

systems manually is not feasible and scalable for such a dynamic and diverse

medium such as web contents (Muslea, Minton &Knoblock, 1998). Due to this



nature of the Web, most IE systems focus on specific web sites to extract. Others

use machine learning or data mining techniques to learn the extraction patterns or

rules for Web documents semi-automatically or automatically (Kushmerick,

1999). Within this view, Web mining is used to improve Web IE (Web mining is

part of IE) (Kosala&Blockeel, 2000). An example of IE without Web mining is

what done by (El-Beltagy, Rafea&Abdelhamid) for building a model for

automatically augmenting segments documents with metadata using dynamically

acquired background domain knowledge in order to assist users in easily locating

information within these documents through a structured front end[5].Web mining

can be divided into four subtasks:

1.4.1 Information Retrieval/Resource Discovery (IR)

Find all relevant documents on the web. The goal of IR is to automatically

find all relevant documents, while at the same time filter out the non-relevant

ones. Search engines are a major tool people use to find web information. Search

engines use key words as the index to perform query. Users have more control in

searching web content. Automated programs such as crawlers and robots are used

to search the web. Such programs traverses the web to recursively retrieve all

relevant documents. A search engine consists of three components: a crawler

which visits web sites, indexing which is updated when a crawler finds a site, and

a ranking algorithm which records those relevant web sites. However, current

search engines have a major problem -low precision, which is manifested often by

the irrelevance of searched results.

1.4.1.1 Information Extraction (IE):automatically extract specific

fragments of a document from web resources retrieved from the IR step. Building

a uniform IE system is difficult because the web content is dynamic and diverse.

Most IE systems use the \wrapper" [33] technique to extract a specific

information for a particular site. Machine learning techniques are also used to

learn the extraction rules.

1.4.1.2 Generalization: discover information patterns at retrieved web

sites. The purpose of this task is to study users' behavior and interest. Data mining



techniques such as clustering and association rules are utilized here. Several

problems exist during this task. Because web data are heterogeneous, imprecise

and vague, it is difficult to apply conventional clustering and association rule

techniques directly on the raw web data.

1.4.1.3 Analysis/Validation: analyze, interpret and validate the potential

information from the information patterns. The objective of this task is to discover

knowledge from the information provided by former tasks. Based on web data,

one can build models to simulate and validate web information.[6].

1.5 Information Retrieval and Web

The meaning of the term information retrieval can be very broad. Just

getting a credit card out of your wallet so that you can type in the card number is a

form of information retrieval However, as an academic field of study

INFORMATION RETRIEVAL might be defined thus Information retrieval (IR)

is finding material (usually documents) of an unstructured nature As defined in

this way, information retrieval used to be an activity that only a few people

engaged in: reference librarians, paralegals, and similar professional searchers.

Now the world has changed, and hundreds of millions of people engage in

information retrieval every day when they use a web search engine or search their

email. Information retrieval is fast becoming the dominant form of information

access, overtaking traditional database style searching (usually text) that satisfies

an information need from within large collections (usually stored on computers).

IR can also cover other kinds of data and information problems beyond

that specified in the core definition above. The term ―unstructured data‖ refers to

data which does not have clear, semantically overt, easy-for-a-computer structure.

It is the opposite of structured data, the canonical example of which is a relational

database, of the sort companies usually use to maintain product inventories and

personnel records. In reality, almost no data are truly ―unstructured‖. This is

definitely true of all text data if you count the latent linguistic structure of human

languages. But even accepting that the intended notion of structure is overt



structure, most text has structure, such as headings and paragraphs and footnotes,

which is commonly represented in documents by explicit markup (such as the

coding underlying web pages). IR is also used to facilitate ―semi structured‖

search such as finding a document where the title contains Java and the body

contains threading. The field of information retrieval also covers supporting users

in browsing or filtering document collections or further processing a set of

retrieved documents. Given a set of documents, clustering is the task of coming

up with a good grouping of the documents based on their contents. It is similar to

arranging books on a bookshelf according to their topic. Given a set of topics,

standing information needs, or other categories (such as suitability of texts for

different age groups), classification is the task of deciding which class(es), if any,

each of a set of documents belongs to. It is often approached by first manually

classifying some documents and then hoping to be able to classify new documents

automatically.

Information retrieval systems can also be distinguished by the scale at

which they operate, and it is useful to distinguish three prominent scales. In web

search, the system has to provide search over billions of documents stored on

millions of computers. Distinctive issues are needing to gather documents for

indexing, being able to build systems that work efficiently at this enormous scale,

and handling particular aspects of the web, such as the exploitation of hypertext

and not being fooled by site providers manipulating page content in an attempt to

boost their search engine rankings, given the commercial importance of the web

[7].

1.6 The Web

The web is a public service constituted by a set of applications aimed at

extracting documents from computers accessible in Internet – the Internet is a

network of computer networks. One can also describe the web as an information

repository distributed over millions of computers interconnected through Internet

(Baldi et al., 2003). The W3C defines web in a broad way: ―the World Wide Web

is the universe of network-accessible information, an embodiment of human



knowledge‖. Due to its comprehensiveness, with contents related to most subjects

of human activity, and global public acceptance, either at a personal or

institutional level, the web is widely explored as an information source. Web

dimension and dynamic nature become serious drawbacks when it comes to

retrieving information. Another relevant characteristic of the web is the absence

of any global editorial control over its content and format. This contributes largely

to web success but also contributes to a high degree of heterogeneity in content,

language, structure, correctness and validity. Although the problems raised by the

size of the web, around 11,5×109 pages (Gulli et al., 2005), and its dynamics

require special treatment, it seems that the major difficulties concerning the

processing of web documents are generated by the lack of editorial rules and the

lack of a common ontology, which would allow for unambiguous document

specification and interpretation. In the absence of such normative rules, each

document has to be treated as unique. In this scenario, document processing

cannot be based on any underlying structure. Although HTML already involves

some structure its use is not mandatory. Therefore, the higher level of abstraction

that may assure compatibility with a generic web document is the common bag-

of-words (Chakrabarti, 2003). This low abstraction level is not very helpful for

automatic processing, requiring significant computational costs. The web is a vast

and popular repository, containing information related to almost all human

activities and being used to perform an ever growing set of distinct activities

(bank transactions, shopping, chatting, government transactions, weather report

and getting geographic directions, just to name a few). Despite the difficulties this

medium poses to automatic as well as to non-automatic processing, it has been

increasingly explored and has been motivating efforts, from both academic and

industry, which aim to facilitate this exploration. Currently the web is a repository

of documents, the majority of them HTML documents, that can be automatically

presented to users but that do not have a base model that might be used by

computers to acquire semantic information on the objects being manipulated. The

semantic web is a formal attempt from W3C to transform the web in a huge

database that might be easier to process automatically than our current syntactic



web. However, despite many initiatives on the semantic web (Lu et al., 2002), the

web has its own dynamics and web citizens are pushing the web to the social plan.

Collaborative systems, radical trust and participation are the main characteristics

of web2.0, a new paradigm emerging since 2004 (O’Reily, 2004).

1.6.1 A Retrospective View of Web Information Retrieval

In the early 1950s, technical librarianship faced a crisis. The scientific

boom sparked by the Second World War had released a flood of publications,

approaching a million new articles each year. Scientists could no longer stay

abreast of current research by general reading alone. Papers relevant to a new

project, but not previously known to the researcher, had to be retrieved at the

project’s outset and the librarian had to facilitate this retrieval. A variety of

cataloguing schemes had been suggested as tools for retrieval, but none had been

rigorously tested for effectiveness, and all were labour-intensive to implement

In responding to technical information’s rapid growth, librarians and

information scientists developed the field of information retrieval. The defining

discovery of the field was that complex schemes for organizing and cataloguing

information into hierarchical taxonomies did little better than simply indexing the

plain words occurring in the text: the crucial part of information retrieval lay in

the process of retrieval. The finding that taxonomy was redundant was little short

of scandalous—after all, Western information science had since Aristotle been

founded on subdividing knowledge by genus and species. But the effect was

liberating. Word occurrences are readily indexed by computer, and retrieval

technology could be constructed on top of such indexes without having to solve

deep problems in human language analysis and semantics. Significantly, the

sufficiency of word occurrence indexing was not argued theoretically (which,

after centuries of such theoretical dispute, would hardly have had an impact), but

demonstrated empirically, through careful evaluation.

In the mid 1990s, users of the newly-emerged web faced a crisis. The

number of web sites was growing rapidly, and finding information by following a

trail of links from a few popular central sites was no longer an adequate access



method. Manually curated directories such as that of Yahoo! were popular, but

manual curation was expensive and scaled poorly. Experienced users could not

keep up with the growth in the number of sites, even in areas of personal interest

to them; and, for novice users, the task of finding useful information on the web

was daunting.

Faced with the mushrooming growth of the web in the second half of the

1990s, a new kind of service provider turned to the decades-old technology of

information retrieval, producing the web search engine. Web search transformed

information retrieval from the rarefied activity of librarians, researchers, journalist

fact-checkers, and intelligence analysts, to the daily activity of almost the entire

computer-enabled population. In doing so, search providers finally bridged a

long-established gap between theory and practice. As early as the 1960s,

researchers had developed statistical techniques for effectively retrieving and

ranking documents against plain keyword queries. The retrieval technology

deployed in practice, though, used logical, Boolean query languages that relied

upon the patience and expertise of the querier to formulate complex query

expressions, precisely specifying their information need. But web users little

expertise, and less patience, for constructing complex queries. Search engines

therefore turned to simple queries and sophisticated retrieval, finally deploying,

on a massive scale, the techniques developed three decades earlier, so creating the

modern search engine. To the surprise once more of some search technologists,

simple keyword search simply worked. In an increasingly competitive search

market, though, how could a provider verify the effectiveness of their search

results, and compare their offering with that of their competitors?

Search technology connects simple queries with unannotated documents,

relieving both the producer and the consumer of information from the complexity

of matching information resources to information needs. The result is tools that

allow neophyte users to find relevant information, across billions of web

documents, in a fraction of a second. But in doing away with complex, formal

information representations in favour of rough approximations, statistical

information retrieval introduced an important problem. It is not possible to



objectively and deterministically state that an information object matches an

information request, even in the terms in which the request is formulated. One can

say that a document has been manually assigned a certain classification under a

hierarchical taxonomy; one can even say that a document contains a Boolean

combination of terms; but one cannot conclusively say that an uncategorized

document meets a user’s information need as expressed by a handful of keywords.

The contemporary retrieval system sits at the interface between computational

formalism on the one hand, and the ambiguity of human cognition on the other.

There is uncertainty in what the retrieval system should do, and therefore in how

correct a set of results are.

The ambiguity of the retrieval task makes the question of retrieval

effectiveness a crucial and contested one. Methods for evaluating effectiveness

are therefore essential, in both research and deployment. Retrieval evaluation

relies fundamentally on human assessment of result quality. The

noncomputability of effectiveness makes information retrieval a deeply empirical

discipline, closer to natural or even social science than to formal computational

theory. The complex, interlocked relations that connect imprecise queries,

uncurated documents, and inchoate information needs, are not given, but must be

hypothesized and tested on observed search behavior.

The importance of empirical evaluation in information retrieval has been

recognized since the field began; the initial work that established the primacy of

retrieval over indexing gained much of its impact from the meticulous and

painstaking experimental work on which it was based. But the same scale of data

that makes retrieval technology necessary, also makes manual assessment costly.

While result quality can be measured by directly assessing user satisfaction with,

or utility gained from, retrieval results, such direct measurement of the user’s

satisfaction with the results lists as a whole is neither reusable nor reliably

repeatable. Assessing the results of any single system is time-consuming, and

there are many competing retrieval algorithms, each tuned by numerous

parameters. A parameter change that takes a few minutes to decide upon, and a

few seconds to run, could take days to manually assess. Moreover, if each



research group produces its own, independent assessments of retrieval quality,

then not only is much effort duplicated, but also reproduciblity is impaired, and

the potential for bias is introduced. And tuning nowadays is often performed

automatically through machine learning; fitting a manual review stage into each

learning iteration would be unworkable.

The need for scale and automatability, plus the desire for repeatability and

objectivity, has led the information retrieval community to develop hybrid

evaluation technologies, part manual, part automated. The most important of the

evaluation tools is the test collection: a corpus of documents, with a set of queries

(known as topics) to run against the corpus, and judgments of which documents

are (independently) relevant to each query. These relevance judgments must be

manually formed: but once made, the test collection can in principle be reused

indefinitely for fully automated evaluation. The result is an automated and re-

usable evaluation method, based on a simplified model of retrieval.

Test collection evaluation has been the bedrock of retrieval research for

half a century. Collection-based experimentation has grown even more in

importance since the arrival, beginning in the early 1990s, of large scale,

collaboratively developed, and readily obtainable test collections. And (to judge

from publicly available information) the test collection method is also core to the

quality assurance and improvement methods of commercial web search engines.

The practice of retrieval evaluation, though, has run well ahead of the

theory. It was only at the end of the 1990s that the reliability, efficiency, and

interpretability of evaluation results began to be formally investigated. The delay

was in part because it was only after large-scale collaborative experiments had

been running for several years that the datasets needed for a critical investigation

of evaluation became available. Initial enquiries, while foundational, tended to be

either ad-hoc, or else applied statistical methodology developed in other areas to

retrieval evaluation without considering the field’s distinctive features. These

omissions are currently being remedied by the research community.

It is in the context of the effort for greater reliability, accuracy, robustness,

and efficiency in collection-based retrieval evaluation that this thesis is presented.



Building on the foundational work in the area, and employing the large evaluation

datasets now available, major advances in the accuracy and comparability of

evaluation scores can be made in the design of efficient and reliable experiments,

in the extensibility of test collections in dynamic evaluation environments, and in

the measurement of retrieval similarity without relevance assessment. Technical

contributions with awareness of the wider context of evaluation, and of the

necessity of mixing experimental rigour with research innovation can also be

offered.

The need to store and retrieve written information became increasingly

important over centuries, especially with inventions like paper and the printing

press. Soon after computers were invented, people realized that they could be

used for storing and mechanically retrieving large amounts of information. In

1945 Vannevar Bush published a ground breaking article titled ―As We May

Think‖ that gave birth to the idea of automatic access to large amounts of stored

knowledge[8]. In the 1950s, this idea materialized into more concrete descriptions

of how archives of text could be searched automatically. Several works emerged

in the mid 1950s that elaborated upon the basic idea of searching text with a

computer. One of the most influential methods was described by H.P. Luhn in

1957, in which (put simply) he proposed using words as indexing units for

documents and measuring word overlap as a criterion for retrieval [9].

Several key developments in the field happened in the 1960s. Most

notable were the development of the SMART system by Gerard Salton and his

students, first at Harvard University and later at Cornell University; [10] and the

Cranfield evaluations done by Cyril Cleverdon and his group at the College of

Aeronautics in Cranfield [11]. The Cranfield tests developed an evaluation

methodology for retrieval systems that is still in use by IR systems today. The

SMART system, on the other hand, allowed researchers to experiment with ideas

to improve search quality. A system for experimentation coupled with good

evaluation methodology allowed rapid progress in the field, and paved way for

many critical developments.



The 1970s and 1980s saw many developments built on the advances of the

1960s. Various models for doing document retrieval were developed and

advances were made along all dimensions of the retrieval process. These new

models/techniques were experimentally proven to be effective on small text

collections (several thousand articles) available to researchers at the time.

However, due to lack of availability of large text collections, the question whether

these models and techniques would scale to larger corpora remained unanswered.

This changed in 1992 with the inception of Text Retrieval Conference, or

TREC[12]. TREC is a series of evaluation conferences sponsored by various US

Government agencies under the auspices of NIST, which aims at encouraging

research in IR from large text collections. With large text collections available

under TREC, many old techniques were modified, and many new techniques were

developed (and are still being developed) to do effective retrieval over large

collections[13].

The evolution of IR systems may be organized in four distinct periods,

with significant differences among the methods that were applied and the sources

used during each one. During an initial period, up to the 50s, the indexing and

searching processes were handled manually. Indexes were based on taxonomies

or alphabetical lists of previously specified concepts. During this phase, IR

systems were mainly used by librarians and scientists.

During a second period, between around 1950 and the advent of web in

the early 90s, the pressure on the field and the evolution on computer and

database technology allowed for significant improvements. Process went from

manual to automated annotation of documents; however indexes were still built

from restricted descriptions of documents (mainly abstracts and document titles).

IR was viewed as finding the right information in text databases. Operating IR

systems frequently required specific learning. IR systems utilization was

expensive and available only to restricted groups. During a third period, covering

the 90s, the process of indexing and searching becomes fully automated. Full text

indexes are built; web mining evolves and explores not only content but also

structure and usage. IR systems become unrestricted, cheap, widely available and



widely used. From around 2000 on, the fourth and actual period, other sources of

evidence are explored trying to improve systems’ performance.

Searching and browsing are the two basic IR paradigms on the web

(Baeza-Yates et al., 1999). Three approaches to IR seem to have emerged (Broder

et al., 2005):

The search-centric approach argues that free search has become so good

and the search user-interface so common, that users can satisfy all their needs

through simple queries. Search engines follow this approach;

The taxonomy navigation approach claims that users have difficulties

expressing their information needs; organizing information on a hierarchical

structure might help finding relevant information. Directory search systems

follow this approach;

The meta-data centric approach advocates the use of meta-data for

narrowing large sets of results (multi faceted search); third generation search

engines are trying to improve the quality of their answers by merging several

sources of evidence.

IR systems also have to solve problems related to their sources and how to

build their databases/indexes. Several crawling algorithms have been explored, in

order to overcome problems of scale arising from web dimension, such as focused

crawling (Chakrabarti et al., 1999b), intelligent crawling (Aggarwal et al., 2001)

and collaborative crawling (Aggarwal et al., 2004) that explores user behavior

registered in server logs. Other approaches have also been proposed: meta-search

explores the small overlap among search engines’ indexes sending the same query

to a set of search engines and merging their answers – a few specific problems

arise from this approximation (Wang et al., 2003); dynamic search engines try do

deal with web dynamics, such search engines do not have any permanent index

but instead crawl for their answers at query time (Hersovici et al., 1998);

interactive search (Bruza et al., 2000) wrapsa general purpose search engine into

an interface that allows users to navigate towards their goal through a query-by-

navigation process. At present, IR research seems to be focused on retrieval of



high quality, integration of several sources of evidence and multimedia

retrieval[3].

TREC hasalso branched IR into related but important fields like retrieval

of spoken information, non-English language retrieval, information filtering,

user interactions with a retrieval system, and so on.

1.7 Basic Processes of Information Retrieval

There are three basic processes an information retrieval system has to

support: the representation of the content of the documents, the representation of

the user’s information need, and the comparison of the two representations. The

processes are visualized in figure 1.3 (Croft 1993). In the figure, squared boxes

represent data and rounded boxes represent processes.

Figure 1.3: Information Retrieval Process (Croft 1993)

Representing the documents is usually called the indexing process. The

process takes place off-line, that is, the end user of the information retrieval

system is not directly involved. The indexing process results in a formal

representation of the document: the index representation or document

representation. Often, full text retrieval systems use a rather trivial algorithm to

derive the index representations, for instance an algorithm that identifies words in



an English text and puts them to lower case. The indexing process may include

the actual storage of the document in the system, but often documents are only

stored partly, for instance only title and abstract, plus information about the actual

location of the document.

The process of representing the information problem or need is often

referred to as the query formulation process. The resulting formal representation

is the query. In a broad sense, query formulation might denote the complete inter

active dialogue between system and user, leading not only to a suitable query but

possibly also to a better understanding by the user of his/her information need. In

this thesis however, query formulation generally denotes the automatic

formulation of the query when there are no previously retrieved documents to

guide the search, that is, the formulation of the initial query. The automatic

formulation of successive queries is called relevance feedback in this thesis. The

user and the system communicate the information need by respectively queries

and retrieved sets of documents. This is not the most natural form of

communication. Humans would use natural language to communicate the

information need amongst each other. Such a natural language statement of the

information need is called a request. Automatic query formulation inputs the

request and outputs an initial query. In practice, this means that some or all of the

words in the request are converted to query terms, for instance by the rather trivial

algorithm that puts words to lower case. Relevance feedback inputs a query or a

request and some previously retrieved relevant and non-relevant documents to

output a successive query. The comparison of the query against the document

representations is also called the matching process. The matching process results

in a ranked list of relevant documents. Users will walk down this document list in

search of the information they need. Ranked retrieval will hopefully put the

relevant documents somewhere in the top of the ranked list, minimizing the time

the user has to invest on reading the documents. Simple but effective ranking

algorithms use the frequency distribution of terms over documents. For instance,

the words ―family‖ and ―entertainment‖ mentioned in the first section occur

relatively infrequent in the whole book, which indicates that this book should not



receive a top ranking for the request ―family entertainment‖. Ranking algorithms

based on statistical approaches easily halve the time the user has to spend on

reading documents.

1.7.1 Basic models of information retrieval a brief overview

A mathematical model of information retrieval guides the implementation

of information retrieval systems. In the traditional information retrieval systems,

which are usually operated by professional searchers, only the matching process is

automated; indexing and query formulation are manual processes. For these

systems, mathematical models of information retrieval therefore only have to

model the matching process. In practice, traditional information retrieval systems

use the Boolean model of information retrieval.

1.7.1.1 The Boolean model

Is an exact matching model, that is, it either retrieves documents or not

without ranking them. The model supports the use of structured queries, which do

not only contain query terms, but also relations between the terms defined by the

query operators AND, OR and NOT

In modern information retrieval systems, which are usually operated by

nonprofessional users, query formulation is automated as well. However,

candidate mathematical models for these systems still only model the matching

process. There are many candidate models for the matching process of ranked

retrieval systems. These models are so-called approximate matching models, that

is, they use the frequency distribution of terms over documents to compute the

ranking of the retrieved sets. Each of these models has its own advantages and

disadvantages. However, there are two classical candidate models for

approximate matching: the vector space model and the probabilistic model. They

are classical models, not only because they were introduced already in the early

70’s, but also because they represent classical problems in information retrieval.



1.7.1.2 The vector space model

Represents the problem of ranking the documents given the initial query.

The Vector model, probably the most commonly used, assigns real non-negative

weights to index terms in documents and queries. In this model, documents are

represented by vectors in a multi-dimensional Euclidean space. Each dimension in

this space corresponds to a relevant term/word contained in the document

collection. The degree of similarity of documents with regard to queries is

evaluated as the correlation between the vectors representing the document and

the query which can be, and usually is, quantified by the cosine of the angle

between the two vectors.

In the vector model, index term weights are usually obtained as a function

of two factors: the term frequency factor, TF, a measure of intra-cluster

similarity; computed as the number of times that the term occurs in document,

normalized in a way as to make it independent of document length and an inverse

document frequency, IDF, a measure of inter-cluster dissimilarity; weights each

term according to its discriminative power in the entire collection. This model’s

main advantages are related to improvements in retrieval performance due to term

weighting; partial matching that allows retrieval of documents that approximate

the query conditions. The index term independency assumption is probably its

main disadvantage.

1.7.1.3 The probabilistic model

Represent the problem of ranking the documents after some feedback is

gathered. Probabilistic models compute the similarity between documents and

queries as the odds of a document being relevant to a query. Index term weights

are binary. This model ranks documents in decreasing order of their probability of

being relevant, which is an advantage. Its main disadvantages are: the need to

guess the initial separation of documents into relevant and non-relevant; weights

are binary; index terms are assumed to be independent

From a practical point of view, the Boolean model, the vector space model

and the probabilistic model represent three classical problems of information



retrieval, respectively structured queries, initial term weighting, and relevance

feedback. The Boolean model provides the query operators AND, OR and NOT to

formulate structured queries. The vector space model was used by Salton and his

colleagues for hundreds of term weighting experiments in order to find algorithms

that predict which documents the user will find relevant given the initial query

(Salton and Buckley 1988).3 The probabilistic model, provides a theory of

optimum ranking if examples of relevant documents are available [14].

1.7.1.4 Evaluation of Information Retrieval System

Evaluation studies investigate the degree to which the stated goals or

expectations have been achieved or the degree to which these can be achieved.

The three major purposes given for evaluating an information retrieval system

were the need for measures with which to make merit comparisons within a single

test situation, the need for measures with which to make comparisons between

results obtained in difficult test situations and the need for assessing the merit a

real-life system. A number of studies have been conducted to measure the

performance of the information retrieval system. Some criteria have been

proposed by several researchers for the evaluation of information retrieval

systems [CC66, LFW68, and SG83]. These criteria include: coverage of the

system, form of presentation of the search output, user effort, the response time of

the system, and recall and precision. Retrieval effectiveness is defined in terms of

retrieving relevant documents and not retrieving non-relevant documents. Two

traditional factors of measuring effectiveness are Recall and Precision.

1.7.1.4.1 Evaluation criteria

Recall indicates the ability of a system to present all relevant items or

documents. In reality it may not be possible to retrieve all the relevant items from

a collection, especially when the collection is large. A system may be able to

retrieve a proportion of the total relevant documents. Thus, the performance of a

system is often measured by the recall ratio, which denotes the percentage of

relevant items retrieved in a given situation.



Precision implies the ability of a system to present only relevant items or

documents and therefore not to retrieve non-relevant documents. This factor-that

is, how far the system is able to withhold unwanted items in a given situation-is

measured in terms of precision ratio. These two measures are denoted by the

following formulas:

Documents

Chapter 1 Web-Mining and Information Retrievalshodhganga.inflibnet.ac.in/bitstream/10603/65416/6/06_chapter_1.pdf · Chapter 1: Web Mining and Information Retrieval A Study of Web