A Case Study on Web mining

Web Mining: A Case Study

TEJAS GHALSASI 6483 Abstract With an enormous amount of data stored in databases and data warehouses, it is increasingly important to develop powerful tools for analysis of such data and mining interesting knowledge from it. Data mining is a process of inferring knowledge from such huge data. The main problem related to the retrieval of information from the World Wide Web is the enormous number of unstructured documents and resources, i.e., the difficulty of locating and tracking appropriate sources. In this article, a survey of the research in the area of web mining and suggest web mining categories and techniques. Furthermore, a presentation of a web mining environment generator that allows naive users to generate a web mining environment specific to a given domain by providing a set of specifications.

IntroductionWeb mining - is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.Web Mining rapidly collect and integrate information from multiple Web sites. This approach relies on computing power, rather than programmer brain power, to handle many of the complex issues that arise when gathering information from disparate sources. One solution is the power by advanced artificial intelligence algorithms. For instance, machine learning techniques would be used for extracting information from Web sites. Based on a few examples, the software automatically determines how to extract different types of information from data sources.

Web structure miningWeb structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds:

1. Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.

2. Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

Web content miningWeb content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler, ALIWEB, MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.

ClusterCluster is a group of objects that belong to the same class. In other words the similar object are grouped in one cluster and dissimilar are grouped in other cluster.

ClusteringClustering is the process of making group of abstract objects into classes of similar objects.

A cluster of data objects can be treated as a one group.

While doing the cluster analysis, we first partition the set of data into groups based on data similarity and then assign the label to the groups.

The main advantage of Clustering over classification is that, It is adaptable to changes and help single out useful features that distinguished different groups.

Applications of Cluster Analysis

Clustering Analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.

Clustering can also help marketers discover distinct groups in their customer basis. And they can characterize their customer groups based on purchasing patterns.

In field of biology it can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in populations.

Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according house type, value, geographic location.

Clustering also helps in classifying documents on the web for information discovery.

Clustering is also used in outlier detection applications such as detection of credit card fraud.

As a data mining function Cluster Analysis serve as a tool to gain insight into the distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining

Here is the typical requirements of clustering in data mining:

Scalability - We need highly scalable clustering algorithms to deal with large databases.

Ability to deal with different kind of attributes - Algorithms should be capable to be applied on any kind of data such as interval based (numerical) data, categorical, binary data.

Discovery of clusters with attribute shape - The clustering algorithm should be capable of detect cluster of arbitrary shape. The should not be bounded to only distance measures that tend to find spherical cluster of small size.

High dimensionality - The clustering algorithm should not only be able to handle low- dimensional data but also the high dimensional space.

Ability to deal with noisy data - Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.

Interpretability - The clustering results should be interpretable, comprehensible and usable.

Clustering MethodsThe clustering methods can be classified into following categories:

Partitioning Method

Hierarchical Method

Density-based Method

Grid-Based Method

Model-Based Method

Constraint-based Method

Partitioning Method

Suppose we are given a database of n objects, the partitioning method construct k partition of data. Each partition will represents a cluster and k≤n. It means that it will classify the data into k groups, which satisfy the following requirements:

Each group contain at least one object.

Each object must belong to exactly one group.

Points to remember:

For a given number of partitions (say k), the partitioning method will create an initial partitioning.

Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to other.

Hierarchical Methods

This method create the hierarchical decomposition of the given set of data objects. We can classify Hierarchical method on basis of how the hierarchical decomposition is formed as follows:

Agglomerative Approach

Divisive Approach

Agglomerative Approach

This approach is also known as bottom-up approach. In this we start with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds.

Divisive Approach

This approach is also known as top-down approach. In this we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds.

Disadvantage

This method is rigid i.e. once merge or split is done, It can never be undone.

Approaches to improve quality of Hierarchical clustering

Here is the two approaches that are used to improve quality of hierarchical clustering:

Perform careful analysis of object linkages at each hierarchical partitioning.

Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to group objects into microclusters, and then performing macroclustering on the microclusters.

Density-based Method

This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the density in the neighbourhood exceeds some threshold i.e. for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.

Grid-based Method

In this the objects together from a grid. The object space is quantized into finite number of cells that form a grid structure.

Advantage

The major advantage of this method is fast processing time.

It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods

In this method a model is hypothesize for each cluster and find the best fit of data to the given model. This method locate the clusters by clustering the density function. This reflects spatial distribution of the data points.

This method also serve a way of automatically determining number of clusters based on standard statistics , taking outlier or noise into account. It therefore yield robust clustering methods.

Constraint-based Method

In this method the clustering is performed by incorporation of user or application oriented constraints. The constraint refers to the user expectation or the properties of desired clustering results. The constraint give us the interactive way of communication with the clustering process. The constraint can be specified by the user or the application requirement.

ConclusionThe web offers prospecting and user relationship management opportunities that are limited only by the imagination. Web mining is a tool that can extract predictive information from large quantities of data, and is data driven. It uses mathematical and statistical calculations to uncover trends and correlations among the large quantities of data stored in a database. It is a blend of artificial intelligence technology, statistics, data warehousing, and machine learning. This data mining technology is becoming more and more popular, and is one of the fastest growing technologies in information systems today. Thus we have accquired knowledge of webmining.

REFERENCES www.mitre.org/pubs/edge/august_00/carbone.htm. “Data Mining Overview.” http//:www.data-mine.com/ "Data Mining" Def. www. Dictionary.Com “Data Warehouse” Def. www.Dictionary.comData Mining Concepts and Techniques Han & Kamber.Data Mining By Korth

Engineering

A Case Study on Web mining