114
1. INTRODUCTION The World Wide Web is without doubt, already know what and have used it extensively. The World Wide Web (or the Web for short) has impacted on almost every aspect of our lives. It is the biggest and most widely known information source that is easily accessible and searchable. It consists of billions of interconnected documents (called Web pages) which are authored by millions of people. Since its inception, the Web has dramatically changed our information seeking behaviour. Before the Web, finding information means asking a friend or an expert, or buying/borrowing a book to read. However, with the Web, everything is only a few clicks away from the comfort of our homes or offices. Not only can we find needed information on the Web, but we can also easily share our information and knowledge with others. The Web has also become an important channel for conducting businesses. We can buy almost anything from online stores without needing to go to a physical shop. The Web also provides convenient means for us to communicate with each other, to express our views and opinions on anything, and to discuss with people from anywhere in the world. The Web is truly a virtual society. In this chapter, we introduce the Web, its history, and the topics that we will discuss in the seminar. 1.1 WHAT IS THE WORLD WIDE WEB? WEB MINING Page 1

Web Mining

Embed Size (px)

Citation preview

1. INTRODUCTIONThe World Wide Web is without doubt, already know what and have used it extensively. The World Wide Web (or the Web for short) has impacted on almost every aspect of our lives. It is the biggest and most widely known information source that is easily accessible and searchable. It consists of billions of interconnected documents (called Web pages) which are authored by millions of people. Since its inception, the Web has dramatically changed our information seeking behaviour. Before the Web, finding information means asking a friend or an expert, or buying/borrowing a book to read. However, with the Web, everything is only a few clicks away from the comfort of our homes or offices. Not only can we find needed information on the Web, but we can also easily share our information and knowledge with others. The Web has also become an important channel for conducting businesses. We can buy almost anything from online stores without needing to go to a physical shop. The Web also provides convenient means for us to communicate with each other, to express our views and opinions on anything, and to discuss with people from anywhere in the world. The Web is truly a virtual society. In this chapter, we introduce the Web, its history, and the topics that we will discuss in the seminar. 1.1 WHAT IS THE WORLD WIDE WEB? The World Wide Web is officially defined as a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents. In simpler terms, the Web is an Internet-based computer network that allows users of one computer to access information stored on another through the world-wide network called the Internet. The Web's implementation follows a standard client-server model. In this model, a user relies on a program (called the client) to connect to a remote machine (called the server) where the data is stored. Navigating through the Web is done by means of a client program called the browser, e.g., Netscape, Internet Explorer, Firefox, etc. Web browsers work by sending requests to remote servers for information and then interpreting the returned documents written in HTML and laying out the text and graphics on the users computer screen on the client side.WEB MINING Page 1

The operation of the Web relies on the structure of its hypertext documents. Hypertext allows Web page authors to link their documents to other related documents residing on computers anywhere in the world. To view these documents, one simply follows the links (called hyperlinks). The idea of hypertext was invented by Ted Nelson in 1965, who also created the well known hypertext system Xanadu (http://xanadu. com/). Hypertext that also allows other media (e.g., image, audio and video files) is called hypermedia.

1.2 A BRIEF HISTORY OF THE WEB AND THE INTERNETCREATION OF THE WEB: The Web was invented in 1989 by Tim Berners-Lee, who, at that time, worked at CERN (Centre European pour la Recherche Nucleaire, or European Laboratory for Particle Physics) in Switzerland. He coined the term World Wide Web, wrote the first World Wide Web server, httpd, and the first client program (a browser and editor), WORLD WIDE WEB: It began in March 1989 when Tim Berners-Lee submitted a proposal titled Information Management: A Proposal to his superiors at CERN. In the proposal, he discussed the disadvantages of hierarchical information organization and outlined the advantages of a hypertext-based system. The proposal called for a simple protocol that could request information stored in remote systems through networks, and for a scheme by which information could be exchanged in a common format and documents of individuals could be linked by hyperlinks to other documents. It also proposed methods for reading text and graphics using the display technology at CERN at that time. The proposal essentially outlined a distributed hypertext system, which is the basic architecture of the Web. Initially, the proposal did not receive the needed support. However, in 1990, Berners-Lee re-circulated the proposal and received the support to begin the work. With this project, Berners-Lee and his team at CERN laid the foundation for the future development of the Web as a distributed hypertext system. They introduced their server and browser, the protocol used for communication between clients and the server, theWEB MINING Page 2

Hyper Text Transfer Protocol (HTTP), the Hyper Text Markup Language (HTML) used for authoring Web documents, and the Universal Resource Locator (URL). And so it began. MOSAIC AND NETSCAPE BROWSERS: The next significant event in the development of the Web was the arrival of Mosaic. In February of 1993, Marc Andreesen from the University of Illinois NCSA (National Center for Supercomputing Applications) and his team released the first "Mosaic for X" graphical Web browser for UNIX. A few months later, different versions of Mosaic were released for Macintosh and Windows operating systems. This was an important event. For the first time, a Web client, with a consistent and simple point-andclick graphical user interface, was implemented for the three most popular operating systems available at the time. It soon made big splashes outside the academic circle where it had begun. In mid-1994, Silicon Graphics founder Jim Clark collaborated with Marc Andreessen, and they founded the company Mosaic Communications (later renamed as Netscape Communications). Within a few months, the Netscape browser was released to the public, which started the explosive growth of the Web. The Internet Explorer from Microsoft entered the market in August, 1995 and began to challenge Netscape. The creation of the World Wide Web by Tim Berners-Lee followed by the release of the Mosaic browser are often regarded as the two most significant contributing factors to the success and popularity of the Web. INTERNET: The Web would not be possible without the Internet, which provides the communication network for the Web to function. The Internet started with the computer network ARPANET in the Cold War era. It was produced as the result of a project in the United States aiming at maintaining control over its missiles and bombers after a nuclear attack. It was supported by Advanced Research Projects Agency (ARPA), which was part of the Department of Defense in the United States. The first ARPANET connections were made in 1969, and in 1972, it was demonstrated at the First International Conference on Computers and Communication, held in Washington D.C. At the conference, ARPA scientists linked computers together from 40 different locations.WEB MINING Page 3

In 1973, Vinton Cerf and Bob Kahn started to develop the protocol later to be called TCP/IP (Transmission Control Protocol/Internet Protocol). In the next year, they published the paper Transmission Control Protocol, which marked the beginning of TCP/IP. This new protocol allowed diverse computer networks to interconnect and communicate with each other. In subsequent years, many networks were built, and many competing techniques and protocols were proposed and developed. However, ARPANET was still the backbone to the entire system. During the period, the network scene was chaotic. In 1982, the TCP/IP was finally adopted, and the Internet, which is a connected set of networks using the TCP/IP protocol, was born. SEARCH ENGINES: With information being shared worldwide, there was a need for individuals to find information in an orderly and efficient manner. Thus began the development of search engines. The search system Excite was introduced in 1993 by six Stanford University students. EINet Galaxy was established in 1994 as part of the MCC Research Consortium at the University of Texas. Jerry Yang and David Filo created Yahoo! in 1994, which started out as a listing of their favourite Web sites, and offered directory search. In subsequent years, many search systems emerged, e.g., Lycos, Inforseek, AltaVista, Inktomi, Ask Jeeves, Northernlight, etc. Google was launched in 1998 by Sergey Brin and Larry Page based on their research project t at Stanford University. Microsoft started to commit to search in 2003, and launched the MSN search engine in spring 2005. It used search engines from others before. Yahoo! provided a general search capability in 2004 after it purchased Inktomi in 2003. W3C (THE WORLD WIDE WEB CONSORTIUM): W3C was formed in the December of 1994 by MIT and CERN as an international organization to lead the development of the Web. W3C's main objective was to promote standards for the evolution of the Web and interoperability between WWW products by producing specifications and reference software. The first International Conference on World Wide Web (WWW) was also held in 1994, which has been a yearly event ever since. From 1995 to 2001, the growth of the Web boomed. Investors saw commercialWEB MINING Page 4

opportunities and became involved. Numerous businesses started on the Web, which led to irrational developments. Finally, the bubble burst in 2001. However, the development of the Web was not stopped, but has only become more rational since.

1.3 WEB DATA MININGThe rapid growth of the Web in the last decade makes it the largest publicly accessible data source in the world. The Web has many unique characteristics, which make mining useful information and knowledge a fascinating and challenging task. Let us review some of these characteristics. 1. The amount of data/information on the Web is huge and still growing. The coverage of the information is also very wide and diverse. One can find information on almost anything on the Web. 2. Data of all types exist on the Web, e.g., structured tables, semi structured Web pages, unstructured texts, and multimedia files (images, audios, and videos). 3. Information on the Web is heterogeneous. Due to the diverse authorship of Web pages, multiple pages may present the same or similar information using completely different words and/or formats. This makes integration of information from multiple pages a challenging problem. 4. A significant amount of information on the Web is linked. Hyperlinks exist among Web pages within a site and across different sites. Within a site, hyperlinks serve as information organization mechanisms. Across different sites, hyperlinks represent implicit conveyance of authority to the target pages. That is, those pages that are linked (or pointed) to by many other pages are usually high quality pages or authoritative pages simply because many people trust them. 5. The information on the Web is noisy. The noise comes from two main sources. First, a typical Web page contains many pieces of information, e.g., the main content of the page, navigation links, advertisements, copyright notices, privacy policies, etc. For a particular application, only part of the information is useful. The rest is considered noise. To perform fine-grain Web information analysis and data mining, the noise should be removed. Second, due to the fact that the Web does not have quality control of information, i.e., oneWEB MINING Page 5

can write almost anything that one likes, a large amount of information on the Web is of low quality, erroneous, or even misleading. 6. The Web is also about services. Most commercial Web sites allow people to perform useful operations at their sites, e.g., to purchase products, to pay bills, and to fill in forms. 7. The Web is dynamic. Information on the Web changes constantly. Keeping up with the change and monitoring the change are important issues for many applications. 8. The Web is a virtual society. The Web is not only about data, information and services, but also about interactions among people, organizations and automated systems. One can communicate with people anywhere in the world easily and instantly, and also express ones views on anything in Internet forums, blogs and review sites. All these characteristics present both challenges and opportunities for mining and discovery of information and knowledge from the Web. To explore information mining on the Web, it is necessary to know data mining, which has been applied in many Web mining tasks. However, Web mining is not entirely an application of data mining. Due to the richness and diversity of information and other Web specific characteristics discussed above, Web mining has developed many of its own algorithms. 1.3.1 WHAT IS DATA MINING? Data mining is also called knowledge discovery in databases (KDD). It is commonly defined as the process of discovering useful patterns or knowledge from data sources, e.g., databases, texts, images, the Web, etc. The patterns must be valid, potentially useful, and understandable. Data mining is a multi-disciplinary field involving machine learning, statistics, databases, artificial intelligence, information retrieval, and

visualization. There are many data mining tasks. Some of the common ones are supervised learning (or classification), unsupervised learning (or clustering), association rule mining, and sequential pattern mining. We will discuss all of them in this seminar. A data mining application usually starts with an understanding of the application domain by data analysts (data miners), who then identify suitable data sources and the

WEB MINING

Page 6

target data. With the data, data mining can be performed, which is usually carried out in three main steps: Pre-processing: The raw data is usually not suitable for mining due to various reasons. It may need to be cleaned in order to remove noises or abnormalities. The data may also be too large and/or involve many irrelevant attributes, which call for data reduction through sampling and attribute selection. Data mining: The processed data is then fed to a data mining algorithm which will produce patterns or knowledge. Post-processing: In many applications, not all discovered patterns are useful. This step identifies those useful ones for applications. Various evaluation and visualization techniques are used to make the decision. The whole process (also called the data mining process) is almost always iterative. It usually takes many rounds to achieve final satisfactory results, which are then incorporated into real-world operational tasks. Traditional data mining uses structured data stored in relational tables, spread sheets, or flat files in the tabular form. With the growth of the Web and text documents, Web mining and text mining are becoming increasingly important and popular. 1.3.2 WHAT IS WEB MINING? Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content, and usage data. Although Web mining uses many data mining techniques, as mentioned above it is not purely an application of traditional data mining due to the heterogeneity and semi-structured or unstructured nature of the Web data. Many new mining tasks and algorithms were invented in the past decade. Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three types: Web structure mining, Web content mining and Web usage mining. Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. Web mining should be decomposed into these subtasks: 1. Resource finding: The task of retrieving intended Web documents.WEB MINING Page 7

2. Information selection and pre processing: Automatically selecting and pre processing specific information from retrieved Web resources. 3. Generalization: Automatically discovers general patterns at individual Web sites as well as across multiple sites. 4. Analysis: Validation and/or interpretation of the mined patterns. Resource finding is the process of retrieving data from text sources available on the Web such as electronic magazines and newsletters or text contents of HTML documents. Information selection and pre processing step is transformation process retrieved in information retrieval (IR) process from original data. These transformations cover removing stop words, finding phrases in the training corpus, transforming the representation to relational or first order logic form, etc. Data mining techniques and machine learning are often used for generalization. In information and knowledge discovery process, people play very important role. This is important for validation and/or interpretation in last step.

1.4 WEB MINING CATEGORIESWeb mining is categorized into three areas of interest based on part of Web to mine: 1. Web content mining describes discovery of useful information from contents, data and documents two different points of view: ir view and db view 2. Web structure mining model of link structures, topology of hyperlinks categorizing of web pages 3. Web usage mining mines secondary data derived from user interactions Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web structure mining is the process of inferring knowledge from the Web organization and links between references and referents in the

WEB MINING

Page 8

Web. Finally, Web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in Web access logs. In this seminar, we will discuss all these three types of mining. However, due to the richness and diversity of information on the Web, there are a large number of Web mining tasks. We will not be able to cover them all. We will only focus on some important tasks and their algorithms. The Web mining process is similar to the data mining process. The difference is usually in the data collection. In traditional data mining, the data is often already collected and stored in a data warehouse. For Web mining, data collection can be a substantial task, especially for Web structure and content mining, which involves crawling a large number of target Web pages. We will devote a whole chapter on crawling. Once the data is collected, we go through the same three-step process: data preprocessing, Web data mining and post-processing. However, the techniques used for each step can be quite different from those used in traditional data mining.

WEB MINING

Page 9

2. DATA MININGData mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from market analysis, fraud detection, and customer retention, to production control and science exploration. Data mining can be viewed as a result of the natural evolution of information technology. The database system industry has witnessed an evolutionary path in the development of the following functionalities: data collection and database creation, data management (including data storage and retrieval, and database transaction processing), and advanced data analysis (involving data warehousing and data mining). For instance, the early development of data collection and database creation mechanisms served as a prerequisite for later development of effective mechanisms for data storage and retrieval, and query and transaction processing. With numerous database systems offering query and transaction processing as common practice, advanced data analysis has naturally become the next target. Since the 1960s, database and information technology has been evolving systematically from primitive file processing systems to sophisticated and powerful database systems. The research and development in database systems since the 1970s has progressed from early hierarchical and network database systems to the development of relational database systems, data modelling tools, and indexing and accessing methods. In addition, users gained convenient and flexible data access through query languages, user interfaces, optimized query processing, and transaction management. Efficient methods for on-line transaction processing (OLTP), where a query is viewed as a read-only transaction, have contributed substantially to the evolution and wide acceptance of relational technology as a major tool for efficient storage, retrieval, and management of large amounts of data. Database technology since the mid-1980s has been characterized by the popular adoption of relational technology and an upsurge of research and development activities on new and powerful database systems. These promote the development of advanced dataWEB MINING Page 10

models such as extended-relational, object-oriented, object-relational, and deductive models. Application-oriented database systems, including spatial, temporal, multimedia, active, stream, and sensor, and scientific and engineering databases, knowledge bases, and office information bases, have flourished. Issues related to the distribution, diversification, and sharing of data have been studied extensively. Heterogeneous database systems and Internet-based global information systems such as the World Wide Web (WWW) have also emerged and play a vital role in the information industry. The steady and amazing progress of computer hardware technology in the past three decades has led to large supplies of powerful and affordable computers, data collection equipment, and storage media. This technology provides a great boost to the database and information industry, and makes a huge number of databases and information repositories available for transaction management, information retrieval, and data analysis. Data can now be stored in many different kinds of databases and information repositories. One data repository architecture that has emerged is the data warehouse, a repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making. Data warehouse technology includes data cleaning, data integration, and on-line analytical processing (OLAP), that is, analysis techniques with functionalities such as summarization, consolidation, and aggregation as well as the ability to view information from different angles. Although OLAP tools support multidimensional analysis and decision making, additional data analysis tools are required for in-depth analysis, such data classification, clustering, and the characterization of data changes over time. In addition, huge volumes of data can be accumulated beyond databases and data warehouses. Typical examples include the World Wide Web and data streams, where data flow in and out like streams, as in applications like video surveillance, telecommunication, and sensor networks. The effective and efficient analysis of data in such different forms becomes a challenging task. The abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation. The fast-growing, tremendous amount of data, collected and stored in large and numerous data repositories, has far exceeded our human ability for comprehension without powerful tools. As a result, data collected in large data repositories become data tombsdata archives that are seldom visited. Consequently, important decisions are often made based not on the informationWEB MINING Page 11

rich data stored in data repositories, but rather on a decision makers intuition, simply because the decision maker does not have the tools to extract the valuable knowledge embedded in the vast amounts of data. In addition, consider expert system technologies, which typically rely on users or domain experts to manually input knowledge into knowledge bases. Unfortunately, this procedure is prone to biases and errors, and is extremely time-consuming and costly. Data mining tools perform data analysis and may uncover important data patterns, contributing greatly to business strategies, knowledge bases, and scientific and medical research. The widening gap between data and information calls for a systematic development of data mining tools that will turn data tombs into golden nuggets of knowledge.

Figure 1: DATA MINING AS A STEP IN THE PROCESS OF KNOWLEDGE DISCOVERY

WEB MINING

Page 12

2.1 DATA MINING BRIEF OVERVIEWSimply stated, data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named knowledge mining from data, which is unfortunately somewhat long. Knowledge mining, a shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material. Thus, such a misnomer that carries both data and mining became a popular choice. Many other terms carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery. Knowledge discovery as a process is depicted in Figure 1 and consists of an iterative sequence of the following steps: 1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations) 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)

WEB MINING

Page 13

Steps 1 to 4 are different forms of data pre processing, where the data are prepared for mining. The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one because it uncovers hidden patterns for evaluation. We agree that data mining is a step in the knowledge discovery process. However, in industry, in media, and in the at a base research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data. Therefore, in this book, we choose to use the term data mining. We adopt a broad view of data mining functionality: data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. Based on this view, the architecture of a typical data mining system may have the following major components (Figure 2): Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the users data mining request. Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a patterns interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.WEB MINING Page 14

Figure 2: ARCHITECTURE OF A TYPICAL DATA MINING SYSTEM Pattern evaluation module: This component typically employs interestingness measures (Section 1.5) and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns. User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user

WEB MINING

Page 15

to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. From a data warehouse perspective, data mining can be viewed as an advanced stage of online analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style analytical processing of data warehouse systems by incorporating more advanced techniques for data analysis. Although there are many data mining systems on the market, not all of them can perform true data mining. A data analysis system that does not handle large amounts of data should be more appropriately categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A system that can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorized as a database system, an information retrieval system, or a deductive database system.

2.2 DATA MINING TECHNIQUESNeural Networks/Pattern Recognition - Neural Networks are used in a blackbox fashion. One creates a test data set, lets the neural network learn patterns based on known outcomes, then sets the neural network loose on huge amounts of data. For example, a credit card company has 3,000 records, 100 of which are known fraud records. The data set updates the neural network to make sure it knows the difference between the fraud records and the legitimate ones. The network learns the patterns of the fraud records. Then the network is run against companys million record data set and the network spits out the records with patterns the same or similar to the fraud records. Neural networks are known for not being very helpful in teaching analysts about the data, just finding patterns that match. Neural networks have been used for optical character recognition to help the Post Office automate the delivery process without having to use humans to read addresses. Memory Based Reasoning - This technique has results similar to neural network but goes about it differently. MBR looks for "neighbour" kind of data, rather than patterns. If you look at insurance claims and want to know which the adjudicators should look at and which they can just let go through the system, you would set up a set of claims you want adjudicated and let the technique find similar claims.WEB MINING Page 16

Cluster Detection/Market Basket Analysis - This is where the classic beer/diapers bought together analysis came from. It finds groupings. Basically, this technique finds relationships in product or customer or wherever you want to find associations in data. Link Analysis - This is another technique for associating like records. Not used too much, but there are some tools created just for this. As the name suggests, the technique tries to find links, either in customers, transactions, etc. and demonstrate those links. Visualization - This technique helps users understand their data. Visualization makes the bridge from text based to graphical presentation. Such things as decision tree, rule, cluster and pattern visualization help users see data relationships rather than read about them. Many of the stronger data mining programs have made strides in improving their visual content over the past few years. This is really the vision of the future of data mining and analysis. Data volumes have grown to such huge levels, it is going to be impossible for humans to process it by any text-based method effectively, soon. We will probably see an approach to data mining using visualization appear that will be something like Microsofts Photosynth. The technology is there, it will just take an analyst with some vision to sit down and put it together. Decision Tree/Rule Induction - Decision trees use real data mining algorithms. Decision trees help with classification and spit out information that is very descriptive, helping users to understand their data. A decision tree process will generate the rules followed in a process. For example, a lender at a bank goes through a set of rules when approving a loan. Based on the loan data a bank has, the outcomes of the loans (default or paid), and limits of acceptable levels of default, the decision tree can set up the guidelines for the lending institution. These decision trees are very similar to the first decision support (or expert) systems. Genetic Algorithms - GAs are techniques that act like bacteria growing in a Petri dish. You set up a data set then give the GA ability to do different things for whether a direction or outcome is favourable. The GA will move in a direction that will hopefully optimize the final result. GAs are used mostly for process optimization, such as scheduling, workflow, batching, and process re-engineering. Think of GA as simulations run over and over to find optimal results and the infrastructure around being able to both run the simulations and the ways to set up which results are optimal.WEB MINING Page 17

OLAP Online Analytical Processing. OLAP allows users to browse data following logical questions about the data. OLAP generally includes the ability to drill down into data, moving from highly summarized views of data into more detailed views. This is generally achieved by moving along hierarchies of data. For example, if one were analyzing populations, one could start with the most populous continent, then drill down to the most populous country, then to the state level, then to the city level, then to the neighbourhood level. OLAP also includes browsing up hierarchies (drill up), across different dimensions of data (drill across), and many other advanced techniques for browsing data, such as automatic time variation when drilling up or down time hierarchies. OLAP is by far the most implemented and used technique. It is also generally the most intuitive and easy to use.

2.3 KIND OF DATAWe examine a number of different data repositories on which mining can be performed. In principle, data mining should be applicable to any kind of data repository, as well as to transient data, such as data streams. Thus the scope of our examination of data repositories will include relational databases, data warehouses, transactional databases, advanced database systems, flat files, data streams, and the World Wide Web. Advanced database systems include object-relational databases and specific application-oriented databases, such as spatial databases, time-series databases, text databases, and multimedia databases. The challenges and techniques of mining may differ for each of the repository systems. 2.3.1 RELATIONAL DATABASES A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. The software programs involve mechanisms for the definition of database structures; for data storage; for concurrent, shared, or distributed data access; and for ensuring the consistency and security of the information stored, despite system crashes or attempts at unauthorized access.

WEB MINING

Page 18

2.3.2 DATAWAREHOUSES A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. To facilitate decision making, the data in a data warehouse are organized around major subjects, such as customer, item, supplier, and activity. The actual physical structure of a data warehouse may be a relational data store or a multidimensional data cube. A data cube provides a multidimensional view of data and allows the pre computation and fast accessing of summarized data. 2.3.3 TRANSACTIONAL DATABASES In general, a transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction. The transactional database may have additional tables associated with it, which contain other information.

2.4 CLASSIFICATION OF DATA MINING SYSTEMSData mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or high-performance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, business, bioinformatics, or psychology. Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems, which may help potential users distinguish between such system sand identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows:

WEB MINING

Page 19

Classification according to the kinds of databases mined: A data mining system can be classified according to the kinds of databases mined. Database systems can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly. For instance, if classifying according to data models, we may have a relational, transactional, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time-series, text, stream data, multimedia data mining system, or a World Wide Web mining system. Classification according to the kinds of knowledge mined: Data mining systems can be categorized according to the kinds of knowledge they mine, that is, based on data mining functionalities, such as characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis. A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities. Classification according to the kinds of techniques utilized: Data mining systems can be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems) or the methods of data analysis employed (e.g., database-oriented or data warehouse oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on). A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique that combines the merits of a few individual approaches. Classification according to the applications adapted: Data mining systems can also be categorized according to the applications they adapt. For example, data mining Systems may be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on. Different applications often require the integration of application-specific methods. Therefore, a generic, all-purpose data mining system may not fit domain-specific mining tasks.

WEB MINING

Page 20

2.5 DATA MINING TASK PRIMITIVESEach user will have a data mining task in mind, that is, some form of data analysis that he or she would like to have performed. A data mining task can be specified in the form of a data mining query, which is input to the data mining system. A data mining query is defined in terms of data mining task primitives. These primitives allow the user to interactively communicate with the data mining system during discovery in order to direct the mining process, or examine the findings from different angles or depths. The data mining primitives specify the following. The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest (referred to as the relevant attributes or dimensions). The kind of knowledge to be mined: This specifies the data mining functions to be performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis. The background knowledge to be used in the discovery process: This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for evaluating the patterns found. Concept hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels of abstraction. User beliefs regarding relationships in the data are another form of background knowledge. The interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures. The expected representation for visualizing the discovered patterns: This refers to the form in which discovered patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and cubes.

WEB MINING

Page 21

3. WEB MININGThe following figure shows the architecture of web mining briefly. It divides into two stages. The stage1 contains all information of data & stage2 contains analysis. According to analysis target, web mining can be divided into three different types, which are web usage mining, web content mining and web structure mining.

WEB MINING ARCHITECTURE

Data Cleaning

Transaction Identification

Data Integration

Transformation

Pattern Discovery

Pattern Analysis

Server data log

Clean log

Transaction data

Path Analysis

OLAP/ Visualisation Tools

Registration data

Association Rules Knowledge Query Mechanism

Name Address Marks Attar

Sequential Pattern

Documents and Usage Attributes

Database Query Languag e

Clusters and Classification Rules

Intelligent Agent

STAGE 1

STAGE 2

Figure 3: ARCHITECTURE OF WEB MINING Data mining is the nontrivial process of identifying valid novel, potentially useful, and ultimately understandable patterns in data Fayyad. The most commonly used techniques in data mining is artificial neural networks, decision trees, genetic algorithm, nearest neighbour method, and rule induction. Data mining research has drawn on a number of other fields such as inductive learning, machine learning and statistics etc.WEB MINING Page 22

Machine learning is the automation of a learning process and learning is based on observations of environmental statistics and transitions. Machine learning examines

previous examples and their outcomes and learns how to reproduce these make generalizations about new uses. Inductive learning Induction means inference of information from data and Inductive learning is a model building process where the database is analyzed to find patterns. Main strategies are supervised learning and unsupervised learning. Statistics: used to detect unusual patterns and explain patterns using statistical models such as linear models. Data mining models can be a discovery model it is the system automatically discovering important information hidden in the data or verification model takes an hypothesis from the user and tests the validity of it against the data. The web contains collection of pages that includes countless hyperlinks and huge volumes of access and usage information. Because of the ever-increasing amount of information in cyberspace, knowledge discovery and web mining are becoming critical for successfully conducting business in the cyber world. Web mining is the discovery and analysis of useful information from the web. Web mining is the use of data mining techniques to automatically discover and extract information from web documents and services (content, structure, and usage).

3.1 APPROACHES OF WEB MININGTwo different approaches were taken in initially defining web mining. i. ii. Process centric View Web mining as a sequence of tasks Data centric view web mining as a web data that was being used in the mining process.

3.2 MINING TECHNIQUESThe important data mining techniques applied in the web domain include Association Rule, Sequential pattern discovery, clustering, path analysis, classification and outlier discovery.WEB MINING Page 23

1. Association Rule Mining: Predict the association and correlation among set of items where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other itms. That is, 1) Discovers the correlations between pages that are most often referenced together in a single server session/user session. 2) Provide the information: i. What are the set of pages frequently accessed together by web users? ii. What page will be fetched next? iii. What are paths frequently accessed by web users? 3) Associations and correlations: i. Page association from usage data user sessions, user transactions. ii. Page associations from content data similarity based on content analysis iii. Page associations based on structure -- link connectivity between pages. Advantages: Guide for web site restructuring by adding links that interconnect pages often viewed together. Improve the system performance by pre fetching web data.

2. Sequential pattern discovery: Applied to web access server transaction logs. The purpose is to discover sequential patterns that indicate user visit patterns over a certain period. That is, the order in which URLs tend to be accessed. Advantage: Useful user trends can be discovered. Predictions concerning visit pattern can be made. To improve website navigation. Personalize advertisements. Dynamically reorganize link structure and adopt web site contents to individual client requirements or to provide clients with automatic recommendations that best suit customer profiles. 3. Clustering: Group together items (users, pages, etc.,) that have similar characteristics. a) Page clusters: groups of pages that seem to be conceptually related according to users perception.

WEB MINING

Page 24

b) User Cluster: groups or users that seem to be behave similarly when navigating through a web site. 4. Classification: maps a data item into one of several predetermined classes. Example: describing each users category using profiles. Classification algorithms are decision tree, nave Bayesian classifier, neural networks. 5. Path Analysis: A technique that involves the generation of some form of graph that represents relation defined on web pages. This can be the physical layout of a web site in which the web pages are nodes and links between these pages are directed edges. Most graphs are involved in determining frequent traversal

patterns/ more frequently visited paths in a web site. Example: What paths do users traversal before they go to a particular URL? To use data mining on our web site, we have to establish and record visitor and item characteristics, and visitor interactions. Visitor characteristics include: i. Demographics are tangible attributes such as home address, income, property, etc. ii. Psychographics are personality types such as early technology interest, buying tendencies. iii. Techno graphics are attributes of visitors system, such as operating system, browser, and modem speed. Item characteristics include: i. Web content information media type, content category, URL. ii. Product information - product category, colour, size, price Visitor interactions include: i. Visitor item interactions include purchase history, advertising history, and preference information. ii. Visitor site statistics are per session characteristics, such as total time, pages viewed, and so on. We have a lot of information about web visitors and content, but we probably are not making the best use of it. The existing OLAP systems can report only on directlyWEB MINING Page 25

observed and easily correlated information. They rely on users to discover patterns and decide what to do with them. The information is even too complex for humans to discover these patterns using an OLAP system. To solve these problems, data mining techniques are utilized. The scope of data mining is i. Automated prediction of trends, and behaviours ii. Automated discovery of previously unknown patterns. Web mining is searches for i. Web access patterns, ii. Web structure, iii. Regularity and dynamics of web contents. The web mining research is a converging research area from several research communities, such as database, information retrieval, and AI research communities, especially from machine learning and natural language processing. World wide web is a popular and interactive medium to gather information today. The WWW provides every Internet citizen with access to an abundance of information. Users encounter some problems when interacting with the web. i. Finding relevant information (information overload Only a small portion of the web pages contain truly relevant/useful information): a) low precision (the abundance problem 99% of information of no interest to 99% of people) which is due to the irrelevance of many of the search results. This results in a difficulty of finding the relevant information. b) Low recall (limited coverage of the web-Internet sources hidden behind search interface) due to the inability to index all the information available on the web. This results in a difficulty of finding the unindexed

information that is relevant. ii. Discovery of existing but hidden knowledge (retrieve 1/3rd of the indexable web) iii. Personalization of the information (type & presentation of information) Limited customization to individual users. iv. Learning about customers/individual users.WEB MINING Page 26

v. Lack of feedback on human activities. vi. Lack of multidimensional analysis and data mining support. vii. The web constitutes a highly dynamic information source. Not only does the web continue to grow rapidly, the information I holds also receives constant updates. News, stock market, service centre, and corporate sites revise their web pages regularly. Linkage information and access records also undergo frequent updates. viii. The web serves a broad spectrum of user communities. The Internets rapidly expanding user community connects millions of workstations, and usage purposes. Many lack good knowledge of the information networks structure, are unaware of a particular searchs heavy cost, frequently get lost within the webs ocean of information and lengthy waits required to retrieve search results. ix. Web page complexity far exceeds the complexity of any traditional text document collection. Although the web functions as a huge digital library, the pages themselves lack a uniform structure and contain far more authoring style and content variations than any set of books or traditional text-based documents. Moreover, searching it is extremely difficult. Common problems web marketers want to solve are how to target advertisements (Targeting), Personalize web pages (Personalization), create web pages that show products often bought together (associations), classify articles automatically (Classification), characterize group of similar visitors (clustering), estimate missing data and predict future behaviour. In general web mining tasks are: i. Mining web search engine data ii. Analyzing the webs link structures iii. Classifying web document automatically iv. Mining web page semantic structure and page contents v. Mining web dynamics vi. Personalization. Thus, web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the web data. Web mining aimsWEB MINING Page 27

at finding and extracting relevant information that is hidden in web-related data, in particular in text documents that are published on the web like data mining is a multidisciplinary effort that draws technique from fields like information retrieval, statistics, machine learning, natural language processing and others. Web mining can be a

promising tool to address ineffective search engines that produce incomplete indexing, retrieval of irrelevant information/unverified reliability or retrieved information. It is essential to have a system that helps the user find relevant and reliable information easily and quickly on the web. Web mining discovers information from mounds of data on the www, but it also monitors and predicts user visit patterns. This gives designers more reliable information in structuring and designing a web site. Given the rate of growth of the web, scalability of search engines is a key issue, as the amount of hardware and network resources needed is large, and expensive. In

addition, search engines are popular tools, so they have heavy constraints on query answer time. So, the efficient use of resources can improve both scalability and answer time. One tool to achieve this goal is web mining.

WEB MINING

Page 28

3.3 WEB MINING TAXONOMYWeb mining can be broadly divided into three distinct categories, according to the kinds of data to be mined. Figure 4 shows the taxonomy.

Figure 4: WEB MINING TAXONOMY 3.3.1 Web Content Mining Web content mining is the process of extracting useful information from the contents of web documents. Content data is the collection of facts a web page is designed to contain. It may consist of text, images, audio, video, or structured records such as lists and tables. Application of text mining to web content has been the most widely researched. Issues addressed in text mining include topic discovery and tracking, extracting association patterns, clustering of web documents and classification of web pages. Research activities on this topic have drawn heavily on techniques developed in other disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP). While there exists a significant body of work in extracting knowledge from images in the fields of image processing and computer vision, the application of these techniques to web content mining has been limited.WEB MINING Page 29

Web content mining is an automatic process that goes beyond keywords extraction. Since the content of a text document presents no machine-readable semantic, some approaches have suggested to restricted the document content in a representation that could be exploited by machines. The usual approach to exploit known structure in document is to use wrappers to map document to some data model. Techniques using lexicons for content interpretation are yet to come. There are two groups of web content mining strategies: those that directly mine the content of document and those that improve on the content search of other tools like search engines. 3.3.2 Web Structure Mining The structure of a typical web graph consists of web pages as nodes, and hyperlinks as edges connecting related pages. Web structure mining is the process of discovering structure information from the web. This can be further divided into two kinds based on the kind of structure information used. Hyperlinks A hyperlink is a structural unit that connects a location in a web page to a different location, either within the same web page or on a different web page. A hyperlink that connects to a different part of the same page is called an intra-document hyperlink, and a hyperlink that connects two different pages is called an inter-document hyperlink. Document Structure In addition, the content within a Web page can also be organized in a tree structured format, based on the various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM) structures out of documents (Wang and Liu 1998; Moh, Lim, and Ng 2000). World Wide Web can reveal more information than just the information contained in documents. For example, links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document. This can be compared to bibliography citations. When a paper is cited often, it ought to be important. The page rank and CLEVER methods take advantage of this information conveyed by the links to find pertinent web

WEB MINING

Page 30

pages. By means of counters, higher levels cumulate the number of artefacts subsumed by the concepts they hold. 3.3.3 Web Usage Mining Web usage mining is the application of data mining techniques to discover interesting usage patterns from web usage data, in order to understand and better serve the needs of web-based applications. Usage data captures the identity or origin of web users along with their browsing behaviour at a web site. Web usage mining itself can be classified further depending on the kind of usage data considered: Web Server Data User logs are collected by the web server and typically include IP address, page reference and access time. Application Server Data Commercial application servers such as Weblogic1,2 StoryServer3 have significant features to enable E-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs. Application Level Data New kinds of events can be defined in an application, and logging can be turned on for them generating histories of these events. It must be noted, however, that many end applications require a combination of one or more of the techniques applied in the above the categories. Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the web access logs of different web sites can help understand the user behaviour and the web structure, thereby improving the design of this colossal collection of resources. There are two main tendencies in web usage mining driven by the applications of discoveries: general access pattern tracking and customized usage tracking. The general access pattern tracking analyzes the web logs to understand accesses patterns and trends.

WEB MINING

Page 31

3.4 THE AXES OF WEB MINING3.4.1 WWW Impact The World Wide Web has grown in the past few years from a small research community to the biggest and most popular way of communication and information dissemination. Every day, the WWW grows by roughly a million electronic pages, adding to the hundreds of millions already on-line. WWW serves as a platform for exchanging various kinds of information, ranging from research papers, and educational content, to multimedia content and software. The continuous growth in the size and the use of the WWW imposes new methods for processing these huge amounts of data. Because of its rapid and chaotic growth, the resulting network of information lacks of organization and structure. Moreover, the content is published in various diverse formats. 3.4.2 Web data Web data are those that can be collected and used in the context of Web Personalization. These data are classified in four categories according to servers, i.e., web usage mi Content data are presented to the end-user appropriately structured. They can be simple text, images, or structured data, such as information retrieved from databases. Structure data represent the way content is organized. They can be either data entities used within a Web page, such as HTML or XML tags, or data entities used to put a Web site together, such as hyperlinks connecting one page to another. Usage data represent a Web sites usage, such as a visitors IP address, time and date of access, complete path accessed, referrers address, and other attributes that can be included in a Web access log. User profile data provide information about the users of a Web site. A user profile contains demographic information for each user of a Web site, as well as information about users interests and preferences. Such information is acquired through registration forms or questionnaires, or can be inferred by analyzing Web usage logs.

WEB MINING

Page 32

3.5 WEB MINING PROS AND CONSPROS Web mining essentially has many advantages which makes this technology attractive to corporations including the government agencies. This technology has enabled ecommerce to do personalized marketing, which eventually results in higher trade volumes. The government agencies are using this technology to classify threats and fight against terrorism. The predicting capability of the mining application can benefits the society by identifying criminal activities. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can find, attract and retain customers; they can save on production costs by utilizing the acquired insight of customer requirements. They can increase profitability by target pricing based on the profiles created. They can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer or customers. CONS Web mining, itself, doesnt create issues, but this technology when used on data of personal nature might cause concerns. The most criticized ethical issue involving web mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without their knowledge or consent. The obtained data will be analyzed, and clustered to form profiles; the data will be made anonymous before clustering so that there are no personal profiles. Thus these applications de-individualize the users by judging them by their mouse clicks. De-individualization, can be defined as a tendency of judging and treating people on the basis of group characteristics instead of on their own individual characteristics and merits. Another important concern is that the companies collecting the data for a specific purpose might use the data for a totally different purpose, and this essentially violates the users interests. The growing trend of selling personal data as a commodity encourages website owners to trade personal data obtained from their site. This trend has increased the amount of data being captured and traded increasing the likeliness of ones privacy being invaded. The companies which buy the data are obliged make it anonymous and these companiesWEB MINING Page 33

are considered authors of any specific release of mining patterns. They are legally responsible for the contents of the release; any inaccuracies in the release will result in serious lawsuits, but there is no law preventing them from trading the data. Some mining algorithms might use controversial attributes like sex, race, religion, or sexual orientation to categorize individuals. These practices might be against the anti-discrimination legislation. The applications make it hard to identify the use of such controversial attributes, and there is no strong rule against the usage of such algorithms with such attributes. This process could result in denial of service or a privilege to an individual based on his race, religion or sexual orientation, right now this situation can be avoided by the high ethical standards maintained by the data mining company. The collected data is being made anonymous so that the obtained data and the obtained patterns cannot be traced back to an individual. It might look as if this poses no threat to ones privacy, actually many extra information can be inferred by the application by combining two separate unscrupulous data from the user.

WEB MINING

Page 34

4. WEB CONTENT MININGWeb content mining is the process of extracting useful information from the content of Web documents. Logical structure, semantic content and layout are contained in semi structured Web page text. Topic discovery, extracting association patterns, clustering of Web documents and classification of Web pages are some of research issues in text mining. These activities use techniques from other disciplines IR, IE (information extraction), NLP (natural language processing) and others. Automatic extraction of semantic relations and structures from Web is a growing application of Web content mining. In this area, several algorithms are used: Hierarchical clustering algorithms on terms in order to create concept hierarchies, formal concept analysis and association rule mining to learn generalized conceptual relations and automatic extraction of structured data records from semi-structured HTML pages. Primary goal of each algorithm is to create a set of formally defined domain ontologies that represent Web site content. Common representation approaches are vector-space models, descriptive logics, first order logic, relational models and probabilistic relational models. Structured data extraction is one of most widely studied research topics of Web content mining. Structured data on the Web are often very important as they represent their host pages essential information. Extracting such data allows one to provide value added services, e.g. shopping and meta search. In contrast to unstructured texts, structured data is also easier to extract. This problem has been studied by researchers in AI and database and data mining. Discovery of useful information from the web contents/data/documents (or) is the application of data mining techniques to content published on the Internet. The web contains many kinds and types of data. Basically, the web content consists of several types of data such as plain text (unstructured), image, audio, video, meta data as well as HTML (semi Structured), or XML (structured documents), dynamic documents, multimedia documents. Recent research on mining multi types of data is termed multimedia data mining. Thus we could consider multimedia data mining as an instance of web content mining. The research around applying data mining techniques to unstructured text is termed knowledge discovery in texts/ text data mining/ text mining. Hence we could consider text mining as an instance as an instance of web content mining. Research issues addressed in text mining are: topic discovery, extracting association patterns, clustering of web documents and classification of web pages.WEB MINING Page 35

4.1 ISSUES IN WEB CONTENT MINING: developing intelligent tools for information retrieval finding keywords and key phases discovering grammatical rules collections hypertext classification/categorization extracting key phrases from text documents learning extraction rules hierarchical clustering predicting relationships

4.2 WEB CONTENT MINING APPROACHES:

CONTENT MINING

Agent Based Approach 1. Intelligent Search Agents 2. Information Filtering/Categorization 3. Personalized Web Agents4.2.1 AGENT BASED APPROACHES:

Data Base Approach1. Multilevel Databases 2. Web Query Systems

Involves AI systems that can act autonomously or semi autonomously on behalf of a particular user, to discover and organize web based information. Agent Based approaches focus on intelligent and autonomous web mining tools based on agent technology. i. Some intelligent web agents can use a user profile to search for relevant information, then organize and interpret the discovered information. Example: Harvest.WEB MINING Page 36

ii. Some use various information retrieval techniques and the characteristics of open hypertext documents to organize and filter retrieved information. Example: Hypursuit. iii. Learn user preferences and use those preferences to discover information sources for those particular users. Agent-based Web mining systems can be placed into the following three categories: Intelligent Search Agents: Several intelligent Web agents have been developed that search for relevant information using domain characteristics and user pro les to organize and interpret the discovered information. Agents such as Harvest, FAQ-Finder, Information Manifold, OCCAM, and Para Site rely either on pre-specified domain information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents. Agents such as Shop Bot and ILA (Internet Learning Agent) interact with and learn the structure of unfamiliar information sources. Shop Bot retrieves product information from a variety of vendor sites using only general information about the product domain. ILA learns models of various information sources and translates these into its own concept hierarchy. Information Filtering/Categorization: A number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve categorize them. HyPursuit uses semantic information embedded in link structures and document content to create cluster hierarchies of hypertext documents, and structure an information space. BO (Bookmark Organizer) combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information. Personalized Web Agents: This category of Web agents learn user preferences and discover Web information sources based on these preferences, and those of other individuals with similar interests.

4.2.1 DATA BASE APPROACHES: Data base approach: focuses on integrating and organizing the heterogeneous and semi-structured data on the web into more structured and high level collections of resources. These organized resources can then be accessed and analyzed. These metadata, or generalization are then organized into structured collections and can be analyzed.

WEB MINING

Page 37

Database approaches to Web mining have focused on techniques for organizing the semi-structured data on the Web into more structured collections of resources, and using standard database querying mechanisms and data mining techniques to analyze it. Multilevel Databases: The main idea behind this approach is that the lowest level of the database contains semi-structured information stored in various Web repositories, such as hypertext documents. At the higher level meta data or generalizations are extracted from lower levels and organized in structured collections, i.e. relational or object-oriented databases. For example, Han, et. Al. use a multilayered database where each layer is obtained via generalization and transformation operations performed on the lower layers. Kholsa, et. al. propose the creation and maintenance of meta-databases at each information providing domain and the use of a global schema for the meta-database. The incremental integration of a portion of the schema from each information source, rather than relying on a global heterogeneous database schema. The ARANEUS system extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views. Web Query Systems: Many Web-based query systems and languages utilize standard database query languages such as SQL, structural information about Web documents, and even natural language processing for the queries that are used in World Wide Web searches. W3QL combines structure queries, based on the organization of hypertext documents, and content queries, based on information retrieval techniques. Web Log Logic-based query language for restructuring extracts information from Web in- formation sources. Lorel and UnQL query heterogeneous and semi-structured information on the Web using a labelled graph data model. TSIMMIS extracts data from heterogeneous and semi-structured information sources and correlates them to generate an integrated database representation of the extracted information.

4.3 WEB CONTENT MINING TASK4.3.1 Structured Data Extraction This is perhaps the most widely studied research topic of Web content mining. One of the reasons for its importance and popularity is that structured data on the Web are often very important as they represent their host pages. Essential information, e.g., lists ofWEB MINING Page 38

products and services. Extracting such data allows one to provide value added services, e.g., comparative shopping, and meta-search. Structured data is also easier to extract compared to unstructured texts. This problem has been studied by researchers in AI, database and data mining, and Web communities. There are several approaches to structured data extraction, which is also called wrapper generation. The first approach is to manually write an extraction program for each Web site based on observed format patterns of the site. This approach is very labour intensive and time consuming. It thus does not scale to a large number of sites. The second approach is wrapper induction or wrapper learning, which is the main technique currently. Wrapper learning works as follows: The user first manually labels a set of trained pages. A learning system then generates rules from the training pages. The resulting rules are then applied to extract target items from Web pages. The third approach is the automatic approach. Since structured data objects on the Web are normally database records retrieved from underlying databases and displayed in Web pages with some fixed templates. Automatic methods aim to find patterns/grammars from the Web pages and then use them to extract data. 4.3.2 Unstructured Text Extraction Most Web pages can be seen as text documents. Extracting information from Web documents has also been studied by many researchers. The research is closely related to text mining, information retrieval and natural language processing. Current techniques are mainly based on machine learning and natural language processing to learn extraction rules. Recently, a number of researchers also make use of common language patterns (common sentence structures used to express certain facts or relations) and redundancy of information on the Web to find concepts, relations among concepts and named entities. The patterns can be automatically learnt or supplied by human users. Another direction of research in this area is Web question-answering. Although question-answering was first studied in information retrieval literature, it becomes very important on the Web as Web offers the largest source of information and the objectives of many Web search queries are to obtain answers to some simple questions. Extend question-answering to the Web by query transformation, query expansion, and then selection. 4.3.3 Web Information Integration

WEB MINING

Page 39

Due to the sheer scale of the Web and diverse authorships, various Web sites may use different syntaxes to express similar or related information. In order to make use of or to extract information from multiple sites to provide value added services, e.g., metasearch, deep Web search, etc, one needs to semantically integrate information from multiple sources. Recently, several researchers attempted this task. Two popular problems related to the Web are (1) Web query interface integration, to enable querying multiple Web databases and (2) schema matching, e.g., integrating Yahoo and Google.s directories to match concepts in the hierarchies. The ability to query multiple deep Web databases is attractive and interesting because the deep Web contains a huge amount of information or data that is not indexed by general search engines. 4.3.4 Building Concept Hierarchies Because of the huge size of the Web, organization of information is obviously an important issue. Although it is hard to organize the whole Web, it is feasible to organize Web search results of a given query. A linear list of ranked pages produced by search engines is insufficient for many applications. The standard method for information organization is concept hierarchy and/or categorization. The popular technique for hierarchy construction is text clustering, which groups similar search results together in a hierarchical fashion. Instead, it exploits existing organizational structures in the original Web documents, emphasizing tags and language patterns to perform data mining to find important concepts, sub-concepts and their hierarchical relationships. In other words, it makes use of the information redundancy property and semi-structure nature of the Web to find what concepts are important and what their relationships might be. This work aim to compile a survey article or a book on the Web automatically. 4.3.5 Segmenting Web Pages & Detecting Noise A typical Web page consists of many blocks or areas, e.g., main content areas, navigation areas, advertisements, etc. It is useful to separate these areas automatically for several practical applications. For example, in Web data mining, e.g., classification and clustering, identifying main content areas or removing noisy blocks (e.g., advertisements, navigation panels, etc) enables one to produce much better results. It was shown in that the information contained in noisy blocks can seriously harm Web data mining. Another application is Web browsing using a small screen device, such as a PDA. IdentifyingWEB MINING Page 40

different content blocks allows one to re-arrange the layout of the page so that the main contents can be seen easily without losing any other information from the page. 4.3.6 Mining Web Opinion Sources Consumer opinions used to be very difficult to obtain before the Web was available. Companies usually conduct consumer surveys or engage external consultants to find such opinions about their products and those of their competitors. Now much of the information is publicly available on the Web. There are numerous Web sites and pages containing consumer opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. This online word-of-mouth behaviour represents new and measurable sources of information for marketing intelligence. Techniques are now being developed to exploit these sources to help companies and individuals to gain such information effectively and easily. For instance, proposes a feature based summarization method to automatically analyze consumer opinions in customer reviews from online merchant sites and dedicated review sites. The result of such a summary is useful to both potential customers and product manufacturers.

WEB MINING

Page 41

5. WEB STRUCTURE MININGWeb Structure Mining operates on the webs hyperlink structure. This graph structure can provide information about page ranking or authoritativeness and enhance search results through filtering i.e., tries to discover the model underlying the link structures of the web. This model is used to analyze the similarity and relationship between different web sites. Uses the hyperlink structure of the web as an additional information source. This type of mining can be further divided into 2 kinds based on the kind of structural data used. a) HYPERLINKS: A hyperlink is a structural unit that connects a web page to different location, either within the same web page (intra document hyperlink) or to a different web page (inter document) hyperlink. b) DOCUMENT STRUCTURE: In addition, the content within a web page can also be organized in a tree structured format, based on various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM) structures out of documents. Web link analysis used for: 1. ordering documents matching a user query (ranking) 2. deciding what pages to add to a collection 3. page categorization 4. finding related pages 5. finding duplicated web sites 6. and also to find out similarity between them Web structure mining uses the hyperlink structure of the Web to yield useful information, including definitive pages specification, hyperlinked communities

identification, Web pages categorization andWeb site completeness evaluation. Web structure mining can be divided into two categories based on the kind of structured data used:

WEB MINING

Page 42

1. Web graph mining: The Web provides additional information about how different documents are connected to each other via hyperlinks. The Web can be viewed as a (directed) graph whose nodes are Web pages and whose edges are hyperlinks between them. 2. Deep Web mining: Web also contains a vast amount of non crawlable content. his hidden part of the Web is referred to as the deep Web or the hidden Web. Compared to the static surface Web, the deep Web contains a much larger amount of high-quality structured information. Most of mining algorithms, that are improving the performance of Web search, are based on two assumptions. (a) Hyperlinks convey human endorsement. If there exists a link from page A to page B, and these two pages are authored by different people, then the first author found the second page valuable. Thus the importance of a page can be propagated to those pages it links to. (b) Pages that are co-cited by a certain page are likely related to the same topic. The popularity or importance of a page is correlated to the number of incoming links to some extendt, and related pages tend to be clustered together through dense linkages among them. Web information extraction has the goal of pulling out information from a collection of Web pages and converting it to a homogeneous form that is more readily digested and analyzed for both humans and machines. The result of IE could be used to improve the indexing process, because IE removes irrelevant information in Web pages and facilitates other advanced search functions due to the structured nature of data. It is usually difficult or even impossible to directly obtain the structures of the Web sites backend databases without cooperation from the sites. Instead, the sites present two other distinguishing structures: Interface schema and result schema. The interface schema is the schema of the query interface, which exposes attributes that can be queried in the backend database. The result schema is the schema of the query results, which exposes attributes that are shown to users.

WEB MINING

Page 43

6. WEB USAGE MININGWeb Usage Mining is a part of Web Mining, which, in turn, is a part of Data Mining. As Data Mining involves the concept of extraction meaningful and valuable information from large volume of data, Web Usage mining involves mining the usage characteristics of the users of Web Applications. This extracted information can then be used in a variety of ways such as, improvement of the application, checking of fraudulent elements etc. Web Usage Mining is often regarded as a part of the Business Intelligence in an organization rather than the technical aspect. It is used for deciding business strategies through the efficient use of Web Applications. It is also crucial for the Customer Relationship Management (CRM) as it can ensure customer satisfaction as far as the interaction between the customer and the organization is concerned. The major problem with Web Mining in general and Web Usage Mining in particular is the nature of the data they deal with. With the upsurge of Internet in this millennium, the Web Data has become huge in nature and a lot of transactions and usages are taking place by the seconds. Apart from the volume of the data, the data is not completely structured. It is in a semi-structured format so that it needs a lot of preprocessing and parsing before the actual extraction of the required information. we have taken up a small part of the Web Usage Mining process, which involves the Preprocessing, User Identification, Bot removal and Analysis of the

6.1 WEB USAGE MINING ARCHITECTUREThe WEBMINER is a system that implements parts of this general architecture. The architecture divides the Web usage mining process into two main parts. The rest part includes the domain dependent processes of transforming the Web data into suitable transaction form. This includes preprocessing, transaction identification, and data integration components. The second part includes the largely domain independent application of generic data mining and pattern matching techniques (such as the discovery of association rule and sequential patterns) as part of the system's data mining engine. The overall architecture for the Web mining process. Data cleaning is the first step performed in the Web usage mining process. Some low level data integration tasks may also beWEB MINING Page 44

performed at this stage, such as combining multiple logs, incorporating referrer logs, etc. After the data cleaning, the log entries must be partitioned into logical clusters using one or a series of transaction identification modules. The goal of trans- action identification is to create meaningful clusters of references for each user. The task of identifying transactions is one of either dividing a large transaction into multiple smaller ones or merging small transactions into fewer larger ones. The input and output transaction formats match so that any number of modules to be combined in any order, as the data analyst sees _t. Once the domain-dependent data transformation phase is completed, the resulting transaction data must be formatted to conform to the data model of the appropriate data mining task. For instance, the format of the data for the association rule discovery task may be different than the format necessary for mining sequential patterns. Finally, a query mechanism will allow the user (analyst) to provide more control over the discovery process by specifying various constraints.

6.2 WEB DATAIn Web Usage Mining, data can be collected in server logs, browser logs, proxy logs, or obtained from an organization's database. These data collections differ in terms of the location of the data source, the kinds of data available, the segment of population from which the data was collected, and methods of implementation. There are many kinds of data that can be used in Web Mining. 1. Content: The visible data in the Web pages or the information which was meant to be imparted to the users. A major part of it includes text and graphics (images). 2. Structure: Data which describes the organization of the website. It is divided into two types. Intra-page structure information includes the arrangement of various HTML or XML tags within a given page. The principal kind of inter-page structure information are the hyper-links used for site navigation. 3. Usage: Data that describes the usage patterns of Web pages, such as IP addresses, page references, and the date and time of accesses and var