19350574 Azizul Azhar RamliWeb Usage Mining Using Apriori Algorithm

Embed Size (px)

Citation preview

WEB USAGE MINING USING APRIORI ALGORITHM: UUM LEARNING CARE PORTAL CASEAzizul Azhar bin Ramli Information System Department Faculty of Information Technology and Multimedia, Kolej Universiti Teknologi Tun Hussein Onn (KUiTTHO) 86400 Parit Raja, Batu Pahat, Johor D/T Tel: +607-4538000 ext. 8056, Fax: +607-4532119 Email: [email protected]

Abstract The enormous content of information on the World Wide Web makes it obvious candidate for data mining research. Application of data mining techniques to the World Wide Web referred as Web mining where this term has been used in three distinct ways; Web Content Mining, Web Structure Mining and Web Usage Mining. E Learning is one of the Web based application where it will facing with large amount of data. In order to produce the university E Learning (UUM Educare) portal usage patterns and user behaviors, this paper implements the high level process of Web Usage Mining using basic Association Rules algorithm call Apriori Algorithm. Web Usage Mining consists of three main phases, namely Data Preprocessing, Pattern Discovering and Pattern Analysis. Server log files become a set of raw data where its must go through with all the Web Usage Mining phases to producing the final results. Here, Web Usage Mining, approach has been combining with the basic Association Rules, Apriori Algorithm to optimize the content of the university E Learning portal. Finally, this paper will present an overview of results analysis and Web administrator can use the findings for the suitable valuable actions. KEY WORDS: server log file, data mining, Web mining, Web Usage Mining, Association Rules, Apriori algorithm.

1

1.0

PROJECT OVERVIEW Data mining is a technique used to deduce useful and relevant information to

guide professional decisions and other scientific research (Chen, Han and Yu, 1996). It is a cost-effective way of analyzing large amounts of data, especially when a human could not analyze such datasets. Massification of the use the internet has made automatic knowledge extraction from Web log files a necessity. Information provided are interested in techniques that could learn Web users information needs and preferences. This can improve the effectiveness of their Web sites by adapting the information structure of the sites to the users behavior. Recently, the advent of data mining techniques for discovering usage pattern from Web data (Web Usage Mining) indicates that these techniques can be a viable alternative to traditional decision making tools (Srivastava et al., 2000). Web Usage Mining is the process of applying data mining techniques to the discovery of usage patterns from Web data and is targeted towards applications (Srivastava et al., 2000). Web Usage Mining mines the secondary data (Web server access logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, user queries, mouse clicks and any other data as the result of interaction with the Web) derived from the interactions of the users during certain period of Web sessions. This paper explores the use of Web Usage Mining techniques to analyze Web log records collected from E-Learning portal (UUM Educare). Using commercial data Web mining tools (WebLog Expert Lite 3.5 and Sawmill 7) and ARunner 1.0 (prototype of GUI Christian Borgelt Apriori tool by Shamrie Sainin, FTM, UUM), it have identified several Web access pattern by applying well known data mining techniques (Apriori Algorithm) to the access logs of this educational portal. This includes descriptive statistic and Association Rules for the portal including support and confidence to represent the

2

Web usage and user behavior for UUM Educare. The results and findings for this experimental analysis can be use by the Web administration and may be upper level in the UUM community in order to plan the upgrading and enhancement to the portal presentation. Objective and Scopes of Project Generally, the main objective of this project is to perform Web Usage Mining process, specifically:

i.

To preprocess UUM Educare server logs files from the university E-Learning Web servers for determining and discovering the user access pattern.

ii.

To apply the basic Association Rules Apriori algorithm for implementation of Web Usage Mining process to producing usage pattern by determine the most user interest based on the options that being provided by the university E-Learning portal.

iii.

To analyze the outputs usage patterns and user behaviors for UUM Educare from the Web Usage Mining implementation process.

The scopes of this project are:

i. ii.

Research Organization: University E-Learning portal (UUM Educare) Focus of Project: Extract the server log file from university E-Learning server on certain of weeks within a semester, preprocessing the set of raw data, select the data that contribute for pattern analysis, implement the pattern mining using basic Association Rule, Apriori algorithm, in order to produce the final results (rules).

3

2.0

LITERATURE REVIEW

Data Mining, Web Mining and Web Usage Mining Data mining (DM) is a step from Knowledge Discovery in Database (KDD) process, which is defined as a nontrivial process of identifying valid, novel, potentially useful and ultimately understandable pattern in data (Fayyad et al., 1996). The term pattern here refers some abstract representation of a subset data of the data, that is, an expression in some language describing a data subset or a data subset or a model applicable to that subset. Data mining efforts associated with the Web, called Web mining, can be broadly categorized into three areas of interest based on which part of the Web to mine; Web Content mining, Web Structure mining, and Web Usage Mining (Kosala and Blockeel, 2000). In Web mining, data can be collected at the server-side, client-side, proxy servers or a consolidated Web/business database (Srivastava et al., 2000). The information provided by the data sources described above can be used to construct several data abstractions, namely users, page-views, click-streams and server sessions. Web Usage Mining is defined as the process of applying data mining techniques to the discovery of usage patterns from Web logs data which to identify Web users behavior (Srivastava et al., 2000). Web Usage Mining is the type of Web mining activity that involves an automatic discovery of user access patterns from one or more Web servers. As shown in Fig. 1, three main tasks are performed in Web Usage Mining; Preprocessing, Pattern Discovery and Pattern Analysis. Fig.1 represents a brief description about the main task of Web Usage Mining process.

4

Association Rules and Apriori Algorithm The problem of deriving Association Rules from data was first formulated in (Agrawal, Imielinski and Swami, 1993) and is called the market-basket problem. The problem is that we are given a set of items and a large collection of transactions which are sets (baskets) of items. The task is to find relationships between the containments of various items within those baskets. Apart from the supermarket scenario there are many other examples where Association Rules have been used, for example users visits of WWW pages which the structure and its content can be optimized. Xue et al., (2001) have used re-ranking method and generalized Association Rules to extract access patterns of the Web sites pattern usage. Mannila et al., (1999) use page accesses from a Web server log as events for discovering frequent episodes. Chen et al., (1996) introduce the concept of using the maximal forward references in order to break down user sessions into transactions for the mining of traversal patterns. Batista and Silva, (2001) perform mining process for online newspaper Web access logs by using Apriori algorithm. The task in Association Rules mining involves finding all rules that satisfy user defined constraints on minimum support and confidence with respect to a given dataset.

5

Most commonly used Association Rule discovery algorithm that utilizes the frequent itemset strategy is exemplified by the Apriori algorithm (Agrawal et al., 1993). Apriori was the first scalable algorithm designed for association-rule mining algorithm. Apriori is an improvement over the AIS and SETM algorithms (Agrawal and Srikant, 1994). The Apriori algorithm searches for large itemsets during its initial database pass and uses its result as the seed for discovering other large datasets during subsequent passes. Rules having a support level above the minimum are called large or frequent itemsets and those below are called small itemsets (Chen et al., 1996). The algorithm is based on the large itemset property which states: Any subset of a large itemset is large and if an itemset is not large and then none of its supersets are large (Agrawal and Srikant, 1994). Web Usage Mining in Educational Field In a Web-based learning environment, where both the tutors and learners are separated spatially and physically, student modeling is one of the biggest challenges. Traditional student modeling techniques are inapplicable in these systems when tutors are overwhelmed by the huge volumes of sequential data generated as learners browse through the Web pages (Agrawal and Srikant, 1995). Web mining techniques, including clustering and Association Rules mining can be applied to extract hidden and interesting knowledge to facilitate instructional planning and student diagnosis. Web mining in education is not new. It has been applied to mine aggregate paths for learners engaged in a distance education environment (Ha, Bae and Park, 2000); relevant words to students based on text mining from their browsed documents (Ochi et al., 1998); earticles for students based on key-word-driven text mining (Tang et al., 2000), and to analyze learners learning behaviors (Zaiane and Luo, 2001). The previous research proposed the beyond usage mining to consider the content of the pages that have been visited. In the E-learning system, both learners browsing behaviors and course content

6

are important to derive learners learning levels, intentions, goals, interests or abilities. Incorporating course content can aid in an understanding of learners browsing habits. In particular, understanding the learners browsing behaviors can facilitate, the course contain personalization. The existing system called Artificial intelligence in Education (AIED), employs a knowledge base, a student model and instructional plans. For a Web based AIED system, Web mining becomes part of student modeling. The system can relate its mined knowledge of page contents and student navigation patterns to students level of understanding to decide upon appropriate feedback to them (Tang et al., 2001).

3.0

PROJECT METHODOLOGY AND IMPLIMENTATION The Web Usage Mining process proposed by (Srivastava et al., 2000) becomes a

major guide line upon project implementation. Fig.3 shows the general flow of the project methodology.

Server Log File The server log file dated from 19 February 2004 until 13 March 2004 has been selected for further analysis. The server log files are retrieved from the UUM Educare server, www.e-web.uum.edu.my . The total amount of the server log file between that duration is about 650 MB and the large amount of data becomes the most challenging

7

problem to handle during the Data Preprocessing phase. The server log file consists of nine attributes in the single line of record as shown in Fig 4.

In the data selection phase, log files started on 19 February 2004 until the end of semester (13 March 2004) have been selected. The selection of the server log files must be done carefully because of the UUM Educare is part of the GroupWeb facilities where beside the Educare (My Desktop) as an E Learning, there are several facilities such as My Portfolio and Resources. Because of the UUM Educare facilities that provided by the GroupWeb is mixed with another facilities or options, the server log file also includes the mix of log file for every transaction between the facilities in the GroupWeb portal. Data Preprocessing Data Preprocessing phase is one of the most challenging phase in this study. The major task in this phase are includes handling missing values, identifying outliers, smooth out noisy data and correct inconsistent data (Han and Kamber, 2001). Data Preprocessing consists of all the actions taken before the actual Pattern Analysis phase process starts. The Data Preprocessing phase is being done by using available software in the market. On early stage of this phase, Macro tool in Microsoft Access have been selected to assist the preprocessing tasks and for the following data preprocessing task, filter tool in Microsoft Excel becomes the selected tool. The selection of this period is because of the universities academic calendar shows the selected dates are nearly to the end of second semester of 2003/2004

8

session where 13 March 2004 the last day for final examination. Fig. 5 shows the data after preprocessing phase is done.

Pattern Analysis During the Pattern Analysis phase, the descriptive method is being used analyze the data such as general summary of the Web usage and customer behaviors. This general summary includes the most active users using the portal either from Malaysia or other country. If the users came from Malaysia, its also shows the locality of the users either accessing the UUM Educare portal from the UUM Local Area Network (LAN) or outside of UUM campus. The analysis also tries to find out the top visitors for each facility or option that being provided by the UUM Educare portal. There are several facilities or option that being placed in UUM Educare portal such as dms, profile, resources, announcement, assessment, calendar, pnotes, assignment and forum. The dms as one of the options in UUM Educare can be analyzed to know the most requested documents in UUM Educare portal. Beside the dms option analysis, the sever log files also trace the information of documents that was downloaded. Pattern Discovery Association Rules Given a server log files that represent UUM Educare portal activities, the main purpose of Association Rules is to generate all Association Rules that have support and confidence greater than the user specified minimum support (called min_sup) and minimum confidence (called min_conf) respectively. An algorithm for finding all

9

Association Rules, henceforth, referred to as the Apriori algorithm (Agrawal and Srikant, 1994). The selected of Apriori algorithm is because of the performance where it able to run the mining process in short period. Currently, Apriori algorithm is commonly used for generating the Association Rules for Web Usage Mining and this experimental study focus on exploratory of Web Usage Mining in university E-Learning portal (UUM Educare). Results As stated above, this study will focus on Web Usage Mining of UUM Educare portal. The results of this study are divided into two sections where the first section will discuss about the general descriptions of the access pattern and users behaviors of UUM Educare portal (descriptive statistic). Another section will display the supports and confidences of the different level in UUM Educare portal. All the results will display using certain chart for such as pie and bar chart to make it easier understand.

4.0

FINDINGS AND RESULTS The Web Usage Mining for Universiti Utara Malaysia E Learning (UUM Educare)

portal where the main URL, www.e-web.uum.edu.my are divided into two main stages or section. Each stages having their own phases with certain sub activities. The first stages are including log data retrieving from the UUM Educare server where the Data Selection and Data Preprocessing phases are directly involves. The second stages are the mining stages where its will involving Pattern Discovery by applying Association Rules and Pattern Analysis phases in order to discovered the UUM Educare portal usage pattern.

10

General Pattern Analysis Results (access pattern and users behaviors descriptive statistic) The UUM Educare portal has several options (dms, resources, announcement, assessment, calendar and forum) that can be chooses by the users. Based on the Universal Resource Locator (URL) stem, the users only accessed the UUM Educare options in host of www.e-web.uum.edu.my without navigating other sub options provided by UUM Educare portal are shows on the figure below.

Association Rules Results (support and confidence of the different options) Figure below shows the results that represent the support and confidence for each option that being provided at UUM Educare portal where the main host, www.eweb.uum.edu.my and continue with main URL path for E Learning portal. There are 14 options being provided on UUM Educare and for this analysis, the total transactions for the options path are 10 578 transactions are selected.

11

Based on the Fig.6 above, it can be conclude that, /main path will be the most requested page and it followed by the /dms path where the documents downloading are can done here. /main path is the top level for UUM Educare and it display the basic information about the portfolio/subject includes subject area, discipline, owner and subject description. With /main and other options path, user also can select other options that provided by UUM Educare portal. Fig. 7 shows that support and confidence for each directories and the /dms option path with 36.45% percentage of support and confidence where it is a highest percentage for support and also for the confidence level. Its followed by /main and /assessment option path where the support and confidence is 15.02% and 13.05%. Association Rules induction is the extraction of rules in the form of X => Y (if X then Y) quantified with a confidence (proportion of occurrences that verifies Y among occurrences that verifies X) and a support (proportion of occurrences that verifies X and Y among all occurrences). A next result shows the Association Rules, including support and confidence by applying Apriori algorithm for identifying the patterns, defining a threshold of 15% for the minimum support and a threshold of 70% for the minimum confidence. Fig. 8 shows the pages that related to each other where the most frequent options that being selected during the certain options is selected.

12

Based on the figure above, it can conclude that the rule with higher support (22.0%) means, if in a session the user selected the /announcement and /main options path, user also selected the /dms option path; the rule with higher confidence (99.1%) says that if in a session, the user selected the /announcement and /dms option path, user also selected the /main option path. Next figure (Fig. 9) represent a graphical chart for the 6 most accepted rules for options relationship.

Fig. 10 below show the Association Hyperedges for UUM Educare that represents the portal pages those orderly archived. A threshold of 10% for the minimum support and a threshold of 75% for the minimum confidence are being used. It shows that support of 11.7% is the high percentage of transactions that contain all items appearing in the hyperadge, that is in the /assignment /announcement /main with the percentages of confidence is 78.1%. The confidence of 92.2% with 10.6% of support is on /assignment /announcement /assessment /main is represent the highest of average confidence of all rules that can be formed using the items in the hyperedge with all items appearing in the rule (average confidence of the rules including /main