Azizul Azhar Ramli_Web Usage Mining Using Apriori Algorithm

WEB USAGE MINING USING APRIORI ALGORITHM: UUM LEARNING CARE

PORTAL CASE

Azizul Azhar bin RamliInformation System Department

Faculty of Information Technology and Multimedia,Kolej Universiti Teknologi Tun Hussein Onn (KUiTTHO)

86400 Parit Raja, Batu Pahat, Johor D/TTel: +607-4538000 ext. 8056, Fax: +607-4532119

Email: [email protected]

Abstract

The enormous content of information on the World Wide Web makes it obvious

candidate for data mining research. Application of data mining techniques to the World

Wide Web referred as Web mining where this term has been used in three distinct ways;

Web Content Mining, Web Structure Mining and Web Usage Mining. E Learning is one

of the Web based application where it will facing with large amount of data. In order to

produce the university E Learning (UUM Educare) portal usage patterns and user

behaviors, this paper implements the high level process of Web Usage Mining using

basic Association Rules algorithm call Apriori Algorithm. Web Usage Mining consists of

three main phases, namely Data Preprocessing, Pattern Discovering and Pattern

Analysis. Server log files become a set of raw data where it’s must go through with all

the Web Usage Mining phases to producing the final results. Here, Web Usage Mining,

approach has been combining with the basic Association Rules, Apriori Algorithm to

optimize the content of the university E Learning portal. Finally, this paper will present an

overview of results analysis and Web administrator can use the findings for the suitable

valuable actions.

KEY WORDS: server log file, data mining, Web mining, Web Usage Mining, Association

Rules, Apriori algorithm.

1

1.0 PROJECT OVERVIEW

Data mining is a technique used to deduce useful and relevant information to

guide professional decisions and other scientific research (Chen, Han and Yu, 1996). It

is a cost-effective way of analyzing large amounts of data, especially when a human

could not analyze such datasets.

Massification of the use the internet has made automatic knowledge extraction

from Web log files a necessity. Information provided are interested in techniques that

could learn Web users’ information needs and preferences. This can improve the

effectiveness of their Web sites by adapting the information structure of the sites to the

users’ behavior.

Recently, the advent of data mining techniques for discovering usage pattern

from Web data (Web Usage Mining) indicates that these techniques can be a viable

alternative to traditional decision making tools (Srivastava et al., 2000). Web Usage

Mining is the process of applying data mining techniques to the discovery of usage

patterns from Web data and is targeted towards applications (Srivastava et al., 2000).

Web Usage Mining mines the secondary data (Web server access logs, browser logs,

user profiles, registration data, user sessions or transactions, cookies, user queries,

mouse clicks and any other data as the result of interaction with the Web) derived from

the interactions of the users during certain period of Web sessions.

This paper explores the use of Web Usage Mining techniques to analyze Web

log records collected from E-Learning portal (UUM Educare). Using commercial data

Web mining tools (WebLog Expert Lite 3.5 and Sawmill 7) and ARunner 1.0 (prototype

of GUI Christian Borgelt Apriori tool by Shamrie Sainin, FTM, UUM), it have identified

several Web access pattern by applying well known data mining techniques (Apriori

Algorithm) to the access logs of this educational portal. This includes descriptive statistic

and Association Rules for the portal including support and confidence to represent the

2

Web usage and user behavior for UUM Educare. The results and findings for this

experimental analysis can be use by the Web administration and may be upper level in

the UUM community in order to plan the upgrading and enhancement to the portal

presentation.

Objective and Scopes of Project

Generally, the main objective of this project is to perform Web Usage Mining

process, specifically:

i. To preprocess UUM Educare server logs files from the university E-Learning

Web servers for determining and discovering the user access pattern.

ii. To apply the basic Association Rules – Apriori algorithm for implementation of

Web Usage Mining process to producing usage pattern by determine the

most user interest based on the options that being provided by the university

E-Learning portal.

iii. To analyze the outputs usage patterns and user behaviors for UUM Educare

from the Web Usage Mining implementation process.

The scopes of this project are:

i. Research Organization: University E-Learning portal (UUM Educare)

ii. Focus of Project: Extract the server log file from university E-Learning

server on certain of weeks within a semester, preprocessing the set of raw

data, select the data that contribute for pattern analysis, implement the

pattern mining using basic Association Rule, Apriori algorithm, in order to

produce the final results (rules).

3

2.0 LITERATURE REVIEW

Data Mining, Web Mining and Web Usage Mining

Data mining (DM) is a step from Knowledge Discovery in Database (KDD)

process, which is defined as a “nontrivial process of identifying valid, novel, potentially

useful and ultimately understandable pattern in data” (Fayyad et al., 1996). The term

pattern here refers some abstract representation of a subset data of the data, that is, an

expression in some language describing a data subset or a data subset or a model

applicable to that subset.

Data mining efforts associated with the Web, called Web mining, can be broadly

categorized into three areas of interest based on which part of the Web to mine; Web

Content mining, Web Structure mining, and Web Usage Mining (Kosala and Blockeel,

2000). In Web mining, data can be collected at the server-side, client-side, proxy servers

or a consolidated Web/business database (Srivastava et al., 2000). The information

provided by the data sources described above can be used to construct several data

abstractions, namely users, page-views, click-streams and server sessions.

Web Usage Mining is defined as the process of applying data mining techniques

to the discovery of usage patterns from Web logs data which to identify Web user’s

behavior (Srivastava et al., 2000). Web Usage Mining is the type of Web mining activity

that involves an automatic discovery of user access patterns from one or more Web

servers.

As shown in Fig. 1, three main tasks are performed in Web Usage Mining;

Preprocessing, Pattern Discovery and Pattern Analysis. Fig.1 represents a brief

description about the main task of Web Usage Mining process.

4

Association Rules and Apriori Algorithm

The problem of deriving Association Rules from data was first formulated in

(Agrawal, Imielinski and Swami, 1993) and is called the “market-basket problem”. The

problem is that we are given a set of items and a large collection of transactions which

are sets (baskets) of items. The task is to find relationships between the containments of

various items within those baskets.

Apart from the supermarket scenario there are many other examples where

Association Rules have been used, for example users’ visits of WWW pages which the

structure and its content can be optimized. Xue et al., (2001) have used re-ranking

method and generalized Association Rules to extract access patterns of the Web sites

pattern usage. Mannila et al., (1999) use page accesses from a Web server log as

events for discovering frequent episodes. Chen et al., (1996) introduce the concept of

using the maximal forward references in order to break down user sessions into

transactions for the mining of traversal patterns. Batista and Silva, (2001) perform mining

process for online newspaper Web access logs by using Apriori algorithm.

The task in Association Rules mining involves finding all rules that satisfy user

defined constraints on minimum support and confidence with respect to a given dataset.

5

Most commonly used Association Rule discovery algorithm that utilizes the frequent

itemset strategy is exemplified by the Apriori algorithm (Agrawal et al., 1993).

Apriori was the first scalable algorithm designed for association-rule mining

algorithm. Apriori is an improvement over the AIS and SETM algorithms (Agrawal and

Srikant, 1994). The Apriori algorithm searches for large itemsets during its initial

database pass and uses its result as the seed for discovering other large datasets during

subsequent passes. Rules having a support level above the minimum are called large or

frequent itemsets and those below are called small itemsets (Chen et al., 1996). The

algorithm is based on the large itemset property which states: Any subset of a large

itemset is large and if an itemset is not large and then none of its supersets are large

(Agrawal and Srikant, 1994).

Web Usage Mining in Educational Field

In a Web-based learning environment, where both the tutors and learners are

separated spatially and physically, student modeling is one of the biggest challenges.

Traditional student modeling techniques are inapplicable in these systems when tutors

are overwhelmed by the huge volumes of sequential data generated as learners browse

through the Web pages (Agrawal and Srikant, 1995). Web mining techniques, including

clustering and Association Rules mining can be applied to extract hidden and interesting

knowledge to facilitate instructional planning and student diagnosis. Web mining in

education is not new. It has been applied to mine aggregate paths for learners engaged

in a distance education environment (Ha, Bae and Park, 2000); relevant words to

students based on text mining from their browsed documents (Ochi et al., 1998); e-

articles for students based on key-word-driven text mining (Tang et al., 2000), and to

analyze learners’ learning behaviors (Zaiane and Luo, 2001). The previous research

proposed the beyond usage mining to consider the content of the pages that have been

visited. In the E-learning system, both learners’ browsing behaviors and course content

6

are important to derive learners’ learning levels, intentions, goals, interests or abilities.

Incorporating course content can aid in an understanding of learners’ browsing habits. In

particular, understanding the learners’ browsing behaviors can facilitate, the course

contain personalization.

The existing system called Artificial intelligence in Education (AIED), employs a

knowledge base, a student model and instructional plans. For a Web based AIED

system, Web mining becomes part of student modeling. The system can relate its mined

knowledge of page contents and student navigation patterns to students’ level of

understanding to decide upon appropriate feedback to them (Tang et al., 2001).

3.0 PROJECT METHODOLOGY AND IMPLIMENTATION

The Web Usage Mining process proposed by (Srivastava et al., 2000) becomes

a major guide line upon project implementation. Fig.3 shows the general flow of the

project methodology.

Server Log File

The server log file dated from 19 February 2004 until 13 March 2004 has been

selected for further analysis. The server log files are retrieved from the UUM Educare

server, www.e-web.uum.edu.my . The total amount of the server log file between that

duration is about 650 MB and the large amount of data becomes the most challenging

7

http://www.e-web.uum.edu.my/

problem to handle during the Data Preprocessing phase. The server log file consists of

nine attributes in the single line of record as shown in Fig 4.

In the data selection phase, log files started on 19 February 2004 until the end of

semester (13 March 2004) have been selected. The selection of the server log files must

be done carefully because of the UUM Educare is part of the GroupWeb facilities where

beside the Educare (My Desktop) as an E Learning, there are several facilities such as

My Portfolio and Resources. Because of the UUM Educare facilities that provided by the

GroupWeb is mixed with another facilities or options, the server log file also includes the

mix of log file for every transaction between the facilities in the GroupWeb portal.

Data Preprocessing

Data Preprocessing phase is one of the most challenging phase in this study.

The major task in this phase are includes handling missing values, identifying outliers,

smooth out noisy data and correct inconsistent data (Han and Kamber, 2001). Data

Preprocessing consists of all the actions taken before the actual Pattern Analysis phase

process starts.

The Data Preprocessing phase is being done by using available software in the

market. On early stage of this phase, Macro tool in Microsoft Access have been selected

to assist the preprocessing tasks and for the following data preprocessing task, filter tool

in Microsoft Excel becomes the selected tool.

The selection of this period is because of the universities academic calendar

shows the selected dates are nearly to the end of second semester of 2003/2004

8

session where 13 March 2004 the last day for final examination. Fig. 5 shows the data

after preprocessing phase is done.

Pattern Analysis

During the Pattern Analysis phase, the descriptive method is being used analyze

the data such as general summary of the Web usage and customer behaviors. This

general summary includes the most active users using the portal either from Malaysia or

other country. If the users came from Malaysia, it’s also shows the locality of the users

either accessing the UUM Educare portal from the UUM Local Area Network (LAN) or

outside of UUM campus.

The analysis also tries to find out the top visitors for each facility or option that

being provided by the UUM Educare portal. There are several facilities or option that

being placed in UUM Educare portal such as dms, profile, resources, announcement,

assessment, calendar, pnotes, assignment and forum. The dms as one of the options in

UUM Educare can be analyzed to know the most requested documents in UUM Educare

portal. Beside the dms option analysis, the sever log files also trace the information of

documents that was downloaded.

Pattern Discovery – Association Rules

Given a server log files that represent UUM Educare portal activities, the main

purpose of Association Rules is to generate all Association Rules that have support and

confidence greater than the user specified minimum support (called min_sup) and

minimum confidence (called min_conf) respectively. An algorithm for finding all

9

Association Rules, henceforth, referred to as the Apriori algorithm (Agrawal and Srikant,

1994).

The selected of Apriori algorithm is because of the performance where it able to

run the mining process in short period. Currently, Apriori algorithm is commonly used for

generating the Association Rules for Web Usage Mining and this experimental study

focus on exploratory of Web Usage Mining in university E-Learning portal (UUM

Educare).

Results

As stated above, this study will focus on Web Usage Mining of UUM Educare

portal. The results of this study are divided into two sections where the first section will

discuss about the general descriptions of the access pattern and users behaviors of

UUM Educare portal (descriptive statistic). Another section will display the supports and

confidences of the different level in UUM Educare portal. All the results will display using

certain chart for such as pie and bar chart to make it easier understand.

4.0 FINDINGS AND RESULTS

The Web Usage Mining for Universiti Utara Malaysia E Learning (UUM Educare)

portal where the main URL, www.e-web.uum.edu.my are divided into two main stages or

section. Each stages having their own phases with certain sub activities. The first stages

are including log data retrieving from the UUM Educare server where the Data Selection

and Data Preprocessing phases are directly involves. The second stages are the mining

stages where its will involving Pattern Discovery by applying Association Rules and

Pattern Analysis phases in order to discovered the UUM Educare portal usage pattern.

10

General Pattern Analysis Results (access pattern and users behaviors –

descriptive statistic)

The UUM Educare portal has several options (dms, resources, announcement,

assessment, calendar and forum) that can be chooses by the users. Based on the

Universal Resource Locator (URL) stem, the users only accessed the UUM Educare

options in host of www.e-web.uum.edu.my without navigating other sub options provided

by UUM Educare portal are shows on the figure below.

Association Rules Results (support and confidence of the different options)

Figure below shows the results that represent the support and confidence for

each option that being provided at UUM Educare portal where the main host, www.e-

web.uum.edu.my and continue with main URL path for E Learning portal. There are 14

options being provided on UUM Educare and for this analysis, the total transactions for

the options path are 10 578 transactions are selected.

11

Based on the Fig.6 above, it can be conclude that, /main path will be the most

requested page and it followed by the /dms path where the documents downloading are

can done here. /main path is the top level for UUM Educare and it display the basic

information about the portfolio/subject includes subject area, discipline, owner and

subject description. With /main and other options path, user also can select other

options that provided by UUM Educare portal.

Fig. 7 shows that support and confidence for each directories and the /dms

option path with 36.45% percentage of support and confidence where it is a highest

percentage for support and also for the confidence level. It’s followed by /main and

/assessment option path where the support and confidence is

15.02% and 13.05%.

Association Rules induction is the extraction of rules in the form of X => Y (if X

then Y) quantified with a confidence (proportion of occurrences that verifies Y among

occurrences that verifies X) and a support (proportion of occurrences that verifies X and

Y among all occurrences).

A next result shows the Association Rules, including support and confidence by

applying Apriori algorithm for identifying the patterns, defining a threshold of 15% for the

minimum support and a threshold of 70% for the minimum confidence. Fig. 8 shows

the pages that related to each other where the most frequent options that being selected

during the certain options is selected.

12

Based on the figure above, it can conclude that the rule with higher support

(22.0%) means, “if in a session the user selected the /announcement and /main

options path, user also selected the /dms option path”; the rule with higher confidence

(99.1%) says that “if in a session, the user selected the /announcement and /dms

option path, user also selected the /main option path”. Next figure (Fig. 9) represent a

graphical chart for the 6 most accepted rules for options relationship.

Fig. 10 below show the Association Hyperedges for UUM Educare that

represents the portal pages those orderly archived. A threshold of 10% for the minimum

support and a threshold of 75% for the minimum confidence are being used. It shows

that support of 11.7% is the high percentage of transactions that contain all items

appearing in the hyperadge, that is in the /assignment /announcement /main with the

percentages of confidence is 78.1%.

The confidence of 92.2% with 10.6% of support is on /assignment

/announcement /assessment /main is represent the highest of average confidence of

all rules that can be formed using the items in the hyperedge with all items appearing in

the rule (average confidence of the rules including /main <- /assessment

/announcement /assignment, /assessment <- /main /announcement

13

/assignment, /announcement <- /main /assessment /assignment and /assignment

<- /main /assessment /announcement.

The following figure (Fig. 11) shows 6 most accepted Association Rules

Hyperedge for UUM Educare server log files during selected period of time.

14

5.0 PROJECT SIGNFICANCE

Generally, this project will produce the useful finding for analyzing the Web usage

pattern for UUM ELearning, www.e-web.uum.edu.my and more specific:

i. This study will become the first step for the analyzing university E-Learning portal

by applying Web usage mining approach with basic Association Rules – Apriori

algorithm.

ii. The outcomes from this study can be used by the Web administrator in order to

plan necessary improvement, enhancement and valuable actions to the

university E-Learning portal.

iii. The implementation of Web usage mining process for university E-Learning

portal may becomes the guide line for the system development purposes.

6.0 CONCLUSION AND RECOMMENDATION

Web Usage Mining is an active field for research and it will generate new hopes

in internet based business. Web Usage Mining applications are being used in some

famous Websites and this paper totally focuses on education field. This paper presents a

brief introduction of Web mining technique, apart of the data mining technologies and

also the implementation of the Web Usage Mining in E-Learning portal. Server log files

of UUM E-Learning (UUM Educare) in server host, www.e-web.uum.edu.my have been

selected for this project. In order to perform the Web Usage Mining, the methodology

that being introduce by (Srivastava et al., 2000) becomes major guide where it includes

three main phases; Data Preprocessing, Pattern Analysis and Pattern Discovery. All the

particular phases were done carefully to produce quality results. Data Processing phase

for the Web Usage Mining is a challenging task and basic Association Rules algorithm,

Apriori algorithm was selected as a technique to produce the support and confidence of

the different levels in Web usage mining of UUM Eduacare portal.

15

The selection of the Apriori algorithm for performing Web usage mining on UUM

Educare portal is because of Apriori algorithm is a common data mining technique for

association based analysis. By applying this algorithm to the Web log file, the

relationship between the accessed pages can be mined. The Web usage patterns and

user behavior on UUM Educare portal also can analyze by using this algorithm where

the descriptive statistic approach can’t perform this analysis. The results and findings for

this analysis are more reliable but less of accuracy because of the Apriori algorithm

properties where the same selected itemsets are always counted. The results or findings

from this experimental analysis are surely useful for Web administrator in order to

improve Web services and performance through the improvement of Web sites,

including their contents, structure, presentation and delivery. The valuable actions may

contain of performing the Web pages value added modification.

As a recommendation, in order to enhance and continue this project, the

suggested methodology can be implemented for system development purposes. The

system may perform and implement the Web usage mining phase including data

selection, data preprocessing, pattern discovery and analysis. Apriori algorithm may be a

part of the pattern discovery sub function.

Beside that, there are certain important recommendation may propose for

improving the results during project implementation. As discussed in chapter V, Findings

and Results, there is placed there the Association Rules with percentages support and

confidence for 10 most requested options in UUM Educare portal and also represent the

Association Rules for /dms option. The figures for each support and confidence in

particular results are same because the step calculation for Apriori algorithm. According

to Boon Lay et., al (1999), the percentages of support for Association Rules can be

improve by finding all sets items (attribute=value) that have transactions support above

16

minimum support. Itemsets with minimum support are called large itemsets and the large

itemsets are used to generate the desired rules.

Lastly, for future work, the another method for analyzing sparse data can be used

in the study of E-Learning Web log access, use of different similarity Association Rules

and conclude about the most suitable alternatives for knowledge extraction from Web

log data.

REFERENCES

Abd. Wahab, M. H, Siraj, F and Yusoff, N. (2004). Log Mining Using Generalize

Association Rules. In Proceedings of Master Final Project 2004 Presentation,

UUM, Malaysia.

Agrawal, R., Imielinski, T. and Swami, A. (1993). Mining Association Rules between Sets

of Items in Large Databases. In Proceedings of the International ACM SIGMOD

Conference, Washington DC, USA, pages 207–216.

Agrawal, R. and Srikant, R. (1994). Fast Algorithm for Mining Association Rules. Proc. of

the 20th VLDB Conference. Pp 487-499.

Agrawal, R., and Srikant, R. (1995). Mining Sequential Patterns. In Proc. of the Eleventh

International Conference on Data Engineering (ICDE), Taiwan. Pp 3-14.

Batista, P and Silva, M (2001). Prospeccao dos Dados de Acesso a um Servidor de

Noticias na Web, 2nd Coferencia sobre Redes de Computadores, Evora,

Portogal.

17

Boon Lay, C, Khalid, M and Yusof, R. (!999). Intelligent Database by Neural Network

and Data Mining. In Proc. of Artificial Intelligent Applications in Industry, Kuala

Lumpur. Pp 201-219.

Borgelt, C. (2004). Apriori: Finding Association Rules/Hyperedges with the Apriori

Algorithm. School of Computer Science, University of Magdeburg.

Chen, M.-S., Jan, J., Yu, P.S. (1996). Data Mining: An Overview from a Database

Perspective. IEEE Transactions on Knowledge and Data Engineering, (8:6). Pp

866.883.

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, R. (1996). The KDD Process for

Extracting Useful Knowledge from Volumes of Data. Communications of the

ACM, (39:11). Pp 27-34.

Han, J., Kamber, M. (2001). Data Mining: Concepts and Techniques. Morgan-Kaufmann

Academic Press, San Francisco.

Jiang, Q. (2003). Web Usage Mining: Process and Application. Presentation for CSE

8331.

Kosala, R., Blockeel, H. (2000). Web Mining Research: A Survey. ACM SIGKDD

(Special Interest Group on Knowledge Discovery and Data Mining) Explorations.

June, (2:1). Pp 1-10.

18

Mannila, H., Toivonen, H. and Verkamo, A. I. (1994). Efficient Algorithms for Discovering

Association Rules. In AAAIWorkshop on Knowledge and Discovery in

Databases, Seattle, Washington, USA, Pp 181–192.

Srivasta, J., Cooley, R., Deshpande, M., and Tan P. N. (2000). Web Usage Mining:

Discovery and Application of Web Usage Pattern from Web Data. Department of

Computer Science and Engineering, University of Minnesota.

Tang, C.; Lau, R.W.H.; Li, Q.; Yin, H.; Li, T.; and Kilis, D.(2000). Personalized

Courseware Construction Based on Web Data Mining. In Proc. of the First

International Conference on Web Information Systems Engineering (WISE 2000)

vol.2, Pp. 204-211.

Tang, Y. T. and McCalla, G. (2001). Student modeling for a Web based Learning

Environment: a Data Mining Approach. Department of Computer Science,

University of Saskatchewan, Canada.

Xue, G. R., Zeng, H. J., Ma, W. Y and Lu, C. J. (2002). Log Mining to Improve the

Performance of the Methods from statistic, Neural Nets, Machine Learning and

Experts System. Morgan Kaufman.

Zaiane, O. and Luo, J. (2001). Towards Evaluating Learners’ Behavior in a Web-based

Distance Learning Environment. In Proc. of IEEE International Conference on

Advanced Learning Technologies, Pp 357-360, Madison, WI.

19

Documents

Azizul Azhar Ramli_Web Usage Mining Using Apriori Algorithm