Upload
vendhan123
View
651
Download
0
Embed Size (px)
Citation preview
WEB USAGE MINING USING APRIORI ALGORITHM: UUM LEARNING CARE
PORTAL CASE
Azizul Azhar bin RamliInformation System Department
Faculty of Information Technology and Multimedia,Kolej Universiti Teknologi Tun Hussein Onn (KUiTTHO)
86400 Parit Raja, Batu Pahat, Johor D/TTel: +607-4538000 ext. 8056, Fax: +607-4532119
Email: [email protected]
Abstract
The enormous content of information on the World Wide Web makes it obvious
candidate for data mining research. Application of data mining techniques to the World
Wide Web referred as Web mining where this term has been used in three distinct ways;
Web Content Mining, Web Structure Mining and Web Usage Mining. E Learning is one
of the Web based application where it will facing with large amount of data. In order to
produce the university E Learning (UUM Educare) portal usage patterns and user
behaviors, this paper implements the high level process of Web Usage Mining using
basic Association Rules algorithm call Apriori Algorithm. Web Usage Mining consists of
three main phases, namely Data Preprocessing, Pattern Discovering and Pattern
Analysis. Server log files become a set of raw data where it’s must go through with all
the Web Usage Mining phases to producing the final results. Here, Web Usage Mining,
approach has been combining with the basic Association Rules, Apriori Algorithm to
optimize the content of the university E Learning portal. Finally, this paper will present an
overview of results analysis and Web administrator can use the findings for the suitable
valuable actions.
KEY WORDS: server log file, data mining, Web mining, Web Usage Mining, Association
Rules, Apriori algorithm.
1
1.0 PROJECT OVERVIEW
Data mining is a technique used to deduce useful and relevant information to
guide professional decisions and other scientific research (Chen, Han and Yu, 1996). It
is a cost-effective way of analyzing large amounts of data, especially when a human
could not analyze such datasets.
Massification of the use the internet has made automatic knowledge extraction
from Web log files a necessity. Information provided are interested in techniques that
could learn Web users’ information needs and preferences. This can improve the
effectiveness of their Web sites by adapting the information structure of the sites to the
users’ behavior.
Recently, the advent of data mining techniques for discovering usage pattern
from Web data (Web Usage Mining) indicates that these techniques can be a viable
alternative to traditional decision making tools (Srivastava et al., 2000). Web Usage
Mining is the process of applying data mining techniques to the discovery of usage
patterns from Web data and is targeted towards applications (Srivastava et al., 2000).
Web Usage Mining mines the secondary data (Web server access logs, browser logs,
user profiles, registration data, user sessions or transactions, cookies, user queries,
mouse clicks and any other data as the result of interaction with the Web) derived from
the interactions of the users during certain period of Web sessions.
This paper explores the use of Web Usage Mining techniques to analyze Web
log records collected from E-Learning portal (UUM Educare). Using commercial data
Web mining tools (WebLog Expert Lite 3.5 and Sawmill 7) and ARunner 1.0 (prototype
of GUI Christian Borgelt Apriori tool by Shamrie Sainin, FTM, UUM), it have identified
several Web access pattern by applying well known data mining techniques (Apriori
Algorithm) to the access logs of this educational portal. This includes descriptive statistic
and Association Rules for the portal including support and confidence to represent the
2
Web usage and user behavior for UUM Educare. The results and findings for this
experimental analysis can be use by the Web administration and may be upper level in
the UUM community in order to plan the upgrading and enhancement to the portal
presentation.
Objective and Scopes of Project
Generally, the main objective of this project is to perform Web Usage Mining
process, specifically:
i. To preprocess UUM Educare server logs files from the university E-Learning
Web servers for determining and discovering the user access pattern.
ii. To apply the basic Association Rules – Apriori algorithm for implementation of
Web Usage Mining process to producing usage pattern by determine the
most user interest based on the options that being provided by the university
E-Learning portal.
iii. To analyze the outputs usage patterns and user behaviors for UUM Educare
from the Web Usage Mining implementation process.
The scopes of this project are:
i. Research Organization: University E-Learning portal (UUM Educare)
ii. Focus of Project: Extract the server log file from university E-Learning
server on certain of weeks within a semester, preprocessing the set of raw
data, select the data that contribute for pattern analysis, implement the
pattern mining using basic Association Rule, Apriori algorithm, in order to
produce the final results (rules).
3
2.0 LITERATURE REVIEW
Data Mining, Web Mining and Web Usage Mining
Data mining (DM) is a step from Knowledge Discovery in Database (KDD)
process, which is defined as a “nontrivial process of identifying valid, novel, potentially
useful and ultimately understandable pattern in data” (Fayyad et al., 1996). The term
pattern here refers some abstract representation of a subset data of the data, that is, an
expression in some language describing a data subset or a data subset or a model
applicable to that subset.
Data mining efforts associated with the Web, called Web mining, can be broadly
categorized into three areas of interest based on which part of the Web to mine; Web
Content mining, Web Structure mining, and Web Usage Mining (Kosala and Blockeel,
2000). In Web mining, data can be collected at the server-side, client-side, proxy servers
or a consolidated Web/business database (Srivastava et al., 2000). The information
provided by the data sources described above can be used to construct several data
abstractions, namely users, page-views, click-streams and server sessions.
Web Usage Mining is defined as the process of applying data mining techniques
to the discovery of usage patterns from Web logs data which to identify Web user’s
behavior (Srivastava et al., 2000). Web Usage Mining is the type of Web mining activity
that involves an automatic discovery of user access patterns from one or more Web
servers.
As shown in Fig. 1, three main tasks are performed in Web Usage Mining;
Preprocessing, Pattern Discovery and Pattern Analysis. Fig.1 represents a brief
description about the main task of Web Usage Mining process.
4
Association Rules and Apriori Algorithm
The problem of deriving Association Rules from data was first formulated in
(Agrawal, Imielinski and Swami, 1993) and is called the “market-basket problem”. The
problem is that we are given a set of items and a large collection of transactions which
are sets (baskets) of items. The task is to find relationships between the containments of
various items within those baskets.
Apart from the supermarket scenario there are many other examples where
Association Rules have been used, for example users’ visits of WWW pages which the
structure and its content can be optimized. Xue et al., (2001) have used re-ranking
method and generalized Association Rules to extract access patterns of the Web sites
pattern usage. Mannila et al., (1999) use page accesses from a Web server log as
events for discovering frequent episodes. Chen et al., (1996) introduce the concept of
using the maximal forward references in order to break down user sessions into
transactions for the mining of traversal patterns. Batista and Silva, (2001) perform mining
process for online newspaper Web access logs by using Apriori algorithm.
The task in Association Rules mining involves finding all rules that satisfy user
defined constraints on minimum support and confidence with respect to a given dataset.
5
Most commonly used Association Rule discovery algorithm that utilizes the frequent
itemset strategy is exemplified by the Apriori algorithm (Agrawal et al., 1993).
Apriori was the first scalable algorithm designed for association-rule mining
algorithm. Apriori is an improvement over the AIS and SETM algorithms (Agrawal and
Srikant, 1994). The Apriori algorithm searches for large itemsets during its initial
database pass and uses its result as the seed for discovering other large datasets during
subsequent passes. Rules having a support level above the minimum are called large or
frequent itemsets and those below are called small itemsets (Chen et al., 1996). The
algorithm is based on the large itemset property which states: Any subset of a large
itemset is large and if an itemset is not large and then none of its supersets are large
(Agrawal and Srikant, 1994).
Web Usage Mining in Educational Field
In a Web-based learning environment, where both the tutors and learners are
separated spatially and physically, student modeling is one of the biggest challenges.
Traditional student modeling techniques are inapplicable in these systems when tutors
are overwhelmed by the huge volumes of sequential data generated as learners browse
through the Web pages (Agrawal and Srikant, 1995). Web mining techniques, including
clustering and Association Rules mining can be applied to extract hidden and interesting
knowledge to facilitate instructional planning and student diagnosis. Web mining in
education is not new. It has been applied to mine aggregate paths for learners engaged
in a distance education environment (Ha, Bae and Park, 2000); relevant words to
students based on text mining from their browsed documents (Ochi et al., 1998); e-
articles for students based on key-word-driven text mining (Tang et al., 2000), and to
analyze learners’ learning behaviors (Zaiane and Luo, 2001). The previous research
proposed the beyond usage mining to consider the content of the pages that have been
visited. In the E-learning system, both learners’ browsing behaviors and course content
6
are important to derive learners’ learning levels, intentions, goals, interests or abilities.
Incorporating course content can aid in an understanding of learners’ browsing habits. In
particular, understanding the learners’ browsing behaviors can facilitate, the course
contain personalization.
The existing system called Artificial intelligence in Education (AIED), employs a
knowledge base, a student model and instructional plans. For a Web based AIED
system, Web mining becomes part of student modeling. The system can relate its mined
knowledge of page contents and student navigation patterns to students’ level of
understanding to decide upon appropriate feedback to them (Tang et al., 2001).
3.0 PROJECT METHODOLOGY AND IMPLIMENTATION
The Web Usage Mining process proposed by (Srivastava et al., 2000) becomes
a major guide line upon project implementation. Fig.3 shows the general flow of the
project methodology.
Server Log File
The server log file dated from 19 February 2004 until 13 March 2004 has been
selected for further analysis. The server log files are retrieved from the UUM Educare
server, www.e-web.uum.edu.my . The total amount of the server log file between that
duration is about 650 MB and the large amount of data becomes the most challenging
7
problem to handle during the Data Preprocessing phase. The server log file consists of
nine attributes in the single line of record as shown in Fig 4.
In the data selection phase, log files started on 19 February 2004 until the end of
semester (13 March 2004) have been selected. The selection of the server log files must
be done carefully because of the UUM Educare is part of the GroupWeb facilities where
beside the Educare (My Desktop) as an E Learning, there are several facilities such as
My Portfolio and Resources. Because of the UUM Educare facilities that provided by the
GroupWeb is mixed with another facilities or options, the server log file also includes the
mix of log file for every transaction between the facilities in the GroupWeb portal.
Data Preprocessing
Data Preprocessing phase is one of the most challenging phase in this study.
The major task in this phase are includes handling missing values, identifying outliers,
smooth out noisy data and correct inconsistent data (Han and Kamber, 2001). Data
Preprocessing consists of all the actions taken before the actual Pattern Analysis phase
process starts.
The Data Preprocessing phase is being done by using available software in the
market. On early stage of this phase, Macro tool in Microsoft Access have been selected
to assist the preprocessing tasks and for the following data preprocessing task, filter tool
in Microsoft Excel becomes the selected tool.
The selection of this period is because of the universities academic calendar
shows the selected dates are nearly to the end of second semester of 2003/2004
8
session where 13 March 2004 the last day for final examination. Fig. 5 shows the data
after preprocessing phase is done.
Pattern Analysis
During the Pattern Analysis phase, the descriptive method is being used analyze
the data such as general summary of the Web usage and customer behaviors. This
general summary includes the most active users using the portal either from Malaysia or
other country. If the users came from Malaysia, it’s also shows the locality of the users
either accessing the UUM Educare portal from the UUM Local Area Network (LAN) or
outside of UUM campus.
The analysis also tries to find out the top visitors for each facility or option that
being provided by the UUM Educare portal. There are several facilities or option that
being placed in UUM Educare portal such as dms, profile, resources, announcement,
assessment, calendar, pnotes, assignment and forum. The dms as one of the options in
UUM Educare can be analyzed to know the most requested documents in UUM Educare
portal. Beside the dms option analysis, the sever log files also trace the information of
documents that was downloaded.
Pattern Discovery – Association Rules
Given a server log files that represent UUM Educare portal activities, the main
purpose of Association Rules is to generate all Association Rules that have support and
confidence greater than the user specified minimum support (called min_sup) and
minimum confidence (called min_conf) respectively. An algorithm for finding all
9
Association Rules, henceforth, referred to as the Apriori algorithm (Agrawal and Srikant,
1994).
The selected of Apriori algorithm is because of the performance where it able to
run the mining process in short period. Currently, Apriori algorithm is commonly used for
generating the Association Rules for Web Usage Mining and this experimental study
focus on exploratory of Web Usage Mining in university E-Learning portal (UUM
Educare).
Results
As stated above, this study will focus on Web Usage Mining of UUM Educare
portal. The results of this study are divided into two sections where the first section will
discuss about the general descriptions of the access pattern and users behaviors of
UUM Educare portal (descriptive statistic). Another section will display the supports and
confidences of the different level in UUM Educare portal. All the results will display using
certain chart for such as pie and bar chart to make it easier understand.
4.0 FINDINGS AND RESULTS
The Web Usage Mining for Universiti Utara Malaysia E Learning (UUM Educare)
portal where the main URL, www.e-web.uum.edu.my are divided into two main stages or
section. Each stages having their own phases with certain sub activities. The first stages
are including log data retrieving from the UUM Educare server where the Data Selection
and Data Preprocessing phases are directly involves. The second stages are the mining
stages where its will involving Pattern Discovery by applying Association Rules and
Pattern Analysis phases in order to discovered the UUM Educare portal usage pattern.
10
General Pattern Analysis Results (access pattern and users behaviors –
descriptive statistic)
The UUM Educare portal has several options (dms, resources, announcement,
assessment, calendar and forum) that can be chooses by the users. Based on the
Universal Resource Locator (URL) stem, the users only accessed the UUM Educare
options in host of www.e-web.uum.edu.my without navigating other sub options provided
by UUM Educare portal are shows on the figure below.
Association Rules Results (support and confidence of the different options)
Figure below shows the results that represent the support and confidence for
each option that being provided at UUM Educare portal where the main host, www.e-
web.uum.edu.my and continue with main URL path for E Learning portal. There are 14
options being provided on UUM Educare and for this analysis, the total transactions for
the options path are 10 578 transactions are selected.
11
Based on the Fig.6 above, it can be conclude that, /main path will be the most
requested page and it followed by the /dms path where the documents downloading are
can done here. /main path is the top level for UUM Educare and it display the basic
information about the portfolio/subject includes subject area, discipline, owner and
subject description. With /main and other options path, user also can select other
options that provided by UUM Educare portal.
Fig. 7 shows that support and confidence for each directories and the /dms
option path with 36.45% percentage of support and confidence where it is a highest
percentage for support and also for the confidence level. It’s followed by /main and
/assessment option path where the support and confidence is
15.02% and 13.05%.
Association Rules induction is the extraction of rules in the form of X => Y (if X
then Y) quantified with a confidence (proportion of occurrences that verifies Y among
occurrences that verifies X) and a support (proportion of occurrences that verifies X and
Y among all occurrences).
A next result shows the Association Rules, including support and confidence by
applying Apriori algorithm for identifying the patterns, defining a threshold of 15% for the
minimum support and a threshold of 70% for the minimum confidence. Fig. 8 shows
the pages that related to each other where the most frequent options that being selected
during the certain options is selected.
12
Based on the figure above, it can conclude that the rule with higher support
(22.0%) means, “if in a session the user selected the /announcement and /main
options path, user also selected the /dms option path”; the rule with higher confidence
(99.1%) says that “if in a session, the user selected the /announcement and /dms
option path, user also selected the /main option path”. Next figure (Fig. 9) represent a
graphical chart for the 6 most accepted rules for options relationship.
Fig. 10 below show the Association Hyperedges for UUM Educare that
represents the portal pages those orderly archived. A threshold of 10% for the minimum
support and a threshold of 75% for the minimum confidence are being used. It shows
that support of 11.7% is the high percentage of transactions that contain all items
appearing in the hyperadge, that is in the /assignment /announcement /main with the
percentages of confidence is 78.1%.
The confidence of 92.2% with 10.6% of support is on /assignment
/announcement /assessment /main is represent the highest of average confidence of
all rules that can be formed using the items in the hyperedge with all items appearing in
the rule (average confidence of the rules including /main <- /assessment
/announcement /assignment, /assessment <- /main /announcement
13
/assignment, /announcement <- /main /assessment /assignment and /assignment
<- /main /assessment /announcement.
The following figure (Fig. 11) shows 6 most accepted Association Rules
Hyperedge for UUM Educare server log files during selected period of time.
14
5.0 PROJECT SIGNFICANCE
Generally, this project will produce the useful finding for analyzing the Web usage
pattern for UUM ELearning, www.e-web.uum.edu.my and more specific:
i. This study will become the first step for the analyzing university E-Learning portal
by applying Web usage mining approach with basic Association Rules – Apriori
algorithm.
ii. The outcomes from this study can be used by the Web administrator in order to
plan necessary improvement, enhancement and valuable actions to the
university E-Learning portal.
iii. The implementation of Web usage mining process for university E-Learning
portal may becomes the guide line for the system development purposes.
6.0 CONCLUSION AND RECOMMENDATION
Web Usage Mining is an active field for research and it will generate new hopes
in internet based business. Web Usage Mining applications are being used in some
famous Websites and this paper totally focuses on education field. This paper presents a
brief introduction of Web mining technique, apart of the data mining technologies and
also the implementation of the Web Usage Mining in E-Learning portal. Server log files
of UUM E-Learning (UUM Educare) in server host, www.e-web.uum.edu.my have been
selected for this project. In order to perform the Web Usage Mining, the methodology
that being introduce by (Srivastava et al., 2000) becomes major guide where it includes
three main phases; Data Preprocessing, Pattern Analysis and Pattern Discovery. All the
particular phases were done carefully to produce quality results. Data Processing phase
for the Web Usage Mining is a challenging task and basic Association Rules algorithm,
Apriori algorithm was selected as a technique to produce the support and confidence of
the different levels in Web usage mining of UUM Eduacare portal.
15
The selection of the Apriori algorithm for performing Web usage mining on UUM
Educare portal is because of Apriori algorithm is a common data mining technique for
association based analysis. By applying this algorithm to the Web log file, the
relationship between the accessed pages can be mined. The Web usage patterns and
user behavior on UUM Educare portal also can analyze by using this algorithm where
the descriptive statistic approach can’t perform this analysis. The results and findings for
this analysis are more reliable but less of accuracy because of the Apriori algorithm
properties where the same selected itemsets are always counted. The results or findings
from this experimental analysis are surely useful for Web administrator in order to
improve Web services and performance through the improvement of Web sites,
including their contents, structure, presentation and delivery. The valuable actions may
contain of performing the Web pages value added modification.
As a recommendation, in order to enhance and continue this project, the
suggested methodology can be implemented for system development purposes. The
system may perform and implement the Web usage mining phase including data
selection, data preprocessing, pattern discovery and analysis. Apriori algorithm may be a
part of the pattern discovery sub function.
Beside that, there are certain important recommendation may propose for
improving the results during project implementation. As discussed in chapter V, Findings
and Results, there is placed there the Association Rules with percentages support and
confidence for 10 most requested options in UUM Educare portal and also represent the
Association Rules for /dms option. The figures for each support and confidence in
particular results are same because the step calculation for Apriori algorithm. According
to Boon Lay et., al (1999), the percentages of support for Association Rules can be
improve by finding all sets items (attribute=value) that have transactions support above
16
minimum support. Itemsets with minimum support are called large itemsets and the large
itemsets are used to generate the desired rules.
Lastly, for future work, the another method for analyzing sparse data can be used
in the study of E-Learning Web log access, use of different similarity Association Rules
and conclude about the most suitable alternatives for knowledge extraction from Web
log data.
REFERENCES
Abd. Wahab, M. H, Siraj, F and Yusoff, N. (2004). Log Mining Using Generalize
Association Rules. In Proceedings of Master Final Project 2004 Presentation,
UUM, Malaysia.
Agrawal, R., Imielinski, T. and Swami, A. (1993). Mining Association Rules between Sets
of Items in Large Databases. In Proceedings of the International ACM SIGMOD
Conference, Washington DC, USA, pages 207–216.
Agrawal, R. and Srikant, R. (1994). Fast Algorithm for Mining Association Rules. Proc. of
the 20th VLDB Conference. Pp 487-499.
Agrawal, R., and Srikant, R. (1995). Mining Sequential Patterns. In Proc. of the Eleventh
International Conference on Data Engineering (ICDE), Taiwan. Pp 3-14.
Batista, P and Silva, M (2001). Prospeccao dos Dados de Acesso a um Servidor de
Noticias na Web, 2nd Coferencia sobre Redes de Computadores, Evora,
Portogal.
17
Boon Lay, C, Khalid, M and Yusof, R. (!999). Intelligent Database by Neural Network
and Data Mining. In Proc. of Artificial Intelligent Applications in Industry, Kuala
Lumpur. Pp 201-219.
Borgelt, C. (2004). Apriori: Finding Association Rules/Hyperedges with the Apriori
Algorithm. School of Computer Science, University of Magdeburg.
Chen, M.-S., Jan, J., Yu, P.S. (1996). Data Mining: An Overview from a Database
Perspective. IEEE Transactions on Knowledge and Data Engineering, (8:6). Pp
866.883.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, R. (1996). The KDD Process for
Extracting Useful Knowledge from Volumes of Data. Communications of the
ACM, (39:11). Pp 27-34.
Han, J., Kamber, M. (2001). Data Mining: Concepts and Techniques. Morgan-Kaufmann
Academic Press, San Francisco.
Jiang, Q. (2003). Web Usage Mining: Process and Application. Presentation for CSE
8331.
Kosala, R., Blockeel, H. (2000). Web Mining Research: A Survey. ACM SIGKDD
(Special Interest Group on Knowledge Discovery and Data Mining) Explorations.
June, (2:1). Pp 1-10.
18
Mannila, H., Toivonen, H. and Verkamo, A. I. (1994). Efficient Algorithms for Discovering
Association Rules. In AAAIWorkshop on Knowledge and Discovery in
Databases, Seattle, Washington, USA, Pp 181–192.
Srivasta, J., Cooley, R., Deshpande, M., and Tan P. N. (2000). Web Usage Mining:
Discovery and Application of Web Usage Pattern from Web Data. Department of
Computer Science and Engineering, University of Minnesota.
Tang, C.; Lau, R.W.H.; Li, Q.; Yin, H.; Li, T.; and Kilis, D.(2000). Personalized
Courseware Construction Based on Web Data Mining. In Proc. of the First
International Conference on Web Information Systems Engineering (WISE 2000)
vol.2, Pp. 204-211.
Tang, Y. T. and McCalla, G. (2001). Student modeling for a Web based Learning
Environment: a Data Mining Approach. Department of Computer Science,
University of Saskatchewan, Canada.
Xue, G. R., Zeng, H. J., Ma, W. Y and Lu, C. J. (2002). Log Mining to Improve the
Performance of the Methods from statistic, Neural Nets, Machine Learning and
Experts System. Morgan Kaufman.
Zaiane, O. and Luo, J. (2001). Towards Evaluating Learners’ Behavior in a Web-based
Distance Learning Environment. In Proc. of IEEE International Conference on
Advanced Learning Technologies, Pp 357-360, Madison, WI.
19