View
3
Download
0
Category
Preview:
Citation preview
International Review on Computers and Software (I.RE.CO.S.), Vol. 7, n. 3
Copyright © 2012 Praise Worthy Prize S.r.l. - All rights reserved
Interpreting Web Usage Patterns Generated Using a Hybrid SOM-
Based Clustering Technique
Ammar M. Huneiti1
Abstract – The rapid and huge growth of the web has emphasized the need to monitor the
behavior of web users and to identify their interest, knowledge, preferences, goals, etc. This paper
introduces a methodology for classifying users and pages of an educational online hypermedia
using a hybrid clustering technique based on Self Organizing Map (SOM) neural networks. This
paper also introduces an analytical cluster validation and interpretation approach to verify and
explain the generated clusters of users and pages. The implemented cluster validation process
utilizes a silhouette-based quantitative measure, while a combined data visualization and statistical
cluster interpretation technique is proposed. Several experiments have been carried out using real
data collected through special lab sessions of real students navigating an online tutorial.
Experimental results indicated that the proposed methodology was able to prototype users and to
recognize the association between pages based on their usage. Moreover, the topic of interest and
the users interested in these topics were also identified.
Keywords: Web usage mining, SOM, Cluster validation and interpretation, User modelling,
Adaptive hypermedia
I. Introduction
Nowadays, we are witnessing a huge increase in the
resources and services hosted on the web. A similar
increase in web sites and web users and in their
diversity is also evident. In the last few years, the huge
reduction in the prices of Internet subscription fees,
online services, and the computerized hardware has
resulted in an explosive growth in the number of
Internet users. Moreover, web sites are becoming more
complex with regards to their structure, provided
services, and the large number of documents that they
exhibit including diverse content of different media
types. This rapid growth has caused many concerns
related to the publishing of online hypermedia such as
cognitive overload and hypermedia disorientation. In
addition, issues such as author-enforced structures and
one-size-fits-all material are becoming more of a
concern and need to be reviewed. These problematic
issues are caused by the traditional and static methods
of authoring online hypermedia, where users are not or
cannot be considered in advance.
Until recently and according to [1], the author is
committed to the form as well as to the content of the
work, well in advance of the actual time at which it is
presented. There is a consensus among researchers that
powerful and advanced authoring tools and technologies
such as fast graphics, portable smart devices, cheap
internet, hypermedia editing and authoring software etc,
cannot help to improve the quality of the published
material unless similar powerful and advanced
authoring methods are utilised [2,3]. Thibeau [4] states
that “unless the information meets reader needs, in the
way reader needs to see it, these tools will never reach
their potential”. Technological advances in the field of
information processing, storage, presentation, and
retrieval have always had a great influence on the way
that online hypermedia is being authored and published.
In addition, data mining techniques are considered as a
major tool that can recognize hidden usage patterns and
therefore, provide a very useful feedback about pages
and/or users of online hypermedia. Extracting these
patterns will almost certainly contribute to the re-
authoring or re-structuring of the online hypermedia in a
user adaptive manner and will provide the author with a
very useful source of information about the real needs
of the end-users.
As a result, the need to monitor the behavior of users
and their interest is growing. Consequently, the
identification of the association relationship between
different pages on the web and even different web sites
based on their usage is also of a great importance. The
above mentioned issues were the real drivers behind the
research introduced in the field of web data mining. At
the present time, this field of research is becoming much
more mature and more specialized. Web usage mining
is a special type of web data mining which also includes
web content mining, web structure mining, and even
web opinion mining.
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
This paper introduces a methodology for classifying
users and pages of online hypermedia using a hybrid
clustering technique. At the core of this clustering
technique is a Self Organizing Map (SOM) neural
network. The paper also implements a cluster validation
and interpretation analysis in order to substantiate the
resulting clusters. Section 2 is a review of the literature
related to the application of web usage mining for
classifying online pages and users. Section 3 introduces
the SOM-based methodology for extracting patterns of
users and associated pages. Section 4 presents the
experimental results. Finally, section 5 presents future
work and concludes the paper.
II. Web Usage Mining for Classifying
Pages and Users
A common direction for most of the recent work
conducted on online hypermedia is to provide user-
centred and task-specific information to end users [5,6].
This direction was adopted by many state-of-the-art
hypermedia systems such as El-Tech [7], IPM [8],
MMA [9], and ADAPTS [10]. Users of online
hypermedia have different levels of knowledge,
expertise, and qualifications, and also different goals
and objectives. Many researchers such as those in [4,5]
argue that hypermedia authoring should be user-centric
rather than author-driven. As a result, data mining
techniques were utilized with many web-based
hypermedia systems to advice and manage users’ way
of navigation through the online material [11,12].
Moreover, the information that is presented to users has
to vary in its focus, level of detail, and presentation
format. It has to be adapted to the information needs of
the users [13]. Therefore, the main objective of many
web-related research endeavors was to identify the
similarity or the strength of association between web
users and/or between pages in a quantitative manner.
Many mathematical models were utilized in order to
measure the similarity relationship between pages or
users.
Recognizing the similarity between users is part of an
ongoing research into user modeling which aims at
identifying the knowledge, interest, goal, or preferences
of users of a web site(s). User models are the
representation of the user’s state of mind [14]. Modeling
users can be used for personalizing and/or customizing
web sites to match the needs of individual users or a
group of users, respectively. In addition, user models
are an essential requirement for adaptive hypermedia
systems [15], which aim at delivering user-customized
information that is concise, specific, relevant, and easy
to understand. Adaptive hypermedia is defined in [16]
as “all hypermedia systems which reflect some features
of the user in the user model and apply this model to
adapt various visible aspects of the system to the user”.
Adaptive hypermedia attempts to solve several
problems associated with the static design of
hypermedia including author-enforced structures,
cognitive overload, disorientation in hyperspace, and
one-size-fits-all material [17]. As mentioned earlier,
many different features of the user can be used to adapt
the conveyed information including user knowledge,
goal, background, experience, preferences, etc. The first
two are the most commonly used features of the user in
adaptive hypermedia and they are, along with other
features, encapsulated in a user model. Moreover,
adaptive online hypermedia systems that improve their
organisation and presentation by learning from visitor
access patterns are reported in [18,19]. A common
characteristic of these systems is that they apply data
mining techniques to users’ access logs, which record
the behaviour of every user within the web site, in order
to fine-tune the web site and its information to the users’
needs.
In addition to user modelling, recognizing the
similarity between web pages is mainly an investigation
into the classification, clustering and organisation of
online pages [12, 20]. The classification and clustering
of online pages is normally associated with supervised
and unsupervised machine learning techniques,
respectively. Data mining techniques and in particular
clustering techniques are used to classify pages with
regard to their content, link structure, and usage [21,22].
Finding the content-based similarity between web pages
is very useful for categorisation of pages and topic
identification for rapid and accurate information
retrieval [23]. The content of pages can be text based
and any other type of media such as images, audio, and
video. In addition, similarity of pages can be assessed
with regard to their existing link structures which
include incoming and outgoing links. Pages that
reference the same set of pages can be considered
similar and on the other hand, pages that are referenced
by the same set of pages can also be considered similar
[24]. Moreover, usage-based similarity is concerned
with tracking the users’ interaction with the hypermedia
in order to identify the association between pages [12].
As a result, web data mining can be divided into
three main types including, web usage mining, web
content mining and web structure mining. Although it is
beyond the scope of this work, and for completeness, it
is worth mentioning that an emerging research direction
related to web data mining is the web opinion mining
which is mostly associated with users of social
networks. According to Romero & Ventura (2007) web
data mining can be categorised as (i) clustering,
classification, and outlier detection (ii) association rule
and sequential pattern mining and (iii) text mining.
Normally, the content, the presentation, and the link
structure of online pages are designed and generated by
the author of the hypermedia. In contrast, the usage of
pages is influenced by the users of the hypermedia
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
whom they generate navigational paths through the
hypermedia which reflect patterns that need to be
discovered. These navigational paths were described by
[23] as the “footprints” of the users stored in log files.
Extracting and interpreting these patterns or footprints
generated by users of online hypermedia systems is the
main objective of web usage mining [25]. Web usage
mining is defined by [26] as the discovery and analysis
of patterns in data collected from the users’ interactions
with the website. The navigation of every user accessing
any web site is registered at the web server log file(s).
These are huge time-stamped files that record all details
of every interaction for every user with the web site.
Extracting meaningful patterns from these files can
support the development of online recommender
systems, achieve better performance of information
retrieval systems, enable the building of online user-
adaptive systems, re-structuring web sites in a user-
centred manner, identifying customers’ interests, habits,
and preferences for online e-commerce systems,
optimising the network performance and better
configuring the web/proxy server, and many other
applications. The most popular application for web
usage mining is in educational hypermedia
[11,13,23,27] where logs are gathered for students
interacting with an online educational material such as
tutorials, documentation, lessons, etc.
Kohonen’s Self Organizing Map (SOM) [28] is a
competitive artificial neural network (ANN) that is
classified as an unsupervised machine learning
technique. Many web data mining approaches have used
SOM as the main clustering technique [29], and in
particular for web usage mining [30,31]. As far as this
research is concerned, SOM was chosen as the primary
clustering technique because it is an unsupervised
learning technique which suits the nature of the
clustered data. It also enables the visualization of the
clustered data, normally, as a 2-D grid of location
sensitive clusters. SOM can deal with high-dimensional
data and map the results into a low-dimensional space
such as a user friendly grid topology that preserves the
spatial autocorrelation between clusters. It has also the
ability to deal with large number of clusters reaching as
much as hundreds of generated clusters. SOM is very
suitable for generating recommendations to users
because it implements a neighborhood-based
organization of clusters where data vectors can belong
to a certain cluster and, although not as strong, still have
an associative relationship with other vectors in
neighboring clusters. To the best of our knowledge, the
LOGSOM system [31] is the work most related to our
research because it proposes a web usage classification
method that utilizes k-means clustering combined with
SOM. In contrast, our work utilises SOM to cluster
users while LOGSOM uses SOM to cluster pages.
As mentioned in [32], the existing clustering
techniques do not provide an indication of the quality of
their outcome. The most important steps in any
clustering technique are the validation and interpretation
of the resulting clusters. Validation and interpretation of
clusters are concerned with generating quality valid
clusters and explaining the outcome of these clusters,
respectively. The validation of clusters resolves certain
issues related to the clustering performance such as
identifying the best number of clusters to generate,
detecting the fitness of the resulting clustering scheme
to the data set, recognizing the suitability of the
partitioning for the data set, etc [32]. The silhouette
measure is used by many researchers to overcome the
clusters validation issue [33,34]. It provides a robust and
quantitative measure of how well each data point fits
within its assigned cluster and, hence, it can validate the
overall clustering procedure. On the other hand,
clusters’ interpretation is concerned with analyzing the
normalized matrix
user/page matrix
k-means clustering of
pages
data
normalization
user/page binary
matrix
construction
SOM clustering of
users
Log File clusters
of pages
Interpretation of
SOM clusters
clusters
validation
clusters
validation
users prototypes
lists of associated pages
clusters of users
Fig. 1. Methodology for extracting usage patterns using SOM
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
resulting clusters in order to identify useful patterns and
formulate trends. Most researchers deploy statistical
methods in order to interpret the clustering results [24].
In addition, the visualization of the resulted clusters,
where applicable, is a very useful aid in detecting
hidden patterns within the data [35]. As far as our work
is concerned, a combination of statistical and data
visualization technique is used in order to interpret the
resulted clusters.
III. SOM-Based Methodology for
Extracting Usage Patterns
The methodology used to extract patterns of
users and pages from web usage data is depicted in
Fig. 1. The methodology mainly consists of three
main phases including (i) pre-SOM preparation
(ii) SOM clustering of users, and (iii) post-SOM
analysis. The pre-SOM preparation phase
comprises all tasks included within the dashed
polygon in Fig. 1, which aims at preparing the log
file data for efficient SOM clustering. This
includes constructing a user/page navigation
matrix, reducing the dimension of this matrix
using k-means clustering, and data normalization.
The post-SOM analysis aims at interpreting the
generated SOM clusters in order to identify
prototypes of users and to extract lists of relevant
and associated pages based on the usage data
extracted from the log file.
The methodology also highlights the need for
validating the results of every clustering step
before proceeding to the next phase. The clusters
validation precedes every clustering step in order
to identify the best clustering parameters to adopt.
III.1 Data pre-processing for applying SOM
Online usage data come in huge textual log file(s)
which require an extensive offline pre-processing
before applying any classification technique. The
pre-processing phase includes data cleansing,
reduction, transformation, normalization, and
modeling. In addition, individual users, pages and
users’ sessions must be identified.
The data used in this research is concerned with
students’ usage of an online hypermedia-based
tutorial of the Java© programming language
courses taught at the University of Jordan
including Object Oriented Programming I and II in
the spring semester of 2009/2010. The students
undertaking these courses were brought to special
Lab sessions and were requested to navigate
through the Java tutorial. These sessions were
approximately one hour each and all the students’
interactions were saved in a centralised log files.
These files consist of students’ past usage records
which register attributes such as the usage time,
the IP address, the requested URL, the request
method, the transport protocol, etc. As far as this
work is concerned, the first three attributes were
used including, usage time, IP address, and the
requested URL. As a result of the controlled Lab
sessions, individual users were easily identified by
their unique IP addresses and users’ sessions were
also identified by the date and time of every Lab
session.
III.1.1 Building the User/Page Navigation Matrix
The centralised log files were gathered and
processed in order to construct the user/page
navigation matrix. The collected data consist of
438 different users accessing 195 distinct pages.
These users generated a total of 8441 transactions.
Table I illustrates the constructed user/page matrix
which is of dimension 438x195, where Ui refers to
user i and Pi refers to page i.
TABLE I
USER/PAGE MATRIX
U2 U3 U4 U5 …. U438
P1 1 0 1 1 ….. 1
P2 1 0 0 1 ….. 0
P3 0 0 1 1 ….. 1
….. ….. ….. ….. ….. ….. …..
P195 0 0 1 1 ….. 0
The constructed user/page matrix is a binary
matrix where 1 indicates a “visit” and 0 indicates a
“no visit”. For instance, the matrix depicted in
Table I shows that U1 has visited P1 and did not
visit P2, P3 and so forth. It is difficult to apply the
SOM technique using this user/page matrix as an
input, because of its high dimensionality. This will
require that the number of input neurons to be
equivalent to the number of pages i.e. 195, which
is a very high number of the SOM’s input neurons.
This will distort the SOM and will yield results
that are difficult to interpret. Moreover, the
required processing time will be unacceptable
especially if the number of pages is more than 195
which is the case with many existing web sites.
Therefore, an initial classification phase is needed
in order to group similar pages together and enable
us to deal with groups of similar pages rather than
individual pages. This will significantly reduce the
dimensionality of the user/page matrix and
provides a more suitable data structure for SOM
clustering.
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
III.1.2 K-Means for Initial Classification of Pages
All visited pages are classified using the k-means
clustering technique in order to group individual
pages into k classes of pages. Due to its simplicity,
k-means clustering technique is widely used in
most hybrid clustering systems as an initial
classification technique [22,31]. This step is
normally followed by another clustering phase
using a more robust clustering technique.
As shown in Table I above, the data used for
clustering the pages has significant number of
zero-valued entries, where zero indicates “no
visit”. Clustering this type of data using any
conventional distance function such as the
Euclidean distance, which treats all entries as
equally significant, will yield a distorted result.
This is because the significant information is,
mainly, embedded in the nonzero elements of the
dataset. Moreover, the zero-valued elements
imply, more or less, that there is no information
acquired. Therefore, the clustering results not be
accurate if the zero and nonzero elements are
given the same significance.
As a result, the distance function selected for
the k-means clustering step was the Hamming
distance function. This function employs the
percentage of elements that differ in order to
measure the similarity between any pair of data
vectors. The Hamming distance function is only
suitable for binary data where it calculates the
frequency of occurrence of the patterns 01 and 10
within the two vectors in comparison, compared to
the overall length of the vectors. In addition, it is
important to decide on the most suitable number
of clusters (k) to use for clustering the dataset.
This is done by comparing the sum of the
silhouette values (SumSil) for all data points using
different values for k. The silhouette value for
each data point is a similarity measure of that
point to points in its assigned cluster compared to
points in the other clusters. The silhouette value
Sil(i) for a data point i is computed in [34] as
follows:
( ) ( ) ( )
( ) ( ) ( )
where a(i) is the average distance of data point i from
all other points within its assigned cluster, and b(i) is
the average distance of data point i from all other points
in the nearest cluster to its original cluster. The lower
the value of a(i) and the higher the value of b(i) the
more i is fit within its own cluster and vice versa. As
deducted from equation (1) above, the silhouette value
for any data point i will range from +1, indicating points
that are very distant from neighbouring clusters, through
0, indicating points that are not distinctly in one cluster
or another, to -1, indicating points that are probably
assigned to the wrong cluster. From the above
description it is clear that:
( ) ( )
By comparing the sums of the silhouette values of the
whole dataset using different number of clusters (k), we
can determine the best number of clusters (k) available
for the this particular dataset. The sum of silhouette
values for k clusters, SumSil(k), is defined as:
( ) ∑ ( )
( )
where Sil(i) is as defined in equation (1), and D is the
number of data points in the dataset. The higher the
SumSil the more suitable the number of clusters (k)
used. In theory, the highest positive value that the
SumSil can reach is equal to the number of data points
in the clustered dataset (+D), and this can only occur
when all data points score +1, and vice versa.
Considering the dataset used in this work where 195
pages are clustered, the highest value that SumSil can
reach is +195 (perfect clusters) and the lowest is -195
(completely wrong clusters). Table II shows the sum of
the silhouette values (SumSil) compared to different
number of k-means clusters (k). The SumSil values in
the table indicate that the most suitable number of
clusters, k, for this particular dataset is 3 clusters where
SumSil is the highest. TABLE II
SUMSIL VS K
k 3 4 5 6 7
SumSil 72.1 41.5 40.9 47.5 37.2
III.1.3 Data normalization to improve clustering
Data normalization is an important pre-requisite in order
to prepare the data resulted from the k-means phase for
yielding better results and more efficient SOM
clustering. This consists of two main normalization
steps including (i) converting the user/page matrix into
user/group of pages, and (ii) percentage-based
normalization of the resulting users’ navigational
vectors.
Recall that the initial k-means clustering phase has
classified all pages into three clusters (Table II) or
groups of pages (GoP). These groupings are used to
reduce the dimensionality of the user/page matrix and to
convert it to user/GoP matrix. This matrix is constructed
by calculating Vij, which is the total number of visits
generated by user i, to all pages that belong to the group
of pages j (GoPj).
∑
( )
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
where Vi is the total number of visits generated by user
i. Hence, the overall navigation of any user i can be
described in terms of the vector where:
= <Vi1, Vi2, Vi3,….., Vik> (5)
As shown in Table III, the user/GoP matrix is the
concatenation of all vectors . This matrix has number
of columns equal to the number of users and it has
number of rows equal to the number of GoPs.
TABLE III
USER/GoP MATRIX
U1 U2 U3 U4 U5 …… U438
GoP1 1 21 11 0 11 …… 1
GoP2 4 7 0 0 2 …… 0
GoP3 6 0 0 25 5 …… 2
The total number of visits made by a user does not
fully express the actual navigational pattern of this user.
Alternatively, it would be more appropriate to use the
total number of visits made by a user to a GoP
compared to the overall visits made by the same user to
all other GoPs in a single session. Therefore, using the
percentage of the user’s visits to a group of pages in
comparison with his\her visits to other groups is more
representative of the overall navigational pattern of this
particular user.
For instance, considering that we have three GoPs,
and that user A has generated 21,7, and 0 visits to GoP1,
GoP2, and GoP3, respectively. Hence, the vector that
represents the user’s navigation is <21, 7, 0>. In the
mean time, if user B has generated the vector <3, 1, 0>,
then when the similarity between these two vectors is
calculated using any standard distance function it
would result in a comparatively high value i.e. low
similarity between these two navigational patterns.
Meanwhile, when considering the percentage
normalized values for these two vectors, they both
would converge towards the same vector <0.75, 0.25,
0>. Applying any distance function on these two vectors
would yield a distance of zero. Hence, the two
navigational patterns are exactly similar, which is the
case when considering navigational “patterns” rather
than “counts of visits”. Table IV illustrates the
normalized user/GoP matrix, which is a normalized
version of Table III, above. Every column in the table
represents the navigational pattern of an individual user
which is described using a 3-D vector of visits
percentage.
TABLE IV
NORMALISED USER/GoP MATRIX
U1 U2 U3 U4 U5 …… U438
GoP1 0.09 0.75 1 0 0.61 …… 0.33
GoP2 0.36 0.25 0 0 0.11 …… 0
GoP3 0.55 0 0 1 0.28 …… 0.67
III.2 Classification of Users Using SOM
SOM neural networks consists mainly of two layers,
namely, the input layer and the output layer. Every
neuron in the input layer is connected to all neurons in
the output layer. Even though the output layer can be 1-
D row of neurons or a 3-D mesh, normally, the output
layer is a 2-D lattice or grid of output neurons. As
illustrated in Fig. 2, the SOM’s input neurons are fully
connected to the output neurons using weighted
connections. This topology of neurons enables the
projection of the input patterns into a 2-D grid of output
neurons. The location of the winner output neuron
corresponds to a particular feature of the input pattern.
Although not as much, the entire family of pre-defined
winning neighborhood neurons, also, adapt to the input
patterns.
Fig. 3 outlines the pseudo code of the SOM
used in this study. The algorithm consists of two
phases (i) SOM initialization, and (ii) SOM
Learning. After a pre-defined number of learning
cycles, this topology will eventually correspond
with the principle of spatial autocorrelation, which
implies that the closer the output neurons to each
other, the more similar they are and the more they
resemble similar input patterns. Using clustering
terminology, the spatial autocorrelation of SOM,
provides a multi-level ranking mechanism to the
suitability of clusters to the corresponding input
patterns.
The data set that is used to train the SOM is the
one shown in Table IV. Unlike what was done at
the first phase of clustering using k-means, SOM
is used here to classify users rather than
classifying pages. All users are projected into a 2-
D grid according to their navigation patterns.
III.3 Interpretation of the SOM Clusters
In general, the interpretation of clusters is a
troublesome process which requires a great deal of
understanding of the nature of the original data.
The method deployed here is based on sorting the
original dataset, represented by the user/page
matrix of Table I, using the clusters resulted from
the SOM classification. The following hypothesis
is adopted for interpreting the resulted SOM
clusters. The hypothesis outlines the concept of
page usage similarity.
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
1. Initialization
1.1 Initialize the network
1.1.1 Set the number of input neurons
1.1.2 Set the number of output neurons
1.1.3 Decide on the topology of the output layer
1.2 Initialize neurons’ weights
1.2.1 initialize neuron weights to a small values (say an interval [0,1])
1.2.2 set the learning rate parameter α
1.3 Decide on the appropriate distance function DF
1.4 Decide on the appropriate neighborhood function NF
1.5 Set the number of learning cycles (Epochs) Eps
2. Learning
2.1 epoch =1
2.2 iteration =1
2.3 Apply data point di
2.4 Calculate the distance (D) from di to all neurons using DF
2.5 Identify the winning neuron Nw where D is minimum
2.6 Strengthen the weights of Nw towards di using α
2.7 Strengthen (although not as much) the weights of all neighboring neurons NN using NF
2.8 While iteration ≤ number of data points → increment iteration, go to 2.4
2.9 While epoch ≤ Eps → increment epoch and go to 2.2
Fig. 2. SOM topology
Fig. 3. SOM pseudo code
Hypothesis: Pages are deemed similar, related,
associated and relevant to each other if they are
frequently visited together in a single session by
many users. The more users visit the same
collection of pages the stronger the similarity
relationship between these pages.
The cluster interpretation technique used is
explained using the following example.
Let us assume that users U1, U2, and U3 have been
assigned to the same cluster (C1) and their
navigation patterns over 8 pages (P1...P8) are
summarized as follows:
P1 P2 P3 P4 P5 P6 P7 P8
C1
U1 1 0 1 1 0 0 1 1
U2 1 0 0 1 0 0 1 0 non-zero
average U3 1 1 0 0 0 0 1 1
Total 3 1 1 2 0 0 3 2 2
Users 1, 2 and 3 have similar navigational
patterns because they belong to the same SOM
cluster. By considering the above mentioned
hypothesis, it is clear that P1 and P7 are strongly
associated to each other because they were visited
by all users of C1. In order to be able to comment
on the association of the rest of the pages we need
a quantitative measurement of the degree of
association.
As shown in the above example, the total of
visits made by every user belonging to this cluster
to every page is calculated. Furthermore, a cluster
average of all “nonzero” totals of visits is also
calculated which is equal to 2. In order to
determine the degree of association between pages
it is suggested that all pages that have visits total
greater than or equal to the average can be
considered associated. Considering the navigation
of users of this cluster, P1 and P7 can be
considered associated with an association score of
3. On the other hand, P4 and P8 are also associated
within this cluster but with an association score of
2. The rest of pages are not considered associated
because they score less than the cluster average.
On the other hand, it can be deduced that users of
this cluster are highly interested in Pages P1 and
Output Layer
Input Layer
..…….
..….
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
P7, and although not as much, they are also
interested in pages P4 and P8. This method was
adopted for the analysis of the main clusters
generated by the SOM.
IV. Experimental Results
IV.1 SOM Validation
Different dimensions for the SOM’s grid topology
using different numbers of training cycles (epochs) were
tested. The sum of the silhouette values (SumSil) was
used in order to compare the results and to verify the
suitability of different grid arrangements and training
epochs. Table V shows the sum of the silhouette values
generated using different dimensions of the output grid
at different epochs. Taking into account that the number
of data points is 438, which indicates that, the maximum
value for the SumSil is +438 (very good) and the
minimum is -438 (very bad). The oval-surrounded cells
in the table represent the highest SumSil value within
every grid arrangement. Overall, these values indicate
that the best arrangement for the SOM’s output grid is
21x21 trained using 60 epochs.
As a result, the SOM used in this work consists
of 3 input neurons and an output grid topology of
21x21 neurons with a total of 441 output neurons.
It utilizes an adaptive training strategy where the
values of the learning rate and the neighborhood
distance are altered starting from an initial
maximum value towards a final pre-set minimum
value for fine-tuning the network. The learning
rate α is initially set to 0.9 and is altered
adaptively towards 0.02 at the end of the learning
phase. Similarly the neighborhood distance is
initially set to equal the maximum distance
between two neurons and decreases towards 1.
The distance function used is the well-known
Euclidean distance.
Fig. 4 represents a 3-D plot of the 438 data points
(red dots) overlapped by the 441 SOM neurons (line
connected blue stars) and trained over 60 epochs. The
figure shows the distribution of the SOM output neurons
compared to the distribution of the original dataset. It
can be noticed that after 60 training epochs the SOM
neurons have rearranged themselves around the training
data points and made a full inclusion of the data. The
generated 21x21 grid of output neurons is depicted in
Fig. 5, where spatial closeness resembles similarity. The
cells in the grid are numbered left to right starting from
cell 1 (top left corner) and finishing with cell 441 (right
bottom corner).
Fig. 4. SOM’s 441 output neurons after 60 epochs compared to training data
TABLE V
SUM OF SILHOUETTE VALUES OF DIFFERENT GRID DIMENSION AND EPOCHS
Epochs
Grid
10 20 40 60 80 100 120
10x10 274.5 297.1 281.6 284.6 306.1 307.8 279.3
15x15 324.1 328.9 359.4 351.4 329.9 323.5 343.9
21x21 364.2 364.1 367.8 375.9 364.3 358.1 368.8
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
Fig. 5. The resulted 21x21 SOM output grid after 60 epochs
IV.2 Analysis of SOM clusters
The generated SOM clusters depicted in Fig. 5 are
interpreted in two steps. Firstly, the 2-D output grid of
clusters is inspected and analyzed visually. The grid
shows that the most significant clusters are the three
clusters depicted using the oval shapes because they
contain the highest number of students. These clusters
are represented in cells 1, 21, and 428 of the SOM grid
and they contain 15, 74, and 47 students, respectively.
These clusters are a valid representation of the overall
trend of all the students’ navigation through the tutorial.
In addition, each cluster represents a user prototype and
therefore it can be suggested that the users of the Java
tutorial are, mainly, of three prototypes. The grid also
has a substantial number of cells containing only one or
two students. These can be interpreted as students
generating random clicks over the tutorial and/or
students with no defined interest in any valid tutorial
topic. Perhaps, these are students that did not take the
exercise seriously or they had an individualized and
personal topic of interest that did not match with the
interests of the overall population of students.
Secondly, the statistical procedure described in
section (3.3) was used to interpret the resulting grid of
clusters. The three main clusters of the SOM grid
depicted in Fig. 5 were analyzed in more details. Table
VI shows all the pages frequently visited by students of
cluster 1 and scoring above the average (average = 3.1).
The table shows that all these pages are sections of
chapters 8, 9, and 10 of the tutorial which were proven
associated by the navigation of the students of cluster 1.
On the other hand, these chapters are believed to be
related because the author of the tutorial had organized
them in succession. In addition, it can be noticed that
the main topic of this navigational pattern is the “Array
Object” where most pages are directly relevant or
indirectly supporting this theme.
TABLE VI
CLUSTER 1 - ALL PAGES SCORING ABOVE AVERAGE
Chapter Page Name Visits Total
'/chap10_05.html' 'Array length' 6
'/chap10_03.html' 'for loops' 5
'/chap09_06.html' 'Printing an object' 5
'/chap09_13.html' 'Generalization' 5
'/chap09_14.html' 'Algorithms' 5
'/chap09_05.html' 'Creating a new object' 4
'/chap08_12.html' 'Objects and primitives' 4
'/chap09_01.html' 'Class definitions and object types' 4
'/chap10_02.html' 'Copying arrays' 4
'/chap10_04.html' 'Arrays and objects' 4
'/chap10_01.html' 'Accessing elements' 4
'/chap09_09.html' 'Modifiers' 4
'/chap08_01.html' 'What's interesting?' 4
'/chap08_02.html' 'Packages' 4
'/chap08_04.html' 'Instance variables' 4
'/chap09_03.html' 'Constructors' 4
'/chap09_04.html' 'More constructors' 4
'/chap08_11.html' 'Garbage collection' 4
'/chap10_06.html' 'Random numbers' 4
Table VII shows the ten highest ranked pages
out of twenty pages frequently visited by students
belonging to cluster 21 and scoring above the
average (average=21.5). All of these pages are
sections of author-generated chapters 1, 2, and 3
of the tutorial. The main theme of this navigational
pattern can be categorized as “Introduction to Java
programming”.
C# 21 C# 1
C# 428
C# 22
group of
clusters
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
TABLE VII
CLUSTER 21– HIGHEST 10 RANKED PAGES
Chapter Page Name Visits Total
'/chap01_01.html' 'What is a programming language' 39
'/chap01_03.html' 'What is debugging' 37
'/chap02_02.html' 'Variables' 35
'/chap01_05.html' 'The first program' 34
'/chap02_05.html' 'Keywords' 31
'/chap02_03.html' 'Assignment' 31
'/chap02_04.html' 'Printing variables' 31
/chap02_01.html' 'More printing' 30
'/chap01_02.html' 'What is a program' 30
'/chap02_06.html' 'Operators' 30
Table VIII shows the fifteenth highest ranked
pages out of 56 pages visited by students of cluster
428 and scoring above the average (average=5.8).
All of these pages are sections of chapters 14 to 19
of the tutorial. The main theme of this navigation
pattern can be categorized as “Advanced data
structures”. TABLE VIII
CLUSTER 428 – HIGHEST 15 RANKED PAGES
Chapter Page Name Visits Total
'/chap17_01.html' 'A tree node' 12
'/chap14_02.html' 'The Node class' 11
'/chap15_03.html' 'The Java Stack Object' 11
'/chap15_05.html' 'Creating wrapper objects' 11
'/chap14_03.html' 'Lists as collections' 11
'/chap19_09.html' 'Performance of resizing' 11
'/chap17_02.html' 'Building trees' 11
'/chap15_02.html' 'The Stack ADT' 10
'/chap18_06.html' 'Definition of a Heap' 10
'/chap14_04.html' 'Lists and recursion' 9
'/chap14_10.html' 'The LinkedList class' 9
'/chap18_01.html' 'The Heap' 9
'/chap15_07.html' 'Getting the values out' 9
'/chap15_01.html' 'Abstract data types' 9
'/chap18_10.html' 'Heap sort' 9
The statistical analysis of the main clusters also
uncovered three main prototypes of students.
These prototypes are novice, intermediate, and
advanced, which correspond to clusters 21, 1, and
428, respectively. Students belonging to cluster 21
were interested in the first three chapters which
are introductory chapters that contain basic
information. Their navigation pattern has reflected
their knowledge of Java which can be classified as
novice. On the other hand, students of cluster 428
were interested in the last six chapters which
contain advanced information and cannot be
understood unless students have prior knowledge
in Java. Thus, their knowledge of Java can be
described as advanced. The navigation pattern of
students of cluster 1 showed that they were
interested in intermediate tutorial chapters.
A cluster neighborhood analysis shows that
clusters 21 and 428 have no direct neighbors.
Cluster 1 has three direct neighbors from which
cluster 22 has the most number of students.
Cluster 22 has 31 pages scoring above the average
that contain all of the 19 pages of the neighboring
cluster 1 (Table VI). The rest of the pages are
related to the main topic of cluster 1 which is the
“Array Object”. Table IX outlines a sample of
pages that are in cluster 22 and not in the
neighboring cluster 1. It is clear that pages of
cluster 22 complement and support the topic of the
pages of neighboring cluster 1.
Finally, the group of clusters highlighted by the
dark rectangle in Fig. 5 contains a total of 50
students. The main theme of the navigation of this
group is a combination of “java classes and
objects” and “Java data structures”. These results
are compatible with the themes found for cluster 1
at the top of the grid and cluster 428 at the bottom
of the grid.
TABLE IX
PAGES IN CLUSTER 22 AND NOT IN CLUSTER 1
Chapter Page Name Visits Total
'/chap10_08.html' 'Array of random numbers' 4
'/chap08_03.html' 'Point objects' 3
'/chap08_05.html' 'Objects as parameters' 3
'/chap08_06.html' 'Rectangles' 3
'/chap08_07.html' 'Objects as return types' 3
'/chap08_08.html' 'Objects are mutable' 3
'/chap08_09.html' 'Aliasing' 3
'/chap08_10.html' 'null object' 3
V. Conclusion and Future Work
In this paper, we have proposed a hybrid k-means
and SOM clustering technique for classifying
users and extracting the association between
navigated pages. In addition, the paper suggested
techniques for the validation and interpretation of
clusters. Our results show that users were
classified with regard to their navigational patterns
and they were successfully prototyped. The results
also showed that pages can be classified and
associated with each other by grouping similar
user navigation paths and averaging the total hits
within each group. We noticed that data
visualization is very useful for the interpretation of
the clustering results as it highlights the major
clusters and their neighborhood and in detecting
users of personal interests. Moreover, the
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
application of the silhouette measure to compare
and to validate the clustering results was effective.
In future work, we are interested in trying
different measurements to evaluate the interest of
users and thus the association of pages. The
measurement used in this paper is the hits count
where other factors may better indicate the interest
of a user towards a certain page or topic. This may
include the time spent reading (TSR) a page, the
scrolling time on a page, the link usage within a
page, etc. A comparative study is intended in the
near future in order to compare results using
different measurement for user interest. With
regard to user modelling, further research can be
conducted on investigating the application of
implicit user models that are automatically
detected by the system, in providing adaptive
information. The application of a real-time
recommendation system based on real-time
identification of the interest of the end-users can
also be further investigated.
References
[1] S. Aikat and D. Aikat, Shared Techniques
between Print and Online Documentation, Proc.
of the 14th
Annual International Conference on
Computer Documentation SIGDOC’96, pp. 125-
129, Research Triangle Park, NC, USA, 1996.
[2] J. Price, Introduction: Special Issue on
Structuring Complex Information for Electronic
Publication, IEEE Transaction on Professional
Communication, Vol. 40, n. 2, pp. 69-77, 1997.
[3] A. Csinger, K. S. Booth and D. Poole, AI Meets
Authoring: User Models for Intelligent
Multimedia, Artificial Intelligence Review, Vol.
8, n. 5-6, pp. 447-468, 1995.
[4] J. Thibeau, Making Information Work on the
World Wide Web, Proceedings of the 43rd
Annual Conference of the Society for Technical
Communication, pp. 374-378, 1996. [5] R. Z. Cabada, M. L. B. Estrada, C. A. R. García,
EDUCA: A web 2.0 authoring tool for
developing adaptive and intelligent tutoring
systems using a Kohonen network, Expert
Systems with Applications, Vol. 38, n. 8, pp.
9522-9529, August 2011.
[6] A. M. Huneiti, Data Models for Retrieving Task-
Specific and Technicians-Adaptive Hypermedia,
WSEAS Transactions on Computers, Vol. 7, n. 9,
pp. 1495-1504, Sept. 2008.
[7] J.W. Coffey, A. J. Canas, G. Hill, R. Carff, T.
Reichherzer and N. Suri, Knowledge Modelling
and the Creation of El-Tech: a Performance
Support and Training System for Electronic
Technicians, Expert Systems with Applications,
Vol. 25, pp. 483-492, 2003.
[8] D. T. Pham, and R. M. Setchi, Case-Based
Generation of Adaptive Product Manuals,
Proceedings of the Institution of Mechanical
Engineers IMech, Vol. 217 (B), pp. 313-322,
2003.
[9] L. Francisco-Revilla, and F.M. Shipman,
Adaptive Medical Information Delivery
Combining User, Task, and Situation Models,
Int. Conference on Intelligent User Interfaces,
ACM Press, pp. 94-97, 2000.
[10] P. Brusilovsky, D. W. and Cooper, ADAPTS:
Adaptive Hypermedia for a Web-based
Performance Support System, Proceedings of the
2nd
Workshop on Adaptive Systems and User
Modeling on the WWW, pp. 41-47, May 11-
14,1999.
[11] O. Mustapaşa, A. Karahoca, D. Karahoca, and
H. Uzunboylu, Hello World, Web Mining for E-
Learning, Procedia Computer Science, Vol. 3,
pp. 1381-1387, 2011.
[12] C. Dimopoulos, C. Makris, Y. Panagis, E.
Theodoridis, and A. Tsakalidis, A web page
usage prediction scheme using sequence
indexing and clustering techniques, Data &
Knowledge Engineering, Vol. 69, n. 4, pp. 371-
382, April 2010.
[13] Z. Shen, C. Miao, R. Gay, and C. P. Low,
Personalized e-Learning – a Goal Oriented
Approach, Proceedings of the 7th WSEAS
International Conference on Distance Learning
and Web Engineering (DIWEB '07), pp. 304 –
309, 2007.
[14] P. De Bra, Design Issues in Adaptive Web-Site
Development, Proceedings of the 2nd
Workshop
on Adaptive Systems and User Modelling on the
Web, pp. 29-39, 1999.
[15] E. Knutov, P. De Bra, and M. Pechenizkiy, AH
12 years later: a comprehensive survey of
adaptive hypermedia methods and techniques,
New review of hypermedia and multimedia, Vol.
15, n. 1, pp. 5-38, 2009.
[16] P. Brusilovsky, Methods and Techniques of
Adaptive Hypermedia, User Modeling and User-
Adapted Interaction, Vol. 6, pp. 87-129, 1996.
[17] D.W. Cooper, F. P. Veitch, M. M. Anderson and
M. J. Clifford, Adaptive Diagnostic and
Personalised Technical Support (ADAPTS),
Proceedings of the IEEE Aerospace Conference,
Vol.3, pp. 139-149,1999.
[18] X. He, H. Zha, C. H. Q. Ding, and H. D. Simon,
Web Document Clustering using Hyperlink
Structures, Computational Statistics & Data
Analysis, Vol. 41, n. 1, pp. 19-45, 2002.
[19] M. Perkowitz, and O. Etzioni, Towards Adaptive
Web Sites: Conceptual Framework and Case
Study, Artificial Intelligence, Vol. 118, n. 1-2,
pp. 245-275, 2000.
[20] M. H. Chehreghani, H. Abolhassani, M. H.
Chehreghani, Density link-based methods for
A. M. Huneiti
International Review on Computers and Software, Vol. 7, n. 3
clustering web pages, Decision Support Systems,
Vol. 47, n. 4, pp. 374-382, November 2009.
[21] A. Romero, S. Ventura, A. Zafra, and P. De Bra,
Applying Web usage mining for personalizing
hyperlinks in Web-based adaptive educational
systems, Computers & Education, Vol. 53, n. 3,
pp. 828-840, November 2009.
[22] S. Park, N.C. Suresh, and B. K. Jeong, Sequence-
based clustering for Web usage mining: A new
experimental framework and ANN-enhanced K-
means algorithm, Data & Knowledge
Engineering, Vol. 65, n. 3, pp. 512-543, June
2008.
[23] R. Farzan and P. Brusilovsky, Social Navigation
Support in E-Learning: What are the Real
Footprints?, Proceedings of the 3rd Workshop on
Intelligent Techniques for Web Personalization
(ITWP’05), pp. 49-56, 2005.
[24] J. Zhu, J. Hong, and J. G. Hughes, PageCluster:
Mining Conceptul Link Hierarchies from Web
Log Files for Adaptive Web Site Navigation,
ACM transactions on Internet Technologies, ,
Vol. 4, n. 2, pp. 185-208, 2004.
[25] L. Chen, S. S. Bhowmick, and W. Nejdl,
COWES: Web user clustering based on
evolutionary web sessions, Data & Knowledge
Engineering, Vol. 68, n. 10, pp. 867-885,
October 2009,.
[26] B. Mobasher, Data Mining for Web
Personalization, The Adaptive Web, Springer
Lecture Notes in Computer Science, Vol. 4321,
pp. 90-135, 2007.
[27] C. Romero, and S. Ventura, Educational data
mining: A survey from 1995 to 2005, Expert
Systems with Applications, Vol. 33, n. 1, pp. 135-
146, 2007.
[28] T. Kohonen, Self-organized formation of
topological correct feature maps, Biological
Cypernetics, Vol. 43, pp. 59-69, 1982.
[29] K. Etminani, A.R. Delui, N.R. Yanehsari, and M.
Rouhani, Web usage mining: Discovery of the
users' navigational patterns using SOM,
Networked Digital Technologies, NDT '09, pp.
224 – 249, 2009.
[30] C. Wei, W. Sen, Z. Yuan, and C. Lian-Chang,
Algorithm of mining sequential patterns for web
personalization services, SIGMIS Database, Vol.
40, n. 2, pp. 57-66, 2009.
[31] K. A. Smith, and Ng A., Web page clustering
using a self–organizing map of user navigation
patterns, Decision Support Systems, Vol. 35, pp.
245-256, 2003.
[32] G. Pallis, L. Angelis, and A. Vakali, Validation
and interpretation of Web users’ sessions
clusters, Information Processing &
Management, Vol. 43, n. 5, pp. 1348-1367,
September 2007.
[33] M.G.R. Sause, A. Gribov, A.R. Unwin, and S.
Horn, Pattern recognition approach to identify
natural clusters of acoustic emission signals,
Pattern Recognition Letters, Vol. 33, n. 1, pp.
17-23, January 2012.
[34] P. J. Rousseeuw, Silhouettes: a graphical aid to
the interpretation and validation of cluster
analysis, Journal of computational and applied
mathematics, Vol. 20, pp. 53-65, 1987.
[35] C. Shahabi, A.M. Zarkesh, J. Adibi, and V. Shah,
Knowledge discovery from users Web page
navigation, Proceedings of the 7th
international
workshop on research issues in data engineering
. pp. 20-29, 1997.
Authors Information
1Computer Information Systems Department, King Abdullah II School of Information Technology, Jordan University, Amman 11942,
Jordan, E-mail: a.huneiti@ju.edu.jo.
Ammar M. Huneiti received his BSc, MSc
and PhD degrees from Cardiff University,
UK. His BSc is in Computer Science (1991), his MSc is in Information Systems
Technologies (1992) and his PhD is in
Systems Engineering (2004). Between 1992 and 2000 he worked for several private and
public sector organizations supervising the
design and implementation of IT related projects. Since 2005 and until present, he is an assistant professor at the Department of Computer
Information Systems, King Abdullah II School of Information
Technology, the University of Jordan. His research interests include Intelligent Web Information Systems, Web Data Mining, Adaptive
Hypermedia, and User Modelling.
Recommended