Upload
trinhdat
View
226
Download
0
Embed Size (px)
Citation preview
FUZZY BICLUSTERING APPROACH FOR WEB COMMUNITIES
IDENTIFICATION AND WEB PERSONALIZATIONH. Hannah Inbarani1,K. Thangavel2
1Department of Computer Science,PeriyarUniversity ,SALEM-636 011,India,Email:[email protected] of Computer Science,PeriyarUniversity ,SALEM-636 011,India,Email:[email protected]
Abstract
Information overload appears to be a growing problem for people in the web era. The
rapid development of web technologies has made the World Wide Web a huge information
source. Existence of huge amounts of data and lack of well defined data models for the web
makes information retrieval a tedious task. Due to this, web user navigates through the web site
without finding relevant resources. Web personalization is a step to alleviate the information
overload problem thereby helping the user to make interest driven visits. Web usage mining tries
to reveal the underlying access patterns from web transactions or user session data that are
recorded in web log files. Generally, web users navigate through web pages based on their
interest and the coherence between web pages. They may exhibit different types of access
interests during their surfing period. Thus, employing data mining techniques on the observed
usage data may lead to finding the underlying usage patterns. Hence there is a need to develop
efficient technique for uncovering web user communities based on which relevant pages can be
recommended based on users preferences. In this paper we propose a Robust Fuzzy Biclustering
approach which captures web communities and recommends pages based on the bicluster
patterns. Experiments were performed using the web log collected from the web server for a
leading IT Services and Solutions company. In order to show the effectiveness of the proposed
Robust Fuzzy Biclustering approach, recommendation results are compared with the existing
approaches like Conventional Biclustering,CDK-Means, spectral co-clustering approaches.
Experimental results show the effectiveness of the proposed algorithm over the existing
biclustering approaches.
Keywords: Information overload, Web usage mining, Fuzzy Biclustering, Web Page
Recommendation, Personalization,Co-clustering
1
1. IntroductionThe technology behind personalization has undergone tremendous changes, and several web-based personalization systems have been proposed in recent years. Although personalization can be accomplished in numerous ways, most web personalization techniques fall into four major categories: decision rule-based filtering, content-based filtering, and collaborative filtering and web usage mining. Decision rule-based filtering surveys users to obtain user demographics or static profiles, and then lets web sites manually specify rules based on them. Content-based filtering relies on items being similar to what a user has liked previously. Collaborative filtering (CF), also called social or group filtering is the most successful personalization technology to date. Most successful recommender systems on the web typically use explicit user ratings of products or preferences to sort user profile information into peer groups(Sung Ho Ha et al.2002). It then tells users what products they might want to buy by combining their personal preferences with those of like-minded individuals. Additionally, traditional collaborative or content-based filtering, have problems, such as reliance on subject user ratings and static profiles or the inability to capture richer semantic relationships among web objects. To overcome these shortcomings, the web personalization attempts to increasingly incorporate web usage mining techniques. Web usage mining can help improve the scalability, accuracy, and flexibility of recommender systems. Web usage mining also can reduce the need for obtaining subjective user ratings or registration-based personal preferences.
Web usage mining (WUM) uses data mining algorithms to automatically discover and extract patterns from web usage data and predict user behavior
while users interact with the web and helps in the discovery of web communities.Although web
usage mining has exposed limitations— sparsity in usage data or regular changes in site content,
it also has several advantages over traditional techniques. The data source for web usage mining
2
is generally the server access log, but sometimes a client-side agent collects data.An
interesting problem associated with the web is the definition and delineation of so called web
communities. A community is loosely defined to be a collection of content creators that share a
common interest or topic. The systematic extraction of emerging communities is useful for many
reasons including communities which provide high quality information to interested users
(Jayson E. Rome 2005).
Discovery of web communities, groups of related web pages sharing common interests, is
important for assisting users' information retrieval from the web. There are several different
granularities of overlapping web communities, and this makes the identification of objective
boundaries of web communities difficult (Grieser et al. 2003).This paper focuses on identification of web communities based on user’s navigation behavior.These communities are used in web page prediction. Web prediction systems based on WUM obtain user profiles dynamically from usage patterns, and thus their performance does not degrade over time as the profiles age.
1.1 . MotivationThe web usage mining tasks can involve the discovery of association rules, sequential patterns, page view clusters, user clusters, probabilistic models or any other pattern discovery method (Sarabjot Singh Anand et al. 2007). The discovered patterns are used by the online component to provide personalized content to users based on their current navigational activity. The personalized content can take the form of recommended links or products, targeted advertisements, or text and graphics tailored to the user’s preferences. The web server keeps track of the active server session as the user’s browser makes HTTP requests. The recommendation engine considers the active server session in conjunction with the discovered patterns to provide personalized content.
The primary motivation behind the use of clustering in collaborative filtering
(GuandongXu2008) and web usage mining is to improve the efficiency and scalability of the
3
real-time personalization tasks.In the context of web personalization, this task involves clustering user sessions identified in the preprocessing stage. A variety of clustering techniques can be used for clustering similar users’ sessions based on occurrence patterns of URL (Uniform Resource Locator) references. User sessions can be mapped into a multidimensional space as vectors of URL references.
Another approach for obtaining aggregate usage profiles is to directly compute (overlapping) clusters of page view references based on how often they occur together across user sessions (rather than clustering sessions, themselves). The usage profiles obtained in this way is called page cluster(BamshadMobasher et al. 2002).
However, both User clustering (UC)and Page view clustering (PC) are one-sided approaches, inthe sense that they examine similarities either only between users or only between pages, respectively. This way, they ignore the clear duality that exists between users and items. Furthermore, UC and PC algorithms cannot detect partial matching of preferences, because their similarity measures consider the entire set of pages or users, respectively. Another limitation ofuser or page clustering algorithms is that number of clusters must be
given as input based on the structure of input patterns. Hence the first goal this work is to
simultaneously cluster users and pages based on their URL references.
The flow of information in a completely automated web personalization system can be
prone to significant amounts of error and uncertainty. This uncertainty pervades all stages from
the user’s web navigation patterns to the final recommendations, including the intermediate
stages of logging web usage, preprocessing, segmenting web log data into web user sessions, and
learning a usage model from this data (OlfaNasraoui et al. 2003). Hence the second goal of this
work is to handle the uncertainty prevailing in the web pattern discovery process.
1.2. Contribution
4
The simultaneous clustering of users and pages discovers biclusters, which correspond to groups
of users which exhibit high correlation on groups of pages. For page recommendation, biclusters
allow the computation of similarity between a test user and a bicluster only on the pages that are
included in the bicluster. Thus, partial matching of preferences is taken into account too.
Moreover, a user can be matched with several nearest biclusters, thus to receive
recommendations that cover the range of his various preferences. A simple and robust
Biclustering approach was already proposed (Hannah Inbarani et al. 2011) in our previouswork
for web page recommendation.
The wide spectrum of uncertainties involved in the web navigation process can be
modeled and handled using well studied formal models of uncertainty in fuzzy set theory and
soft computing. Hence to facade the second described goal, i.e., to handle uncertainty in the web
pattern discovery process, fuzzy biclustering approach isintroduced in this work for web
personalization.
The contributions of this paper are summarized as follows:
To capture the range ofthe user’s preferencesand to handle uncertainty which prevails in
the web navigation patterns we introduce for the first time, to our knowledge,
theapplication of Fuzzy Biclustering (FB)algorithm for web personalization. The
effectiveness of this approach is compared with spectral co-clustering approach proposed
by (Dhillon 2001) for co-clusteringof words and documents,CDK-Means approach
proposed by (Pensal et al. 2005) which is a K-Means like approach for Biclustering of
categorical data and Conventional Biclustering (CB) approach proposed in our previous
work (Hannah Inbarani et al.2011) using recommendation evaluation metrics and the
results are discussed in section 7.
A web user profiling approach and recommendation approach based on fuzzy
biclustering is also proposed for web page recommendation.
The rest of this paper is organized as follows: section 2 summarizes the relatedwork, whereas
section 3 lists out the research issues addressed in this paper, section 4describes the methodology
for web page recommendation process and the proposed FB approach, section 5 discusses the
5
performance analysis of FB and comparative analysis of FB with CB,CDK-Means and spectral
co-clusteringapproaches.
2. Related Work
Web clustering can involve either grouping of users who present similar browsing patterns or
grouping of pages having related content based on information derived from different sources.
Specifically, user clustering approaches can be based on usage data recorded in web
server log files and create web communities i.e groups of users with similar browsing behavior
(Pallis, G. and Koutsonikola et al.2006). On the other hand, in web page clustering approaches,
information can be extracted from pages’ content (Hammouda, K.M and Kamel, M.S 2004),
structure i.e links between web pages or pages’ structure as described by the involved tags
(DoruTanasa and Brigitte Trousse2004) , and usage data i.e which pages tend to be accessed by
users with similar interests ( Nakagawa and Mobasher 2003). Moreover, the clustering results
may be beneficial for a wide range of applications such as websites’ personalization (Nasraoui,
O., Soliman, M. Saka, 2008), web caching and prefetching (Li, H-Y etal. 2007), search engines
(Liu etal.2005) and Content Delivery Networks (Pallis, G. and Koutsonikolaet al. 2006) . In
addition, the clustering results can contribute to the enhancement of recommendation engines
(Chi, C-C etal.2008) and to the design of collaborative filtering systems (Srinivasa, N. and
Medasani, S 2004).
In user/page clustering approaches, the exact user access patterns are not taken into
account. Hence recent studies have used biclustering approaches to disclose this duality between
users and pages, by grouping them in both dimensions simultaneously (Liu X., He P. and Yang
Q., 2005 and Koutsonikolaet al. 2009). The goal of these approaches is to identify groups of
related web users and pages, which results from the tendency of some users to visit the same set
of pages. This behaviour characterizes users’ interests as similar and highly related to the topic
that the specific set of pages involves. The obtained results are particularly useful for
applications such as e-commerce and recommendation engines, since relations between clients
and products may be revealed. These relations are more meaningful than the one-way clustering
of users or pages.(Koutsonikolaetal.2009).
6
Usually, the clusters (or biclusters) resulting from the web usage mining algorithms may
not necessarily have crisp boundaries, rather they have fuzzy or rough boundaries (Hannah
Inbaraniet al. 2009). Koutsonikola et al. (2009) has proposed Fuzzy Biclustering approach to
cluster users and pages simultaneously. The limitation of this two way clustering approach is
that it is based on clustering and so the exact user access patterns cannot be obtained. Hence it is
not suitable for page recommendation as correlation between pages disappear as the user access
patterns are merged in user and page clustering techniques. So as defined in (Fu et al.1999)
precision of the recommendation sets will be lower.
The concept of biclustering has been used in (Mirkin B et al. 1996) to perform grouping
in amatrix by using both rows and columns. However, biclustering has been usedpreviously in
(Hartigan et al.1972) under the name direct clustering. Recently, biclustering (alsoknown as co-
clustering, two-sided clustering, two-way clustering) has been exploited by many researchers in
diverse scientific fields, towards the discovery ofuseful knowledge (Cheng Y and Church 2000,
Dhillon 2001,Dhillon et al. 2003 Long B,et al. 2005) . One of these fields is bioinformatics
(Tang C 2001), and morespecifically, microarray data analysis.The results of each microarray
experiments are represented as a data matrix, with different samples as rows and different genes
as columns. Other fields are text mining(Dhillon 2001) and web mining (Koutsonikola et al.
2009).
There are several approaches to deal with the biclustering problem. Many different
algorithms for biclustering have already been proposed in the literature (Cheng, Y. and Church
2000 and Tang C 2001). In short, these methods can be classified by (i) the type of biclusters
they find; (ii) the structure of these biclusters; and (iii) the way the biclusters are discovered.
The type of the biclusters is related to the concept of similarity between the elements of
the matrix. For instance, some algorithms search for constant value biclusters, while others
search for coherent values of the elements or even for coherent evolution biclusters (PabloA. D.
de Castro, Fabrício, 2007).The structure of the biclusters can be of many types. There are single
bicluster algorithms, which find only one bicluster in the center of the matrix; the exclusive
columns and/or rows, in which the biclusters cannot overlap in either columns or rows of the
matrix; arbitrary positioned, overlapping biclusters and overlapping biclusters with hierarchical
7
structure. The way the biclusters are discovered refers to the number of biclusters discovered per
run. Some algorithms find only one bicluster, others simultaneously find several biclusters and
some of them find small sets of biclusters at each run.
By performing co-clustering(biclusteringor two way clustering), the users and pages are
simultaneously clustered into several co-clusters. Each co-cluster consists of a pair of highly
relevant user cluster and page cluster. Co-clustering o ers some advantages such asff
dimensionality reduction, interpretable document cluster (Dhillon et al. 2003), and improvement
in accuracy due to local model of clustering (Madeira and Oliveira 2004). Fuzzy co-clustering
further improves the representation of overlapping clusters using fuzzy membership function.
These advantages make fuzzy co-clustering a suitable option to categorize users and pages,
particularly the ones in the World Wide Web.
Web personalization systems have recently attempted to incorporate techniques for
webmining.Webmining turns out to be an enablingmechanismto overcomethe problems
associated with more traditionalweb personalization techniques, such as collaborative or content-
based filtering.Accessinformation, coded into server logs, is processed by applying web mining
techniques forseveral purposes, such as clustering users with similar browsing behavior,
extracting interesting usage patterns, and discovering potential correlations between web pages
and usergroups (Flesca et al. 2005). In a collaborative filtering approach, they provide a user
with personalized recommendations based on the similarity between his/her profile and the ones
of other users with similar interests. User profiles, representing the information needs and
preferences of users, can be inferred from the ratings that users provide on information items,
explicitly or implicitly, through their interactions with a system(GirardiA et al. 2007). A user
model, a representation of this profile, can be obtained implicitly through the application of web
usage mining techniques. The personalization of services offered by a web site is an important
step in the direction of alleviating information overload,making the web a friendlier
environmentfor its individual user and hence creating trustworthy relationships between the
website and the visitor-customer(Pierrakos et al. 2003). One of the best illustrations of
recommendations is Amazon’s recommendation engine,where a user is informed that
“Customers who bought this item also bought this” or“Customers who bought music by this
artist also bought music by these artists (Pabarskaite 2007).
8
In this paper, we propose a Robust Fuzzy Biclustering technique for simultaneous
clustering of users and pages and the proposed approach is compared with co-clustering
approach proposed by (Dhillon 2001) for co-clustering word and documents and CDK-Means
proposed by (Pensal et al. 2005) which is a K-means like approach for Biclustering of categorical
data and the results are discussed in section 5.
3. Research Issues
In this section, we examine the issues of web page recommendation systems. Table 1
summarizes the symbols that are used in the sequel.
Accuracy
Webpersonalization is viewed as a data mining task. Hence the accuracy of models
learned for this purpose can be evaluated using a number of metricsthat have been used in
machine learning and data mining literature such as Mean Absolute Error (MAE) and area under
theReceiver Operating Characteristic (ROC) curve, depending on the formulation of the learning
task.In this work, MAE is used to measure the accuracy of web page recommendation results.
Scalability
The performance and scalability dimension aims to measure the response time of agiven
recommendation algorithm and how easily it can scale to handle a large numberof concurrent
requests for recommendations. Typically, these systems need to be ableto handle large volumes
of recommendation requests without significantly adding to theresponse time of the web site that
they have been deployed on. The proposed approach is scalable because the recommendation is
performed online and user profile discovery is an offline process. The on-line parts concern the
time it takes to create a recommendation list, based on the pages visited by the active user
session. As proved in (Symeonidis et al. 2008), online part of Biclustering approaches take less
execution time than user / page clustering approaches.
Sparsity
Sparsity refers to the fact that as the number of pages in a web site increases, even the
most prolific users of the system will only visit a very small percentage of all pages. As a result,
there will be many pairs of customers that have no pages in common and even those that do will
9
not have a large number of common pages.Sparsity can be handled well by selecting appropriate
value for K.
Precision
Precision and Recall are standard metrics used in information retrieval. While precision
measures the probability that a selected item is relevant, recall measures theprobability that a
relevant item is selected. Precision and recall are commonly used inevaluating the selection task.
Coverage measures the percentage of the universe of items that the recommendation system is
capable of recommending. The F1-Measure that combines precision and coverage has also been
used for this purpose task. In this work, Precision,coverage and F1-Measure are the metrics used
for measuring the prediction process.
Similarity Measure
Similarity measure: The most extensively used similarity measures are basedon
correlation and cosine-similarity (Symeonidis et al. 2008). Specifically, user-based clustering
algorithms mainly use Pearson’s correlation, whereas for page view clustering algorithms, the
Adjusted CosineMeasure(Mobasher et al. 2002) is preferred.The Adjusted CosineMeasure is a
variation of the simple cosine formula thatnormalizes bias from subjective ratings of different
users. In this work, cosine similarity measure is to find the similarity of users with patterns.
Table 1. Symbols
Symbol Definitionnb Number of Biclusters
K Number of recommended biclusters
m Number of users
n Number of pages
UP User profile matrix of size(nb x n)
P Pattern Matrix of size(nb x n)
BU Users in Bicluster(nb x m)
BP Pages in Bicluster(nb x n)
µui User Membership values
µpj Page Membership values
nn Active sub session size
10
Extracting User profiles
Web Server LogInternetPreprocessing
Pattern Discovery
Recommendedpages Matching Module
HTTP Request
Match current session with User profile
S Session Matrix/User access Matrix
p1,p2,…, pn Pages/URLs
u1,u2,…, um Users
4. MethodologyWeb personalization system based on web usage mining discovers web usage profiles,
followed by a recommendation system that can respond to the users’ individual interests.
The architecture of the proposed system is shown in Figure 1. In the offline processing,
user sessions are extracted and Fuzzy biclustering approach is used for extracting user access
patterns and the user profiles are generated. In online processing, current session of the user is
matched with user profiles and the most similar profiles are used for page recommendation.
Figure1. System Architecture
The Recommendation Process consists of two phases.
Offline Phase
11
ONLINE MODULE
OFFLINE MODULE
The three steps of offline phase are:
Preprocessing,
Pattern Discovery (Biclustering)
User Profiling
Online phase
The two steps of online phase are:
Match active session with user profiles
Recommend top N list of pages
4.1 PreprocessingData cleaning operation is performed as defined in (DoruTanasa and Brigitte Trousse 2004),
which removes image files and style sheet files. The access log of a web server is a record of all
files (URLs)accessed by users on a web site. Each log entry consists of the information
components such as remotehost, Rfc931, Authuser, date, request,status and bytes.
The sample entries in the web log file are listed in Figure 2.
218.248.30.146 - - [21/Nov/2009:03:10:51 +0530] "POST /make_slides.php HTTP/1.1" 200 740
216.104.15.130 - - [21/Nov/2009:03:20:37 +0530] "GET /messengerplus.php HTTP/1.0" 200 15202
Figure 2.Sample Web Log file
In the next step, using user session identification process, user sessions are identified and Session
Matrix is created. User Access Matrix S = {sij} where sij =1 if page j has been visited by user i
otherwise it is set to zero. The weight associated with each visited page is represented by W =
{wij} where each entry in the weight matrix specifies the number of hits on a specific page as
defined in (Claypool M., 2001). For each user, the weight vector of each navigational session is
represented as a sequence of visited pages with corresponding weights{w11, w12, w13,…w1n}
where wijdenotes the weight for a page j visited in ithuser session. The sample user access
/session matrix is shown in Figure 3. Each row of user access matrix is called a session
vector/user access vector/transaction.
12
p1 p2 p3 p4 p5
u1 1 1 1 1 0
u2 0 1 1 1 0
u3 1 0 1 1 1
u4 1 1 1 0 0
Figure3.Sample User Access Matrix
4.2 The Biclustering Process
The biclustering process on a User access matrix involves the determination of a set of
clusters taking into account both users and pages. Each bicluster is defined on a subset of users
and a subset of pages. Moreover, two biclusters may overlap, which means that several users or
pages of the session matrix may participate in multiple biclusters. Another important
characteristic of biclusters is that each bicluster should not be fully contained in another
determined bicluster. Overlapping is allowed in order not to miss important biclusters.
Three biclusters formed from the User access matrix in Figure 3 are listed in Figure 4.
Bicluste
r
Users in the Bicluster Pages in the Bicluster
B1 BU1 = {u1,u4} BP1 = {p1,p2,p3}
B2 BU2= {u1,u2} BP2 = {p2,p3, p4}
B3 BU3 = {u1,u2,u3} BP3 = {p3,p4}
Figure4.Biclusters of the sample user access matrix in Figure 3
4.3 User profilingThe first step in intelligent Web personalization is the automatic identification of user profiles.
This constitutes the knowledge discovery engine. The discovered user profiles are used to
13
recommend relevant URLs to old and new anonymous users of a web site (OlfaNasraoui and
Chris Petenes 2003).
User profiling is the process of collecting information about the characteristics,
preferences, and activities of web communities. An efficient and effective algorithm for web
recommendations is the user profiling approach, which is on a basis of collaborative filtering
techniques, a kind of commonly used algorithms in recommender systems.
This can be accomplished either explicitly or implicitly. Explicit collection of user
profile data is performed through the use of online registration forms, questionnaires, and the
like. The methods that are applied for implicit collection of user profile data vary from the use of
cookies or similar technologies to the analysis of the users’ navigational behavior that can be
performed using web log mining techniques (Jian-Guo Liu and Wei-Ping Wu 2004).
Mobasher B. et al., (2002) have proposed a potentially effective method PACT (Profile
Aggregations based on Clustering Transactions ) to generate aggregate profiles based on the
centroids of each transaction cluster. However the centroid of each cluster may represent the
different groups of pages without much correlation. Hence in this paper, a robust Fuzzy
biclustering approach is proposed to generate profiles which reveal the implicit relationship that
exists between the pages and users. Discovery of aggregate profiles based on Biclustering was
already proposed in our previous work (Hannah Inbarani et al. 2011).
4.4. Recommendation ProcessThe goal of personalization based on anonymous web usage data
is to compute recommendation set for the current (active) user session, consisting of the objects (links, ads, text, products, etc.) that most closely match the current user profile. The recommendation engine is the online component of a usage-based personalization system. The procedure for recommendation is described in Figure. 8.
5. Proposed work 5.1 FuzzyBiclustering(FB) approach
14
In contrast to traditional clustering, a biclustering method produces biclusters, each of
which identifies a correlation between a set of users and a set of pages. The boundary of a
bicluster is usually fuzzy in practice as users and pages can belong to multiple biclusters at the
same time but with different membership degrees. In contrast to a crisp bicluster, which either
contains a user or a page completely or does not contain it at all, a fuzzy bicluster can contain a
user or a page completely or does not contain it at all. To deal with the ambiguity and the
uncertainty underlying web interaction data, fuzzy reasoning appears to be an effective tool.
Fuzzy biclustering algorithm works as follows: In the first step, distinct patterns of the
session matrix S is extracted using Hadamard product defined in Def (1). Given that Sis made
up of nbdistinct patterns, Pattern Matrix P can be expressed as P = p lj where nb is the
number of distinct patterns and j = 1, 2, . . . , n and n is the number of pages. In the second step,
insert the pages of the patterns in the biclusters. The complete description of Fuzzy biclustering
is shown in Figure 6.
Definition 1: Hadamard Product:
Hadamard product (named after French mathematician Jacques Hadamard, also known as the
entry wise product. Note also that both A and B need to be the same size, but not necessarily
square.Formally, for two matrices of the same dimensions:
The HadamardproductA · B is a matrix of the same dimensions
with elements given by
Pattern Extraction:
A pattern v can be extracted by the Hadamard product of each row(considered as a user
access vector (1 x n) ) with other rows of user access matrix denoted by S i◦ Sj where Si ={Si1,Si2,
…, Sin}and Sj={Sj1,Sj2, …, Sjn}
15
The various patterns extracted by Hadamard product for the sample user access matrix in Figure
3 are listed in Figure5.
Patter
n
Pages
P1 {p2,p3,p4}
P2 {p3,p4}
P3 {p1,p2,p3}
Figure.5Patterns obtained from the user access matrix in Figure 3
From the Figure. 5,it can be observed that the patterns extracted finds all the pages in the
bicluster and the number of biclustersnb is equal to the number of patterns.
Algorithm 3 :Fuzzy Biclustering(S,m,n,NU,BP,BU,nb)
Input : Session Matrix S(m,n)
NU - Number of users
n - Number of pages
m - Fuzzy Index
minp - Minimum number of pages allowed in a bicluster
Output : nbbiclusters
nb =0; /* Index of bicluster
Identify distinct patterns of S and store it in Matrix P
/* P - set of distinct patterns
/* L - is the number of distinct patterns in S
Step 1 :Extract all the nbdistinct Patterns
Step 2 :Place all the pages in the extracted Pattern l in BiclusterBPi
Step 3:If the Extracted pattern exists is user session , Place user j in BiclusterBUi
Step 4: Set Initial Page Membership μpijfor each page in the Pattern /Bicluster I as
μpij = 1 if Pagej∊Biclusteri
0 Otherwise
Step 5 : Compute similarity of user i with all the patterns
Step 6: Compute User membership μuijusing Eqn. (2)
Step 7: Update Page membership of pages in the pattern/Bicluster using user
16
membership
Update each Pattern using
P (i , j )=∑j=1
m
( μuij )m . P(i . j)
∑j=1
m
( μuij)m
, i=1 ,2, …, L
Step 8: Stopping criterion: Repeat steps 5 to 7 until the changes in |Pij+1 – Piji| between
two iterations are greater than a fixed threshold ε.
Step 9 : Set μPij = Updated P(i,j)
Output nb /* Number of biclusters
Output BU and BP /* Users and Pages in each Bicluster
Output User membership
Output Page membership
Figure.6Fuzzy Biclustering approach
Definition 2 : The membership of user in each bicluster is calculated by computing the
similarity of each user access pattern with each pattern in the bicluster.
μij=simi ( si , p j )
∑ simi ( si , p j )i=1, 2 ,…,m j=1, 2 , …,nb (2)
where Si represents user access pattern of ith user of the bicluster, j specifies a page in the
bicluster, n specifies the number of pages in the bicluster, nb represents number of distinct
patterns in the session matrix and Simi¿i, P j❑) represents the similarity of each session with
pattern j. Cosine similarity [1] is used for computing the similarity of the user with the pattern.
The proposed Robust Fuzzy biclustering algorithm seems to be an effective tool for web
personalization because the membership of each page is optimized whenever a new user is added
to the bicluster. These memberships are then used as weights for web page prediction.
5.2 Discoveryof Aggregate Profiles Based on Fuzzy Biclustering
In this method, the result of Fuzzy biclustering is used for obtaining user profiles.
Fuzzy membership values of pages in the page biclusters are used as weights and low
support page views i.e pages with membership values below the threshold value α , arefiltered
out. The steps for building user profile based on Fuzzy biclustering are described in Figure7.
17Algorithm : Building user profile based on Fuzzy Biclustering
Input :A set of biclusters,Membership values and Threshold α
nb – Number of Biclusters
Output :Set of user profiles UPj j = 1, 2, . . . , ,nb
Procedure
Figure 7 .Building user profiles based on Fuzzy Biclustering
This fuzzy model generates robust profile because the weights are determined from the
membership values of page views in the biclusters and the membership values are
determined from Fuzzy biclustering techniques. The value of α is taken as 0.4.
5.3 Biclustering based RecommendationWeb Personalization aims to provide intelligent online services such as web
recommendations, based on past web user navigation patterns. Biclustering based
recommendation process is described in Figure8.
18
Algorithm : Biclustering Based Recommendation
Input :Recommendation Threshold α, a set of user profiles generated from the
Biclusters and
t - current sub session
K - number of biclusters to be recommended
N - number of pages to be recommended from K biclusters
Output: Recommendation vector R.
Step 1 : Generate integrated User and Page biclusters (co-clusters) using Biclustering
algorithm.
Step 2 :Generate User profiles using the method specified in Figure 2.
Step 3 :Compute the similarity between user’s sub session vector t and the user
profiles generated .
Step 4 : Sort each row of similarity matrix in descending order based on weights
Step 5 : Include N pages in Top K biclusters if weight wti> threshold α in the
Recommendation vector R
Algorithm : Building user profile based on Fuzzy Biclustering
Input :A set of biclusters,Membership values and Threshold α
nb – Number of Biclusters
Output :Set of user profiles UPj j = 1, 2, . . . , ,nb
Procedure
Figure.8Biclustering based Recommendation process
In order to provide recommendations, we have to find the biclusters containing users
withpreferences that have strong partial similarity with the test user. This stage is executed
online and consists of two basic operations:
The formation of test users’ neighborhood, i.e., to find the K nearest biclusters.
The generation of the top-N recommendation list of pages
6. Experimental Evaluation
6.1Data set 1The web access logs from http://www.technmantix.com were used for our experiments.
Technmantix is a leading IT Services and Solutions company. The actual web log contains nearly
31415 entries. After preprocessing(data cleaning) the web access logs and removing references
by image files and style sheet files, a total of 13375 log entries were identified and after applying
data filtering and session identification,2599(maximum data set size) user sessions were
identified . The total number of URLs representing pageviews was 362 and after eliminating the
image files, style sheet files, the total number of remaining pageview URLs in the training and
the evaluation sets is 113. Approximately 25% of these transactions were randomly selected as
the testing set, and the remaining portion was used as the training set for page recommendation.
6.1.1 Evaluation Metrics
The performance ofFB, CB, CDK-Means, coclusteringmethodsare measured using 4 different
standard measures, namely, precision, coverage, F1-Measure and MAEas defined in
(BamshadMobasheret al.2002). These measures are adaptations of the standard measures,
19
Algorithm : Biclustering Based Recommendation
Input :Recommendation Threshold α, a set of user profiles generated from the
Biclusters and
t - current sub session
K - number of biclusters to be recommended
N - number of pages to be recommended from K biclusters
Output: Recommendation vector R.
Step 1 : Generate integrated User and Page biclusters (co-clusters) using Biclustering
algorithm.
Step 2 :Generate User profiles using the method specified in Figure 2.
Step 3 :Compute the similarity between user’s sub session vector t and the user
profiles generated .
Step 4 : Sort each row of similarity matrix in descending order based on weights
Step 5 : Include N pages in Top K biclusters if weight wti> threshold α in the
Recommendation vector R
precision and recall, often used in information retrieval. MAE is used for measuring the accuracy
of recommendation process. In this context, precision measures the degree to which the
recommendation engine produces accurate recommendations. On the other hand, coverage
measures the ability of the recommendation engine to produce all of the page views that are
likely to be visited by the user. The precision measure represents the ratio of matches between
the recommendation set and the target set to the size of recommendation set. The coverage
measure represents the ratio of matches to the size of the target set.
If we have transaction t (taken from the evaluation set) viewed as a set of pageviews, and
that we use a window nn⊆ t (of size |nn|) to produce a recommendation set R using the
recommendation engine. Then the precision of R with respect to t is defined as:
Precision(R, t) = |R ∩ (t −nn)| / |R| (3)
and the coverage of R with respect to t is defined as:
Coverage(R, t) = |R ∩ (t − nn)| / |t −nn| (4)
6.1.2 Parameter Setting
The minimum number of pages and users in a bicluster is set to 2. For CB,
coclustering and CDK-Means, the implicit rating obtained from the hits of the users in different
pages are used as weights and the weight of each page in the bicluster is determined as per the
user profiling algorithm discussed in (Hannah Inbarani et al. 2011). Unless otherwise specified,
the default values for the parameters are K=4,N=4,nn=2. These optimum values are selected
after several runs based on sensitivity analysis for the best performance in terms of coverage and
precision.
6.2 Recommendation results of Fuzzy Biclustering
The recommendation engine takes a collection of user profiles as input and generates a
recommendation set by matching the current user’s activity against the discovered patterns. We
20
use a fixed-size sliding window over the current active session to capture the current user’s
history depth. Thus, the sliding window of size n over the active session allows only the last n
visited pages to influence the recommendation value of items in the recommendation set. This
sliding window is called as active session window.
In eachiteration, each user sessiontin the evaluation set was divided into two parts. The
firstnn page views were used for generating recommendations, whereas, the remaining portion of
t(target set) was used to evaluate the generated recommendations. For the recommendation
process we chose a session window size of 2. The recommendation results are given in Table 2
for the sample path.
Table.2 Recommendation Results for a Typical User Navigation Path
Pages of Active User
session
Recommended Web pages Recommendation
score
/make_slides.php
/website-design-
services.php
/consultancy-services- utomation.php
/software-development- ompany.php
/outsourcing.php
/support.php
/website-application- commerce.php
/web_hosting.html
/hosting/livezilla/server.php
/Products/billing-software.php
About-us/TechCmantiX- infrastructure.php
0.4512
0.5218
0.5477
0.5492
0.5492
0.3162
0.4776
0.5715
0.4666
0.5611
21
/downloads.php
/support.php
/leadership.php
/register.php
/testimonial.php
/careers.php
/support.php
/website-application-ecommerce.php
/content-writing.php
/billing-automation-with- counts.php
/leads-management-system.php
0.7530
0.5610
0.6073
0.6375
0.7218
0.4807
0.3986
6.3 Performance Analysis for FuzzyBiclustering(FB)
The required input of the algorithm is minimum number of pages to be included in the
bicluster. In order to discover the best biclusters it is important to fine-tune this input variable.
Figure 9 depicts the average numbers of pages in a bicluster, which increases with increasing
minp.
2 3 4 5 60
1
2
3
4
5
6
7
8
Minimum number of pages/BicluserMinp
Figure 9. Average number of pages in the bicluster
Impact of Recommended Number of Biclusters
Figure 10illustrates the values of F1 measure, Precision and Coverage for varying values
of K. As shown, the best performance is attained for K =2. As minimum number of biclusters are
22
Ave
rage
num
ber
of p
ages
recommended which are very similar to the current active session, the values of F1-Measure ,
precision and coverage remain increased.
2 3 4 5 60
0.10.20.30.40.50.60.70.8
PrecisionCoverageF-Measure
Figure10. Number of Recommended Biclusters versus F1-Measure, Precision and
Coverage
Impact of membership values and Recommendation Threshold α
The recommendation score is computed based on membership values as explained in 6.
Figure 11illustrates the values of F1-Measure,Precision and Coverage for varying values of
Recommendation Threshold α. As shown, the best performance is attained for α = 0.8 and 1 . As
the value of α is increased, the values of F1-Measure, Precision and coverage remains increased.
0.2 0.4
0.600000000000001 0.8 1
00.10.20.30.40.50.60.70.8
PrecisionCoverageF1-Measure
23
Number of Recommended Biclusters
Rec
omm
enda
tion
Mea
sure
s
Recommendation Threshold
Rec
omm
enda
tion
Mea
sure
Figure 11. RecommendationThreshold versus F1-Measure, Precision and Coverage
for FB
Impact of sub session size nn
Figure12illustrates the values of F1 measure, Precision and Coverage for varying values
of nn. As shown, the best performance is attained for nn =1. When the value of nn is small,
Precision and F1-Measure remains increased but the coverage value is increased when the value
of nn becomes increased.
1 2 3 4 50
0.10.20.30.40.50.60.70.8
PrecisionCoverageF1-Measure
Active Session sizenn
Figure12. Active session size versus F1-Measure, Precision and
Coverage for FB
6.4 Comparative results for effectiveness In this section, we compare the performance of Robust Fuzzy Biclustering(FB), with
Conventional biclustering, CDK-Meansand spectral co-clustering. The parameters, are tuned as
follows: the size of the recommendation list (N, default value 4), Number of biclusters
recommended is set to 2 and the size of training set (default value 75%). The test set consists of
all remaining users, i.e., those not in the training set. Users in the test set are the basis for
measuring the examined metrics. The performance comparison of FB, CB, CDK-Means and
spectral co-clustering using F1-Measure, Precision and Coverage for the maximum data set size
is shown Figure 13.
24
Rec
omm
enda
tion
M
easu
re
CDK-M
eans
Coclustering CB FB
00.10.20.30.40.50.60.70.8
PrecisionCoverageF1-Measure
Figure 13. Comparative analysis of CDK-Means,Co-clustering,CB and FB
Table 3 shows the values of Precision, Coverage and F1-Measure for the maximum session size.
Table 3.Comparison between CDK-Means, spectral co-clustering, CB and FB
in terms of Precision, Coverage and F1-Measure
Approach Precision Coverage F1- Measure
CDK-Means 0.49 0.4362 0.4615
Spectral
Co-clustering0.3534 0.7003 0.4698
CB 0.503 0.7375 0.5981
FB 0.6512 0.7512 0.6977
In terms of precision, FB outperforms all the other methods CDK-Means,Coclustering
and CB. Precision of CB is higher than that of CDK-Means and precision of CDK-Means is
higher than that of coclustering. In terms of coverage FBshows superior performance than the
other methods. The coverage of coclustering approach is slightly lower than that of CB and
significantly higher than that of CDK-Means. The overall Performance is measured using F1-
Measure and FB shows superior performance than other methods. The performance of CB in
terms of F1-Measure is better than that of Co-clustering and CDK-Means. There is only slight
25
difference in F1-Measure of CDK-Means and coclustering. The F1 measure attains its maximum
value when both precision and coverage are maximum.
Measure of Accuracy
The performance measure MAE, indicates the degree of deviation of users desired
pagesfrom the recommended set of pages. MAE for various Biclustering for MAXSIZE of data
set 1 is shown in Figure 14.
012345678
MAE
MAE
val
ue
Figure 14. Mean Absolute Error(MAE)
Complexity of FB
The Robust Fuzzy Biclustering Algorithm can be shown to have complexity of O(m x n
xnb x τ), where m is thenumber of rows of the session matrix A, n is thenumber of columns in
S, τ is the number of iterations taken for convergence.
Impact of Test data size
Training/Test data size: Now we test the impact of the size of the trainingset, which is
expressed as percentage of the total data set size. The results for F1are given in Figure15. As
expected, when the training set is small, performancedowngrades for FB. Therefore, we should
be careful enough to use adequately large training sets.
26
50 60 70 800
0.10.20.30.40.50.60.70.80.9
F1-Measure
Training set sizeF1
-Mea
sure
Figure15. F1-Measure for various Training set sizes
6.5 Data set 2
For the purpose of evaluating the performance and the effectiveness of the FB algorithm,
experiments were conducted with preprocessed web access logs of www.microsoft.com which is
available in UCI repository[http://www.ics.uci.edu/].This log file records the use of
www.microsoft.com by 5000 anonymous, randomly-selected users who have visited the web site
in a one week timeframe in February 1998 with an average of 5.7 page views per user. The file
contains no personally identifiable information. This data set includes visits which are recorded
are recorded in time order and no preprocessing is required since data set was given in sessions.
The 294 web pages are identified by their title (e.g. "NetShow for PowerPoint") and URL (e.g.
"/stream"). These algorithms are applied only for testing instances available in UCI repository by
taking 294 web pages and 5000 (Maximum data set size) users.
6.5.1 Comparative results for effectiveness In this section, we compare the performance of Robust Fuzzy Biclustering(FB),
Conventional biclustering(CB), CDK-Meansand spectral co-clustering using data set2. The
optimum values of parameters are set to K= 6, N = 4 and nn = 3 after performing sensitivity
analysis.
27
Precision ofpage recommendation
The precision of page recommendation results of data set 2 for 5000 users and 294 web
pages is measured by precision,coverage and F1-Measure and is shown in Figure 16.
FB CB
CDK-M
eans
Coclustering
00.10.20.30.40.50.60.70.8
PrecisionCoverageF1-Measure
Figure16. Precision of Page recommendation
It can be observed from the Figure 16that FB proveshigh performance than other Biclustering
approaches.
Accuracy of Page recommendation
The accuracy of page recommendation results for data set 2 is illustrated in Figure 17. It can
observed from the figure that FB proves low MAE than other Biclustering approaches.
FB CB
CDK-M
eans
Coclustering
01234567
MAE
Figure17. Accuracy of Page recommendation
28
Conclusion
The target of personalization based on web usage data is to compute a recommendation
set for the current user session based on user’s past navigation patterns. In this paper a new
personalized recommendation method based on biclustering is proposed to improve the web-
personalized recommendation. An extensive experimental comparison of Robust Fuzzy
Biclustering approach is made with CB, CDK-Meansand spectral co-clusteringusing the
recommendation measures MAE, precision,coverage and F1-Measure. This work improves
precision and coverage ratio and reduces MAE at the same time.
We highlight the following observations from our examination:
Our biclustering approaches show significant improvements over existing user and page
clustering algorithms, in terms of effectiveness, because it exploits the duality of users
and pages .
In our experiments,FB outperforms slightly other Biclustering approaches, in terms of
accurate recommendations. The reason is that the weights are computed based on
membership values of pages in the bicluster and the weights are optimized based on the
number of users and their access patterns thereby making FB more suitable for
recommendation.
Summarizing the aforementioned conclusions, it can be inferred that, the proposed Fuzzy-
biclustering algorithm attains maximum efficiency than the existing biclustering algorithms.
Hence Robust Fuzzy Biclustering approach provides improved web-personalized
recommendation than other biclusteringapproaches and is more suitable for those web sites in
which users navigate through the web pages with much uncertainty.
ACKNOWLEDGEMET
We thank the anonymousreviewers and editors for the valuable suggestions on this work and
ideas which helpedus in the improvement of the paper.
REFERENCES
29
BamshadMobasher, Honghua Dai, Taoluo, Miki Nakagawa: Discovery and Evaluation of
Aggregate Usage Profiles for Web Personalization. Data Mining and Knowledge
Discovery (6) (2002) 61–82.
Cheng, Y., Church, G.: Biclustering of expression data. In: Proceedings of the ISMB
Conference (2000) 93–103.
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph
partitioning. In: Proceedings of the ACM SIGKDD Conference (2001).
DoruTanasa, Brigitte Trousse, Advanced Data Preprocessing for Intersites web usage
mining. IEEE Intelligent Systems (2004) 59-65.
DimitriosPierrakos ,GeorgiosPaliouras, Christos Papatheodorou and Constantine D.
Spyropoulos, Web Usage Mining as a Tool for Personalization:ASurvey,User Modeling
and User-Adapted Interaction 13: 311-372,2003.
GuandongXu ,Web Mining Techniques for Recommendation and Personalization. Ph.d
Thesis (2008).
Gunter Grieser, Yuzuru Tanaka and Akihiro Yamamoto ,Discovery of Web ommunities
from Positive and Negative Examples ,Discovery Science, In : Proceedings of 6th
International Conference, DS 2003, Sapporo, Japan, October 17-19, springer-verlag,pp:
69-376, ( 2003).
Hammouda, K.M., Kamel, M.S, Efficient phrase-based document indexing for Web
document clustering, IEEE Transactions on Knowledge and Data Engineering. 10(6)
(2004) 1279–1296.
Hannah Inbarani, H., Thangavel, K, A Robust Biclustering Approach for Effective Web
Personalization, Visual Analytics and Interactive Technologies: Data, Text and Web
Mining Applications (2011).
30
Hartigan, J.A, Direct clustering of a data matrix, Journal of the American Statistical
Association 67(337) (1972)123–129.
Jayson E. Rome and Robert M,Towards a formal concept analysis approach to exploring
communities on the WWW , B. Ganter and R.Godin(Eds), ICFCA (2005), LNAI 3403,
pp : 33-48, 2005.
Jian-Guo Liu, Wei-Ping Wu,Web Usage Mining For Electronic Business Applications,
In: Proceedings of the Third International Conference on Machine Learning and
Cybernetics, Shanghai, (Aug 2004) 57-63.
Li, H.Y., Xie, C.S., Liu Y,A new method of prefetching I/O requests, In: Proceedings of
International Conference on Networking”, Architecture and Storage, Guilin, China (July
2007)217–224.
Liu, X., He, P., Yang, Q.: Mining user access patterns based on web logs,Canadian
Conference on Electrical and Computer Engineering, May, Saskatoon Inn Saskatoon,
Saskatchewan Canada 2280–2283 (2005).
Long, B., Zhang (Mark), Z., Yu, P.S, Co-clustering by block value decomposition,In:
Proceeding of the eleventh ACM SIGKDD International Conference on Knowledge
discovery in data mining. ACM Press, New York (2005) 635–640.
Mirkin, B.: Mathematical classification and clustering, Kluwer Academic Publishers
Dordrecht (1996).
Nakagawa, M., Mobasher, B, A hybrid web personalization model based on site
connectivity, In the fifth international WEBKDD workshop: Web mining as a premise to
effective and intelligent web applications (2003) 59–70.
31
Nasraoui, O., Soliman, M., Saka, E., Badia, A., Germain R, A web usage mining
framework for mining evolving user profiles in dynamic websites, IEEE Transactions on
Knowledge and Data Engineering, 20(2) (2008) 202–215.
OlfaNasraoui, Chris Petenes, An Intelligent Web Recommendation Engine Based on
Fuzzy Approximate Reasoning, In :proceedings of the IEEE International Conference on
Fuzzy Systems1116-1121 (2003).
Pallis G.,VakaliKoutsonikola A, Insight and perspectives for content delivery
networks,Communications of the ACM 49(1) (2006)101–106.
Pablo A. D. de Castro, Fabrício2007, ApplyingBiclustering to Text Mining: An Immune-
Inspired Approach,In: ICARIS, Vol. 4628,Springer (2007), pp. 83-94.
PanagiotisSymeonidis, AlexandrosNanopoulos, ApostolosN. Papadopoulos and
YannisManolopoulos, Nearest-biclusters collaborative filtering based on constant and
coherent values ,Information Retrieval , 1(11), pp. 51-75.
Rosario Girardi.A, Leandro BalbyMarinho E, A domain model of web recommender
systems based on usage mining and collaborative filtering , International Journal of
Requirements Engineering (2007) 12: 23–40, Springer verlag, London.
Ruggero, G. Pensa1.,CelineRobardet, Jean-Fran¸ CoisBoulicaut,A Bi-clustering
Framework for Categorical Data, A. Jorge et al. (Eds.): PKDD 2005, LNAI 3721,
Springer Verlag Berlin Heidelberg (2005) 643–650 .
Sarabjot Singh Anand,BamshadMobasher, Intelligent Techniques for Web
Personalization. ACM Transactions on Internet Technology7(4)(October 2007).
Sergio Flesca , Sergio Greco ,Andrea Tagarelli ,Ester Zumpano, Mining User references,
Page Content and Usage to Personalize website navigation, World Wide Web: Internet
32
and web information systems, 8, 317–345, 2005, Springer Science + Business Media,
Inc.
Srinivasa N,Medasani S, Active fuzzy clustering for collaborative filtering,In:
Proceedings of IEEE International Conference on Fuzzy Systems, July, Budapest,
Hungary (2004) 1607–1702.
Sung Ho Ha, Helping Online Customers Decide through Web Personalization, IEEE
Intelligent systems (2002) 34-43.
Tang, C., Zhang, L., Zhang, I.Ramanathan, M, Interrelated two-way clustering: An
unsupervised approach for gene expression data analysis, In: Proceedings of the 2nd
IEEE Int. Symposium on Bioinformatics and Bioengineering (2001)41–48.
TsuyoshiMurata , DOI: 10.1007,Discovery of Web Communities from Positive and
Negative Examples, Lecture Notes in Computer Science, 2003, Volume 2843/2003, 369-
376.
Vassiliki A. Koutsonikola and Athena I. Vakali(2009), A fuzzy bi-clustering approach to
correlate web users and pages, Int. J. Knowledge and Web Intelligence, 1(2),3-23 .
ZidrinaPabarskaite&AistisRaudys, A process of knowledge discovery from web log
data:Systematization and critical review ,Springer ,Journal of Intelligent Information
Systems (2007) 28:79–104.
33