8
Clustering-Based Learning Approach for Ant Colony Optimization Model to Simulate Web User Behavior Pablo Loyola * , Pablo E. Rom´ an * and Juan D. Vel´ asquez * * Department of Industrial Engineering Universidad de Chile, Rep´ ublica 701- P.O. Box: 8370439 Santiago, Chile [email protected], [email protected], [email protected] Abstract—In this paper we propose a novel methodology for analyzing web user behavior based on session simulation by us- ing an Ant Colony Optimization algorithm which incorporates usage, structure and content data originating from a real web site. In the first place, artificial ants learn from a clustered web user session set through the modification of a text preference vector. Then, trained ants are released through a web graph and the generated artificial sessions are compared with real usage. The main result is that the proposed model explains approximately 80% of real usage in terms of a predefined similarity measure. Keywords-Ant Colony Optimization, Web User Behavior, Web Usage Mining, Multi Agent Simulation, Text Preferences. I. I NTRODUCTION Soft computing methodologies have gained a consider- able amount of relevance in relation with WUM due to their flexible implementation and results in the field of recommendation-based systems and adaptive web sites [1]. Furthermore, Ant Colony Optimization (ACO) is one of the tools used for these purposes. This metaheuristic is inspired by nature, where ants optimize their trails for food foraging based on releasing chemical substances into the environment called pheromones. This simple idea is applied to the web user trails of visited pages also called sessions [2]. Artificial ants are trained through a web session clustering method modifying an intrinsic text preference vector which represents the importance given by the users to the set of most important keywords. Furthermore, trained ants are used to predict future browsing behavior. This paper is organized as follows. Section 2 provides a brief summary of related research. Section 3 presents our ACO-based model for web user behavior. Section 4 outlines our experiments on a real web site. Section 5 provides some conclusions and suggests future research. II. RELATED WORK ACO is inspired by the behavior of some insect species related by means of the presence of a social component in their organizational structure, which allows them to per- form complex tasks through a cooperative and coordinated process. The key component behind this is the capability of generating an indirect communication by modifying the environment. This mechanism is called stigmergy. Specifi- cally in ants, it is based on the use of chemical substances called pheromones which can be deposited on the ground and detected by the colony in order to execute labors such as food foraging, cooperative transport and corpse grouping. The methodology is based on the progressive construction of pheromone trails from the nest to the food source searching for the minimum path. This is achieved through an auto- catalytic process in which pheromone levels are reinforced in inverse proportion to the time needed to walk through its respective path. As ants choose paths with higher pheromone levels, suboptimal solutions evaporate and the algorithm returns the shortest path. Dorigo proposed the first ACO model to solve the TSP 1 in 1992 [3]. This work stated the principles of the methodology in terms of the use of four components: graph represen- tation, problem restrictions, pheromone trails and solution construction. Most of the calculation is produced by using two variables that manage trails generation. In the first place, a support measure η ij , also called heuristic information, which contains problem information that is not accessible directly to the ants and has been introduced externally. For instance, it can store either costs or utility associated with a decision. In the second place, a pheromone function τ ij(t) is defined: τ ij (t + 1) = (1 - ρ)τ ij (t)+Δτ ij , (1) where Δτ ij = ij , with Q a constant and ρ an attenuation factor, which is associated with the pheromone evaporation. Finally, the trail preference p ij (t): p ij (t) = [τ ij (t)] α * [η ij ] β n k=1 [τ ik (t)] α * [η ik ] β , (2) where α and β are the weights for pheromones and support, respectively. This allows the development of the metaheuristic and its further applications in fields from vehicle routing problems to image processing [4]. The implementation of ACO in Web Intelligence tasks, together with other bio-inspired heuristics, has gained a 1 Travelling Salesman Problem

Clustering-based learning approach for Ant Colony Optimization model to simulate Web user behavior

  • Upload
    uchile

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Clustering-Based Learning Approach for Ant Colony Optimization Model toSimulate Web User Behavior

Pablo Loyola∗, Pablo E. Roman∗ and Juan D. Velasquez∗∗Department of Industrial Engineering

Universidad de Chile, Republica 701- P.O. Box: 8370439 Santiago, [email protected], [email protected], [email protected]

Abstract—In this paper we propose a novel methodology foranalyzing web user behavior based on session simulation by us-ing an Ant Colony Optimization algorithm which incorporatesusage, structure and content data originating from a real website. In the first place, artificial ants learn from a clustered webuser session set through the modification of a text preferencevector. Then, trained ants are released through a web graphand the generated artificial sessions are compared with realusage. The main result is that the proposed model explainsapproximately 80% of real usage in terms of a predefinedsimilarity measure.

Keywords-Ant Colony Optimization, Web User Behavior,Web Usage Mining, Multi Agent Simulation, Text Preferences.

I. INTRODUCTION

Soft computing methodologies have gained a consider-able amount of relevance in relation with WUM due totheir flexible implementation and results in the field ofrecommendation-based systems and adaptive web sites [1].Furthermore, Ant Colony Optimization (ACO) is one ofthe tools used for these purposes. This metaheuristic isinspired by nature, where ants optimize their trails for foodforaging based on releasing chemical substances into theenvironment called pheromones. This simple idea is appliedto the web user trails of visited pages also called sessions [2].Artificial ants are trained through a web session clusteringmethod modifying an intrinsic text preference vector whichrepresents the importance given by the users to the set ofmost important keywords. Furthermore, trained ants are usedto predict future browsing behavior.

This paper is organized as follows. Section 2 provides abrief summary of related research. Section 3 presents ourACO-based model for web user behavior. Section 4 outlinesour experiments on a real web site. Section 5 provides someconclusions and suggests future research.

II. RELATED WORK

ACO is inspired by the behavior of some insect speciesrelated by means of the presence of a social componentin their organizational structure, which allows them to per-form complex tasks through a cooperative and coordinatedprocess. The key component behind this is the capabilityof generating an indirect communication by modifying the

environment. This mechanism is called stigmergy. Specifi-cally in ants, it is based on the use of chemical substancescalled pheromones which can be deposited on the groundand detected by the colony in order to execute labors suchas food foraging, cooperative transport and corpse grouping.The methodology is based on the progressive construction ofpheromone trails from the nest to the food source searchingfor the minimum path. This is achieved through an auto-catalytic process in which pheromone levels are reinforcedin inverse proportion to the time needed to walk through itsrespective path. As ants choose paths with higher pheromonelevels, suboptimal solutions evaporate and the algorithmreturns the shortest path.

Dorigo proposed the first ACO model to solve the TSP1 in1992 [3]. This work stated the principles of the methodologyin terms of the use of four components: graph represen-tation, problem restrictions, pheromone trails and solutionconstruction. Most of the calculation is produced by usingtwo variables that manage trails generation. In the first place,a support measure ηij , also called heuristic information,which contains problem information that is not accessibledirectly to the ants and has been introduced externally. Forinstance, it can store either costs or utility associated witha decision. In the second place, a pheromone function τij(t)is defined:

τij(t+ 1) = (1− ρ)τij(t) + ∆τij , (1)

where ∆τij = Qηij , with Q a constant and ρ an attenuationfactor, which is associated with the pheromone evaporation.Finally, the trail preference pij(t):

pij(t) =[τij(t)]

α ∗ [ηij ]β∑n

k=1 [τik(t)]α ∗ [ηik]β, (2)

where α and β are the weights for pheromones andsupport, respectively.

This allows the development of the metaheuristic and itsfurther applications in fields from vehicle routing problemsto image processing [4].

The implementation of ACO in Web Intelligence tasks,together with other bio-inspired heuristics, has gained a

1Travelling Salesman Problem

considerable amount of attention due to the necessity offinding new ways to explain and simulate web user behavior.

Ling [1] proposes an ACO-based method for discoveringnavigational patterns, by using a support measure basedon real transitions between pages. Then, user navigationalpreferences, represented by pheromone levels, are directlyproportional to the real frequency of usage. Abraham andRamos [5], propose another approach based on the clusteringof web user sessions. The model takes inspiration from theability of certain kinds of ants in terms of grouping corpses,generating ant cemeteries.

White et al. [6], uses an ACO implementation to progres-sively find the best correlation between web content anda set of online advertisements in order to maximize theprobability of profit. To achieve this, a vector space model isused together with the notion of web user preference vector.

III. ACO MODEL FOR SIMULATING WEB USAGE

The key aspect of the proposed model is to enable artificialants to learn from real user behavior in terms of the settingsof the intrinsic parameters of an ACO implementation, whichhave to be adapted to a web environment.

A. Web User Sessions Clustering

The learning algorithms to be used in this paper are basedon a continuous comparison between the artificial sessionsgenerated by the ants and the web user behavior representedthrough real web user sessions. A first approach could be tocompare each artificial session with the complete set of realsessions available. This method could take a considerableamount of time due to the large number of sessions generatedin the study time intervals. Thus, it is proposed to performa clustering process of the web user sessions in order toreduce the number of subsequent comparisons.

1) Similarity Measure: To generate a valid similaritymeasure two main subjects must be considered. In the firstplace there are the sets of web pages belonging to eachsession, and in the second place there is the order in whichthose web pages were visited. It is proposed to use a degreeof similarity [7] between sessions based on the calculationof longest common subsequence (LCS) [8].

Given r a real session and c an artificial session, thesimilarity measure is defined by

sim(r, c) =LCS(r, c)

max{||r||, ||c||}(3)

where ||r|| represents the length of a given session rand LCS(r, s) is the longest common subsequence betweensessions.

2) Clustering Method: Web user sessions are basically acategorical representation of the paths followed by users anddo not have an intrinsic value. Rather, it is their components,web pages, which give them an inner characteristic. Thus,the notion of a mean value for representing them does

not exist [2]. We propose to use a hierarchical methodfor clustering consisting of an agglomerative process thatat every step groups sessions according to the similaritymeasure. This method uses only the similarity (or distance)between elements which fits into the categorical nature ofweb user sessions [2]. Also, it allows visualizing at everystep the partial groups being generated. This is very usefulin relation with the task of deciding the optimal number ofclusters.

3) Centroid calculation: It is necessary to define a repre-sentative session for each cluster, which can be labeled as acentral session or centroid. We propose, for a given cluster,to calculate in each session the accumulative value of thesimilarity measure and to choose the session which has themaximum.

B. Web Graph Modeling

A web site is represented by a directed graph in which theweb pages are the nodes and links the edges. As the processof web user session recollection implies constant trackingduring a certain period of time, changes may occur in bothstructure and content within the web graph. For this reasonit is necessary to register those modifications. This results inthe generation of many graphs, which contain the dynamicsof the web site during the time interval under study.

1) Nodes: In the first place it is necessary to extract themost important keywords present in the corresponding webpage. As the TF-IDF has shown good results in terms ofrepresenting the relevance of keywords within a corpus [9],[10], we propose a method based on that statistical measureconsisting of the following steps:

Given a web graph G,• All keywords present must be extracted and stored in

a container, the KeywordsGraph.• For each keyword k in the KeywordsGraph, its TF-IDF

value is calculated in relation to every page p in theweb graph. If k is not present on a determined page,its TF-IDF value is 0.

• For each keyword k, the maximum value of the TF-IDFis selected. Formally, for a given keyword k, tfidfk =max{tfidfk1, tfidfk2, ..., tfidfkN}.

• tfidfmin and tfidfmax limits are defined in relation tothe values obtained with the set of keywords and mustbe set in order to avoid extreme elements.

• All keywords whose TF-IDF representative value isbetween tfidfmin and tfidfmax are stored in a MIK, acontainer for the Most Important Keywords. Formally,MIK = {ki ∈ KeywordsGraph \ tfidfmin ≤tfidfi ≤ tfidfmax}.

• Finally this set can be converted into a vector thecomponents of which must be ordered decreasinglywith respect to the TF-IDF values.

In the second place, we must generate a vector foreach page p which represents in a quantitative fashion the

correspondence between its own keywords and the set MIK.For each page p in the web graph, all keywords must be

extracted and stored in a set called the KeywordsPagep.Then a Lp vector is generated, the length of which has to beexactly the same as the MIK and the value in a determinedindex must be associated with the keyword present in thatindex in the vector MIK. Values in Lp will be set accordingto the following rule:

Lpi =

{TF − IDFip if MIKi ∈ KeywordsPagep0 otherwise

Each keyword i in the MIK must be searched within theKeywordsPagep. If the keyword i is found, the value thatwill be set in the index i of vector Lp will be equal to theTF-IDF value of the keyword related to the page p. If not, thevalue will be 0. Thus, vector Lp will have non-zero valuesproportional to the fact that the keywords in page p arecontained in the group of the most important keywords. Theidea behind this method is the construction of a same-lengthvector for all pages in the web graph. This standardizationwill be necessary in the subsequent comparison algorithms.

2) Links: Following the common representation, edgeswill be used to simulate the links in the web graph. Initialpheromone trails must be placed in terms of the probabilityof transition of a random surfer [11] defined by:

pij =p ∗ gijcj

+1− pn

(4)

where n is the total number of nodes, p is an arbitraryvalue for the probability of continuing surfing within theweb site (following the 80-20 rule). The adjacency matrixis represented by gij , this expression takes the value of1 if a link between nodes i and j exists, 0 otherwise. cjcorresponds to the number of links present in node i (nodedegree).

C. Artificial Ants

In order to adapt web user behavior to ACO it is necessaryto model each ant with elements that allow the managementof data extracted from a web site.

1) Text Preference Vector µ: For each ant, a text pref-erence vector µ is defined as a measure for the relativeimportance that a particular agent gives to a set of keywords[12], [13]. Each component of every vector is initiallyan artificial representation of the relevance given to thedetermined component of the MIK vector. For example, inthe µi text preference vector (belonging to ant i) the valueof the component j , µij will be a quantitative expressionfor the importance the ant i gives to the keyword present inthe j index in the MIK vector.

Possible values are between 0, which means no interestin the keyword, and 1, which represents full interest.

2) Ant Utility: The main goal for every ant is to travelthrough the graph and collect information in order to satisfyits need for content. Every time an ant i arrives at a specificpage p, a comparison between the µi text preference vectorand the corresponding Lp vector is performed. As a resultof this comparison, a utility is computed proportional to thelevel of text similarity between both vectors. As the antpasses from node to node, it accumulates the utility untila predefined limit is reached. In that instant the ant finishesthe session.

3) Ant Initialization: Each ant is assigned a µ vector,which will define its initial text preference. Additionally, aweb user session cluster is associated with the ant. This is thecore idea behind the learning algorithm, that each ant willmodify its preference vector in order to model the sessionsbelonging to the cluster it was associated with.

Initially, all µ vectors are constructed using a weightedrandom generation based on the following distribution:• High preference: 20% of the vector components. Values

between 0.8 and 1.• Medium preference : 60% of vector components. Val-

ues between 0.2 and 0.8.• Low preference : 20% of vector components. Values

between 0.0 and 0.2.This distribution was conceived based on the supposed

fact that for a given set of keywords, only a small groupwill concentrate a high value of preference.

4) First Node Election: While in a standard implementa-tion of ACO the starting and ending points are well defined(for example, in a vehicle routing problem), in a web usageapproach ants can start and end their sessions in any node.

As each ant is associated with a cluster of web usersessions, we propose to generate a set of possible firstvisited pages for that ant, based on the pages that appear infirst place within the constituent sessions of the associatedcluster. The selection of the first page for the artificial ants isperformed by using the frequencies of the first pages presentin the corresponding cluster. These values are used as aprobability in a Monte Carlo approach.

D. Learning Algorithm

The main idea behind the ant learning process is theprogressive modification of their respective µ text preferencevector in order to generate artificial sessions which have anadequate similarity to the ones present in their correspondingsession cluster.

1) µ Text Preference Vector Modification for ImprovingGenerated Sessions: For a given ant, before beginning totravel through the web graph, a fixed increment in one ofthe components of its µ vector is performed. This can beunderstood as a guided increase in the importance that theant gives to a determined keyword. Each of these changeswill influence the behavior of the ants and the sessions theygenerate. Thus, it is necessary to find the modification which

results in sessions most similar to the real sessions in therespective cluster.

In the first place, copies of the µ vectors must be gener-ated. Then, for each original µ vector, its copy is constantlymodified. In each modification, the ant will generate adetermined session. Finally, the original µ vector is modifiedaccording to the index in which its copy was modified andthat generated the most similar session in relation with theassociated cluster.

2) Ant Arrival to the Web Graph: Once an ant has beenset with a modified copy of its original µ vector, it is placedin the web graph according to the method described inIII-C4. When arriving at the first node, the ant must compareits µ copy with the corresponding L text vector associatedwith the node. This step aims to simulate what happens whena user arrives at a web site, he/she evaluates what the firstpage offers in terms of content, comparing it with his/herown preferences. Then, he/she decides whether or not tocontinue through the link structure.

We propose to use the cosine similarity measure tocompare user preference vector µ and the corresponding textvector Lp from a given page p [13]. The resulting value willbe taken as the utility obtained by the user for visiting pagep.

utility =〈µ • Lp〉||µ||||Lp||

(5)

This parameter will accumulate as the ant travels throughthe graph until a limit is reached.

3) Method for Selecting Next Page: After the calculationof the utility given on the first page, the ant must decidewhich page to follow. Each of the links to the neighborscontains a pheromone level which was initially calculatedas a transition probability. This probability will be used inthe decision process as is pointed out in algorithm 1.Algorithm III.1: Next Node Selection

Data: Actual node iInitialize NeighbourNodes = GetNeighbourNodes(i);1

Initialize NeighbourPheromoneLevels2

= GetNeighbourPheromoneLevels(i);Initialize sum = 0;3

Initialize r = GenerateRandom(0,1);4

foreach node j ∈ NeighbourNodes do5

τij = NeighbourPheromoneLevels(i, j);6

sum = sum + τij ;7

if sum ≥ r then8

return NeighbourNodes(j);9

return −1 ;10

4) Pheromone Levels Update: The process of pheromonelevels update is divided into two steps. The first is relatedto the modification in the link which was chosen by the ant.In the second place, a general decrease of all levels must beperformed. This process is called evaporation.

Given an actual node i, the process of selection of the nextnode to follow results in the choice of the neighboring nodej. Thus, the pheromone levels in the link τij will be updated.The increment that will be added is directly proportional tothe probability value given by the Logit model for discretechoices. This model has been widely used to simulate theway in which users search for and decide in relation to adiscrete set of possibilities [12], [13].

τij = τij +eVij∑

k∈Neightboursi eVik∗ ε (6)

where Vij is the utility obtained by the ant when passingfrom node i to j. ε is an arbitrary value (ε > 1) to helptweak the magnitudes obtained.

This modification will result in the sum of all levels ofpheromone contained in the links that come out of node ibeing greater than 1. Therefore a normalization process mustbe performed.

Additionally, with the purpose of generating a stronger fitbetween the newly generated sessions and the set of sessionsfrom the respective cluster, we propose to modify expression6 with a boost component. If the transition between nodes iand j exists within the sessions that belong to the respectivecluster, the modification of τij will be given by:

τij = τij +eVij∑

k∈Neightboursi eVik∗ ε+ boost (7)

The next step is to evaporate the pheromone level in alllinks in the web graph. This process is normal in every ACOimplementation and has as a main purpose to progressivelyeliminate pheromone levels from unused links.

5) End Conditions: Ants travel through the web graphuntil one of the following conditions is reached:• The utility limit is reached: Each ant is set with a

limit for the amount of utility to obtain by travelingthrough the web graph. If the limit is reached, it canbe understood that the need for information has beenfilled.

• The ant has fallen into a loop: If the ant begins to movein a circular way, it is perceived as a non-genuine webuser behavior and the session is stopped. To identify afeasible loop, each time a node is added to the currentsession, its previous additions are inspected and if thecomplete sequences are repeated twice, the session isstopped.

• Persistent visits to a determined page: If the ant visits,within a given session, a unique page during more thanfour consecutive times, the session is stopped.

6) Generated Sessions Storage: When one of the pre-vious conditions is met, the session is stopped and itscomponents are associated to an unique ID. Finally thesession is stored together with the index in which the copyof the µ vector was modified.

7) Modification of µ Vector: As was mentioned before,for every modification made to the copy of the µ vector,the ant generates an artificial session. Each modificationconsists of an increment of one of the components of thevector which represents the user preference for an associatedkeyword that belongs to the set of most important keywordsin the web graph. The next step is to choose one of themodifications made in the copy and reproduce it in theoriginal vector. The method for finding it is to compare eachartificial session with the central session of the respectivecluster. The modification which produced the session mostsimilar to the central session will be selected. Finally, theoriginal µ vector is modified.

E. Model Validation

The main idea behind the model validation process isto collect structure, content and usage from a next intervalof time. Thus, both training data and validation data mustbelong to the same web site, but from disjoined periods.Then, trained ant sessions are generated and compared withreal web user behavior.

1) Data Preparation: Both content and structure datamust be collected and processed in the same way used withthe training data. Usage data must be represented throughweb sessions and the hierarchical clustering process mustbe performed. Thus, both training and validation sourceswill have an equivalent structure which will allow theimplementation of coherent comparisons.

2) Ant Initialization: The main output from the learningprocess is a set of trained text-preference vectors, each onetaken as a representation of the real preferences from adetermined web session cluster that belong to the trainingset. For each of these vectors, a group of ants will begenerated. Every member of the group will incorporate therespective vector as its own preference. Additionally, eachgroup of ants will have its own web graph.

3) Ant Behavior: For each µ vector, its whole group ofants is released on its respective web graph at the same time.As all ants have the same preference, the same informationwill be searched. This represents a standard implementationof ACO, where all agents cooperate indirectly in the searchfor food, updating pheromone levels at the edges of thegraph. Thus, this algorithm is similar to the one presentedin the training process, but without the learning component.

For a given ant that belongs to the group Hj (associatedwith text preference vector µj) a schematic representationof the algorithm will be:

Algorithm III.2: Ant Behavior

Initialize1

SessionsContainer(), utility = 0,MAXIterations;for i← 0 to MAXIterations do2

session() ;3

CurrentNode← getCurrentNode();4

session.add(CurrentNode);5

L← getV ectorL(CurrentNode);6

utility ← utility + CosineSim(µ,L);7

while utility < max AND !InvalidLoops AND8

!InvalidRep doNeighbours←9

getNeighbours(CurrentNode) ;DestinationNode←10

getDestinationNode(Neighbours) ;session.add(DestinationNode);11

L← getV ectorL(DestinationNode);12

utility ← utility + CosineSim(µ,L);13

τCurrentNode→DestinationNode ←14

τCurrentNode→DestinationNode +LogitProb(CurrentNode,DestinationNode);evaporateGraph();15

CurrentNode← DestinationNode;16

SessionsContainer.add(session, i);17

session.clear();18

This algorithm allows each group of ants to fill itsrespective container with artificial sessions. Each ant travelsuntil an ending condition is reached. The generated session iscollected and the ant starts again. Progressively, the changesmade in the pheromone levels will shape certain routeswithin the graph that will be gradually reinforced by therespective group.

4) Generated Sessions Characterization: For each con-tainer of generated sessions, a convergence analysis must beperformed. As the process of session storage is sequential,heterogeneity is expected at the beginning because thelevels of pheromone have not been modified in order toinfluence the ants to follow a defined pattern. Homogeneityis expected near the end, when the collaborative work hasreinforced some trails. Therefore, it is necessary to find aconvergence threshold which allows the selection of thesubset of generated sessions that can be understood asthe representative behavior of a respective group of antsassociated with a certain µ trained preference vector. Twomethods are proposed:• Select a proportion of sessions from the last inserted

in the container. For instance, 10% of the last sessionscan be extracted and labeled as homogeneous behavior.This is under the assumption of convergence inherentin the ACO algorithm.

• Inspect the consecutive similarity between stored ses-sions. Using equation 3 it is possible to calculate the

similarity between sessions. If a tendency toward highvalues of session similarity is perceived, it can beassumed that convergence has been reached.

Both methods require expert assistance because the pa-rameters will depend on problem intrinsic nature.

For every group of ants, after their respective subset ofartificial sessions has been selected, it is necessary to choosea central session as a representative value, which can beperformed using the method described in III-A3. There willtherefore be one session for each group of ants.

5) Comparison Between Sessions: The final step in thevalidation process is the comparison between the followingsets of sessions:• Real web sessions corresponding to the web usage in

the validation period. These sessions have been groupedin Cn clusters. Each cluster has a central value ci (i =1 . . . n).

• Artificial sessions generated through a standard ACOalgorithm using trained text preference vectors. Thesesessions are organized in Sm sets, each one from adifferent group of trained ants. Each set has a repre-sentative session sj (j = 1 . . .m)

It must be noted that probably n 6= m. In fact, m isassociated with the number of µ text preference vectors,which, in turn, comes from the clusters of user sessions fromthe training period. Thus, both n and m correspond to thenumber of user session clusters in different time periods,therefore they do not have to necessarily be the same.

Each Sj must be compared with every Ci using sj andcj and the similarity measure defined in III-A1. A similaritythreshold must be defined, which if reached, represents thatthe artificial session is sufficiently similar to the real sessionto explain it.

IV. EXPERIMENTAL RESULTS

The presented model was implemented on the web site ofthe Industrial Engineering Department of the University ofChile2. This site represents the main source of informationabout the institution and the activities related to it. In termsof content, it includes news, academic events, undergraduateand postgraduate information, faculty and administrativestaff. It also serves as a transition platform for severalsub-sites belonging to postgraduate programs and researchcenters. Both training and validation periods were set to onemonth, thus data was extracted from September to Octoberin 2010.

A. Structure and Content Processing

To collect web site data, a web crawler approach was used.It allowed simultaneous storage of content and structureand set the relationship between them in terms of thecorresponding identifiers. Content data was pre-processed

2http://www.dii.uchile.cl

based on the application of standard methodologies forcleaning, such as non-html element removal, extraction ofstop-words and a stemming process. Crawler capabilitiesenabled the identification of changes through time whichwere explicitly registered in order to obtain a more realisticsource. Finally, collected data was transformed into a vectorspace model representation.

TF-IDF values from the training data set were calculatedand normalized. Following the method described in III-B1,the set of most important keywords was generated using thekeywords whose TF-IDF values were between 0.6 and 0.9,consisting of 165 elements. Using this set, L vectors werecalculated for each page within the web graph.

B. Web User Session Extraction

The extraction and processing of the user behavior wasperformed using a proactive strategy consisting of the in-tensive use of a javascript file which had been insertedpreviously into all web pages that belong to the web site instudy. This javascript file is the key element for tracking andregistering visited web pages. This method is based on theutilization of a cookie as a temporary storage object for thenavigational data with the purpose of completely identifyingweb user sessions, which are later sent to a dedicated serverfor final storage. When a user visits a page, the methodcollects the IP address, user agent, time stamp, URL, host,and a sequence indicator (entering or leaving the page).Additionally, this script allows the assignment of a uniqueidentifier to the current session.

One of the advantages of this method is the automaticgeneration of the sessionization process, which represents aconsiderable savings of time due to the tasks involved. Onthe other hand, it has some issues related to its dependencyon the onload and onbeforeunload javascript functions,which are used to identify when a user enters or leavesa page. Those functions are handled in different waysdepending on which web browser is being used.

1) Web User Session Clustering: The first step was toremove all sessions with length equal to one because of thelack of the notion of sequential path. Before performing theclustering algorithm it is necessary to define a criterion forselecting the optimal number of clusters. We propose to usethe ratio between extra similarity and intra similarity [14],selecting the number of clusters that minimize it.

For the training data set, consisting of 6,842 sessions, 806clusters were found. For the validation data set, consistingof 5,701 sessions, 733 clusters were found. Although thenumber may seem high, it must be noted that the mainpurpose for using clustering was to reduce the dimensional-ity of data due to the considerable number of comparisonsthat must be done. Thus, due to the high exigency of thesimilarity measure in use, which not only compares betweenthe elements within sessions, but also the sequence in whichthey appear, it is expected that the algorithm would behave

Figure 1. Initial text preference values for vector µ10.

reticently in terms of merging clusters. Finally, using themethod described in III-A3, for each cluster a central sessionis obtained.

C. Training Algorithm

For each cluster found in the training set, an artificial antwas associated and initialized with a µ text preference vectorartificially generated. Several tests were performed in orderto obtain model parameters.• ε : 2.2• evaporation factor : 0.8• ant utility limit : 0.2• boost: 0.2Each of the 806 artificial ants was set to iterate 1,000 times

in their respective web graphs. This process was configuredto be executed in parallel using the MASON3 framework forJava, which is designed to manage multi-agent simulations.It was executed under an Intel Core 2 Duo 2.1 Ghz computer,using 3Gb RAM and Windows 7 Professional Edition. Theentire learning process took approximately four hours.

The output of the learning algorithm is the set of trained µtext preference vectors. Results showed that the initial vec-tors were significantly modified. It must be noted that initialµ vectors are generated randomly using a rule describedin III-C1. The algorithm gradually modifies the componentsand most of them decrease considerably while a few increasetheir values. For instance, Figure 1 represents the initial andhighly heterogeneous values for a given vector µ, in this casethe one associated with cluster 10. Figure 2 shows the finaldistribution of values after 1,000 iterations of the learningalgorithm. The subset of keywords that have the highervalues are interpreted as the text preference of the usersthat generated sessions contained in the respective cluster.

D. Model Validation

As was mentioned in III-E, the validation consists oftwo main processes. First, using content and structure datafrom a different period, a web graph structure is generated.Within this, groups of ants, each one associated with oneof the trained µ vectors, generate a set of artificial sessions.

3http://cs.gmu.edu/eclab/projects/mason/

Figure 2. Text-preference component values for vector µ10 are influencedby only seven terms of 165.

Figure 3. Sequential comparison between generated sessions for µ10

The second part is related with the aggregated comparisonbetween the mentioned group of artificial sessions and theclusters of sessions from real web usage in the validationperiod.

For each of the 806 trained µ vectors, a group of 200 antswas associated. Each group was released on a respective webgraph and the generated sessions were inserted sequentiallyin a database. When a session is inserted, a similaritymeasure is calculated between the current session and theone that had been stored immediately before.

h(t) = sim(session(t), session(t− 1)) (8)

Figure 3 shows the evolution of h(t) for the sessionsgenerated by the group of ants associated with µ10. Inthe beginning, the similarity between inserted sessions wasalmost zero. This is because the levels of pheromone in theweb graph had not been sufficiently modified to generatedefined trials that force the ants to follow a determinedpattern. As time passes, similarity progressively increasesuntil approximately iteration number 350, where a value of1 is reached for the first time, which means equal sessions.

With the purpose of collecting a subset of sessions thathave a coherent level of homogeneity, it is proposed toextract the last 50 inserted sessions for each group of ants.

The next step is to compare artificial sessions with realsessions and calculate what proportion of the behavior can beexplained by the model. For each cluster of real sessions, thecentroid was calculated. For each set of artificial sessions,

a representative value was selected. Both processes wereperformed following III-E5. The similarity threshold wasset at t = 0.6, due to the fact that this value statisticallyrepresents at least two matching elements between sessions,including sequence order. If this threshold is reached, a realsession is explained by an artificial session.

Each representative artificial session was compared withevery centroid. Results show that 611 of 806 of them havea similarity greater or equal to 0.6 with at least one of the733 centroids. If t = 0.7, that number decreases to 484, andif it is set to t=0.5 , the number increases to 639. Then,performing the comparison from the centroids, when t= 0.6,598 centroids can be explained by the set of representativeartificial sessions. With 0.7, 452 centroids are obtained andwith t= 0.5, 601 centroids (81%).

It must be noted that the results are not equivalent whenperforming the comparisons in both directions. The reasonfor this is that there is a group of representative artificialsessions that are associated with the same centroid, i.e., aparticular usage behavior, represented by the centroid, andtherefore by a cluster. This is explained by different groupsof ants, which are associated with different text preferencevectors µ.

V. CONCLUSION

We conclude that the model presented in this paper is aplausible way to integrate ACO metaheuristic with web min-ing methodologies. Its multi-variable approach, which usesboth content, structure and usage data from web sources,allows the building of a framework for web user simula-tion based on the construction of text preference vectors,whose fully parameterized structure allows the detectionand incorporation of any change in web environment. Itmust be noted that the simplicity of problem formulationwhich is intrinsic to ACO cannot be applied directly toweb-based problems because of the complexity of data andits time dependency. Additional functionalities are requiredto handle the interaction of data from different dimensions.For instance, as in standard ACO implementations, such asvehicle routing, both starting and ending points are welldefined, web user sessions present a multi starting-endingnature, which is a challenge to ACO in terms of ensuringalgorithm convergence. Another fact to be pointed out isthat as this model is based on a collaborative multi-agentconstruction of solutions, results have to be understood inan aggregated fashion. Thus, this work allows the obtainingof a global estimation of web usage. Specifically, 81% ofexplanation in relation to the coherent matching betweenartificial and real sessions based on a similarity measurewas achieved.

Future work is related to the refinement of the textpreference model in order to incorporate different utilitieswhen collecting information, which could adapt in a betterway to real user behavior. Changes in the collaborative

process could also be made to obtain a faster solution,including comparison with other metaheuristics.

ACKNOWLEDGMENT

This work was partially supported by the FONDEF projectDO8I-1015 and the Millennium Institute on Complex Engi-neering Systems (ICM: P-05-004-F, CONICYT: FBO16).

REFERENCES

[1] C.-C. Lin and L.-C. Tseng, “Website reorganization using anant colony system,” Expert Systems with Applications, vol. 37,no. 12, pp. 7598 – 7605, 2010.

[2] B. Liu, Web Data Mining: Exploring Hyperlinks,Contents andUsage Data. Springer-Verlag Berlin Heidelberg, 2007.

[3] M. Dorigo and L. M. Gambardella, “Ant colonies for thetraveling salesman problem,” Biosystems, 1997.

[4] M. Dorigo and T. Sttzle, Ant Colony Optimization. MITPress, 2004.

[5] A. Abraham and V. Ramos, “Web usage mining using artifi-cial ant colony clustering and genetic programming,” CoRR,vol. abs/cs/0412071, 2004.

[6] T. White, A. Salehi-Abari, and B. Box, “On how ants putadvertisements on the web,” in IEA-AIE 2010: Proceedings ofthe 23rd International Conference on Industrial, Engineering& Other Applications of Applied Intelligent Systems, Cordoba,Espaa, May 2010.

[7] M. Spiliopoulou, B. Mobasher, and B. B. M. Nakagawa,“A framework for the evaluation of session reconstructionheuristics in web-usage analysis,” Journal of Computing,vol. 15, no. 2, pp. 171–190, 2003.

[8] D. Gusfield, Algorithms on String, Trees and Sequences.Computer Science and Computational Biology. CambridgeUniversity Press, 1999.

[9] J. D. Velasquez and V. Palade, Adaptive Web sites: A Knowl-edge Extraction from Web Data Approach. IOS Press, 2008.

[10] J. D. Velasquez, “Web site keywords: A methodology forimproving gradually the web site text content,” IntelligentData Analysis, vol. to appear, 2011.

[11] S. Brin and L. Page, “The anatomy of a large-scale hypertex-tual web search engine,” in COMPUTER NETWORKS ANDISDN SYSTEMS. Elsevier Science Publishers B. V., 1998,pp. 107–117.

[12] P. E. Roman, “Web user behavior analysis,” Ph.D. disserta-tion, Universidad de Chile, 2011.

[13] P. E. Roman and J. D. Velasquez, “Stochastic simulationof web users,” in 2010 IEEE / WIC / ACM InternationalConference, vol. 1. IEEE Computer Society Press, 2010,pp. 30–33.

[14] D. Ganguly, S. Mukherjee, S. Naskar, and P. Mukherjee, “Anovel approach for determination of optimal number of clus-ter,” Computer and Automation Engineering, InternationalConference on, vol. 0, pp. 113–117, 2009.