24
Distributed and Parallel Databases, 11, 93–116, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Evolutionary Techniques for Web Caching ATHENA VAKALI [email protected] Department of Informatics, Aristotle University, Thessaloniki 54006, Greece Recommended by: Ahmed Elmagarmid Abstract. Web caching has been proposed as an effective solution to the problems of network traffic and congestion, Web objects access and Web load balancing. This paper presents a model for optimizing Web cache content by applying either a genetic algorithm or an evolutionary programming scheme for Web cache content replacement. Three policies are proposed for each of the genetic algorithm and the evolutionary programming techniques, in relation to objects staleness factors and retrieval rates. A simulation model is developed and long term trace-driven simulation is used to experiment on the proposed techniques. The results indicate that all evolutionary techniques are beneficial to the cache replacement, compared to the conventional replacement applied in most Web cache server. Under an appropriate objective function the genetic algorithm has been proven to be the best of all approaches with respect to cache hit and byte hit ratios. Keywords: World Wide Web caching, cache replacement algorithms, genetic algorithms, evolutionary programming 1. Introduction The rapid expansion of the World Wide Web has resulted in major network traffic and congestion. Web data circulation has been almost doubling every six months, and despite efforts for capacity increases demands aren’t always kept up [11]. Improving response times and access latencies for clients became a quite important and challenging issue. Web caching has been proposed as a technique to reduce both the Internet traffic and the access times for (frequently) requested objects. Many of the Web caching aspects are originated from the caching idea implemented in various computer and network systems and web caching introduces new issues in Web objects management and retrieval across the network. The overall process of accessing data is no longer dependent on the client/server interaction. A client requests object(s) residing at a server, but instead of accessing the specified server, its local storage media is checked first. If the requested data resides in local cache is withdrawn from there with no extra network access cost, otherwise the original server needs to be contacted. Web Caches are implemented such that information will reside closer to user(s) since clients retain a local cache for Web objects storage. Therefore, both the load of the origin servers and the network traffic reduces, since upon requesting Web objects the clients can access their local cache instead of fetching the data from their original server. In this paper, the problem of supporting effective Web object caching is addressed and certain evolutionary techniques are proposed.

Evolutionary Techniques for Web Caching

  • Upload
    auth

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Distributed and Parallel Databases, 11, 93–116, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

Evolutionary Techniques for Web Caching

ATHENA VAKALI [email protected] of Informatics, Aristotle University, Thessaloniki 54006, Greece

Recommended by: Ahmed Elmagarmid

Abstract. Web caching has been proposed as an effective solution to the problems of network traffic andcongestion, Web objects access and Web load balancing. This paper presents a model for optimizing Web cachecontent by applying either a genetic algorithm or an evolutionary programming scheme for Web cache contentreplacement. Three policies are proposed for each of the genetic algorithm and the evolutionary programmingtechniques, in relation to objects staleness factors and retrieval rates. A simulation model is developed and long termtrace-driven simulation is used to experiment on the proposed techniques. The results indicate that all evolutionarytechniques are beneficial to the cache replacement, compared to the conventional replacement applied in mostWeb cache server. Under an appropriate objective function the genetic algorithm has been proven to be the bestof all approaches with respect to cache hit and byte hit ratios.

Keywords: World Wide Web caching, cache replacement algorithms, genetic algorithms, evolutionaryprogramming

1. Introduction

The rapid expansion of the World Wide Web has resulted in major network traffic andcongestion. Web data circulation has been almost doubling every six months, and despiteefforts for capacity increases demands aren’t always kept up [11]. Improving responsetimes and access latencies for clients became a quite important and challenging issue. Webcaching has been proposed as a technique to reduce both the Internet traffic and the accesstimes for (frequently) requested objects.

Many of the Web caching aspects are originated from the caching idea implementedin various computer and network systems and web caching introduces new issues in Webobjects management and retrieval across the network. The overall process of accessingdata is no longer dependent on the client/server interaction. A client requests object(s)residing at a server, but instead of accessing the specified server, its local storage media ischecked first. If the requested data resides in local cache is withdrawn from there with noextra network access cost, otherwise the original server needs to be contacted. Web Cachesare implemented such that information will reside closer to user(s) since clients retain alocal cache for Web objects storage. Therefore, both the load of the origin servers and thenetwork traffic reduces, since upon requesting Web objects the clients can access their localcache instead of fetching the data from their original server. In this paper, the problem ofsupporting effective Web object caching is addressed and certain evolutionary techniquesare proposed.

94 VAKALI

1.1. Related work

World-Wide Web caching differs from the traditional caching in a distributed file systemmainly in its access patterns since Web is orders of magnitude larger than any distributedfile system [5]. Furthermore, Web caching differs from conventional caching due to objectssizes nonhomogeneity [1]. The most critical research issues in Web caching concern cachereplacement strategies as well as cache consistency and validation.

Performance improvements due to Web-based caching have been investigated in orderto estimate the value and importance of Web caching. Research efforts have focused inmaintaining Web objects coherency by proposing effective data replacement policies. In[12] a web-based dynamic data caching model is introduced and this model’s design andperformance are analyzed. A number of Web replacement policies are discussed in [4]and comparisons are made under trace-driven simulations. In [2] the importance of variousworkload characteristics for the Web proxy caches replacement is analyzed and trace-driven simulation is used to evaluate the replacement effectiveness. Other issues have beendiscussed in relation to Web caching: Caching and replication and Cache hierarchy schemeshave been proposed in the past to exploit Web objects availability by redundancy andhierarchical caches collaboration for object sharing and exchange [3, 7, 17]. Furthermore,incorporation of locality with cost and size is discussed in [6], where cost-aware proxycaching algorithms are introduced towards performance improvement.

The proxy cache needs to support validation mechanisms such that the cached objectsmatch with their origin. An adaptive cache invalidation method in mobile client/serverenvironments is presented in [13], where adaptable mechanisms adjust the size of theinvalidation report to optimize the use of limited communication bandwidth. Furthermore,there are storage limitations since the cache has a certain storage capacity. The cache server’sability to keep data as “fresh” as possible (i.e. eliminate Web information staleness) isanother critical issue. The Web server is responsible for the preservation of informationobject’s “freshness”, by adopting specific updating rules, towards balancing the server loadand facilitating the internet information circulation. One of the key issues in Web caching isto keep cached objects consistent with their origin storage copy, such that the client receivesproper and non-stale data. Cache consistency policies have been included in almost everyproxy cache server (e.g. [17]) and their improvement became a major research issue. Asurvey of contemporary cache consistency mechanisms in use on the Internet is presentedin [10]. By trace-driven simulation it is shown that a weak cache consistency protocolreduces network bandwidth consumption and server load.

The idea of evolutionary algorithms has been used to solve many computational prob-lems demanding optimization and adaptation to changing environments. Usually groupedunder the term evolutionary algorithms (or evolutionary computation), are the domains ofgenetic algorithms, evolution programming and strategies, as well as genetic programming.Genetic algorithms are one of the main evolutionary methods, applied to many computa-tional problems requiring either search through a huge number of possibilities for solutions,or adaptation to a changing environment. More specifically, genetic algorithms have beenapplied in various research areas such as scientific modeling, machine learning as well asnetwork infrastructure [8, 9, 14,15]. A web-based evolutionary model has been presented

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 95

in [20] where cache content is updated by evolving over a number of successive cache ob-jects populations and it is shown by trace-driven simulation that cache content is improved.A genetic algorithm model is presented in [19] for performing Web objects replicationand caching where caching is employed by the evolution of Web objects cache populationinvolving replication policies for the most recently accessed objects.

1.2. Research summary

Web caching is implemented by proxy server applications developed to support many users.Proxy applications act as an intermediate between Web clients and servers. Clients maketheir connection to proxy applications running on their hosts. The proxy connects the serverand relays data between the client and the server. At each request, the proxy server iscontacted first to find whether it has a valid copy of the requested object. If the proxy hasthe requested object this is considered a cache hit, otherwise a cache miss occurs and theproxy must forward the request on behalf of the client. Upon receiving a new object, theproxy services a copy to the end-user and keeps another copy to its local storage. When thecache is full, there is a need for a specific technique to remove some of the current copiesin order to store more recently requested objects.

This paper presents a model which introduces the idea of genetic algorithms and evolu-tionary programming to the Web cache replacement process. The proposed models adapt theevolutionary computation idea to preserve a consistent cache “population” of World-WideWeb information objects. A Web cache is modeled as a population of information objectsand the aim is to improve the cache population regarding its reliability, and accessibility.The proposed Web caching replacement policies are based on either genetic algorithms orevolutionary programming in order to maintain in cache the most “strong” Web objects ofthose demanding to be cached.

Figure 1 represents the overall approach of the proposed evolutionary Web caching. Thepresented approach applies either the genetic algorithm or the evolutionary programmingmethodologies in order to perform the cache replacement. As mentioned above, each cached

Figure 1. Evolutionary cache replacement—Research overview.

96 VAKALI

object is considered to be an individual in the cache population and the proposed schemesfavor the “strongest” cached objects. Each objects “strength” is determined by specificcriteria related to object’s staleness, access frequency and retrieval cost .The simulationmodel is based on the Squid proxy cache server and is experimented under workloads ofreal cache traces and cache log files.

The remainder of the paper is organized as follows. The next section has the problemstatement and the most crucial cache content parameters are specified. Sections 3 and 4present the genetic algorithm and the evolutionary programming cache replacement mod-els respectively. Web cache servers and the simulation details are discussed in Section 5,whereas trace-driven experimentation and results are presented in Section 6. Section 7summarizes the conclusions and discusses potential future work.

2. Problem statement

A Web cache is an application residing between Web servers and clients such that it watchesrequests for information objects identified as html pages, images, documents and files. Webcache servers reply to the users request by sending the requested Web object and by (at thesame time) saving a copy for the cache itself. If another request refers to the same object,cache will use the copy it has, instead of asking the original server for it again. As pointed outin Section 1, the two main Web caches advantages are the reduce in both latency (request issatisfied by the cache which is closer to the client) and traffic (each object is retrieved fromthe server once, thus reducing the bandwidth used by a client). A data object’s freshness hasto be determined by specific rules and an object is considered stale when the original servermust be contacted to validate the existence of the cache copy. Object’s staleness resultsfrom the cache server’s lack of awareness about the original object’s changes. Each proxycache server must be reinforced with specific staleness confrontation.

Web cache content can be modeled by an information hash table consisting of a numberof rows, each row associated with a particular cached object. The number of rows is boundedby the number of cached objects. Each object is identified by its corresponding stored objectfilename, along with a number of related attributes. The attributes are chosen such that cachereplacement could be supported and employed. The most important parameters for eachWeb object are summarized in Table 1. More specifically, ti is the variable to denote thetime when object was cached and at this cache time, the variable lmi is used to denote thetime of object’s last modification.

In order to formulate the problem we define a set of parameters that govern the pro-posed Web cache replacement process. The main introduced parameters are related withthe object’s staleness status, its frequency of access and its retrieval rate.

Definition 1. The cached object’s staleness ratio is defined by,

sri = ti − lmi

now − ti

where the numerator corresponds to the time interval between the time of object beingcached and the time of the object’s last modification and the denominator is the cache “age”

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 97

Table 1. The most useful attributes of each cached object i .

Parameter Description

C total capacity of the cache area (Mbytes).

N number of cached objects.

seri the server on which the object resides.

si object’s size (KBytes).

li time the object was logged.

ti time the object was cached.

lmi time of object’s last modification.

ai number of accesses since the last time object i was accessed.

idi objects original copy identification (e.g. its URL address).

of the object i.e. it determines the time that the object has remained in cache. It is alwaystrue that sri ≥ 0 since ti − lmi ≥ 0 and now − ti > 0 (now is a variable used to represent thecurrent time). It is always true that ti − lmi ≥ 0 since (by its definition) the lmi is the timeof the object’s last modification as reported at the time ti when the object is cached, i.e. lmi

captures the object’s latest modification prior to the object’s cache time (ti ). The lower thevalue of sri the more stale the object i is, since that indicates that it has remained in cachefor longer period.

Definition 2. The cached object’s dynamic frequency is defined by

df i = 1

ai

since ai is the metric for estimating an object’s access frequency (Table 1). According to [1]ai is well-defined for all objects which have been accessed before. It is true that the higherthe values of df i , the most recently was accessed, since ai is the parameter to identify thenumber of accesses to other objects since object i was last referenced. When all referencesare for object i (i.e. ai = 0) the value of df i becomes the maximum number thus the objecti is the most popular one.

Definition 3. The cached object’s retrieval rate is defined by

rri = lats × bands

where lats is the latency to open a connection to a server s and bands is the bandwidth of theconnection to server s (measured in megabits per second). rri represents the cost involvedwhen retrieving an object from its original storage server.

A Web cache server considers whether an object could be cached or not. In case that thereis not enough space there is a need to remove one or more objects from the cache in order

98 VAKALI

to free sufficient space. The cache replacement process must guarantee enough space forthe incoming objects. Therefore, there are two actions related to the replacement process,either the object will remain stored in cache or it will be purged from cache. A function isneeded to identify the action that should be taken for each cached object.

Definition 4. The cached object’s action function is defined by

acti ={

0 if object i will be purged from cache

1 otherwise

Problem statement: Suppose that N is the number of objects in cache and C is the totalcapacity of the cache area. The cache content optimization problem is to:

MAXIMIZEN∑

i=1

acti × sri × df i

rri(1)

subject toN∑

i=1

acti × si ≤ C

In the optimization formula the fraction df irri

is used as a weight factor associated with eachcached object, since it relates the objects access frequency with its retrieval rate. The basicaim of the proposed cache replacement problem is to maintain in cache the most non-stale,frequently accessed Web objects.

3. The genetic algorithm caching model

The Genetic Algorithm (GA) approach is considered for the cache replacement problem inorder to optimize the cached objects such that cache area is better exploited and utilized.The GA is used because of two main reasons: First, the basic idea of the GAs is basedon the evolution of populations by the criterion “survival of the fittest” and the the cacheshould contain the fittest (i.e. non-stale, frequently accessed) information objects. Second,the GAs are applied to problems demanding optimization out of spaces which are too largeto be exhaustively searched and the cache content usually consists of a large amount ofinformation objects (stored files). Figure 2 depicts the cycle of a GA applied in a spaceof individuals. A GA is an iterative procedure that consists of a constant-size populationof individuals encoding a possible solution in a given problem space. In our problem wemodel the cached objects as the individuals considered for evolution. The individuals areassessed according to predefined quality criterion, called objective or fitness function. Twogenetically-inspired operations, known as crossover and mutation are applied to selectedcached objects (considered to be the population individuals) to successively create strongercache generations.

• Encoding and operators: Web objects are endoded appropriately in order to partici-pate in the considered GA caching model. Each individual in the cache must be identified

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 99

Figure 2. The genetic algorithm process.

according to a predetermined encoded string. The encoding scheme is chosen such that thepotential solution to our problem may be represented as a set of parameters. These param-eters are joined together to form the encode string. In order to consider the identificationof each Web object individual, the objects identifications (idi , Table 1) are mapped to theinteger values 1, 2, . . . , S, where S is the total number of objects selected for caching.According to the problem statement (defined in Eq. (1)) the parameters acti , sri , df i , rri

and si are the ones to guide the optimization problem, therefore they are included in ourencoding string. Each parameter is assigned a value and the presence of that parameter issignaled by the presence of that value in the ordered encoded string, such that each indi-vidual’s characteristics are maintained. More specifically, the acti parameter reserves onebit in the encoded string in order to support the action to be performed on cached objects,whereas the other parameters reserve a number of bits according to their range of values.For example, df i reserves another 2 bits in order to map the values in bit-representationsuch that the bits 00 correspond to df i values between 0 and 0.25, the bits 01 corre-spond to df i values between 0.5 and 0.5 and so on, to capture the range of its potentialvalues.

The standard GA manipulations, namely the crossover and mutation, mix and recom-bine “genes” of an initial cache population to form the new cached objects for the nextgeneration.

Crossover is performed between two cache object individuals (“parents”) with someprobability, in order to identify two new individuals resulting by exchanging parts of par-ents’ strings. The exchanging of “parents” parts are performed by cutting each individualat a specific position and produce two head and two tail segments. The tail segments arethen swapped over to produce two new full length individual strings. Figure 3 presentsthe crossover operation on an example of an 8-bit binary encoded string, partitioned afterits 5th bit, in order to result into two new 8-bit individuals.

100 VAKALI

Figure 3. GA operators: Crossover and mutation.

Mutation is introduced in order to prevent premature convergence to local optima byrandomly sampling new points in the search space. Mutation is applied to each “child”individually after crossover. It randomly alters each individual with a (usually) smallprobability (e.g. 0.001). Figure 3 depicts the mutation operation in an binary 8-bit stringwhere the 4th bit is mutated to result in a new individual.

The cache population will evolve over successive generations such that the fitness ofthe best and the average cached object individual in each cache generation is improvedtowards the global optimum.

• The objective function: An objective (or fitness) function must be devised based on theneed to have a figure of merit proportional to the utility or ability of the encoded cacheindividual.

The following formula defines the fitness function F1(x) related to the staleness ratio,for a population x corresponding to a specific cache generation:

F1(x) =N∑

i=1

(acti × sri × df i ) (2)

This objective function places emphasis on the objects dynamic frequency in relationto its staleness ratio. This function is proposed in order to evaluate the role of bothdynamic frequency and staleness ratio in case of no introduction of the retrieval rate.F1(x) considers these two parameters since they participate in the Problem Statement asstated in Section 2.

Similarly, the following formula F2(x) defines the fitness function that relates to theretrieval rate of population x :

F2(x) =N∑

i=1

(acti × df i

rri

)(3)

This objective function places emphasis on the objects dynamic frequency in relation toits retrieval rate. This function is proposed in order to evaluate the role of both retrieval

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 101

rate and staleness ratio in case of no introduction of the staleness ratio. Again, F2(x)

considers these two parameters since they participate in the Problem Statement as statedin Section 2.

Finally, the following formula F3(x) defines the fitness function that considers bothstaleness, access frequency and retrieval rate for population x :

F3(x) =N∑

i=1

(acti × sri × df i

rri

)(4)

This objective function places emphasis on all of the parameters involved in the maxi-mization Problem Statement (Section 2), in order to propose an objective function closeto the problem statement and definition.

Therefore, the above three fitness functions have been introduced in the present researcheffort in order to consider the effect of staleness, access frequency and retrieval cost inthe overall cache replacement process. The criteria for choosing these fitness functionswere based on elaborating on the impact of either staleness or access frequency alone,before considering both of these factors in a more integrated objective function formula,for the overall cache replacement.

• The GA cache replacement algorithm The proposed GA cache replacement algorithmgenerates an initial population of cached objects. To make the search efficient, the initialpopulation consists of cached objects characterized of being as non-stale and frequentlyaccessed as possible. Then, the standard operators defined above mix and recombine theencoding strings of the initial population to form offspring of the next generation. In thisprocess of evolution, the fitter cached object individuals will create a larger number ofoffspring, and thus have a higher chance of surviving to subsequent generations. TheGA caching evolves under a termination criterion, which is identified by the number ofgenerations i.e. the number of successive GA cycle runs. A pseudo-code version of theGA cache replacement approach follows:

initialize()old_cache_pop <- initial cache populationevaluate_fitness(old_cache_pop)generation <- 1while (generation <= maxgen) do

par1 <- selection(popsize, fitness, old_cache_pop)par2 <- selection(popsize, fitness, old_cache_pop)crossover(par1,par2,old_cache_pop,new_cache_pop, p_cross)mutation(new_cache_pop, p_mutate)evaluate_fitness(new_cache_pop)statistical_report(new_cache_pop)old_cache_pop <- new_cache_popgeneration <- generation + 1

In the above GA maxgen corresponds to the maximum number of successive generationruns, popsize is the cache population size, fitness is the cache update factor as described

102 VAKALI

in the previous subsection), par1 and par2 are the parents chosen for the reform ofeach generation, p cross, p mutate are the probabilities for the crossover and mutationoperators, respectively. The old cache pop refers to the initial population in every GAcycle whereas the new cache pop is the resulting population of each GA run. The proposeGA model follows the simple GA proposed in [9].

4. The evolutionary programming model

Evolutionary Programming (EP) is a technique similar to GA but it places emphasis on thebehavior linkage between cached objects chosen as “parents” and their offspring, instead ofseeking to emulate specific operators as crossover and mutation together. EP is applied toWeb cache replacement since it has been proven quite useful in optimization problems andhas been successfully applied to numerous problems from different domains. Furthermore,EP is introduced as an alternative evolutionary-like approach in order to be compared andcommented in relation to GA techniques. Figure 4 presents the overall EP evolution cycle.It is important to note that the EP approach does not use any crossover as an operator forthe evolution of the cache population. Furthermore, EP uses stochastic selection via a tour-nament. Each trial solution in the cache population faces competition against a preselectednumber of opponents and receives a “win” (i.e. remain in cache) if it is at least as goodas its opponent in each encounter. Selection then eliminates those solutions with the least“wins”. There are two important differences between the EP and the GA approach:

• there is no constraint on the cached objects representation and encoding. In GA therewas a need for encoding the cache individuals, whereas in EP the representation followsfrom the problem.

Figure 4. The evolutionary programming process.

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 103

• the mutation operation simply changes aspects of the cache population according to astatistical distribution and the severity of mutations is reduced as the global optimum isapproached.

For the EP there is an underlying assumption that an objective criterion can be character-ized in terms of specific variables, and that there is an optimum solution (or multiple suchoptima) in terms of those variables. The specific variables related to the EP cache replace-ment are determined by the the parameters acti , sri , df i , rri and si , as in the GA approachpresented in the previous section. The basic EP cache replacement process is based on thefollowing steps:

1. an initial cache population is chosen, based on random trial cache objects populationsolutions. The number of solutions in a cache population is highly relevant to the speed ofoptimization, but there is no specific number as of how many solutions are appropriate,several numbers can be examined.

2. each solution is replicated into a new cache population. Each of these populations ismutated according to a distribution of mutation types, which are judged on the basis ofthe functional change imposed on the cached objects considered as parents.

3. each offspring solution is assessed by computing its objective function, i.e. its fitness.Typically, a stochastic tournament is held to determine S solutions to be retained for thepopulation of solutions. There is no requirement that the cache population size remainsconstant.

The EP cache replacement technique is described by the following pseudo-code:

generation <- 1initialize()old_cache_pop <- initial cache populationevaluate_fitness(old_cache_pop)while (generation <= maxgen) do

mutation(old_cache_pop, new_cache_population,mutation_rate)evaluate_fitness(new_cache_pop)select_survive(new_cache_pop)statistical_report(new_cache_pop)old_cache_pop <- new_cache_popgeneration <- generation + 1

The proposed EP cache replacement algorithm starts with an initial population of cachedobjects. Then, there is a perturbation (mutation operator) of the whole cache population andthe fitness of new population is evaluated. The cache objects to “survive”, i.e. remain incache, are stochastically selected according to the defined objective (fitness) function. Thefitness function is related to the most crucial parameters of the cache replacement process asdiscussed in Section 3. Therefore the three functions F1(x), F2(x) and F3(x), formulated inEqs. (2–4) respectively, will be applied for the fitness function estimation. The most crucialfactor in EP is the selection of the mutation rate. The standard deviation of the mutation

104 VAKALI

applied in the present paper, is related to the staleness rate of each individual and the samemutation rate is applied equivalently to each parameter in a given solution.

5. The simulation model

A trace-driven proxy cache simulator was developed, based on a real proxy cache serverimplementation. The simulator was appropriately configured in order to employ the pro-posed evolutionary caching techniques. The simulator considers a main proxy cache serverto service requests posed by a fixed number of clients. The main simulator modules of bothserver and client are depicted in Figure 5. Aspects of resource management are simplified inboth client and server in order to support timing and synchronization. The most importantmodules support the cache replacement policies as they relate to request servicing, timingsynchronization and performance evaluation:

• The Cache Server: consists of the Timing module to model the timing devoted to re-quest processing, the Caching module to manage the caching replacement and validationand the Service module to synchronize and coordinate the client request servicing. TheCache replacement module supports the proposed evolutionary algorithms as well as theconventional LRU replacement conducted in most currently available proxy servers.

• The Client: is composed of the the Timing module to simulate the timing requirementsand the Request Manager which formalizes the requests based on real workload traces.Each client submits requests to the proxy cache server, each request refers to a specificWeb object and is appropriately structured.

The proposed simulation model is based on a real Web cache server implementation,namely the Squid proxy cache server. The simulators modules are appropriately configuredsuch that the performance metrics (cache hit ratio, byte hit ratio) can be estimated. Detailsabout most popular Web cache servers and the Squid Web cache to which the simulation isbased, are discussed next.

Figure 5. Simulation model overview.

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 105

5.1. Caching on the web

Nowadays, a variety of cache servers are available for the World-Wide Web caching andmost of the recent Web servers include caching modules (e.g. Apache, Spinner, Jigsaw,Purveyor).

• CERN proxy server has been widely adopted since there was a large infrastructure of theCERN web servers already installed. A heuristic known as time-to-live (TTL), was usedto manage object’s staleness and a TTL timing frame based on request and expiry dates,accompanies each document in the cache [10, 21].

• Netscape Proxy Server has been available commercially since 1995 and supports man-agement of object’s staleness by TTL frames based on object’s age when it is cached.This server also supports pre-preemptively fetch groups of linked web pages accordingto a schedule and has a variety of filtering options for use as a firewall proxy.

• Harvest cache software was developed aiming in more effective use of the informationavailable on the Internet, by sharing the load of information gathering and publishingbetween many servers. Newest Harvest developments are available commercially whereasa team from the National Laboratory for Advanced Networking Research (NLANR) havecontinued to provide a free version under the name Squid [18]. Harvest and Squid havebeen adopted widely by many institutions and research organizations as a new proposalfor efficient caching.

Squid proxy cache server software has gained a lot of attention lately, since it is used onan experimental network of seven major co-operating servers across U.S.A. This network isestablished under a project framework by NLANR and supports links to collaborating cacheprojects in other countries. Squid has evolved by additional features for objects refreshmentand purging, memory usage and hierarchical caching.

The following algorithm is used by Squid to determine whether an object is stale or fresh.

if Age > Client_max_age thenReturn "STALE"

else if Age <= min_age thenReturn "FRESH"

else if (expires) then // expires field existsif (expires <= NOW) then Return "STALE"else Return "FRESH"

else if Age > max_age thenReturn "STALE"

else if lm_factor < Percent thenReturn "FRESH"

else Return "STALE"

The refresh parameters are identified as min age, Percent and max age. Age is how muchthe object has aged since it was retrieved whereas lm factor is the ratio of age over the howold was the object when it was retrieved. expires is an optional field used to mark an object’s

106 VAKALI

Figure 6. Squid proxy cache structure.

expiration date. Client max age is the (optional) maximum age the client will accept astaken from the http cache-control request header.

Figure 6 represents the organization of Squid cache hierarchy structure consisting ofa two-level decomposition. Assuming approximately 256 objects per directory there is apotential of a total of 1, 048, 576 (= 16 × 256 × 256) cached objects. Squid switched fromthe TTL base expiration model to a Refresh-Rate model. Objects are no longer purged fromthe cache when they expire. Instead of assigning TTLs when the object enters the cache, nowa check of freshness requirements is performed when objects are requested. Squid keepssize of the disk cache relatively smooth since objects are removed at the same rate they areadded and object purging is performed by the implementation of a Least-Recently-Used(LRU) replacement algorithm. Objects with large LRU values are forced to be removedprior objects with smaller LRU “ages”. Squid cache storage is implemented as a hash tablewith some number of hash “buckets” and store buckets are randomized so that same bucketsare not scanned at the same time of the day [18].

5.2. Workload traces

The workload needs to be specified by identifying the parameters that are needed to supportthe cache replacement policies. These parameters are related to the fileds of Squid proxyaccess log-files. Squid (in its default configuration) makes four logfiles:

• logs/access.log: requests issued to the proxy server regarding how many users use thecache, how much each requested etc.

• logs/cache.log: information Squid needs to know such as errors, startup messages etc.

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 107

Table 2. store.log: Fields of each individual object and related parameters.

Store log fields Model parameter

time time this entry was logged. li

action RELEASE, SWAPIN, or SWAPOUT. acti

RELEASE: object removed from cache.

SWAPOUT: object saved to disk.

SWAPIN: object swapped into memory.

Status HTTP reply code.

Datehdr HTTP Date: reply header.

lastmod HTTP Last-Modified: reply header. lmi

expires HTTP Expires: reply header.

type HTTP Content-Type reply header.

exp-len HTTP Content-Length reply header.

real-len # bytes of content actually read. si

method HTTP request method.

key cache key ; often simply the URL. idi

• logs/store.log: information of what’s happening with the cache diskwise; it shows when-ever an object is added or removed from disk.

• cache/log: contains the mapping of objects to their disk location.

The workload specification is based on the parameters introduced in Section 2. Table 2presents the most important caching parameters as they they relate to specific fields of theSquid’s store.log file. The parameters that govern the cache replacement process, namelyparameters acti , sri , df i , rri and si , are computed by the parameters related to Squid accesslog files.

6. Experimentation—Results

Aristotle University has installed Squid proxy cache for main and sibling caches and sup-ports a Squid mirror site. The present paper uses traced information provided by this cacheinstallation for experimentation. The simulator was tested by Squid cache traces and theircorresponding log files. Traces refer to the period from January to April 1999, regardinga total of 80,000,000 requests, of more that 1,000 GB content. A compact log was cre-ated for the support of an effective caching simulator, due to extremely large access logscreated by the proxy. The reduced simulation log was constructed by the original Squidlog fields needed for the overall simulation runs. All of the proposed techniques have beensimulated and experimented under the workload traces. The notations GA F1, GA F2,GA F3 refer to the GA cache replacement policies under objective function F1(x), F2(x)

or F3(x), respectively. The crossover and mutation probabilities values are pcrossover = 0.6

108 VAKALI

and pmutation = 0.001 since these values are in the range of suggested representative trialsets for many GA optimizations [9, 16] Similarly EP F1, EP F2, EP F3 refer to the corre-sponding EP cache replacement policies. The Initial population for the GA and EP schemesis the population produced by the Squid access log files. Furthermore, the typical LRUcache replacement policy applied in most proxy caches such as Squid, has been simulatedin order to serve as a basis for comparisons and discussion.

The performance metrics used in this simulation model focus on the cached objectscache-hit ratio and byte-hit ratio:

• Cache hit ratio: represents the percentage of all requests being serviced by a cache copyof the requested object, instead of contacting the original object’s server.

• Byte hit ratio: represents the percentage of all data transferred from cache, i.e. corre-sponds to ratio of the size of objects retrieved from the cache server. Byte hit ratio providesan indication of the network bandwidth.

The above metrics are considered to be the most typical ones in order to capture and analyzethe cache replacement policies (e.g. [1, 2, 4]).

Figures 7 and 8 depict the cache hit ratio for the three different GA policies with re-spect to the number of evolved cache generations and the cache size, respectively. Morespecifically, Figure 7 presents the cache hit ratio for a cache population being reproducedby 20, 30, . . . , 100 generations and Figure 8 presents the cache hit ratio for a cache sizevarying over 4, 8, 16, . . . , 256 GBytes. The cache hit under GA F1 policy outperforms thecorresponding metric of GA F2 policy, whereas they are both worse that the GA F3 ap-proach. This was expected due to the more integrated nature of objective function F3(x)

as expressed in Eq. 4. The improvement in the cache hit ratio due to GA F3 is almost15% when compared to GA F2, and about 10% when compared to GA F1. The three

Figure 7. Cache hit; GA caching/generations.

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 109

Figure 8. Cache hit; GA caching/cache size.

GA approaches show similar cache hit ratio curves as the cache size increases. It shouldbe noted that the cache hit ratios seem to get to a peak and remain there as the cachesize increases. This is explained due to the fact that the larger the cache size, the lessreplacement actions need to be taken since there is more space to store the cached ob-jects and the cache server can “afford” to accommodate them there with less replacementactions.

Figures 9 and 10 depict the byte hit ratio for the three different GA policies with respectto the number of evolved cache generations and the cache size, respectively. It is interestingto note that the both hit ratios are improved as the number of generations increase. Thebyte hit ratios follow a similar skew as the corresponding cache hit ratios, but they neverget as high as the cache hit ratios and the byte hit ratios have a more smooth curve as thecache size increases. Furthermore, it is quite interesting to note that GA F2 results in betterbyte hit ratios than the corresponding GA F1, while GA F3 remains the best approach withrespect to byte hit ratios as well. The different behavior of GA F1 and GA F2 in relation tocache hit and byte hit ratios is explained by the different approach in each of these objectivefunctions. More specifically, F1(x) is based on the cache staleness ratio, whereas F2(x)

concerns the retrieval rate, as formulated in Eqs. (2) and (3) respectively. Therefore, it wasexpected that GA F1 will result in better cache hit, but worse byte hit than the correspondingratios of GA F2 technique.

Figures 11 and 12 represent the cache hit ratio for the EP-based techniques with respect tocache generations and cache size, respectively. The byte hit ratios of the EP-based techniqueswith respect either cache generations or cache size, are shown in Figures 13 and 14. TheEP cache replacement has a similar to the GA cache replacement behavior. EP F1 is betterthan EP F2 with respect to cache hit ratios, while EP F2 results in better byte hit ratios.EP F3 has always shown the best ratios in both cache hit and byte hit. An important remark

110 VAKALI

Figure 9. Byte hit; GA caching/generations.

Figure 10. Byte hit; GA caching/cache size.

is that the EP approach is less effective than the GA approach. Forexample the GA cache hitratios could reach a 0.8 value, whereas the EP cache hit ratios have an upper bound of 0.7under the same trial set of 100 generations, when applying GA F3 and EP F3. GA resultsin better byte hit ratios than EP but the two techniques difference for this metric is quitelow in the range of 1% to 2%.

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 111

Figure 11. Cache hit; EP caching/generations.

Figure 12. Cache hit; EP caching/cache size.

The experimentationconsidered LRU as the conventional cache replacement strategy tobe compared and commented with respect to the proposed evolutionary approaches. TheGA F3 and the EP F3 techniques were used for cache replacement strategies comparisonssince they have been proven to be the most effective over all proposed evolutionary policies.Several workload traces were used for the experimentation. Figures 15 and 16 result from

112 VAKALI

Figure 13. Byte hit; EP caching/generations.

Figure 14. Byte hit; EP caching/cache size.

Squid workload traces of theperiod October to December 1998 whereas Figures 17 and18 result from Squid workload traces of the period January to March 1999. Figures 15and 17 represent the cache hit ratios of their trace periods, over an increasing cache sizeof 4, 8, 16, . . . , 256 GBytes. Both of the evolutionary approaches have been proven tosignificantly outperform the typical LRU with respect to cache hit rates. More specifically,

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 113

Figure 15. Cache hit; replacement policies.

Figure 16. Byte hit; replacement policies.

the improvementin cache hit ratios between the GA F3 and the LRU range between 25% toalmost 40%, whereas the corresponding difference in cache hit ratios between the EP F3 andthe LRU are in the range of 20% to almost 30%. Similarly, Figures 16 and 18 represent thebyte hit ratios of their trace periods, over a cache of 4, 8, 16, . . . , 256 GBytes. As pointedout in the experimentation above, the difference between GA and EP regarding byte hit

114 VAKALI

Figure 17. Cache hit; replacement policies.

Figure 18. Byte hit; replacement policies.

ratios is not as large as in the cache hit ratios. This is also emphasized in Figures 16 and 18where the GA and EP curves remain in close distance. The difference between LRU and theproposed evolutionary algorithms in relation to the byte hit ratios is of the same range aswith the cache hit ratios discussed above. In conclusion, both EP and GA have been shownto improve the cache replacement process with respect to cache hit and byte hit ratios.

EVOLUTIONARY TECHNIQUES FOR WEB CACHING 115

7. Conclusions—Future work

This paper has presented a study of applying the evolutionary computation idea to the Web-based proxy cache replacement. Genetic algorithms and evolutionary programming werethe two evolutionary mechanisms applied to cache replacement. The proposed evolutionaryapproaches were guided by appropriate objective functions in relation to objets staleness,retrieval and access costs. Trace-driven simulation was used in order to evaluate the perfor-mance of the proposed cache replacement techniques and the simulation model was basedon the Squid proxy cache server. The experimentation indicated that all of the proposedevolutionary approaches outperform the conventional Least-Recently-Used (LRU) policyadopted by most currently available proxies. Results have shown that provided that theevolution process is guided by an objective function (integrating both the object’s stalenessand the retrieval rate), the cache hit and byte hit ratios would be signicantly improved.

Further research should experiment the present scheme under different evolutionary andcomputational schemes such as evolutionary strategies, simulated annealing and thresholdacceptance. These methodologies could be introduced in the proposed cache replacementmodel in order to study the impact of such innovative replacement policies to the Web cachecontent.

Acknowledgments

The author would like to thank Panayotis Junakis (System administrator) and Savvas Anas-tasiades (technical staff) of the Network Operation Center at the Aristotle University, forproviding access to the Squid cache traces and trace log files.

References

1. C. Aggarwal, J. Wolf, and P.S.Yu, “Caching on the World Wide Web,” IEEE Transactions on Knowledge andData Engineering, vol. 11, no.1, pp. 94–107, 1999.

2. M. Arlitt, R. Friedrich, and T. Jin, “Performance evaluation of web proxy cache replacement policies,”Performance Evaluation Journal, vol. 39, no. 1–4, pp. 149–164, 2000.

3. M. Baentsch et al., “Enhancing the Web’s infrastructure: From caching to replication,” IEEE Internet Com-puting , vol. 1, no. 2, pp. 18–27, 1997.

4. A. Belloum and L.O. Hertzberger, “Document replacement policies dedicated to Web caching,” in ProceedingsISIC/CIRA/ISAS’98 Conference, Maryland, USA, Sept. 1998.

5. M.A. Blaze, Caching in Large-Scale Distributed File Systems, Princeton University, Ph.D. Thesis, Jan. 1993.6. P. Cao and S. Irani, “Cost-aware WWW proxy caching algorithms,” in Proceedings of the USENIX Symposium

on Internet Technologies and Systems, USITS’97, Monterey, California, Dec. 1997.7. A. Chankhunthod, P. Danzig, and C. Neerdaels, “A hierarchical internet object cache,” in Proceedings of the

USENIX 1996 Annual Technical Conference, San Diego, California, Jan. 1996, pp. 153–163.8. B. Dengiz, F. Atiparmak, and A.E. Smith, “Local search genetic algorithm for optimization of highly reliable

communications networks,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 3, pp. 179–188,1997.

9. D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley: Reading,MA, 1989.

10. J. Gwertzman and M. Seltzer, “World Wide Web cache consistency,” in Proceedings of the USENIX 1996Annual Technical Conference, San Diego, California, Jan. 1996, pp. 141–151.

116 VAKALI

11. A.S. Heddaya, DynaCache, Weaving Caching into the Internet, Infolibria, 1998.12. A. Iyengar and J. Challenger, “Improving Web server performance by caching dynamic data,” in Proceedings

of the USENIX Symposium on Internet Technologies and Systems, USITS’97, Monterey, California, Dec.1997.

13. J. Jing, A. Elmagarmid, A. Helal, and R. Alonso, “Bit-sequences: An adaptive cache invalidation method inmobile client/server environments,” Mobile Networks and Applications, vol. 2, pp. 115–127, 1997.

14. K.T. Ko, K.S. Tang, C.Y. Chan, K.F. Man, and S. Kwong, “Using genetic algorithms to design mesh networks,”IEEE Computer, vol. 30, no. 8, pp. 56–61, 1997.

15. A. Kumar, R.M. Pathak, Y.P. Gupta, and H.R.Parsaei, “A genetic algorithm for distributed systems topologydesign,” Computers & Industrial Engineering, vol. 28, pp. 659–670, 1995.

16. Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Program, 3rd ed., Springer-Verlag: Berlin,1996.

17. O. Pearson, The squid cache software, Squid Users Guide, http://www.auth.gr/SquidUsers/, 1998.18. Squid, Squid internet object cache, http://www.auth.gr/Squid/, 1998.19. A. Vakali, “A genetic algorithm scheme for Web replication and caching, in Proceedings of the 3rd

IMACS/IEEE International Conference on Circuits, Systems, Communications and Computers, CSCC’99,World Scientific and Engineering Society Press, Athens, Greece, July 1999.

20. A. Vakali, “A Web-based evolutionary model for internet data caching,” in Proceedings of the 2nd InternationalWorkshop on Network-Based Information Systems, NBIS’99, IEEE Computer Society Press, Florence, Italy,Aug. 1999.

21. D. Wessels, “Intelligent caching World-Wide Web objects,” in Proceedings of the INET’95 Conference, Jan.1995.