17
A Dual Framework and Algorithms for Targeted Online Data Delivery Haggai Roitman, Avigdor Gal, Senior Member, IEEE, and Louiqa Raschid Abstract—A variety of emerging online data delivery applications challenge existing techniques for data delivery to human users, applications, or middleware that are accessing data from multiple autonomous servers. In this paper, we develop a framework for formalizing and comparing pull-based solutions and present dual optimization approaches. The first approach, most commonly used nowadays, maximizes user utility under the strict setting of meeting a priori constraints on the usage of system resources. We present an alternative and more flexible approach that maximizes user utility by satisfying all users. It does this while minimizing the usage of system resources. We discuss the benefits of this latter approach and develop an adaptive monitoring solution Satisfy User Profiles (SUPs). Through formal analysis, we identify sufficient optimality conditions for SUP. Using real (RSS feeds) and synthetic traces, we empirically analyze the behavior of SUP under varying conditions. Our experiments show that we can achieve a high degree of satisfaction of user utility when the estimations of SUP closely estimate the real event stream, and has the potential to save a significant amount of system resources. We further show that SUP can exploit feedback to improve user utility with only a moderate increase in resource utilization. Index Terms—Distributed databases, online information services, client/server multitier systems, online data delivery. Ç 1 INTRODUCTION T HE diversity of data sources and Web services currently available on the Internet and the computational Grid, as well as the diversity of clients and application requirements, poses significant infrastructure challenges. In this paper, we address the task of targeted data delivery. Users may have specific requirements for data delivery, e.g., how frequently or under what conditions they wish to be alerted about update events or update values, or their tolerance to delays or stale information. The challenge is to deliver relevant data to a client at the desired time, while conserving system resources. We consider a number of scenarios including RSS news feeds, stock prices and auctions on the commercial Internet, and scientific data sets and Grid computational resources. We consider an architecture of a proxy server that is managing a set of user profiles that are specified with respect to a set of remote autonomous servers. Push, pull, and hybrid protocols have been used to solve a variety of data delivery problems. Push-based technologies include BlackBerry [16] and JMS messaging, push-based policies for static Web content ( e.g., [20]), and push-based consistency in the context of caching dynamic Web content (e.g., [30]). Push is typically not scalable, and reaching a large number of potentially transient clients is expensive. In some cases, pushing information may overwhelm the client with unsolicited information. Pull-based freshness policies have, therefore, been proposed in many contexts such as Web caching (e.g., [15]) and synchronizing collections of objects, e.g., Web crawlers (e.g., [4]). Several hybrid push-pull solutions have also been presented (e.g., [12]). We focus on pull-based resource monitoring and satisfying user profiles. As an example, consider the setting of RSS feeds that are supported by a pull-based protocol. Currently, the burden of when to probe an RSS resource lies with the client. Although RSS providers use a Time-To-Live (TTL) measure to suggest a probing schedule, a study on Web feeds [21] shows that 55 percent of Web feeds are updated on a regular hourly rate. Further, due to heavy workloads that may be imposed by client probes (especially on popular Web feed providers such as CNN), about 80 percent of the feeds have an average size smaller than 10 KB, suggesting that items are promptly removed from the feeds. These statistics on refresh frequency and volatility illustrate the challenge faced by a proxy in satisfying user needs. As the number of users and servers grow, service personalization through targeted data delivery by a proxy can serve as a solution for better managing system resources. In addition, the use of profiles could lower the load on RSS servers by accessing them only to satisfy a user profile. Much of the existing research in pull-based data delivery (e.g., [7], [24]) casts the problem of data delivery as follows: Given some set of limited system resources, maximize the utility of a set of user profiles. We refer to this problem as OptMon 1 . We consider the following two examples: a Grid perfor- mance monitor tracks computational resources and notifies users of changes in system load and availability. Excessive probing of these machines may increase their load and hurt their performance. As another example, consider a data source that charges users for access. Clearly, minimizing the number of probes to such a source is important to keep probing costs low. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 5 . H. Roitman is with IBM Haifa Research Lab, Haifa, Israel. E-mail: [email protected]. . A. Gal is with the Faculty of Industrial Engineering and Management, Technion—Israel Institute of Technology, Technion City, Haifa 32000, Israel. E-mail: [email protected]. . L. Raschid is with the University of Maryland, College Park, MD 20742. E-mail: [email protected]. Manuscript received 17 Sept. 2008; revised 5 May 2009; accepted 12 Sept. 2009; published online 11 Jan. 2010. Recommended for acceptance by J. Liu. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2008-09-0488. Digital Object Identifier no. 10.1109/TKDE.2010.15. 1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society

A Dual Framework and Algorithms for Targeted

Embed Size (px)

Citation preview

Page 1: A Dual Framework and Algorithms for Targeted

A Dual Framework and Algorithms for TargetedOnline Data Delivery

Haggai Roitman, Avigdor Gal, Senior Member, IEEE, and Louiqa Raschid

Abstract—A variety of emerging online data delivery applications challenge existing techniques for data delivery to human users,

applications, or middleware that are accessing data from multiple autonomous servers. In this paper, we develop a framework for

formalizing and comparing pull-based solutions and present dual optimization approaches. The first approach, most commonly used

nowadays, maximizes user utility under the strict setting of meeting a priori constraints on the usage of system resources. We present

an alternative and more flexible approach that maximizes user utility by satisfying all users. It does this while minimizing the usage of

system resources. We discuss the benefits of this latter approach and develop an adaptive monitoring solution Satisfy User Profiles

(SUPs). Through formal analysis, we identify sufficient optimality conditions for SUP. Using real (RSS feeds) and synthetic traces, we

empirically analyze the behavior of SUP under varying conditions. Our experiments show that we can achieve a high degree of

satisfaction of user utility when the estimations of SUP closely estimate the real event stream, and has the potential to save a

significant amount of system resources. We further show that SUP can exploit feedback to improve user utility with only a moderate

increase in resource utilization.

Index Terms—Distributed databases, online information services, client/server multitier systems, online data delivery.

Ç

1 INTRODUCTION

THE diversity of data sources and Web services currentlyavailable on the Internet and the computational Grid, as

well as the diversity of clients and application requirements,poses significant infrastructure challenges. In this paper, weaddress the task of targeted data delivery. Users may havespecific requirements for data delivery, e.g., how frequentlyor under what conditions they wish to be alerted aboutupdate events or update values, or their tolerance to delaysor stale information. The challenge is to deliver relevant datato a client at the desired time, while conserving systemresources. We consider a number of scenarios including RSSnews feeds, stock prices and auctions on the commercialInternet, and scientific data sets and Grid computationalresources. We consider an architecture of a proxy server thatis managing a set of user profiles that are specified withrespect to a set of remote autonomous servers.

Push, pull, and hybrid protocols have been used to solve avariety of data delivery problems. Push-based technologiesinclude BlackBerry [16] and JMS messaging, push-basedpolicies for static Web content ( e.g., [20]), and push-basedconsistency in the context of caching dynamic Web content(e.g., [30]). Push is typically not scalable, and reaching a largenumber of potentially transient clients is expensive. In somecases, pushing information may overwhelm the client with

unsolicited information. Pull-based freshness policies have,therefore, been proposed in many contexts such as Webcaching (e.g., [15]) and synchronizing collections of objects,e.g., Web crawlers (e.g., [4]). Several hybrid push-pullsolutions have also been presented (e.g., [12]). We focus onpull-based resource monitoring and satisfying user profiles.

As an example, consider the setting of RSS feeds that aresupported by a pull-based protocol. Currently, the burdenof when to probe an RSS resource lies with the client.Although RSS providers use a Time-To-Live (TTL) measureto suggest a probing schedule, a study on Web feeds [21]shows that 55 percent of Web feeds are updated on aregular hourly rate. Further, due to heavy workloads thatmay be imposed by client probes (especially on popularWeb feed providers such as CNN), about 80 percent of thefeeds have an average size smaller than 10 KB, suggestingthat items are promptly removed from the feeds. Thesestatistics on refresh frequency and volatility illustrate thechallenge faced by a proxy in satisfying user needs. As thenumber of users and servers grow, service personalizationthrough targeted data delivery by a proxy can serve as asolution for better managing system resources. In addition,the use of profiles could lower the load on RSS servers byaccessing them only to satisfy a user profile.

Much of the existing research in pull-based data delivery(e.g., [7], [24]) casts the problem of data delivery as follows:Given some set of limited system resources, maximize the utilityof a set of user profiles. We refer to this problem as OptMon1.We consider the following two examples: a Grid perfor-mance monitor tracks computational resources and notifiesusers of changes in system load and availability. Excessiveprobing of these machines may increase their load and hurttheir performance. As another example, consider a datasource that charges users for access. Clearly, minimizing thenumber of probes to such a source is important to keepprobing costs low.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 5

. H. Roitman is with IBM Haifa Research Lab, Haifa, Israel.E-mail: [email protected].

. A. Gal is with the Faculty of Industrial Engineering and Management,Technion—Israel Institute of Technology, Technion City, Haifa 32000,Israel. E-mail: [email protected].

. L. Raschid is with the University of Maryland, College Park, MD 20742.E-mail: [email protected].

Manuscript received 17 Sept. 2008; revised 5 May 2009; accepted 12 Sept.2009; published online 11 Jan. 2010.Recommended for acceptance by J. Liu.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2008-09-0488.Digital Object Identifier no. 10.1109/TKDE.2010.15.

1041-4347/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

Page 2: A Dual Framework and Algorithms for Targeted

A solution toOptMon1 is accompanied by the need to meetrigid a priori bounds on system resource constraints. Thismay not be adequate in environments where the demand formonitoring changes dynamically. Examples of changingneeds in the literature include help desks and reservationsystems. A rigid a priori setting may also have the unintendedconsequence of forcing excessive resource consumption evenwhen there is no additional utility to the user.

To address some of the limitations of OptMon1, wepropose a framework where we consider the dual of theprevious optimization problem as follows: Given some set ofuser profiles, minimize the consumption of system resources whilesatisfying all user profiles. We label this problem as OptMon2;it will be formally defined in Section 2. With this class ofproblems, user needs are set as the constraining factor of theproblem (and thus, need to be satisfied), while resourceconsumption is dynamic and changes with needs. Wepresent an optimal algorithm in the OptMon2 class, namely,Satisfy User Profiles (SUPs). SUP is simple yet powerful in itsability to generate optimal scheduling of pull requests. SUPis an online algorithm; at each time point, it can getadditional requests for resource monitoring. Throughformal analysis, we identify sufficient conditions for SUPto be optimal given a set of updates to resources. Therefore,it satisfies all client needs with minimal resource consump-tion. We also show the conditions under which SUP canoptimally solve OptMon1 as well.

SUP depends on an accurate model of when updatesoccur to perform correctly. However, in practice, suchestimations suffer from two problems. First, the underlyingupdate model that is used is stochastic in nature, andtherefore, updates deviate from the expected update times.Second, it is possible that the underlying update model is(temporarily or permanently) incorrect, and the real datastream behaves differently than expected. To accommodatechanges to source behavior, compensating for stochasticbehavior, correlations, and bursts, SUP exploits feedbackfrom probes and can adapt the probing schedule in adynamic manner and improve scheduling decisions. Wealso present SUP(�) that addresses the second problem andcan locally apply modifications to the update modelparameters. Both SUP and SUP(�) are shown empiricallyto work well under stochastic variations.

We present an extensive evaluation of the solutions tothe two monitoring problems. For our experimentalcomparison of OptMon1, we consider the WIC algorithm[24] which provides the best solution in the literature. ForOptMon2, we consider the ubiquitous TTL algorithm [15].We use real traces from an RSS server and synthetic data,and several example profiles. Our experiments show thatwe can achieve a high degree of satisfaction of user utilitywhen the estimations of SUP closely estimate the real eventstream, and can save significant amount of systemresources compared to solutions that have to meet strict apriori allocations of system resources. We further show thatfeedback improves user utility of SUP(�) with only amoderate increase in resource utilization.

The rest of the paper is organized as follows: Section 2provides a description of dual framework for targeted datadelivery. We next present our model for targeted datadelivery in Section 3. Sections 4-6 introduce SUP, an optimal

dynamic algorithm for solving an OptMon2 problem,discuss its properties, and provide a heuristic variationSUP(�) that locally applies modifications to the updatemodel parameters. We present our empirical analysis inSection 7. We conclude with a discussion of related work(Section 8) and conclusion (Section 9).

2 DUAL FRAMEWORK FOR TARGETED DATA

DELIVERY

In what follows, let R ¼ fr1; r2; . . . ; rng be a set of resources;let T be an epoch; and let fT1; T2; . . . ; TNg be a set ofequidistant chronons1 in T . A schedule S ¼ fsi;jgi¼1...n;j¼1...N

is a set of binary decision variables, set to 1 if resource ri isprobed at time Tj, and 0 otherwise. For example, Fig. 1illustrates a possible schedule S as a matrix with binaryvalues, where rows represent resources and columns arechronons. Thus, for example, at chronon T3, the illustratedschedule assigns probes to resources r2 and r4. We furtherdenote by S the set of all possible schedules. Next, wedefine the dual approaches for targeted data delivery,namely, OptMon1 and OptMon2.

2.1 OptMon1

OptMon1 can be roughly described as the following problem:

maximize user utility

s:t: satisfying system constraints:ð1Þ

The OptMon1 formulation assumes that system con-straints are hard constraints where their assignment is, ingeneral, a priori independent of specific user utilitymaximization task. For example, in [24], [28], OptMon1

involves a system resource constraint of the maximumnumber of probes per chronon for all resources. In [24], thisconstraint represents the number of monitoring tasks that aWeb monitoring system can allocate per chronon for the taskof maximizing the utility gained from capturing updates toWeb pages (see Section 7 for a more technical description ofthis algorithm). In [28], the same constraint represents thenumber of available crawling tasks for maximizing thefreshness of Web repositories. Eckstein et al. [13] present a“politeness” constraint which sets an upper bound on thetotal number of probes a proxy client is allowed to have for

6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011

Fig. 1. Examples of a schedule and system constraints.

1. A chronon is an indivisible unit of time. Our model is not restricted toequidistant chronons, yet such a representation avoids excessive use ofnotation.

Page 3: A Dual Framework and Algorithms for Targeted

the whole epoch. These two constraint types are illustratedin Fig. 1. The vertical oval represents the per-chrononconstraint and the horizontal oval represents the politenessconstraint. Both constraints are violated in the examplegiven in Fig. 1.

The benefits ofOptMon1 are apparent whenever there arehard system constraints on resources, e.g., limited band-width for mobile users. In such a setting, OptMon1 canmaximize user utility. On the down side, OptMon1 formula-tion has two main limitations. First, with a diversity of serverand client profiles, we expect that there will be periods ofvarying intensity with respect to the intensity of updates atthe server(s) as well as the intensity of probes needed tosatisfy client profiles. The second problem is the rigidity ofOptMon1 algorithms with respect to system resource alloca-tion. It is generally not known a priori how many times weneed to probe sources. An estimate that is too low will fail tosatisfy the user profile, while an estimate that is too high mayresult in excessive and wasteful probes (e.g., as is the casewith current RSS applications). Solutions to OptMon1 havenot dynamically attempted to reduce resource consumption,even if doing so would not negatively impact client utility.For example, in a solution to OptMon1 [24], once the upperbound on bandwidth has been set (given in terms of howmany data sources can be probed in parallel per chronon),bandwidth can no longer be adjusted and user needs may notbe fully met. Moreover, while OptMon1 may be allocatedwith additional system resources over time, this by itselfcannot guarantee an efficient utilization of system resourcesthat could improve the gain in user utility.

2.2 OptMon2

We propose a dual formulationOptMon2, which reverses theroles of user utility and system constraints, setting thefulfillment of user needs as the hard constraint. OptMon2

assumes that the system resources that will be consumed tosatisfy user profiles should be determined by the specificprofiles and the environment, e.g., the model of updates, anddoes not assume an a priori limitation of system resources.OptMon2 can be stated as the following general formulation:

minimize system resource usage

s:t: satisfying user profiles:ð2Þ

2.3 Comparison of OptMon1 and OptMon2

The dual problems are inherently different. Therefore, nosolution for one problem can dominate a solution to theother for all possible problem instances. Each solution has abetter fit for a different application. For example, a crawlermay have dedicated resources that must be consumed orwill go to waste; OptMon1 fits this scenario. OptMon2 bestsuits scenarios in which system constraints are “soft” (e.g.,more bandwidth can be added, or more disk space can bebought) and where the consumption of these systemresources depends heavily on user specifications. Forexample, a proxy serving many clients may procureresources on demand; here, OptMon2 works better.

Fig. 2 illustrates the benefit of OptMon2. In the monitor-ing schedule, each vertical bar represents the amount ofneeded probes in a chronon to fully satisfy client needs. Thedata represented in this figure is taken from one of thetraces we use for our experiments (see Section 7) and isbrought here for illustration purposes only. The two

horizontal lines represent a fixed allocation of probes, oneof a single resource per chronon and the other for threeresources per chronon. For each such allocation, wecompute the amount of missed resources due to insufficientprobes and the wasted probes. For a single resource perchronon, 95 resources are missed and 29 probes are wasted.For three resources per chronon, 24 resources are missedand 158 probes are wasted. This confirms that a flexibleresource allocation, driven by needs, can assist in efficientresource consumption while catering better to client needs.

3 MODEL FOR TARGETED DATA DELIVERY

The centerpiece of our model is the notion of executionintervals, a simple modeling tool for representing dynami-cally changing client needs. We discuss user profiles, servernotifications, and monitoring. We also discuss how execu-tion intervals are generated from user profiles. We then turnour attention to the formal definition of a schedule and theutility of probing.

To illustrate our model and algorithms, we present a casestudy using RSS, a popular format for publishing informa-tion summaries on the Web. Diverse data types arenowadays available as publications in RSS, including newsand weather updates, blog postings, media postings, digitalcatalog notifications, promotions, white papers, and soft-ware updates. The use of RSS feeds is continuously growingand is supported by a pull-based protocol. RSS customiza-tion today is provided using specialized RSS readers (alsoknown as RSS aggregators). A user of such a reader cancustomize her profile by specifying the rate of monitoringeach RSS feed. Some readers even allow defining filteringrules over the RSS feed content which support furtherpersonalization. Recently, the RSS protocol was extendedwith special metatags such as server side TTL that hintwhen new updates are expected. We note that while thisimproves customization, server side hints such as TTL forstatic content delivery are not used often in other contexts,and was shown to be inefficient [17, 95]. Despite thesefeatures, a client who is only interested in being alerted ofupdates for a particular item in some news category,whenever the rate of updates increases to be at least twiceas much as the usual rate, cannot specify such a profileusing standard available RSS readers. This scenario requiresfurther refined personalization that is currently unavailable.Our case study is that of RSS monitoring of CNN News,providing publications of news updates by CNN on varioustopics such as world news, sports, finance, etc. Typically,only news article titles are provided in an RSS item and alink directs the reader to the original article. Each item also

ROITMAN ET AL.: A DUAL FRAMEWORK AND ALGORITHMS FOR TARGETED ONLINE DATA DELIVERY 7

Fig. 2. Resource allocation illustrated.

Page 4: A Dual Framework and Algorithms for Targeted

has a time stamp of its publication date, and sometimes,CNN also provides TTL tags.

3.1 User Profiles

Profiles are declarative user specifications for data delivery.A profile should be easy to specify and sufficiently rich tocapture client requirements. A profile should have clearsemantics and be simple to implement. To illustrate thebasic elements of a profile, we introduce next an example ofa profile template (Fig. 3). The profile is given in a profilelanguage we have developed and whose full specificationcan be found in [25]. The language syntax uses XML and itsvarious elements are explained below. In particular, weassume that every resource r 2 R has a unique identifier(e.g., URI) and can be described using some schema (e.g.,Relational Schema, DTD, XML-Schema, RDF-Schema, etc.) Aresource can be queried using a suitable query language(e.g., SQL, XQuery, SPARQL, etc.). For example, the profilein Fig. 3 queries an RSS resource using XQuery (RSS schemais available in [27]). This profile can be stated in English asfollows: “Return the title and description of items published onthe CNN Top Stories RSS feed channel, once X new updates tothe channel have occurred. Notification received within Yminutes after each new X updates occurred will have value of 1else 0. Notifications should take place during two months startingon 24 August 2008, 10:00:00 GMT.”

In our model, we assume a setting in which a proxymonitors a set of resources in R given a set of client profilesP ¼ fp1; p2; . . . ; pmg. A Profile p 2 P contains two elementtypes, namely, Domain and Notification. DomainðpÞ � R is aset of resources of interest to the client. A notification rule �is a rule defined over a subset of resources in DomainðpÞ. Aprofile p contains a set N ¼ f�1; �2; . . . ; �kg of notificationrules. It is worth noting that user profiles can be dynamic,where the set of notification rules in a given profile maychange over time, expressing changes in user interests.

3.2 Notifications

Clients use notification rules to describe their data needs andexpress the utility (see Section 3.4) they assign with datadelivery. A notification rule extends the Event-Condition-Action (ECA) structure in active databases [1], [11] and can be

modified dynamically by the user. A notification rule � is aquadruple hQ;Tr; T ; Ui.Q is a query written in some suitablequery language (e.g., XQuery in the example profile in Fig. 3).Tr is a trigger. T is the epoch in which rules are evaluated.Finally, U is a utility expression specifying the utility clientgains by notifications of Q.

A notification query Q is specified over a set of resourcesfrom the profile domain denoted by DomainðQ; �Þ. Queriesare equivalent to actions in the ECA structure. Fig. 3 has anXQuery expression that selects all items from the CNN TopStories RSS channel element and returns, for each item, itstitle, and description.

A trigger Tr is an event-condition expression he; Cispecifying the triggering event e and a condition C . Tr isalso specified over a set of resources from the profiledomain denoted by DomainðTr; �Þ. It is worth noting thatDomainðQ; �Þ and DomainðTr; �Þ may overlap, that is,DomainðQ; �Þ \DomainðTr; �Þ 6¼ ;. Once an event e isdetected, the condition C is immediately evaluated (im-mediate coupling mode [18]) and if true, the query Q isscheduled. We note that other coupling modes are availablein the literature [18].

We consider two event types. The first is an update to aresource in DomainðTr; �Þ. The second is a temporal event,e.g., once an hour. The condition C is a Boolean valueexpression. For example, the trigger in Fig. 3 specifies anupdate event

AFTER UPDATE TO $rss=channelf g;

with a condition

NUPDATEð $rss=channelf gÞ% X ¼ 0;

where % is the modulo division operator. NUPDATEreturns the total number of times the RSS channel wasupdated since the start of the notification epoch.

A notification utility expressionU states the utility a clientgains from notifications of Q. For example, the utilityexpression in Fig. 3 specifies the following utility expression:

WITHIN Y minutes 1 ELSE 0;

which means that the client assigns utility of 1 for notifica-tions ofQ that are delivered at maximum delay of Y minutesafter X update events have occurred. In Section 3.4, weprovide formal specifications of the utility of probing.

3.3 Execution Intervals and Monitoring

Once an event, specified in the trigger part of thenotification rule, occurs, the trigger condition is immedi-ately evaluated and if it is true, the notification is said to beexecutable. The period in which a notification rule isexecutable was referred to in the literature as life [24]. Weemphasize here the difference between the executableperiod of a notification (life) and the period in which rules,in general, can be evaluated (epoch). Two examples of lifewe shall use in this paper are life ¼ overwrite, in which anupdate is available for monitoring only until the nextupdate to the same resource occurs. Another life setting islife ¼ windowðY Þ, for which an update can be monitoredup to Y chronons after it has occurred (where Y ¼ 0 denotesa requirement for immediate notification).

8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011

Fig. 3. Profile example for RSS feeds.

Page 5: A Dual Framework and Algorithms for Targeted

The time period in which a notification is executable forsome event defines an execution interval (interval for short),during which monitoring (i.e., the query part of a notifica-tion) should take place. An execution interval starts with anevent and its length is determined by the relevant life policy.Each notification rule � is associated with a set of intervalsEIð�Þ. For each I 2 EIð�Þ, we define �ðIÞ as the times Tj 2 Tin which the notification is executable for interval I. If, forexample, I is determined using life ¼ windowðY Þ policy,then �ðIÞ would contain exactly Y chronons. It is worthnoting that execution intervals of a notification rule mayoverlap; thus, the execution of a notification query may occurat the same time for two or more events that cause thenotification to become executable. It is also worth noting thatexecution intervals change dynamically once a notificationrule has been modified.

Monitoring can be done using one of three methods,namely, push-based, pull-based, or hybrid. With push-basedmonitoring, the server pushes updates to clients, providingguarantees with respect to data freshness at a possiblyconsiderable overhead at the server. With pull-based mon-itoring, content is delivered upon request, reducing over-head at servers, with limited effectiveness in estimatingobject freshness. The hybrid approach combines push andpull, either based on resource constraints [12] or roledefinition. As an example of role-based hybrid push-pull,consider an architecture in which a mediator is positionedbetween clients and servers. The mediator can monitorservers by periodically pulling their content, and determinewhen to push data to clients based on their content deliveryprofiles. In this paper, we focus on pull-based monitoring.

For completeness, we describe in the online supplement,which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.15, via an example, how execution intervalscan be derived from notification rules using update models.For this purpose, we utilize Poisson update models. In [7],[14], [19], it was argued that the use of an update modelbased on Poisson processes suits well updates in a Webenvironment. Poisson processes are suitable for modeling aworld where data updates are independent from oneanother, which seems plausible in data sources with widelydistributed access, such as updates to auction Web sites.Following [14], we devise an update model, based onnonhomogeneous Poisson processes, capturing time-vary-ing update intensities. Such a model reflects well scenariosin which e-mails arrive more rapidly during work hoursand more bids arrive toward the end of an auction.

Example 1. As an example, we now assume that X ¼ 2and Y ¼ 10, that is, the notification rule described inFig. 3 requires to deliver every other update assuminglife ¼ windowð10Þ. Fig. 4 illustrates an example of anupdate event stream realization, estimated using someupdate model. The gray intervals are the derivedexecution intervals and the black intervals in Fig. 4illustrate the derived execution intervals in the casewith life ¼ overwrite.

Example 1 highlights one of the main aspects of ourmodel, which is personalization. Note that the life parameterrepresents not only different server capabilities but alsoclient preferences. Some clients are interested in receiving an

update before the next update arrives (e.g., giving a purchaseorder before the next price change), while others are tolerantto some extent (represented by a time-based window). TheSUP algorithm we introduce in this work proposes anefficient schedule given a variety of user profiles, repre-sented by an abstract set of execution intervals. From now onwe assume the availability of a stream of execution intervals,possibly generated using the method suggested in the onlinesupplement, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.15. It is worth noting that the derivationof execution intervals can be done online. Such onlinederivation can be used to delay the generation of executionintervals, thus utilizing feedback gathered during monitor-ing to improve future monitoring. We use this observation toderive adaptive monitoring schemes.

As a final note, in this work, we assume that once anexecution interval is probed, the notification to the user isimmediate. An extension in which notifications may bedelayed is easy to model using execution intervals. Insuch cases, an execution interval Ik is computed to beIk ¼ ½Ts; Tf�D�, where D denotes the estimated delay innotification to the user. The interval is shortened (fromthe right) to ensure a timely delivery of update events.

3.4 Schedules and the Utility of Probing

Let N k be the set of notification rules of profile pk 2 P. Let� 2 N k be a notification rule that utilizes resources from pkdomain. The satisfiability of a schedule with respect to � isdefined next.

Definition 1. Let S 2 S be a schedule, � be a notification rulewith DomainðQ; �Þ, and T be an epoch with N chronons. S issaid to satisfy � in T (denoted by S �T �) if

8I 2 EIð�Þ; 8ri 2 DomainðQ; �Þð9Tj 2 �ðIÞ : si;j ¼ 1Þ:

Definition 1 requires that in each execution interval, everyresource referenced by �’s query Q is probed at least once. Itis worth noting that each execution interval I 2 EIð�Þ isassociated with some (either update or periodical) event,and therefore, a schedule that satisfies the notification rule �actually needs to “capture” every event required in �.

Whenever it becomes clear from the context, we use S �� instead of S �T �. Definition 1 is easily extended to aprofile and a set of profiles, as follows:

Definition 2. Let S 2 S be a schedule, P ¼ p1; p2; . . . ; pmf g bea set of profiles, and T be an epoch with N chronons.S is said to satisfy pk 2 P (denoted as S � pk) if for

each notification rule � 2 N k, S � �.

ROITMAN ET AL.: A DUAL FRAMEWORK AND ALGORITHMS FOR TARGETED ONLINE DATA DELIVERY 9

Fig. 4. Example execution intervals derived from an update model.

Page 6: A Dual Framework and Algorithms for Targeted

S is said to satisfy P (denoted as S � P) if for each profilepk 2 P, S � pk.

Example 2. As an example of profile satisfaction, weconsider Fig. 5. Fig. 5 contains an example user profileand two possible schedules. In the left side of Fig. 5, wehave an epoch with five chronons and three executionintervals, each is associated with a different notificationrule of the user profile. The first I1 requires to proberesource r1 during ½T1; T2�; the second I2 requires toprobe resource r2 during ½T3; T4�; and finally, the third I3

that requires to probe resource r3 during chronon T5.Both Schedule 1 and Schedule 2 in Fig. 5 probe each oneof the execution intervals; thus, they both satisfy thethree notification rules, and thus, satisfy the profile.

Given a notification rule � 2 N k and a resourceri 2 DomainðQ; �Þ, a utility function uðri; �; TjÞ describesthe utility of probing a resource ri at chronon Tj accordingto notification rule �. Intuitively, probing a resource r atchronon T is useful (and therefore, should receive a positiveutility value) if it is referred to in the Query part of thenotification rule and if the condition in the Trigger part ofthat profile holds. It is important to emphasize again thedifference of roles between the Query part and the Triggerpart of the notification rule. In particular, probing a resourcer is useful only if the data required by a notification rule(specified in the Query part) can be found at r.uðri; �; TjÞ is derived by assigning positive utility when a

condition is satisfied, and a utility of 0 otherwise. u isdefined to be strict if it satisfies the following condition:

uðri; �; TjÞ ¼w; Tj 2

SI2EIð�Þ �ðIÞ ^ ri 2 DomainðQ; �Þ;

0; otherwise:

ð3Þ

That is, uðri; �; TjÞ assigns a constant value of w wheneverthere exists an execution interval for resource ri, derivedfrom notification rule �, and the probe of resource ri,referenced by the query Q, coincides with the timeconstraints of the execution interval.

From now on we shall assume the use of a binary utility,i.e., w ¼ 1. Examples of strict utility functions includeuniform (where utility is independent of delay) and slidingwindow (where utility is 1 within the window and 0 outsideit). Examples of nonstrict utility functions are linear andnonlinear decay functions. Nonstrict utilities quantifytolerance toward delayed data delivery (or latency). Weshall restrict ourselves in this work to strict utility functions.The case of nonstrict utility functions can be handled in thescope of OptMon2 problems by allowing users to define a

threshold for the minimal utility required in the user profile(e.g., the maximum delay allowed on each notification tothe user). We handle such utilities in our model using therestrictions of life ¼ windowðY Þ parameter.

The expected utility U accrued by executing monitoringschedule S in an epoch T is given by

U Sð Þ ¼X

�2[ml¼1N l

XI2Ið�Þ

Xri2 DomainðQ;�Þ

min 1;XTj2I

si;juðri; Tj; �Þ

0@

1A:ð4Þ

The innermost summation ensures that utility is accumu-lated whenever a probe is performed within an executioninterval. This utility cannot be more than 1 since probing aresource more than once within the same execution intervaldoes not increase its utility. The utility is summed over allexecution intervals, all relevant resources, and over allnotification rules in a profile.

Example 3. As a concrete example of utility calculation,consider once more Fig. 5. In this example, we saw that thetwo schedules satisfy the profile, where each executioninterval monitoring credits each schedule with a utility of1, and since they both satisfy the three notification rules,the total utility acquired by these schedules is 3.

4 THE SUP ALGORITHM

LetR ¼ fr1; r2; . . . ; rngbe a set ofn resources, fT1; T2; . . . ; TNgbe a set of chronons in an epoch T , and S ¼ si;j

� �2 S be a

schedule. Let P ¼ fp1; p2; . . . ; pmg be a set of user profiles. Aconcrete OptMon2 problem can be formalized as follows:

minimizeX

ri2R;Tj2Tsi;j

s:t: S � P:ð5Þ

Recall that a notification rule � is associated with a set ofresources DomainðQ; �Þ. Given a notification rule � and theset of its execution intervals EIð�Þ, SUP identifies the set ofresources Q�

I � DomainðQ; �Þ that must be probed in anexecution interval I 2 EIð�Þ.

The main intuition behind the SUP algorithm is to identifythe best candidate chronon in which the assignment ofprobes to resources maximizes the number of executionintervals that can benefit from each probe. This will then leadto a reduction of the number of resources in Q�

I that need tobe probed during each execution interval I. We identify thebest candidate chronons by delaying the probes of executionintervals to the last possible chronon in which the utility is

10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011

Fig. 5. Example of a user profile and two schedules satisfying it.

Page 7: A Dual Framework and Algorithms for Targeted

still positive (and notifications can be safely delivered tousers). As we will prove later in Section 5.2, such probeassignments ensure a perfect elimination order of executionintervals that leads to an optimal solution.

Example 4. Fig. 6 provides an illustration of SUP executionphilosophy, with two notification rules �1 (shown at thetop of the figure) and �2 (shown at the bottom).Notification rule �1 is executable once every two updateswith a life window of four chronons (life ¼ windowð4Þ).Notification rule �2 is executable once every five updateswith life ¼ overwrite. In this example, we assume thatboth queries of notification rules �1 and �2 refer tothe same set of resources, that is, DomainðQ; �1Þ ¼DomainðQ; �2Þ. Stars represent estimated update events,using some update model. The execution intervals,defined in Section 3.3 to be the times in which anotification rule is executable, are denoted as rectangles.The upper gray rectangles represent the derived execu-tion intervals of notification rule �1, while the lower blackrectangles represent the ones derived from notificationrule �2. We number execution intervals (EIs) for conve-nience of exposition.

Suppose the last update has occurred during chrononT1 and SUP has also probed at T1 for the last time. Afterthe second update which occurred during chronon T2,notification rule �1 becomes executable and the EI I1 ¼½T2; T6� was generated and delivered to SUP. SUP delaysthe probe of each execution interval until the lastpossible chronon in which the notification rule �1 isexecutable (and still has some value to the user). Thus, EII1 is scheduled for probing at chronon T6. After thefourth update event at chronon T5, notification rule �1

becomes executable again and a new execution intervalI2 ¼ ½T4; T8� is generated (and scheduled for probing atT8). Meanwhile, at chronon T5 after the occurrence of thefifth update event, notification rule �2 also becomesexecutable for the first time and remains so for the nextthree chronons until the occurrence of the sixth updateevent, resulting in the generation of EI I3 ¼ ½T5; T8� (andscheduled by SUP at chronon T8). At chronon T6, EI I1 isbeing probed by SUP (according to the schedule). At thatchronon, EI I1 overlaps with EIs I2 and I3 (which weprove in Section 5.2 to be the maximum possible overlapwith I1), and thus, by probing EI I1, SUP guarantees alsothat the other two EIs are also probed (satisfied). Thus,using a single probe, SUP can satisfy three different EIsfor the two notification rules. The same process occursagain with EI I4 at chronon T13, resulting in a total usageof two probes by SUP that satisfy all six EIs.

We now provide a description of the algorithm. Thepseudocode of the SUP algorithm and the two routines,namely, AdaptiveEIsUpdate and UpdateNotificationEIs thatserved in building our prototype are available in the onlinesupplement, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.15. The algorithm builds a scheduleiteratively. It starts with an empty schedule (8si;j 2 S;si;j ¼ 0) and repeatedly adds probes. The algorithm generatesan initial probing schedule, where the last chronon in the firstI 2 EIð�Þ is picked to execute the probe. It then determinesthe earliest chronon in which to probe, the notification ruleassociated with this monitoring task, and the specificexecution interval. When probed, all resources in the querypart of that notification rule are probed.

SUP depends on an accurate set of execution intervals toperform correctly. However, in practice, determining a setof execution intervals suffers from two main problems.First, the underlying update model that is used to computethe execution intervals is stochastic in nature, and therefore,updates deviate from the expected update times. Second, itis possible that the underlying update model is (temporarilyor permanently) incorrect, and the real data stream behavesdifferently than expected.

To tackle these two problems, we propose to exploitfeedback from probes to revise the probing schedule in adynamic manner, after each monitoring task. Thus, executioninterval generation is deferred to the last possible moment,and responds to deviations in the expected update behaviorof sources that it observes as a result of probing feedback,which is used to improve its next scheduling decisions. Wefirst introduce the general scheme of SUP that addresses thefirst problem and does not require changes to any para-meters. In Section 6, we present a heuristic improvementSUP(�) that addresses the second problem and adds localonline modifications to update model parameters.

Recall that given a notification rule � and an executioninterval I 2 EIð�Þ, SUP probes the resources Q�

I referencedin the notification rule query Q. Given a resource ri 2 Q�

I

that is probed by SUP and a notification rule �, we assumethat we can use the feedback from probing Q�

I to validatewhether � was actually executable during I or not. For thispurpose, we define a Boolean function C�ðfeedbackðriÞ; T Þset to true if � is executable at time T given the feedbackgathered from probing resource ri. feedbackðriÞ is afeedback function, returning feedback data that weregathered from the last probing of resource ri. In thiswork, we consider a feedback function feedbackðriÞ thatreturns the actual number of update events of resource ri.It is worth noting that such validation from feedback ispossible (or required) only when the event e of thenotification rule trigger Tr is an update-based event andwhen DomainðTr; �Þ \DomainðQ; �Þ 6¼ ;, implying thatthere is at least one resource that is referenced both bythe trigger part and the query part of the notification rule.It is also noteworthy that using feedback from probing aresource ri 2 Q�

I , SUP can conserve system resourceswhen the validation of notification rule � fails (i.e.,C�ðfeedbackðriÞ; T Þ ¼ False).

Given that a resource ri 2 Q�I was probed by SUP, we use

the feedback feedbackðriÞ to validate the notification rule �using C�ðfeedbackðriÞ; T Þ. When validation fails, SUPprunes the probing of the rest of the resources in Q�

I n frig

ROITMAN ET AL.: A DUAL FRAMEWORK AND ALGORITHMS FOR TARGETED ONLINE DATA DELIVERY 11

Fig. 6. Illustrating example of SUP execution.

Page 8: A Dual Framework and Algorithms for Targeted

that were not probed yet, and makes adaptive modificationsto the input execution intervals that require also to proberesource ri, including execution interval I itself.

SUP uses the AdaptiveEIsUpdate routine to apply theadaptive modifications. This routine first applies adaptivemodification to notification rule �, by recalculating a newexecution interval I� to be scheduled. Then, the routinedetermines a set of notification rules (denoted by N depðriÞ)that may be associated with execution intervals that need to bemodified by identifying those notification rules that referenceresource ri in their trigger part. For each such notification rule�0 2 N depðriÞ, the routine then identifies execution intervalsI 0 2 EIð�0Þ that need to be modified according to the feedbackfrom probing resource ri. For each execution interval thatneeds to be modified (where the notification rule �0 is found tobe invalid), a new execution interval is calculated using thefeedback and replaces the invalid one.

The following example illustrates the adaptive nature ofthe SUP algorithm:

Example 5. As an example, consider our case studynotification rule � and assume that at the time ofmonitoring resource ri, only feedbackðriÞ ¼ l < X updatesoccur. Here, we define C�ðfeedbackðriÞ; T Þ as follows:

C�ðfeedback rið Þ; T Þ ¼ True, feedback rið Þ ¼ X:

SUP will generate a new execution interval, checking forX � l updates ahead. To illustrate the mechanism forgenerating I�, consider Fig. 7. SUP has set the monitoringtask for this execution interval to be at chronon T 0t . Atchronon T 0t , the monitoring task has revealed that in theinterval ðTs; T 0t �, only l < X updates have occurred, andthus, C�ðfeedbackðriÞ; T ÞÞ ¼ False. The last update hasoccurred at chronon Ts < T 0s < T 0t . Therefore, a newexecution interval is now computed.

We provide a more detailed description of thisexample using a specific update model in the onlinesupplement, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.15. It is worth noting that while theexample handles the case of feedbackðriÞ ¼ l < X, whereless updates than expected occurred, the AdaptiveEIsUp-date routine is general and handles also the case offeedbackðriÞ ¼ l > X. In this case, there is at least one“missed” update and the procedure revises the scheduleto capture the next update on time.

The UpdateNotificationEIs routine is called to ensure thatresources that belong to overlapping intervals are onlyprobed once. This routine involves a rather simple book-keeping. We explain next the routine logic. Let l ¼ � be theassignment of SUP, where � is the notification rule whose

execution interval I is processed at time Tj, and allresources referenced in Q�

I are scheduled for probing attime �j. Given an execution interval I 0 of a notificationrule �0, this procedure removes from Q�0

I 0 the (possiblyempty) resource set Q�

I \Q�0

I 0 if Tj 2 �ðI 0Þ. By doing so, weensure that resources that belong to overlapping executionintervals will be probed only once. In addition, thisprocedure removes any execution interval I for whichQ�I ¼ ;, allowing SUP to consider only execution intervals

for which monitoring is still needed. The process continuesuntil the end of the epoch.

5 ALGORITHM PROPERTIES

SUP assumes the availability of a stream of executionintervals, generated using this or that update model. Suchabstraction allows the algorithm to focus on the monitoring ofexecution intervals; thus, SUP optimal solution depends onlyon the number of execution intervals it is required to considerduring the monitoring task. This implies that SUP can handlean arbitrary number of user profiles, depending only on the totalnumber of execution intervals of all input profiles.

SUP is executed in an online fashion, where executionintervals are introduced right before they are required to beconsidered by SUP. This adds further flexibility to themonitoring scheme by allowing user profiles to change over time.Further, we can exploit the feedback gathered during themonitoring scheme to better improve the probing of futurescheduled execution intervals by adaptive monitoring.

SUP accesses OðKÞ execution intervals, where K is thenumber of total probes in a schedule, bounded by Nn(number of resources multiplied by number of chronons inan epoch). We expect, however, K to be much smaller thanNn, since K serves as a measure of the amount of data usersexpect to receive during the monitoring process.

We next provide a detailed analysis of three of SUP’sproperties. Section 5.1 analyzes SUP correctness. SUPoptimality is given in Section 5.2. Finally, in Section 5.3,we discuss terms under which SUP is also optimal as anOptMon1 solution.

5.1 SUP Correctness

SUP correctness is given by the following theorem:

Theorem 1. Let S be the schedule generated by SUP . Given a setof profiles P ¼ fp1; p2; . . . ; pmg, S � P.

Proof. Let S be the schedule generated by SUP . Let p 2 P,� 2 NðpÞ, and I 2 EIð�Þ. We define

NðIÞ ¼ fI 0; I 0 6¼ I ^ I \ I 0 6¼ ; ^ 9�0 2 N : I 0 2 EIð�0Þg:� ¼ min

I 02N ðIÞfmax �ðI 0Þg:

Let ri 2 Q�I . First, lets assume that ri 2 Q�

I \Q�0

I 0 . If � 2 I,then according to the algorithm, si;j ¼ 1 where Tj ¼ � .Else, the algorithm selects another I 0 2 N ðIÞ and probesall resources of Q�0

I 0 including ri. In both cases, 9Tj 2 T :si;j ¼ 1 for resource ri 2 Q�

I . Let I be the execution intervalselected by SUP . In case I ¼ I, then all resources in Q�

I

were probed and we finish. Else, I has some resources inQ�I that were not probed, and thus, still remains in NðIÞ

and we repeat again the same process. At every such step,we are guaranteed that at least one execution interval will

12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011

Fig. 7. Illustrating example of SUP adaptivity.

Page 9: A Dual Framework and Algorithms for Targeted

be removed fromNðIÞ. Since according to the algorithm, Iwill not be removed fromNðIÞ until each resource ri 2 Q�

I

is probed, in the worst case, all resources of Q�I will be

probed after jN ðIÞj steps. Thus, we guarantee to probeevery resource ri 2 Q�

I in some chronon inside I, andaccording to Definition 2, we get that S � P. tuSUP is surprisingly simple, given its ability to ensure

correctness, and in some clearly defined cases, efficiency(see below). We attribute the algorithm simplicity to theOptMon2 formalism and the execution interval abstraction.Generally speaking, a new probe is set for a resource at thelast possible chronon where a notification remains execu-table. That is, it is deferred to the last possible chrononwhere the utility is still 1. This, combined with the use ofProcedure 2, is needed to develop an optimal schedule interms of system resource (probes) utilization.

5.2 SUP Optimality

We now provide an optimality proof based on the graph-theoretic properties of SUP. We begin with the followingdefinition:

Definition 3. Given N ¼ [mk¼1N k, a resource r 2 R, and twoexecution intervals I and I 0, we say that the intervals “r-intersect” (denoted by I \r I 0) if the following two conditionsare satisfied:

1. I \ I 0 6¼ ;.2. 9�; �0 2 N : r 2 Q�

I ^ r 2 Q�0

I 0 .

According to Definition 3, two execution intervals r-intersect if the same resource r is required to be probedduring some shared chronon of both execution intervals.

Given a set of profiles P ¼ fp1; p2; . . . ; pmg and an epochT ¼ T1; T2; . . . ; TNf g, we construct an interval graphGðV ;EÞ from the execution intervals derived from P duringthe epoch T , where V ¼ fIjI 2 Ng and E ¼ fðI; I 0Þj9r 2R : I \r I 0g. It is worth noting that G can be defined as aunion graph GðV ;EÞ ¼

Sni¼1 GiðVi; EiÞ, where for each

subgraph GiðVi; EiÞ, Vi ¼ fIj9� 2 N : ri 2 Q�Ig and Ei ¼

fðI; I 0ÞjI \ri I 0g.It is known that for every interval graph (or a general

chordal graph), there always exists a perfect eliminationordering [5]. A perfect elimination ordering of a graph G isan order �G that assures that every vertex v selected by theorder and the set of its neighbors (denoted by NðvÞ) thatsucceed v in �G jointly form a clique.

We first show that SUP provides a perfect eliminationordering for each subgraph GiðVi; EiÞ. Let �SUP be the SUPorder, that is, given two execution intervals I and I 0, SUPprefers the interval with an earlier termination (see lines 8and 24 of the algorithm pseudocode in the online supple-ment, which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.15). Formally:

I �SUP I 0 , maxj

�ðIÞf g maxj

�ðI 0Þf g:

Lemma 2. Given GiðVi; EiÞ, �SUP provides a perfect elimina-tion ordering of Gi.

Proof. Let I be an interval selected by SUP order �SUP , andlet NðIÞ be the set of neighbors of I in Gi. According to

�SUP , every interval I 0 2 NðIÞ intersects with I and endstogether with or after interval I. Thus, every twointervals in NðIÞ intersect. Therefore, NðIÞ [ fIg is aclique in Gi. Since I is an arbitrary interval selected bySUP, �SUP provides a perfect elimination ordering. tu

We next show that the set of neighbors of an interval Ithat is selected by SUP is the largest possible for I, and thus,the clique NðIÞ [ fIg is the maximal possible clique thatcontains I.

Lemma 3. Let I be an interval selected by SUP for probing atchronon T ¼ maxt2�ðIÞftg, then the clique formed from fIg [NðIÞ at chronon T is a maximal clique containing I.

Proof. SUP chooses to probe interval I at the last possiblechronon for probing I, where at that chronon, I intersectswith all of its possible neighbor intervals, and therefore,the clique formed from NðIÞ [ fIg is maximal. tu

Given a schedule S � P, we denote by Ki ¼P

Tj2T si;jthe total number of probes performed by schedule S during

the epoch T by monitoring resource ri 2 R . Thus, the total

number of probes of schedule S is given by K ¼P

ri2RKi.

The following concludes the proof of SUP optimality:

Theorem 4. Let R ¼ fr1; r2; . . . ; rng be a set of n resources,fT1; T2; . . . ; TNg be a set of chronons in an epoch T , andS ¼ si;j

� �be a monitoring schedule, generated by Algorithm

SUP, with K. Let S0 2 S be a schedule that satisfies S0 � Pwith K0. Then, K K0.

Proof. SUP decision making is independent for eachresource ri 2 R, and therefore, the problem is separablein the number of resources. Consider a resource ri 2 R.Let I be any execution interval probed by S with respectto resource ri. I may or may not be probed in S0. Assumethat I was probed by S0 and let T be the probe chronon.Let NSðIÞ and NS0 ðIÞ denote the number of ri-intersect-ing execution intervals captured by probing I by S andS0, respectively. Obviously, T I:Tf and according toLemma 3, we get that NSðIÞ NS0 ðIÞ. Now assume thatI was not probed by S0. Therefore, since S0 � P, theremust exist some other ri-intersecting execution interval I 0

that was probed by S0. Let T 0 be the chronon in which S0

probed I 0. Again, since S0 � P, we have that T 0 I:Tf(otherwise, S0 will not capture I). I 0 was not probed by S,since according to SUP order, the following holdsI:Tf I 0:Tf . Therefore, for this case, we have again thatthe following must hold NSðIÞ NS0 ðI 0Þ. Using thisresult, we have

Ki ¼ jVij �XI2S

NSðIÞ

jVij �X

I2S\S0NS0 Ið Þ þ

XI 02S0nS

NS0 I0ð Þ

0@

1A ¼ K0i;

concluding that: K K0. tuProbing at the last possible chronon ensures an optimal

usage of system resources (probes) while still satisfying userprofiles. However, due to the stochastic nature of the process,probing later may decrease the probability of satisfying the

ROITMAN ET AL.: A DUAL FRAMEWORK AND ALGORITHMS FOR TARGETED ONLINE DATA DELIVERY 13

Page 10: A Dual Framework and Algorithms for Targeted

profile. This is true, for example, with hard deadlines whereonce the deadline has passed, the utility is 0. Determining anoptimal chronon for probing, i.e., the one that maximizes theprobability of satisfying the profile, depends on thestochastic process of choice, and is itself an interestingoptimization problem. We defer this analysis to future work.

5.3 Terms for SUP Dual Optimality

Generally speaking, the dual optimization problemsOptMon1 and OptMon2 cannot be compared directly.Satisfying user profiles may violate system constraintsand satisfying system constraints may fail to satisfy userprofiles. However, the following Theorem 5 provides aninteresting observation. Theorem 4 shows that SUP pro-vides a schedule with minimal system resource utilization.The following theorem (which proof is immediate from (4))shows that the schedule generated by SUP also hasmaximum utility for the class of strict utility functions(and hence, can maximize utility while minimizing systemresource consumption).

Theorem 5. Let R ¼ fr1; r2; . . . ; rng be a set of n resources,fT1; T2; . . . ; TNg be a set of chronons in an epoch T , P ¼fp1; . . . ; pmg be a set of user profiles, and S ¼ fsi;jg be amonitoring schedule, generated by SUP, with a utility UðSÞ. Iffor every notification rule � 2 N , uðri; Tj; �Þ is strict, thenUðSÞ UðS0Þ for any schedule S0 ¼ fs0i;jg 6¼ S.

Proof. The maximal value of (4) isP

�2NP

I2EIð�Þ jQ�I j and it

is achieved when any arbitrary schedule S guaranteesthat S � P. According to Theorem 1, SUP generates sucha schedule, thus has maximal utility. tu

Whenever the resources consumed by SUP satisfy thesystem constraints of OptMon1, then SUP is guaranteed tosolve the dual OptMon1 (as well as OptMon2) and maximizeuser utility, while at the same time minimizing resourceutilization. As an example, consider an algorithm that sets anupper limit M on the number of probes in a chronon for allpages. Assume that in the schedule of SUP, the maximumnumber of probes in any chronon satisfies M. Since SUPutilizes in each chronon only the amount of probes that isneeded to satisfy the profile expressions, the total number ofprobes will never exceed N �M. Whenever strict utilityfunctions are used, SUP can serve as a basis for solving thedual problem OptMon1. A schedule S, generated by SUPwith no bound on system resource usage, and a set of desiredsystem resource constraints, can be used as a starting point insolving OptMon1, as illustrated in Fig. 2. S can be used toavoid overprobing in chronons when less updates areexpected. System resources may be allocated to chrononsthat are more update intensive. In this situation, SUP mayserve as a tentative initial solution to the OptMon1 problem,allowing local tuning at the cost of reduced utility. We defera formal discussion of SUP under system constraints tofuture research.

6 SUP WITH LOCAL MODEL MODIFICATION

We next illustrate an approach to managing local errors inthe update model. For purpose of illustration, we assumea piecewise constant Poisson update model [15], as follows:Let J!¼ ðJ1; J2; . . . ; JkÞ be a set of k intervals Ji ¼ ½T is ; T ifÞ

aligned on the epoch T such that T iþ1s ¼ T if , with T 1

s ¼ T1

and Tkf ¼ TN . The update model associates a constantintensity level �i for each interval Ji. Therefore, the modelis given as a set of pairs M

�! ¼ ðhJi; �iiÞki¼1. Fig. 8 providesan illustration of such a model with k ¼ 3. The horizontalthin lines represent the three different intensities of themodel.

SUP ð�Þ is an extension of SUP that utilizes the feedbackgathered from the data delivery process to include localadaptive modifications to the update model itself. Inparticular, if feedback indicates that many updates havebeen missed, SUPð�Þ will locally compensate for such achange in update frequency by locally increasing frequencyand vice versa. It is worth noting that this modelmodification technique is heuristic in nature. More statisti-cally, robust techniques will involve methods developed inresearch areas such as statistical process control (e.g., [29]).

The pseudocode of SUPð�Þ is given in the onlinesupplement, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.15. The algorithm works as follows:First, as in SUP, it validates the notification rule � given thefeedback feedbackðriÞ. In case the validation fails, it modifiesthe update model of resource ri (denoted by M

�!ðriÞ) bycalling procedure AdaptUpdateModel (also available in theonline supplement, which can be found on the ComputerSociety Digital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.15). This procedure uses the feed-back about actual number of events and applies localmodifications to the update model adaptively. Finally, as inSUP, SUPð�Þ calls the procedure AdaptiveEIsUpdate todetermine the revised schedule. We now describe theoperation of AdaptUpdateModel procedure which is illu-strated in Fig. 8. Using the current schedule S, it first locatesthe last chronon that a probe was assigned to resource ri inschedule S (denoted by Tprev). Then, SUPð�Þ finds an intervalJ ¼ ½TL; TU � that includes the chronon T on which the currentprobe of resource ri took place. The start and end chronons ofJ define a region in which the intensity remains constant withregard to the current intensity at chrononT . It is worth notingthat the start point TL is chosen from the interval ½Tprev; T �;therefore, SUPð�Þ adaptively modifies the updated model byutilizing only the feedback that falls inside the constantintensity region to which chrononT belongs and does not usefeedback that was gathered before chronon Tprev (the lastchronon on which the resource ri was probed before T ). We

14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011

Fig. 8. Illustrating example of SUPð�Þ scheme.

Page 11: A Dual Framework and Algorithms for Targeted

term the region ½TL; T � the Effective Feedback Region. We firstextract the feedback of the actual number of events nT thatoccurred during the interval ½TL; T � from feedbackðriÞ. Wethen modify the update model by replacing the pair hJ; �iwith two new pairs hJ 0; �0i and hJ 00; �00i. For the first pairhJ 0; �0i, we use the feedback to determine a new estimatedintensity �0 ¼ nT

T�TL during J 0 ¼ ½TL; T Þ.2 Then, we estimatethe intensity �00 during J 00 ¼ ½T; TUÞ by smoothing the localintensity � that corresponds to J with the intensity that iscalculated from the feedback �0. The smoothing parameter �is defined by the portion of the feedback region out of the allJ region and is used to avoid overfitting of the new feedbacklocal intensity. The revised intensities are represented in Fig. 8as boldface lines.

7 EXPERIMENTS

We present empirical results, analyzing the behavior of SUPand SUP(�) under varying settings. We start in Section 7.1with a description of the trace data sets and the experimentsetup. We then analyze the impact of profile selection, lifeparameter, and update model on SUP performance (Sec-tions 7.2-7.3). Section 7.4 presents an empirical comparisonwith existing solutions to the dual optimization problems.Finally, we compare SUP(�) to SUP (Section 7.5) and showthat the former improves on the latter’s effective utility witha moderate increase in the number of probes.

7.1 Data Sets and Experiment Setup

We implemented SUP in Java, JDK version 1.4 andexperimented with it on various trace data sets, profiles, lifeparameters, and update models. Traces of update eventsinclude real RSS feed traces and synthetic traces. We considertwo different update models, FPN and Poisson (to bediscussed shortly) to model the arrival of new update eventsto these traces. For comparison purposes, we also imple-mented WIC as described in [24] to determine a schedule forOptMon1 and TTL [15] as another (yet very simple)OptMon2

solution. We briefly review the two solutions.Web information collector (WIC). The WIC algorithm

gets as input four decision parameters, namely, pi;j,lifeiðk; jÞ, urgencyiðj� kÞ, and a constraint M. pi;j denotesthe probability that resource ri will be updated during

chronon Tj; lifeiðk; jÞ denotes the probability that an updatethat occurred to resource ri at chronon Tk will still beavailable at chronon Tj at the server; urgencyiðj� kÞdenotes the value of a ðj� kÞ chronons delayed monitoringof resource ri. We initialized the pi;j probabilities for WICusing the two update models. We used the Overwrite andWindow(Y) life instantiations as defined in [24]. We furtherdefined a uniform urgency of updates by settingurgencyiðj� kÞ � 1. WIC is a greedy algorithm, and at eachchronon Tj, it chooses to probe M resources with thehighest local gained utility, where such utility is given asUiðk; jÞ ¼ pi;j � lifeiðk; jÞ � urgencyiðj� kÞ.

TTL. Given a TTL parameter, a probe is scheduled foreach resource every TTL chronons. Using TTL, we cansimulate a periodical poll of servers, such as the oneproposed by standard RSS aggregators.

Table 1 summarizes the various dimensions of ourexperiments. We next discuss each parameter in more details.

Trace data set. We used data from a real trace of RSSFeeds. We collected RSS news feeds from several Web sitessuch as CNN and Yahoo!. We have recorded the events ofinsertion of new feeds into the RSS files. In this paper, wepresent results for 2,873 updates to CNN Top stories RSSfeed [9] collected for one month during September andOctober 2005.3 We also generated two types of syntheticdata. The first set simulates an epoch with three differentexponential interarrival intensity, medium (first half a day),low (next two days), and high (last half a day). This data setcan model the arrival of bids in an auction (without the finalbid sniping). The second data set has a stochastic cyclicmodel of one week, separating working days from week-ends, and working hours from night hours. Such a model istypical for many applications [14], including posting tonewsgroups, reservation data, etc. Here, it can be repre-sentative of an RSS data with varying update intensity. Thisdata set was generated assuming an exponential (time-dependent) interarrival time. Table 2 summarizes thenumber of recorded events for each of the three data sets.The epoch size varies from one data set to another. Eachepoch was partitioned into N ¼ 10;000 chronons.

Profile and notification rule. We used the profiletemplate “RSS_Monitoring” in Fig. 3 as a basis. Weuse the “Num_Update_Watch” notification rule. We varythe values of X ¼ 1; . . . ; 5. For the life parameter, we havevaried window with Y 2 0; . . . ; 100f g chronons. We alsoconsider a life parameter of overwrite.

Update model. As described in Section 3.3, we use updatemodels to estimate the update pattern at a server, and to

ROITMAN ET AL.: A DUAL FRAMEWORK AND ALGORITHMS FOR TARGETED ONLINE DATA DELIVERY 15

2. Note that this estimate may be higher or lower than the currentparameter. 3. The trace is available on http://ie.technion.ac.il/~avigal/trace.zip.

TABLE 1Summary of the Experiment Parameters

TABLE 2Summary of the Data Sets

Page 12: A Dual Framework and Algorithms for Targeted

trigger monitoring of servers according to profiles. Theestimated pattern may not coincide with the actual updateevents at the server. Thus, the choice of an update model hasan impact on profile satisfaction. We used two differentupdate models to represent updates at servers and modeledeach one of the three data sets with these models, as follows:

. Poisson update model: Following [14], we devisedan update model as a nonhomogeneous Poissonprocess. Therefore, we have a Poisson process withinstantaneous arrival rate � : < ! ½0;1Þ as a modelof occurrence of update events. The number of updateevents occurring in any interval ðs; f � is assumed tobe a Poisson random variable with expected value�ðs; fÞ ¼

R fs �ðtÞdt.

. False positives and false negatives (FPNs) updatemodel: Following [24], we devised the FPN updatemodel. Given a stream of updates, a probability pi;jis assigned the value 1 if a resource ri is updated atchronon Tj. Once probabilities are defined, we addnoise to the probability model, as follows: Given anerror factor Z 2 ½0; 1�, the value of pi;j is switchedfrom 1 to 0 with probability Z. Then, for eachmodified pi;j, a new chronon Tj0 is randomly selectedand the value of pi;j0 is set to 1. Note that FPN can beapplied to any data trace, regardless of its truestochastic pattern.

While the Poisson update model can be used to modelreal-world updates where updates are predicted based onpast observations (e.g., using update histories), the FPNmodel is actually a “synthetic” model that requires thecomplete stream of updates to construct the model. There-fore, the purpose of the FPN model in our experiments is tomeasure the sensitivity of SUP to update model noise, i.e., anoise that is attributed to the usage of this or that updatemodel that sometimes estimates updates that deviate fromthe actual updates. Since the Poisson update model isgenerated using updates observed in the past in order topredict future updates, such noise may be present.

With three data sets, two update models (and parametervariations for FPN), and varying profile parameters, there isa large number of possible experiment configurations. Inthis work, we restrict ourselves to presenting results withthe more “interesting” configurations.

Recall that an optimal schedule S� for SUP gives it amaximum utility. For the variety of update models andprofile settings, the actual schedule S will possibly have alower utility. We measure the effective utility of schedule Sas the ratio of S and S� utility.

7.2 Impact of Profile Selection

In our first experiment, we report on the impact of profileselection on the online effective utility for SUP. In this set ofexperiments, we do not allow the use of feedback,effectively setting C�ðfeedbackðriÞ; T Þ to always be True.

Fig. 9 illustrates the results of variations of the profiletemplate in Fig. 3. We vary the X value (maximum numberof updates a client can tolerate) from 1 to 5; this value isplotted on the x-axis. We also vary the life parameter,introducing four different life parameters, overwrite andwindow(Y) with Y ¼ 0; 10, and 20 chronons. It is worthnoting that Y ¼ 0 generates execution intervals with widthof a single chronon, meaning that the event associated with

each interval should be delivered immediately withoutfurther delay, while larger values Y ¼ 10; 20 generate widerexecution intervals, allowing some (constant) delay innotifications. Each life parameter is represented by adifferent curve. We choose the update model to be FPNwith Z ¼ 0:6. We present the results for the three data sets.

For all data sets and all values of X as the value of Yincreases, satisfying the profile is easier since Y controlsthe window to satisfy the profile. Hence, for higher Y , theeffective utility increases. The value of X reflects thecomplexity of the profile. For example, forX ¼ 4, the updatemodel must accurately predict four updates. As the value ofX increases, all of the update models will have increasingcumulative error in estimating consecutive updates. Thus,for larger X, the effective utility decreases.

An interesting observation is that the performance ofoverwrite for synthetic data 1 and RSS was worse thanwindowðY ¼ 10; 20Þ, while for synthetic data 2, the perfor-mance was better. This indicates that with synthetic data 2,SUP is allowed more maneuvering space to monitorproperly, probably due to the way, updates are spreadacross the epoch. For the other two data sets, it seems thatan average update event was overwritten within less than10 chronons from the time of its occurrence, while for thelast data set, it was above 20 chronons. Finally, as Y keepsincreasing, the effective utility is expected to continue toincrease as well, reaching the value of 1 when Y ! N forany given X.

7.3 Impact of Update Model Selection

We study next how various parameter settings for the FPNmodel and the use of the Poisson model impact effectiveutility. Recall that we introduce stochastic variation in theupdate model through Z (FPN model parameter), when-ever it is strictly less than 1. We present the performance ofSUP for the “Num_Update_Watch” notification rule, withX ¼ 1; . . . ; 5 and the overwrite life parameter. We use allthree data sets to illustrate our results. Here, we also do notallow the use of feedback.

In Fig. 10, SUP has 100 percent effective utility forZ ¼ 1:0, since the FPN update model for Z ¼ 1:0 accuratelyestimates all updates. As we modify the parameter Z from1.0 to 0.4, more variance is added, and effective utility isexpected to decrease. We observe that for all updatemodels, for higher X values, the effective utility decreases.This is because all update models have increasing difficultyin predicting four or five consecutive updates.

We observe that the Poisson model has differingbehavior for the different data sets, compared to the FPNmodel. For the Synthetic Data 1, the effective utility of thePoisson model is more-or-less bounded by Z ¼ 0:6 andZ ¼ 0:8; this indicates that the Poisson model that was usedreflects this data trace up to about an error of 20-40 percent.For Synthetic Data 2, the effective utility of the Poissonmodel is typically below the effective utility of FPN forZ ¼ 0:4. This implies that the Poisson model had about60 percent error. Finally, for the RSS data, the effectiveutility of the Poisson model appears to dominate allvariations of the FPN model for which Z < 1:0 and X > 1,indicating that the Poisson model may best represent thisRSS trace for complex profiles.

16 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011

Page 13: A Dual Framework and Algorithms for Targeted

7.4 OptMon1 and OptMon2

Recall that while OptMon1 problems set hard constraints on

system resources, OptMon2 aims at minimizing system

resource utilization. Further, OptMon2 secures the full

satisfaction of user specification (given an accurate update

model), while OptMon1 can only aim at maximizing it.

Thus, we cannot compare solutions to OptMon1 and

OptMon2 directly. Instead, we make the following indirect

comparison: 1) We compare the system resource (probes)

utilization of the different solutions. 2) Given some level of

system resource utilization, we compare the effective utility

of the different solutions.

Both SUP and TTL are solutions of OptMon2. The TTLsolution will use the server provided TTL4 to determinewhen the next probe to a resource should be to satisfy aprofile. WIC [24] is a solution to theOptMon1 problem. Fig. 11provides the system resource utilization and correspondingutility of the three algorithms. The experiment uses theSynthetic Data 1 data set which contains 244 resourceswhile using a profile with X ¼ 1 and life ¼ Overwrite foreach resource. We add a parameter denoted by M, used byWIC, to represent a system constraint on the total number of

ROITMAN ET AL.: A DUAL FRAMEWORK AND ALGORITHMS FOR TARGETED ONLINE DATA DELIVERY 17

Fig. 9. SUP performance for various profiles.

Fig. 10. SUP performance for various update models.

4. Such TTL is in the RSS 2.0 specification [28] and is used by RSS serversto suggest the next update time of an RSS channel.

Page 14: A Dual Framework and Algorithms for Targeted

probes allowed per chronon. Fig. 11a provides the analysisresults for FPN with Z ¼ 1:0, where updates occur at theexpected update time as determined by the update model.Fig. 11b provides the execution results, assuming a Poissonupdate model. It is worth noting that TTL does not take intoconsideration the update model, and therefore, its perfor-mance remains the same both in Figs. 11a and 11b.

In Fig. 11, SUP and SUP(�) are represented by a singlepoint each in the graph. In Fig. 11a, SUP performs optimallywith an effective utility of 1.0. The optimal number of probesfor SUP is 2,462 for this data set. We study WIC and TTLunder various parameter settings; we consider 500, 1,000,and 2,000 for the number of chronons in an epoch T . Thethree curves WIC(N ¼f500; 1;000; 2;000g) represent theseparameter settings for WIC, while TTL(N ¼ f500; 1;000;2;000g) for TTL. We also varied the M level for WIC. Thex-axis represents the total number of probes, which is equalto N �M of WIC. Thus, for N ¼ 500 chronons and M ¼ 20,WIC consumes 10,000 probes. Similarly, with N ¼ 1;000chronons and M ¼ 20, WIC consumes 20,000 probes. Giventhat TTL is allowed to probe the same total number of probesas WIC ðN �MÞ and assuming that there are n resources, allhave the same importance, each resource was allocated withN �Mn probes. The TTL value (in chronons) used for each

resource monitoring is then given by b NN �Mn

c ¼ b nMc.We observe that the effective utility of TTL is less than

for SUP, even with increasing number of probes. The valueof effective utility for WIC is less than both SUP and TTL.

We now focus on the data set and the Poisson updatemodel of Fig. 11b. For this data set and model, the effectiveutility for SUP is about 0.62 (about 62 percent of the optimal).This corresponds to 3,904 probes; the effective utility isrepresented by a single point. In this case also, SUP performsbetter for the same number of probes than both TTL and WIC.

For all N values, WIC starts with low effective utility(less than 0.2) and as the number of probes increases, theutility monotonically increases for values of N ¼ 500 andN ¼ 1;000. For value of N ¼ 2;000, WIC-effective utilitysometimes drops when we increase the number of probes.In order to reach a utility of 0.62 (equivalent to that of SUP),it requires more than 20,000 probes, which is approximatelyfive times higher than resource consumption of SUP. TTLalso starts low, yet higher than WIC, and its effective utilityalso increases as the number of probes increases. TTLrequires more than 7,000 probes to reach effective utility of

0.62, which is approximately 1.8 times higher than that ofSUP. For the two update models, we can observe that TTLhas better effective utility than WIC for the same number oftotal probes. The reason for that is that TTL, unlike WIC,has no upper bound of M resources per chronon and canactually probe all resources at once.

The relatively low effective utility indicates that predict-ing an update event may not be very accurate, serving as anempirical justification to the introduction of feedback inSUP. Fig. 11b shows that SUP(�), which uses feedback moreaggressively than SUP, manages to improve the effectiveutility by more than 15 percent with an increase in thenumber of probes. We shall compare SUP and SUP(�) inmore details in Section 7.6.

7.5 Impact of Adaptiveness

We performed experiments on all three data sets, and thevarious update models, comparing SUP with and withoutthe use of feedback.

Fig. 12 illustrates the impact of feedback in the RSS dataset with life ¼ overwrite. Fig. 12a presents the increase inrelative utility when using feedback for four variations ofFPN. For Z ¼ 1:0, SUP performance is optimal, and there-fore, feedback cannot improve the schedule. For smallerFPN values, feedback does not improve the performance forX ¼ 1. This is because the performance with and withoutfeedback converges to generating the same executionintervals. Therefore, the execution intervals generated usingfeedback will always coincide with SUP existing executionintervals, and no additional monitoring tasks will be issuedby the SUP algorithm. For larger X values, however,feedback improves significantly (for this data set, up to200 percent for X ¼ 4 and Z ¼ 0:8).

The cost of feedback is presented in Fig. 12b. Again, forZ ¼ 1:0, no modification to the schedule is needed andfeedback adds no extra probes. For other models, and forX > 1, effective utility improvement comes at a cost, albeitnot a big one. For example, for Z ¼ 0:4 and X ¼ 5, theincrease in the number of probes was 71 percent (comparewith 90 percent increase in effective utility). It is noteworthythat the increase in effective utility and probing is notnecessarily correlated. For example, for X ¼ 4, the effectiveutility for Z ¼ 0:4 is dropping, while the number of probesslightly increases.

18 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011

Fig. 11. SUP, WIC, and TTL for Synthetic Data 1 data set for (a) FPN(1) and (b) Poisson.

Page 15: A Dual Framework and Algorithms for Targeted

7.6 SUP versus SUP(�)

Fig. 13 compares between SUP and SUPð�Þ. We used RSS dataand life ¼ window with Y ¼ 50, 1 X 10, and the Poissonmodel. The light-colored line shows the improvement interms of effective utility. The dark line shows the improve-ment in terms of number of monitoring tasks, whereimprovement means less probes.

The results show that SUPð�Þ consistently improves onSUP. It is worth noting that even in the case of X ¼ 1,SUPð�Þmanages to improve on SUP with a mild increase inthe number of probes. Fig. 11b also shows this improve-ment. It also shows that SUPð�Þ dominates WIC for allvariants but one (WIC 500 with 65,000 probes) and many ofthe TTL variants as well.

As for probe increase, we can observe that for X ¼ 1; 2; 3,SUPð�Þ requires slightly more probes then SUP, but forX > 4, SUPð�Þ manages to produce higher effective utilitywhile reducing SUP cost. Although SUPð�Þ is a heuristicsolution and no guarantees to its performance are given,these results remain consistent with other parametersettings as well (not shown in this paper).

8 RELATED WORK

Pull-based freshness policies require clients to contactservers to check for updates to objects. Such policies havebeen proposed in many contexts such as Web caching andsynchronizing collections of objects, e.g., Web crawlers.There has been much research in the Web cachingcommunity on pull-based freshness policies [15]. Thesepolicies typically rely on heuristics to estimate the freshnessof a cached object, for example, estimating freshness as afunction of the last time the object was modified. Other

works [19], [14] have proposed the use of an update modelto represent, in stochastic terms, update arrival. Pull-basedfreshness has also been addressed in the context ofsynchronizing a large collection of objects, e.g., Webcrawlers [4], [8]. These works propose policies for prefetch-ing objects from remote sources to maximize the freshnessof objects in the cache. The goal is to refresh a collection ofobjects offline, rather than handle client requests online.

Quality-driven data delivery involves the design ofefficient algorithms for data delivery subject to system anduser constraints. Designing such algorithms is harder inpull-based settings than in push-based, since the updateprocess is known only in stochastic terms. We next present aset of dimensions (see Fig. 14 for an illustrative comparison)to classify pull-based approaches, followed by an overviewof some existing approaches. We then classify eachapproach along these dimensions and discuss the limita-tions of existing approaches and research challenges.

The first dimension we consider is when objects arerefreshed, either asynchronously, on demand, or somecombination of the two. Researches in [4], [7], [24] arepurely asynchronous and refresh data independent of clientrequests. Others, e.g., L-R Profiles [15], [3], are purely ondemand and only refresh objects when they are requestedby clients. Finally, approaches such as Prevalidation [10],[23], [17] lie in between these two extremes and performboth asynchronous and on-demand data access.

The second dimension is the objective and constraints ofthe problem. We group these together along the y-axis inFig. 14 . The objective is the value to be optimized, e.g., data

ROITMAN ET AL.: A DUAL FRAMEWORK AND ALGORITHMS FOR TARGETED ONLINE DATA DELIVERY 19

Fig. 12. Impact of feedback for RSS data, life ¼ overwrite.

Fig. 13. Relative performance of SUP and SUPð�Þ for RSS data,life ¼ windowð50Þ.

Fig. 14. Classification of existing pull-based policies along severaldimensions.

Page 16: A Dual Framework and Algorithms for Targeted

recency or client utility, and the constraints are limitations,

e.g., bandwidth. By utility we mean some client-specified

function to measure the value of an object to a client, based

on a metric such as data recency, e.g., [3] or importance to

the client, e.g., [6]. We now present several existing

approaches and describe how we classify them along the

above dimensions.On-demand approaches.

. TTL: TTL [15] is commonly used to maintainfreshness of object copies for applications such ason-demand Web access. Each object is assigned aTime-to-Live (either server-defined or estimatedusing heuristics), and any object requested after thistime must be validated at a server to check forupdates. TTL aims to maximize the recency of dataand assumes no bandwidth constraints. Thus, weclassify it as (on demand, recency, none).

. TTL with prevalidation (TTL-Prevalidation): Preva-lidation [10] extends TTL by asynchronously vali-dating expired cached objects in the background. Asin TTL, the goal is to maximize data recency. Thisapproach assumes limits on the amount of band-width for prevalidation, but as in TTL, it assumes nobandwidth constraints for on-demand requests.

. Latency-Recency Profiles (L-R Profiles): Latency-recency profiles [3] are a generalization of TTL thatallow clients to explicitly trade off data recency toreduce latency using a utility function. The objectiveis to maximize the utility of all client requests. Thispolicy assumes no bandwidth constraints. Weclassify this as (on demand, utility, none).

. Profile-Driven Cache Management (PDCM): Profile-driven cache management [6] enables data rechar-ging for clients with intermittent connectivity.Clients specify profiles of the utility of each object.The objective is to download a set of objects tomaximize client utility, while the client is connected.PDCM does not consider updates to objects.

Asynchronous approaches.

. Cache Synchronization (Synch): The objective ofcache synchronization [7] is to maximize the averagerecency of a set of objects in a cache, subject toconstraints on the number of objects that can besynchronized (for simplicity, we express this as abandwidth constraint). This approach does notincorporate client utility or preferences into thedecision. Application-aware cache synchronization(AA-Synch) [4] improves upon this by taking objectpopularity into account. In [22], a cooperativeapproach between a cache and its data sources ispresented that aim at offering a best effort cachesynchronization under bandwidth constraints.

. WIC: WIC [24] aims to monitor updates to a set ofinformation sources subject to bandwidth con-straints. The objective is to capture updates to a setof objects, rather than maximize the average fresh-ness of a cache as in cache synchronization [7]. Thisapproach does not consider client requests or clientutility (utility is given only in terms of server abilityto capture updates). Thus, we classify this as(asynchronous, recency, bandwidth).

SUP is also classified in Fig. 14. It is an asynchronousalgorithm. Following the dual approach, presented in thispaper, SUP is classified as an algorithm that aims atminimizing bandwidth while keeping an optimal utility asits constraint.

SUP(�) uses feedback to modify the model itself usinglocal and transient changes to the model. Alternatingbetween predefined models was suggested in [2], where amechanism to choose between two possible update modelsis established. Such a mechanism was suggested to handlebursts of updates.

9 CONCLUSIONS

In this work, we focused on pull-based data delivery thatsupports user profile diversity. Minimizing the number ofprobes to sources is important for pull-based applications toconserve resources and improve scalability. Solutions thatcan adapt to changes in source behavior are also importantdue to the difficulty of predicting when updates occur. In thispaper, we have addressed these challenges through the useof a new formalism of a dual optimization problem(OptMon2), reversing the roles of user utility and systemresources. This revised specification leads naturally to asurprisingly simple, yet powerful algorithm (SUP) whichsatisfies user specifications while minimizing system re-source consumption. We have formally shown that SUP isoptimal for OptMon2 and under certain restrictions can beoptimal for OptMon1 as well. We have empirically shown,using RSS data traces as well as synthetic data, that SUP cansatisfy user profiles and capture more updates compared toexisting policies. SUP is adaptive and can dynamicallychange monitoring schedules. Our experiments show thatusing feedback in SUP improves the performance with amoderate increase in the number of needed probes.

We believe that the main impact of this work will be inwhat is now known as the Internet of things, where sensordata are collected, analyzed, and utilized in many differentways, based on user’s needs. With the Internet of things,user profiles, and their satisfaction dictate the way data areutilized, and monitoring sensor data efficiently is amandatory prerequisite to the creation of any informationsystem that is based on such data.OptMon2 is defined in such a way that satisfaction of a

user profile is a hard constraint. However, sometimes,profile may state preferences rather than hard constraints.Extending the problem to handle profile preferences poses anew challenge to this problem. Adding preferences wasdiscussed in [26], where a trade-off was suggested betweencompleteness (which is defined as a hard constraint in thiswork) and delay of information delivery. This specificationyields a biobjective problem definition (both client satisfac-tion and utility maximization). The algorithmic solutionchanges to identify the Pareto curve of feasible, pairwisenondominated solutions. Another way of adding prefer-ences to this work is by redefining utility (set to becompleteness in this work) to include a variety of dimen-sions, combined through some linear or other combinations.We consider this problem as another challenge and anavenue for future research.

In future work, we shall also consider how to incorporateresource constraints into SUP. We shall investigate theoptimal positioning of monitoring tasks in an execution

20 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011

Page 17: A Dual Framework and Algorithms for Targeted

interval, maximizing the probability of satisfying userprofiles, given the stochastic nature of the update model.Finally, we shall investigate the changes to our algorithmicsolution whenever nonstrict utilities are present.

REFERENCES

[1] A. Adi and O. Etzion, “Amit—The Situation Manager,” Int’l J.Very Large Data Bases, vol. 13, no. 2, pp. 177-203, May 2004.

[2] L. Bright, A. Gal, and L. Raschid, “Adaptive Pull-Based Policiesfor Wide Area Data Delivery,” ACM Trans. Database Systems,vol. 31, no. 2, pp. 631-671, 2006.

[3] L. Bright and L. Raschid, “Using Latency-Recency Profiles forData Delivery on the Web,” Proc. Int’l Conf. Very Large Data Bases(VLDB), pp. 550-561, Aug. 2002.

[4] D. Carney, S. Lee, and S. Zdonik, “Scalable Application-AwareData Freshening,” Proc. IEEE CS Int’l Conf. Data Eng., pp. 481-492,Mar. 2003.

[5] L.S. Chandran, L. Ibarra, F. Ruskey, and J. Sawada, “Generatingand Characterizing the Perfect Elimination Orderings of a ChordalGraph,” Theoretical Computer Science, vol. 307, no. 2, pp. 303-317,2003.

[6] M. Cherniack, E. Galvez, M. Franklin, and S. Zdonik, “Profile-Driven Cache Management,” Proc. IEEE CS Int’l Conf. Data Eng.,pp. 645-656, Mar. 2003.

[7] J. Cho and H. Garcia-Molina, “Synchronizing a Database toImprove Freshness,” Proc. ACM SIGMOD, pp. 117-128, May 2000.

[8] J. Cho and A. Ntoulas, “Effective Change Detection UsingSampling,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 2002.

[9] “CNN Top Stories RSS Feed,” http://rss.cnn.com/services/rss/cnn_topstories.rss, 2010.

[10] E. Cohen and H. Kaplan, “Refreshment Policies for Web ContentCaches,” Proc. IEEE INFOCOM, pp. 1398-1406, Apr. 2001.

[11] U. Dayal et al., “The HiPAC Project: Combining Active Databasesand Timing Constraints,” SIGMOD Record, vol. 17, no. 1, pp. 51-70,Mar. 1988.

[12] P. Deolasee, A. Katkar, P. Panchbudhe, K. Ramamritham, and P.Shenoy, “Adaptive Push-Pull: Disseminating Dynamic Web Data,”Proc. Int’l World Wide Web Conf. (WWW), pp. 265-274, May 2001.

[13] J. Eckstein, A. Gal, and S. Reiner, “Optimal Information Monitor-ing under a Politeness Constraint,” Technical Report RRR 16-2005,RUTCOR, Rutgers Univ., May 2005.

[14] A. Gal and J. Eckstein, “Managing Periodically Updated Data inRelational Databases: A Stochastic Modeling Approach,” J. ACM,vol. 48, no. 6, pp. 1141-1183, 2001.

[15] J. Gwertzman and M. Seltzer, “World Wide Web CacheConsistency,” Proc. USENIX Ann. Technical Conf., pp. 141-152,Jan. 1996.

[16] “BlackBerry Wireless Handhelds,” http://www.blackberry.com,2010.

[17] Z. Jiang and L. Kleinrock, “Prefetching Links on the WWW,” Proc.IEEE Int’l Conf. Comm., 1997.

[18] G. Kappel, S. Rausch-Schott, and Retschitzegger, “BeyondCoupling Modes: Implementing Active Concepts on Top of aCommercial OODBMS,” Object-Oriented Methodologies and Systems,S. Urban and E. Bertino, eds., pp. 189-204. Springer-Verlag, 1994.

[19] J.-J. Lee, K.-Y. Whang, B.S. Lee, and J.-W. Chang, “An Update-RiskBased Approach to TTL Estimation in Web Caching,” Proc. Conf.Web Information Systems Eng. (WISE), pp. 21-29, Dec. 2002.

[20] C. Liu and P. Cao, “Maintaining Strong Cache Consistency on theWorld Wide Web,” Proc. Int’l Conf. Distributed Computing Systems(ICDCS), 1997.

[21] H. Liu, V. Ramasubramanian, and E.G. Sirer, “Client and FeedCharacteristics of rss, a Publish-Subscribe System for WebMicronews,” Proc. Internet Measurement Conf. (IMC), Oct. 2005.

[22] C. Olston and J. Widom, “Best-Effort Cache Synchronization withSource Cooperation,” Proc. ACM SIGMOD, pp. 73-84, 2002.

[23] V. Padmanabhan and J. Mogul, “Using Predictive Prefetching toImprove World Wide Web Latency,” ACM SIGCOMM ComputerComm. Rev., vol. 26, no. 3, pp. 22-36, July 1996.

[24] S. Pandey, K. Dhamdhere, and C. Olston, “WIC: A General-Purpose Algorithm for Monitoring Web Information Sources,”Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 360-371, Sept.2004.

[25] “Promo Language Specification,” http://ie.technion.ac.il/~avigal/ProMoLang.pdf, 2010.

[26] H. Roitman, A. Gal, and L. Raschid, “Capturing ApproximatedData Delivery Tradeoffs,” Proc. IEEE CS Int’l Conf. Data Eng., 2008.

[27] “RSS,” http://www.rss-specifications.com, 2010.[28] J.L. Wolf, M.S. Squillante, P.S. Yu, J. Sethuraman, and L. Ozsen,

“Optimal Crawling Strategies for Web Search Engines,” Proc. Int’lWorld Wide Web Conf. (WWW), pp. 136-147, 2002.

[29] E. Yashchin, “Change-Point Models in Industrial Applications,”Nonlinear Analysis, vol. 30, pp. 3997-4006, 1997.

[30] J. Yin, L. Alvisi, M. Dahlin, and A. Iyengar, “Engineering Server-Driven Consistency for Large Scale Dynamic Web Services,” Proc.Int’l World Wide Web Conf. (WWW), pp. 45-57, May 2001.

Haggai Roitman received the BSc degree ininformation systems engineering and the PhDdegree in information management engineeringfrom the Technion in 2004 and 2009, respec-tively. He is a research staff member at IBMHaifa Research Lab (HRL). He works in theInformation Retrieval Solutions Group. His mainresearch interests are in the boundary betweendynamic data management (e.g., Web monitor-ing) and content management (e.g., content

analysis and content dissemination networks), Web 2.0 data manage-ment, and data integration. He is also an adjunct lecturer in the WilliamDavidson Faculty of Industrial Engineering and Management, Technion.He has published several papers in leading conferences (e.g., VLDB,ICDE, SIGIR, CIKM, and JCDL). In his free time, he enjoys masteringhis DJ skills.

Avigdor Gal received the DSc degree in thearea of temporal active databases in 1995 fromthe Technion—Israel Institute of Technology,where he is an associate professor. He haspublished more than 80 papers in journals (e.g.,Journal of the ACM (JACM), ACM Transactionson Database Systems (TODS), IEEE Transac-tions on Knowledge and Data Engineering(TKDE), ACM Transactions on Internet Technol-ogy (TOIT), and VLDB Journal), books (Tem-

poral Databases: Research and Practice), and conferences (ICDE, ER,CoopIS, and BPM) on the topics of data integration, temporaldatabases, information systems architectures, and active databases.He is a steering committee member of IFCIS, a member of IFIP WG 2.6,and a recipient of the IBM Faculty Award for 2002-2004. He is a memberof the ACM and a senior member of the IEEE.

Louiqa Raschid received the bachelor’s degreefrom the Indian Institute of Technology, Chennai,in 1980, and the PhD degree from the Universityof Florida in 1987. She is a professor at theUniversity of Maryland. She has published morethan 140 papers in the leading conferences andjournals in databases, scientific computing, Webdata management, bioinformatics, and AI in-cluding the ACM SIGMOD, VLDB, AAAI, IEEEICDE, ACM Transactions on Database Systems,

IEEE Transactions on Knowledge and Data Engineering, IEEETransactions on Computers, and the Journal of Logic Programming.Her research has received multiple awards including more than 25 grantsfrom the US National Science Foundation (NSF) and US DefenseAdvanced Research Projects Agency (DARPA). Papers that shecoauthored have been nominated for or won the Best Paper Awardsat the 1996 International Conference on Distributed ComputingSystems, the 1998 International Conference on Cooperative InformationSystems, and the 2008 International Conference on Data Integration inthe Life Sciences. She has been recognized as an ACM distinguishedscientist. She has chaired or served on multiple IEEE and ACM programcommittees and the editorial boards of the VLDB Journal, ACMComputing Surveys, ACM Journal on Data and Information Quality,Proceedings of the VLDB, INFORMS Journal of Computing, and theIEEE Transactions on Knowledge and Data Engineering. She hasplayed a key role in the Sahana FOSS project for disaster informationmanagement including serving as the chief database architect andboard chair (2006-2008). Sahana is the only comprehensive product fordisaster information management.

ROITMAN ET AL.: A DUAL FRAMEWORK AND ALGORITHMS FOR TARGETED ONLINE DATA DELIVERY 21