12
A Query Processor for Prediction-Based Monitoring of Data Streams Sergio Ilarri IIS Department Univ. of Zaragoza María de Luna 1 50018 Zaragoza, Spain [email protected] Ouri Wolfson CS Department Univ. of Illinois Chicago, IL, 60607 [email protected] Eduardo Mena IIS Department Univ. of Zaragoza María de Luna 1 50018 Zaragoza, Spain [email protected] Arantza Illarramendi LSI Department Univ. of the Basque Country Apdo. 649 20080 San Sebastián, Spain [email protected] Prasad Sistla CS Department Univ. of Illinois Chicago, IL, 60607 [email protected] ABSTRACT Networks of sensors are used in many different fields, from industrial applications to surveillance applications. A com- mon feature of these applications is the necessity of a mon- itoring infrastructure that analyzes a large number of data streams and outputs values that satisfy certain constraints. In this paper, we present a query processor for monitoring queries in a network of sensors with prediction functions. Sensors communicate their values according to a threshold policy, and the proposed query processor leverages predic- tion functions to compare tuples efficiently and to generate answers even in the absence of new incoming tuples. Two types of constraints are managed by the query processor: window-join constraints and value constraints. Uncertainty issues are considered to assign probabilistic values to the results returned to the user. Moreover, we have developed an appropriate buffer management strategy, that takes into account the contributions of the prediction functions con- tained in the tuples. We also present some experimental results that show the benefits of the proposal. 1. INTRODUCTION There has been a great interest in techniques for monitor- ing networks of sensors in a variety of contexts (e.g., fleet tracking, monitoring the levels of certain gases in a chem- ical environment, etc.). The common feature of monitor- ing applications for networks of sensors is the need of han- dling a continuous supply of data streams. For this pur- pose, many works propose the use of a Data Stream Manage- ment System (DSMS) [3], in a monitoring computer, which Permission to copy without fee all or part of this material is granted pro- vided that the copies are not made or distributed for direct commercial ad- vantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the ACM. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permissions from the publisher, ACM. EDBT 2009, March 24–26, 2009, Saint Petersburg, Russia. Copyright 2009 ACM 978-1-60558-422-5/09/0003 ...$5.00 implements suitable non-blocking techniques to process un- bounded amounts of data. In this context, sensors send their data as continuous and in- dependent streams of data. As the values they measure can change very frequently (e.g., a GPS receiver in a car mea- sures its location, that is changing constantly), this could imply many communications of data from each sensor. On the one hand, this implies that the DSMS has to process a large amount of incoming tuples. On the other hand, there is a great usage of the network, which is specially important in wireless environments where communications are expensive and quickly drain the energy of wireless devices [22]. However, in many situations, a sensor can predict the val- ues that it will measure in the near future. For example, a location sensor (e.g., a GPS receiver) in a car could pre- dict future locations by considering the current speed and route. Similarly, the temperature in a room will probably evolve during the day following some predictable patterns. Taking into account this situation, this work proposes to as- sociate each sensor value with a prediction function in a way that the sensor will update its value on the monitoring com- puter only when it differs significantly from the predicted value. This strategy reduces both the communication and the query processing efforts. Due to the use of prediction functions together with an update policy where only signif- icant values are transmitted, a pair of sensors could start satisfying a required constraint even when no new tuple is received from any of the sensors. The use of prediction func- tions requires the definition of a suitable query processor able to handle them efficiently and effectively. The main contributions of our work are: A query processor is defined, that processes queries on data streams with prediction functions. Two types of constraints are considered: value con- straints and window-join constraints. In particular, window-join constraints have not been studied in other works. 415

A Query Processor for Prediction-Based Monitoring of Data ...boxu/domino/edbt09-data-stream.pdf · A Query Processor for Prediction-Based Monitoring of Data Streams Sergio Ilarri

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

A Query Processor for Prediction-Based Monitoring ofData Streams

Sergio IlarriIIS Department

Univ. of ZaragozaMaría de Luna 1

50018 Zaragoza, [email protected]

Ouri WolfsonCS DepartmentUniv. of Illinois

Chicago, IL, [email protected]

Eduardo MenaIIS Department

Univ. of ZaragozaMaría de Luna 1

50018 Zaragoza, [email protected]

Arantza IllarramendiLSI Department

Univ. of the Basque CountryApdo. 649

20080 San Sebastián, [email protected]

Prasad SistlaCS DepartmentUniv. of Illinois

Chicago, IL, [email protected]

ABSTRACTNetworks of sensors are used in many different fields, fromindustrial applications to surveillance applications. A com-mon feature of these applications is the necessity of a mon-itoring infrastructure that analyzes a large number of datastreams and outputs values that satisfy certain constraints.

In this paper, we present a query processor for monitoringqueries in a network of sensors with prediction functions.Sensors communicate their values according to a thresholdpolicy, and the proposed query processor leverages predic-tion functions to compare tuples efficiently and to generateanswers even in the absence of new incoming tuples. Twotypes of constraints are managed by the query processor:window-join constraints and value constraints. Uncertaintyissues are considered to assign probabilistic values to theresults returned to the user. Moreover, we have developedan appropriate buffer management strategy, that takes intoaccount the contributions of the prediction functions con-tained in the tuples. We also present some experimentalresults that show the benefits of the proposal.

1. INTRODUCTIONThere has been a great interest in techniques for monitor-ing networks of sensors in a variety of contexts (e.g., fleettracking, monitoring the levels of certain gases in a chem-ical environment, etc.). The common feature of monitor-ing applications for networks of sensors is the need of han-dling a continuous supply of data streams. For this pur-pose, many works propose the use of a Data Stream Manage-ment System (DSMS) [3], in a monitoring computer, which

Permission to copy without fee all or part of this material is granted pro-vided that the copies are not made or distributed for direct commercial ad-vantage, the ACM copyright notice and the title of the publication and itsdate appear, and notice is given that copying is by permission of the ACM.To copy otherwise, or to republish, to post on servers or to redistribute tolists, requires a fee and/or special permissions from the publisher, ACM.EDBT 2009, March 24–26, 2009, Saint Petersburg, Russia.Copyright 2009 ACM 978-1-60558-422-5/09/0003 ...$5.00

implements suitable non-blocking techniques to process un-bounded amounts of data.

In this context, sensors send their data as continuous and in-dependent streams of data. As the values they measure canchange very frequently (e.g., a GPS receiver in a car mea-sures its location, that is changing constantly), this couldimply many communications of data from each sensor. Onthe one hand, this implies that the DSMS has to process alarge amount of incoming tuples. On the other hand, there isa great usage of the network, which is specially important inwireless environments where communications are expensiveand quickly drain the energy of wireless devices [22].

However, in many situations, a sensor can predict the val-ues that it will measure in the near future. For example,a location sensor (e.g., a GPS receiver) in a car could pre-dict future locations by considering the current speed androute. Similarly, the temperature in a room will probablyevolve during the day following some predictable patterns.Taking into account this situation, this work proposes to as-sociate each sensor value with a prediction function in a waythat the sensor will update its value on the monitoring com-puter only when it differs significantly from the predictedvalue. This strategy reduces both the communication andthe query processing efforts. Due to the use of predictionfunctions together with an update policy where only signif-icant values are transmitted, a pair of sensors could startsatisfying a required constraint even when no new tuple isreceived from any of the sensors. The use of prediction func-tions requires the definition of a suitable query processorable to handle them efficiently and effectively. The maincontributions of our work are:

• A query processor is defined, that processes queries ondata streams with prediction functions.

• Two types of constraints are considered: value con-straints and window-join constraints. In particular,window-join constraints have not been studied in otherworks.

415

• Uncertainty issues are considered and managed by thequery processor.

• An appropriate buffer management is performed todeal with storage limitations.

• An experimental study shows the interest of the pro-posal in a variety of conditions.

The rest of this paper is as follows. In Section 2, we describethe use of prediction functions and the format of tuples con-sidered. In Section 3, we explain the types of constraints pro-cessed by the proposed query processor: value constraintsand window-join constraints. In Section 4, we detail themain components of the query processor. In Section 5, weexplain how the uncertainty in the values measured by thesensors is managed. We present an experimental evaluationshowing the feasibility and performance of the query proces-sor in Section 6. The paper finishes with some related worksin Section 7 and conclusions and future work in Section 8.

2. SENSORS AND PREDICTIONSThis paper argues that predictions can be reasonably usedin a variety of contexts. For example, a GPS in a car couldsend not only the current location of the car but also itsvector of movement or expected trajectory [35]. Similarly,the values measured by a temperature sensor indoors are notexpected to change in normal conditions. The level of fuelor the distance traveled by a vehicle, the temperature of anarea, the altitude of an airplane, the intensity detected by alight sensor during the day, the number of persons entering amall over a certain period, etc., are examples of values thatcan be estimated with a prediction function (which can bebuilt using different techniques [15,27]).

Thus, in our proposal, every value measured by a sensor isattached to a prediction function that will be used to pre-dict future values of that sensor. In this way, the sensorwill only send significant values (e.g., values that differ fromthe predicted value by more than a certain threshold) to themonitoring computer instead of sending them continually,saving a great amount of wireless communication efforts atthe sensors. This is very important, as wireless communica-tions are expensive and drain quickly the energy of wirelessdevices [22]. A reduction in the number of communicatedvalues also leads to a decrease in the processing overhead atthe monitoring computer, increasing the scalability of thequery processing.

Data streams from sensors are composed of update-tuples:tpj=<si, typei, tsj, fj(t)>, where si is the identifier of asensor (assumed unique, as in many other works on queryprocessing in sensor networks, such as [5, 12, 13, 26]), typei

is the type of value it measures, tsj is the (implicit or ex-plicit [33]) timestamp of the update, and fj(t) is a predic-tion function that, given a certain time instant t, retrievesthe expected sensor value at that time. The value of thesensor at the update time is given by fj(tsj). It should benoted that sometimes several prediction functions can be in-cluded in the same tuple (e.g., in the experiments presentedin Section 6, the sensors measure two-dimensional values,corresponding to the horizontal and vertical coordinates ofmoving objects). As an example, <locCar20, location, 10,

{15 + 4.5t, 3t}> is a tuple sent by a location sensor (identi-fied by locCar20) in car20, moving with a horizontal speedof 4.5 m/s and a vertical speed of 3 m/s, whose locationa t = 10 is (60, 30). This function can be maintained bysome dead-reckoning method [34]: each vehicle sends its lo-cation and current speed to the monitoring computer, andthe speed and current location are sent again when the devi-ation between the actual location and the location estimatedusing the predicted function exceeds a threshold.

For simplicity, we consider that the sensors can estimate fu-ture values based on past measurements. So, a sensor canobtain and communicate a prediction function to the mon-itoring computer. Although there may be some processingcost associated to the estimation of prediction functions, itsuse will save communications by the sensors, which is a keyelement to reduce energy consumption. Moreover, the pro-posed query processor does not contradict proposals wherea prediction model is computed and assigned to the sensorsby a third party, such as [14]. Indeed, this is required whenthe prediction model is based on data not available at thesensor (e.g., based on values measured by sensors nearby).

In order to predict the value of a sensor at a given timeinstant, the last prediction function received from the sen-sor before that time instant must be applied. In otherwords, the prediction function of a tuple tpk is applicableduring the time interval between the timestamp of that tu-ple and that of the next tuple from the same sensor, whichis called the Interval of Applicability of the Prediction Func-tion: IAPFtpk

=[IAPFtpk.start, IAPFtpk

.end).

Sensors can follow a number of update policies [34] in orderto decide when a value is significant and should be commu-nicated to the monitoring computer. We advocate a thresh-old policy with a maximum period between updates. Withthis policy, every sensor commits to communicate an updatewhenever: 1) the difference between its current value and thevalue estimated using the last prediction function exceeds acertain threshold (correction update), or 2) the time elapsedsince the last update has exceeded a certain period T (heart-beat update). For example, a temperature sensor could com-municate its current value whenever the difference betweenthe predicted value at a given time instant and its real valueexceeds two Celsius degrees with at least one update everythree minutes. The update period T indicates the maximumamount of time during which a prediction function can beapplied. Outside that period, the prediction function is notreliable and it can be assumed that the sensor is unable tocommunicate new updates (e.g., it is not alive). Appropriatethreshold values can be specified depending on the Qualityof Service (QoS) requirements. Moreover, the threshold up-date policy could also be adaptive, i.e., a sensor could dy-namically adjust its threshold as a result of an assessment ofthe tradeoff between the communication cost and the cost ofthe imprecision of predictions [35] (in this case, the formatof tuples should be extended with the value of the thresholdconsidered by the sensor at the timestamp of the tuple).

3. TYPES OF CONSTRAINTSTwo types of query constraints, which may appear simul-taneously in a query, are considered by the proposed queryprocessor: value constraints and window-join constraints.

416

Constraints such as “the selected temperature sensors mustmeasure a temperature under 50F degrees” are called valueconstraints and are represented by vConstr(type, comp, K),where type is a type of sensor value, comp is a compara-tor among ≤, <, >, ≥, <>, and =, and K is a constant.The sample constraint above would be expressed as vCon-str(temperature, <, 50F).

Window-join constraints such as “the selected pairs of gassensors must measure a similar concentration of carbon diox-ide within an interval of 10 seconds” are more complex andare represented by wJConstr(type1, type2, w, comp, K),where type1 and type2 are two types of sensor values, wis called the valid-time window and specifies the maximumtime difference allowed between two values to be joined,and the rest of parameters are as explained for value con-straints. The sample constraint above would be expressed,using this syntax, as wJConstr(CO2, CO2, 10 seconds, >,1%), considering similar values those that differ less than1%. The relative-timestamp condition allows comparisonsbetween values as long as they refer to approximately thesame time instant. Valid-time windows are useful: 1) incases where timestamps are uncertain (such as when thesensors’ clocks are not precisely synchronized), and 2) alsoin some specific queries (e.g., retrieving pairs of buses thatarrive at the same stop within 30 minutes of each other).Other scenarios where window-join constraints can be use-ful include: in network data processing to detect accessesfrom the same IP at approximately the same time (it mayindicate a DOS attack) or other traffic patterns that oc-cur simultaneously, in financial applications to find trendswhen comparing stock quotes, to compare segments of tra-jectories of objects which move unsynchronizely, to monitorcredit card transactions, etc.

4. QUERY PROCESSOR MODULESIn this section, we describe the components of the proposedquery processor (shown in Figure 1): the Tuple Evaluator,the Buffer Manager, and the Prediction Validator. At theend of the section, we also explain how they interact.

4.1 Tuple EvaluatorAs opposed to what happens in traditional databases, que-ries must be evaluated in an incremental way in order to copewith a high arrival rate of data sent by an arbitrarily highnumber of sensor devices. Thus, whenever a new tuple is re-ceived by the Tuple Evaluator, it must verify how that tupleaffects the active queries (i.e., the continuous queries in ex-ecution). For queries that consist only of a value constraint,the new tuple is considered alone. For queries that involvea window-join constraint, the new tuple is compared withprevious tuples corresponding to the other type of valuesinvolved in the window-join constraint (potential matches).This incremental approach will detect all the answers to theconstraint, as it is event-driven instead of being based onperiodic evaluations at specific times.

The comparison of tuples for window-join constraints is per-formed in two stages: a filter step and an evaluation step. Inthe filter step, possible matches are filtered out based on theidea that two predicted values cannot match if they are notcomparable due to the difference in their timestamps. Forthose pairs of tuples that are not filtered out, the window-

Query Processor

a class of sensors

Ouput Data Streams

validation period

PROCESSINGTRIGGER

VALIDATORPREDICTION

MANAGERBUFFERqueries TUPLE

EVALUATOR

SENSORS

events alarms

Predictions

(centralized computer)(tuples)

Input Data Streams

SAMPLING

tClock

T_refreshment

tuples

t

new tuple

tuples from

tuples

warnings

tentative predictions

snapshots

CLIENTS

of tuplesbuffers

maxDelay

w’s

Figure 1: Modules of the query processor

join constraint must be evaluated over them to determinewhen they will match (if ever). In the evaluation step, linearprediction functions are considered1, so that the problem ofprocessing window-join constraints translates to comparingprediction functions by solving a system of linear inequal-ities, which can be performed efficiently. The correspond-ing system of linear inequalities for window-join constraintsand a couple of prediction functions fk(tk) = a · tk + b andfl(tl) = c · tl + d is shown in Figure 2, where T is the periodof the threshold update policy (see Section 2). A graphi-cal resolution of the system of linear inequalities is possible,by representing each constraint with a line, as shown in theexample of Figure 3. As a result of the comparison of twotuples, predicted tuples that will satisfy the constraint aregenerated. The cost of processing an incoming tuple for awindow-join constraint obviously depends on the number oftuples stored for the other sensor class involved in the joinand, especially, on the number of potential matches thatpass the filter step; for each non-filtered potential match, asystem of linear inequalities like the one shown in Figure 2must be solved. For more details about this process, theinterested reader is referred to [18,19].

For each active query, the Tuple Evaluator generates an out-put data stream of tuples that are predicted to satisfy therequired constraints in the future. The format of these out-put tuples, which is explained in the following, depends onwhether the query includes a window-join constraint or not.

1Many well-known estimation techniques, such as the lin-ear extrapolation, the double exponential smoothing, or theKalman filter, are linear models. Moreover, non-linear func-tions can be linearized in many cases [23].

417

fk(tk) − fl(tl) ≤ K → a · tk − c · tl + b − d ≤ K (1)fl(tl) − fk(tk) ≤ K → c · tl − a · tk + d − b ≤ K (2)tk − tl ≤ w (3)tl − tk ≤ w (4)tk ≥ IAPFtpk

.start (5)tk ≤ MIN(IAPFtpk

.end, IAPFtpk.start + T ) (6)

tl ≥ IAPFtpl.start (7)

tl ≤ MIN(IAPFtpl.end, IAPFtpl

.start + T ) (8)

Figure 2: Linear inequalities for window-joins

k

llPrediction function 2: g( )=2 − 3k kPrediction function 1: f( )=

8

validity interval=[0,12]

4 10 12

−4

−6 −4

6

−2

4

constr

aint (4

)

constraint (5)

constraint (2)

constraint (1)

14

cons

train

t (6)

l10

WINDOW−JOIN CONSTRAINT

w=4K=1With comparator <

(IAPF.start=0, IAPF.end=14)

(IAPF.start=0, IAPF.end=10)

constraint (8)

cons

train

t (7)

constr

aint (3

)

timestamp−matchingboundary

t

t t t t

(0,2)

(12,8)

(10,6)

(0,1)

t

(−6,−2)

Figure 3: Timestamp-matching boundary

For queries that include a constraint about pairs of sen-sors, the format of an output predicted tuple is: <tpi, tpj ,VM=(I,P)>, where tpi and tpj are the input tuples thathave been joined and V M is the validity mark of the outputpredicted tuple. The validity mark indicates under whichconditions the predicted tuple applies; that is, when the pre-diction functions of tpi and tpj can be used to estimate thevalues of the corresponding sensors, values that will satisfythe constraint. The validity mark can be seen as a general-ization of the idea of validity period presented in [32], andit has two components:

• A validity interval I. It is the time interval duringwhich the values of the sensors match, according to thespecified constraint and the given prediction functions.

• A timestamp-matching boundary P . It is a polygonthat constraints the admissible combinations of timesti and tj for a match (as shown in Figure 3), where ti

is a time instant for the evaluation of the predictionfunction fi(t) in tpi, and tj is an evaluation time in-stant for fj(t) in tpj . This can be used, for example, tofind examples of values that match. More importantly,it is used to validate predicted tuples (see Section 4.3).

If the query consists only of a value constraint, the format ofpredicted tuples is: <tpi, VM=I>, where the validity markis given just by the validity interval I. For example, a tuple<s1, type1, t1, 3t+2, [5,15]> indicates that the sensor s1 sat-

isfies the constraint during the interval [5,15] by consideringthe prediction function 3t + 2 (with timestamp t1).

4.2 Buffer ManagerThe Tuple Evaluator communicates the tuples it receives toa module that is called the Buffer Manager. The BufferManager decides which input tuples will be stored (becausethey can be needed to answer monitoring queries) and whichones will be discarded. In this way, when the Tuple Evalua-tor receives a new tuple tpi from a sensor si of value type Si,it asks the Buffer Manager about tuples tpk of other types ofvalues involved in window-join constraints with Si, in orderto find possible matches. In the following, we first explainthe basic mechanism to decide whether it is interesting tostore a tuple or not. Then, we explain the policy appliedwhen there is insufficient storage space for all the tuples.

4.2.1 Temporal Width of a BufferThe tuples that should be stored in the buffers are deter-mined by the constraints of the active queries. As differenttypes of values are subjected to different query constraints, adifferent buffer is used for each type of value. In the buffer ofa certain type of value, only tuples with IAPFs intersectingwith a given time interval need to be stored. Thus, storingmore tuples would imply both a waste of storage space anda higher join processing cost. The temporal width of thebuffer for the type of value S is given by:

tWidth(S) =

MAX(w) + maxDelay if window-join(S)maxDelay otherwise

where window-join(S) returns true iff S is involved in awindow-join constraint, MAX(w) is the maximum w inthese constraints, and maxDelay is an estimation of themaximum delay of input tuples (i.e., a tuple with timestampt can arrive in the query processor as late as t+maxDelay).A tuple tpk for the type of value S must be kept in the bufferonly if:

([currentT ime−tWidth(S), currentT ime]T

IAPFtpk) 6= ∅

since, as mentioned in Section 2, a tuple is not only appli-cable at its timestamp but at any time instant within itsIAPF .

The Buffer Manager will periodically shift the windows oftuples stored in its buffers (removing the unneeded tuplesaccording to the previous considerations) in a process calledpurging. It must also manage problems derived from thelack of space to store new tuples. If there is not enoughspace to store an incoming tuple, purging is automaticallyactivated. If purging does not solve the problem, the relativeimportance of the prediction functions of the different tuplesis considered by following an appropriate replacement policy.This issue is explained in more detail in the following.

4.2.2 Dealing with Limited StorageIn a real environment, there will probably be a certain lim-ited amount of space available for tuples in the buffers. As

418

the rate of arrival of new tuples is unbounded, it is possiblethat some tuples that should be placed in the buffers (ac-cording to the previous explanation) cannot be physicallystored.

To deal with this problem, we propose a method that tradesthe need of additional storage space for a greater inaccuracy(which is unavoidable because of the storage limitation thatprevents the storage of all the tuples needed). In the eventthat the maximum allowed space for a buffer is reached,the Buffer Manager decides if a new incoming tuple shouldreplace another tuple in the buffer or be discarded. For thatpurpose, the contribution of each of the tuples stored in thebuffer, and also of the new tuple, is considered. In thisway, the tuple with the smallest contribution is discarded.Although several methods have been proposed to deal withstorage limitations in the context of data streams [3] (e.g.,random sampling, histograms and wavelets), none of themconsiders the peculiarities of data streams with predictionfunctions.

The proposed buffer replacement policy is based on the ideathat each tuple contributes to decrease the error of the sensorvalues predicted, as it provides an updated prediction func-tion. Thus, each tuple tpk can be associated to a value thatindicates its contribution Ctpk

. The contribution of a tu-ple tpk from sensor si is the decrease in the error committedwhen predicting values of si in the interval IAPFtpk

by usingthe prediction function in tpk instead of a prediction func-tion from a tuple with a contiguous IAPF . Thus, severalalternatives to predict values for the time interval IAPFtpk

are possible if tpk is removed (depending on whether a pre-vious tuple tpk−1 and/or a next tuple tpk+1 from the samesensor is available or not):

• If there is both a previous tuple tpk−1 and a next tupletpk+1 stored in the buffer, then there are two possibil-ities to choose:

1. The prediction function fk−1(t) of the previoustuple tpk−1 could be applied within IAPFtpk

. Inthat case, the IAPFtpk−1 should be modified asfollows:

IAPFtpk−1=[IAPFtpk−1.start, IAPFtpk.end)

By applying the out-of-date prediction functionfk−1(t) in tpk−1 instead of the updated predictionfunction fk(t) in tpk, a certain error ek−1

k arises,whose accumulated value can be computed as theabsolute value of the definite integral of the differ-ence of the prediction functions within IAPFtpk

:

ek−1k =

˛

˛

˛

˛

˛

Z IAPFtpk.end

IAPFtpk.start

fk(t) − fk−1(t) dt

˛

˛

˛

˛

˛

2. The prediction function fk+1(t) of the next tu-ple tpk+1 could be applied within IAPFtpk

. Inthat case, the IAPFtpk+1

should be modified asfollows:

IAPFtpk+1=[IAPFtpk

.start, IAPFtpk+1.end)

As the prediction function fk+1(t) in tpk+1 is usedto estimate values in IAPFtpk

(a past time in-terval), instead of using the prediction functionfk(t), an error ek+1

k arises, which can be com-puted similarly to the previous case:

ek+1k =

˛

˛

˛

˛

˛

Z IAPFtpk.end

IAPFtpk.start

fk(t) − fk+1(t) dt

˛

˛

˛

˛

˛

• If there is a next tuple tpk+1 but not a previous tu-ple tpk−1, then the next tuple tpk+1 could be appliedwithin the IAPFtpk

, as explained above.

• If there is a previous tuple tpk−1 but not a next tupletpk+1, then the previous tuple tpk−1 could be appliedwithin the IAPFtpk

, as explained above.

• Finally, if neither a previous nor a next tuple is avail-able in the buffer, then it is not possible to predictvalues within the IAPFtpk

(unless tpk is stored).

Based on the previous ideas, a priority queue is maintainedfor the tuples in the buffers, where the priority of a tupleis the inverse of its contribution. Thus, the tuple at thehead of the queue (one with the smallest contribution) isthe candidate to be removed in case of insufficient storage.According to the definition of contribution indicated beforeand the previous considerations, the contribution of a tupletpk is given by:

Ctpk=

8

<

:

MIN(ek−1k

, ek+1k

) if (∃ tpk+1 ) ∧ (∃ tpk−1)

ek+1k

if (∃ tpk+1) ∧ (¬∃ tpk−1)∞ otherwise

where an infinite contribution is assigned to a tuple if it is theonly tuple stored for the corresponding sensor (intuitively,if that tuple is removed then there is no way to predict thevalue for its sensor, as no other prediction function for thatsensor is available). A tuple with an infinite contributioncan only be selected as a victim to be removed from thebuffer in case there is no other tuple with a smaller con-tribution (the selection among several tuples with infinitecontribution is random). As shown in the previous formula,the contribution of a tuple tpk is also set to infinite when it isthe last tuple from that sensor (i.e., ¬∃ tpk+1), even though(as it has been indicated before) the previous tuple couldbe applied within the IAPF of such a tuple. The reason isthat the tuple with the greatest timestamp from each sensorshould be kept if possible in the buffer; otherwise, the pre-cision of future estimations will degrade. Thus, it should benoted that if the last tuple from a sensor is removed and theprevious tuple is extended to cover its IAPF , the predictionfunction of such a previous tuple will be the one consideredto compute the contribution of a new future incoming tuple(this is not precise but there is no other choice). By as-signing an infinite contribution to the latest tuple from eachsensor, this problem is minimized since such tuples will bestored if possible.

As the contribution of a tuple depends on its previous and/ornext tuple, the contribution is not fixed. On the contrary,it changes along time when new tuples are inserted or oldtuples are removed:

419

• Case 1: a new tuple tpk is inserted into the buffer. Inthis case, the contribution Ctpk

of the new tuple mustbe computed. In case there is a previous tuple tpk−1

(from the same sensor) in the buffer, its contributionCtpk−1

must also be updated (since the computationof the contribution of a tuple must consider the pre-diction function of its next tuple –if any–, as explainedbefore). Finally, if there is a next tuple tpk+1 (fromthe same sensor) its contribution must also be updated(as the contribution of a tuple may be affected by theprevious tuple).

• Case 2: a tuple tpk is removed from the buffer. Ifthere exists a previous tuple tpk−1 but not a next tupletpk+1 (for the same sensor) in the buffer, then only theprevious tuple can “cover the gap” and therefore theIAPFtpk−1

is extended by considering the IAPFtpk.

Similarly, if there exists a next tuple tpk+1 but not aprevious tuple tpk−1 in the buffer, then the IAPFtpk+1

is extended to cover the IAPFtpk. If there is both

a previous tuple and a next tuple in the buffer, theone considered to be applied within the IAPFtpk

isthe one that minimizes the accumulated error causedby the difference in the prediction functions (the ek−1

k

and ek+1k errors). After that, the contributions of the

previous and/or the next tuple (Ctpk−1, Ctpk+1

) mustbe recomputed. It should be noted that this wholeprocess is only applied when a tuple is removed due toinsufficient storage space. Thus, if a tuple is removeddue to purging (i.e., because it is not needed anymore),then only the next tuple must be checked to update itscontribution.

By recomputing the contributions as needed, the Buffer Ma-nager keeps up-to-date the information needed to select, as avictim to be evicted from a buffer, the tuple whose removal isexpected to have the smallest impact on the accuracy of thequery processor. The benefits of this buffer managementpolicy are evaluated experimentally in Section 6.3, whereother alternatives are also considered.

4.3 Prediction ValidatorThe Tuple Evaluator releases, for each query, predicted tu-ples about values that will satisfy the query. The communi-cation of another prediction function by any sensor involvedin an output predicted tuple could invalidate such predictedtuple. Therefore, the tuples obtained by the Tuple Evalua-tor are only tentative and will be considered validated onlywhen there is a guarantee that they cannot be found to bewrong later2. A predicted tuple from the Tuple Evaluatorcan be invalidated due to two reasons: 1) if it will not bevalid until a future time instant (according to the validityinterval of the predicted tuple), then an updated predictionfunction from any sensor involved will invalidate the previ-ous prediction function and, therefore, the predicted tuple;and 2) if a tuple that arrives late (i.e., it arrives “now” butit refers to a past time instant) invalidates the prediction.

A module called Prediction Validator can be plugged be-tween the output of the Tuple Evaluator and the input of a2Similar to the potential answers proposed in [25], which“may turn into current answers and be reported to theusers”.

client, with the goal of releasing only tuples that are guar-anteed to be valid3. Prediction functions in a predicted tu-ple allow to get predicted values that satisfy the query con-straints during the validity interval of the tuple. Assumingthat tuples are not received by the Tuple Evaluator laterthan maxDelay time units since they were released by sen-sors, a predicted value is considered committed/validatedmaxDelay time units after the timestamp of the predictionfunction that estimated it. Thus, the Prediction Validatorchecks the tuples it stores with a certain validation period(e.g., every second) to release valid tuples that can be in-ferred from them.

4.3.1 Validating Value ConstraintsFor predicted tuples with only one sensor identifier (i.e., cor-responding to queries with a single value constraint), thevalidity mark of the predicted tuple is just appropriatelymodified. As an example, for a predicted tuple <s1, type1,ts, 3t + 2, [5, 20]>, maxDelay=5, and the current time in-stant equals 15 time units, the tuple that would be releasedin the output data stream is <s1, type1, ts, 3t + 2, [5, 10]>.If the validation period is 1 time unit, then the next tuplewould be <s1, type1, ts, 3t + 2, [5, 11]> at time instant 16.If it is 2 time units, then it would be <s1, type1, ts, 3t + 2,[5, 12]> at time instant 17. A new tuple from sensor s1 couldinvalidate the predicted tuple at any time before t = 20 andstop the process.

4.3.2 Validating Window-Join ConstraintsPredicted tuples corresponding to queries with a window-join constraint (i.e., tentative tuples with two sensor iden-tifiers) are a little bit more difficult. Thus, the timestamp-matching boundary, which contains points (ti,tj) indicatingthe valid combinations of the timestamps for the sensorsmatched (as explained in Section 4.1), must be analyzed.In particular, the Prediction Validator must check if thesquare with vertices (t′ − w, t′ − w), (t′ − w, t′), (t′, t′ − w),(t′, t′), where t′ is the considered time instant (the currenttime instant minus maxDelay, since the Prediction Valida-tor releases at time t + maxDelay a tuple predicted validat time t) and w is the valid-time window, intersects thetimestamp-matching boundary. If there is an intersection,then one sensor at t′ matches the value of the other sensorat a time instant t′′, with t′ − w ≤ t′′ ≤ t′. Notice that itis not enough to check whether t′ is within the timestamp-matching boundary, since the value at t′ may match onlywith values of the other sensor at future time instants (there-fore, not validated yet). In case of match, the output tupleis released with a validity interval (and timestamp-matchingboundary) limited to that considered time instant t′.

As an example, tuples tp1=<s1, type1, ts, t> and tp2=<s2,type2, ts′, 2t− 3> match according to the window-join con-straint wJConstr(type1, type2, 4, ≤, 1) at t = 0; however,the value that matches for tuple tp2 is one second into thefuture; if that match is released at t = 0, a new predic-tion function from sensor s1, that invalidates it, could bereceived in the meanwhile. So, checking the validity intervalof a predicted tuple is not enough: the timestamp-matching

3In some contexts, releasing predicted (non-validated) re-sults could be interesting. Therefore, the Prediction Valida-tor can be turned off.

420

boundary must be checked too, as we have explained.

4.4 Interactions between ModulesWhen the Tuple Evaluator receives a new tuple tp1 fromsensor s1, of a type of value S1, it first communicates it tothe Buffer Manager, which computes the IAPFtp1

for thattuple and updates the IAPF of the previous tuple from thesame sensor. For each active query with constraints abouttype S1, the Tuple Evaluator generates an invalidation tu-ple for sensor s1 and the time interval IAPFtp1

, indicatingthat previous prediction functions for that sensor are notapplicable anymore in that interval. It also generates newpredicted tuples (if any), as explained in Section 4.1. Aquery could include several window-join constraints; for ex-ample, three sensor streams A, B, and C, related with twowindow-join constraints A-B and B-C can be processed byconsidering separately the joins A-B and B-C and then inte-grating the results. Possible optimizations for queries withmultiple joins are left as future work.

The tuples generated by the Tuple Evaluator are received bythe Prediction Validator. If the tuple received is an invali-dation tuple for a sensor, then the Prediction Validator willupdate the validity mark (validity interval and timestamp-matching boundary) of predicted tuples involving that sen-sor, in order not to include the invalidation interval for thatsensor (this could even invalidate the whole predicted tuple).If the tuple received is a predicted tuple, then the Predic-tion Validator will store it in the table of tentative tuplesand release it as required (see Section 4.3). It should benoted that the Tuple Evaluator could release either just aninvalidation tuple or also predicted tuples. The PredictionValidator also validates the tentative tuples according to therequired validation period (as explained in Section 4.3).

5. MANAGING UNCERTAINTYThe evaluation step as it is presented in Section 4.1 assumesthat there is no uncertainty in the predicted values. How-ever, the values predicted using the prediction functions oftuples are subject to a certain uncertainty, given by thethreshold update policy (see Section 2). This implies that ifthe value obtained using the prediction function is x and thethreshold used for updates is δ, then the real value can beany value in the interval [x−δ, x+δ]. In Figure 4, an exampleof how the uncertainty determines the possible real valuescorresponding to a certain predicted value is shown. Con-sidering the time interval between seconds three and four,it can be seen that the value of fk(t) will be in the range[3.2, 7] while fl(t) will be in the range [4.2, 7]. However, withno uncertainty the ranges would be [4.4, 5.4] and [5, 5.4] re-spectively.

The proposed query processor supports uncertainty man-agement. On the one hand, queries can be processed withdifferent semantics to take uncertainty into account. On theother hand, a tuple in an answer can be tagged with theprobability that the tuple satisfies the query constraints, ac-cording to the existing uncertainty. Moreover, the minimumprobability required for a tuple to be shown in the answercan be also specified as a query condition. Although someworks, such as [9], have studied probabilistic queries, westudy uncertainty issues in the context of data streams withprediction functions and focusing on the new window-join

f (t)

f (t)

3 4

3.2

7

uu

u =1

4.24.4

5.4

5.

value

time

l

k

u =1.212

u

u1

1

2

2

Figure 4: Uncertainty in the prediction functions

constraints. In the following, we explain how the uncer-tainty is managed by the query processor.

5.1 Query Semantics for UncertaintyTo take uncertainty into consideration, two different seman-tics are defined –may and must– for satisfaction of a query.For a given constraint, any of these two semantics could berequired. Such semantics were first proposed in [31] in thecontext of moving objects and were also used in [34]. Thissemantic distinction can be of great interest in sensor net-works. For example, we could need to know if the temper-ature increases above a certain level with logging purposes(must) or to detect the possibility of values of a dangerouschemical substance above a certain level (may).

In the following, we formally define the semantics of may andmust satisfaction for window-join queries, and we show howthe evaluation step (given in Section 4.1) can be modified toprocess the queries according to these two different seman-tics. The queries given by value constraints can be handledsimilarly under the two uncertainty semantics. Now, con-sider the following window-join constraint on two differentsensors s1, s2: |v1 − v2| ≤ K and |t1 − t2| ≤ w, where v1, v2

are the values of sensors s1 and s2 at times t1 and t2 respec-tively. Let [l1, u1] and [l2, u2] be the uncertainty intervalsfor the values of sensors v1 and v2 respectively (l1 and l2are the lower uncertainty bounds and u1 and u2 are the up-per uncertainty bounds). The values of sensor s1 at time t1and the value of sensor s2 at time t2 may satisfy the abovewindow-join constraint if |t1 − t2| ≤ w and ∃x ∈ [l1, u1] and∃y ∈ [l2, u2] such that |x−y| ≤ K. Note that this conditionis equivalent to requiring a non-empty intersection betweenthe intervals [l1 − K, u1 + K] and [l2, u2]. The followinglemma is easily seen from this observation.

Lemma 1: Assume that the uncertainty intervals for thevalues v1 and v2 of the sensors s1 and s2 at times t1 andt2 are given by [l1, u1] and [l2, u2], respectively. Then, theymay satisfy the window-join constraint |v1 − v2| ≤ K and|t1−t2| ≤ w iff l1−u2 ≤ K and l2−u1 ≤ K and |t1−t2| ≤ w.

In view of the above lemma, the evaluation step given in 4.1is modified for checking may satisfaction as follows. Let δk

and δl respectively be the thresholds used for the updates ofthe two sensors. Then, the intervals of uncertainties on the

421

predicted values of these two sensors are [fk(tk)−δk, fk(tk)+δk] and [fl(tl) − δl, fl(tl) + δl], respectively. Inequalities (1)and (2) in Section 4.1 (see Figure 2) are simply replaced bythe following inequalities:

a · tk − c · tl + b − d − δk − δl ≤ K

c · tl − a · tk + d − b − δl − δk ≤ K

Now, the must satisfaction is defined. The value of sensor s1

at time t1 and the value of sensor s2 at time t2 must satisfy(or definitely satisfy) the previous window-join constraint if|t1 − t2| ≤ w and ∀ x ∈ [l1, u1] and ∀ y ∈ [l2, u2] it is thecase that |x − y| ≤ K. In order for the above condition tobe satisfied it is necessary and sufficient that u2 ≤ l1 + K

and l2 ≥ u1 − K. Note that this condition implies that thelengths of the two intervals are bounded by 2K. From theseobservations, the following lemma is obtained.

Lemma 2: Assume that the uncertainty intervals for thevalues v1 and v2 of the sensors s1 and s2 at times t1 andt2 are given by [l1, u1] and [l2, u2], respectively. Then, theymust satisfy the window-join constraint |v1 − v2| ≤ K and|t1−t2| ≤ w iff |t1−t2| ≤ w and u2−l1 ≤ K and u1−l2 ≤ K.

Inequalities (1) and (2) in Section 4.1 (see Figure 2) aresimply replaced by the following inequalities for checkingthe must satisfaction:

c · tl − a · tk + d − b + δl + δk ≤ K

a · tk − c · tl + b − d + δk + δl ≤ K

5.2 Determining the Probability of a MatchAssuming a certain probability distribution of the real valuev1 ∈ [l1, u1] of a sensor s1 at time t1, the probability ofa match can also be computed. Let V1 denote the randomvariable corresponding to the value of s1 at time t1. Weassume that each value within the interval [l1, u1] is equallypossible for V1. So, the real value of sensor s1 is given bythe random variable V1 with the following density function:

PV1(v1) =

1u1−l1

if l1 ≤ v1 ≤ u1

0 otherwise

In order to compute the probability of satisfaction of a win-dow-join constraint for the values v1 and v2 of sensors s1

and s2 at time instants t1 and t2, v2 is considered to be arandom variable V2 whose density function has value 1

u2−l2

within the range [l2, u2] and zero otherwise. Now, a randomvariable Z = V1−V2 is defined. Assuming that the variablesare independent, the density function for this new variableis given by the convolution:

PZ(z) =

Z +∞

−∞

PV2(z − τ) · PV1

(τ) dτ

Since PV1(τ) is zero if τ is outside the interval [l1, u1], and

1u1−l1

otherwise, the following holds good:

PZ(z) =

Z u1

l1

PV2(z − τ) ·

1

u1 − l1dτ

Let us assume that u2− l2 ≤ u1− l1; if this is not satisfied v1

and v2 can be interchanged to satisfy this. It is not difficultto see that (l1 − u2) ≤ (l1 − l2) ≤ (u1 − u2) ≤ (u1 − l2).It can now be shown that the probability density functionPZ(z) has the shape of a trapezoid and is given as follows,where D = 1

(u1−l1)·(u2−l2):

PZ(z) =

8

>

>

>

>

<

>

>

>

>

:

0 if z ≤ l1 − u2

(z − l1 + u2) · D if l1 − u2 ≤ z ≤ l1 − l21

u1−l1if l1 − l2 ≤ z ≤ u1 − u2

(u1 − l2 − z) · D if u1 − u2 ≤ z ≤ u1 − l20 if z ≥ u1 − l2

Now, the probability that |Z| ≤ K must be computed. Thisis equal to the probability that −K ≤ Z ≤ K, which is:

Z K

−K

PZ(z) dvz

This value depends upon how the values −K and K arerelated to the values l1−u2, l1−l2, u1−u2, and u1−l2. Notethat these last four values are in increasing order. A carefulanalysis shows that there are only fifteen different possibleranges of values for K. One extreme is when K ≥ u1 − l2and −K ≤ l1 − u2; in this case, the above probability is 1and this corresponds to the must satisfiability. At the otherextreme, K ≤ l1−u2 or −K ≥ u1−l2; in this case, the aboveprobability (i.e., the probability of satisfaction) is zero. Itis not difficult to see that the complement of this conditioncorresponds to the condition of the may satisfaction and inthis case the probability of satisfaction is non-zero. Someother possible ranges are the following. If u1 − u2 ≤ K ≤u1 − l2 and −K ≤ l1 − u2, then the above integral and

hence the probability comes out to be 1 − D·(u1−l2−K)2

2. If

l1−l2 ≤ K ≤ u1−u2 and −K ≤ l1−u2, then the probabilityis 2K+u2+l2−2l1

2(u1−l1).

For each of the above cases, a set of linear equations can beset up as given in Section 4.1 after replacing the inequalities(1) and (2) by those given in Lemma 1, and for those tk

and tl satisfying the system of inequalities the probabilityof satisfaction is computed using the expression given above.

In this analysis, we have assumed that the random variablesV1 and V2 are uniformly distributed in their respective in-tervals, which implies that we have considered uncertaintyin the worst case. Another reasonable assumption is thatthey have a bounded normal distribution within their re-spective intervals; analysis of this case may be more difficultand could be a subject for future work.

6. EXPERIMENTAL EVALUATIONIn this section, we show some results obtained using the im-plemented prototype. The experimental settings are sum-marized in Table 1. In the experiments, the sensors are

422

location devices (GPS receivers) attached to objects movingat 60 mph (about 26.8 m/s) within an area of 10 squaredkilometers. Every object is initialized with a random lo-cation and an orientation that changes randomly, with a50% probability, every 10 seconds. Thus, a random mobil-ity model is considered to generate the datasets. The ob-jects’ trajectories are precomputed and stored in files to bereadily available in the different experiments. Every movingobject measures its location every three seconds, and com-municates input tuples (with two values and two predictionfunctions based on simple linear extrapolation, for the x andy coordinates) to the query processor according to the spec-ified threshold update policy: a threshold of 25 meters witha minimum of one update every 30 seconds is considered.The trajectory files are unified into a single trace file that isprocessed using the IBM Location Transponder [28], whichassigns threads as needed to meet the deadlines of the loca-tion updates.

Table 1: Parameters for the experimentsObjects’ speed 60 mph (∼ 26.8 m/s)

GPS sampling rate 3 secondsPeriod of angle change 10 secondsProb. of angle change 50%

Scenario size 10 squared kilometersThreshold update policy 25 meters

Max. period update policy (T) 30 secondsK of the window-join constraint 400 metersw of the window-join constraint 3 seconds

comp of the window-join constraint “≤”Validation period 1 second

Duration of the test 3 minutesIndexing mechanism R-Tree

In the described scenario, each monitoring test is run forthree minutes and new tuples are released into the outputdata stream every second (i.e., the validation period, definedin Section 4.3, is one second). The query that is evaluatedretrieves pairs of objects that are within 400 meters of eachother both in the horizontal and vertical dimensions. Forthis, a window-join constraint with K = 400 meters, a com-parator comp = “ ≤ ”, and a valid-time window w of threeseconds, is considered. The experiments were carried outon a computer with a 2xDual-Core AMD Opteron proces-sor running at 2.6 GHz, with 8 GB of RAM, and SunOS5.10. The Buffer Manager stores tuples in main memory.The current prototype, implemented in Java, does not useany specialized solver for systems of linear inequalities. AnR-Tree [16] is used to filter potential matches of an input tu-ple, previous to the filter step described in Section 4.1 (formore details, please see [19]).

In the rest of this section, we present experiments that ana-lyze: 1) the scalability of the query processor, 2) the benefitsof using prediction functions, and 3) the suitability of theproposed buffer management policy. The accuracy is mea-sured in terms of precision, recall, and F-measure, adoptingthese definitions from the field of Information Retrieval [4].In order to compute the ideal answer for a given time in-stant, the last GPS samples of the objects are considered.We focus on window-join constraints because processing thistype of constraint is more challenging.

6.1 Scalability of the Query ProcessorFigure 5 shows the impact of the total number of movingobjects/sensors on the accuracy of the query processor. As

can be seen in the figure, the accuracy is not perfect even inscenarios with a small number of moving objects, as there isalways some imprecision regarding the values that can be es-timated using the prediction functions (imprecision allowedby the threshold update policy). However, it should be em-phasized that the query processor exhibits a good scalabilitywhen the number of moving objects increases. The slightperformance degradation affects especially the recall, as thequery processor strives to process a large number of inputtuples so as to release new output tuples quickly. However,a large number of moving objects is needed to challenge asingle monitoring computer. It should also be noted thata worst-case situation is considered in this experiment, inthe sense that all the input tuples are of the type of valuein which the query is interested (GPS locations). This im-plies that any new tuple received could, in principle, matchwith any other tuple, as all the input tuples are stored inthe same buffer. Besides, as the size of the scenario is fixed,the object density increases with the number of moving ob-jects, reducing the efficiency of the R-Tree. Although it isnot shown in the figure, the total query processing time (ac-cumulated during the whole experiment) increases linearlywith the number of objects (ranging up to about 70 secondsin a scenario with 1000 objects); considering also a valueconstraint decreases this cost dramatically (20 seconds with1000 objects), as the value constraint is verified before per-forming the join, saving many tuple comparisons.

Figure 5: Effect of the number of objects: accuracy

6.2 Benefits of Prediction FunctionsIn this experiment, the accuracy of the approach presentedin this paper is compared with an alternative approach whereno prediction functions are used. In this latter case, thethreshold update policy described in Section 2 is modifiedaccordingly so as to communicate a new update when thedifference between the current value and the last communi-cated value exceeds the threshold (estimating values is nolonger possible). The general query processor architectureis maintained in both cases.

Figure 6 shows how the accuracy of the query processingis considerably better when prediction functions are used,except for scenarios with a small number of moving ob-jects/sensors (< 150). Thus, without prediction functionsthe values measured by the sensors cannot be estimated, soonly the values at update-time can be considered accurate.The decrease in the accuracy of the approach that does notuse prediction functions is explained by a high query pro-cessing overload, due to the huge increase in the number oftuples communicated by the moving objects when no pre-diction functions are used (about 300 tuples per second in a

423

scenario with 1000 objects): the Tuple Evaluator is not ableto cope with a very high update rate.

While the use of predictions and the accuracy of the queryprocessing may be affected by factors such as the speed of theobjects, the threshold of the update policy, the size of thevalid-time window, or the existence of bursts and conceptdrifts, we have obtained a good accuracy in a variety ofsettings [19].

Figure 6: Benefits of prediction functions

6.3 Buffer Management PolicyIn this section, the buffer management policy presented inSection 4.2.2 is evaluated. For this, a scenario with 300objects/sensors and a buffer limitation of 325 tuples is con-sidered. The GPS sampling rate is of one sampling everysecond. The period of angle change by a moving object isset to five seconds with a probability of 75%; in this way, therate of input tuples increases considerably, which leads to ahigher competition for the use of the limited space availablein the buffer for the GPS sensors of the moving objects. Forevaluation purposes, the proposed policy (that will be calledin this section discard-min-C, as it tries to maximize the con-tribution of the tuples stored) is compared with three otherpossible policies: 1) discard-if-full, where an incoming tupleis just discarded if there is no space to store it; 2) discard-oldest, where the oldest tuple stored is discarded to leavespace for the incoming tuple; and 3) discard-randomly, wherea victim is selected randomly. Like in the proposed pol-icy (see Section 4.2.2), existing tuples are used “to coverthe gaps” (time intervals for which an accurate predictionfunction is not available due to removals), except with thediscard-if-full policy. Doing this with the discard-if-full pol-icy would mean that purging would never have an oppor-tunity to discard tuples, as their IAPFs would never falloutside the buffer window, which would have a very nega-tive effect on the precision of the query processor when usingthat policy.

Figure 7 shows the recall achieved by the different policiesfor different values of the valid-time window. The policy pro-posed in this paper (discard-min-C) clearly achieves the bestrecall, the second best policy is to discard the oldest tuple,the third best policy is to discard tuples randomly, and fi-nally the worst policy is to discard incoming tuples while thebuffer is full (i.e., until some stored tuples are purged by theBuffer Manager). Similar conclusions are obtained by ana-lyzing the precision (for the sake of clarity, not shown in thefigure), although in this case the precision of discard-min-Cand discard-oldest is very similar. Thus, the advantage ofdiscard-min-C over discard-oldest is a significant increase inthe recall (at the expense of a slightly higher processing cost

due to the need of computing the contributions of tuplesand keeping them up-to-date). It is also remarkable that,in this experiment, the accuracy (in terms of both precisionand recall) of the proposed policy is in every case less than1% worse than in a situation where there is no buffer sizelimitation (this ideal situation increases only the F-measurein about 0.85% on average). We have performed other ex-periments simulating delayed incoming tuples, and similarconclusions can be drawn.

Figure 7: Comparison of different buffer policies

7. RELATED WORKDue to its importance, minimizing the size and number oftransmissions in sensor networks has been the focus of manyworks (e.g., see [29, 30]). In relation to this problem, thereare several works that (as we do) propose to use predictionsin the context of data streams:

• The most relevant work is [24], which also advocatesthe use of predictors to minimize the communicationcosts in a sensor network. The authors of that workstudy several prediction techniques, all of them basedon linear functions. A similar idea is proposed in [21],where a Kalman filter is specifically chosen among thelinear estimation methods. However, these works donot focus on query processing issues.

• In [14], the basic assumption is that in many con-texts the readings of nearby sensors are correlated.Therefore, the authors propose to analyze the exist-ing spatio-temporal correlations to compute predictionmodels in networks with static sensors. However, howto efficiently process the data received from sensors toanswer different types of queries is out of the scope oftheir work. Therefore, this work is complementary toours. There are other works that also exploit correla-tions among sensors’ values (e.g., [10, 30]).

• Another interesting work is [20], where Kalman filtersare used to adjust dynamically the sampling rate ofsensors: sensors for which the prediction error is higherwill sample at a higher rate. The main disadvantageis that unexpected events between samplings cannotbe detected, which may be a problem if the samplinginterval is large. This is also a complementary workto ours: it could be used with our query processor todynamically adjust the sampling rate of the sensors.

• Probabilistic models are used in [11] to capture cor-relations among sensors so as to provide more robustinterpretations of sensor readings and optimize their

424

acquisition. Its ultimate goal is similar to ours: toreduce energy consumption and processing overload.However, the approach is different, as it focuses ondata acquisition in a pull-based query model. Thus,for example, continuous queries or window-join con-straints are not considered.

Regarding the query constraints handled by the proposedquery processor, some relevant works are worth mentioning:

• In [17], time window constraints are used to limit thesets of sensor tuples that can be matched for a query,and a mechanism to process multi-way joins is pro-posed. However, the authors focus on a different prob-lem (tracking the motion of a moving object) and theirtechnique requires the specification of the names of theindividual sensors involved in the join, so they must beknown in advance.

• Also in the context of moving objects, CAMEL (Con-tinuous Active Monitor Engine for Location-based Ser-vices) [7, 8] is a location stream database with a loca-tion management engine to support intelligent locationaware services. It supports unary and binary movingobjects triggers, which can be seen as a special caseof the value and window-join constraints that we con-sider. However, it does not deal with prediction func-tions or temporal conditions. Moreover, as in [17], theobjects of interest must be referenced in the query.

• In [6], the Dynamic Time Warping distance is consid-ered, such that a point in a time series can be alignedwith multiple points of another time series in different(previous or later) temporal positions. So, althoughthe context of that work is different (the goal there isto compare time series), the importance of taking intoaccount temporal shifts is also highlighted (as withthe possibility to specify a valid-time window in thewindow-join constraints proposed in this paper).

• Finally, it is also interesting to mention that differenttypes of sliding windows have been proposed in the lit-erature (e.g., time-based or tuple-based windows [2]),which can be considered as complementary to the pro-posal presented in this paper: such windows should beconsidered when extracting the tuples from the buffersfor query processing and also as an additional criteriato purge old tuples from the buffers.

To conclude, [1] is a recent relevant work that also advocatesimplementing query operators by solving equation systemsderived from piecewise polynomial models. The main focusof such work is on query validation. A technique called queryinversion is proposed to translate an accuracy requirementon the output tuples into an accuracy requirement on theinput tuples. An accuracy validation drives the proposed on-line predictive query processing: only input tuples that maycause a change in the result are processed. An importantdifference between our proposal and this and other worksdescribed in this section is that we define a complete queryprocessor architecture for handling data streams with pre-diction functions, studying all the issues affecting the queryprocessing, from the sensors to the clients.

8. CONCLUSIONS AND FUTURE WORKIn this paper, we have described a query processor for han-dling data streams with prediction functions in a networkof sensors. The use of prediction functions allows to mini-mize communications from the sensors, which is an impor-tant concern due to energy and bandwidth limitations, andit also allows an efficient query processing.

Although the use of predictions has been already proposedin the literature of data streams, and some works such as [21,24] compare different prediction techniques, no other workfocuses on query processing aspects. The proposed incre-mental query processing approach detects all the answers,adapts to different types of clients (e.g., a trigger processingmodule that reacts to certain events, or a sampling mod-ule that presents periodically a snapshot of the answer toa user), and allows the processing of predicted future re-sults. Two types of interesting constraints are considered,with an emphasis on window-join constraints, that have notbeen considered so far in other works. An appropriate buffermanagement policy, that takes into account the contributionof the prediction functions contained in the tuples, has alsobeen proposed. Moreover, uncertainty issues are managedby supporting may and must queries and tagging the outputtuples with a probability of satisfaction. The experimentsperformed show the interest of our proposal.

As future work, some aspects of the proposed query proces-sor could be optimized. There is a wide variety of proposalsin the literature focusing on optimization issues for queryprocessing (join re-ordering for queries with more than twoinputs, sharing computation across similar queries, etc.) andsome of them could probably be adopted in our approach.

9. ACKNOWLEDGEMENTSResearch supported by NSF grants 0326284, 0330342, ITR-0086144, 0513736, 0209190, EIA-0320956, EIA-0220562,HRD-0317692, and CICYT project TIN2007-68091-C02-02.We thank Raquel Trillo for her useful comments.

10. REFERENCES[1] Y. Ahmad, O. Papaemmanouil, U. Cetintemel, and

J. Rogers. Simultaneous equation systems for queryprocessing on continuous-time data streams. In 24thInternational Conference on Data Engineering,ICDE’08, pages 666–675, Washington, 2008. IEEEComputer Society.

[2] A. Arasu, S. Babu, and J. Widom. The CQLcontinuous query language: Semantic foundations andquery execution. VLDB Journal, 15(2):121–142, 2006.

[3] B. Babcock, S. Babu, M. Datar, R. Motwani, andJ. Widom. Models and issues in data stream systems.In 21st ACM SIGMOD-SIGACT-SIGART Symposiumon Principles of Database Systems, PODS’02, pages1–16, New York, 2002. ACM.

[4] R. Baeza-Yates and B. Ribeiro-Neto. ModernInformation Retrieval. ACM / Addison-Wesley, 1999.

[5] P. Bonnet, J. Gehrke, and P. Seshadri. Towards sensordatabase systems. In 2nd International Conference onMobile Data Management, MDM’01, volume 1987 ofLecture Notes in Computer Science, pages 3–14,London, 2001. Springer.

425

[6] P. Capitani and P. Ciaccia. Warping the time on datastreams. Data & Knowledge Engineering,62(3):438–458, 2007.

[7] Y. Chen, F. Rao, X. Yu, and D. Liu. CAMEL: amoving object database approach for intelligentlocation aware services. In 4th InternationalConference on Mobile Data Management, MDM’03,volume 2574 of Lecture Notes in Computer Science,pages 331–334, London, 2003. Springer.

[8] Y. Chen, F. Rao, X. Yu, D. Liu, and L. Zhang.Managing location stream using moving objectdatabase. In 14th International Workshop on Databaseand Expert Systems Applications, DEXA’03, pages916–920, Washington, 2003. IEEE Computer Society.

[9] R. Cheng, D. Kalashnikov, and S. Prabhakar.Evaluation of probabilistic queries over imprecise datain constantly-evolving environments. InformationSystems, 32(1):104–130, 2007.

[10] D. Chu, A. Deshpande, J. Hellerstein, and W. Hong.Approximate data collection in sensor networks usingprobabilistic models. In 22nd International Conferenceon Data Engineering, ICDE’06, page 48, Washington,2006. IEEE Computer Society.

[11] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein,and W. Hong. Model-based approximate querying insensor networks. VLDB Journal, 14(4):417–443, 2005.

[12] T. Ghanem. Supporting predicate-window queries indata stream management systems. In 22ndInternational Conference on Data EngineeringWorkshops, ICDE’06 Workshops, page 139,Washington, 2006. IEEE Computer Society.

[13] T. Ghanem, W. Aref, and A. Elmagarmid. Exploitingpredicate-window semantics over data streams.SIGMOD Record, 35(1):3–8, 2006.

[14] S. Goel and T. Imielinski. Prediction-based monitoringin sensor networks: Taking lessons from MPEG. ACMComputer Communication Review, 31(5):82–98, 2001.

[15] J. D. Gooijer and R. Hyndman. 25 years of IIF timeseries forecasting: a selective review. Department ofEconometrics and Business Statistics, MonashUniversity, 2005.

[16] A. Guttman. R-trees: a dynamic index structure forspatial searching. In ACM SIGMOD InternationalConference on Management of Data, SIGMOD’84,pages 47–57, New York, 1984. ACM.

[17] M. Hammad, W. Aref, and A. Elmagarmid. Streamwindow join: tracking moving objects insensor-network databases. In 15th InternationalConference on Scientific and Statistical DatabaseManagement, SSDBM’03, pages 75–84, Washington,2003. IEEE Computer Society.

[18] S. Ilarri, O. Wolfson, E. Mena, A. Illarramendi, andN. Rishe. Processing of data streams with predictionfunctions. In 39th Hawaii International Conference onSystem Sciences, HICSS-39, page 237a, Washington,2006. IEEE Computer Society.

[19] S. Ilarri, O. Wolfson, E. Mena, A. Illarramendi, andP. Sistla. An architecture for prediction-basedmonitoring of data streams. Technical ReportRR-08-08, University of Zaragoza, September 2008.

[20] A. Jain and E. Chang. Adaptive data sampling forsensor networks. In 1st International Workshop on

Data Management for Sensor Networks, DMSN’04,pages 10–16, New York, 2004. ACM.

[21] A. Jain, E. Chang, and Y.-F. Wang. Adaptive streamresource management using Kalman filters. In ACMSIGMOD International Conference on Management ofData, SIGMOD’04, pages 11–22, New York, 2004.ACM.

[22] C. Jones, K. Sivalingam, P. Agrawal, and J. Chen. Asurvey of energy efficient network protocols for wirelessnetworks. Wireleless Networks, 7(4):343–358, 2001.

[23] K. Kowalski and W.-H. Steeb. Nonlinear DynamicalSystems and Carleman Linearization. World Scientific,Singapore, 1991.

[24] V. Kumar, B. Cooper, and S. Navathe. Predictivefiltering: a learning-based approach to data streamfiltering. In 1st International Workshop on DataManagement for Sensor Networks, DMSN’04, pages88–97, New York, 2004. ACM.

[25] D. Lin, B. Cui, and D. Yang. Optimizing movingqueries over moving object data streams. In 12thInternational Conference on Database Systems forAdvanced Applications, DAASFA’07, volume 4443 ofLecture Notes in Computer Science, pages 563–575,Berlin, 2007. Springer.

[26] S. Madden, M. Franklin, J. Hellerstein, and W. Hong.TinyDB: an acquisitional query processing system forsensor networks. ACM Transactions on DatabaseSystems, 30(1):122–173, 2005.

[27] J. Miles and M. Shevlin. Applying Regression andCorrelation: A Guide for Students and Researchers.SAGE Publications, London, 2001.

[28] J. Myllymaki and J. Kaufman. IBM LocationTransponder. IBM alphaworks,http://www.alphaworks.ibm.com/tech/transponder, 2002.

[29] M. Sharaf, J. Beaver, A. Labrinidis, and P. Chrysanthis.Balancing energy efficiency and quality of aggregate data insensor networks. The VLDB Journal, 13(4):384–403, 2004.

[30] A. Silberstein, R. Braynard, and J. Yang. Constraintchaining: on energy-efficient continuous monitoring insensor networks. In ACM SIGMOD InternationalConference on Management of Data, SIGMOD’06, pages157–168, New York, 2006. ACM.

[31] A. Sistla, O. Wolfson, S. Chamberlain, and S. Dao.Querying the uncertain position of moving objects. InTemporal Databases: Research and Practice, volume 1399of Lecture Notes in Computer Science, pages 310–337,Berlin, 1998. Springer.

[32] Y. Tao and D. Papadias. Time-parameterized queries inspatio-temporal databases. In ACM SIGMODInternational Conference on Management of Data,SIGMOD’02, pages 334–345, New York, 2002. ACM.

[33] K. Torp, C. Jensen, and R. Snodgrass. Effectivetimestamping in databases. The VLDB Journal,8(3-4):267–288, 2000.

[34] O. Wolfson, S. Chamberlain, S. Dao, L. Jiang, andG. Mendez. Cost and imprecision in modeling the positionof moving objects. In 14th International Conference onData Engineering, ICDE’98, pages 588–596, Washington,1998. IEEE Computer Society.

[35] O. Wolfson, A. Sistla, S. Chamberlain, and Y. Yesha.Updating and querying databases that track mobile units.Distributed and Parallel Databases, 7(3):257–287, 1999.

426