1
Continuous Queries Continuous Queries over over
Data StreamsData Streams
Vitaly Kroivets, Lyan MarinaVitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and InternetPresentation for The Seminar on Database and InternetThe Hebrew University of Jerusalem, Fall 2002The Hebrew University of Jerusalem, Fall 2002
2
Contents of the lectureContents of the lecture
IntroductionIntroduction
Proposed Architecture of Data Proposed Architecture of Data Stream Management SystemStream Management System
Research problemsResearch problems
Query OptimizationQuery Optimization
BibliographyBibliography
3
Data Streams vs. Data Data Streams vs. Data SetsSets
Data Sets:Data Sets: Data Streams:Data Streams:
Updates Updates infrequentinfrequent
Data changed Data changed constantly constantly (sometimes (sometimes additions only)additions only)Old data Old data
required many required many timestimes
Mostly only freshest Mostly only freshest data useddata used
Example: Example: employees employees personal data personal data tabletable
Examples: financial Examples: financial tickers, data feeds tickers, data feeds from sensors, from sensors, network monitoring, network monitoring, etcetc
4
Using Traditional Using Traditional DatabaseDatabase
User/ApplicationUser/ApplicationUser/ApplicationUser/Application
LoaderLoaderLoaderLoader
QueryQuery ResultResult
ResultResult……
QueryQuery……
5
Data Streams ParadigmData Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application
Register QueryRegister Query
Stream QueryStream QueryProcessorProcessor
ResultResult
6
Data Streams ParadigmData Streams ParadigmUser/ApplicationUser/ApplicationUser/ApplicationUser/Application
Register QueryRegister Query
Stream QueryStream QueryProcessorProcessor
ResultResult
Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)
DataStream
ManagementSystem
(DSMS)
7
What Is A Continuous What Is A Continuous Query ?Query ?
Query which is Query which is issued once issued once and logically and logically run run continuously.continuously.
8
What is Continuous What is Continuous Query ?Query ?
Query which is issued once and run continuously.Query which is issued once and run continuously.
Example: detect abnormalities in network traffic behavior in real-time and their cause -- like link congestion due to hardware failure.
9
What is Continuous What is Continuous Query ?Query ?
Query which is issued once and run continuously.Query which is issued once and run continuously.
More examples:
Continues queries used to support load balancing, online automatic trading at Stock Exchange
10
Special ChallengesSpecial Challenges
Timely online answers Timely online answers even for rapid data even for rapid data streamsstreams
Ability of fast access to Ability of fast access to large portions of data large portions of data
Processing of multiple Processing of multiple streams simultaneously streams simultaneously
11
Making Things ConcreteMaking Things Concrete
Outgoing (call_ID, caller, time, event)Incoming (call_ID, callee, time, event)
event = start or end
CentralOffice
CentralOffice
DSMS
BOB ALICE
12
Making Things ConcreteMaking Things Concrete
Database = two streams of mobile call Database = two streams of mobile call recordsrecords Outgoing(connectionID, caller, start, end)Outgoing(connectionID, caller, start, end) Incoming(connectionID, callee, start, end)Incoming(connectionID, callee, start, end)
Query language = SQLQuery language = SQL
FROM clauses can refer to streams and/or FROM clauses can refer to streams and/or relationsrelations
13
Query 1 (self-join)Query 1 (self-join)
Find allFind all outgoing callsoutgoing calls longer thanlonger than 2 minutes2 minutes
SELECT O1.call_ID, O1.callerSELECT O1.call_ID, O1.callerFROM Outgoing O1, Outgoing O2FROM Outgoing O1, Outgoing O2WHERE (O2.time – O1.time > 2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.call_ID = O2.call_ID AND O1.event = startAND O1.event = start AND O2.event = end)AND O2.event = end)
Result requiresResult requires unbounded storageunbounded storage Can provideCan provide result as data streamresult as data stream Can output after 2 min,Can output after 2 min, without seeingwithout seeing end end
14
Query 2 (join)Query 2 (join)
Pair upPair up callerscallers and and calleescallees
SELECT O.caller, I.calleeSELECT O.caller, I.calleeFROM Outgoing O, Incoming IFROM Outgoing O, Incoming IWHERE O.call_ID = I.call_IDWHERE O.call_ID = I.call_ID
Can still provideCan still provide result as data streamresult as data stream RequiresRequires unbounded temporary storage …unbounded temporary storage … … … unless streams areunless streams are near-synchronizednear-synchronized
15
Query 3 (group-by Query 3 (group-by aggregation)aggregation)
Total connection timeTotal connection time for each callerfor each caller
SELECT O1.caller, sum(O2.time – O1.time)SELECT O1.caller, sum(O2.time – O1.time)FROM Outgoing O1, Outgoing O2FROM Outgoing O1, Outgoing O2WHERE (O1.call_ID = O2.call_IDWHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O1.event = start AND O2.event = end)AND O2.event = end)GROUP BY O1.callerGROUP BY O1.caller
Cannot provide result in (append-only) Cannot provide result in (append-only) stream. stream.
Alternatives:Alternatives:• Output stream with updatesOutput stream with updates• Provide current value on demandProvide current value on demand• Keep answer in memoryKeep answer in memory
16
ConclusionsConclusions
Conventional DBMS technology is Conventional DBMS technology is inadequateinadequate
We need reconsider all aspects of data We need reconsider all aspects of data management and processing in presence management and processing in presence of data streamsof data streams
17
DBMS versus DSMSDBMS versus DSMS
• Persistent relationsPersistent relations • Transient streams (and Transient streams (and persistent relations)persistent relations)
18
DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations
• Transient streams (and Transient streams (and persistent relations)persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
19
DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations
• Transient streams (and Transient streams (and persistent relations)persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
• Random accessRandom access • Sequential accessSequential access
20
DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations
• Transient streams (and Transient streams (and persistent relations)persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
• Random accessRandom access • Sequential accessSequential access
• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design
• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics
21
DBMS versus DSMSDBMS versus DSMS• Persistent relations Persistent relations
• Transient streams (and Transient streams (and persistent relations)persistent relations)
• One-time queriesOne-time queries • Continuous queriesContinuous queries
• Random accessRandom access • Sequential accessSequential access
• Access plan Access plan determined by query determined by query processor and processor and physical DB designphysical DB design
• Unpredictable data Unpredictable data arrival and arrival and characteristicscharacteristics
• ““Unbounded” disk storeUnbounded” disk store • Bounded main memoryBounded main memory
22
RelatedRelated workworkTapestryTapestry system system
CContent-based filtering oontent-based filtering off email messages. email messages. RRestricted subset of SQLestricted subset of SQL append-only query append-only query resultsresultsCronicle data modelCronicle data model
AAppend-only ordered sequences of tuplesppend-only ordered sequences of tuples restricted view-definition languagerestricted view-definition language doesnt store doesnt store any croniclesany croniclesAlert systemAlert system
EEvent-condition Action triggers in conventional vent-condition Action triggers in conventional SQL DBSQL DB Continuous Queries over append-only Continuous Queries over append-only "active tables"."active tables".
23
RelatedRelated workworkMaterialized ViewsMaterialized Views
Materialized Views are queries which need to be Materialized Views are queries which need to be reevaluated whenever database changesreevaluated whenever database changes..
Materialized Views vsMaterialized Views vs. . Continuous QueriesContinuous Queries::
Continuous QueriesContinuous Queries May stream rather then store resultMay stream rather then store result May deal with append only relations May deal with append only relations May provide approximate answersMay provide approximate answers Processing strategy may adapt characteristics Processing strategy may adapt characteristics
of data streamof data stream
24
Architecture for Architecture for continuous queriescontinuous queries
Single stream of tuples D, single continuous Query QSingle stream of tuples D, single continuous Query Qand Answer to the query Aand Answer to the query AQ is issued once and operates continuouslyQ is issued once and operates continuously
<A,B><A,B><A,B> Q
Data Stream
Continuous Query
A?Answer
25
Architecture for Architecture for continuous queriescontinuous queries
We consider data streams that adhere to the relation We consider data streams that adhere to the relation model (i. e. streams of tuples), although many of model (i. e. streams of tuples), although many of the ideas and techniques are independent of the the ideas and techniques are independent of the data model being considereddata model being considered
<A,B><A,B><A,B> Q
Data Stream
Continuous Query
A?Answer
26
Architecture for continuous Architecture for continuous queriesqueries
Scenario 1Scenario 1 ( (simplestsimplest):):
Data stream D is append only Data stream D is append only - - no updates or no updates or deletions. How to handle Q?deletions. How to handle Q?
11) ) Always store current answer A to Q Always store current answer A to Q ..
D is of unbounded size D is of unbounded size ==> A may be too> A may be too..
22) ) Not to store A, but make new tuples in A Not to store A, but make new tuples in A available as another continuous streamavailable as another continuous stream..
No need for unbounded storage for A, but No need for unbounded storage for A, but may may need unbounded storage to determine new need unbounded storage to determine new
tuples in Atuples in A..
27
Architecture for continuous Architecture for continuous queriesqueries
Scenario 2Scenario 2 Input stream is appendInput stream is append--only, but may cause only, but may cause
updates and deletions in answer Aupdates and deletions in answer A.. => May need to update/delete tuples in output => May need to update/delete tuples in output
data streamdata stream Scenario3Scenario3 ( (most generalmost general)) Input stream D includes updates and deletionsInput stream D includes updates and deletions.. => Much data of stream should be stored to => Much data of stream should be stored to
determine answer.determine answer.
28
Architecture for continuous Architecture for continuous queriesqueries
How to solve?How to solve?
1) Restrict expressiveness of Q.1) Restrict expressiveness of Q.
2) Impose constrains on data stream to2) Impose constrains on data stream to
guarantee that answer to Q is boundedguarantee that answer to Q is bounded
and amount of data needed to compute Q .and amount of data needed to compute Q .
3) Provide approximate answer.3) Provide approximate answer.
29
Arcitecture for processing Arcitecture for processing continuous queriescontinuous queries
Stream QueryStream QueryProcessorProcessor
Stream QueryStream QueryProcessorProcessor
Stream 1
Stream 2
Stream N
.
.
.
Throw
Scratch
Store
Stream
30
Architecture for Architecture for continuous queriescontinuous queries
STREAMSTREAM is data stream containing tuples is data stream containing tuples appended to A. It is appendappended to A. It is append--only stream only stream ((shouldnt include updatesshouldnt include updates//deletionsdeletions))
STREAMSTREAM and and STORESTORE define current answer A define current answer A..
31
Architecture for continuous Architecture for continuous queriesqueries
When query Q is notified of new When query Q is notified of new
tuple t in a relevant data stream, tuple t in a relevant data stream,
it can perform number of actions,it can perform number of actions,
which are not mutually exclusivewhich are not mutually exclusive
1) t causes new tuples in A1) t causes new tuples in A
if tuple a will remain in A foreverif tuple a will remain in A forever: :
send a to send a to STREAMSTREAM
2) if a should be in A, but may be2) if a should be in A, but may be removed at some removed at some moment: add a to moment: add a to STORESTORE
Stream QueryStream QueryProcessorProcessor
Stream QueryStream QueryProcessorProcessor
Throw Scratch Store Stream
Stream
32
Architecture for continuous Architecture for continuous queriesqueries
When query Q is notified of new tuple t in a relevant When query Q is notified of new tuple t in a relevant
data stream, it can perform number of actions,data stream, it can perform number of actions,
which are not mutually exclusivewhich are not mutually exclusive
3) t may cause update or deletion3) t may cause update or deletion
of answer tuples in Store. Answer of answer tuples in Store. Answer
tuples may be moved from tuples may be moved from
STORE STORE to to STREAMSTREAM
4) May need to save t or derived 4) May need to save t or derived
data to ensure in future can compute data to ensure in future can compute
query result send t to query result send t to SCRATCHSCRATCH
Stream QueryStream QueryProcessorProcessor
Stream QueryStream QueryProcessorProcessor
Throw Scratch Store Stream
Stream
33
Architecture for continuous Architecture for continuous queriesqueries
When query Q is notified of new tuple t in a relevant When query Q is notified of new tuple t in a relevant
data stream, it can perform number of actions,data stream, it can perform number of actions,
which are not mutually exclusivewhich are not mutually exclusive
5) t not needed and will not be5) t not needed and will not be
needed. Send it to needed. Send it to THROWTHROW
((unless we like to archive itunless we like to archive it))
6) As a result of t we may move 6) As a result of t we may move
data from data from STORESTORE or or SCRATCHSCRATCH
to to THROWTHROW
Stream QueryStream QueryProcessorProcessor
Stream QueryStream QueryProcessorProcessor
Throw Scratch Store Stream
Stream
34
Architecture for Architecture for continuous queriescontinuous queries
Scenario1 Scenario1
Data stream D is append only Data stream D is append only - - no updates orno updates or
deletions. Always store current answer A to Q deletions. Always store current answer A to Q ..
STREAMSTREAM empty emptySTORESTORE always contain A always contain ASCRATCHSCRATCH contains whatever needed to to contains whatever needed to to keep answer in keep answer in STORESTORE up to date up to date
35
Architecture for Architecture for continuous queriescontinuous queries
Scenario2Scenario2Answer A exclusively as data stream D.Answer A exclusively as data stream D.
STREAMSTREAM stream answer A stream answer A
STORESTORE empty emptySCRATCHSCRATCH contains whatever needed to to keep contains whatever needed to to keep answer in answer in STORESTORE up to date up to date
36
Architecture for Architecture for continuous queriescontinuous queries
Scenario 3Scenario 3
Input stream append only, answer A may haveInput stream append only, answer A may have
updates and deletionsupdates and deletions
Example Example : : Q is groupQ is group--by with Min aggregation functionby with Min aggregation function..
Answer A maintained in Answer A maintained in STORESTORE
SCRATCHSCRATCH is empty is empty
37
Architecture for Architecture for continuous queriescontinuous queries
Scenario 4Scenario 4Input streams may include updates andInput streams may include updates anddeletionsdeletions
Unbounded storage required for Unbounded storage required for SCRATCHSCRATCH to ensure that Min always will be computedto ensure that Min always will be computed Both in 3 and 4Both in 3 and 4: : data moved to data moved to STREAMSTREAM only only whenever known that no further updateswhenever known that no further updates//deletions deletions etc of tuples of this group will occuretc of tuples of this group will occur..
38
The Architecture and The Architecture and Related WorkRelated Work
Implementing Triggers in terms of proposed Implementing Triggers in terms of proposed architecture (for launching triggered actions architecture (for launching triggered actions assume actions performed by SQL storedassume actions performed by SQL stored--proceduresprocedures..)) STREAMSTREAM and and STORESTORE empty empty.. SCRATCHSCRATCH used for data required to moniotor complex used for data required to moniotor complex
eventsevents BenefitsBenefits: : complex multitable events & conditions to complex multitable events & conditions to
be monitoredbe monitored Trigger processing benefit from efficient data Trigger processing benefit from efficient data
management management / / processingprocessing Techniques Techniques ( ( see belowsee below))
39
The Architecture and The Architecture and Related WorkRelated Work
Implementing Materialized views in terms ofImplementing Materialized views in terms of
proposed architectureproposed architecture View itsef is maintained in View itsef is maintained in STORESTORE Base dataBase data: : in in SCRATCHSCRATCH Data expiration Data expiration : : to expedite cleanup ofto expedite cleanup of
SCRATCHSCRATCH No way to ensure bounding of size of STORE No way to ensure bounding of size of STORE
and and SCRATCHSCRATCH
40
End of Part IEnd of Part I
41
Research Problems Research Problems Designing Query Language Designing Query Language Online processing of rapid streamsOnline processing of rapid streams
ApproximationApproximation techniques techniques Storage constrains vs. performance requirementsStorage constrains vs. performance requirements SummarizationSummarization
Query Planning / OptimizationQuery Planning / Optimization Building good Query PlanBuilding good Query Plan SchedulingScheduling Sub-Plans SharingSub-Plans Sharing
Resource ManagementResource Management AdaptationAdaptation
42
Research Problems: Research Problems: Languages for Languages for
Continuous QueriesContinuous Queries
Bounding the size of scratch/storeBounding the size of scratch/store Open problem : to determine for arbitrary Open problem : to determine for arbitrary
SQL query whether properties satisfiedSQL query whether properties satisfied
43
Query LanguageQuery Language
Query language allows both streams and Query language allows both streams and relationsrelations
Assumptions: Assumptions:
Streams: Ordered Append-only Unbounded Multiple streams allowed
Relations: Unordered Support updates and deletions
44
SQL ExtensionsSQL ExtensionsFor Continuous Queries For Continuous Queries
FROM FROM allowed both to Streams and allowed both to Streams and Relations Relations
Sliding Window forSliding Window for FROMFROM clause (for clause (for streams) streams)
Optional "Optional "PartitioningPartitioning" " clause clause Mandatory Mandatory ""Window sizeWindow size"" Optional "Optional "Filtering predicateFiltering predicate""
45
Windows specification Windows specification
UsingUsing ROWSROWS
ROWS 50 PRECEEDINGROWS 50 PRECEEDING
UsingUsing RANGERANGE
RANGE 15 minutes PRECEEDINGRANGE 15 minutes PRECEEDING
46
Example 1Example 1
Web Server
CL1
CL2
CL3 CL
4
DSMS
Internet
SS ( C ( Client_id, lient_id, URL, domain, URL, domain, time )time )
Clients
.com
CL5
CL7
.il
.NF
CS web Math web
47
Example 1 (CQL)Example 1 (CQL)““FromFrom” with “” with “RangeRange””
Stream "Requests" of requests to web Stream "Requests" of requests to web server with attributes:server with attributes:
((client_id, URL, domain, time)client_id, URL, domain, time)
Query counting number of request of pages Query counting number of request of pages from domain “cs.huji.ac.il” in the last day:from domain “cs.huji.ac.il” in the last day:
SELECT COUNT(*)SELECT COUNT(*)
FROM Request S[FROM Request S[RANGE 1 DAY PRECEEDINGRANGE 1 DAY PRECEEDING]]
WHERE S.domain= "cs.huji.ac.il"WHERE S.domain= "cs.huji.ac.il"
48
Partitioning Clause Partitioning Clause
Partitions data in several groups Partitions data in several groups ComputesComputes separate windowseparate window for each for each
groupgroup Merges windows into single result Merges windows into single result Is syntactically same asIs syntactically same as GROUP BYGROUP BY
clauseclause Example : Example :
49
Example 2 Example 2 ““Partition By”Partition By”
How many pages served (only each clients 10 How many pages served (only each clients 10 most recent requests) by request from domainmost recent requests) by request from domain
CS.HUJI.AC.ILCS.HUJI.AC.IL from from CS website CS website ??
SELECT COUNT (*) SELECT COUNT (*) FROMFROM requests S requests S [[PARTITION BYPARTITION BY s.Client_id s.Client_id Rows 10 PRECEEDINGRows 10 PRECEEDING Where s.Domain = Where s.Domain = ‘C‘CS.HUJI.AC.IL’ S.HUJI.AC.IL’ ]] Where s.URL LIKE Where s.URL LIKE 'http://cs.huji.Ac.Il/%'http://cs.huji.Ac.Il/%''
50
Example 3 Example 3 Join with relationJoin with relation
Classify domain by primary type of web content they Classify domain by primary type of web content they serveserve
..ac.il EDUCATIONac.il EDUCATION .gov.il Government .gov.il Government .co.il COMMERCE.co.il COMMERCE .com COMMERCE.com COMMERCE
Count number of requests from "commerce" domains out Count number of requests from "commerce" domains out of last 10000 records of last 10000 records
10% sample of requests stream is used 10% sample of requests stream is used
51
Example 3 (Cont.)Example 3 (Cont.)
SELECT COUNT (*) FROMSELECT COUNT (*) FROM (SELECT R.class(SELECT R.class FROM FROM Requests SRequests S 10% SAMPLE , 10% SAMPLE , Domains RDomains R WHERE WHERE SS.Domain=.Domain=RR.Domain) .Domain) TT [ROWS 10000 PRECEEDING][ROWS 10000 PRECEEDING] WHERE WHERE TT.class = "commerce".class = "commerce"
Note: stream ofNote: stream of RequestsRequests is joined withis joined with DomainsDomains relation resulting inrelation resulting in stream stream TT , before, before applying sliding windowapplying sliding window
52
Performance Challenge:Performance Challenge:
Multiple rapid incoming data streamsMultiple rapid incoming data streams Multiple complex queries with Multiple complex queries with
timeliness requirements timeliness requirements Finite resources Finite resources
53
Solution: Approximation Solution: Approximation
Approximate answers Approximate answers Graceful degradationGraceful degradation Maximize precision based on available Maximize precision based on available
resources resources
54
ApproximationApproximation : :Static vs. DynamicStatic vs. Dynamic
Queries modified at Queries modified at submission time to use submission time to use fewer resources fewer resources
User guaranteed certain User guaranteed certain query behaviorquery behavior
User can configure User can configure approximation mechanism approximation mechanism
Adaptation mechanisms Adaptation mechanisms not needed not needed
Queries modified at Queries modified at run timerun time
Not suitable for some Not suitable for some applicationsapplications
55
Approximation Approximation Techniques Techniques
Window ReductionWindow Reduction Sampling rate reductionSampling rate reduction Summarization (Synopses) Summarization (Synopses)
56
Window reduction Window reduction
Decreasing size of window Decreasing size of window Introduce Window where none was specified originally Introduce Window where none was specified originally
May increase output rate (duplicate elimination for example)
Must detect bad cases statically Affects resources used by operator
57
Sampling rate reduction Sampling rate reduction
will reduce output rate will reduce output rate will not to influence resource requirements of will not to influence resource requirements of
operation operation
Introduce SAMPLE if not specified Reduce sampling rate
58
SummarizationSummarization
Summaries(data synopses)Summaries(data synopses) - concise - concise representation at expense of accuracy representation at expense of accuracy Sampling, Histograms Wavelets Sampling, Histograms Wavelets
How to make guaranties about query results based on summaries ? How to maintain efficiently in rapid data streams ? What summarization techniques are better ?
59
Dynamic approximation Dynamic approximation ChallengesChallenges
Some apps will not tolerate unpredicted Some apps will not tolerate unpredicted and variable accuracy and variable accuracy
Extend Language to specify tolerable Extend Language to specify tolerable imprecision imprecision
60
Dynamic approximation Dynamic approximation techniques techniques
Synopses compression Synopses compression Sampling Sampling Load sheddingLoad shedding
61
Synopses compression Synopses compression
Synopses: concise representation at expense Synopses: concise representation at expense of accuracyof accuracy
Reducing memory overheadReducing memory overhead Methods:Methods:
histograms, Wavelets, etchistograms, Wavelets, etc
62
Load shedding Load shedding
Drop tuples from queries, when they Drop tuples from queries, when they grow too large grow too large
Drops chunks of tuples at time -- differs Drops chunks of tuples at time -- differs from sampling, which eliminates from sampling, which eliminates probabilistically probabilistically
load shedding -- biased, but easier to load shedding -- biased, but easier to implement implement
63
Query Plans: Query Plans: How DSMS process How DSMS process
Query?Query?
Separate Query Plan for each Continuous Query vs. one Separate Query Plan for each Continuous Query vs. one Mega-Query plan for all computations for all usersMega-Query plan for all computations for all users
Plan components may be sharedPlan components may be shared
Query registers before streams start to produce dataQuery registers before streams start to produce data How about adding queries over existing streams How about adding queries over existing streams Queries over archived / discarded DataQueries over archived / discarded Data
Issues to consider:
64
STREAM System: Query STREAM System: Query Plans Plans
Query OperatorsQuery Operators
Reads stream of tuples from set of input Reads stream of tuples from set of input queues, processes them, writes output tuples queues, processes them, writes output tuples into single output queueinto single output queue
Input Queue
Input QueueOperator
Output Queue
65
Query Plans (Cont.)Query Plans (Cont.)
Inter-Operator QueuesInter-Operator Queues Queues connect different operators and defineQueues connect different operators and define
tuples flowtuples flow
SynopsesSynopsesSummarizes tuples seen so far at intermediateSummarizes tuples seen so far at intermediateoperator as needed for futureoperator as needed for future
66
When Synopses When Synopses Needed ?Needed ?
Join operatorMust remember tuples seen so far on each of input streams – maintain synopses for each
Filter operator (selection) Do not maintain state – no need for synopses
67
ExampleExample
Str
eam
R
Str
eam
SOperator O1 (Join)Synop1 Synop2
Synop3 Synop4
Str
eam
T
Operator O2(select)
Operator O3(Join)
Query1
Query2
Queue1 Queue2
Queue3
Queue 4
SelectionOver Join of R and S
Join of R,S, T
Q3 is Shared
Scheduler
68
Explanations to ExampleExplanations to Example
Two plans (for Q1 and Q2) share a sub-plan Two plans (for Q1 and Q2) share a sub-plan joining streams R and S by sharing it output joining streams R and S by sharing it output queue q3queue q3
Execution of operators controlled by Global Execution of operators controlled by Global SchedulerScheduler
When operator O scheduled, control passes to When operator O scheduled, control passes to O for period determined by number of tuplesO for period determined by number of tuples
Possible time-slice based schedulingPossible time-slice based scheduling
69
Resource Sharing for Resource Sharing for Query PlansQuery Plans
When Continuous Queries share common sub-When Continuous Queries share common sub-expressionsexpressions
Similar to traditional DBMSSimilar to traditional DBMS Resource sharing and Approximation considered Resource sharing and Approximation considered
separatelyseparately Do not share , if sharing introduces Do not share , if sharing introduces
approximation like merging sub-expressions approximation like merging sub-expressions with different window sizeswith different window sizes
70
Implementation of Shared Implementation of Shared QueueQueue
Queue maintains pointer to first unread tuple for each Queue maintains pointer to first unread tuple for each operator operator
Discard tuple once they had been read by all operatorsDiscard tuple once they had been read by all operators
t1 t2 t3 t4 t5 t6 t7 t8 Shared Queue
Op1
Op2
Op3
Op4
71
Resource Sharing (cont.)Resource Sharing (cont.)
Base Data Stream accessed by multiple queries Base Data Stream accessed by multiple queries shared as common sub-expressionshared as common sub-expression
Number of tuples in shared queue depends on :Number of tuples in shared queue depends on : Rate of addition to the queueRate of addition to the queue Rate at which slowest operator consumes Rate at which slowest operator consumes
tuplestuples Common sub-expression of 2 queries with very Common sub-expression of 2 queries with very
different consumption rates different consumption rates
72
Shared Queue IssuesShared Queue Issues
P1, P2 – parents of operator JP1, P2 – parents of operator J J will be scheduled frequently, for sake of P1J will be scheduled frequently, for sake of P1 J should be scheduled less frequently for P2 (to avoid J should be scheduled less frequently for P2 (to avoid
proliferation of tuples in q) proliferation of tuples in q)
Operator J (Join) Queue qStream
Stream
P1Heavy consumer
P2Light consumer
73
Sub-Plan SharingSub-Plan Sharing
Formally proven: Formally proven: sub-plan sharing may be sub-optimal for common sub-plan sharing may be sub-optimal for common
sub-expressions with joinssub-expressions with joins for common sub-expressions without joins sharing is for common sub-expressions without joins sharing is
always preferablealways preferable
74
Synopses SharingSynopses Sharing
Issues to consider:Issues to consider: Which operator responsible to manage Which operator responsible to manage
shared synopses ?shared synopses ? Synopses required by different operators , Synopses required by different operators ,
how to choose size of common synopses?how to choose size of common synopses? If synopses are identical, how to cope with If synopses are identical, how to cope with
different consumption rates?different consumption rates?
75
SchedulingScheduling Objective for Scheduler:Objective for Scheduler:
Stream-based variation of response timeStream-based variation of response time ThroughputThroughput Weighted fairness among queuesWeighted fairness among queues Minimize intermediate queues sizesMinimize intermediate queues sizes
Granularity for Scheduler:Granularity for Scheduler: Max number of tuples consumed by operatorMax number of tuples consumed by operator Time-unitTime-unit Parallelism in scheduling algorithm ?Parallelism in scheduling algorithm ?
76
Scheduling : ExampleScheduling : Example
O1 takes 1 time unit to operate on n tuples from q1,O1 takes 1 time unit to operate on n tuples from q1,with 20% selectivity, produces n/5 tuples in q2with 20% selectivity, produces n/5 tuples in q2
Op. O1 Op. O2q1 q2
O2 takes 1 time unit to operate on n/5 tuples O2 takes 1 time unit to operate on n/5 tuples from q2,from q2, and it doesn’t produces tuples.and it doesn’t produces tuples.
77
Scheduling Example Scheduling Example (Cont.)(Cont.)
Assume, Assume, averageaverage arrival rate on q1 is no more arrival rate on q1 is no more than n per 2 time units queues are boundedthan n per 2 time units queues are bounded
Arrivals may be burstyArrivals may be bursty
Possible scheduling strategiesPossible scheduling strategies Algoritm1 (time-slicing) :Algoritm1 (time-slicing) :
tuples processed 1 time unit by each operator.tuples processed 1 time unit by each operator.
O1 consumes n units, O2 consumes n/5; O1 consumes n units, O2 consumes n/5;
O1, O2 …O1, O2 …Algoritm2 : O1 operates until its queue empty, Algoritm2 : O1 operates until its queue empty,
afterwards – O2afterwards – O2
78
Algorithm 1Algorithm 1
11
22
33
44
55
66
77
88
2266
11
2n tuples arrived
n tuples arrived n tuples
arrived
Orange : Tuples in Q1 Orange : Tuples in Q1 Yellow : Tuples in Q2Yellow : Tuples in Q2
Time
Queue Size
79
Algorithm2Algorithm2
Orange : tuples in Q1 Orange : tuples in Q1 Yellow : Tuples in Q2Yellow : Tuples in Q2
11
22
33
44
55
66
77
88
2n tuples arrived
n tuples arrived n tuples
arrived
Queue Size
Time
80
Comparison. Which is Comparison. Which is better?better?
22
33
44
55
66
77
88
2266
11
2n tuples arrived
n tuples arrived n tuples
arrived
Time11
Orange : Algorithm1 Yellow : Algorithm2
Total size of bothqueues
81
Greedy Scheduler RuleGreedy Scheduler Rule
Schedule the operator that Schedule the operator that consumes largestconsumes largest number of of tuples per time and is the number of of tuples per time and is the most most selectiveselective (produces fewest tuples) (produces fewest tuples)
Operators with full batches in Operators with full batches in input queuesinput queues are are favored over high priority operators with under-favored over high priority operators with under-full inputs (better utilization of time-slice)full inputs (better utilization of time-slice)
High-priority operator may be underutilized if High-priority operator may be underutilized if feeders are low priority – feeders are low priority – consider chains of consider chains of operatorsoperators
82
Scheduling Algorithm Scheduling Algorithm DiscussionDiscussion
Queue size minimizationQueue size minimization Increased time to initial resultsIncreased time to initial results Strategy 1 would produce initial results fasterStrategy 1 would produce initial results faster Incorporate response time and weighted fairness Incorporate response time and weighted fairness
into algorithminto algorithm Flexible time-slicesFlexible time-slices Taking context-switching into accountTaking context-switching into account
83
Resource ManagementResource Management Relevant Resources:Relevant Resources:
Memory Memory CPUCPU I/O (if disk used)I/O (if disk used) Network (in Distributed DSMS)Network (in Distributed DSMS)
Our Goal Our Goal
Maximize query precision by making best useMaximize query precision by making best use
of available resources and have a capability toof available resources and have a capability to
do that dynamically and adaptivelydo that dynamically and adaptively
84
Resource Management Resource Management Cont.Cont.
Allocating memory to query plan Allocating memory to query plan Incorporating known constraints on input Incorporating known constraints on input
streams to reduce synopses without streams to reduce synopses without compromising precisioncompromising precision
Operator scheduling to minimize queue sizeOperator scheduling to minimize queue size
Focus on memory used by synopses and queues Algorithms developed in STREAM :
85
Resource Management Resource Management Approaches (Cont.)Approaches (Cont.)
Exploiting constraints over data streamsExploiting constraints over data streams
When additional information about streams is When additional information about streams is available (gathered stats, constraint specs) -- available (gathered stats, constraint specs) -- reduce resource utilization with same result reduce resource utilization with same result precision precision
86
Adaptation – why?Adaptation – why?
Adaptation:Adaptation: Queries are long runningQueries are long running Parameters Parameters
Stream flow rateStream flow rate Stream data characteristics Stream data characteristics Environment (available RAM) Environment (available RAM) may vary -- how to adapt? may vary -- how to adapt?
87
Exploiting Constraints Exploiting Constraints over Data Streamsover Data Streams
Answering Requires synopses of unbounded size !Answering Requires synopses of unbounded size !
Query Q : join , to monitorfulfillments delays
O FStream Orders
Stream Fulfillments
Order_IDItem
_ID
Synop-O Synop-F
88
Constraints (cont.)Constraints (cont.) Tuples for given (orderID, itemID) arrive at stream O Tuples for given (orderID, itemID) arrive at stream O
before corresponding tuples arrive to Fbefore corresponding tuples arrive to F No need to maintain a join synopses for F !!No need to maintain a join synopses for F !! Another constrain: tuples arrive at O clustered by Another constrain: tuples arrive at O clustered by
orderIDorderID We need only to save tuples for given orderID, until next We need only to save tuples for given orderID, until next
orderID seenorderID seen
Ord1, item 4
Ord1, item 2
Ord1, item 1
Ord1, item 3
Ord3, item 4
Ord3, item 1Ord1, item 3
Ord1, item 2
Ord1, item 1
Ord3, item 1
Ord3, item 4
Ord3, item 2
More RAM needed
for synapse
89
Constraints (Cont.)Constraints (Cont.)
Referential integrityReferential integrity Unique-valueUnique-value Clustered-ArrivalClustered-Arrival Ordered-ArrivalOrdered-Arrival
90
SummarySummary
Architecture for DSMS Query Language Common Design Problems Tradeoff: efficiency, accuracy, storage
91
ReferencesReferences “Continuous Queries over Data Streams” by S.Babu, J.Widom (Stanford University)
“Query Processing, Approximation, and Resource Management In a Data Stream Management System” by R.Motiwani, J.Widom and others (Stanford University)
http://www.db.stanford.edu/stream
Questions ?Questions ?
93