Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
7/3/13
1
Distributed Data Management Summer Semester 2013
TU Kaiserslautern
Dr.-‐Ing. Sebas8an Michel
[email protected]‐saarland.de
Distributed Data Management, SoSe 2013, S. Michel 1
(DISTRIBUTED) DATA STREAM PROCESSING
Lecture 9
Distributed Data Management, SoSe 2013, S. Michel 2
Data Stream Management vs. Tradi8onal Data Management
• Data is moving! Con8nuously generated (assumed infinite!) • At high pace • Queries are (mainly) con8nuous (aka. standing). Registered
once, observed “forever”. • Answer to queries in (near) real-‐8me required (oUen) • Probabilis8c methods for efficiency or considering only part
of the stream (sliding window) Distributed Data Management, SoSe 2013, S. Michel 3
DATA STREAM
Set of queries
results
DBMS vs. DSMS
Distributed Data Management, SoSe 2013, S. Michel 4
Database management system (DBMS) Data stream management system (DSMS) Persistent data (rela8ons) vola8le data streams Random access Sequen8al access One-‐8me queries Con8nuous queries
(theore8cally) unlimited secondary storage limited main memory
Only the current state is relevant Considera8on of the order of the input
rela8vely low update rate poten8ally extremely high update rate
Li]le or no 8me requirements Real-‐8me requirements
Assumes exact data Assumes outdated/inaccurate data
Plannable query processing Variable data arrival and data characteris8cs
h"p://en.wikipedia.org/wiki/Data-‐stream_management_system
Data Stream Model
• Stream of data items is unbounded (available memory is not)
• No way to store en8re stream (how could we, its (probably) not ending)
• To compute query results, need to devise algorithm with li]le memory consump8on
Distributed Data Management, SoSe 2013, S. Michel 5
Overview of Data Stream Topics • Synopses:
– concise representa8ons of stream content – tailored to tasks, e.g., coun8ng dis8nct elements – usually not exact, but approxima8ons (es8mators) of true values.
• Windows: – focus of certain recent subset of data – computa8on of func8ons/joins over window(s) content
– Think: SQL over stream windows (ranges)
Distributed Data Management, SoSe 2013, S. Michel 6
7/3/13
2
SYNOPSES (ESTIMATORS)
Distributed Data Management, SoSe 2013, S. Michel 7
Coun8ng Occurrences
• Consider a stream of elements ai …, a2, a84, a41, a2, a77, a231, a2, a4, a54, …
• How oUen does a2 occur?
• How to implement?
Distributed Data Management, SoSe 2013, S. Michel 8
• Keep counter for each id • Required space #ids (=N) • Not feasible of N is very large
Probabilis8c Coun8ng: Count-‐Min Sketch • Keep 2-‐dim array (h, r) • h hash func8ons* that map to range 0…(r-‐1)
Distributed Data Management, SoSe 2013, S. Michel 9
Cormode, Muthukrishnan (2004). An Improved Data Stream Summary: The Count-‐Min Sketch and its Applica8ons. J. Algorithms 55: 29–38.
0 1 2 3 4 5
• Arriving item a • For each j: array[j, hj(a)]++
h1
h2
h3
h4
Count-‐Min Sketch: Coun8ng
• How oUen did we see item a? • h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2 • Take minimum of the corresponding values in the 2-‐d array. Here: 4
• Es8mate is never underes8ma8ng • Overes8ma8on probabilis8cally bounded
Distributed Data Management, SoSe 2013, S. Michel 10
5 3 4 4 9 3
4 7 1 4 4 8
8 4 6 7 2 1
3 1 4 8 7 5
0 1 2 3 4 5 h1
h2
h3
h4
9
8
8
4
Unbiased vs. Biased Es8mators
• Given a real number and an es8mator of it, denoted as
• E.g., number of dis8nct elements in a set S
• is called an unbiased es8mator of E[ ] = n • and biased otherwise, in which case
Bias[ ] = E[ ] -‐ n
Distributed Data Management, SoSe 2013, S. Michel 11
nn̂
n̂ n̂
n̂n̂
Coun8ng Dis8nct Elements
• Consider a stream of elements ai …, a2, a84, a41, a2, a77, a231, a2, a4, a54, …
• How to compute/es8mate the number of dis8nct elements observed?
Distributed Data Management, SoSe 2013, S. Michel 12
7/3/13
3
Usability • Streams (one pass, li]le memory footprint)
• Distributed systems: compact data exchange (recall Bloom filter)
• Sketches for par8al data can be merged for global view
Distributed Data Management, SoSe 2013, S. Michel 13
sketch
sketch
Efficient Coun8ng, Comparing
Flajolet Mar8n (FM) Sketch (aka. Hash Sketch)
• Allocate a bitvector B of size m = log(N) • Hash items to bitvector posi8ons according to a geometric distribu8on: – Hash each item i to a m-‐bit number h(i) – Compute posi8on k of the least-‐significant “1“ of h(i) – Set the bit B[k] to “1“
S: 17, 5, 19, 211, 17, 5, 31 h(17) = 010100 then least-‐sig. 1 bit = 3 h(5) = 000101 then least-‐sig. 1 bit = 1 ...
• Proposed originally by Flajolet and Martin in 1985
Distributed Data Management, SoSe 2013, S. Michel 14
FM-‐Sketch: Es8mator
• Get then posi8on t of leU-‐most “0” bit of B • Count-‐Dis8nct Es8mate of real dis8nct number n: here: with t = 4:
• Improvement: – If you use more bitmaps and compute an average posi8on t, you can improve count-‐dis8nct es8mate
n̂ = 2t / 0.7735
Distributed Data Management, SoSe 2013, S. Michel 15
111010 B: • In the end B might look like this
n̂ = 24 / 0.7735 ! 20.685
Note: Be careful with leU-‐most bit; depends on interpreta8on of bits
FM Sketch: Intui8on/Idea
• B[0] is set approximately n/2 8mes • B[1] is set approximately n/4 8mes • B[i] = 0 if i>> log2(n) • B[i] = 1 if i<<log2(n) • “Mix” of 1s and 0s around i≈log2(n)
• Use leU-‐most zero at indicator for log2(n): n ≈ 2 posi8on of leU most zero bit
Distributed Data Management, SoSe 2013, S. Michel 16
FM-‐Sketch: Union
• Given: two mul8sets S and T and theirs sketches and of size m
• Then: The sketch is the sketch of
BSBT
B = BS ! BTS!T
Distributed Data Management, SoSe 2013, S. Michel 17
K-‐Min Value (KMV) Synopsis
• KMV synopsis is ordered set of k smallest values },...,,{ )()2()1( kUUUL =
0 1
• Unbiased Es8mator: – Exact error analysis based on theory of order sta8s8cs – Asympto8cally op8mal as k becomes large
n̂kUB = (k !1) /U(k )
Distributed Data Management, SoSe 2013, S. Michel 18
• Given set S of values. Want number of dis8nct elements n := D(S) (nota8on)
• Hashing outputs values uniformly in [0,1] k-min values
Slide based on PPT slides from Beyer et al. ‘07
7/3/13
4
(Mul8set) Union of Par88ons
0
k-min
0
k-min
0
∪
k-min
U (k)
L
LA LB
• Combine KMV synopses: LA ⊕ LB • Theorem: L is a KMV synopsis of A∪B
… 1 … 1
… 1
Distributed Data Management, SoSe 2013, S. Michel 19
Take union of values and consider again the k smallest ones:
Slide based on PPT slides from Beyer et al. ‘07 Distributed Data Management, SoSe 2013, S. Michel
• L=LA⊕LB as before (union): contains k elements – L corresponds to a uniform random sample of DVs in A∪B
• K∩ = # values in L that are also in D(A∩B)
• K∩/k es8mates Jaccard coefficient:
es8mates
• Unbiased es8mator of #DVs in the intersec8on:
(Mul8set) Intersec8on of Par88ons
n̂! = (k "1) /U(k ) n! = D(A!B)
n̂! =K!
kk "1U(k )
#
$%%
&
'((
D(A!B)D(A"B)
20
D(set) = dis8nct values
Slide based on PPT slides from Beyer et al. ‘07
Jaccard Coefficient
Min-‐Hashing
• Hash func8on h maps elements of a set to integer space. Let’s do that for two sets, A and B
• Let hmin(A) and hmin(B) denote the minimum of these numbers for set A and B, respec8vely.
• Then
Distributed Data Management, SoSe 2013, S. Michel 21
P[hmin (A) = hmin (B)]=| A!B || A"B |
Min-‐Hashing (Cont’d) • Why does it work? If the min values are the same, that element that causes the min value has to be in ; probability is
• As seen before for other es8mators: improved es8ma8on quality through mul8ple “rounds” of es8mates (Error is O(1/√k), with k rounds) – one min value, mul8ple hash func8ons – several min values, one hash func8on
Distributed Data Management, SoSe 2013, S. Michel 22
A!B | A!B || A"B |
Hash fu
nc8o
ns have to be min-‐w
ise
inde
pend
ent
Unbiased es8mator (but too high variance)
Min-‐Hash: Es8mator count = 0!for each hash function h:!!if hmin(A) == hmin(B) then !! !count++!end! EsFmate of Jaccard is given as count/k
Distributed Data Management, SoSe 2013, S. Michel 23
See exercise in assignment sheet 5
Literature
Distributed Data Management, SoSe 2013, S. Michel 24
• Graham Cormode, S. Muthukrishnan: An improved data stream summary: the count-‐min sketch and its applica8ons. J. Algorithms 55(1): 58-‐75 (2005)
• Philippe Flajolet, G. Nigel Mar8n: Probabilis8c Coun8ng Algorithms for Data Base Applica8ons. J. Comput. Syst. Sci. 31(2): 182-‐209 (1985)
• Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher: Min-‐Wise Independent Permuta8ons. J. Comput. Syst. Sci. 60(3): 630-‐659 (2000)
• Z. Bar-‐Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Coun8ng dis8nct elements in a data stream. In Proc. RANDOM, pages 1–10, 2002.
• Kevin S. Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, Rainer Gemulla: On synopses for dis8nct-‐value es8ma8on under mul8set opera8ons. SIGMOD Conference 2007: 199-‐210
7/3/13
5
DATA STREAM MANAGEMENT SYSTEMS AND CQL
Distributed Data Management, SoSe 2013, S. Michel 25
Thanks to Johannes Gehrke (Cornell) for providing some the following material. Many of the slides are ini8ally based on material by Jennifer Widom (Stanford).
Data Stream Model
• A stream S is a (possibly) infinite bag (mul8set) of elements <s,τ> where s is a tuple belonging to the schema of S and τ is the 8mestamp of the element.
• Think: tuples of a rela8onal DBMS extended with 8mestamp, streaming in.
Distributed Data Management, SoSe 2013, S. Michel 26
Data Streams: Example • Monitoring of highway traffic:
Distributed Data Management, SoSe 2013, S. Michel 27
PosSpeedStr(vehicleId,speed,xPos,dir,hwy)
• E.g., for: – conges8on predic8on/warning
– es8mates of travel 8me – toll collec8on! – 8cket for too fast driving
Data Streams: Example
• Environmental Monitoring
Distributed Data Management, SoSe 2013, S. Michel 28
Sta8onStream(humidity, solarRadia8on, windSpeed, snowHeight)
• Various applica8on scenarios: – avalanche risk level computa8on
– insights for agriculture – air pollu8on (urban) monitoring
Con8nuous Queries • In contrast to ad-‐hoc, single 8me queries in (rela8onal) DBMS.
• Queries over Streams are considered con8nuous: registered once, run “forever”: – “want to stay updated to avalanche risk, not just check once”
• Also called standing queries or subscrip8ons (in publish/subscribe context)
• For instance: – Compute average temperature. – Select all orders of stock “Apple” with quan8ty larger than 100.
Distributed Data Management, SoSe 2013, S. Michel 29
What and How can we Compute DB-‐Style Queries?
• How to compute average values over an infinite stream? Block forever?
• How to join infinite streams if join partners can arbitrarily arrive (or not)?
• Idea: keep window that renders a con8nuous (infinite) stream a snapshot/sta8c rela8on
Distributed Data Management, SoSe 2013, S. Michel 30
7/3/13
6
Sliding Window Concept
• Focus a]en8on to latest values of stream • Allows computa8on of aggregates • Joins are computed across windows overlaid of other (or same) streams
Distributed Data Management, SoSe 2013, S. Michel 31
8me
past data current window future
Sliding Window: Example 18.3°C 13.5°C 27.0°C 11.6°C 29.6°C 39.7°C 24.2°C 11.5°C 12.7°C 27.9°C ….
Distributed Data Management, SoSe 2013, S. Michel 32
• Window of size W – based on 8me (=> 8me-‐based)
– or number of tupels inside (count-‐based)
• ShiUed every t by B
Sliding Window Aggregates
• Output average for each window when it slides.
• Here: – 17.7°C – 26.3°C – 19.1°C
Distributed Data Management, SoSe 2013, S. Michel 33
18.3°C 13.5°C 27.0°C 11.6°C 29.6°C 39.7°C 24.2°C 11.5°C 12.7°C 27.9°C ….
Sliding Window Joins
• Join is executed over individual window contents.
Distributed Data Management, SoSe 2013, S. Michel 34
window 2
window 1 stream 1
stream 2
Types of Sliding Windows
• Time based Window – window contains tuples within a certain 8me range; e.g., Twi]er Tweets of the last 10 minutes, stock market values of the last 10 seconds
– size can arbitrarily change if input rate changes • Count-‐based Window
– window contains at any 8me a fixed amount of items, say, the last 100 Tweets or 10000 last stock trades
– newly arriving items kick out older ones (once window is filled up), depending on strategy (next slide)
Distributed Data Management, SoSe 2013, S. Michel 35
Types of Sliding Windows (Cont’d)
• Sliding Window: move window on certain 8cks/8me, con8nuous or in blocks
• Tumbling Window: create new window for each 8me range or size W (i.e., collect data un8l full or for W 8me; then reset)
• At each slide/”tumple” a func8on can be applied to window content and the result outpu]ed
• This is also called “trigger”. Distributed Data Management, SoSe 2013, S. Michel 36
7/3/13
7
Overview of Data Stream Management Systems (DSMSs)
• STREAM (Stanford University), Aurora (Brandeis/Brown/MIT), TelegraphCQ (UC Berkely), Cayuga (Cornell), PIPES (Uni Marburg), …
• Large interest also from companies/startups: Oracle MicrosoU, IBM, Streambase
• Lately open-‐source product for big data distributed streams: Yahoo! S4, Twi]er Storm (will see in detail later)
Distributed Data Management, SoSe 2013, S. Michel 37
StreamBase Example UI
Distributed Data Management, SoSe 2013, S. Michel 38 h]p://www.streambase.com
STREAM
• Stanford Stream Data Manager • “General purpose” DSMS for streams and stored data
• Declara8ve query language to phrase con8nuous queries (SQL like).
Distributed Data Management, SoSe 2013, S. Michel 39
Arvind Arasu et al. : STREAM: The Stanford Stream Data Manager. IEEE Data Eng. Bull. 26(1): 19-‐26 (2003)
Con8nuous Query Language – CQL
SQL with:
• Streams • Windows • New seman8cs (stream)
– Three rela8on-‐to-‐stream operators: Istream, Dstream, Rstream
• Sampling
Slide based on material from Jennifer Widom. 40 Distributed Data Management, SoSe 2013, S. Michel
Example Relation (Used Later) Simplified Linear Road Setup: • A single input stream: The stream of posi8ons and speeds of vehicles
• vehicleId: vehicle • speed: speed in MPH • xPos: Posi8on of the vehicle within the highway in feet
• dir: direc8on (east or west) • hwy: highway number
Slide based on material from Jennifer Widom. 41 Distributed Data Management, SoSe 2013, S. Michel
PosSpeedStr(vehicleId,speed,xPos,dir,hwy)
Example Query 1 • Two streams:
– Orders (orderID, customer, cost) – Fulfillments (orderID, clerk)
• Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Distributed Data Management, SoSe 2013, S. Michel 42
SELECT sum(O.cost) FROM Orders O, Fulfillments F [Range 1 Day] WHERE O.orderID = F.orderID and F.clerk = “Sue” and O.customer = “Joe”
7/3/13
8
Example Query 2 • Using a 10% sample of the fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost
Distributed Data Management, SoSe 2013, S. Michel 43
SELECT F.clerk, max(O.cost) FROM orders O, fulfillments F [PARTITION BY clerk ROW 5] 10% SAMPLE WHERE O.orderID = F.orderID GROUP BY F.clerk
CQL: Rela8ons and Streams
• T: discrete, ordered 8me domain
• A rela8on R is a mapping from 8me T to bag of tuples belonging to the schema of R.
• That is, R(t) varies over 8me • Updates carry 8mestamps, too!
• A stream is a set of (tuple, 8mestamp) elements
Distributed Data Management, SoSe 2013, S. Michel 44
Streams ßà Relations
Streams Relations
Window specifica8on
Special operators: Istream, Dstream, Rstream
Any rela8onal query language
Slide based on material from Jennifer Widom. Distributed Data Management, SoSe 2013, S. Michel 45
Stream à Rela8on • S [W] is a rela8on -‐ at 8me T it contains all tuples in window W applied to stream S, up to 8me T.
• When W = ∞, it contains all tuples in stream S up to 8me T
• Ways to construct these windows “[W]” – Time-‐based – Tuple-‐based – Par88oned
Slide based on material from Jennifer Widom. Distributed Data Management, SoSe 2013, S. Michel 46
Time-‐Based Window
• S [Range T] – S [Now] – S [Range Unbounded]
Examples: • PosSpeedStr [RANGE 30 Seconds] • PosSpeedStr [NOW] • PosSpeedStr [RANGE Unbounded]
Slide based on material from Jennifer Widom.
Note: variable number of records in the window
Distributed Data Management, SoSe 2013, S. Michel 47
Tuple-‐Based Window
• S [Rows N] – If tuples form a par8al order, 8es are broken arbitrarily
– [Rows Unbounded]
Example: • PosSpeedStr [ROWS 1]
Slide based on material from Jennifer Widom. Distributed Data Management, SoSe 2013, S. Michel 48
7/3/13
9
Par88oned Windows
• S [Par88on By A1,...,Ak Rows N] 1. Logically par88on S into substreams (compare to
SQL GROUP By) 2. Compute a tuple sliding window 3. Take union
Example: • PosSpeedStr [PARTITION BY vehicleId ROWS 1]
Slide based on material from Jennifer Widom. Distributed Data Management, SoSe 2013, S. Michel 49
Recall: PosSpeedStr(vehicleId,speed,xPos,dir,hwy)
Rela8on à Rela8on • With previous window transform we get a rela8on, now we can apply
• any query expressed in SQL – just that deal now with 8me-‐varying rela8ons
Example: • SELECT disFnct vehicleId FROM PosSpeedStr [RANGE 30 Seconds]
Slide based on material from Jennifer Widom.
Computes the ac?ve vehicles
Distributed Data Management, SoSe 2013, S. Michel 50
Rela8on à Stream
• Istream(R) contains a stream element (r,t) whenever r in R(t) \ R(t-‐1) “Insert stream”
• Dstream(R) contains a stream element (r,t) whenever r in R(t-‐1) \ R(t) “Delete stream”
• Rstream(R) contains a stream element (r,t) whenever r in R(t) “Rela8on stream”
Slide based on material from Jennifer Widom. Distributed Data Management, SoSe 2013, S. Michel 51
Istream, Dstream, and Rstream
• Istream(R): contains all tuples in R that are new within the last 8me period, i.e., insert stream
• Dstream(R): contains all tuples in R which where in the stream before the last period (and not anymore in now), i.e., delete stream
• Rstream(R): contains all tuples in R
Distributed Data Management, SoSe 2013, S. Michel 52
Note: Istream and Dstream are expressible with Rstream and suitable selec?ons. How?
Rela8on à Stream: Examples
SELECT Istream(*) FROM PosSpeedStr [RANGE Unbounded] WHERE speed > 65 SELECT Rstream(*) FROM PosSpeedStr [NOW] WHERE speed > 65
Slide based on material from Jennifer Widom. Distributed Data Management, SoSe 2013, S. Michel 53
Query Results at Time T
• Use all rela8ons at 8me T • Use all streams up to T, converted to rela8ons • Compute rela8onal results • Convert result to streams if desired
Distributed Data Management, SoSe 2013, S. Michel 54 Slide based on material from Jennifer Widom.
7/3/13
10
Examples • What is the following query doing?
SELECT Istream(Avg(A)) FROM S [Range 5 seconds]
Distributed Data Management, SoSe 2013, S. Michel 55
Inten?on maybe: Emit 5-‐second moving average on every ?mestep, but output is generated only if average changes (Istream!)
• To emit a result on every 8mestep SELECT Rstream(Avg(A)) FROM S [Range 5 seconds] • To emit a result on every second SELECT Rstream(Avg(A)) FROM S [Range 5 seconds Slide 1 second]
Slide based on material from Jennifer Widom.
Examples (Cont’d)
SELECT F.clerk, max(O.cost) FROM O [∞], F [Rows 1000] WHERE O.orderID = F.orderID GROUP BY F.clerk • At 8me T: en8re stream O and last 1000 tuples of F as rela8ons
• Evaluate query, update result rela8on at T
Distributed Data Management, SoSe 2013, S. Michel 56
Orders (orderID, customer, cost)Fulfillments (orderID, clerk)
Slide based on material from Jennifer Widom.
Examples (Cont’d)
SELECT Istream(F.clerk, max(O.cost)) FROM O [∞], F [Rows 1000] WHERE O.orderID = F.orderID GROUP BY F.clerk • At 8me T: en8re stream O and last 1000 tuples of F as
rela8ons • Evaluate query, update result rela8on at T • Streamed result: New result (<clerk, max>, T), whenever
<clerk, max> changes from T-‐1 Distributed Data Management, SoSe 2013, S. Michel 57
Orders (orderID, customer, cost)Fulfillments (orderID, clerk)
Slide based on material from Jennifer Widom.
Query Execu8on in STREAM
• When a con8nuous query is registered, generate a query execu8on plan – New plan merged with exis8ng plans – Users can also create & manipulate plans directly
• Plans composed of three main components: – Operators – Queues (input and inter-‐operator) – State (windows, operators requiring history)
• Global scheduler for plan execu8on
Slide based on material from Jennifer Widom. Distributed Data Management, SoSe 2013, S. Michel 58
Slide based on material from Jennifer Widom.
Simple Query Plan Q1 Q2
State4 ⋈ State3 σ
Stream1 Stream2
Stream3
State1 State2 ⋈
Scheduler
Slide courtesy of Jennifer Widom. Distributed Data Management, SoSe 2013, S. Michel 59
More Topics • Seen only formal model and standard concepts of data stream management systems
• There is of course much more to it • Implementa8on, op8miza8on (e.g., equivalences), load shedding, ...
• Will be an own lecture by itself. • Next, look at system aspects in distributed data stream management systems and (mobile) sensor networks
Distributed Data Management, SoSe 2013, S. Michel 60
7/3/13
11
Literature • Arvind Arasu, Shivnath Babu, Jennifer Widom: The CQL con8nuous query
language: seman8c founda8ons and query execu8on. VLDB J. 15(2): 121-‐142 (2006)
• Arvind Arasu et al. : STREAM: The Stanford Stream Data Manager. IEEE Data Eng. Bull. 26(1): 19-‐26 (2003)
• h]p://infolab.stanford.edu/~widom/cql-‐talk.pdf • Alan J. Demers, Johannes Gehrke, Biswanath Panda, Mirek Riedewald, Varun
Sharma, Walker M. White: Cayuga: A General Purpose Event Monitoring System. CIDR 2007: 412-‐422
• Jürgen Krämer, Bernhard Seeger: Seman8cs and implementa8on of con8nuous sliding window queries over data streams. ACM Trans. Database Syst. 34(1) (2009)
Distributed Data Management, SoSe 2013, S. Michel 61