11
7/3/13 1 Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.Ing. Sebas8an Michel [email protected] Distributed Data Management, SoSe 2013, S. Michel 1 (DISTRIBUTED) DATA STREAM PROCESSING Lecture 9 Distributed Data Management, SoSe 2013, S. Michel 2 Data Stream Management vs. Tradi8onal Data Management Data is moving! Con8nuously generated (assumed infinite!) At high pace Queries are (mainly) con8nuous (aka. standing). Registered once, observed “forever”. Answer to queries in (near) real8me required (oUen) Probabilis8c methods for efficiency or considering only part of the stream (sliding window) Distributed Data Management, SoSe 2013, S. Michel 3 DATA STREAM Set of queries results DBMS vs. DSMS Distributed Data Management, SoSe 2013, S. Michel 4 Database management system (DBMS) Data stream management system (DSMS) Persistent data (rela8ons) vola8le data streams Random access Sequen8al access One8me queries Con8nuous queries (theore8cally) unlimited secondary storage limited main memory Only the current state is relevant Considera8on of the order of the input rela8vely low update rate poten8ally extremely high update rate Li]le or no 8me requirements Real8me requirements Assumes exact data Assumes outdated/inaccurate data Plannable query processing Variable data arrival and data characteris8cs h"p://en.wikipedia.org/wiki/Datastream_management_system Data Stream Model Stream of data items is unbounded (available memory is not) No way to store en8re stream (how could we, its (probably) not ending) To compute query results, need to devise algorithm with li]le memory consump8on Distributed Data Management, SoSe 2013, S. Michel 5 Overview of Data Stream Topics Synopses: concise representa8ons of stream content tailored to tasks, e.g., coun8ng dis8nct elements usually not exact, but approxima8ons (es8mators) of true values. Windows: focus of certain recent subset of data computa8on of func8ons/joins over window(s) content Think: SQL over stream windows (ranges) Distributed Data Management, SoSe 2013, S. Michel 6

DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

1  

Distributed  Data  Management  Summer  Semester  2013  

TU  Kaiserslautern  

Dr.-­‐Ing.  Sebas8an  Michel    

[email protected]­‐saarland.de  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   1  

(DISTRIBUTED)  DATA  STREAM  PROCESSING    

Lecture  9  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   2  

Data  Stream  Management  vs.  Tradi8onal  Data  Management  

•  Data  is  moving!  Con8nuously  generated  (assumed  infinite!)  •  At  high  pace  •  Queries  are  (mainly)  con8nuous  (aka.  standing).  Registered  

once,  observed  “forever”.  •  Answer  to  queries  in  (near)  real-­‐8me  required  (oUen)  •  Probabilis8c  methods  for  efficiency  or  considering  only  part  

of  the  stream  (sliding  window)  Distributed  Data  Management,  SoSe  2013,  S.  Michel   3  

DATA  STREAM  

Set  of  queries  

results  

DBMS  vs.  DSMS  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   4  

Database  management  system  (DBMS)   Data  stream  management  system  (DSMS)  Persistent  data  (rela8ons)     vola8le  data  streams  Random  access   Sequen8al  access  One-­‐8me  queries   Con8nuous  queries  

(theore8cally)  unlimited  secondary  storage  limited  main  memory  

Only  the  current  state  is  relevant   Considera8on  of  the  order  of  the  input  

rela8vely  low  update  rate   poten8ally  extremely  high  update  rate  

Li]le  or  no  8me  requirements   Real-­‐8me  requirements  

Assumes  exact  data   Assumes  outdated/inaccurate  data  

Plannable  query  processing  Variable  data  arrival  and  data  characteris8cs  

h"p://en.wikipedia.org/wiki/Data-­‐stream_management_system  

Data  Stream  Model  

•  Stream  of  data  items  is  unbounded  (available  memory  is  not)  

•  No  way  to  store  en8re  stream  (how  could  we,  its  (probably)  not  ending)  

•  To  compute  query  results,  need  to  devise  algorithm  with  li]le  memory  consump8on  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   5  

Overview  of  Data  Stream  Topics  •  Synopses:  

– concise  representa8ons  of  stream  content  –  tailored  to  tasks,  e.g.,  coun8ng  dis8nct  elements  – usually  not  exact,  but  approxima8ons  (es8mators)  of  true  values.    

•  Windows:  –  focus  of  certain  recent  subset  of  data  – computa8on  of  func8ons/joins  over  window(s)  content  

– Think:  SQL  over  stream  windows  (ranges)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   6  

Page 2: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

2  

SYNOPSES  (ESTIMATORS)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   7  

Coun8ng  Occurrences  

•  Consider  a  stream  of  elements  ai        …,  a2,  a84,  a41,  a2,  a77,  a231,  a2,  a4,  a54,  …  

•  How  oUen  does  a2  occur?    

•  How  to  implement?  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   8  

•  Keep  counter  for  each  id  •  Required  space  #ids  (=N)  •  Not  feasible  of  N  is  very  large    

Probabilis8c  Coun8ng:  Count-­‐Min  Sketch  •  Keep  2-­‐dim  array  (h,  r)  •  h  hash  func8ons*  that  map  to  range  0…(r-­‐1)    

Distributed  Data  Management,  SoSe  2013,  S.  Michel   9  

Cormode,    Muthukrishnan  (2004).  An  Improved  Data  Stream  Summary:  The  Count-­‐Min  Sketch  and  its  Applica8ons.  J.  Algorithms  55:  29–38.  

0   1   2   3   4   5  

•  Arriving  item  a  •  For  each  j:        array[j,  hj(a)]++  

h1  

h2  

h3  

h4  

Count-­‐Min  Sketch:  Coun8ng  

•  How  oUen  did  we  see  item  a?  •  h1(a)  =  4,  h2(a)=5,  h3(a)=0,  h4(a)=2  •  Take  minimum  of  the  corresponding  values  in  the  2-­‐d  array.  Here:  4  

•  Es8mate  is  never  underes8ma8ng  •  Overes8ma8on  probabilis8cally  bounded  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   10  

5   3   4   4   9   3  

4   7   1   4   4   8  

8   4   6   7   2   1  

3   1   4   8   7   5  

0   1   2   3   4   5  h1  

h2  

h3  

h4  

9  

8  

8  

4  

Unbiased  vs.  Biased  Es8mators  

•  Given  a  real  number            and  an  es8mator  of  it,  denoted  as    

•  E.g.,  number  of  dis8nct  elements  in  a  set  S  

•           is  called  an  unbiased  es8mator  of  E[      ]  =  n  •  and  biased  otherwise,  in  which  case  

Bias[        ]  =  E[      ]  -­‐  n  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   11  

nn̂

n̂ n̂

n̂n̂

Coun8ng  Dis8nct  Elements  

•  Consider  a  stream  of  elements  ai        …,  a2,  a84,  a41,  a2,  a77,  a231,  a2,  a4,  a54,  …  

•  How  to  compute/es8mate  the  number  of  dis8nct  elements  observed?  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   12  

Page 3: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

3  

Usability  •  Streams  (one  pass,  li]le  memory  footprint)  

•  Distributed  systems:  compact  data  exchange  (recall  Bloom  filter)  

•  Sketches  for  par8al  data  can  be  merged  for  global  view  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   13  

sketch  

sketch  

Efficient  Coun8ng,  Comparing  

Flajolet  Mar8n  (FM)  Sketch  (aka.  Hash  Sketch)  

•  Allocate  a  bitvector  B  of  size  m  =  log(N)  •  Hash  items  to  bitvector  posi8ons  according  to  a  geometric  distribu8on:  – Hash  each  item  i  to  a  m-­‐bit  number  h(i)  – Compute  posi8on  k  of  the  least-­‐significant  “1“  of  h(i)  – Set  the  bit  B[k]  to  “1“  

S:  17,  5,  19,  211,  17,  5,  31    h(17)  =  010100        then  least-­‐sig.  1  bit  =  3  h(5)      =  000101        then  least-­‐sig.  1  bit  =  1  ...  

•  Proposed originally by Flajolet and Martin in 1985

Distributed  Data  Management,  SoSe  2013,  S.  Michel   14  

FM-­‐Sketch:  Es8mator  

•  Get  then  posi8on  t  of  leU-­‐most  “0”  bit  of  B  •  Count-­‐Dis8nct  Es8mate  of  real  dis8nct  number  n:                        here:  with  t  =  4:      

•  Improvement:    –  If  you  use  more  bitmaps  and  compute  an  average  posi8on  t,  you  can  improve  count-­‐dis8nct  es8mate  

n̂ = 2t / 0.7735

Distributed  Data  Management,  SoSe  2013,  S.  Michel   15  

111010    B:  •  In  the  end  B  might  look  like  this  

n̂ = 24 / 0.7735 ! 20.685

Note:  Be  careful  with  leU-­‐most  bit;  depends  on  interpreta8on  of  bits  

FM  Sketch:  Intui8on/Idea  

•  B[0]  is  set  approximately  n/2  8mes  •  B[1]  is  set  approximately  n/4  8mes  •  B[i]  =  0    if  i>>  log2(n)  •  B[i]  =  1  if  i<<log2(n)  •  “Mix”  of  1s  and  0s  around  i≈log2(n)  

•  Use  leU-­‐most  zero  at  indicator  for  log2(n):      n  ≈  2  posi8on  of  leU  most  zero  bit  

 Distributed  Data  Management,  SoSe  2013,  S.  Michel   16  

FM-­‐Sketch:  Union  

•  Given:  two  mul8sets  S  and  T  and  theirs  sketches              and                  of  size  m  

•  Then:                                  The  sketch                                                                                              is  the  sketch  of  

 

BSBT

B = BS ! BTS!T

Distributed  Data  Management,  SoSe  2013,  S.  Michel   17  

K-­‐Min  Value  (KMV)  Synopsis  

•  KMV  synopsis  is  ordered  set  of  k  smallest  values  },...,,{ )()2()1( kUUUL =

0   1  

•  Unbiased  Es8mator:  –  Exact  error  analysis  based  on  theory  of  order  sta8s8cs  –  Asympto8cally  op8mal  as  k  becomes  large  

n̂kUB = (k !1) /U(k )

Distributed  Data  Management,  SoSe  2013,  S.  Michel   18  

•  Given  set  S  of  values.  Want  number  of  dis8nct  elements  n  :=  D(S)  (nota8on)  

•  Hashing  outputs  values  uniformly  in  [0,1]  k-min values  

Slide based on PPT slides from Beyer et al. ‘07

Page 4: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

4  

(Mul8set)  Union  of  Par88ons  

0  

k-min  

0  

k-min  

0  

k-min  

U  (k)

L  

LA   LB  

•  Combine  KMV  synopses:  LA  ⊕  LB  •  Theorem:  L  is  a  KMV  synopsis  of  A∪B  

… 1   … 1  

… 1  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   19  

Take  union  of  values  and  consider  again  the  k  smallest  ones:  

Slide based on PPT slides from Beyer et al. ‘07 Distributed  Data  Management,  SoSe  2013,  S.  Michel  

•  L=LA⊕LB    as  before  (union):  contains  k  elements  –  L  corresponds  to  a  uniform  random  sample  of  DVs  in  A∪B  

•  K∩  =  #  values  in  L  that  are  also  in  D(A∩B)  

•  K∩/k  es8mates  Jaccard  coefficient:    

                                                             es8mates    

•  Unbiased  es8mator  of  #DVs  in  the  intersec8on:    

(Mul8set)  Intersec8on  of  Par88ons  

n̂! = (k "1) /U(k ) n! = D(A!B)

n̂! =K!

kk "1U(k )

#

$%%

&

'((

D(A!B)D(A"B)

20  

D(set)  =  dis8nct  values    

Slide based on PPT slides from Beyer et al. ‘07

Jaccard  Coefficient  

Min-­‐Hashing  

•  Hash  func8on  h  maps  elements  of  a  set  to  integer  space.  Let’s  do  that  for  two  sets,  A  and  B  

•  Let  hmin(A)  and  hmin(B)    denote  the  minimum  of  these  numbers  for  set  A  and  B,  respec8vely.  

•  Then  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   21  

P[hmin (A) = hmin (B)]=| A!B || A"B |

Min-­‐Hashing  (Cont’d)  •  Why  does  it  work?  If  the  min  values  are  the  same,  that  element  that  causes  the  min  value  has  to  be  in                        ;  probability  is    

•  As  seen  before  for  other  es8mators:  improved  es8ma8on  quality  through  mul8ple  “rounds”  of  es8mates  (Error  is  O(1/√k),  with  k  rounds)  – one  min  value,  mul8ple  hash  func8ons  – several  min  values,  one  hash  func8on  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   22  

A!B | A!B || A"B |

Hash  fu

nc8o

ns  have  to  be  min-­‐w

ise  

inde

pend

ent  

Unbiased  es8mator  (but  too  high  variance)  

Min-­‐Hash:  Es8mator  count = 0!for each hash function h:!!if hmin(A) == hmin(B) then !! !count++!end!  EsFmate  of  Jaccard  is  given  as    count/k  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   23  

See  exercise  in  assignment  sheet  5  

Literature  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   24  

•  Graham  Cormode,  S.  Muthukrishnan:  An  improved  data  stream  summary:  the  count-­‐min  sketch  and  its  applica8ons.  J.  Algorithms  55(1):  58-­‐75  (2005)  

•  Philippe  Flajolet,  G.  Nigel  Mar8n:  Probabilis8c  Coun8ng  Algorithms  for  Data  Base  Applica8ons.  J.  Comput.  Syst.  Sci.  31(2):  182-­‐209  (1985)  

•  Andrei  Z.  Broder,  Moses  Charikar,  Alan  M.  Frieze,  Michael  Mitzenmacher:  Min-­‐Wise  Independent  Permuta8ons.  J.  Comput.  Syst.  Sci.  60(3):  630-­‐659  (2000)  

•  Z.  Bar-­‐Yossef,  T.  S.  Jayram,  R.  Kumar,  D.  Sivakumar,  and  L.  Trevisan.  Coun8ng  dis8nct  elements  in  a  data  stream.  In  Proc.  RANDOM,  pages  1–10,  2002.  

•  Kevin  S.  Beyer,  Peter  J.  Haas,  Berthold  Reinwald,  Yannis  Sismanis,  Rainer  Gemulla:  On  synopses  for  dis8nct-­‐value  es8ma8on  under  mul8set  opera8ons.  SIGMOD  Conference  2007:  199-­‐210  

Page 5: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

5  

DATA  STREAM  MANAGEMENT  SYSTEMS  AND  CQL  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   25  

Thanks  to  Johannes  Gehrke  (Cornell)  for  providing  some  the  following  material.  Many  of  the  slides  are  ini8ally  based  on  material  by  Jennifer  Widom  (Stanford).  

 Data  Stream  Model    

•  A  stream  S  is  a  (possibly)  infinite  bag  (mul8set)  of  elements  <s,τ>  where  s  is  a  tuple  belonging  to  the  schema  of  S  and  τ  is  the  8mestamp  of  the  element.  

•  Think:  tuples  of  a  rela8onal  DBMS  extended  with  8mestamp,  streaming  in.  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   26  

Data  Streams:  Example  •  Monitoring  of  highway  traffic:  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   27  

PosSpeedStr(vehicleId,speed,xPos,dir,hwy)  

•  E.g.,  for:  –  conges8on  predic8on/warning  

–  es8mates  of  travel  8me  –  toll  collec8on!  –  8cket  for  too  fast  driving  

Data  Streams:  Example  

•  Environmental  Monitoring  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   28  

Sta8onStream(humidity,  solarRadia8on,  windSpeed,                      snowHeight)  

•  Various  applica8on  scenarios:  –  avalanche  risk  level  computa8on  

–  insights  for  agriculture  –  air  pollu8on  (urban)  monitoring  

Con8nuous  Queries  •  In  contrast  to  ad-­‐hoc,  single  8me  queries  in  (rela8onal)  DBMS.  

•  Queries  over  Streams  are  considered  con8nuous:  registered  once,  run  “forever”:  –  “want  to  stay  updated  to  avalanche  risk,  not  just  check  once”  

•  Also  called  standing  queries  or  subscrip8ons  (in  publish/subscribe  context)  

•  For  instance:  –  Compute  average  temperature.  –  Select  all  orders  of  stock  “Apple”  with  quan8ty  larger  than  100.  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   29  

What  and  How  can  we  Compute    DB-­‐Style  Queries?  

•  How  to  compute  average  values  over  an  infinite  stream?  Block  forever?  

•  How  to  join  infinite  streams  if  join  partners  can  arbitrarily  arrive  (or  not)?  

•  Idea:  keep  window  that  renders  a  con8nuous  (infinite)  stream  a  snapshot/sta8c  rela8on  

 

Distributed  Data  Management,  SoSe  2013,  S.  Michel   30  

Page 6: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

6  

Sliding  Window  Concept  

•  Focus  a]en8on  to  latest  values  of  stream  •  Allows  computa8on  of  aggregates  •  Joins  are  computed  across  windows  overlaid  of  other  (or  same)  streams  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   31  

8me  

past  data  current    window   future  

Sliding  Window:  Example  18.3°C  13.5°C  27.0°C  11.6°C  29.6°C  39.7°C  24.2°C  11.5°C  12.7°C  27.9°C  ….  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   32  

•  Window  of  size  W  – based  on  8me  (=>  8me-­‐based)  

– or  number  of  tupels  inside          (count-­‐based)  

•  ShiUed  every  t  by  B  

Sliding  Window  Aggregates  

•  Output  average  for  each  window  when  it  slides.  

•  Here:  – 17.7°C  – 26.3°C  – 19.1°C  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   33  

18.3°C  13.5°C  27.0°C  11.6°C  29.6°C  39.7°C  24.2°C  11.5°C  12.7°C  27.9°C  ….  

Sliding  Window  Joins  

•  Join  is  executed  over  individual  window  contents.  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   34  

window  2  

window  1  stream  1  

stream  2  

Types  of  Sliding  Windows  

•  Time  based  Window  – window  contains  tuples  within  a  certain  8me  range;  e.g.,  Twi]er  Tweets  of  the  last  10  minutes,  stock  market  values  of  the  last  10  seconds  

–  size  can  arbitrarily  change  if  input  rate  changes  •  Count-­‐based  Window  

– window  contains  at  any  8me  a  fixed  amount  of  items,  say,  the  last  100  Tweets  or  10000  last  stock  trades  

–  newly  arriving  items  kick  out  older  ones  (once  window  is  filled  up),  depending  on  strategy  (next  slide)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   35  

Types  of  Sliding  Windows  (Cont’d)  

•  Sliding  Window:  move  window  on  certain    8cks/8me,  con8nuous  or  in  blocks  

•  Tumbling  Window:  create  new  window  for  each  8me  range  or  size  W  (i.e.,  collect  data  un8l  full  or  for  W  8me;  then  reset)  

•  At  each  slide/”tumple”  a  func8on  can  be  applied  to  window  content  and  the  result  outpu]ed  

•  This  is  also  called  “trigger”.  Distributed  Data  Management,  SoSe  2013,  S.  Michel   36  

Page 7: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

7  

Overview  of  Data  Stream  Management  Systems  (DSMSs)  

•  STREAM  (Stanford  University),  Aurora  (Brandeis/Brown/MIT),  TelegraphCQ  (UC  Berkely),  Cayuga  (Cornell),  PIPES  (Uni  Marburg),  …  

•  Large  interest  also  from  companies/startups:  Oracle  MicrosoU,  IBM,  Streambase  

•  Lately  open-­‐source  product  for  big  data  distributed  streams:  Yahoo!  S4,  Twi]er  Storm  (will  see  in  detail  later)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   37  

StreamBase  Example  UI  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   38  h]p://www.streambase.com  

STREAM  

•  Stanford  Stream  Data  Manager  •  “General  purpose”  DSMS  for  streams  and  stored  data  

•  Declara8ve  query  language  to  phrase  con8nuous  queries  (SQL  like).  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   39  

Arvind  Arasu  et  al.  :  STREAM:  The  Stanford  Stream  Data  Manager.  IEEE  Data  Eng.  Bull.  26(1):  19-­‐26  (2003)  

Con8nuous  Query  Language  –  CQL  

SQL  with:  

•  Streams  •  Windows  •  New  seman8cs  (stream)  

– Three  rela8on-­‐to-­‐stream  operators:  Istream,  Dstream,  Rstream  

•  Sampling    

Slide based on material from Jennifer Widom. 40  Distributed  Data  Management,  SoSe  2013,  S.  Michel  

Example Relation (Used Later) Simplified  Linear  Road  Setup:  •  A  single  input  stream:  The  stream  of  posi8ons  and  speeds  of  vehicles  

   •  vehicleId:  vehicle  •  speed:  speed  in  MPH  •  xPos:  Posi8on  of  the  vehicle  within  the  highway  in  feet  

•  dir:  direc8on  (east  or  west)  •  hwy:  highway  number    

Slide based on material from Jennifer Widom. 41  Distributed  Data  Management,  SoSe  2013,  S.  Michel  

PosSpeedStr(vehicleId,speed,xPos,dir,hwy)  

Example  Query  1  •  Two  streams:  

– Orders  (orderID,  customer,  cost)  – Fulfillments  (orderID,  clerk)  

•  Total  cost  of  orders  fulfilled  over  the  last  day  by  clerk  “Sue”  for  customer  “Joe”  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   42  

SELECT  sum(O.cost)  FROM  Orders  O,  Fulfillments  F  [Range  1  Day]  WHERE  O.orderID  =  F.orderID  and  F.clerk  =  “Sue”                                      and  O.customer  =  “Joe”  

Page 8: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

8  

Example  Query  2  •  Using  a  10%  sample  of  the  fulfillments  stream,  take  the  5  most  recent  fulfillments  for  each  clerk  and  return  the  maximum  cost  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   43  

SELECT  F.clerk,  max(O.cost)  FROM  orders  O,                fulfillments  F  [PARTITION  BY  clerk  ROW  5]  10%  SAMPLE  WHERE  O.orderID  =  F.orderID  GROUP  BY  F.clerk  

CQL:  Rela8ons  and  Streams  

•  T:  discrete,  ordered  8me  domain  

•  A  rela8on  R  is  a  mapping  from  8me  T  to  bag  of  tuples  belonging  to  the  schema  of  R.  

•  That  is,  R(t)  varies  over  8me  •  Updates  carry  8mestamps,  too!  

•  A  stream  is  a  set  of  (tuple,  8mestamp)  elements  

 

Distributed  Data  Management,  SoSe  2013,  S.  Michel   44  

Streams ßà Relations

Streams Relations

Window  specifica8on  

Special  operators:  Istream,  Dstream,  Rstream  

Any  rela8onal  query  language  

Slide based on material from Jennifer Widom. Distributed  Data  Management,  SoSe  2013,  S.  Michel   45  

Stream  à  Rela8on  •  S  [W]  is  a  rela8on  -­‐    at  8me  T  it  contains  all  tuples  in  window  W  applied  to  stream  S,  up  to  8me  T.  

•  When  W  =  ∞,  it  contains  all  tuples  in  stream  S  up  to  8me  T  

•  Ways  to  construct  these  windows  “[W]”  –  Time-­‐based  –  Tuple-­‐based  –  Par88oned    

Slide based on material from Jennifer Widom. Distributed  Data  Management,  SoSe  2013,  S.  Michel   46  

Time-­‐Based  Window  

•  S  [Range  T]  – S  [Now]  – S  [Range  Unbounded]  

Examples:  •  PosSpeedStr  [RANGE  30  Seconds]  •  PosSpeedStr  [NOW]  •  PosSpeedStr  [RANGE  Unbounded]  

Slide based on material from Jennifer Widom.

Note: variable number of records in the window

Distributed  Data  Management,  SoSe  2013,  S.  Michel   47  

Tuple-­‐Based  Window  

•  S  [Rows  N]  –  If  tuples  form  a  par8al  order,  8es  are  broken  arbitrarily  

–  [Rows  Unbounded]  

Example:  •  PosSpeedStr  [ROWS  1]  

Slide based on material from Jennifer Widom. Distributed  Data  Management,  SoSe  2013,  S.  Michel   48  

Page 9: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

9  

Par88oned  Windows  

•  S  [Par88on  By  A1,...,Ak  Rows  N]  1.  Logically  par88on  S  into  substreams  (compare  to  

SQL  GROUP  By)  2.  Compute  a  tuple  sliding  window  3.  Take  union  

 Example:  •  PosSpeedStr  [PARTITION  BY  vehicleId  ROWS  1]    

Slide based on material from Jennifer Widom. Distributed  Data  Management,  SoSe  2013,  S.  Michel   49  

Recall:  PosSpeedStr(vehicleId,speed,xPos,dir,hwy)  

Rela8on  à  Rela8on  •  With  previous  window  transform  we  get  a  rela8on,  now  we  can  apply  

•  any  query  expressed  in  SQL  –  just  that  deal  now  with  8me-­‐varying  rela8ons  

Example:  •  SELECT  disFnct  vehicleId  FROM  PosSpeedStr  [RANGE  30  Seconds]  

 

Slide based on material from Jennifer Widom.

Computes  the  ac?ve  vehicles  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   50  

Rela8on  à  Stream  

•  Istream(R)  contains  a  stream  element  (r,t)  whenever  r  in  R(t)  \  R(t-­‐1)        “Insert  stream”  

•  Dstream(R)  contains  a  stream  element  (r,t)  whenever  r  in  R(t-­‐1)  \  R(t)        “Delete  stream”  

•  Rstream(R)  contains  a  stream  element  (r,t)  whenever  r  in  R(t)                                  “Rela8on  stream”  

 

Slide based on material from Jennifer Widom. Distributed  Data  Management,  SoSe  2013,  S.  Michel   51  

Istream,  Dstream,  and  Rstream  

•  Istream(R):  contains  all  tuples  in  R  that  are  new  within  the  last  8me  period,  i.e.,  insert  stream  

•  Dstream(R):  contains  all  tuples  in  R  which  where  in  the  stream  before  the  last  period  (and  not  anymore  in  now),  i.e.,  delete  stream  

•  Rstream(R):  contains  all  tuples  in  R  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   52  

Note:  Istream  and  Dstream  are  expressible  with  Rstream  and  suitable  selec?ons.  How?  

Rela8on  à  Stream:  Examples  

SELECT  Istream(*)  FROM  PosSpeedStr  [RANGE  Unbounded]  WHERE  speed  >  65    SELECT  Rstream(*)  FROM  PosSpeedStr  [NOW]  WHERE  speed  >  65          

Slide based on material from Jennifer Widom. Distributed  Data  Management,  SoSe  2013,  S.  Michel   53  

Query  Results  at  Time  T  

•  Use  all  rela8ons  at  8me  T  •  Use  all  streams  up  to  T,  converted  to  rela8ons  •  Compute  rela8onal  results  •  Convert  result  to  streams  if  desired  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   54  Slide based on material from Jennifer Widom.

Page 10: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

10  

Examples  •  What  is  the  following  query  doing?    

SELECT  Istream(Avg(A))  FROM  S  [Range  5  seconds]      

Distributed  Data  Management,  SoSe  2013,  S.  Michel   55  

Inten?on  maybe:  Emit  5-­‐second  moving  average  on  every  ?mestep,  but  output  is  generated  only  if  average    changes  (Istream!)  

•  To  emit  a  result  on  every  8mestep  SELECT  Rstream(Avg(A))  FROM  S  [Range  5  seconds]        •  To  emit  a  result  on  every  second              SELECT  Rstream(Avg(A))  FROM  S                                          [Range  5  seconds  Slide  1  second]        

Slide based on material from Jennifer Widom.

Examples  (Cont’d)  

SELECT  F.clerk,  max(O.cost)  FROM  O  [∞],  F  [Rows  1000]  WHERE  O.orderID  =  F.orderID  GROUP  BY  F.clerk    •  At  8me  T:  en8re  stream    O  and  last  1000  tuples  of  F  as  rela8ons  

•  Evaluate  query,  update  result  rela8on  at  T  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   56  

Orders  (orderID,  customer,  cost)Fulfillments  (orderID,  clerk)  

Slide based on material from Jennifer Widom.

Examples  (Cont’d)  

SELECT  Istream(F.clerk,  max(O.cost))  FROM  O  [∞],  F  [Rows  1000]  WHERE  O.orderID  =  F.orderID  GROUP  BY  F.clerk    •  At  8me  T:  en8re  stream    O  and  last  1000  tuples  of  F  as  

rela8ons  •  Evaluate  query,  update  result  rela8on  at  T  •  Streamed  result:  New  result  (<clerk,  max>,  T),  whenever  

<clerk,  max>  changes  from  T-­‐1  Distributed  Data  Management,  SoSe  2013,  S.  Michel   57  

Orders  (orderID,  customer,  cost)Fulfillments  (orderID,  clerk)  

Slide based on material from Jennifer Widom.

Query  Execu8on  in  STREAM  

•  When  a  con8nuous  query  is  registered,  generate  a  query  execu8on  plan  –  New  plan  merged  with  exis8ng  plans  –  Users  can  also  create  &  manipulate  plans  directly  

•  Plans  composed  of  three  main  components:  –  Operators    –  Queues  (input  and  inter-­‐operator)  –  State  (windows,  operators  requiring  history)  

•  Global  scheduler  for  plan  execu8on  

Slide based on material from Jennifer Widom. Distributed  Data  Management,  SoSe  2013,  S.  Michel   58  

Slide based on material from Jennifer Widom.

Simple Query Plan Q1 Q2

State4 ⋈ State3 σ

Stream1 Stream2

Stream3

State1 State2 ⋈

Scheduler

Slide courtesy of Jennifer Widom. Distributed  Data  Management,  SoSe  2013,  S.  Michel   59  

More  Topics  •  Seen  only  formal  model  and  standard  concepts  of  data  stream  management  systems  

•  There  is  of  course  much  more  to  it  •  Implementa8on,  op8miza8on  (e.g.,  equivalences),  load  shedding,  ...  

•  Will  be  an  own  lecture  by  itself.  •  Next,  look  at  system  aspects  in  distributed  data  stream  management  systems  and  (mobile)  sensor  networks  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   60  

Page 11: DataStream%Model% Overview%of%DataStream%Topics% · 2013. 7. 3. · 7/3/13 7 % % & %

7/3/13  

11  

Literature  •  Arvind  Arasu,  Shivnath  Babu,  Jennifer  Widom:  The  CQL  con8nuous  query  

language:  seman8c  founda8ons  and  query  execu8on.  VLDB  J.  15(2):  121-­‐142  (2006)  

•  Arvind  Arasu  et  al.  :  STREAM:  The  Stanford  Stream  Data  Manager.  IEEE  Data  Eng.  Bull.  26(1):  19-­‐26  (2003)  

•  h]p://infolab.stanford.edu/~widom/cql-­‐talk.pdf  •  Alan  J.  Demers,  Johannes  Gehrke,  Biswanath  Panda,  Mirek  Riedewald,  Varun  

Sharma,  Walker  M.  White:  Cayuga:  A  General  Purpose  Event  Monitoring  System.  CIDR  2007:  412-­‐422  

•  Jürgen  Krämer,  Bernhard  Seeger:  Seman8cs  and  implementa8on  of  con8nuous  sliding  window  queries  over  data  streams.  ACM  Trans.  Database  Syst.  34(1)  (2009)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   61