66
Spark Making Big Data Analytics Interactive and RealTime Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Timothy Hunter, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica sparkproject.org UC BERKELEY

Making Big Data Analytics Interactive and Real-Time

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Making Big Data Analytics Interactive and Real-Time

Spark  Making  Big  Data  Analytics  Interactive  and  Real-­‐Time  

Matei  Zaharia,  in  collaboration  with  Mosharaf  Chowdhury,  Tathagata  Das,  Timothy  Hunter,  Ankur  Dave,  Haoyuan  Li,  Justin  Ma,  Murphy  McCauley,  Michael  Franklin,  Scott  Shenker,  Ion  Stoica    spark-­‐project.org  

UC  BERKELEY  

Page 2: Making Big Data Analytics Interactive and Real-Time

Overview  Spark  is  a  parallel  framework  that  provides:  » Efficient  primitives  for  in-­‐memory  data  sharing  » Simple  APIs  in  Scala,  Java,  SQL  » High  generality  (applicable  to  many  emerging  problems)  

This  talk  will  cover:  » What  it  does  » How  people  are  using  it  (including  some  surprises)  » Current  research  

Page 3: Making Big Data Analytics Interactive and Real-Time

Motivation  MapReduce  simplified  data  analysis  on  large,  unreliable  clusters  

But  as  soon  as  it  got  popular,  users  wanted  more:  » More  complex,  multi-­‐pass  applications  (e.g.  machine  learning,  graph  algorithms)  » More  interactive  ad-­‐hoc  queries  » More  real-­‐time  stream  processing  

One  reaction:  specialized  models  for  some  of  these  apps  (e.g.  Pregel,  Storm)  

Page 4: Making Big Data Analytics Interactive and Real-Time

Motivation  Complex  apps,  streaming,  and  interactive  queries  all  need  one  thing  that  MapReduce  lacks:  

Efficient  primitives  for  data  sharing    

Page 5: Making Big Data Analytics Interactive and Real-Time

Examples  

iter.  1   iter.  2   .    .    .  

Input  

HDFS  read  

HDFS  write  

HDFS  read  

HDFS  write  

Input  

query  1  

query  2  

query  3  

result  1  

result  2  

result  3  

.    .    .  

HDFS  read  

Slow  due  to  replication  and  disk  I/O,  but  necessary  for  fault  tolerance  

Page 6: Making Big Data Analytics Interactive and Real-Time

iter.  1   iter.  2   .    .    .  

Input  

Goal:  Sharing  at  Memory  Speed  

Input  

query  1  

query  2  

query  3  

.    .    .  

one-­‐time  processing  

10-­‐100×  faster  than  network/disk,  but  how  to  get  FT?  

Page 7: Making Big Data Analytics Interactive and Real-Time

Challenge  

How  to  design  a  distributed  memory  abstraction  that  is  both  fault-­‐tolerant  and  efficient?  

Page 8: Making Big Data Analytics Interactive and Real-Time

Existing  Systems  Existing  in-­‐memory  storage  systems  have  interfaces  based  on  fine-­‐grained  updates  » Reads  and  writes  to  cells  in  a  table  » E.g.  databases,  key-­‐value  stores,  distributed  memory  

Requires  replicating  data  or  logs  across  nodes  for  fault  tolerance  è  expensive!  » 10-­‐100x  slower  than  memory  write…  

Page 9: Making Big Data Analytics Interactive and Real-Time

Solution:  Resilient  Distributed  Datasets  (RDDs)  

Provide  an  interface  based  on  coarse-­‐grained  operations  (map,  group-­‐by,  join,  …)  

Efficient  fault  recovery  using  lineage  » Log  one  operation  to  apply  to  many  elements  » Recompute  lost  partitions  on  failure  » No  cost  if  nothing  fails  

Page 10: Making Big Data Analytics Interactive and Real-Time

Input  

query  1  

query  2  

query  3  

.    .    .  

RDD  Recovery  

one-­‐time  processing  

iter.  1   iter.  2   .    .    .  

Input  

Page 11: Making Big Data Analytics Interactive and Real-Time

Generality  of  RDDs  RDDs  can  express  surprisingly  many  parallel  algorithms  » These  naturally  apply  same  operation  to  many  items  

Capture  many  current  programming  models  » Data  flow  models:  MapReduce,  Dryad,  SQL,  …  » Specialized  models  for  iterative  apps:  Pregel,  iterative  MapReduce,  PowerGraph,  …  

Allow  these  models  to  be  composed  

Page 12: Making Big Data Analytics Interactive and Real-Time

Outline  Programming  interface  

Examples  

User  applications  

Implementation  

Demo  

Current  research:  Spark  Streaming  

Page 13: Making Big Data Analytics Interactive and Real-Time

Spark  Programming  Interface  

Language-­‐integrated  API  in  Scala*  

Provides:  » Resilient  distributed  datasets  (RDDs)  •  Partitioned  collections  with  controllable  caching  

» Operations  on  RDDs  •  Transformations  (define  RDDs),  actions  (compute  results)  

» Restricted  shared  variables  (broadcast,  accumulators)  

*Also  Java,  SQL  and  soon  Python  

Page 14: Making Big Data Analytics Interactive and Real-Time

Example:  Log  Mining  Load  error  messages  from  a  log  into  memory,  then  interactively  search  for  various  patterns  

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.persist()

Block  1  

Block  2  

Block  3  

Worker  

Worker  

Worker  

Driver  

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks  

results  

Msgs.  1  

Msgs.  2  

Msgs.  3  

Base  RDD  Transformed  RDD  

Action  

Result:  full-­‐text  search  of  Wikipedia  in  <1  sec  (vs  20  sec  for  on-­‐disk  data)  Result:  scaled  to  1  TB  data  in  5-­‐7  sec  

(vs  170  sec  for  on-­‐disk  data)  

Page 15: Making Big Data Analytics Interactive and Real-Time

RDDs  track  the  graph  of  transformations  that  built  them  (their  lineage)  to  rebuild  lost  data  

E.g.:  

 

 

messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDD    

path  =  hdfs://…  

FilteredRDD    

func  =  _.contains(...)  

MappedRDD    

func  =  _.split(…)  

Fault  Recovery  

Page 16: Making Big Data Analytics Interactive and Real-Time

Fault  Recovery  Results  

119  

57   56   58   58  81  

57   59   57   59  

0  20  40  60  80  100  120  140  

1   2   3   4   5   6   7   8   9   10  

Iteratrion

 time  (s)  

Iteration  

Failure  happens  

Page 17: Making Big Data Analytics Interactive and Real-Time

Example:  Logistic  Regression  

Goal:  find  best  line  separating  two  sets  of  points  

+

+ + +

+

+

+ + +

– – –

– – –

+

target  

random  initial  line  

Page 18: Making Big Data Analytics Interactive and Real-Time

Example:  Logistic  Regression  val data = spark.textFile(...).map(readPoint).persist() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Page 19: Making Big Data Analytics Interactive and Real-Time

Logistic  Regression  Performance  

110  s  /  iteration  

first  iteration  80  s  further  iterations  6  s  0  

10  

20  

30  

40  

50  

60  

1   10   20   30  

Run

ning

 Tim

e  (m

in)  

Number  of  Iterations  

Hadoop  

Spark  

Page 20: Making Big Data Analytics Interactive and Real-Time

Example:  Collaborative  Filtering  

Goal:  predict  users’  movie  ratings  based  on  past  ratings  of  other  movies  

R  =  

1  ?  ?  4  5  ?  3  ?  ?  3  5  ?  ?  3  5  ?  5  ?  ?  ?  1  4  ?  ?  ?  ?  2  ?  

Movies  

Users  

Page 21: Making Big Data Analytics Interactive and Real-Time

Model  and  Algorithm  Model  R  as  product  of  user  and  movie  feature  matrices  A  and  B  of  size  U×K  and  M×K  

Alternating  Least  Squares  (ALS)  » Start  with  random  A  &  B  » Optimize  user  vectors  (A)  based  on  movies  » Optimize  movie  vectors  (B)  based  on  users  » Repeat  until  converged  

R   A  =  BT  

Page 22: Making Big Data Analytics Interactive and Real-Time

Serial  ALS  var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R)) }

Range  objects  

Page 23: Making Big Data Analytics Interactive and Real-Time

Naïve  Spark  ALS  var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R)) .collect() }

Problem:  R  re-­‐sent  

to  all  nodes  in  each  iteration  

Page 24: Making Big Data Analytics Interactive and Real-Time

Efficient  Spark  ALS  var R = spark.broadcast(readRatingsMatrix(...)) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect() }

Solution:  mark  R  as  broadcast  variable  

Result:  3×  performance  improvement  

Page 25: Making Big Data Analytics Interactive and Real-Time

Scaling  Up  Broadcast  Initial  version  (HDFS)   Cornet  P2P  broadcast  

0  

50  

100  

150  

200  

250  

10   30   60   90  

Iteration  time  (s)  

Number  of  machines  

Communication  Computation  

0  

50  

100  

150  

200  

250  

10   30   60   90  

Iteration  time  (s)  

Number  of  machines  

Communication  Computation  

[Chowdhury  et  al,  SIGCOMM  2011]  

Page 26: Making Big Data Analytics Interactive and Real-Time

Other  RDD  Operations  

Transformations  (define  a  new  RDD)  

map  filter  

sample  groupByKey  reduceByKey  sortByKey  

flatMap  union  join  

cogroup  cross  …  

Actions  (return  a  result  to  driver  program)  

collect  reduce  count  save  …  

Page 27: Making Big Data Analytics Interactive and Real-Time

Spark  in  Java  lines.filter(_.contains(“error”)).count()

JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Page 28: Making Big Data Analytics Interactive and Real-Time

Spark  in  Python  (Coming  Soon!)  

lines = sc.textFile(sys.argv[1]) counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x, y: x + y)

Page 29: Making Big Data Analytics Interactive and Real-Time

Outline  Programming  interface  

Examples  

User  applications  

Implementation  

Demo  

Current  research:  Spark  Streaming  

Page 30: Making Big Data Analytics Interactive and Real-Time

Spark  Users  

400+  user  meetup,  20+  contributors  

Page 31: Making Big Data Analytics Interactive and Real-Time

User  Applications  Crowdsourced  traffic  estimation  (Mobile  Millennium)  

Video  analytics  &  anomaly  detection  (Conviva)  

Ad-­‐hoc  queries  from  web  app  (Quantifind)  

Twitter  spam  classification  (Monarch)  

DNA  sequence  analysis  (SNAP)  

…  

Page 32: Making Big Data Analytics Interactive and Real-Time

Mobile  Millennium  Project  Estimate  city  traffic  from  GPS-­‐equipped  vehicles  (e.g.  SF  taxis)  

Page 33: Making Big Data Analytics Interactive and Real-Time

Sample  Data  

Credit:  Tim  Hunter,  with  support  of  the  Mobile  Millennium  team;  P.I.  Alex  Bayen;  traffic.berkeley.edu  

Page 34: Making Big Data Analytics Interactive and Real-Time

Challenge  Data  is  noisy  and  sparse  (1  sample/minute)  

Must  infer  path  taken  by  each  vehicle  in  addition  to  travel  time  distribution  on  each  link  

Page 35: Making Big Data Analytics Interactive and Real-Time

Challenge  Data  is  noisy  and  sparse  (1  sample/minute)  

Must  infer  path  taken  by  each  vehicle  in  addition  to  travel  time  distribution  on  each  link  

Page 36: Making Big Data Analytics Interactive and Real-Time

Solution  EM  algorithm  to  estimate  paths  and  travel  time  distributions  simultaneously  

observations  

weighted  path  samples  

link  parameters  

flatMap  

groupByKey  

broadcast  

Page 37: Making Big Data Analytics Interactive and Real-Time

Results  

3×  speedup  from  caching,  4.5x  from  broadcast  

[Hunter  et  al,  SOCC  2011]  

Page 38: Making Big Data Analytics Interactive and Real-Time

Conviva  GeoReport  

SQL  aggregations  on  many  keys  w/  same  filter  

40×  gain  over  Hive  from  avoiding  repeated  I/O,  deserialization  and  filtering  

0.5  

20  

0   5   10   15   20  

Spark  

Hive  

Time  (hours)  

Page 39: Making Big Data Analytics Interactive and Real-Time

Other  Programming  Models  

Pregel  on  Spark  (Bagel)  » 200  lines  of  code  

Iterative  MapReduce  » 200  lines  of  code  

Hive  on  Spark  (Shark)  » 5000  lines  of  code  » Compatible  with  Apache  Hive  » Machine  learning  ops.  in  Scala  

Page 40: Making Big Data Analytics Interactive and Real-Time

Outline  Programming  interface  

Examples  

User  applications  

Implementation  

Demo  

Current  research:  Spark  Streaming  

Page 41: Making Big Data Analytics Interactive and Real-Time

Implementation  Runs  on  Apache  Mesos  cluster  manager  to  coexist  w/  Hadoop  

Supports  any  Hadoop  storage  system  (HDFS,  HBase,  …)  

Spark   Hadoop   MPI  

Mesos  

Node   Node   Node  

…  

Easy  local  mode  and  EC2  launch  scripts  

No  changes  to  Scala  

Page 42: Making Big Data Analytics Interactive and Real-Time

Task  Scheduler  Runs  general  DAGs  

Pipelines  functions  within  a  stage  

Cache-­‐aware  data  reuse  &  locality  

Partitioning-­‐aware  to  avoid  shuffles  

join  

union  

groupBy  

map  

Stage  3  

Stage  1  

Stage  2  

A:   B:  

C:   D:  

E:  

F:  

G:  

=  cached  data  partition  

Page 43: Making Big Data Analytics Interactive and Real-Time

Language  Integration  Scala  closures  are  Serializable  Java  objects  » Serialize  on  master,  load  &  run  on  workers  

Not  quite  enough  » Nested  closures  may  reference  entire  outer  scope,  pulling  in  non-­‐Serializable  variables  not  used  inside  » Solution:  bytecode  analysis  +  reflection  

 

Page 44: Making Big Data Analytics Interactive and Real-Time

Interactive  Spark  Modified  Scala  interpreter  to  allow  Spark  to  be  used  interactively  from  the  command  line  » Track  variables  that  each  line  depends  on  » Ship  generated  classes  to  workers  

Enables  in-­‐memory  exploration  of  big  data  

Page 45: Making Big Data Analytics Interactive and Real-Time

Outline  Programming  interface  

Examples  

User  applications  

Implementation  

Demo  

Current  research:  Spark  Streaming  

Page 46: Making Big Data Analytics Interactive and Real-Time

Outline  Programming  interface  

Examples  

User  applications  

Implementation  

Demo  

Current  research:  Spark  Streaming  

Page 47: Making Big Data Analytics Interactive and Real-Time

Many  “big  data”  apps  need  to  work  in  real  time  » Site  statistics,  spam  filtering,  intrusion  detection,  …  

To  scale  to  100s  of  nodes,  need:  » Fault-­‐tolerance:  for  both  crashes  and  stragglers  » Efficiency:  don’t  consume  many  resources  beyond  base  processing  

Challenging  in  existing  streaming  systems  

Motivation  

Page 48: Making Big Data Analytics Interactive and Real-Time

Traditional  Streaming  Systems  

Continuous  processing  model  » Each  node  has  long-­‐lived  state  » For  each  record,  update  state  &  send  new  records  

mutable  state  

node  1  

node  3  

input  records   push  

node  2  input  records  

Page 49: Making Big Data Analytics Interactive and Real-Time

Traditional  Streaming  Systems  Fault  tolerance  via  replication  or  upstream  backup:  

node  1  

node  2  

node  1’  

node  3’  

node  2’  

synchronization  

node  1  

node  3  

node  2  

standby  

input  

input  

input  

input  

node  3  

Page 50: Making Big Data Analytics Interactive and Real-Time

Traditional  Streaming  Systems  Fault  tolerance  via  replication  or  upstream  backup:  

node  1  

node  3  

node  2  

node  1’  

node  3’  

node  2’  

synchronization  

node  1  

node  3  

node  2  

standby  

input  

input  

input  

input  

Fast  recovery,  but  2x  hardware  cost  

Only  need  1  standby,  but  slow  to  recover  

Page 51: Making Big Data Analytics Interactive and Real-Time

Traditional  Streaming  Systems  Fault  tolerance  via  replication  or  upstream  backup:  

node  1  

node  3  

node  2  

node  1’  

node  3’  

node  2’  

synchronization  

node  1  

node  3  

node  2  

standby  

input  

input  

input  

input  

[Borealis,  Flux]   [Hwang  et  al,  2005]  

Neither  approach  can  handle  stragglers  

Page 52: Making Big Data Analytics Interactive and Real-Time

Observation  Batch  processing  models,  such  as  MapReduce,  do  provide  fault  tolerance  efficiently  » Divide  job  into  deterministic  tasks  » Rerun  failed/slow  tasks  in  parallel  on  other  nodes  

Idea:  run  streaming  computations  as  a  series  of  small,  deterministic  batch  jobs  » Same  recovery  schemes  at  much  smaller  timescale  » To  make  latency  low,  store  state  in  RDDs  

Page 53: Making Big Data Analytics Interactive and Real-Time

Discretized  Stream  Processing  

t  =  1:  

t  =  2:  

batch  operation  

pull  input  …  

…  

input  

immutable  dataset  (stored  reliably)  

immutable  dataset  (output  or  state);  stored  as  RDD  

Page 54: Making Big Data Analytics Interactive and Real-Time

Fault  Recovery  Checkpoint  state  RDDs  periodically  

If  a  node  fails/straggles,  rebuild  lost  RDD  partitions  in  parallel  on  other  nodes  

map  

input  dataset   output  dataset  

Faster  recovery  than  upstream  backup,  without  the  cost  of  replication  

Page 55: Making Big Data Analytics Interactive and Real-Time

How  Fast  Can  It  Go?  Can  process  over  60M  records/s  (6  GB/s)  on  100  nodes  at  sub-­‐second  latency  

 

Max  throughput  under  a  given  latency  (1  or  2s)  

0  

20  

40  

60  

80  

0   50   100  

Rec

ords

/s  (m

illions

)  

Nodes  in  Cluster  

Grep  

1  sec  2  sec   0  

10  

20  

30  

40  

0   50   100  

Rec

ords

/s  (m

illions

)  

Nodes  in  Cluster  

Top  K  Words  

1  sec  2  sec  

Page 56: Making Big Data Analytics Interactive and Real-Time

Comparison  with  Storm  

Storm  limited  to  100K  records/s/node  

Also  tried  S4:  10K  records/s/node  

Commercial  systems:  O(500K)  total  

0  

10  

20  

30  

100   1000  

MB/s  per  nod

e  Record  Size  (bytes)  

Top  K  Words  Spark  Storm  

0  

20  

40  

60  

80  

100   1000  

MB/s  per  nod

e  

Record  Size  (bytes)  

Grep  Spark  Storm  

Lack  Spark’s  FT  guarantees  

Page 57: Making Big Data Analytics Interactive and Real-Time

How  Fast  Can  It  Recover?  Recovers  from  faults/stragglers  within  1  second  

Sliding  WordCount  on  20  nodes  with  10s  checkpoint  interval  

0  

0.5  

1  

1.5  

0   5   10   15   20   25  

Interval  Proce

ssing  

Time  (s)  

Time  (s)  

Failure  happens  

Page 58: Making Big Data Analytics Interactive and Real-Time

Programming  Interface  Extension  to  Spark:  Spark  Streaming  » All  Spark  operators  plus  new  “stateful”  ones  

 

// Running count of pageviews by URL

views = readStream("http:...", "1s")

ones = views.map(ev => (ev.url, 1))

counts = ones.runningReduce(_ + _)

t  =  1:  

t  =  2:  

views ones counts

map   reduce  

. . .

=  RDD   =  partition  

“Discretized  stream”  (D-­‐stream)  

Page 59: Making Big Data Analytics Interactive and Real-Time

Incremental  Operators  words.reduceByWindow(“5s”, max)

words   interval  max  

sliding  max  

t-­‐1  

t  

t+1  

t+2  

t+3  

t+4  max  

   

   

Associative  function  

words   interval  counts  

sliding  counts  

t-­‐1  

t  

t+1  

t+2  

t+3  

t+4   +  

+

–  

words.reduceByWindow(“5s”, _+_, _-_)

Associative  &  invertible  

Page 60: Making Big Data Analytics Interactive and Real-Time

Applications  Conviva  video  dashboard  

Mobile  Millennium  traffic  estimation  

0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  

0   16   32   48   64  

Active  vide

o  se

ssions

 (m

illions

)  

Nodes  in  Cluster  

0  

500  

1000  

1500  

2000  

0   20   40   60   80  GPS  ob

servations

 /  se

c  

Nodes  in  Cluster  

(online  EM  algorithm)  (>50  session-­‐level  metrics)  

Page 61: Making Big Data Analytics Interactive and Real-Time

Unifying  Streaming  and  Batch  

D-­‐streams  and  RDDs  can  seamlessly  be  combined  » Same  execution  and  fault  recovery  models  

Enables  powerful  features:  » Combining  streams  with  historical  data:  

! pageViews.join(historicCounts).map(...)  

» Interactive  ad-­‐hoc  queries  on  stream  state:   ! pageViews.slice(“21:00”,“21:05”).topK(10)

used  in  MM  application  

used  in  Conviva  app  

Page 62: Making Big Data Analytics Interactive and Real-Time

Benefits  of  a  Unified  Stack  

Write  each  algorithm  only  once  

Reuse  data  across  streaming  &  batch  jobs  

Query  stream  state  instead  of  waiting  for  import  

 Some  users  were  doing  this  manually!  » Conviva  anomaly  detection,  Quantifind  dashboard  

Page 63: Making Big Data Analytics Interactive and Real-Time

Conclusion  “Big  data”  is  moving  beyond  one-­‐pass  batch  jobs,  to  low-­‐latency  apps  that  need  data  sharing  

RDDs  offer  fault-­‐tolerant  sharing  at  memory  speed  

Spark  uses  them  to  combine  streaming,  batch  &  interactive  analytics  in  one  system  

www.spark-­‐project.org  

Page 64: Making Big Data Analytics Interactive and Real-Time

Related  Work  DryadLINQ,  FlumeJava  » Similar  “distributed  collection”  API,  but  cannot  reuse  datasets  efficiently  across  queries  

GraphLab,  Piccolo,  BigTable,  RAMCloud  » Fine-­‐grained  writes  requiring  replication  or  checkpoints  

Iterative  MapReduce  (e.g.  Twister,  HaLoop)  » Implicit  data  sharing  for  a  fixed  computation  pattern  

Relational  databases  » Lineage/provenance,  logical  logging,  materialized  views  

Caching  systems  (e.g.  Nectar)  » Store  data  in  files,  no  explicit  control  over  what  is  cached  

Page 65: Making Big Data Analytics Interactive and Real-Time

Behavior  with  Not  Enough  RAM  

68.8  

58.1  

40.7  

29.7  

11.5  

0  

20  

40  

60  

80  

100  

Cache  disabled  

25%   50%   75%   Fully  cached  

Iteration  time  (s)  

%  of  working  set  in  memory  

Page 66: Making Big Data Analytics Interactive and Real-Time

RDDs  for  Debugging  Debugging  general  distributed  apps  is  very  hard  

However,  Spark  and  other  recent  frameworks  run  deterministic  tasks  for  fault  tolerance  

Leverage  this  determinism  for  debugging:  » Log  lineage  for  all  RDDs  created  (small)  » Let  user  replay  any  task  in  jdb,  rebuild  any  RDD  to  query  it  interactively,  or  check  assertions