49
Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

  • Upload
    ngohanh

  • View
    224

  • Download
    7

Embed Size (px)

Citation preview

Page 1: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Spark: A Brief History

https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Page 2: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

A Brief History:

2002

2002MapReduce @ Google

2004MapReduce paper

2006Hadoop @ Yahoo!

2004 2006 2008 2010 2012 2014

2014Apache Spark top-level

2010Spark paper

2008Hadoop Summit

Page 3: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

A Brief History: MapReduce

circa 1979 – Stanford, MIT, CMU, etc. set/list operations in LISP, Prolog, etc., for parallel processingwww-formal.stanford.edu/jmc/history/lisp/lisp.htm

circa 2004 – Google MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawatresearch.google.com/archive/mapreduce.html

circa 2006 – Apache Hadoop, originating from the Nutch Project Doug Cuttingresearch.yahoo.com/files/cutting.pdf

circa 2008 – Yahoo web scale search indexing Hadoop Summit, HUG, etc. developer.yahoo.com/hadoop/

circa 2009 – Amazon AWS Elastic MapReduce Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc. aws.amazon.com/elasticmapreduce/

Page 4: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

MapReduce use cases showed two major limitations:

1. difficultly of programming directly in MR

2. performance bottlenecks, or batch not fitting the use cases

In short, MR doesn’t compose well for large applications

Therefore, people built specialized systems as workarounds…

A Brief History: MapReduce

Page 5: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

A Brief History: MapReduce

MapReduce

General Batch Processing

Pregel Giraph

Dremel Drill Tez

Impala GraphLab

Storm S4

Specialized Systems: iterative, interactive, streaming, graph, etc.

The State of Spark, and Where We're Going Next Matei Zaharia Spark Summit (2013) youtu.be/nU6vO2EJAb4

Page 6: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

2002

2002MapReduce @ Google

2004MapReduce paper

2006Hadoop @ Yahoo!

2004 2006 2008 2010 2012 2014

2014Apache Spark top-level

2010Spark paper

2008Hadoop Summit

A Brief History: Spark

Spark: Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica USENIX HotCloud (2010) people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf !Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica NSDI (2012) usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

Page 7: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

A Brief History: Spark

Unlike the various specialized systems, Spark’s goal was to generalize MapReduce to support new apps within same engine

Two reasonably small additions are enough to express the previous models:

• fast data sharing • general DAGs

This allows for an approach which is more efficient for the engine, and much simpler for the end users

Page 8: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

A Brief History: Spark

The State of Spark, and Where We're Going Next Matei Zaharia Spark Summit (2013) youtu.be/nU6vO2EJAb4

used as libs, instead of specialized systems

Page 9: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Some key points about Spark:

• handles batch, interactive, and real-timewithin a single framework

• native integration with Java, Python, Scala

• programming at a higher level of abstraction

• more general: map/reduce is just one setof supported constructs

A Brief History: Spark

Page 10: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

https://www.safaribooksonline.com/library/view/data-analytics-with/9781491913734/assets/dawh_0401.png

Page 11: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Resilient  Distributed  Datasets  A  Fault-­‐Tolerant  Abstraction  for  In-­‐Memory  Cluster  Computing  

Matei  Zaharia,  Mosharaf  Chowdhury,  Tathagata  Das,  Ankur  Dave,  Justin  Ma,  Murphy  McCauley,  Michael  Franklin,  Scott  Shenker,  Ion  Stoica  

UC  Berkeley  UC  BERKELEY  

Page 12: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Motivation  MapReduce  greatly  simplified  “big  data”  analysis  on  large,  unreliable  clusters  

But  as  soon  as  it  got  popular,  users  wanted  more:  » More  complex,  multi-­‐stage  applications  (e.g.  iterative  machine  learning  &  graph  processing)  » More  interactive  ad-­‐hoc  queries  

Response:  specialized  frameworks  for  some  of  these  apps  (e.g.  Pregel  for  graph  processing)  

Page 13: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Motivation  Complex  apps  and  interactive  queries  both  need  one  thing  that  MapReduce  lacks:  

Efficient  primitives  for  data  sharing    

In  MapReduce,  the  only  way  to  share  data  across  jobs  is  stable  storage    slow!  

Page 14: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Examples  

iter.  1   iter.  2   .    .    .  

Input  

HDFS  read  

HDFS  write  

HDFS  read  

HDFS  write  

Input  

query  1  

query  2  

query  3  

result  1  

result  2  

result  3  

.    .    .  

HDFS  read  

Slow  due  to  replication  and  disk  I/O,  but  necessary  for  fault  tolerance  

Page 15: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

iter.  1   iter.  2   .    .    .  

Input  

Goal:  In-­‐Memory  Data  Sharing  

Input  

query  1  

query  2  

query  3  

.    .    .  

one-­‐time  processing  

10-­‐100×  faster  than  network/disk,  but  how  to  get  FT?  

Page 16: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Challenge  

How  to  design  a  distributed  memory  abstraction  that  is  both  fault-­‐tolerant  and  efficient?  

Page 17: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Challenge  Existing  storage  abstractions  have  interfaces  based  on  fine-­‐grained  updates  to  mutable  state  » RAMCloud,  databases,  distributed  mem,  Piccolo  

Requires  replicating  data  or  logs  across  nodes  for  fault  tolerance  » Costly  for  data-­‐intensive  apps  » 10-­‐100x  slower  than  memory  write  

Page 18: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Solution:  Resilient  Distributed  Datasets  (RDDs)  

Restricted  form  of  distributed  shared  memory  » Immutable,  partitioned  collections  of  records  » Can  only  be  built  through  coarse-­‐grained  deterministic  transformations  (map,  filter,  join,  …)  

Efficient  fault  recovery  using  lineage  » Log  one  operation  to  apply  to  many  elements  » Recompute  lost  partitions  on  failure  » No  cost  if  nothing  fails  

Page 19: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Input  

query  1  

query  2  

query  3  

.    .    .  

RDD  Recovery  

one-­‐time  processing  

iter.  1   iter.  2   .    .    .  

Input  

Page 20: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Generality  of  RDDs  Despite  their  restrictions,  RDDs  can  express  surprisingly  many  parallel  algorithms  » These  naturally  apply  the  same  operation  to  many  items  

Unify  many  current  programming  models  » Data  flow  models:  MapReduce,  Dryad,  SQL,  …  » Specialized  models  for  iterative  apps:  BSP  (Pregel),  iterative  MapReduce  (Haloop),  bulk  incremental,  …  

Support  new  apps  that  these  models  don’t  

Page 21: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Memory  bandwidth  

Network  bandwidth  

Tradeoff  Space  

Granularity  of  Updates  

Write  Throughput  

Fine  

Coarse  

Low   High  

K-­‐V  stores,  databases,  RAMCloud  

Best  for  batch  workloads  

Best  for  transactional  workloads  

HDFS   RDDs  

Page 22: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Spark  Programming  Interface  

DryadLINQ-­‐like  API  in  the  Scala  language  

Usable  interactively  from  Scala  interpreter  

Provides:  » Resilient  distributed  datasets  (RDDs)  » Operations  on  RDDs:  transformations  (build  new  RDDs),  actions  (compute  and  output  results)  » Control  of  each  RDD’s  partitioning  (layout  across  nodes)  and  persistence  (storage  in  RAM,  on  disk,  etc)  

Page 23: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Spark  Operations  

Transformations  (define  a  new  RDD)  

map  filter  

sample  groupByKey  reduceByKey  sortByKey  

flatMap  union  join  

cogroup  cross  

mapValues  

Actions  (return  a  result  to  driver  program)  

collect  reduce  count  save  

lookupKey  

Page 24: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Example:  Log  Mining  Load  error  messages  from  a  log  into  memory,  then  interactively  search  for  various  patterns  

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block  1  

Block  2  

Block  3  

Worker  

Worker  

Worker  

Master  

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

tasks  

results  

Msgs.  1  

Msgs.  2  

Msgs.  3  

Base  RDD  Transformed  RDD

Action  

Result:  full-­‐text  search  of  Wikipedia  in  <1  sec  (vs  20  sec  for  on-­‐disk  data)  Result:  scaled  to  1  TB  data  in  5-­‐7  sec  

(vs  170  sec  for  on-­‐disk  data)  

Page 25: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Task  Scheduler  Dryad-­‐like  DAGs  

Pipelines  functions  within  a  stage  

Locality  &  data    reuse  aware  

Partitioning-­‐aware  to  avoid  shuffles  

join  

union  

groupBy  

map  

Stage  3  

Stage  1  

Stage  2  

A:   B:  

C:   D:  

E:  

F:  

G:  

=  cached  data  partition  

Page 26: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf

Page 27: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

What is RDD?

Resilient Distributed Dataset- A big collection of data with following

properties- Immutable- Distributed- Lazily evaluated- Type inferred- Cacheable

https://www.slideshare.net/datamantra/anatomy-of-rdd

Page 28: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Pseudo Monad

• Wraps iterator + partitions distribution

• Keeps track of history for fault tolerance

• Lazily evaluated, chaining of expressions

https://www.slideshare.net/deanchen11/scala-bay-spark-talk

Page 29: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Partitions

● Logical division of data● Derived from Hadoop Map/Reduce● All Input,Intermediate and output data will be

represented as partitions● Partitions are basic unit of parallelism● RDD data is just collection of partitions

Page 30: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Partition from Input Data

Data in HDFS

Chunk 1 Chunk 2 Chunk 3

Input Format

RDD

Partition 1 Partition 2 Partition 3

Page 31: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Partition and Immutability

● All partitions are immutable ● Every transformation generates new partition● Partition immutability driven by underneath

storage like HDFS● Partition immutability allows for fault

recovery

Page 32: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Partitions and Distribution

● Partitions derived from HDFS are distributedby default

● Partitions also location aware● Location awareness of partitions allow for

data locality● For computed data, using caching we can

distribute in memory also

Page 33: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Accessing partitions

● We can access partition together rather single row at a time

● mapParititons API of RDD allows us that● Accessing partition at a time allows us to do

some partionwise operation which cannot be done by accessing single row.

Page 34: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Partition for transformed Data

● Partitioning will be different for key/value pairs that are generated by shuffle operation

● Partitioning is driven by partitioner specified● By default HashPartitioner is used● You can use your own partitioner also

Page 35: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Hash Partitioning

Input RDD

Partition 1 Partition 2 Partition 3

Hash(key)%no.of.partitions

Ouptut RDD Partition 1 Partition 2

Page 36: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Custom Partitioner

● Partition the data according to your datastructure

● Custom partitioning allows control over no ofpartitions and the distribution of data acrosswhen grouping or reducing is done

Page 37: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Look up operation

● Partitioning allows faster lookups● Lookup operation allows to look up for a

given value by specifying the key● Using partitioner, lookup determines which

partition look for● Then it only need to look in that partition● If no partition is specified, it will fallback to

filter

Page 38: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Parent(Dependency)

● Each RDD has access to it’s parent RDD● Nil is the value of parent for first RDD● Before computing it’s value, it always

computes it’s parent● This chain of running allows for laziness

Page 39: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Sub classing

● Each spark operator, creates an instance ofspecific sub class of RDD

● map operator results in MappedRDD,flatMap in FlatMappedRDD etc

● Subclass allows RDD to remember theoperation that is performed in thetransformation

Page 40: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

RDD transformationsval dataRDD = sc.textFile(args(1))

val splitRDD = dataRDD.flatMap(value => value.split(“ “)

Hadoop RDD

splitRDD: FlatMappedRDD

dataRDD :MappedRDD

Nil

Page 41: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Compute

● Compute is the function for evaluation ofeach partition in RDD

● Compute is an abstract method of RDD● Each sub class of RDD like MappedRDD,

FilteredRDD have to override this method

Page 42: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

RDD actionsval dataRDD = sc.textFile(args(1))

val flatMapRDD = dataRDD.flatMap(value => value.split(“ “)

flatMapRDD.collect()

Hadoop RDD

FlatMap RDD

Mapped RDD

Nil

runJob

compute

compute

compute

Page 43: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

runJob API

● runJob API of RDD is the api to implementactions

● runJob allows to take each partition andallow you evaluate

● All spark actions internally use runJob api.

Page 44: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Caching

● cache internally uses persist API● persist sets a specific storage level for a

given RDD● Spark context tracks persistent RDD● When first evaluates, partition will be put into

memory by block manager

Page 45: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Block manager

● Handles all in memory data in spark● Responsible for

○ Cached Data ( BlockRDD)○ Shuffle Data○ Broadcast data

● Partition will be stored in Block with id (RDD.id, partition_index)

Page 46: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

How caching works?

● Partition iterator checks the storage level● if Storage level is set it calls

cacheManager.getOrCompute(partition)

● as iterator is run for each RDD evaluation, its transparent to user

Page 47: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

The State of Spark, and Where We're Going Next Matei Zaharia Spark Summit (2013) youtu.be/nU6vO2EJAb4

A Brief History: Spark

Page 48: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

RDDs  track  the  graph  of  transformations  that  built  them  (their  lineage)  to  rebuild  lost  data  

E.g.:   messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2))

HadoopRDD  path  =  hdfs://…  

FilteredRDD  func  =  _.contains(...)  

MappedRDD  func  =  _.split(…)  

Fault  Recovery  

HadoopRDD   FilteredRDD   MappedRDD  

Page 49: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael

Fault  Recovery  Results  

119  

57   56   58   58  81  

57   59   57   59  

0  20  40  60  80  100  120  140  

1   2   3   4   5   6   7   8   9   10  

Iteratrion

 time  (s)  

Iteration  

Failure  happens