Spark:Resilient Distributed Datasets for
In-Memory Cluster Computing
Brad KarpUCL Computer Science
(with slides contributed by Matei Zaharia)
CS M038 / GZ069th March 2016
MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters
But as soon as it got popular, users wanted more:»More complex, multi-stage applications
(e.g. iterative machine learning & graph processing)»More interactive ad-hoc queries
MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters
But as soon as it got popular, users wanted more:»More complex, multi-stage applications
(e.g. iterative machine learning & graph processing)»More interactive ad-hoc queriesResponse: specialized frameworks for some of these apps (e.g. Pregel for graph processing)
MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:
Efficient primitives for data sharing
MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:
Efficient primitives for data sharing
In MapReduce, the only way to share data across jobs is stable storage è slow!
Examples
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Examples
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow because of replication and disk I/O,but necessary for fault tolerance
iter. 1 iter. 2 . . .
Input
Goal: In-Memory Data Sharing
Input
query 1
query 2
query 3
. . .
one-timeprocessing
RAM 10-100× faster than network/disk,but how to get fault tolerance?
Challenge
How to design a distributed memory abstraction that is both fault-tolerant and
efficient?
ChallengeExisting storage abstractions have interfaces based on fine-grained updates to mutable state»RAMCloud, databases, distributed mem, Piccolo
Requires replicating data or logs across nodes for fault tolerance»Costly for data-intensive apps»10-100x slower than memory write
Solution: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory» Immutable, partitioned collections of records»Can only be built through coarse-grained
deterministic transformations (map, filter, join, …)
Efficient fault recovery using lineage»Log one operation to apply to many elements»Recompute lost partitions on failure»No cost if nothing fails
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-timeprocessing
iter. 1 iter. 2 . . .
Input
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-timeprocessing
iter. 1 iter. 2 . . .
Input
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-timeprocessing
iter. 1 iter. 2 . . .
Input
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-timeprocessing
iter. 1 iter. 2 . . .
Input
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-timeprocessing
iter. 1 iter. 2 . . .
Input
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-timeprocessing
iter. 1 iter. 2 . . .
Input
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-timeprocessing
iter. 1 iter. 2 . . .
Input
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-timeprocessing
iter. 1 iter. 2 . . .
Input
Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms»These naturally apply the same operation to many
items
Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP (Pregel),
iterative MapReduce (Haloop), bulk incremental, …
Support new apps that these models don’t
Tradeoff Space
Granularityof Updates
Write Throughput
Fine
Coarse
Low High
K-V stores,databases,RAMCloud
HDFS RDDs
Networkbandwidth
Tradeoff Space
Granularityof Updates
Write Throughput
Fine
Coarse
Low High
K-V stores,databases,RAMCloud
HDFS RDDs
Memorybandwidth
Networkbandwidth
Tradeoff Space
Granularityof Updates
Write Throughput
Fine
Coarse
Low High
K-V stores,databases,RAMCloud
HDFS RDDs
Memorybandwidth
Networkbandwidth
Tradeoff Space
Granularityof Updates
Write Throughput
Fine
Coarse
Low High
K-V stores,databases,RAMCloud
Best fortransactionalworkloads
HDFS RDDs
Memorybandwidth
Networkbandwidth
Tradeoff Space
Granularityof Updates
Write Throughput
Fine
Coarse
Low High
K-V stores,databases,RAMCloud
Best for batchworkloads
Best fortransactionalworkloads
HDFS RDDs
OutlineSpark programming interfaceImplementation
Demo
How people are using Spark
OutlineSpark programming interfaceImplementation
Demo
How people are using Spark
Spark Programming InterfaceDryadLINQ-like API in the Scala languageUsable interactively from Scala interpreter
Provides:»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build new
RDDs), actions (compute and output results)»Control of each RDD’s partitioning (layout across
nodes) and persistence (storage in RAM, on disk, etc)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
Worker
Worker
Worker
Master
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) Worker
Worker
Worker
Master
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) Worker
Worker
Worker
Master
Base RDD
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))Worker
Worker
Worker
Master
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))Worker
Worker
Worker
Master
Transformed RDD
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
Worker
Worker
Worker
Master
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Worker
Worker
Worker
Master
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).countAction
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
tasks
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
tasksresults
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
tasksresults
Msgs. 1
Msgs. 2
Msgs. 3
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
Msgs. 1
Msgs. 2
Msgs. 3
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
Msgs. 1
Msgs. 2
Msgs. 3
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
tasks
Msgs. 1
Msgs. 2
Msgs. 3
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
tasksresults
Msgs. 1
Msgs. 2
Msgs. 3
Result: full-text search of Wikipedia in <1 sec (vs 20 sec
for on-disk data)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
tasksresults
Msgs. 1
Msgs. 2
Msgs. 3
Result: full-text search of Wikipedia in <1 sec (vs 20 sec
for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
tasksresults
Msgs. 1
Msgs. 2
Msgs. 3
RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.: messages = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
HadoopRDDpath = hdfs://…
FilteredRDDfunc =
_.contains(...)
MappedRDDfunc = _.split(…)
Fault Recovery
RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.: messages = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
HadoopRDDpath = hdfs://…
FilteredRDDfunc =
_.contains(...)
MappedRDDfunc = _.split(…)
Fault Recovery
HadoopRDD FilteredRDD MappedRDD
Fault Recovery Results119
57 56 58 5881
57 59 57 59
020406080
100120140
1 2 3 4 5 6 7 8 9 10
Iter
atio
n ti
me
(s)
Iteration
Failure happens
Fault Tolerance vs. PerformanceWith RDDs, programmer controls tradeoff between fault tolerance and performance
• if frequently persist (e.g., with REPLICATE), fast recovery, but slower execution
• if infrequently persist, fast execution but slow recovery
Why Wide vs. Narrow Dependencies?RDD includes metadata:
• lineage (graph of RDDs and operations)• wide dependency: requires shuffle (and
possibly entire preceding RDD(s) in lineage)• narrow dependency: no shuffle (and only
needs isolated partition(s) of preceding RDD(s) in lineage)
Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to
Σi∈neighbors ranki / |neighborsi|
links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {ranks = links.join(ranks).flatMap {
(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))
}.reduceByKey(_ + _)}
Optimizing Placementlinks & ranks repeatedly joined
Can co-partition them (e.g. hash both on URL) to avoid shuffles
Can also use app knowledge, e.g., hash on DNS name
links = links.partitionBy(new URLPartitioner())
reduceContribs0
join
joinContribs2
Ranks0(url, rank)
Links(url, neighbors)
. . .
Ranks2
reduce
Ranks1
PageRank Performance
171
72
23
0
50
100
150
200
Tim
e pe
r ite
rati
on (s
)
Hadoop
Basic Spark
Spark + Controlled Partitioning
Memory ExhaustionLeast Recently Used eviction of partitionsFirst evict non-persisted, used partitions; they vanish, as not stored on diskThen evict persisted ones
Behavior with Insufficient RAM
68.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
0% 25% 50% 75% 100%
Iter
atio
n ti
me
(s)
Percent of working set in memory
Scalability18
4
111
76
116
80
62
15 6 3
0
50
100
150
200
250
25 50 100
Iter
atio
n ti
me
(s)
Number of machines
HadoopHadoopBinMemSpark
274
157
106
197
121
87
143
61
33
0
50
100
150
200
250
300
25 50 100It
erat
ion
tim
e (s
)
Number of machines
Hadoop HadoopBinMemSpark
Logistic Regression K-Means
Contributors to Speedup
15.4
13.1
2.9
8.4
6.9
2.9
0
5
10
15
20
In-mem HDFS In-mem local file Spark RDD
Iter
atio
n ti
me
(s)
Text Input
Binary Input
ImplementationRuns on Mesos [NSDI 11]to share cluster w/ Hadoop
Can read from any Hadoop input source (HDFS, S3, …)
Spark Hadoop MPI
Mesos
Node Node Node Node
…
No changes to Scala language or compiler» Reflection + bytecode analysis to correctly ship code
www.spark-project.org
Programming Models Implemented on SparkRDDs can express many existing parallel models»MapReduce, DryadLINQ»Pregel graph processing [200 LOC]»Iterative MapReduce [200 LOC]»SQL: Hive on Spark (Shark)
Enables apps to efficiently intermix these models
All are based oncoarse-grained operations
RDDs: SummaryRDDs offer a simple and efficient programming model for a broad range of applications
• Avoid disk I/O overhead for intermediate results of multi-phase computations
• Leverage lineage: cheaper than checkpointingwhen no failures, low cost when failures
• Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery
Scheduler significantly more complicated than MapReduce’s; we did not discuss!