View
216
Download
0
Category
Preview:
Citation preview
Overview of Spark projectPresented by Yin Zhu (yinz@ust.hk)
Materials from
• http://spark-project.org/documentation/
• Hadoop in Practice by A. Holmes
• My demo code: https://bitbucket.org/blackswift/spark-example
25 March 2013
Review of MapReducePagerank implemented in Scala and Hadoop
Why Pagerank?
>More complicated than the “hello world” WordCount
>Widely used in search engines, very large scale input (the whole web graph!)
>Iterative algorithm (typical style for most numerical algorithms for data mining)
A functional implementationdef pagerank(links: Array[(UrlId, Array[UrlId])], numIters: Int):Map[UrlId, Double] = { val n = links.size var ranks = (Array.fromFunction(i => (i, 1.0)) (n)).toMap // init: each node has rank 1.0 for (iter <- 1 to numIters) { // Take some interactions val contrib = links .flatMap(node => {
val out_url = node._1 val in_urls = node._2 val score = ranks(out_url) / in_urls.size // the score each outer link
recieves in_urls.map(in_url => (in_url, score) )
} ) .groupBy(url_score => url_score._1) // group the (url, score) pairs by url .map(url_scoregroup => // sum the score for each unique url (url_scoregroup._1, url_scoregroup._2.foldLeft (0.0) ((sum,url_score) => sum+url_score._2))) ranks = ranks.map(url_score => (url_score._1, if (contrib.contains(url_score._1)) 0.85 * contrib(url_score._1) + 0.15 else url_score._2)) } ranks }
Map
Reduce
Hadoop implementationhttps://github.com/alexholmes/hadoop-book/tree/master/src/main/java/com/manning/hip/ch7/pagerank
Hadoop/MapReduce implementation
Fault tolerance
the result after each iteration is saved to Disk
Speed/Disk IO
disk IO is proportional to the # of iterations
ideally the link graph should be loaded for only once
Resilient Distributed DatasetsA Fault-Tolerant Abstraction forIn-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley,Michael Franklin, Scott Shenker, Ion Stoica
UC Berkeley UC BERKELEY
MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters
But as soon as it got popular, users wanted more:
»More complex, multi-stage applications(e.g. iterative machine learning & graph processing)
»More interactive ad-hoc queries
Response: specialized frameworks for some of these apps (e.g. Pregel for graph
processing)
MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:
Efficient primitives for data sharing
In MapReduce, the only way to share data across jobs is stable storage
slow!
Examples
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow due to replication and disk I/O,but necessary for fault tolerance
iter. 1 iter. 2 . . .
Input
Goal: In-Memory Data Sharing
Input
query 1
query 2
query 3
. . .
one-timeprocessing
10-100× faster than network/disk, but how to get FT?
Solution: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory
»Immutable, partitioned collections of records
»Can only be built through coarse-grained deterministic transformations (map, filter, join, …)
Efficient fault recovery using lineage»Log one operation to apply to many
elements»Recompute lost partitions on failure»No cost if nothing fails
Three core concepts of RDDTransformations
define a new RDD from an existing RDD
Cache and Partitioner
Put the dataset into memory/other persist media, and specify the locations of the sub datasets
Actions
carry out the actual computation
RDD and its lazy transformationsRDD[T]: A sequence of objects of type T
Transformations are lazy:
https://github.com/mesos/spark/blob/master/core/src/main/scala/spark/PairRDDFunctions.scala
/*** Return an RDD of grouped items.*/
def groupBy[K: ClassManifest](f: T => K, p: Partitioner): RDD[(K, Seq[T])] = { val cleanF = sc.clean(f) this.map(t => (cleanF(t), t)).groupByKey(p) }
/*** Group the values for each key in the RDD into a single sequence. Allows controlling the* partitioning of the resulting key-value pair RDD by passing a Partitioner.*/
def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = { def createCombiner(v: V) = ArrayBuffer(v)
def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v
def mergeCombiners(b1: ArrayBuffer[V], b2: ArrayBuffer[V]) = b1 ++= b2
val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _, mergeValue _, mergeCombiners _, partitioner) bufs.asInstanceOf[RDD[(K, Seq[V])]] }
Actions: carry out the actual computation
https://github.com/mesos/spark/blob/master/core/src/main/scala/spark/SparkContext.scala
runJob
Example
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist() or .cache()
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
Task Scheduler for actionsDryad-like DAGs
Pipelines functionswithin a stage
Locality & data reuse aware
Partitioning-awareto avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
RDDs track the graph of transformations that built them (their lineage) to rebuild lost data
E.g.:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))
HadoopRDDpath = hdfs://…
FilteredRDDfunc =
_.contains(...)
MappedRDDfunc = _.split(…)
Fault Recovery
HadoopRDD FilteredRDD MappedRDD
Fault Recovery Results
1 2 3 4 5 6 7 8 9 100
20406080
100120140
119
57 56 58 5881
57 59 57 59
Iteration
Itera
trio
n t
ime (
s) Failure happens
Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms
»These naturally apply the same operation to many items
Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP
(Pregel), iterative MapReduce (Haloop), bulk incremental, …
Support new apps that these models don’t
Memorybandwidth
Networkbandwidth
Tradeoff Space
Granularityof Updates
Write Throughput
Fine
Coarse
Low High
K-V stores,databases,RAMCloud
Best for batchworkloads
Best fortransactional
workloads
HDFS RDDs
Spark Programming InterfaceDryadLINQ-like API in the Scala language
Usable interactively from Scala interpreter
Provides:»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build
new RDDs), actions (compute and output results)
»Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
tasks
results
Msgs. 1
Msgs. 2
Msgs. 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20
sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to
Σi∈neighbors ranki / |neighborsi|
links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _)}
Optimizing Placement
links & ranks repeatedly joined
Can co-partition them (e.g. hash both on URL) to avoid shuffles
Can also use app knowledge, e.g., hash on DNS name
links = links.partitionBy( new URLPartitioner())
reduce
Contribs0
join
join
Contribs2
Ranks0
(url, rank)
Links(url,
neighbors)
. . .
Ranks2
reduce
Ranks1
PageRank Performance
01020304050607080 72
23
Hadoop
Basic Spark
Spark + Con-trolled Partition-ing
Tim
e p
er
iter-
ati
on
(s)
ImplementationRuns on Mesos [NSDI 11]to share clusters w/ Hadoop
Can read from any Hadoop input source (HDFS, S3, …)
SparkHadoo
pMPI
Mesos
Node Node Node Node
…
No changes to Scala language or compiler»Reflection + bytecode analysis to correctly ship
code
www.spark-project.org
Behavior with Insufficient RAM
0% 25% 50% 75% 100%0
20
40
60
80
10068.8
58.1
40.7
29.7
11.5
Percent of working set in memory
Itera
tion
tim
e (
s)
Scalability
25 50 1000
50
100
150
200
250
18
4
11
1
76
11
6
80
62
15
6 3
HadoopHadoopBinMemSpark
Number of machines
Ite
rati
on
tim
e (
s)
25 50 1000
50
100
150
200
250
300 27
4
15
7
10
6
19
7
12
1
87
14
3
61
33
Hadoop HadoopBinMemSpark
Number of machines
Ite
rati
on
tim
e (
s)
Logistic Regression K-Means
Breaking Down the Speedup
In-mem HDFS
In-mem local file
Spark RDD0
5
10
15
201
5.4
13
.1
2.9
8.4
6.9
2.9
Text InputBinary Input
Itera
tion
tim
e (
s)
Spark Operations
Transformations
(define a new RDD)
mapfilter
samplegroupByKeyreduceByKey
sortByKey
flatMapunionjoin
cogroupcross
mapValues
Actions(return a result
to driver program)
collectreducecountsave
lookupKey
Open Source Community15 contributors, 5+ companies using Spark,3+ applications projects at Berkeley
User applications:»Data mining 40x faster than Hadoop
(Conviva)»Exploratory log analysis (Foursquare)»Traffic prediction via EM (Mobile Millennium)»Twitter spam classification (Monarch)»DNA sequence analysis (SNAP)». . .
Related WorkRAMCloud, Piccolo, GraphLab, parallel DBs
»Fine-grained writes requiring replication for resilience
Pregel, iterative MapReduce»Specialized models; can’t run arbitrary / ad-hoc queries
DryadLINQ, FlumeJava»Language-integrated “distributed dataset” API, but
cannot share datasets efficiently across queries
Nectar [OSDI 10]»Automatic expression caching, but over distributed FS
PacMan [NSDI 12]»Memory cache for HDFS, but writes still go to
network/disk
ConclusionRDDs offer a simple and efficient programming model for a broad range of applications
Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery
Try it out at www.spark-project.org
Recommended